Add famst crate: Fast Approximate Minimum Spanning Tree #93

jorendorff · 2026-01-14T01:03:03Z

Implementation of the FAMST algorithm from Almansoori & Telek (2025).

Features:

Generic over data type T and distance function
Uses NN-Descent for ANN graph construction
Three-phase approach: ANN graph, component connection, edge refinement
O(dn log n) time complexity, O(dn + kn) space complexity

(Entirely vibe-coded by Opus 4.5.)

Implementation of the FAMST algorithm from Almansoori & Telek (2025). Features: - Generic over data type T and distance function - Uses NN-Descent for ANN graph construction - Three-phase approach: ANN graph, component connection, edge refinement - O(dn log n) time complexity, O(dn + kn) space complexity Paper: https://arxiv.org/abs/2507.14261

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot · 2026-01-14T01:06:56Z

crates/famst/src/lib.rs

+        // Max-heap: larger distances have higher priority
+        self.distance
+            .partial_cmp(&other.distance)
+            .unwrap_or(std::cmp::Ordering::Equal)


The NeighborEntry struct implements Ord by comparing f64 values directly, but f64 comparison with NaN values can be problematic. While partial_cmp is used with unwrap_or(Equal) as a fallback, this could hide bugs where NaN distances are computed. Consider adding validation to ensure the distance function never returns NaN, or document this assumption clearly.

This issue also appears in the following locations of the same file:

line 629

line 384

line 472

Suggested change

// Max-heap: larger distances have higher priority

self.distance

.partial_cmp(&other.distance)

.unwrap_or(std::cmp::Ordering::Equal)

// Max-heap: larger distances have higher priority.

// We assume distances are finite and not NaN; treat NaN as a programming error.

match self.distance.partial_cmp(&other.distance) {

Some(ordering) => ordering,

None => panic!("NeighborEntry::cmp encountered NaN distance; distances must be finite"),

}

Copilot · 2026-01-14T01:06:57Z

crates/famst/src/lib.rs

+    // Add edges where j < i (reverse direction in ANN graph)
+    for (i, (neighbors, distances)) in ann_graph
+        .neighbors
+        .iter()
+        .zip(ann_graph.distances.iter())
+        .enumerate()
+    {
+        for (&j, &d) in neighbors.iter().zip(distances.iter()) {
+            if j < i {
+                // Check if this edge isn't already added
+                all_edges.push(Edge::new(j, i, d));
+            }
+        }
+    }
+


The edge deduplication logic in this function adds edges twice from the ANN graph - once for i < j (lines 578-582) and once for j < i (lines 592-596). This creates duplicate edges unnecessarily before deduplication. The directed ANN graph edges should only need to be added once. Consider refactoring to avoid creating duplicate edges in the first place, which would improve performance.

Suggested change

// Add edges where j < i (reverse direction in ANN graph)

for (i, (neighbors, distances)) in ann_graph

.neighbors

.iter()

.zip(ann_graph.distances.iter())

.enumerate()

{

for (&j, &d) in neighbors.iter().zip(distances.iter()) {

if j < i {

// Check if this edge isn't already added

all_edges.push(Edge::new(j, i, d));

}

}

}

Copilot · 2026-01-14T01:06:57Z

crates/famst/src/lib.rs

+    let n = data.len();
+    if n == 0 {
+        return FamstResult {
+            edges: vec![],
+            total_weight: 0.0,
+        };
+    }
+    if n == 1 {
+        return FamstResult {
+            edges: vec![],
+            total_weight: 0.0,
+        };
+    }


There are no tests for edge cases such as empty data (n=0) or single-point data (n=1). While the code handles these cases, adding explicit tests would ensure this behavior is maintained and documented.

This issue also appears in the following locations of the same file:

line 142

aneubeck · 2026-01-14T15:39:47Z

crates/famst/src/lib.rs

+
+    for i in 0..n {
+        let mut heap = BinaryHeap::with_capacity(k);
+        let mut indices: Vec<usize> = (0..n).filter(|&j| j != i).collect();


this is O(n)!!
We need an O(k) solution here

aneubeck · 2026-01-14T15:40:20Z

crates/famst/src/lib.rs

+    }
+
+    // Initialize with random neighbors using max-heap for each point
+    let mut heaps: Vec<BinaryHeap<NeighborEntry>> = Vec::with_capacity(n);


Can't we allocate one gigantic vector and slice it instead of creating millions of tiny heaps?

aneubeck · 2026-01-14T15:40:52Z

crates/famst/src/lib.rs

+
+    // Initialize with random neighbors using max-heap for each point
+    let mut heaps: Vec<BinaryHeap<NeighborEntry>> = Vec::with_capacity(n);
+    let mut neighbor_sets: Vec<HashSet<usize>> = vec![HashSet::with_capacity(k); n];


k should be small enough that we can simply do a find on the slice instead of instantiating a hashset.

aneubeck · 2026-01-14T15:51:57Z

crates/famst/src/lib.rs

+            // Sample from reverse neighbors
+            let reverse: Vec<usize> = reverse_neighbors[i].iter().copied().collect();
+            let sample_size =
+                ((reverse.len() as f64 * config.nn_descent_sample_rate).ceil() as usize).max(1);
+            let mut sampled_reverse = reverse.clone();
+            sampled_reverse.shuffle(rng);
+            sampled_reverse.truncate(sample_size);


there are exactly k neighbors for every node.
but there can be "arbitrarily" many reverse neighbors...
So, using the reverse ones will cause scaling problems!

If we connect a reverse neighbor with another reverse neighbor, than those can also be found by starting from the reverse-reverse neighbor and following two forward edges...

The combination of a reverse and a forward neighbor can be found by connecting from the reverse neighbor two forward neighbors!

The only combination which cannot be translated is the path of forward neighbor to reverse neighbor. In this scenario, the reverse path is again a forward neighbor to reverse neighbor path. And the middle trick means connecting two reverse neighbors. So, those won't be found which is probably fine.

This change also means that one will create edges for other nodes than the one under consideration though.
For that reason, it might be best to reserve 2k slots and than cleanup, after every iteration (or when the slot overflows)

jorendorff requested a review from a team as a code owner January 14, 2026 01:03

Copilot AI review requested due to automatic review settings January 14, 2026 01:03

Copilot started reviewing on behalf of jorendorff January 14, 2026 01:03 View session

Copilot AI reviewed Jan 14, 2026

View reviewed changes

aneubeck reviewed Jan 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add famst crate: Fast Approximate Minimum Spanning Tree #93

Add famst crate: Fast Approximate Minimum Spanning Tree #93

jorendorff commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 14, 2026

Uh oh!

Copilot AI Jan 14, 2026

Uh oh!

Copilot AI Jan 14, 2026

Uh oh!

aneubeck Jan 14, 2026

Uh oh!

aneubeck Jan 14, 2026

Uh oh!

aneubeck Jan 14, 2026

Uh oh!

aneubeck Jan 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-        // Max-heap: larger distances have higher priority
-        self.distance
-            .partial_cmp(&other.distance)
-            .unwrap_or(std::cmp::Ordering::Equal)
+        // Max-heap: larger distances have higher priority.
+        // We assume distances are finite and not NaN; treat NaN as a programming error.
+        match self.distance.partial_cmp(&other.distance) {
+            Some(ordering) => ordering,
+            None => panic!("NeighborEntry::cmp encountered NaN distance; distances must be finite"),
+        }

Add famst crate: Fast Approximate Minimum Spanning Tree #93

Are you sure you want to change the base?

Add famst crate: Fast Approximate Minimum Spanning Tree #93

Conversation

jorendorff commented Jan 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

aneubeck Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

aneubeck Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

aneubeck Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

aneubeck Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

aneubeck Jan 14, 2026 •

edited

Loading