-
Notifications
You must be signed in to change notification settings - Fork 14
Add famst crate: Fast Approximate Minimum Spanning Tree #93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implementation of the FAMST algorithm from Almansoori & Telek (2025). Features: - Generic over data type T and distance function - Uses NN-Descent for ANN graph construction - Three-phase approach: ANN graph, component connection, edge refinement - O(dn log n) time complexity, O(dn + kn) space complexity Paper: https://arxiv.org/abs/2507.14261
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
| // Max-heap: larger distances have higher priority | ||
| self.distance | ||
| .partial_cmp(&other.distance) | ||
| .unwrap_or(std::cmp::Ordering::Equal) |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NeighborEntry struct implements Ord by comparing f64 values directly, but f64 comparison with NaN values can be problematic. While partial_cmp is used with unwrap_or(Equal) as a fallback, this could hide bugs where NaN distances are computed. Consider adding validation to ensure the distance function never returns NaN, or document this assumption clearly.
This issue also appears in the following locations of the same file:
- line 629
- line 384
- line 472
| // Max-heap: larger distances have higher priority | |
| self.distance | |
| .partial_cmp(&other.distance) | |
| .unwrap_or(std::cmp::Ordering::Equal) | |
| // Max-heap: larger distances have higher priority. | |
| // We assume distances are finite and not NaN; treat NaN as a programming error. | |
| match self.distance.partial_cmp(&other.distance) { | |
| Some(ordering) => ordering, | |
| None => panic!("NeighborEntry::cmp encountered NaN distance; distances must be finite"), | |
| } |
| // Add edges where j < i (reverse direction in ANN graph) | ||
| for (i, (neighbors, distances)) in ann_graph | ||
| .neighbors | ||
| .iter() | ||
| .zip(ann_graph.distances.iter()) | ||
| .enumerate() | ||
| { | ||
| for (&j, &d) in neighbors.iter().zip(distances.iter()) { | ||
| if j < i { | ||
| // Check if this edge isn't already added | ||
| all_edges.push(Edge::new(j, i, d)); | ||
| } | ||
| } | ||
| } | ||
|
|
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The edge deduplication logic in this function adds edges twice from the ANN graph - once for i < j (lines 578-582) and once for j < i (lines 592-596). This creates duplicate edges unnecessarily before deduplication. The directed ANN graph edges should only need to be added once. Consider refactoring to avoid creating duplicate edges in the first place, which would improve performance.
| // Add edges where j < i (reverse direction in ANN graph) | |
| for (i, (neighbors, distances)) in ann_graph | |
| .neighbors | |
| .iter() | |
| .zip(ann_graph.distances.iter()) | |
| .enumerate() | |
| { | |
| for (&j, &d) in neighbors.iter().zip(distances.iter()) { | |
| if j < i { | |
| // Check if this edge isn't already added | |
| all_edges.push(Edge::new(j, i, d)); | |
| } | |
| } | |
| } |
| let n = data.len(); | ||
| if n == 0 { | ||
| return FamstResult { | ||
| edges: vec![], | ||
| total_weight: 0.0, | ||
| }; | ||
| } | ||
| if n == 1 { | ||
| return FamstResult { | ||
| edges: vec![], | ||
| total_weight: 0.0, | ||
| }; | ||
| } |
Copilot
AI
Jan 14, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no tests for edge cases such as empty data (n=0) or single-point data (n=1). While the code handles these cases, adding explicit tests would ensure this behavior is maintained and documented.
This issue also appears in the following locations of the same file:
- line 142
|
|
||
| for i in 0..n { | ||
| let mut heap = BinaryHeap::with_capacity(k); | ||
| let mut indices: Vec<usize> = (0..n).filter(|&j| j != i).collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is O(n)!!
We need an O(k) solution here
| } | ||
|
|
||
| // Initialize with random neighbors using max-heap for each point | ||
| let mut heaps: Vec<BinaryHeap<NeighborEntry>> = Vec::with_capacity(n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we allocate one gigantic vector and slice it instead of creating millions of tiny heaps?
|
|
||
| // Initialize with random neighbors using max-heap for each point | ||
| let mut heaps: Vec<BinaryHeap<NeighborEntry>> = Vec::with_capacity(n); | ||
| let mut neighbor_sets: Vec<HashSet<usize>> = vec![HashSet::with_capacity(k); n]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k should be small enough that we can simply do a find on the slice instead of instantiating a hashset.
| // Sample from reverse neighbors | ||
| let reverse: Vec<usize> = reverse_neighbors[i].iter().copied().collect(); | ||
| let sample_size = | ||
| ((reverse.len() as f64 * config.nn_descent_sample_rate).ceil() as usize).max(1); | ||
| let mut sampled_reverse = reverse.clone(); | ||
| sampled_reverse.shuffle(rng); | ||
| sampled_reverse.truncate(sample_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are exactly k neighbors for every node.
but there can be "arbitrarily" many reverse neighbors...
So, using the reverse ones will cause scaling problems!
If we connect a reverse neighbor with another reverse neighbor, than those can also be found by starting from the reverse-reverse neighbor and following two forward edges...
The combination of a reverse and a forward neighbor can be found by connecting from the reverse neighbor two forward neighbors!
The only combination which cannot be translated is the path of forward neighbor to reverse neighbor. In this scenario, the reverse path is again a forward neighbor to reverse neighbor path. And the middle trick means connecting two reverse neighbors. So, those won't be found which is probably fine.
This change also means that one will create edges for other nodes than the one under consideration though.
For that reason, it might be best to reserve 2k slots and than cleanup, after every iteration (or when the slot overflows)
Implementation of the FAMST algorithm from Almansoori & Telek (2025).
Features:
Paper: https://arxiv.org/abs/2507.14261
(Entirely vibe-coded by Opus 4.5.)