Skip to content

Conversation

@jorendorff
Copy link
Contributor

Implementation of the FAMST algorithm from Almansoori & Telek (2025).

Features:

  • Generic over data type T and distance function
  • Uses NN-Descent for ANN graph construction
  • Three-phase approach: ANN graph, component connection, edge refinement
  • O(dn log n) time complexity, O(dn + kn) space complexity

Paper: https://arxiv.org/abs/2507.14261

(Entirely vibe-coded by Opus 4.5.)

Implementation of the FAMST algorithm from Almansoori & Telek (2025).

Features:
- Generic over data type T and distance function
- Uses NN-Descent for ANN graph construction
- Three-phase approach: ANN graph, component connection, edge refinement
- O(dn log n) time complexity, O(dn + kn) space complexity

Paper: https://arxiv.org/abs/2507.14261
@jorendorff jorendorff requested a review from a team as a code owner January 14, 2026 01:03
Copilot AI review requested due to automatic review settings January 14, 2026 01:03
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment on lines +251 to +254
// Max-heap: larger distances have higher priority
self.distance
.partial_cmp(&other.distance)
.unwrap_or(std::cmp::Ordering::Equal)
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NeighborEntry struct implements Ord by comparing f64 values directly, but f64 comparison with NaN values can be problematic. While partial_cmp is used with unwrap_or(Equal) as a fallback, this could hide bugs where NaN distances are computed. Consider adding validation to ensure the distance function never returns NaN, or document this assumption clearly.

This issue also appears in the following locations of the same file:

  • line 629
  • line 384
  • line 472
Suggested change
// Max-heap: larger distances have higher priority
self.distance
.partial_cmp(&other.distance)
.unwrap_or(std::cmp::Ordering::Equal)
// Max-heap: larger distances have higher priority.
// We assume distances are finite and not NaN; treat NaN as a programming error.
match self.distance.partial_cmp(&other.distance) {
Some(ordering) => ordering,
None => panic!("NeighborEntry::cmp encountered NaN distance; distances must be finite"),
}

Copilot uses AI. Check for mistakes.
Comment on lines +585 to +599
// Add edges where j < i (reverse direction in ANN graph)
for (i, (neighbors, distances)) in ann_graph
.neighbors
.iter()
.zip(ann_graph.distances.iter())
.enumerate()
{
for (&j, &d) in neighbors.iter().zip(distances.iter()) {
if j < i {
// Check if this edge isn't already added
all_edges.push(Edge::new(j, i, d));
}
}
}

Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The edge deduplication logic in this function adds edges twice from the ANN graph - once for i < j (lines 578-582) and once for j < i (lines 592-596). This creates duplicate edges unnecessarily before deduplication. The directed ANN graph edges should only need to be added once. Consider refactoring to avoid creating duplicate edges in the first place, which would improve performance.

Suggested change
// Add edges where j < i (reverse direction in ANN graph)
for (i, (neighbors, distances)) in ann_graph
.neighbors
.iter()
.zip(ann_graph.distances.iter())
.enumerate()
{
for (&j, &d) in neighbors.iter().zip(distances.iter()) {
if j < i {
// Check if this edge isn't already added
all_edges.push(Edge::new(j, i, d));
}
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +160 to +172
let n = data.len();
if n == 0 {
return FamstResult {
edges: vec![],
total_weight: 0.0,
};
}
if n == 1 {
return FamstResult {
edges: vec![],
total_weight: 0.0,
};
}
Copy link

Copilot AI Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no tests for edge cases such as empty data (n=0) or single-point data (n=1). While the code handles these cases, adding explicit tests would ensure this behavior is maintained and documented.

This issue also appears in the following locations of the same file:

  • line 142

Copilot uses AI. Check for mistakes.

for i in 0..n {
let mut heap = BinaryHeap::with_capacity(k);
let mut indices: Vec<usize> = (0..n).filter(|&j| j != i).collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is O(n)!!
We need an O(k) solution here

}

// Initialize with random neighbors using max-heap for each point
let mut heaps: Vec<BinaryHeap<NeighborEntry>> = Vec::with_capacity(n);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we allocate one gigantic vector and slice it instead of creating millions of tiny heaps?


// Initialize with random neighbors using max-heap for each point
let mut heaps: Vec<BinaryHeap<NeighborEntry>> = Vec::with_capacity(n);
let mut neighbor_sets: Vec<HashSet<usize>> = vec![HashSet::with_capacity(k); n];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k should be small enough that we can simply do a find on the slice instead of instantiating a hashset.

Comment on lines +323 to +329
// Sample from reverse neighbors
let reverse: Vec<usize> = reverse_neighbors[i].iter().copied().collect();
let sample_size =
((reverse.len() as f64 * config.nn_descent_sample_rate).ceil() as usize).max(1);
let mut sampled_reverse = reverse.clone();
sampled_reverse.shuffle(rng);
sampled_reverse.truncate(sample_size);
Copy link
Collaborator

@aneubeck aneubeck Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are exactly k neighbors for every node.
but there can be "arbitrarily" many reverse neighbors...
So, using the reverse ones will cause scaling problems!

If we connect a reverse neighbor with another reverse neighbor, than those can also be found by starting from the reverse-reverse neighbor and following two forward edges...

The combination of a reverse and a forward neighbor can be found by connecting from the reverse neighbor two forward neighbors!

The only combination which cannot be translated is the path of forward neighbor to reverse neighbor. In this scenario, the reverse path is again a forward neighbor to reverse neighbor path. And the middle trick means connecting two reverse neighbors. So, those won't be found which is probably fine.

This change also means that one will create edges for other nodes than the one under consideration though.
For that reason, it might be best to reserve 2k slots and than cleanup, after every iteration (or when the slot overflows)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants