-
Notifications
You must be signed in to change notification settings - Fork 81
feat(remove): refactor Remove with mark remove mode #1551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @LHT129, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the vector management capabilities within the indexing system by introducing explicit modes for adding and removing vectors. The core change revolves around a 'mark remove' feature, which allows vectors to be logically deleted without immediate physical removal, providing greater flexibility for index maintenance and recovery. This refactoring impacts the fundamental Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant refactoring of the Remove functionality to support a "mark for removal" (soft delete) mode. This is a well-executed change that improves the flexibility of the index. The API for Remove has been updated to handle batch operations and return a count of removed items, which is a good improvement. The implementation of soft deletes in HGraph and IVF through LabelTable and CombinedFilter is solid. I've included a few suggestions to improve performance and correctness in specific implementations.
I am having trouble creating individual review comments. Click here to see my feedback.
mockimpl/vsag/simpleflat.cpp (432-451)
The current implementation of Remove is inefficient for batch operations. It iterates through the IDs to be removed and, for each ID, it finds the element and resizes the underlying vectors. This results in O(M*N) complexity where M is the number of IDs to remove and N is the total number of elements. The repeated resizing of ids_ and data_ inside the loop is particularly expensive.
I suggest a more efficient implementation with O(N) complexity using a two-pointer approach to remove elements in-place in a single pass.
tl::expected<uint32_t, Error>
SimpleFlat::Remove(const std::vector<int64_t>& ids, RemoveMode mode) {
if (ids.empty()) {
return 0;
}
std::unordered_set<int64_t> ids_to_remove(ids.begin(), ids.end());
int64_t original_num_elements = num_elements_;
int64_t write_idx = 0;
for (int64_t read_idx = 0; read_idx < original_num_elements; ++read_idx) {
if (ids_to_remove.find(ids_[read_idx]) == ids_to_remove.end()) {
if (read_idx != write_idx) {
ids_[write_idx] = ids_[read_idx];
std::memcpy(data_.data() + write_idx * dim_,
data_.data() + read_idx * dim_,
dim_ * sizeof(float));
}
write_idx++;
}
}
uint32_t removed_count = original_num_elements - write_idx;
if (removed_count > 0) {
num_elements_ = write_idx;
ids_.resize(num_elements_);
data_.resize(num_elements_ * dim_);
}
return removed_count;
}src/algorithm/brute_force.cpp (150-179)
The Remove function incorrectly returns 1 regardless of the number of IDs successfully removed. The function signature indicates it should return uint32_t, which is the count of removed items. This is a bug. The function should count the number of successful removals and return that count.
uint32_t
BruteForce::Remove(const std::vector<int64_t>& ids, RemoveMode mode) {
CHECK_ARGUMENT(not use_attribute_filter_,
"remove is not supported when use_attribute_filter is true");
std::scoped_lock lock(this->add_mutex_, this->label_lookup_mutex_);
uint32_t removed_count = 0;
for (auto label : ids) {
const auto last_inner_id = static_cast<InnerIdType>(this->total_count_ - 1);
const auto inner_id = this->label_table_->GetIdByLabel(label);
CHECK_ARGUMENT(inner_id <= last_inner_id, "the element to be remove is invalid");
const auto last_label = this->label_table_->GetLabelById(last_inner_id);
this->label_table_->MarkRemove(label);
--this->label_table_->total_count_;
if (inner_id < last_inner_id) {
Vector<float> data(dim_, allocator_);
GetVectorByInnerId(last_inner_id, data.data());
this->label_table_->MarkRemove(last_label);
--this->label_table_->total_count_;
this->inner_codes_->InsertVector(data.data(), inner_id);
this->label_table_->Insert(inner_id, last_label);
}
this->total_count_--;
removed_count++;
}
return removed_count;
}src/algorithm/ivf.cpp (542-551)
The Remove implementation only handles RemoveMode::MARK_REMOVE. If another mode like REMOVE_AND_REPAIR is passed, the function does nothing and returns 0. This could be misleading. It would be better to explicitly handle unsupported modes, for example by throwing an exception, to make the behavior clear.
uint32_t
IVF::Remove(const std::vector<int64_t>& ids, RemoveMode mode) {
if (mode != RemoveMode::MARK_REMOVE) {
throw VsagException(ErrorType::UNSUPPORTED_INDEX_OPERATION, "IVF only supports MARK_REMOVE mode.");
}
uint32_t delete_count = 0;
std::scoped_lock label_lock(this->label_lookup_mutex_);
delete_count = this->label_table_->MarkRemove(ids);
delete_count_ += delete_count;
return delete_count;
}
74b36e9 to
f2c6246
Compare
| return true; | ||
| this->total_count_--; | ||
| } | ||
| return 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove should return the number of successfully deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only support mark remove, the repair remove will be supported soon
| GetNumElements() const override; | ||
|
|
||
| [[nodiscard]] int64_t | ||
| GetNumberRemoved() const override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename GetNumberRemoved to GetNumRemoved for short
src/impl/filter/combined_filter.h
Outdated
|
|
||
| using CombinedFilterPtr = std::shared_ptr<CombinedFilter>; | ||
|
|
||
| } // namespace vsag No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a blank line at EOF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
src/index/hnsw.cpp
Outdated
| } | ||
| for (auto id : ids) { | ||
| try { | ||
| std::unique_lock lock(rw_mutex_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use std::scoped_lock instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
f4726c0 to
cb0be73
Compare
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (78.98%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. @@ Coverage Diff @@
## main #1551 +/- ##
==========================================
- Coverage 91.57% 91.29% -0.29%
==========================================
Files 328 329 +1
Lines 19263 19396 +133
==========================================
+ Hits 17641 17707 +66
- Misses 1622 1689 +67
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
1c1f44a to
0534b39
Compare
- add and refactor new interfaces - implement if mark remove for HGraph, IVF Signed-off-by: LHT129 <[email protected]>
closed: #1541