close
Skip to content

feat: support merging zonemap index segments#7128

Open
Xuanwo wants to merge 5 commits into
mainfrom
xuanwo/zonemap-segment-merge
Open

feat: support merging zonemap index segments#7128
Xuanwo wants to merge 5 commits into
mainfrom
xuanwo/zonemap-segment-merge

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Jun 5, 2026

ZoneMap already supports fragment-scoped segment builds and multi-segment query, but its segments could not be consolidated through the canonical distributed indexing merge path. This adds a ZoneMap merge primitive that rewrites source segments into one self-contained physical segment while preserving fragment coverage and rows-per-zone metadata, and updates the distributed indexing guide to include zone map merge support.

@github-actions github-actions Bot added A-index Vector index, linalg, tokenizer A-docs Documentation enhancement New feature or request labels Jun 5, 2026
@Xuanwo Xuanwo marked this pull request as ready for review June 5, 2026 10:03
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Jun 5, 2026

@claude review once

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

❌ Patch coverage is 93.65854% with 26 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/scalar/zonemap.rs 73.21% 9 Missing and 6 partials ⚠️
rust/lance-index/src/scalar/zonemap.rs 75.60% 6 Missing and 4 partials ⚠️
rust/lance/src/index/scalar_logical.rs 99.66% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Comment thread docs/src/guide/distributed_indexing.md Outdated

Within a single commit, built segments must have disjoint fragment coverage.

`merge_existing_index_segments(...)` currently supports vector, inverted, bitmap,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new paragraph in docs/src/guide/distributed_indexing.md says merge_existing_index_segments supports "vector, inverted, bitmap, and zone map segments", but BTree is also supported (it was added in #6889 / c7ae55a). The list should include btree to avoid misleading users; suggest "vector, inverted, bitmap, btree, and zone map segments".

Extended reasoning...

What the bug is

The PR adds a new prose paragraph at docs/src/guide/distributed_indexing.md:109-113 describing which index families merge_existing_index_segments(...) can merge. It enumerates the supported types as "vector, inverted, bitmap, and zone map segments" and tells users that "other scalar index families … cannot be merged into a larger physical segment until they add a merge implementation." BTree is missing from this list even though it has a working merge_segments implementation.

Why the existing code already supports BTree

The supported-type guard in rust/lance/src/index.rs:1136-1138 explicitly checks all_btree = source_segments.iter().all(segment_has_btree_details) and includes it in the validity test alongside vector / inverted / bitmap / zonemap. The dispatch at rust/lance/src/index.rs:1158-1159 routes the all-BTree case to crate::index::scalar::btree::merge_segments (the trailing else arm). The BTree merge primitive itself lives at rust/lance/src/index/scalar/btree.rs:90 (pub(crate) async fn merge_segments). All three pieces were landed by PR #6889 (commit c7ae55a 'feat: add segmented BTree index merge_segments support'), which is already on main and predates this PR.

Why this matters

The docs now actively contradict the code. A user reading the new paragraph will reasonably conclude that BTree segments cannot be merged via merge_existing_index_segments and that they must keep paying the cost of many small per-fragment BTree segments, or wait for someone to "add a merge implementation" that already exists. The same paragraph also tells them which families fall under the "cannot be merged" caveat, so the omission is doubly misleading — it both removes BTree from the supported list and implies it belongs in the unsupported list.

Step-by-step proof

  1. Stage two BTree segments with disjoint fragment coverage (the existing tests in rust/lance/src/index/scalar_logical.rs already do exactly this — see test_open_named_scalar_index_uses_all_btree_segments and test_btree_segment_search_is_exact_across_fragments, which call execute_uncommitted per fragment and then commit).
  2. Call dataset.merge_existing_index_segments(staged).await.
  3. merge_existing_index_segments (index.rs:1127ff) computes all_btree = true, passes the guard at line 1138 (!all_btree is false → no error), and at line 1158-1163 enters the else branch, calling crate::index::scalar::btree::merge_segments(self, source_segments).await?.
  4. That returns a merged IndexMetadata, which is the contract the new paragraph implies is unavailable for BTree.

So the API does merge BTree today; only the prose says otherwise.

Addressing the duplicate-refutation note

A refuter argued bug_001 should be dropped in favor of earlier IDs (bug_002 / bug_003) per the duplicates rule. The synthesis step has already taken bug_001 as the canonical instance and merged the duplicates into it, which is the standard resolution — the report itself is still correct and the underlying defect is real. The other two verifier confirmations independently checked the code paths and reached the same conclusion (one even with file:line citations matching mine), so I am submitting under bug_001.

Fix

In docs/src/guide/distributed_indexing.md:109, change "vector, inverted, bitmap, and zone map segments" to "vector, inverted, bitmap, btree, and zone map segments" (or equivalent ordering). No code change is required.

Severity: nit — documentation-only inconsistency, no runtime impact, but it is introduced by the same paragraph this PR adds, so it is in-scope to correct here.

Copy link
Copy Markdown
Contributor

@BubbleCal BubbleCal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern about the merge-then-commit path.

uuid: new_uuid,
fields: vec![field_id],
dataset_version: dataset.manifest.version,
fragment_bitmap: Some(fragment_bitmap),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a merge-then-commit regression test here? Since the merged segment keeps only live fragment coverage, committing it under the same index name can compare against the old segment's original coverage and report retired fragments as orphaned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-docs Documentation A-index Vector index, linalg, tokenizer enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants