Skip to content

feat: add partition artifacts for external vector backends#6463

Open
Xuanwo wants to merge 25 commits intomainfrom
xuanwo/partition-artifact-backend
Open

feat: add partition artifacts for external vector backends#6463
Xuanwo wants to merge 25 commits intomainfrom
xuanwo/partition-artifact-backend

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Apr 9, 2026

This PR adds a partition-addressable intermediate layout (a "partition artifact") that external IVF_PQ backends can produce and Lance can consume directly, replacing a lossy hand-off via a generic dataset.

Background

External backends such as pylance-cuvs already do the expensive work of an IVF_PQ build: they assign each row to a partition and encode PQ codes for the full dataset. Today the only way to hand that result back to Lance is to materialize it as a generic dataset; the finalizer then re-scans and re-groups the rows by partition, throwing away the partitioning the backend just computed.

Layout

An artifact is a small manifest plus a fixed number of bucketed Lance files:

artifact/
├── manifest.json
└── partitions/
    ├── bucket-00000.lance
    ├── bucket-00001.lance
    └── ...

Each row is row_id + part_id + pq_code, routed by bucket = part_id % num_buckets. The manifest records, per logical partition, the file it lives in and the (offset, num_rows) ranges inside that file:

partition 17 ->
  path: partitions/bucket-00005.lance
  ranges:
    - { offset: 1200, num_rows: 800 }
    - { offset: 4096, num_rows: 512 }

At finalize time, Lance reads only the recorded ranges for the partition it is building.

Write path

The writer is streaming and bounded in memory. For each input batch it routes rows into per-bucket in-memory buffers; when a buffer fills it is sorted by part_id, appended to the bucket file, and the covered ranges are recorded in the manifest. There is no second read/sort/rewrite pass at finish().

Changes

Rust:

  • PartitionArtifactBuilder and PartitionArtifactShuffleReader
  • precomputed_partition_artifact_uri plumbed into the existing vector finalization flow

Python:

  • accelerator="cuvs" becomes a thin runtime delegation to the external pylance-cuvs backend. No CUDA/cuVS code lives in this tree.

Non-goals

The on-disk IVF_PQ format, finalizer semantics, and the CPU build path are unchanged. This PR only adds a new input boundary for external backends.

@github-actions github-actions bot added the python label Apr 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@Xuanwo Xuanwo changed the title Add partition artifacts for external vector backends feat: add partition artifacts for external vector backends Apr 9, 2026
@github-actions github-actions bot added the enhancement New feature or request label Apr 9, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 15, 2026

Codecov Report

❌ Patch coverage is 77.79456% with 147 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index/vector/partition_artifact.rs 81.84% 91 Missing and 21 partials ⚠️
rust/lance/src/index/vector/ivf.rs 18.18% 14 Missing and 4 partials ⚠️
rust/lance/src/index/vector/builder.rs 19.04% 16 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant