Skip to content
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
d60d6a5
a clean dataset loader
jshook Apr 6, 2026
53b3b32
disable legacy loaders
jshook Apr 6, 2026
43a843d
docs for local/remote, improved errors and robustness
jshook Apr 6, 2026
0c4d637
This change keeps the simplified catalog-driven loader structure, but…
tlwillke Apr 6, 2026
e1c357f
Set all datasets to NOSCRUB
jshook Apr 6, 2026
7733fe7
add local datasets stub
jshook Apr 6, 2026
967744b
include baseurl indirection, catalog auto-discovery for local
jshook Apr 6, 2026
ab1a928
set to COSINE, update HelloVectorWorld, name datasets
jshook Apr 6, 2026
328473d
override gitignore for example
jshook Apr 6, 2026
d07834d
move entries example to a managed file
jshook Apr 6, 2026
e8c6b5b
refine dataset loaders to use yaml settings
jshook Apr 7, 2026
139e107
fully switch over to new dataset loader, remove previous versions whi…
jshook Apr 7, 2026
1c11fe8
comment out non-shared datasets
jshook Apr 7, 2026
9bc8365
fix env var tests for windows
jshook Apr 7, 2026
801eb55
add docs for datasets
jshook Apr 7, 2026
3c8a8a9
Updated javadoc and default similarity metrics in dataset_metadata.
tlwillke Apr 7, 2026
592de64
Moved dataset-specific yaml index parameters to index-parameters dire…
tlwillke Apr 7, 2026
3756302
Refactored the yaml dataset-catalogs. Caching of public and protecte…
tlwillke Apr 7, 2026
39c0082
Update rat-excludes for yaml files.
tlwillke Apr 7, 2026
856e1e9
Incremental rat-excludes update.
tlwillke Apr 7, 2026
4717bec
Corrected path references in benchmarking doc.
tlwillke Apr 7, 2026
cdf5d7d
Improved handling of null returned by expectedSize during S3 transfer…
tlwillke Apr 7, 2026
fa128d7
Updated run-bench.yml GHA to remove .env secret and add protected cat…
tlwillke Apr 8, 2026
2e88381
Minor tweak to GHA secrets fix.
tlwillke Apr 8, 2026
f759507
Added a comment about how to generate the catalog secret.
tlwillke Apr 8, 2026
3af0d9a
Remove unnecessary suppression of unchecked warnings in new loader.
tlwillke Apr 8, 2026
5d63794
Scrubbed .hdf5 references from the docs and YAML files.
tlwillke Apr 8, 2026
78a3c4c
Minor correction to benchmarking.md directory reference.
tlwillke Apr 8, 2026
671d87c
Make default dataset metadata loading robust to both repo-root and mo…
tlwillke Apr 8, 2026
c947aa5
Added happy-path test coverage for constructor-driven remote catalog …
tlwillke Apr 8, 2026
36131a7
Cache included remote catalogs locally so previously downloaded remot…
tlwillke Apr 8, 2026
271d73d
Ignore the .catalog-cache.
tlwillke Apr 8, 2026
dbb2469
add safety redaction around logging
jshook Apr 8, 2026
3696935
Cleaned up datasets.yml with defaults and removed neighborhood watch …
tlwillke Apr 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,18 @@ local/
.mvn/wrapper/maven-wrapper.jar
.java-version
.bob/
dataset_
**/local_datasets/**

### Bench caches
pq_cache/
index_cache/
dataset_cache/

### Data catalogs
jvector-examples/yaml-configs/dataset-catalogs/*.yaml
jvector-examples/yaml-configs/dataset-catalogs/*.yml
!jvector-examples/yaml-configs/dataset-catalogs/public-catalog.yaml

### Logging (or whatever you use)
logging/
Expand Down Expand Up @@ -49,3 +57,5 @@ hdf5/
# JMH generated files
dependency-reduced-pom.xml
results.csv
**/datasets/custom/**
**/dataset_cache/**
63 changes: 28 additions & 35 deletions docs/benchmarking.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ The general procedure for running benchmarks is mentioned below. The following s
- [Specify the dataset](#specifying-datasets) names to benchmark in `datasets.yml`.
- Certain datasets will be downloaded automatically. If using a different dataset, make sure the dataset files are downloaded and made available (refer the section on [Custom datasets](#custom-datasets)).
- Adjust the benchmark parameters in `default.yml`. This will affect the parameters for all datasets to be benchmarked. You can specify custom parameters for a specific dataset by creating a file called `<your-dataset-name>.yml` in the same folder.
- Decide on the kind of measurements and logging you want and configure them in `run.yml`.
- Decide on the kind of measurements and logging you want and configure them in `run-config.yml`.

You can run the configured benchmark with maven:
```sh
Expand All @@ -35,7 +35,7 @@ Datasets are assumed to be Fvec/Ivec based unless the entry in the `datasets.yml

You'll notice that datasets are grouped into categories. The categories can be arbitrarily chosen for convenience and are not currently considered by the benchmarking system.

For HDF5 files, the substrings `-angular`, `-euclidean` and `-dot` correspond to cosine similarity, L2 distance, and dot product similarity functions (these substrings ARE considered to be part of the "dataset name"). Currently, Fvec/Ivec datasets are implicitly assumed to use cosine similarity (changing this requires editing `DataSetLoaderMFD.java`).
Dataset similarity functions are configured in `jvector-examples/yaml-configs/dataset-metadata.yml`.

Example `datasets.yml`:

Expand Down Expand Up @@ -67,15 +67,15 @@ construction:
```
will build and benchmark four graphs, one for each combination of M and ef in {(32, 100), (64, 100), (32, 200), (64, 200)}. This is particularly useful when running a Grid search to identify the best performing parameters.

### run.yml
### run-config.yml

This file contains configurations for
- Specifying the measurements you want to report, like QPS, latency and recall
- Specifying where to output these measurements, i.e. to the console, or to a file, or both.

The configurations in this file are "run-level", meaning that they are shared across all the datasets being benchmarked.

See `run.yml` for a full list of all options.
See `run-config.yml` for a full list of all options.

## Running `bench` from the command line

Expand All @@ -92,39 +92,32 @@ mvn compile exec:exec@bench -pl jvector-examples -am -DbenchArgs="glove nytimes"

## Custom Datasets

### Custom Fvec/Ivec datasets

Using fvec/ivec datasets requires them to be configured in `DataSetLoaderMFD.java`. Some datasets are already pre-configured; these will be downloaded and used automatically on running the benchmark.

To use a custom dataset consisting of files `base.fvec`, `queries.fvec` and `neighbors.ivec`, do the following:
- Ensure that you have three files:
- `base.fvec` containing N D-dimensional float vectors. These are used to build the index.
- `queries.fvec` containing Q D-dimensional float vectors. These are used for querying the built index.
- `neighbors.ivec` containing Q K-dimensional integer vectors, one for each query vector, representing the exact K-nearest neighbors for that query among the base vectors.
The files can be named however you like.
- Save all three files somewhere in the `fvec` directory in the root of the `jvector` repo (if it doesn't exist, create it). It's recommended to create at least one sub-folder with the name of the dataset and copy or move all three files there.
- Edit `DataSetLoaderMFD.java` to configure a new dataset and it's associated files:
```java
put("cust-ds", new MultiFileDatasource("cust-ds",
"cust-ds/base.fvec",
"cust-ds/query.fvec",
"cust-ds/neighbors.ivec"));
Datasets are configured via YAML catalog files under `jvector-examples/datasets/`. The loader recursively discovers all `.yaml`/`.yml` files in that directory tree. See `jvector-examples/datasets/public/example_datasets_config.yaml` for the full format reference.

To add a custom fvec/ivec dataset:

1. Create a directory under `jvector-examples/datasets/` (e.g. `custom/mydata/`).
2. Add a `.yaml` file mapping your dataset name to its files:
```yaml
_defaults:
cache_dir: ${DATASET_CACHE_DIR:-dataset_cache}

my-dataset:
base: my_base_vectors.fvecs
query: my_query_vectors.fvecs
gt: my_ground_truth.ivecs
```
3. Place your fvec/ivec files in the same directory (or specify a `cache_dir` / `base_url` to fetch them from a remote source).
4. Add the dataset's similarity function to `jvector-examples/yaml-configs/dataset-metadata.yml`:
```yaml
my-dataset:
similarity_function: COSINE
load_behavior: NO_SCRUB
```
The file paths are resolved relative to the `fvec` directory. `cust-ds` is the name of the dataset and can be changed to whatever is appropriate.
- In `jvector-examples/yaml-configs/datasets.yml`, add an entry corresponding to your custom dataset. Comment out other datasets which you do not want to benchmark.
5. Add the dataset name to `jvector-examples/yaml-configs/datasets.yml` so BenchYAML can find it:
```yaml
custom:
- cust-ds
- my-dataset
```

## Custom HDF5 datasets

HDF5 datasets consist of a single file. The Hdf5Loader looks for three HDF5 datasets within the file, `train`, `test` and `neighbors`. These correspond to the base, query and neighbors vectors described above for fvec/ivec files.

To use an HDF5 dataset, edit `jvector-examples/yaml-configs/datasets.yml` to add an entry like the following:
```yaml
category:
- <dataset-name>.hdf5
```

BenchYAML looks for hdf5 datasets with the name `<dataset-name>.hdf5` in the `hdf5` folder in the root of this repo. If the file doesn't exist, BenchYAML will attempt to automatically download the dataset from ann-benchmarks.com. If your dataset is not from ann-benchmarks.com, simply ensure that the dataset is available in the `hdf5` folder and edit `datasets.yml` accordingly.
For remote datasets, use `base_url` to specify where files should be downloaded from. The `${VAR}` and `${VAR:-default}` syntax is supported for environment variable expansion. See the example config for details.
Original file line number Diff line number Diff line change
Expand Up @@ -94,11 +94,11 @@ public static void main(String[] args) throws IOException {
RunConfig runCfg = RunConfig.loadDefault();
artifacts = RunArtifacts.open(runCfg, allConfigs);
} catch (java.io.FileNotFoundException e) {
// Legacy yamlSchemaVersion "0" behavior: no run.yml
// Legacy yamlSchemaVersion "0" behavior: no run-config.yml
// - logging disabled
// - console shows compute selection
// - compute selection comes from legacy search.benchmarks if present, else default
System.err.println("WARNING: run.yml not found. Falling back to deprecated legacy behavior: "
System.err.println("WARNING: run-config.yml not found. Falling back to deprecated legacy behavior: "
+ "no logging, console mirrors computed benchmarks.");

Map<String, List<String>> legacyBenchmarks = null;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@

package io.github.jbellis.jvector.example;

import io.github.jbellis.jvector.example.benchmarks.datasets.DataSetLoaderMFD;
import io.github.jbellis.jvector.example.benchmarks.datasets.DataSets;
import io.github.jbellis.jvector.example.reporting.RunArtifacts;
import io.github.jbellis.jvector.example.yaml.MultiConfig;
import io.github.jbellis.jvector.example.yaml.RunConfig;
Expand All @@ -36,9 +36,8 @@ public static void main(String[] args) throws IOException {
// Run-level policy config (benchmarks/console/logging + run metadata)
RunConfig runCfg = RunConfig.loadDefault();

// Load dataset
var ds = new DataSetLoaderMFD().loadDataSet(datasetName)
.orElseThrow(() -> new RuntimeException("dataset " + datasetName + " not found"))
var ds = DataSets.loadDataSet(datasetName).orElseThrow(
() -> new RuntimeException("dataset " + datasetName + " not found"))
.getDataSet();

// Run artifacts + selections (sys_info/dataset_info/experiments.csv)
Expand Down

This file was deleted.

Loading
Loading