ChunkEval

This repository contains code and data for the paper "Investigating the Robustness of Embedding Models on Noisy Input Texts".

The paper is a study on the robustness of embedding models integrated into Retrieval-Augmented Generation (RAG) systems, specifically when handling noisy, poorly structured input. The study includes three experiments, which are reported in the paper:

Comparing embedding models on the 36 subcorpora of the Synthetic Noise dataset.
Evaluating the normalized differences of similarity values between the original and noisy input texts.
Studying the impact of chunking strategies on several retrieval datasets.

Install dependencies

There are two ways you can install the dependencies to run the code.

Using Poetry (recommended)

If you have the Poetry package manager for Python installed already, you can simply set up everything with:

poetry install && poetry shell

After the installation of all dependencies, you will end up in a new shell with a loaded venv. You can exit the shell at any time with exit.

Using Pip (alternative)

You can also create a venv yourself and use pip to install dependencies:

python3 -m venv venv
source venv/bin/activate
pip install .

Installing flash-attn

The code uses the flash-attn library, which has to be installed separately. You can install it with the followinh command after activating the poetry shell or the venv:

pip install flash-attn

Please note that the chunking experiment requires an older version of the mteb library, which you can install with

pip install mteb==1.7.10

or modifying the pyproject.toml file to include the correct version. For running the remaining experiments, you have to use a newer version of the mteb library (1.14.10).

Run evaluation code

You can run extraction models (from the repository root) with

chunkeval evaluate [EXPERIMENT]

where EXPERIMENT is one of similarity, retrieval, or chunking.

Run visualization

Analogously, you can run the visualization code with

chunkeval visualize [PLOT]

where PLOT is one of similarity, retrieval, or case-study, which specifically compares the embedding models Arctic M and GTE Qwen2 1.5B.

List of used embedding models

The following embedding models are used in the experiments:

Embedding Model	Model Size (M Param.)	Dimensionality	Max Tokens	NDCG@10 on MS-MARCO
BGE small (en, v1.5)	33	384	512	40.83
Arctic M (v1.5)	109	768	512	42.03
Stella (1.5B, en, v5)	1,543	1,024	512	45.22
Jina (v3)	572	1,024	8,192	40.82
GTE Qwen2 (1.5B)	1,776	1,536	32,000	43.36
GTE Qwen2 (7B)	7,613	3,584	32,000	45.98

Visualizations

We provide visualizations for comparing the retrieval performance (nDCG@10) and the normalized differences of similarity values between Arctic M and GTE Qwen2 1.5B.

Result tables

The following two tables show the results of the first experiment, focusing on the comparison of embedding models on the 36 subcorpora of the Synthetic Noise dataset.

Embedding Model	Context Type	100 tokens (68%)			300 tokens (24%)			500 tokens (14%)
Embedding Model	Context Type	start	mid	end	start	mid	end	start	mid	end
BGE small	ID	78.54	77.37	77.00	74.88	66.49	61.76	75.52	62.51	60.68
baseline: 77.69	OOD	70.69	61.54	59.71	33.87	11.55	7.89	25.02	2.28	4.41
Arctic M	ID	79.97	79.32	79.85	81.48	72.90	70.50	80.11	67.01	62.45
baseline: 79.64	OOD	73.00	67.64	66.04	54.15	13.05	18.57	19.67	7.12	8.45
Stella	ID	81.90	79.72	78.16	80.30	75.11	73.11	78.82	74.13	67.11
baseline: 81.95	OOD	72.80	56.94	53.84	42.69	10.82	17.48	30.48	7.68	8.01

Embedding Model	Context Type	100 tokens (68%)			500 tokens (14%)			4000 tokens (2%)
Embedding Model	Context Type	start	mid	end	start	mid	end	start	mid	end
Jina	ID	77.03	74.59	74.31	75.65	67.14	61.90	69.91	61.94	54.11
baseline: 77.33	OOD	68.24	56.79	58.41	31.23	6.91	5.47	14.06	8.51	5.14
GTE Qwen2 1.5B	ID	83.78	82.34	81.68	76.42	71.61	67.30	73.66	69.16	64.13
baseline: 77.84	OOD	83.60	60.01	59.56	31.06	25.93	6.87	17.68	17.51	33.20
GTE Qwen2 7B	ID	83.83	83.60	83.49	79.44	74.71	71.03	78.75	73.86	66.98
baseline: 83.04	OOD	77.44	66.48	59.90	37.44	33.44	23.01	17.83	6.56	33.85

This table shows the results of the comparison study of chunking strategies on several retrieval datasets.

Dataset	#Tokens	Full	Naive	Late	Semantic	Optimal
FiQA2018	176.2	47.33	46.21	47.51	42.60	-
NFCorpus	37.1	36.69	35.54	36.80	35.18	-
SciFact	316.5	72.33	71.83	73.21	71.09	-
TREC-COVID	318.4	77.68	72.95	75.52	70.85	-
MS-MARCO Doc. (300k)	1,604.5	70.18	73.01	71.61	-	-
NarrativeQA (LongEmbed)	74,843.6	34.25	73.99	40.56	-	-
Synth. Noise samples (OOD)	2,106.2	22.93	44.45	41.63	34.35	78.61

License

This repository is licensed under the MIT License. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
plots		plots
resources		resources
src/chunkeval		src/chunkeval
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChunkEval

Install dependencies

Using Poetry (recommended)

Using Pip (alternative)

Installing flash-attn

Run evaluation code

Run visualization

List of used embedding models

Visualizations

Result tables

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ChunkEval

Install dependencies

Using Poetry (recommended)

Using Pip (alternative)

Installing flash-attn

Run evaluation code

Run visualization

List of used embedding models

Visualizations

Result tables

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages