Skip to content

michaeldinzinger/chunkeval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChunkEval

This repository contains code and data for the paper "Investigating the Robustness of Embedding Models on Noisy Input Texts".

The paper is a study on the robustness of embedding models integrated into Retrieval-Augmented Generation (RAG) systems, specifically when handling noisy, poorly structured input. The study includes three experiments, which are reported in the paper:

  • Comparing embedding models on the 36 subcorpora of the Synthetic Noise dataset.
  • Evaluating the normalized differences of similarity values between the original and noisy input texts.
  • Studying the impact of chunking strategies on several retrieval datasets.

Install dependencies

There are two ways you can install the dependencies to run the code.

Using Poetry (recommended)

If you have the Poetry package manager for Python installed already, you can simply set up everything with:

poetry install && poetry shell

After the installation of all dependencies, you will end up in a new shell with a loaded venv. You can exit the shell at any time with exit.

Using Pip (alternative)

You can also create a venv yourself and use pip to install dependencies:

python3 -m venv venv
source venv/bin/activate
pip install .

Installing flash-attn

The code uses the flash-attn library, which has to be installed separately. You can install it with the followinh command after activating the poetry shell or the venv:

pip install flash-attn

Please note that the chunking experiment requires an older version of the mteb library, which you can install with

pip install mteb==1.7.10

or modifying the pyproject.toml file to include the correct version. For running the remaining experiments, you have to use a newer version of the mteb library (1.14.10).

Run evaluation code

You can run extraction models (from the repository root) with

chunkeval evaluate [EXPERIMENT]

where EXPERIMENT is one of similarity, retrieval, or chunking.

Run visualization

Analogously, you can run the visualization code with

chunkeval visualize [PLOT]

where PLOT is one of similarity, retrieval, or case-study, which specifically compares the embedding models Arctic M and GTE Qwen2 1.5B.

List of used embedding models

The following embedding models are used in the experiments:

Embedding Model Model Size (M Param.) Dimensionality Max Tokens NDCG@10 on MS-MARCO
BGE small (en, v1.5) 33 384 512 40.83
Arctic M (v1.5) 109 768 512 42.03
Stella (1.5B, en, v5) 1,543 1,024 512 45.22
Jina (v3) 572 1,024 8,192 40.82
GTE Qwen2 (1.5B) 1,776 1,536 32,000 43.36
GTE Qwen2 (7B) 7,613 3,584 32,000 45.98

Visualizations

We provide visualizations for comparing the retrieval performance (nDCG@10) and the normalized differences of similarity values between Arctic M and GTE Qwen2 1.5B.

Retrieval performance Arctic M

Retrieval performance GTE Qwen2 1.5B

Similarity differences Arctic M

Similarity differences GTE Qwen2 1.5B

Result tables

The following two tables show the results of the first experiment, focusing on the comparison of embedding models on the 36 subcorpora of the Synthetic Noise dataset.

Embedding Model Context Type 100 tokens (68%) 300 tokens (24%) 500 tokens (14%)
start mid end start mid end start mid end
BGE small ID 78.54 77.37 77.00 74.88 66.49 61.76 75.52 62.51 60.68
baseline: 77.69 OOD 70.69 61.54 59.71 33.87 11.55 7.89 25.02 2.28 4.41
Arctic M ID 79.97 79.32 79.85 81.48 72.90 70.50 80.11 67.01 62.45
baseline: 79.64 OOD 73.00 67.64 66.04 54.15 13.05 18.57 19.67 7.12 8.45
Stella ID 81.90 79.72 78.16 80.30 75.11 73.11 78.82 74.13 67.11
baseline: 81.95 OOD 72.80 56.94 53.84 42.69 10.82 17.48 30.48 7.68 8.01
Embedding Model Context Type 100 tokens (68%) 500 tokens (14%) 4000 tokens (2%)
start mid end start mid end start mid end
Jina ID 77.03 74.59 74.31 75.65 67.14 61.90 69.91 61.94 54.11
baseline: 77.33 OOD 68.24 56.79 58.41 31.23 6.91 5.47 14.06 8.51 5.14
GTE Qwen2 1.5B ID 83.78 82.34 81.68 76.42 71.61 67.30 73.66 69.16 64.13
baseline: 77.84 OOD 83.60 60.01 59.56 31.06 25.93 6.87 17.68 17.51 33.20
GTE Qwen2 7B ID 83.83 83.60 83.49 79.44 74.71 71.03 78.75 73.86 66.98
baseline: 83.04 OOD 77.44 66.48 59.90 37.44 33.44 23.01 17.83 6.56 33.85

This table shows the results of the comparison study of chunking strategies on several retrieval datasets.

Dataset #Tokens Full Naive Late Semantic Optimal
FiQA2018 176.2 47.33 46.21 47.51 42.60 -
NFCorpus 37.1 36.69 35.54 36.80 35.18 -
SciFact 316.5 72.33 71.83 73.21 71.09 -
TREC-COVID 318.4 77.68 72.95 75.52 70.85 -
MS-MARCO Doc. (300k) 1,604.5 70.18 73.01 71.61 - -
NarrativeQA (LongEmbed) 74,843.6 34.25 73.99 40.56 - -
Synth. Noise samples (OOD) 2,106.2 22.93 44.45 41.63 34.35 78.61

License

This repository is licensed under the MIT License. See the LICENSE file for details.

About

Investigating the Robustness of Embedding Models on Noisy Input Data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages