This repository contains code and data for the paper "Investigating the Robustness of Embedding Models on Noisy Input Texts".
The paper is a study on the robustness of embedding models integrated into Retrieval-Augmented Generation (RAG) systems, specifically when handling noisy, poorly structured input. The study includes three experiments, which are reported in the paper:
- Comparing embedding models on the 36 subcorpora of the Synthetic Noise dataset.
- Evaluating the normalized differences of similarity values between the original and noisy input texts.
- Studying the impact of chunking strategies on several retrieval datasets.
There are two ways you can install the dependencies to run the code.
If you have the Poetry package manager for Python installed already, you can simply set up everything with:
poetry install && poetry shellAfter the installation of all dependencies, you will end up in a new shell with a loaded venv. You can exit the shell at any time with exit.
You can also create a venv yourself and use pip to install dependencies:
python3 -m venv venv
source venv/bin/activate
pip install .The code uses the flash-attn library, which has to be installed separately. You can install it with the followinh command after activating the poetry shell or the venv:
pip install flash-attnPlease note that the chunking experiment requires an older version of the mteb library, which you can install with
pip install mteb==1.7.10or modifying the pyproject.toml file to include the correct version. For running the remaining experiments, you have to use a newer version of the mteb library (1.14.10).
You can run extraction models (from the repository root) with
chunkeval evaluate [EXPERIMENT]where EXPERIMENT is one of similarity, retrieval, or chunking.
Analogously, you can run the visualization code with
chunkeval visualize [PLOT]where PLOT is one of similarity, retrieval, or case-study, which specifically compares the embedding models Arctic M and GTE Qwen2 1.5B.
The following embedding models are used in the experiments:
| Embedding Model | Model Size (M Param.) | Dimensionality | Max Tokens | NDCG@10 on MS-MARCO |
|---|---|---|---|---|
| BGE small (en, v1.5) | 33 | 384 | 512 | 40.83 |
| Arctic M (v1.5) | 109 | 768 | 512 | 42.03 |
| Stella (1.5B, en, v5) | 1,543 | 1,024 | 512 | 45.22 |
| Jina (v3) | 572 | 1,024 | 8,192 | 40.82 |
| GTE Qwen2 (1.5B) | 1,776 | 1,536 | 32,000 | 43.36 |
| GTE Qwen2 (7B) | 7,613 | 3,584 | 32,000 | 45.98 |
We provide visualizations for comparing the retrieval performance (nDCG@10) and the normalized differences of similarity values between Arctic M and GTE Qwen2 1.5B.
The following two tables show the results of the first experiment, focusing on the comparison of embedding models on the 36 subcorpora of the Synthetic Noise dataset.
| Embedding Model | Context Type | 100 tokens (68%) | 300 tokens (24%) | 500 tokens (14%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| start | mid | end | start | mid | end | start | mid | end | ||
| BGE small | ID | 78.54 | 77.37 | 77.00 | 74.88 | 66.49 | 61.76 | 75.52 | 62.51 | 60.68 |
| baseline: 77.69 | OOD | 70.69 | 61.54 | 59.71 | 33.87 | 11.55 | 7.89 | 25.02 | 2.28 | 4.41 |
| Arctic M | ID | 79.97 | 79.32 | 79.85 | 81.48 | 72.90 | 70.50 | 80.11 | 67.01 | 62.45 |
| baseline: 79.64 | OOD | 73.00 | 67.64 | 66.04 | 54.15 | 13.05 | 18.57 | 19.67 | 7.12 | 8.45 |
| Stella | ID | 81.90 | 79.72 | 78.16 | 80.30 | 75.11 | 73.11 | 78.82 | 74.13 | 67.11 |
| baseline: 81.95 | OOD | 72.80 | 56.94 | 53.84 | 42.69 | 10.82 | 17.48 | 30.48 | 7.68 | 8.01 |
| Embedding Model | Context Type | 100 tokens (68%) | 500 tokens (14%) | 4000 tokens (2%) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| start | mid | end | start | mid | end | start | mid | end | ||
| Jina | ID | 77.03 | 74.59 | 74.31 | 75.65 | 67.14 | 61.90 | 69.91 | 61.94 | 54.11 |
| baseline: 77.33 | OOD | 68.24 | 56.79 | 58.41 | 31.23 | 6.91 | 5.47 | 14.06 | 8.51 | 5.14 |
| GTE Qwen2 1.5B | ID | 83.78 | 82.34 | 81.68 | 76.42 | 71.61 | 67.30 | 73.66 | 69.16 | 64.13 |
| baseline: 77.84 | OOD | 83.60 | 60.01 | 59.56 | 31.06 | 25.93 | 6.87 | 17.68 | 17.51 | 33.20 |
| GTE Qwen2 7B | ID | 83.83 | 83.60 | 83.49 | 79.44 | 74.71 | 71.03 | 78.75 | 73.86 | 66.98 |
| baseline: 83.04 | OOD | 77.44 | 66.48 | 59.90 | 37.44 | 33.44 | 23.01 | 17.83 | 6.56 | 33.85 |
This table shows the results of the comparison study of chunking strategies on several retrieval datasets.
| Dataset | #Tokens | Full | Naive | Late | Semantic | Optimal |
|---|---|---|---|---|---|---|
| FiQA2018 | 176.2 | 47.33 | 46.21 | 47.51 | 42.60 | - |
| NFCorpus | 37.1 | 36.69 | 35.54 | 36.80 | 35.18 | - |
| SciFact | 316.5 | 72.33 | 71.83 | 73.21 | 71.09 | - |
| TREC-COVID | 318.4 | 77.68 | 72.95 | 75.52 | 70.85 | - |
| MS-MARCO Doc. (300k) | 1,604.5 | 70.18 | 73.01 | 71.61 | - | - |
| NarrativeQA (LongEmbed) | 74,843.6 | 34.25 | 73.99 | 40.56 | - | - |
| Synth. Noise samples (OOD) | 2,106.2 | 22.93 | 44.45 | 41.63 | 34.35 | 78.61 |
This repository is licensed under the MIT License. See the LICENSE file for details.



