mirpy

mirpy is a Python library for working with AIRR-seq and immune repertoire data. It provides building blocks for:

parsing clonotype and repertoire tables;
loading and updating V/J segment libraries;
repertoire-level statistics and diversity analysis;
distance calculation and sequence matching;
prototype-based embeddings and comparative analysis.

The package is designed as a reusable toolkit rather than a single pipeline.

Installation

Requirements:

Python 3.11+
a C/C++ build toolchain for compiled extensions

Install from PyPI:

pip install mirpy-lib

Install from source (one-shot):

git clone https://github.com/antigenomics/mirpy.git
cd mirpy
pip install .

Install from source (editable development mode):

git clone https://github.com/antigenomics/mirpy.git
cd mirpy
./setup.sh
pip install -e .

If you only need the package for project usage, prefer pip install mirpy-lib. If you plan to develop or run docs/notebooks locally, use the cloned repo setup.

Main modules

mir.common: clonotypes, repertoires, parsers, segment libraries
mir.distances: aligners, search, graph-based utilities
mir.basic: diversity, sampling, segment usage, pgen helpers
mir.embedding: repertoire and prototype embeddings
mir.comparative: overlap, matching, TCRnet-style comparisons
mir.biomarkers: enrichment and biomarker detection utilities

Quick start

Load a segment library

from mir.common.gene_library import GeneLibrary

lib = GeneLibrary.load_default(
    loci={"TRA", "TRB"},
    species={"human"},
    source="imgt",
)

If a requested organism/locus pair is missing in the default local segment file, mirpy will download the missing V and J segments from IMGT and append them to the default resource file automatically.

Parse a table with clonotypes

from mir.common.parser import VDJtoolsParser

parser = VDJtoolsParser(sep="\t")
clonotypes = parser.parse("example.tsv")

Work with repertoires

from mir.common.repertoire import LocusRepertoire

repertoire = LocusRepertoire(clonotypes=clonotypes, locus="TRB")
print(repertoire.duplicate_count)
print(repertoire.clonotype_count)

# Functional/canonical filtering using IMGT functionality annotations.
from mir.common.filter import filter_functional, filter_canonical
from mir.common.gene_library import GeneLibrary

imgt_lib = GeneLibrary.load_default(loci={"TRB"}, species={"human"}, source="imgt")
functional_rep = filter_functional(repertoire, gene_library=imgt_lib)
canonical_rep = filter_canonical(repertoire, gene_library=imgt_lib)

You can also load a repertoire directly from a file:

from mir.common.parser import VDJtoolsParser
from mir.common.repertoire import LocusRepertoire

repertoire = LocusRepertoire(
    clonotypes=VDJtoolsParser(sep="\t").parse("example.tsv"),
    locus="TRB",
)

Pool repertoires across samples

mirpy provides pooled repertoires with configurable identity rules.

from mir.common.pool import pool_samples

# Pool by nucleotide CDR3 + V/J calls.
pooled_ntvj = pool_samples(
    [sample_rep_1, sample_rep_2],
    rule="ntvj",
    weighted=True,
)

# Pool an entire dataset by amino-acid CDR3 + V/J and keep contributing sample ids.
pooled_aavj = pool_samples(
    dataset,
    rule="aavj",
    include_sample_ids=True,
)

Supported pooling rules:

ntvj: key is (junction, v_gene, j_gene)
nt: key is (junction,)
aavj: key is (junction_aa, v_gene, j_gene)
aa: key is (junction_aa,)

For each pooled key, the representative clonotype is selected by frequency (duplicate_count when weighted=True, otherwise row occurrences), while duplicate_count is reassigned to the total sum over grouped rows. The pooled clonotype metadata contains incidence (unique samples) and occurrences (independent rearrangement rows).

Prototype-based embeddings with TCREMP

TCREmp (T-Cell Receptor EMbedding with Prototypes) embeds immune receptor clonotypes as distance vectors to a fixed set of prototype clonotypes, enabling rapid downstream analysis, dimensionality reduction, and machine learning.

from mir.embedding.tcremp import TCREmp
from mir.common.clonotype import Clonotype

# Build a TCREMP model with default prototypes (fast, fixed-gap junction alignment)
model = TCREmp.from_defaults("human", "TRB", n_prototypes=1000, junction_method="fixed_gap")

# Embed clonotypes as distance vectors to all prototypes
clonotypes = [
    Clonotype(v_gene="TRBV10-3*01", j_gene="TRBJ2-7*01", junction_aa="CASSIRSSYEQYF"),
    Clonotype(v_gene="TRBV20-1*01", j_gene="TRBJ1-1*01", junction_aa="CSARDSSYEQYF"),
]
X = model.embed(clonotypes)  # shape: (2, 3000) — float32 array

# Embeddings combine three distance types per prototype: V-germline, J-germline, and junction
# Column layout: [v_1, j_1, junc_1, v_2, j_2, junc_2, ..., v_K, j_K, junc_K]
print(X.shape)  # (2, 3000)

For full DP alignment semantics, use junction_method="biopython" (slower, ~90x):

model_bio = TCREmp.from_defaults("human", "TRB", n_prototypes=100, junction_method="biopython")

For custom prototypes:

model_custom = TCREmp.from_file("prototypes.tsv", species="human", locus="TRB")

Paired-chain embedding uses the same idea, but concatenates the TRA and TRB embeddings for each PairedClonotype:

from mir.common.parser import VDJdbFullPairedParser
from mir.common.single_cell import build_tenx_sample_from_cell_clonotypes
from mir.common.single_cell_repair import impute_missing_chains
from mir.embedding.tcremp import PairedTCREmp

parser = VDJdbFullPairedParser()

# Keep only complete TRA/TRB rows.
strict_df, strict_meta = parser.parse_cell_clonotypes_file(
    "vdjdb_full.txt.gz",
    species="HomoSapiens",
    include_incomplete=False,
)
strict_sample = build_tenx_sample_from_cell_clonotypes(
    strict_df,
    sample_id="vdjdb_full_human_strict",
    barcode_metadata=strict_meta,
)

# Or keep incomplete rows and impute the missing chain before pairing.
impute_df, impute_meta = parser.parse_cell_clonotypes_file(
    "vdjdb_full.txt.gz",
    species="HomoSapiens",
    include_incomplete=True,
)
imputed_df = impute_missing_chains(impute_df)
imputed_sample = build_tenx_sample_from_cell_clonotypes(
    imputed_df,
    sample_id="vdjdb_full_human_imputed",
    barcode_metadata=impute_meta,
)

paired_model = PairedTCREmp.from_defaults("human", "TRA_TRB", n_prototypes=500)
paired_clonotypes = imputed_sample.paired_locus_repertoires["TRA_TRB"].paired_clonotypes
X_pair = paired_model.embed(paired_clonotypes)

The VDJdb full parser uses one synthetic barcode per source row and stores the row id plus antigen/MHC annotations in sample.single_cell_repertoire.barcode_metadata. A paired analysis notebook is available at notebooks/tcremp_vdjdb_analysis_paired.ipynb.

Key features:

Distance formula: d(a, b) = s(a,a) + s(b,b) − 2·s(a,b) ensures metric properties.
Smart n_jobs auto-switch: with n_jobs=None (default), TCREmp chooses 1 or os.cpu_count() based on workload len(clonotypes) * n_prototypes.
Pre-computed germline distances: V/J distances are cached for O(1) lookup.
Biologically interpretable: Each embedding dimension corresponds to distance to a specific prototype.

n_jobs behavior:

n_jobs=None (default): auto policy based on workload threshold.
n_jobs=1: force serial execution.
n_jobs>1: force explicit worker count.

In auto mode, BioPython junction alignment stays serial because thread overhead usually dominates for that backend.

Why threshold is not based on clonotypes alone:

Parallel splitting is done on the input clonotype list (query axis).
However, each query is scored against all prototypes in the C backend.
So total work is still proportional to n_clonotypes * n_prototypes, and both terms matter for the auto-switch decision.

Default method guidance:

fixed_gap remains the default because it is substantially faster and is the intended production embedding backend.
biopython can be used when full dynamic-programming semantics are required, but no repository benchmark currently shows a consistent downstream quality gain that justifies making it the default.
End-to-end embedding pipelines may show smaller speedups (for example 3-4x) than raw pairwise core-kernel comparisons due to non-alignment overhead.

Repertoire internals and lazy tabular backend

LocusRepertoire supports three internal representations:

eager clonotype list (clonotypes),
lazy column bundles (_pending_cols) for fast parser paths,
Polars table (_polars_table) for tabular I/O and grouped operations.

When a repertoire is built from a Python clonotype list, no Polars table is created up front. The table is generated lazily on first to_polars() call and cached for reuse. Count properties (clonotype_count, duplicate_count) stay fast and avoid full materialization when lazy columns or Polars data are available.

Mask and match sequences

from mir.basic.alphabets import (
    aa_to_reduced,
    mask,
    matches,
    matches_aa_reduced,
    NT_MASK,
    AA_MASK,
)

nt_masked = mask("ATCGAT", (2, 5), NT_MASK)
assert nt_masked == b"ATNNNT"

aa = "CASTIV"
reduced = aa_to_reduced(aa)

# Matching ignores mask symbols: N for nucleotides, X for amino-acid alphabets.
assert matches(mask(aa, 0, AA_MASK), aa, AA_MASK)
assert matches_aa_reduced(aa, mask(reduced, 3, AA_MASK))

Resources

Example notebooks are available in notebooks/.
API and module/function documentation: https://antigenomics.github.io/mirpy/modules.html
Notebook gallery in docs: https://antigenomics.github.io/mirpy/examples.html
Docs source tree: docs/
Agent skill guide for LLM-assisted workflows (Claude, GitHub Copilot, similar agents): skills/mirpy/SKILL.md

Project status

The library is actively evolving. Some modules are more mature than others, and the public API may still change in places.

Name		Name	Last commit message	Last commit date
Latest commit History 337 Commits
.github/workflows		.github/workflows
.vscode		.vscode
assets		assets
docs		docs
mir		mir
notebooks		notebooks
skills/mirpy		skills/mirpy
tests		tests
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
benchmarks.md		benchmarks.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mirpy

Installation

Main modules

Quick start

Load a segment library

Parse a table with clonotypes

Work with repertoires

Pool repertoires across samples

Prototype-based embeddings with TCREMP

Repertoire internals and lazy tabular backend

Mask and match sequences

Resources

Project status

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mirpy

Installation

Main modules

Quick start

Load a segment library

Parse a table with clonotypes

Work with repertoires

Pool repertoires across samples

Prototype-based embeddings with TCREMP

Repertoire internals and lazy tabular backend

Mask and match sequences

Resources

Project status

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages