mirpy is a Python library for working with AIRR-seq and immune repertoire data.
It provides building blocks for:
- parsing clonotype and repertoire tables;
- loading and updating V/J segment libraries;
- repertoire-level statistics and diversity analysis;
- distance calculation and sequence matching;
- prototype-based embeddings and comparative analysis.
The package is designed as a reusable toolkit rather than a single pipeline.
Requirements:
- Python 3.11+
- a C/C++ build toolchain for compiled extensions
Install from PyPI:
pip install mirpy-libInstall from source (one-shot):
git clone https://github.com/antigenomics/mirpy.git
cd mirpy
pip install .Install from source (editable development mode):
git clone https://github.com/antigenomics/mirpy.git
cd mirpy
./setup.sh
pip install -e .If you only need the package for project usage, prefer pip install mirpy-lib.
If you plan to develop or run docs/notebooks locally, use the cloned repo setup.
mir.common: clonotypes, repertoires, parsers, segment librariesmir.distances: aligners, search, graph-based utilitiesmir.basic: diversity, sampling, segment usage, pgen helpersmir.embedding: repertoire and prototype embeddingsmir.comparative: overlap, matching, TCRnet-style comparisonsmir.biomarkers: enrichment and biomarker detection utilities
from mir.common.gene_library import GeneLibrary
lib = GeneLibrary.load_default(
loci={"TRA", "TRB"},
species={"human"},
source="imgt",
)If a requested organism/locus pair is missing in the default local segment file,
mirpy will download the missing V and J segments from IMGT and append them to
the default resource file automatically.
from mir.common.parser import VDJtoolsParser
parser = VDJtoolsParser(sep="\t")
clonotypes = parser.parse("example.tsv")from mir.common.repertoire import LocusRepertoire
repertoire = LocusRepertoire(clonotypes=clonotypes, locus="TRB")
print(repertoire.duplicate_count)
print(repertoire.clonotype_count)
# Functional/canonical filtering using IMGT functionality annotations.
from mir.common.filter import filter_functional, filter_canonical
from mir.common.gene_library import GeneLibrary
imgt_lib = GeneLibrary.load_default(loci={"TRB"}, species={"human"}, source="imgt")
functional_rep = filter_functional(repertoire, gene_library=imgt_lib)
canonical_rep = filter_canonical(repertoire, gene_library=imgt_lib)You can also load a repertoire directly from a file:
from mir.common.parser import VDJtoolsParser
from mir.common.repertoire import LocusRepertoire
repertoire = LocusRepertoire(
clonotypes=VDJtoolsParser(sep="\t").parse("example.tsv"),
locus="TRB",
)mirpy provides pooled repertoires with configurable identity rules.
from mir.common.pool import pool_samples
# Pool by nucleotide CDR3 + V/J calls.
pooled_ntvj = pool_samples(
[sample_rep_1, sample_rep_2],
rule="ntvj",
weighted=True,
)
# Pool an entire dataset by amino-acid CDR3 + V/J and keep contributing sample ids.
pooled_aavj = pool_samples(
dataset,
rule="aavj",
include_sample_ids=True,
)Supported pooling rules:
ntvj: key is(junction, v_gene, j_gene)nt: key is(junction,)aavj: key is(junction_aa, v_gene, j_gene)aa: key is(junction_aa,)
For each pooled key, the representative clonotype is selected by frequency
(duplicate_count when weighted=True, otherwise row occurrences), while
duplicate_count is reassigned to the total sum over grouped rows. The pooled
clonotype metadata contains incidence (unique samples) and occurrences
(independent rearrangement rows).
TCREmp (T-Cell Receptor EMbedding with Prototypes) embeds immune receptor clonotypes
as distance vectors to a fixed set of prototype clonotypes, enabling rapid downstream
analysis, dimensionality reduction, and machine learning.
from mir.embedding.tcremp import TCREmp
from mir.common.clonotype import Clonotype
# Build a TCREMP model with default prototypes (fast, fixed-gap junction alignment)
model = TCREmp.from_defaults("human", "TRB", n_prototypes=1000, junction_method="fixed_gap")
# Embed clonotypes as distance vectors to all prototypes
clonotypes = [
Clonotype(v_gene="TRBV10-3*01", j_gene="TRBJ2-7*01", junction_aa="CASSIRSSYEQYF"),
Clonotype(v_gene="TRBV20-1*01", j_gene="TRBJ1-1*01", junction_aa="CSARDSSYEQYF"),
]
X = model.embed(clonotypes) # shape: (2, 3000) β float32 array
# Embeddings combine three distance types per prototype: V-germline, J-germline, and junction
# Column layout: [v_1, j_1, junc_1, v_2, j_2, junc_2, ..., v_K, j_K, junc_K]
print(X.shape) # (2, 3000)For full DP alignment semantics, use junction_method="biopython" (slower, ~90x):
model_bio = TCREmp.from_defaults("human", "TRB", n_prototypes=100, junction_method="biopython")For custom prototypes:
model_custom = TCREmp.from_file("prototypes.tsv", species="human", locus="TRB")Paired-chain embedding uses the same idea, but concatenates the TRA and TRB
embeddings for each PairedClonotype:
from mir.common.parser import VDJdbFullPairedParser
from mir.common.single_cell import build_tenx_sample_from_cell_clonotypes
from mir.common.single_cell_repair import impute_missing_chains
from mir.embedding.tcremp import PairedTCREmp
parser = VDJdbFullPairedParser()
# Keep only complete TRA/TRB rows.
strict_df, strict_meta = parser.parse_cell_clonotypes_file(
"vdjdb_full.txt.gz",
species="HomoSapiens",
include_incomplete=False,
)
strict_sample = build_tenx_sample_from_cell_clonotypes(
strict_df,
sample_id="vdjdb_full_human_strict",
barcode_metadata=strict_meta,
)
# Or keep incomplete rows and impute the missing chain before pairing.
impute_df, impute_meta = parser.parse_cell_clonotypes_file(
"vdjdb_full.txt.gz",
species="HomoSapiens",
include_incomplete=True,
)
imputed_df = impute_missing_chains(impute_df)
imputed_sample = build_tenx_sample_from_cell_clonotypes(
imputed_df,
sample_id="vdjdb_full_human_imputed",
barcode_metadata=impute_meta,
)
paired_model = PairedTCREmp.from_defaults("human", "TRA_TRB", n_prototypes=500)
paired_clonotypes = imputed_sample.paired_locus_repertoires["TRA_TRB"].paired_clonotypes
X_pair = paired_model.embed(paired_clonotypes)The VDJdb full parser uses one synthetic barcode per source row and stores the
row id plus antigen/MHC annotations in
sample.single_cell_repertoire.barcode_metadata. A paired analysis notebook is
available at notebooks/tcremp_vdjdb_analysis_paired.ipynb.
Key features:
- Distance formula:
d(a, b) = s(a,a) + s(b,b) β 2Β·s(a,b)ensures metric properties. - Smart
n_jobsauto-switch: withn_jobs=None(default), TCREmp chooses1oros.cpu_count()based on workloadlen(clonotypes) * n_prototypes. - Pre-computed germline distances: V/J distances are cached for O(1) lookup.
- Biologically interpretable: Each embedding dimension corresponds to distance to a specific prototype.
n_jobs behavior:
n_jobs=None(default): auto policy based on workload threshold.n_jobs=1: force serial execution.n_jobs>1: force explicit worker count.
In auto mode, BioPython junction alignment stays serial because thread overhead usually dominates for that backend.
Why threshold is not based on clonotypes alone:
- Parallel splitting is done on the input clonotype list (query axis).
- However, each query is scored against all prototypes in the C backend.
- So total work is still proportional to
n_clonotypes * n_prototypes, and both terms matter for the auto-switch decision.
Default method guidance:
fixed_gapremains the default because it is substantially faster and is the intended production embedding backend.biopythoncan be used when full dynamic-programming semantics are required, but no repository benchmark currently shows a consistent downstream quality gain that justifies making it the default.- End-to-end embedding pipelines may show smaller speedups (for example 3-4x) than raw pairwise core-kernel comparisons due to non-alignment overhead.
LocusRepertoire supports three internal representations:
- eager clonotype list (
clonotypes), - lazy column bundles (
_pending_cols) for fast parser paths, - Polars table (
_polars_table) for tabular I/O and grouped operations.
When a repertoire is built from a Python clonotype list, no Polars table is
created up front. The table is generated lazily on first to_polars() call and
cached for reuse. Count properties (clonotype_count, duplicate_count) stay
fast and avoid full materialization when lazy columns or Polars data are
available.
from mir.basic.alphabets import (
aa_to_reduced,
mask,
matches,
matches_aa_reduced,
NT_MASK,
AA_MASK,
)
nt_masked = mask("ATCGAT", (2, 5), NT_MASK)
assert nt_masked == b"ATNNNT"
aa = "CASTIV"
reduced = aa_to_reduced(aa)
# Matching ignores mask symbols: N for nucleotides, X for amino-acid alphabets.
assert matches(mask(aa, 0, AA_MASK), aa, AA_MASK)
assert matches_aa_reduced(aa, mask(reduced, 3, AA_MASK))- Example notebooks are available in notebooks/.
- API and module/function documentation: https://antigenomics.github.io/mirpy/modules.html
- Notebook gallery in docs: https://antigenomics.github.io/mirpy/examples.html
- Docs source tree: docs/
- Agent skill guide for LLM-assisted workflows (Claude, GitHub Copilot, similar agents): skills/mirpy/SKILL.md
The library is actively evolving. Some modules are more mature than others, and the public API may still change in places.
