Skip to content

Full CRAM 3.1 support & performance tuning of CRAM read/write#1763

Open
tfenne wants to merge 43 commits intodevfrom
tf_cram_31
Open

Full CRAM 3.1 support & performance tuning of CRAM read/write#1763
tfenne wants to merge 43 commits intodevfrom
tf_cram_31

Conversation

@tfenne
Copy link
Copy Markdown
Member

@tfenne tfenne commented Mar 30, 2026

Description

Ok, so this is a monster PR. I'm not even sure it should be a single PR, but I wanted to put it up here to get some feedback on how best to proceed. I've created it as a draft for that reason. It builds upon the work by @cmnbroad in #1739 while implementing the necessary additions to write CRAM 3.1 at comparable compression ratios to htslib/samtools.

This was done with the assistance of Claude Code, though it also represents a significant investment of my own personal time, prompting, reviewing, fixing, redirecting.

The PR breaks down roughly as follows:

Many of the additions and changes are strongly inspired by work in htslib!

Currently with the normal profile selected, htsjdk emits CRAMs approximately similar in size to htslib. With some 1KG low-pass data I've been testing with the results are actually slightly smaller. However runtime to generate the CRAMs is ~2x slower than htslib/samtools. I think that gap could be shrunk further (it started out around 5-6x slower), I think there will always be a substantial gap due to a) not having easy access to SIMD instructions and b) not having a good libdeflate wrapper on the JVM.

The read path for CRAM 3.1 predates this PR, but the changes here provide substantial performance improvements. Again, testing on low-pass WGS data I'm seeing a 40-50% speedup.

Assuming this PR/changeset has legs, I will do some broader benchmarking and report on both compression and runtime performance.

@jkbonfield I wouldn't expect you to review this PR, but if you have thoughts on how to validate the results of this work beyond the broad set of CRAM unit tests in the htsjdk repo, I would be very grateful for your input.

Things to think about before submitting:

  • Make sure your changes compile and new tests pass locally.
  • Add new tests or update existing ones:
    • A bug fix should include a test that previously would have failed and passes now.
    • New features should come with new tests that exercise and validate the new functionality.
  • Extended the README / documentation, if necessary
  • Check your code style.
  • Write a clear commit title and message
    • The commit message should describe what changed and is targeted at htsjdk developers
    • Breaking changes should be mentioned in the commit message.

@jkbonfield
Copy link
Copy Markdown

jkbonfield commented Mar 31, 2026

For testing, hts-specs has a corpus of test files designed to validate CRAM decoder implementations.

https://github.com/samtools/hts-specs/tree/master/test/cram

The 3.0/README.md covers the basics of the CRAM structure, adding in feature at a time. So blank file with header, then basic one-read bits without reference based encoding, etc. This is mostly testing structures, features, encodings, and so on. By the end of it it's also testing compression codecs.

The 3.1 directory has 4 CRAM files with progressive levels of new compression codecs, basically htslib's profiles I think. They're whole file real world examples, as beyond the new codecs there are no format changes.

More useful for 3.1 is the codecs directory which have raw codec compressed byte streams for you to decode and validate. These are shrunk down copies of the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/dat, although I think that may be a little incomplete now. Basically it's designed to test things like rans, fqz, arith and name tokeniser codecs in isolation, for unit testing purposes.

This is mostly just integration testing. The nature of CRAM is that even with the same codec we could produce subtly different encoded byte streams (eg it's up to the implementation how it approximates the probabilities tables in rANS and handles rounding errors when the frequencies don't divide up perfectly, but the data stream contains the necessary lookup tables to decode so that's what matters).

So what we need for testing are:

  • Can we round-trip our own data without losing information
  • Can we round-trip with other implementations; java enc -> htscodecs dec, or htscodecs enc -> java dec. The test data here is attempting to validate that bit.

@jkbonfield
Copy link
Copy Markdown

jkbonfield commented Mar 31, 2026

Some comments regarding CRAM 4.0. This is long, but it may explain why I think CRAM 4 is now something to ignore. (Mainly speaking to you as someone looking at adding it for Noodles).

Personally, I think I made an error when I added it to htslib, but the history here is important to understand how we got to where we did.

Initially I had two aims to improve on CRAM 3.0:

  • Fix some glaring issues in the CRAM format.
  • Improve CRAM compression and if possible also speed.
  • A third bonus was to simplify where possible (eg cull the kitchen sink of CORE bit-wise encodings).

Fixing the CRAM format included things like replacing ITF8/LTF8 with a true universal variable length encoder that doesn't need to know the expected bit-size (somewhat of an anathema to proper variable length encoding). It also separated signed from unsigned values, so we no longer take 5 bytes to store -1 (or worse for LTF8). It also gained better ways of doing constant values. I also fixed some deficiencies such as how to discover whether the original data had MD/NM (by adding MD:* and NM:* tags with the "*" being auto-generated on decode, so the presence and location are known). Further improvements would be around =/X vs M in cigar, and so on. There are quite a few little improvements to make it work better as a round-trip tool. I added a CRAM v4 idea tracker on hts-specs not just because I was hoping for community ideas, but primarily as a way to date stamp the ideas I had and prevent patent squatters like MPEG from patenting obvious future improvements!

Looking at the compression side, I regret not adding in zstd, but it was a bit too new then. Zstd should totally replace Gzip (deflate) and LZMA. Deflate is just hampered by a maximum window size of 32kb, which makes deduplication - the core part of LZ - fail on long reads and similar data types. At high levels zstd approaches lzma for ratio. It's not quite as good, but close, and frankly lzma is almost never worth the CPU cost so I doubt it's used much. Similarly bzip2 should be ditched in favour of libbsc, which betters it on pretty much every aspect. Combined we'd be reducing the number of dependencies while also using more modern high performant libraries. There's even a branch of io_lib with libbsc and zstd. This was demonstrated to MPEG at a conference in their first call-for-proposals on a new genomics format. Something I now strongly regret taking part in! (Ultimately they only took the name tokeniser, but have now regretted going with their own entropy encoder as it's simply way too slow.)

The other aspect was improving ratios via custom codecs. The old rANS wins here because it's so much faster as an entropy encoder, and sometimes entropy encoding (or minimal statistical encoding) is the best you can get, so the complex LZ searching methods aren't worth it. (Before CRAM 3.0 we even had zlib being tried with Z_HUFFMAN_ONLY as an option, because it sometimes beat normal zlib while being vastly faster.) The ideas from there eventually ended up in CRAM 3.1.

One aspect of that was the notion of data transforms. This is an idea taken out of PNG, and later used by myself for ZTR. Data starts off as raw, but can have a transformation applied to it. Decoding is like peeling the layers off an onion. If it's not raw format yet, you pass it through the transform indicated in the data-stream and repeat, until you get to the original raw data. These could be delta if it's a series of integer values; size squashing - eg an array of 16bit values becoming 8-bit with an escape mechanism for out of bound data; run-length encoding; bit-packing (eg ACGT strings taking up 2bits per value instead of 8); or even bigger transformations such as BWT and MTF. These can be implemented in CRAM as layered encodings, much like we already have BYTE_ARRAY_LEN which contains two sub-encodings for the literal and length fields. RLE would be a direct comparison to that in data layout. I initially implementing PACK and RLE here.

CRAM 4 was an experimental melting pot of compression improvements along side the format changes.

However it rapidly became apparent that getting any traction on these notions was going to be slow in htsjdk, so I took the decision to attempt to produce a cut down CRAM 4, renamed to CRAM 3.1. The theory was if we only had compression codec changes and we had a high performance separate library to implement them then the code changes needed in other CRAM implementations were very minimal; just updating the list of legal codec numbers and attaching them to callouts to a library API. This isn't much different to how existing implementations may already be handling lzma, bzip2 and deflate formats. (Although I'm aware Java has some native implementations, I was incorrectly assuming the faster C implementations were being utilised mostly.)

A consequence of this is the separate onion-layer transform approach had to go, as that was an CRAM encoding change rather than a trivial codec API (ie compress-buffer and decompress-buffer functions). They got consumed into the rANS-Nx16 codec instead, although minus delta which was a mistake as it would greatly help certain tag types such as ONT signals).

It wasn't such a bad idea, but ultimately it didn't help at all as there was no willingness to simply link against an external library yet we suffered splitting CRAM updates into two new versions instead of 1.

There are still improvements that could be made with CRAM, but I wouldn't make them now I think so ultimately I think CRAM 4.0 is dead in the water. I'm considering removing it from htslib, if only to reduce the potential attack vector for supplying malicious files.

If I was to pick this up again, it'd probably be to start again from scratch and build on something like the recent OpenZL. It wouldn't have the same compression ratios as archive mode CRAM, unless we extended it, but I like the notion of a standard structure that off the shelf tooling can decode. It's like BGZF is basically just decodable with zcat, but with openzl it could include all those useful chunking strategies and data transformations. I still have plans for a BGZF2, but based on OpenZL now instead of zstd. I just got sucked into other projects and my time evaporated to work on this, but one day!

My rough prototype showed enormous gains for multi-sample VCF and long-read SAM files. It's not up to CRAM performance, but closer to CRAM than BAM for Revio and with a rewrite to use OpenZL it ought to be far closer still.

@tfenne
Copy link
Copy Markdown
Member Author

tfenne commented Mar 31, 2026

@jkbonfield Thank you so much for both your pointers on testdata in htsspecs, and the history around CRAM 4. That's a lot to digest, and I think probably I'll reach out via email re: future formats (or finally make it to a GA4GH call). But the short version is that between AI-assisted coding and Project Panama in Java 21/22+, I think it might be much easier to spin up Java wrappers around native libraries and build Java-enabled implementations of future compression formats much faster.

@tfenne tfenne force-pushed the tf_cram_31 branch 2 times, most recently from 94ec6dc to c944b6c Compare April 1, 2026 13:14
@tfenne tfenne requested review from nh13 and yfarjoun April 1, 2026 13:33
@tfenne tfenne marked this pull request as ready for review April 1, 2026 13:33
@tfenne
Copy link
Copy Markdown
Member Author

tfenne commented Apr 1, 2026

@yfarjoun and @nh13 I've marked this as ready for review. I've been back over the code with multiple AIs doing code review, and have fixed up everything they found that I agreed with.

Copy link
Copy Markdown
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs some battle-testing; it's too big for a static code review :/

public ByteBuffer decompress(final ByteBuffer input) {
// Read ISIZE from the last 4 bytes of the GZIP block to size the output
final int isizeOffset = input.limit() - 4;
final int isize = input.duplicate().order(ByteOrder.LITTLE_ENDIAN).position(isizeOffset).getInt();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue:
ISIZE is mod 2^32 so you use Integer.toUnsignedLong for the 2GB-4GB case, check if isize is zero, but no idea how to handle if the actual size was > 4GB. I think it doesn't matter for CRAM, but this is a public function...

}

private void grow(final int minCapacity) {
int newCapacity = buf.length << 1;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: could overflow if buf.length > 1GB

CHANGELOG.md Outdated
### CRAM 3.1 Write Support

- Enable CRAM 3.1 writing with all spec codecs: rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE
- Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
- Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection
- Add configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE) with trial compression for automatic codec selection

README.md Outdated

> **NOTE: _HTSJDK has only partial support for the latest Variant Call Format Specification. VCFv4.3 can be read but not written, VCFv4.4 can be read in lenient mode only, and there is no support for BCFv2.2._**

> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._**
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._**
> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._**

@@ -107,7 +109,7 @@ public void writeAlignment(final SAMRecord alignment) {
*/
// TODO: retained for backward compatibility for disq in order to run GATK tests (remove before merging this branch)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: can we remove this now?

int basesRemaining = 0;
boolean firstLen = true;
int lastLen = 0;
int qualOffset = 0;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: can remove

@@ -335,84 +402,150 @@ private static void writeString(final ByteBuffer tokenStreamBuffer, final String
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

Suggested change
tokenStreamBuffer.put(val.getBytes(StandardCharsets.UTF_8));

this.byteOffsetOfContainer = containerByteOffset;

final ContentDigests hasher = ContentDigests.create(ContentDigests.ALL);
// htslib does not write content digest tags (BD/SD/B5/S5/B1/S1) into slice headers.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: add to breaking changes in the CHANGELOG?

import htsjdk.samtools.*;
import htsjdk.samtools.cram.structure.CRAMCompressionProfile;
import htsjdk.samtools.cram.structure.CRAMEncodingStrategy;
import htsjdk.samtools.util.CloserUtil;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

Suggested change
import htsjdk.samtools.util.CloserUtil;

* @param reader the reader to read from
* @return the decoded value
*/
public static long readUnsignedLTF8(final CRAMByteReader reader) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: these are super dense, ITF8.java and the rest of this code I think is more readable.

cmnbroad and others added 14 commits April 11, 2026 06:34
…e helpers.

Implement STRIPE encoding for Range and rANS Nx16 codecs. Add Range coder
helper methods to CompressionHeaderEncodingMap. Implement FQZComp quality
score encoder and wire it into the CRAM 3.1 write profile. Add javadoc to
Range codec and CompressionUtils methods.
…ion.

Add CRAM compression profiles (FAST, NORMAL, SMALL, ARCHIVE) and wire
encoding strategy through the write API. Implement trial compression for
ARCHIVE profile with FQZComp parameter training to select optimal codec
parameters per data series.
Simple main() class that reads SAM/BAM/CRAM and writes SAM/BAM/CRAM with
a selectable compression profile (fast, normal, small, archive). Reports
record count, timing, and input/output file sizes for quick comparison.

Also adds CRAMCompressionProfile.valueOfCaseInsensitive() for case-insensitive
profile lookup by name.
…hed mate pairs.

Match htslib/samtools behavior: strip NM/MD tags during CRAM encoding when
stored values match recomputed values (preserving non-standard values verbatim),
and regenerate missing NM/MD from read features + reference during decoding.
Handle htslib's cF internal tag for embed_ref=2 mode compatibility.

Implement attached mate pair encoding within slices, storing only a record
offset (NF) instead of full mate info (MF, NS, NP, TS) for mate pairs in
the same slice.

Update test SAM/BAM files to include NM/MD tags that conformant CRAM readers
now generate.
…ion testing.

Switch low-entropy integer data series (MF, MQ, FC, BS, TL, FN) from GZIP to
rANS Nx16 Order 1 where rANS produces better compression. Keep GZIP for
high-entropy positional data (NP, FP) and byte-array series (IN, SC) where
LZ77 outperforms pure entropy coding — rANS produced 97% ratio on NP vs
GZIP's 76%.
…ugging.

Change the external block content IDs assigned to each CRAM data series to match
the values used by htslib's cram_DS_ID enum (cram_structs.h). This is a purely
cosmetic change for newly written files — existing CRAM files still decode
correctly because content IDs are stored in each container's compression header
encoding map, not hardcoded in the decoder.

The previous HTSJDK IDs (1-32) and htslib IDs (11-46) were both arbitrary but
different, making it difficult to compare per-block compression statistics
between the two implementations. With matching IDs, tools like CramStats can
directly compare blocks by content ID across htsjdk-written and samtools-written
CRAM files.
…cution.

Restructure CRAM31Tests into a base class + 4 subclasses (one per compression
profile: fast, normal, small, archive). Each class has a data provider that
supplies the 3 input files, and the test method exercises both samtools and
HTSJDK conversion paths. Since TestNG parallelizes by class, the 4 profile
classes now run concurrently instead of sequentially.

Also passes the matching compression profile to the HTSJDK write step so all
four profiles are actually tested (previously always used NORMAL).

Wall-clock time for fast+normal: ~1m11s vs ~4m30s for the original 12 tests.
…didates with htslib.

TrialCompressor fixes:
- Production phase now calls setMethod(winner.getMethod()) so getMethod() returns
  the correct method for block headers (was returning the last trial block's
  per-block-best instead of the overall winner)
- blocksUntilRetrial now set after initial trial completes (was only set in the
  re-trial branch, causing immediate spurious re-trials)
- Empty blocks delegate to the winner's compress() instead of returning raw empty
  bytes (GZIP requires header framing even for empty payloads)
- Added setMethod() to ExternalCompressor for subclass method updates, with
  null-guard on getMethod() to catch use-before-compress errors

TrialCompressor now properly implements htslib's trial approach: trial NTRIALS=3
non-empty blocks accumulating sizes, pick winner, use for TRIAL_SPAN=70 blocks,
then re-trial. Each block's header always matches its actual compressed data.

Profile fixes:
- SMALL profile now uses NORMAL's primary codec assignments (rANS + FQZComp +
  NameTokeniser) instead of all-GZIP, with BZIP2 + GZIP/rANS as trial
  candidates — matching htslib's SMALL (level 6, use_rans=1, use_bz2=1)
- ARCHIVE trial candidates include Range coder + BZIP2 + GZIP/rANS
- Compress-then-getMethod ordering in createCompressedBlockForStream

New test: testGetMethodMatchesCompressedDataThroughoutLifecycle verifies the
method/data consistency invariant across 80 blocks (trial + production phases).
…am dedup, and all-MATCH elimination.

Four improvements to NameTokenisationEncode that reduce normal-profile read name
compression from 23.7MB to 18.8MB (samtools: 17.2MB) on the 1KG benchmark:

1. Per-token-type rANS flag sets matching htslib tok3 level -3 — STRIPE for
   integer streams (DIGITS, DIFF, DUP), ORDER-1 for STRING, PACK+RLE for TYPE.
2. Fix remaining()/limit() bug in tryCompress — ORDER and STRIPE flag guards
   were checking buffer position instead of data size, silently skipping those
   flags on every iteration after the first.
3. All-MATCH TYPE stream elimination — omit TOKEN_TYPE streams that are entirely
   MATCH after byte 0 (the decoder already regenerates these per the spec).
4. Cross-position duplicate compressed stream detection — emit 3-byte dup
   references for byte-identical compressed streams instead of full data.
…into CRAM.

GzipCodec is a reusable, ByteBuffer-based GZIP compressor/decompressor that
operates directly on Deflater/Inflater without stream wrappers. Supports both
standard GZIP and BGZF output formats, and decompresses either transparently.

Key design points:
- ByteBuffer API for both input and output, with convenience allocating methods
- Stateful (Deflater, Inflater, CRC32 reused across calls), not thread-safe
- Supports DeflaterFactory/InflaterFactory for pluggable implementations
- Configurable compression level and deflate strategy

GZIPExternalCompressor now delegates to GzipCodec, accepting an optional
deflate strategy parameter for future use (e.g., Deflater.FILTERED).
tfenne added 13 commits April 11, 2026 06:34
…implifications.

Comprehensive refactor of the rANS 4x8 and Nx16 encoder/decoder implementations:

API changes:
- RANSEncode.compress() and RANSDecode.uncompress() now accept and return byte[]
  instead of ByteBuffer. External compressor wrappers simplified accordingly.
- All callers updated (ExternalCompressors, NameTokenisationEncode, TokenStreams, tests).

Internal refactor (ByteBuffer -> byte[]):
- All encoder/decoder internals converted from ByteBuffer to byte[] with explicit
  offset tracking (int[] posHolder pattern for mutable positions).
- Encoders write backwards into byte arrays, eliminating the O(N) Utils.reverse()
  pass that was previously required after forward-write encoding.
- Frequency table read/write methods use byte[] with posHolder.
- Added byte[] overloads for CompressionUtils.readUint7/writeUint7.

Structural simplifications:
- Eliminated ArithmeticDecoder wrapper class — inlined as plain int[][] frequencies
  and byte[][] reverseLookup arrays directly on RANSDecode.
- Removed combined advanceSymbolNx16/advanceSymbol4x8 methods from RANSDecodingSymbol
  — callers use advanceSymbolStep() + Utils.RANSDecodeRenormalize*() two-step pattern.
- Removed all dead ByteBuffer methods from RANSEncodingSymbol, RANSDecodingSymbol, Utils.
- Flattened package structure: moved rans4x8/ and ransnx16/ subpackage classes into
  parent rans/ package, enabling package-private field access for inline optimization.
- Made RANSEncodingSymbol fields package-private; inlined putSymbolNx16 arithmetic
  into RANSNx16Encode Order-0 loop (pos kept as local int instead of int[] heap array).
- Selective decoder reset: only reset frequency/symbol rows actually used in previous
  decode, not all 65,536 entries.
- Moved encoder/decoder state initialization from lazy-init methods to constructors,
  making all state arrays final fields.
- Made param classes final, added final modifiers throughout, added javadoc.

Performance: CRAM round-trip benchmark improved from ~189s to ~97s (2x speedup),
with the TrialCompressor change (separate commit) and these rANS changes together.
- Share encoder/decoder instances in RansTest, RangeTest, and RangeInteropTest
  instead of creating new instances per DataProvider row. Each codec eagerly
  allocates large internal symbol/frequency tables, and hundreds of instances
  caused heap exhaustion when test classes ran in parallel.

- Rewrite FakeReferenceSequenceFile to use a single shared byte[] buffer that
  grows as needed, instead of allocating a fresh chromosome-sized array (up to
  247MB) on every getSequence() call.

- Convert memory-heavy DataProviders in ContainerTest, ContainerFactoryTest,
  and CRAMFileBAIIndexTest to return Iterator<Object[]> for lazy consumption,
  avoiding materializing all test data simultaneously.

- Add QuietTestWrapper.lazy(Supplier) for deferred evaluation in DataProviders.

- Fix CRAMReferenceRegion.fetchReferenceBasesByRegion to set regionLength from
  the requested length rather than the returned array length, preventing issues
  with oversized reference arrays.

- Increase test heap to 10G and reduce parallelism to 1.5x CPUs to provide
  more headroom for the CRAM 3.1 write tests.
Move FN_NumberOfReadFeatures from GZIP to rANS Nx16 Order 1 in the NORMAL
profile, matching samtools behavior where rANS outperforms GZIP for this
data series.

Simplify tag trial compression to GZIP + rANS Order 0 (dropping Order 1
which provided negligible benefit for tags). Add BZIP2 as a trial candidate
for SA:Z and XA:Z tags where its BWT-based compression excels on the
structured alignment data (repeated contig names, CIGAR patterns), saving
~7MB on typical WGS data.
Replace Math.floor() with integer division in RangeCoder (both encode and
decode paths) to eliminate int-to-double-to-long conversion overhead on
52.5M calls per file. Replace Math.floorDiv(f,2) with f>>1 in ByteModel
renormalization. Remove redundant RANSEncodingSymbol.reset() loop since
set() overwrites all fields and zero-frequency symbols are never accessed.

Convert RangeCoder encoding output from ByteBuffer to byte[] to eliminate
bounds checking and position tracking overhead. Interleave ByteModel
symbols[] and frequencies[] into a single int[] array for cache locality
during the linear scan in encode/decode.

Bump test heap to 12G to fix OOM on large_aux test.

Benchmarked: ~3% encode speedup (94.3s -> 91.5s), byte-identical output.
…ing.

Replace three compiled regex patterns (READ_NAME_PATTERN, DIGITS0_PATTERN,
DIGITS_PATTERN) with hand-written character-class splitting and digit checks.
The regex tokenizer was called 50M+ times per file, creating a new Matcher
object each time. The hand-written loop produces identical token sequences.

Also reuse a single RANSNx16Encode instance across tryCompress() calls
instead of allocating a fresh 256x256 RANSEncodingSymbol matrix per trial.

Benchmarked: 27% encode speedup (94.3s -> 68.8s), byte-identical output.
htslib does not write content digest tags (BD/SD/B5/S5/B1/S1) into CRAM
slice headers — these are optional per the spec. We were computing SHA-512,
SHA-1, and CRC32 on both bases and quality scores for every record, costing
~21% of total encode time. Block-level CRC32 (required by CRAM 3.0+)
already provides data integrity verification.

Benchmarked: 42% cumulative encode speedup (94.3s -> 54.4s) with 136KB
smaller output (no digest tags in slice headers).
… output optional.

The name tokeniser decoder was allocating a fresh RANSNx16Decode per token
stream (~50 per slice), each constructing 256x256 arrays totaling 1MB+.
Share a single instance across all token streams within a decode call.

Also make CramConverter's output argument optional — omitting it reads all
records without writing, useful for benchmarking the decode path.

Benchmarked: ~4% decode speedup (180s -> 172s on 51.7M record CRAM).
…eReader/Writer.

ByteArrayInputStream.read() was the #1 hotspot in CRAM decode at 10.34% of
CPU time due to its synchronized methods. Replace with CRAMByteReader and
CRAMByteWriter — final classes wrapping byte[] + int pos with no
synchronization and no InputStream/OutputStream inheritance, enabling
aggressive JIT inlining.

Add CRAMByteReader/CRAMByteWriter overloads for ITF8 and LTF8 varint
encoding/decoding (called per-record for every integer data series).

Also optimize ByteArrayStopCodec.read() to scan directly in the underlying
byte[] for the stop byte instead of reading one byte at a time into a BAOS.

Benchmarked: 23% decode speedup (180s -> 138s), encode unchanged (54s -> 53s).
…ministic test counts.

initializeTests() was called from a @BeforeSuite method which can be invoked
concurrently by multiple TestNG threads during parallel class execution. The
unsynchronized static TEST_DATAs ArrayList would get populated multiple times,
inflating the test count by thousands (e.g., 4,824 -> 8,442 for
VariantContextWritersUnitTest). Make initializeTests() synchronized and
idempotent so it only runs once regardless of how many threads call it.
…o single pass.

Previously CRAM decode performed 3-4 separate passes over each record's features:
restoreReadBases (2 sub-passes), getCigarForReadFeatures, and calculateMdAndNm
(via a separately-derived CIGAR). The new restoreBasesAndTags method in
CRAMRecordReadFeatures does all of this in a single iteration, simultaneously
filling bases, building CIGAR elements, and accumulating the MD string and NM
count. Base normalization (toBamReadBasesInPlace) is also folded in.

The CIGAR is cached on CRAMCompressionRecord so toSAMRecord() doesn't need
to recompute it. Handles edge cases: reads extending beyond the reference
boundary, unmapped reads, and the no-features fast path.

Made SequenceUtil.bamReadBaseLookup public for use by the fused method.

Benchmarked: 38% cumulative decode speedup (180s -> 111s on 51.7M records).
…ng CRAM decode.

Tag IDs (2-byte name + 1-byte type) are invariant within a slice but were
previously creating new String objects for key, keyType3Bytes, and recomputing
the binary tag code for every tag on every record (~150-200M allocations for
51M records). Introduce TagKeyCache which pre-computes this metadata once from
the tag ID dictionary and reuses it via a new ReadTag constructor. The cache
uses parallel arrays with linear scan, optimal for the 5-20 unique tag IDs
typical per slice. Also pre-resolves tag data series codecs to eliminate
HashMap<Integer,...> autoboxing in the inner loop. ~9% decode speedup.
LibdeflateDeflater/Inflater call super.end() in their constructors to
free the unused JDK zlib stream, but only override the byte[] methods.
Callers using the ByteBuffer overloads (e.g. GzipCodec) hit
"Deflater has been closed" because the base Deflater's ByteBuffer
methods check the freed native stream.

- Add ByteBuffer overrides to LibdeflateDeflater (setInput, deflate)
  and LibdeflateInflater (setInput, inflate) that delegate to the
  byte[] implementations
- Update GzipCodec compress/decompress to use byte[] deflater/inflater
  API as a belt-and-suspenders fix
@tfenne tfenne changed the base branch from master to dev April 11, 2026 14:00
tfenne added 15 commits April 12, 2026 05:03
Profile-guided optimizations targeting the CRAM encoding hot path:

CompressionHeaderFactory (12.8% → ~3%):
- Eliminated redundant O(tags × records) iteration by gathering tag IDs
  during dictionary building instead of a separate discovery pass
- Skip record iteration entirely for fixed-size tag types (A/c/C/I/i/f/s/S)
- Z-type tags go directly to ByteArrayStopEncoding without size range scan
- Combined getDataForTag + getByteSizeRangeOfTagValues into single pass for
  B-type tags

FQZComp quality encoder:
- Pool 65K ByteModel arrays across compress() calls with reset() instead
  of reallocating. Added ByteModel.reset() method.
- Pool qualData byte[] across calls to avoid per-block allocation

MD5 checksums:
- Cache MessageDigest instance via ThreadLocal instead of calling
  MessageDigest.getInstance("MD5") on every invocation

Name tokeniser:
- Cache parsed integer values in EncodeToken to avoid repeated
  Integer.parseInt() calls in distributeTokensForPosition hot loop
- Use int-based EncodeToken constructor for DELTA/DELTA0 tokens
Bug fixes:
- Fix incorrect offset in RangeEncode bzip2 fallback path
- Add missing stop byte guard in ByteArrayStopCodec
- Reset stale encoder state between NameTokenisationEncode compress() calls
- Fix CRAMByteWriter.grow() overflow for buffers >1GB

Correctness:
- Use US_ASCII charset for read name encoding (per SAM/BAM spec)
- Add equals/hashCode to GZIPExternalCompressor including compression level

Cleanup:
- Remove dead imports, variables, regex patterns, and stale TODO comments
- Reformat LTF8 CRAMByteReader/Writer methods to match ITF8 style
- Add thread-safety note to TrialCompressor javadoc
- Update CHANGELOG and README (profile names, CRAM 3.1 default, digest removal)
…ests

CRAIQueryTest (11 tests): writes a multi-contig CRAM+CRAI programmatically
and verifies that region queries, unmapped queries, multi-interval queries,
and alignment-start queries all return exactly the expected records. Uses
small reads-per-slice (50) to exercise cross-container index entries.

CodecRoundTripPropertyTest (356 tests): verifies decode(encode(X)) == X
for all CRAM codecs (RANS4x8, RANSNx16, Range, FQZComp, NameTokenisation)
across data patterns that target codec-specific edge cases: PACK boundary
(16 distinct symbols), RLE-friendly runs, maximum entropy, alternating
patterns, and size edge cases (empty, single byte).
…mes)

- ReadTag: consolidate intToNameType3Bytes/intToNameType4Bytes into a
  shared intToNameType(int, boolean) method; add roundtrip and packing
  layout tests for all tag ID int/String conversions
- CRAMCompressionRecord: replace tagIdsIndex TODO with comment explaining
  that the shared MutableInt is intentional for tag dictionary grouping
- Slice.validateAlignmentSpanForReference: document that reads extending
  beyond the reference is permitted (matches htslib behavior)
- Slice.setReferenceMD5: document that alignmentStart < 1 is expected for
  multi-ref and unmapped slices
Replace the TODO with a javadoc explanation: supplementary and secondary
reads are always DETACHED and never mate-linked, so synthetic name
propagation cannot help them. This is safe because htsjdk always sets
preserveReadNames=true when encoding. Documents what htslib does (forces
name preservation when SA tags are present) if lossy mode is ever added.
Tests that htsjdk correctly decodes all 65 CRAM files from the hts-specs
test suite (3.0/passed/) and the 4 CRAM 3.1 integration files, comparing
decoded records against the paired SAM ground truth.

Organized into 16 small test classes for effective parallel execution:
- File definition, headers, containers, unmapped, mapped (no ref/with ref)
- Embedded reference, tags, multi-container, compression codecs
- Lossy modes (names/quality/sequence), data layout, corner cases
- Slice tags, compression levels, CRAM 3.1 cross-version

Handles known CRAM/SAM differences:
- Auto-generated MD/NM tags stripped before comparison when SAM lacks them
- Signed/unsigned B-array tag type mismatch tolerated (CRAM limitation)

Two lossy-mode tests disabled pending investigation:
- 1003_qual: detached mate start decoded as 0 instead of expected value
- 1007_seq: soft-clip CIGAR not preserved when sequence is '*'

Also documents that 0200_cmpr_hdr.cram (container with no slices) triggers
an IndexOutOfBoundsException in htsjdk's container initialization.
Container.distributeIndexingParametersToSlices() threw
IndexOutOfBoundsException when the slice list was empty (e.g. a container
with only a compression header and no data, as in hts-specs test
0200_cmpr_hdr.cram). Guard with an early return for empty slice lists.

Also fix CRAMIterator.nextContainer() to skip containers that match the
query but produce no records, rather than returning MATCHES_FILTER with
an empty record list which caused NoSuchElementException on the next
iterator.next() call.
1003_qual: Records with FLAG=0 (not paired) but non-zero PNEXT are valid
SAM but CRAM doesn't preserve mate info for non-paired reads. Compare
with relaxed mate fields for non-paired records.

1007_seq: Document the root cause -- restoreBasesAndTags() short-circuits
CIGAR construction when CF_UNKNOWN_BASES is set, returning a simple
readLength-M CIGAR instead of processing SC features for soft clips.
Remains disabled pending refactoring of the feature processing loop.
restoreBasesAndTags() previously short-circuited when isUnknownBases was
set, returning a simple readLength-M CIGAR without processing read
features. This lost soft clips, hard clips, and other CIGAR operations
stored in the SC/HC/etc. data series.

Fix by running the full feature processing loop even when isUnknownBases,
but skipping base array writes, reference lookups, and MD/NM computation.
The CIGAR state machine (positions, operators, element list) is driven
entirely by feature positions and operators, so it works correctly
without base data. A boolean flag `doBasesAndMdNm` guards all base/ref
work with zero overhead on the normal (non-unknown-bases) path.

Re-enables the hts-specs 1007_seq compliance test which validates CIGAR
10S80M10S is preserved when sequence is '*'.
…ed query

Index tests (23 tests): verify CRAI-based query results against expected
counts from the hts-specs README for all 7 index test files (1400-1406):
simple, unmapped, 3-ref, multi-ref, multi-slice, multi-slice-ref, and
long+short read mixes.

Round-trip tests (52 tests across 3 classes): read each SAM from
hts-specs 3.0/passed/, write to CRAM, read back, compare against
original. Split into basic (unmapped/mapped/embedded-ref), tags+layout,
and index-file groups for parallel execution.

Fix CRAMFileReader.queryUnmapped() for all-unmapped CRAM files: when
getStartOfLastLinearBin() returns -1 (no mapped reads), skip the seek
and iterate from the beginning, matching BAMFileReader's behavior. This
previously caused RuntimeEOFException by seeking to an invalid offset.
The existing FQZComp interop tests decode pre-compressed hts-specs test
vectors but only round-trip synthetic data. Add round-trip tests that
compress and decompress the actual hts-specs quality data files (q4, q8,
q40+dir, qvar) which represent realistic Illumina binned, unbinned, and
long-read quality distributions. Parses the newline-delimited raw data
into per-record quality arrays as FQZComp requires.
These three classes rebuilt CRAM files and indexes from scratch for every
test invocation, despite the data being read-only. Total: ~680s → ~130s.

CRAMFileCRAIIndexTest (207s → ~14s):
  Change @BeforeMethod to @BeforeClass so the BAM→CRAM conversion,
  CRAI index build, and ReferenceSource creation happen once per class
  instead of once per test. Remove singleThreaded=true since shared
  state is now immutable.

CRAMFileBAIIndexTest (248s → ~80s):
  Add ConcurrentMap caches for CRAM bytes, BAI bytes, and CRAM files
  keyed by reads-per-slice. DataProviders now return cached data instead
  of rebuilding. Test methods receive pre-built BAI bytes alongside CRAM
  bytes to avoid redundant index creation.

CRAMIndexPermutationsTests (223s → ~32s):
  Add ConcurrentMap caches for BAI-indexed and CRAI-indexed CRAM files
  keyed by encoding strategy. Each of the 12 strategies builds one
  CRAM+BAI and one CRAM+CRAI, then reuses across all 139 query tests.
The CEUTrio.HiSeq.WGS.b37.NA12878.20.21.v3.0.samtools.cram file was the
dominant cost in 6 test classes (~540s combined). Subsample to ~150K records
using `samtools view --subsample 0.229 --subsample-seed 42`, preserving
CRAM 3.0 format. The subsampled file retains both contigs (chr20: 51K,
chr21: 87K), unmapped reads (11.5K), 36/38 unique FLAG values, and 2064
unique CIGAR operations -- sufficient to exercise all compression profiles
including ARCHIVE (100K reads-per-slice).

Update HtsCRAMCodec31Test hardcoded first-read-name assertion to match
the new first record after subsampling.

Full test suite: 34,189 tests pass, wall-clock ~3m21s (was ~5m).
…e every 200th

Drop the 150 and 200 reads-per-slice variants, keeping only 100 (smallest)
and 300 (largest) since CRAMIndexPermutationsTests already covers 12
strategy permutations. Increase query sampling from every 100th to every
200th mapped read. Class runtime: 126s → 36s.
BCF2EncoderDecoderUnitTest (106s → <1s):
  Reduce the forCombinations list from 17 to 9 values (one representative
  per BCF2 type plus one null per non-CHAR type). This cuts the 3-way
  cartesian product from 17^3=4913 to 9^3=729 test sequences while still
  covering all type combinations and null handling. Total tests: 13848 → 1588.

SeekableStreamGZIPinputStreamIntegrationTest (82s → 4s):
  Reduce from 1M to 100K VCF records per test file (sufficient to exercise
  multi-block BGZF boundaries) and from 10 to 5 first-record-length
  variants. Total tests: 20 → 10.

Full suite: 21,909 tests pass in 2m12s (was ~5m with 34,189 tests).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants