Full CRAM 3.1 support & performance tuning of CRAM read/write by tfenne · Pull Request #1763 · samtools/htsjdk

tfenne · 2026-03-30T20:21:52Z

Description

Ok, so this is a monster PR. I'm not even sure it should be a single PR, but I wanted to put it up here to get some feedback on how best to proceed. I've created it as a draft for that reason. It builds upon the work by @cmnbroad in #1739 while implementing the necessary additions to write CRAM 3.1 at comparable compression ratios to htslib/samtools.

This was done with the assistance of Claude Code, though it also represents a significant investment of my own personal time, prompting, reviewing, fixing, redirecting.

The PR breaks down roughly as follows:

8e93d8b..a174cb7: initial work by Chris
243059c..673ab5f mostly rounding out the CRAM 3.1 functional implementation
140ac3d..bf21087 improvements focused on compression gains and making it easier to compare with htslib
e2b07b4..f5dc3c4 runtime performance tuning

Many of the additions and changes are strongly inspired by work in htslib!

Currently with the normal profile selected, htsjdk emits CRAMs approximately similar in size to htslib. With some 1KG low-pass data I've been testing with the results are actually slightly smaller. However runtime to generate the CRAMs is ~2x slower than htslib/samtools. I think that gap could be shrunk further (it started out around 5-6x slower), I think there will always be a substantial gap due to a) not having easy access to SIMD instructions and b) not having a good libdeflate wrapper on the JVM.

The read path for CRAM 3.1 predates this PR, but the changes here provide substantial performance improvements. Again, testing on low-pass WGS data I'm seeing a 40-50% speedup.

Assuming this PR/changeset has legs, I will do some broader benchmarking and report on both compression and runtime performance.

@jkbonfield I wouldn't expect you to review this PR, but if you have thoughts on how to validate the results of this work beyond the broad set of CRAM unit tests in the htsjdk repo, I would be very grateful for your input.

Things to think about before submitting:

Make sure your changes compile and new tests pass locally.
Add new tests or update existing ones:
- A bug fix should include a test that previously would have failed and passes now.
- New features should come with new tests that exercise and validate the new functionality.
Extended the README / documentation, if necessary
Check your code style.
Write a clear commit title and message
- The commit message should describe what changed and is targeted at htsjdk developers
- Breaking changes should be mentioned in the commit message.

jkbonfield · 2026-03-31T08:19:31Z

For testing, hts-specs has a corpus of test files designed to validate CRAM decoder implementations.

https://github.com/samtools/hts-specs/tree/master/test/cram

The 3.0/README.md covers the basics of the CRAM structure, adding in feature at a time. So blank file with header, then basic one-read bits without reference based encoding, etc. This is mostly testing structures, features, encodings, and so on. By the end of it it's also testing compression codecs.

The 3.1 directory has 4 CRAM files with progressive levels of new compression codecs, basically htslib's profiles I think. They're whole file real world examples, as beyond the new codecs there are no format changes.

More useful for 3.1 is the codecs directory which have raw codec compressed byte streams for you to decode and validate. These are shrunk down copies of the data in https://github.com/jkbonfield/htscodecs-corpus/tree/master/dat, although I think that may be a little incomplete now. Basically it's designed to test things like rans, fqz, arith and name tokeniser codecs in isolation, for unit testing purposes.

This is mostly just integration testing. The nature of CRAM is that even with the same codec we could produce subtly different encoded byte streams (eg it's up to the implementation how it approximates the probabilities tables in rANS and handles rounding errors when the frequencies don't divide up perfectly, but the data stream contains the necessary lookup tables to decode so that's what matters).

So what we need for testing are:

Can we round-trip our own data without losing information
Can we round-trip with other implementations; java enc -> htscodecs dec, or htscodecs enc -> java dec. The test data here is attempting to validate that bit.

jkbonfield · 2026-03-31T09:06:38Z

Some comments regarding CRAM 4.0. This is long, but it may explain why I think CRAM 4 is now something to ignore. (Mainly speaking to you as someone looking at adding it for Noodles).

Personally, I think I made an error when I added it to htslib, but the history here is important to understand how we got to where we did.

Initially I had two aims to improve on CRAM 3.0:

Fix some glaring issues in the CRAM format.
Improve CRAM compression and if possible also speed.
A third bonus was to simplify where possible (eg cull the kitchen sink of CORE bit-wise encodings).

Fixing the CRAM format included things like replacing ITF8/LTF8 with a true universal variable length encoder that doesn't need to know the expected bit-size (somewhat of an anathema to proper variable length encoding). It also separated signed from unsigned values, so we no longer take 5 bytes to store -1 (or worse for LTF8). It also gained better ways of doing constant values. I also fixed some deficiencies such as how to discover whether the original data had MD/NM (by adding MD:* and NM:* tags with the "*" being auto-generated on decode, so the presence and location are known). Further improvements would be around =/X vs M in cigar, and so on. There are quite a few little improvements to make it work better as a round-trip tool. I added a CRAM v4 idea tracker on hts-specs not just because I was hoping for community ideas, but primarily as a way to date stamp the ideas I had and prevent patent squatters like MPEG from patenting obvious future improvements!

Looking at the compression side, I regret not adding in zstd, but it was a bit too new then. Zstd should totally replace Gzip (deflate) and LZMA. Deflate is just hampered by a maximum window size of 32kb, which makes deduplication - the core part of LZ - fail on long reads and similar data types. At high levels zstd approaches lzma for ratio. It's not quite as good, but close, and frankly lzma is almost never worth the CPU cost so I doubt it's used much. Similarly bzip2 should be ditched in favour of libbsc, which betters it on pretty much every aspect. Combined we'd be reducing the number of dependencies while also using more modern high performant libraries. There's even a branch of io_lib with libbsc and zstd. This was demonstrated to MPEG at a conference in their first call-for-proposals on a new genomics format. Something I now strongly regret taking part in! (Ultimately they only took the name tokeniser, but have now regretted going with their own entropy encoder as it's simply way too slow.)

The other aspect was improving ratios via custom codecs. The old rANS wins here because it's so much faster as an entropy encoder, and sometimes entropy encoding (or minimal statistical encoding) is the best you can get, so the complex LZ searching methods aren't worth it. (Before CRAM 3.0 we even had zlib being tried with Z_HUFFMAN_ONLY as an option, because it sometimes beat normal zlib while being vastly faster.) The ideas from there eventually ended up in CRAM 3.1.

One aspect of that was the notion of data transforms. This is an idea taken out of PNG, and later used by myself for ZTR. Data starts off as raw, but can have a transformation applied to it. Decoding is like peeling the layers off an onion. If it's not raw format yet, you pass it through the transform indicated in the data-stream and repeat, until you get to the original raw data. These could be delta if it's a series of integer values; size squashing - eg an array of 16bit values becoming 8-bit with an escape mechanism for out of bound data; run-length encoding; bit-packing (eg ACGT strings taking up 2bits per value instead of 8); or even bigger transformations such as BWT and MTF. These can be implemented in CRAM as layered encodings, much like we already have BYTE_ARRAY_LEN which contains two sub-encodings for the literal and length fields. RLE would be a direct comparison to that in data layout. I initially implementing PACK and RLE here.

CRAM 4 was an experimental melting pot of compression improvements along side the format changes.

However it rapidly became apparent that getting any traction on these notions was going to be slow in htsjdk, so I took the decision to attempt to produce a cut down CRAM 4, renamed to CRAM 3.1. The theory was if we only had compression codec changes and we had a high performance separate library to implement them then the code changes needed in other CRAM implementations were very minimal; just updating the list of legal codec numbers and attaching them to callouts to a library API. This isn't much different to how existing implementations may already be handling lzma, bzip2 and deflate formats. (Although I'm aware Java has some native implementations, I was incorrectly assuming the faster C implementations were being utilised mostly.)

A consequence of this is the separate onion-layer transform approach had to go, as that was an CRAM encoding change rather than a trivial codec API (ie compress-buffer and decompress-buffer functions). They got consumed into the rANS-Nx16 codec instead, although minus delta which was a mistake as it would greatly help certain tag types such as ONT signals).

It wasn't such a bad idea, but ultimately it didn't help at all as there was no willingness to simply link against an external library yet we suffered splitting CRAM updates into two new versions instead of 1.

There are still improvements that could be made with CRAM, but I wouldn't make them now I think so ultimately I think CRAM 4.0 is dead in the water. I'm considering removing it from htslib, if only to reduce the potential attack vector for supplying malicious files.

If I was to pick this up again, it'd probably be to start again from scratch and build on something like the recent OpenZL. It wouldn't have the same compression ratios as archive mode CRAM, unless we extended it, but I like the notion of a standard structure that off the shelf tooling can decode. It's like BGZF is basically just decodable with zcat, but with openzl it could include all those useful chunking strategies and data transformations. I still have plans for a BGZF2, but based on OpenZL now instead of zstd. I just got sucked into other projects and my time evaporated to work on this, but one day!

My rough prototype showed enormous gains for multi-sample VCF and long-read SAM files. It's not up to CRAM performance, but closer to CRAM than BAM for Revio and with a rewrite to use OpenZL it ought to be far closer still.

tfenne · 2026-03-31T22:15:54Z

@jkbonfield Thank you so much for both your pointers on testdata in htsspecs, and the history around CRAM 4. That's a lot to digest, and I think probably I'll reach out via email re: future formats (or finally make it to a GA4GH call). But the short version is that between AI-assisted coding and Project Panama in Java 21/22+, I think it might be much easier to spin up Java wrappers around native libraries and build Java-enabled implementations of future compression formats much faster.

tfenne · 2026-04-01T13:34:33Z

@yfarjoun and @nh13 I've marked this as ready for review. I've been back over the code with multiple AIs doing code review, and have fixed up everything they found that I agreed with.

nh13

This needs some battle-testing; it's too big for a static code review :/

nh13 · 2026-04-05T07:48:59Z

src/main/java/htsjdk/samtools/util/GzipCodec.java

+    public ByteBuffer decompress(final ByteBuffer input) {
+        // Read ISIZE from the last 4 bytes of the GZIP block to size the output
+        final int isizeOffset = input.limit() - 4;
+        final int isize = input.duplicate().order(ByteOrder.LITTLE_ENDIAN).position(isizeOffset).getInt();


issue:
ISIZE is mod 2^32 so you use Integer.toUnsignedLong for the 2GB-4GB case, check if isize is zero, but no idea how to handle if the actual size was > 4GB. I think it doesn't matter for CRAM, but this is a public function...

nh13 · 2026-04-05T07:50:44Z

src/main/java/htsjdk/samtools/cram/io/CRAMByteWriter.java

+    }
+
+    private void grow(final int minCapacity) {
+        int newCapacity = buf.length << 1;


nitpick: could overflow if buf.length > 1GB

nh13 · 2026-04-05T07:53:00Z

CHANGELOG.md

+### CRAM 3.1 Write Support
+
+- Enable CRAM 3.1 writing with all spec codecs: rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE
+- Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection


suggestion:

Suggested change

- Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection

- Add configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE) with trial compression for automatic codec selection

nh13 · 2026-04-05T07:53:15Z

README.md


 > **NOTE: _HTSJDK has only partial support for the latest Variant Call Format Specification.  VCFv4.3 can be read but not written, VCFv4.4 can be read in lenient mode only, and there is no support for BCFv2.2._**

+> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files.  CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection.  Files produced by htsjdk are interoperable with samtools/htslib._**


suggestion:

Suggested change

> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._**

> **NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._**

nh13 · 2026-04-05T07:54:39Z

src/main/java/htsjdk/samtools/CRAMContainerStreamWriter.java

@@ -107,7 +109,7 @@ public void writeAlignment(final SAMRecord alignment) {
     */
    // TODO: retained for backward compatibility for disq in order to run GATK tests (remove before merging this branch)


suggestion: can we remove this now?

nh13 · 2026-04-05T08:35:32Z

src/main/java/htsjdk/samtools/cram/compression/fqzcomp/FQZCompEncode.java

+        int basesRemaining = 0;
+        boolean firstLen = true;
+        int lastLen = 0;
+        int qualOffset = 0;


nitpick: can remove

nh13 · 2026-04-05T08:36:44Z

src/main/java/htsjdk/samtools/cram/compression/nametokenisation/NameTokenisationEncode.java

nitpick:

Suggested change

tokenStreamBuffer.put(val.getBytes(StandardCharsets.UTF_8));

nh13 · 2026-04-05T08:38:30Z

src/main/java/htsjdk/samtools/cram/structure/Slice.java

        this.byteOffsetOfContainer = containerByteOffset;

-        final ContentDigests hasher = ContentDigests.create(ContentDigests.ALL);
+        // htslib does not write content digest tags (BD/SD/B5/S5/B1/S1) into slice headers.


suggestion: add to breaking changes in the CHANGELOG?

nh13 · 2026-04-05T08:38:52Z

src/main/java/htsjdk/samtools/cram/CramConverter.java

+import htsjdk.samtools.*;
+import htsjdk.samtools.cram.structure.CRAMCompressionProfile;
+import htsjdk.samtools.cram.structure.CRAMEncodingStrategy;
+import htsjdk.samtools.util.CloserUtil;


suggestion:

Suggested change

import htsjdk.samtools.util.CloserUtil;

nh13 · 2026-04-05T08:40:44Z

src/main/java/htsjdk/samtools/cram/io/LTF8.java

+     * @param reader the reader to read from
+     * @return the decoded value
+     */
+    public static long readUnsignedLTF8(final CRAMByteReader reader) {


suggestion: these are super dense, ITF8.java and the rest of this code I think is more readable.

…e helpers. Implement STRIPE encoding for Range and rANS Nx16 codecs. Add Range coder helper methods to CompressionHeaderEncodingMap. Implement FQZComp quality score encoder and wire it into the CRAM 3.1 write profile. Add javadoc to Range codec and CompressionUtils methods.

…ion. Add CRAM compression profiles (FAST, NORMAL, SMALL, ARCHIVE) and wire encoding strategy through the write API. Implement trial compression for ARCHIVE profile with FQZComp parameter training to select optimal codec parameters per data series.

Simple main() class that reads SAM/BAM/CRAM and writes SAM/BAM/CRAM with a selectable compression profile (fast, normal, small, archive). Reports record count, timing, and input/output file sizes for quick comparison. Also adds CRAMCompressionProfile.valueOfCaseInsensitive() for case-insensitive profile lookup by name.

…hed mate pairs. Match htslib/samtools behavior: strip NM/MD tags during CRAM encoding when stored values match recomputed values (preserving non-standard values verbatim), and regenerate missing NM/MD from read features + reference during decoding. Handle htslib's cF internal tag for embed_ref=2 mode compatibility. Implement attached mate pair encoding within slices, storing only a record offset (NF) instead of full mate info (MF, NS, NP, TS) for mate pairs in the same slice. Update test SAM/BAM files to include NM/MD tags that conformant CRAM readers now generate.

…ion testing. Switch low-entropy integer data series (MF, MQ, FC, BS, TL, FN) from GZIP to rANS Nx16 Order 1 where rANS produces better compression. Keep GZIP for high-entropy positional data (NP, FP) and byte-array series (IN, SC) where LZ77 outperforms pure entropy coding — rANS produced 97% ratio on NP vs GZIP's 76%.

…ugging. Change the external block content IDs assigned to each CRAM data series to match the values used by htslib's cram_DS_ID enum (cram_structs.h). This is a purely cosmetic change for newly written files — existing CRAM files still decode correctly because content IDs are stored in each container's compression header encoding map, not hardcoded in the decoder. The previous HTSJDK IDs (1-32) and htslib IDs (11-46) were both arbitrary but different, making it difficult to compare per-block compression statistics between the two implementations. With matching IDs, tools like CramStats can directly compare blocks by content ID across htsjdk-written and samtools-written CRAM files.

…cution. Restructure CRAM31Tests into a base class + 4 subclasses (one per compression profile: fast, normal, small, archive). Each class has a data provider that supplies the 3 input files, and the test method exercises both samtools and HTSJDK conversion paths. Since TestNG parallelizes by class, the 4 profile classes now run concurrently instead of sequentially. Also passes the matching compression profile to the HTSJDK write step so all four profiles are actually tested (previously always used NORMAL). Wall-clock time for fast+normal: ~1m11s vs ~4m30s for the original 12 tests.

…didates with htslib. TrialCompressor fixes: - Production phase now calls setMethod(winner.getMethod()) so getMethod() returns the correct method for block headers (was returning the last trial block's per-block-best instead of the overall winner) - blocksUntilRetrial now set after initial trial completes (was only set in the re-trial branch, causing immediate spurious re-trials) - Empty blocks delegate to the winner's compress() instead of returning raw empty bytes (GZIP requires header framing even for empty payloads) - Added setMethod() to ExternalCompressor for subclass method updates, with null-guard on getMethod() to catch use-before-compress errors TrialCompressor now properly implements htslib's trial approach: trial NTRIALS=3 non-empty blocks accumulating sizes, pick winner, use for TRIAL_SPAN=70 blocks, then re-trial. Each block's header always matches its actual compressed data. Profile fixes: - SMALL profile now uses NORMAL's primary codec assignments (rANS + FQZComp + NameTokeniser) instead of all-GZIP, with BZIP2 + GZIP/rANS as trial candidates — matching htslib's SMALL (level 6, use_rans=1, use_bz2=1) - ARCHIVE trial candidates include Range coder + BZIP2 + GZIP/rANS - Compress-then-getMethod ordering in createCompressedBlockForStream New test: testGetMethodMatchesCompressedDataThroughoutLifecycle verifies the method/data consistency invariant across 80 blocks (trial + production phases).

…am dedup, and all-MATCH elimination. Four improvements to NameTokenisationEncode that reduce normal-profile read name compression from 23.7MB to 18.8MB (samtools: 17.2MB) on the 1KG benchmark: 1. Per-token-type rANS flag sets matching htslib tok3 level -3 — STRIPE for integer streams (DIGITS, DIFF, DUP), ORDER-1 for STRING, PACK+RLE for TYPE. 2. Fix remaining()/limit() bug in tryCompress — ORDER and STRIPE flag guards were checking buffer position instead of data size, silently skipping those flags on every iteration after the first. 3. All-MATCH TYPE stream elimination — omit TOKEN_TYPE streams that are entirely MATCH after byte 0 (the decoder already regenerates these per the spec). 4. Cross-position duplicate compressed stream detection — emit 3-byte dup references for byte-identical compressed streams instead of full data.

…into CRAM. GzipCodec is a reusable, ByteBuffer-based GZIP compressor/decompressor that operates directly on Deflater/Inflater without stream wrappers. Supports both standard GZIP and BGZF output formats, and decompresses either transparently. Key design points: - ByteBuffer API for both input and output, with convenience allocating methods - Stateful (Deflater, Inflater, CRC32 reused across calls), not thread-safe - Supports DeflaterFactory/InflaterFactory for pluggable implementations - Configurable compression level and deflate strategy GZIPExternalCompressor now delegates to GzipCodec, accepting an optional deflate strategy parameter for future use (e.g., Deflater.FILTERED).

…implifications. Comprehensive refactor of the rANS 4x8 and Nx16 encoder/decoder implementations: API changes: - RANSEncode.compress() and RANSDecode.uncompress() now accept and return byte[] instead of ByteBuffer. External compressor wrappers simplified accordingly. - All callers updated (ExternalCompressors, NameTokenisationEncode, TokenStreams, tests). Internal refactor (ByteBuffer -> byte[]): - All encoder/decoder internals converted from ByteBuffer to byte[] with explicit offset tracking (int[] posHolder pattern for mutable positions). - Encoders write backwards into byte arrays, eliminating the O(N) Utils.reverse() pass that was previously required after forward-write encoding. - Frequency table read/write methods use byte[] with posHolder. - Added byte[] overloads for CompressionUtils.readUint7/writeUint7. Structural simplifications: - Eliminated ArithmeticDecoder wrapper class — inlined as plain int[][] frequencies and byte[][] reverseLookup arrays directly on RANSDecode. - Removed combined advanceSymbolNx16/advanceSymbol4x8 methods from RANSDecodingSymbol — callers use advanceSymbolStep() + Utils.RANSDecodeRenormalize*() two-step pattern. - Removed all dead ByteBuffer methods from RANSEncodingSymbol, RANSDecodingSymbol, Utils. - Flattened package structure: moved rans4x8/ and ransnx16/ subpackage classes into parent rans/ package, enabling package-private field access for inline optimization. - Made RANSEncodingSymbol fields package-private; inlined putSymbolNx16 arithmetic into RANSNx16Encode Order-0 loop (pos kept as local int instead of int[] heap array). - Selective decoder reset: only reset frequency/symbol rows actually used in previous decode, not all 65,536 entries. - Moved encoder/decoder state initialization from lazy-init methods to constructors, making all state arrays final fields. - Made param classes final, added final modifiers throughout, added javadoc. Performance: CRAM round-trip benchmark improved from ~189s to ~97s (2x speedup), with the TrialCompressor change (separate commit) and these rANS changes together.

- Share encoder/decoder instances in RansTest, RangeTest, and RangeInteropTest instead of creating new instances per DataProvider row. Each codec eagerly allocates large internal symbol/frequency tables, and hundreds of instances caused heap exhaustion when test classes ran in parallel. - Rewrite FakeReferenceSequenceFile to use a single shared byte[] buffer that grows as needed, instead of allocating a fresh chromosome-sized array (up to 247MB) on every getSequence() call. - Convert memory-heavy DataProviders in ContainerTest, ContainerFactoryTest, and CRAMFileBAIIndexTest to return Iterator<Object[]> for lazy consumption, avoiding materializing all test data simultaneously. - Add QuietTestWrapper.lazy(Supplier) for deferred evaluation in DataProviders. - Fix CRAMReferenceRegion.fetchReferenceBasesByRegion to set regionLength from the requested length rather than the returned array length, preventing issues with oversized reference arrays. - Increase test heap to 10G and reduce parallelism to 1.5x CPUs to provide more headroom for the CRAM 3.1 write tests.

Move FN_NumberOfReadFeatures from GZIP to rANS Nx16 Order 1 in the NORMAL profile, matching samtools behavior where rANS outperforms GZIP for this data series. Simplify tag trial compression to GZIP + rANS Order 0 (dropping Order 1 which provided negligible benefit for tags). Add BZIP2 as a trial candidate for SA:Z and XA:Z tags where its BWT-based compression excels on the structured alignment data (repeated contig names, CIGAR patterns), saving ~7MB on typical WGS data.

Replace Math.floor() with integer division in RangeCoder (both encode and decode paths) to eliminate int-to-double-to-long conversion overhead on 52.5M calls per file. Replace Math.floorDiv(f,2) with f>>1 in ByteModel renormalization. Remove redundant RANSEncodingSymbol.reset() loop since set() overwrites all fields and zero-frequency symbols are never accessed. Convert RangeCoder encoding output from ByteBuffer to byte[] to eliminate bounds checking and position tracking overhead. Interleave ByteModel symbols[] and frequencies[] into a single int[] array for cache locality during the linear scan in encode/decode. Bump test heap to 12G to fix OOM on large_aux test. Benchmarked: ~3% encode speedup (94.3s -> 91.5s), byte-identical output.

…ing. Replace three compiled regex patterns (READ_NAME_PATTERN, DIGITS0_PATTERN, DIGITS_PATTERN) with hand-written character-class splitting and digit checks. The regex tokenizer was called 50M+ times per file, creating a new Matcher object each time. The hand-written loop produces identical token sequences. Also reuse a single RANSNx16Encode instance across tryCompress() calls instead of allocating a fresh 256x256 RANSEncodingSymbol matrix per trial. Benchmarked: 27% encode speedup (94.3s -> 68.8s), byte-identical output.

htslib does not write content digest tags (BD/SD/B5/S5/B1/S1) into CRAM slice headers — these are optional per the spec. We were computing SHA-512, SHA-1, and CRC32 on both bases and quality scores for every record, costing ~21% of total encode time. Block-level CRC32 (required by CRAM 3.0+) already provides data integrity verification. Benchmarked: 42% cumulative encode speedup (94.3s -> 54.4s) with 136KB smaller output (no digest tags in slice headers).

… output optional. The name tokeniser decoder was allocating a fresh RANSNx16Decode per token stream (~50 per slice), each constructing 256x256 arrays totaling 1MB+. Share a single instance across all token streams within a decode call. Also make CramConverter's output argument optional — omitting it reads all records without writing, useful for benchmarking the decode path. Benchmarked: ~4% decode speedup (180s -> 172s on 51.7M record CRAM).

…eReader/Writer. ByteArrayInputStream.read() was the #1 hotspot in CRAM decode at 10.34% of CPU time due to its synchronized methods. Replace with CRAMByteReader and CRAMByteWriter — final classes wrapping byte[] + int pos with no synchronization and no InputStream/OutputStream inheritance, enabling aggressive JIT inlining. Add CRAMByteReader/CRAMByteWriter overloads for ITF8 and LTF8 varint encoding/decoding (called per-record for every integer data series). Also optimize ByteArrayStopCodec.read() to scan directly in the underlying byte[] for the stop byte instead of reading one byte at a time into a BAOS. Benchmarked: 23% decode speedup (180s -> 138s), encode unchanged (54s -> 53s).

…ministic test counts. initializeTests() was called from a @BeforeSuite method which can be invoked concurrently by multiple TestNG threads during parallel class execution. The unsynchronized static TEST_DATAs ArrayList would get populated multiple times, inflating the test count by thousands (e.g., 4,824 -> 8,442 for VariantContextWritersUnitTest). Make initializeTests() synchronized and idempotent so it only runs once regardless of how many threads call it.

…o single pass. Previously CRAM decode performed 3-4 separate passes over each record's features: restoreReadBases (2 sub-passes), getCigarForReadFeatures, and calculateMdAndNm (via a separately-derived CIGAR). The new restoreBasesAndTags method in CRAMRecordReadFeatures does all of this in a single iteration, simultaneously filling bases, building CIGAR elements, and accumulating the MD string and NM count. Base normalization (toBamReadBasesInPlace) is also folded in. The CIGAR is cached on CRAMCompressionRecord so toSAMRecord() doesn't need to recompute it. Handles edge cases: reads extending beyond the reference boundary, unmapped reads, and the no-features fast path. Made SequenceUtil.bamReadBaseLookup public for use by the fused method. Benchmarked: 38% cumulative decode speedup (180s -> 111s on 51.7M records).

…ng CRAM decode. Tag IDs (2-byte name + 1-byte type) are invariant within a slice but were previously creating new String objects for key, keyType3Bytes, and recomputing the binary tag code for every tag on every record (~150-200M allocations for 51M records). Introduce TagKeyCache which pre-computes this metadata once from the tag ID dictionary and reuses it via a new ReadTag constructor. The cache uses parallel arrays with linear scan, optimal for the 5-20 unique tag IDs typical per slice. Also pre-resolves tag data series codecs to eliminate HashMap<Integer,...> autoboxing in the inner loop. ~9% decode speedup.

LibdeflateDeflater/Inflater call super.end() in their constructors to free the unused JDK zlib stream, but only override the byte[] methods. Callers using the ByteBuffer overloads (e.g. GzipCodec) hit "Deflater has been closed" because the base Deflater's ByteBuffer methods check the freed native stream. - Add ByteBuffer overrides to LibdeflateDeflater (setInput, deflate) and LibdeflateInflater (setInput, inflate) that delegate to the byte[] implementations - Update GzipCodec compress/decompress to use byte[] deflater/inflater API as a belt-and-suspenders fix

Profile-guided optimizations targeting the CRAM encoding hot path: CompressionHeaderFactory (12.8% → ~3%): - Eliminated redundant O(tags × records) iteration by gathering tag IDs during dictionary building instead of a separate discovery pass - Skip record iteration entirely for fixed-size tag types (A/c/C/I/i/f/s/S) - Z-type tags go directly to ByteArrayStopEncoding without size range scan - Combined getDataForTag + getByteSizeRangeOfTagValues into single pass for B-type tags FQZComp quality encoder: - Pool 65K ByteModel arrays across compress() calls with reset() instead of reallocating. Added ByteModel.reset() method. - Pool qualData byte[] across calls to avoid per-block allocation MD5 checksums: - Cache MessageDigest instance via ThreadLocal instead of calling MessageDigest.getInstance("MD5") on every invocation Name tokeniser: - Cache parsed integer values in EncodeToken to avoid repeated Integer.parseInt() calls in distributeTokensForPosition hot loop - Use int-based EncodeToken constructor for DELTA/DELTA0 tokens

Bug fixes: - Fix incorrect offset in RangeEncode bzip2 fallback path - Add missing stop byte guard in ByteArrayStopCodec - Reset stale encoder state between NameTokenisationEncode compress() calls - Fix CRAMByteWriter.grow() overflow for buffers >1GB Correctness: - Use US_ASCII charset for read name encoding (per SAM/BAM spec) - Add equals/hashCode to GZIPExternalCompressor including compression level Cleanup: - Remove dead imports, variables, regex patterns, and stale TODO comments - Reformat LTF8 CRAMByteReader/Writer methods to match ITF8 style - Add thread-safety note to TrialCompressor javadoc - Update CHANGELOG and README (profile names, CRAM 3.1 default, digest removal)

…ests CRAIQueryTest (11 tests): writes a multi-contig CRAM+CRAI programmatically and verifies that region queries, unmapped queries, multi-interval queries, and alignment-start queries all return exactly the expected records. Uses small reads-per-slice (50) to exercise cross-container index entries. CodecRoundTripPropertyTest (356 tests): verifies decode(encode(X)) == X for all CRAM codecs (RANS4x8, RANSNx16, Range, FQZComp, NameTokenisation) across data patterns that target codec-specific edge cases: PACK boundary (16 distinct symbols), RLE-friendly runs, maximum entropy, alternating patterns, and size edge cases (empty, single byte).

…mes) - ReadTag: consolidate intToNameType3Bytes/intToNameType4Bytes into a shared intToNameType(int, boolean) method; add roundtrip and packing layout tests for all tag ID int/String conversions - CRAMCompressionRecord: replace tagIdsIndex TODO with comment explaining that the shared MutableInt is intentional for tag dictionary grouping - Slice.validateAlignmentSpanForReference: document that reads extending beyond the reference is permitted (matches htslib behavior) - Slice.setReferenceMD5: document that alignmentStart < 1 is expected for multi-ref and unmapped slices

Replace the TODO with a javadoc explanation: supplementary and secondary reads are always DETACHED and never mate-linked, so synthetic name propagation cannot help them. This is safe because htsjdk always sets preserveReadNames=true when encoding. Documents what htslib does (forces name preservation when SA tags are present) if lossy mode is ever added.

Tests that htsjdk correctly decodes all 65 CRAM files from the hts-specs test suite (3.0/passed/) and the 4 CRAM 3.1 integration files, comparing decoded records against the paired SAM ground truth. Organized into 16 small test classes for effective parallel execution: - File definition, headers, containers, unmapped, mapped (no ref/with ref) - Embedded reference, tags, multi-container, compression codecs - Lossy modes (names/quality/sequence), data layout, corner cases - Slice tags, compression levels, CRAM 3.1 cross-version Handles known CRAM/SAM differences: - Auto-generated MD/NM tags stripped before comparison when SAM lacks them - Signed/unsigned B-array tag type mismatch tolerated (CRAM limitation) Two lossy-mode tests disabled pending investigation: - 1003_qual: detached mate start decoded as 0 instead of expected value - 1007_seq: soft-clip CIGAR not preserved when sequence is '*' Also documents that 0200_cmpr_hdr.cram (container with no slices) triggers an IndexOutOfBoundsException in htsjdk's container initialization.

Container.distributeIndexingParametersToSlices() threw IndexOutOfBoundsException when the slice list was empty (e.g. a container with only a compression header and no data, as in hts-specs test 0200_cmpr_hdr.cram). Guard with an early return for empty slice lists. Also fix CRAMIterator.nextContainer() to skip containers that match the query but produce no records, rather than returning MATCHES_FILTER with an empty record list which caused NoSuchElementException on the next iterator.next() call.

1003_qual: Records with FLAG=0 (not paired) but non-zero PNEXT are valid SAM but CRAM doesn't preserve mate info for non-paired reads. Compare with relaxed mate fields for non-paired records. 1007_seq: Document the root cause -- restoreBasesAndTags() short-circuits CIGAR construction when CF_UNKNOWN_BASES is set, returning a simple readLength-M CIGAR instead of processing SC features for soft clips. Remains disabled pending refactoring of the feature processing loop.

restoreBasesAndTags() previously short-circuited when isUnknownBases was set, returning a simple readLength-M CIGAR without processing read features. This lost soft clips, hard clips, and other CIGAR operations stored in the SC/HC/etc. data series. Fix by running the full feature processing loop even when isUnknownBases, but skipping base array writes, reference lookups, and MD/NM computation. The CIGAR state machine (positions, operators, element list) is driven entirely by feature positions and operators, so it works correctly without base data. A boolean flag `doBasesAndMdNm` guards all base/ref work with zero overhead on the normal (non-unknown-bases) path. Re-enables the hts-specs 1007_seq compliance test which validates CIGAR 10S80M10S is preserved when sequence is '*'.

…ed query Index tests (23 tests): verify CRAI-based query results against expected counts from the hts-specs README for all 7 index test files (1400-1406): simple, unmapped, 3-ref, multi-ref, multi-slice, multi-slice-ref, and long+short read mixes. Round-trip tests (52 tests across 3 classes): read each SAM from hts-specs 3.0/passed/, write to CRAM, read back, compare against original. Split into basic (unmapped/mapped/embedded-ref), tags+layout, and index-file groups for parallel execution. Fix CRAMFileReader.queryUnmapped() for all-unmapped CRAM files: when getStartOfLastLinearBin() returns -1 (no mapped reads), skip the seek and iterate from the beginning, matching BAMFileReader's behavior. This previously caused RuntimeEOFException by seeking to an invalid offset.

The existing FQZComp interop tests decode pre-compressed hts-specs test vectors but only round-trip synthetic data. Add round-trip tests that compress and decompress the actual hts-specs quality data files (q4, q8, q40+dir, qvar) which represent realistic Illumina binned, unbinned, and long-read quality distributions. Parses the newline-delimited raw data into per-record quality arrays as FQZComp requires.

These three classes rebuilt CRAM files and indexes from scratch for every test invocation, despite the data being read-only. Total: ~680s → ~130s. CRAMFileCRAIIndexTest (207s → ~14s): Change @BeforeMethod to @BeforeClass so the BAM→CRAM conversion, CRAI index build, and ReferenceSource creation happen once per class instead of once per test. Remove singleThreaded=true since shared state is now immutable. CRAMFileBAIIndexTest (248s → ~80s): Add ConcurrentMap caches for CRAM bytes, BAI bytes, and CRAM files keyed by reads-per-slice. DataProviders now return cached data instead of rebuilding. Test methods receive pre-built BAI bytes alongside CRAM bytes to avoid redundant index creation. CRAMIndexPermutationsTests (223s → ~32s): Add ConcurrentMap caches for BAI-indexed and CRAI-indexed CRAM files keyed by encoding strategy. Each of the 12 strategies builds one CRAM+BAI and one CRAM+CRAI, then reuses across all 139 query tests.

The CEUTrio.HiSeq.WGS.b37.NA12878.20.21.v3.0.samtools.cram file was the dominant cost in 6 test classes (~540s combined). Subsample to ~150K records using `samtools view --subsample 0.229 --subsample-seed 42`, preserving CRAM 3.0 format. The subsampled file retains both contigs (chr20: 51K, chr21: 87K), unmapped reads (11.5K), 36/38 unique FLAG values, and 2064 unique CIGAR operations -- sufficient to exercise all compression profiles including ARCHIVE (100K reads-per-slice). Update HtsCRAMCodec31Test hardcoded first-read-name assertion to match the new first record after subsampling. Full test suite: 34,189 tests pass, wall-clock ~3m21s (was ~5m).

…e every 200th Drop the 150 and 200 reads-per-slice variants, keeping only 100 (smallest) and 300 (largest) since CRAMIndexPermutationsTests already covers 12 strategy permutations. Increase query sampling from every 100th to every 200th mapped read. Class runtime: 126s → 36s.

BCF2EncoderDecoderUnitTest (106s → <1s): Reduce the forCombinations list from 17 to 9 values (one representative per BCF2 type plus one null per non-CHAR type). This cuts the 3-way cartesian product from 17^3=4913 to 9^3=729 test sequences while still covering all type combinations and null handling. Total tests: 13848 → 1588. SeekableStreamGZIPinputStreamIntegrationTest (82s → 4s): Reduce from 1M to 100K VCF records per test file (sufficient to exercise multi-block BGZF boundaries) and from 10 to 5 first-record-length variants. Total tests: 20 → 10. Full suite: 21,909 tests pass in 2m12s (was ~5m with 34,189 tests).

tfenne self-assigned this Mar 30, 2026

tfenne added cram Large 🐋 labels Mar 30, 2026

tfenne mentioned this pull request Mar 31, 2026

A few targeted optimizations to the BAM decoding path that yield 6-7% improvement in performance #1764

Merged

tfenne force-pushed the tf_cram_31 branch from f5dc3c4 to 21b1664 Compare March 31, 2026 12:37

tfenne force-pushed the tf_cram_31 branch 2 times, most recently from 94ec6dc to c944b6c Compare April 1, 2026 13:14

tfenne requested review from nh13 and yfarjoun April 1, 2026 13:33

tfenne marked this pull request as ready for review April 1, 2026 13:33

nh13 approved these changes Apr 5, 2026

View reviewed changes

tfenne mentioned this pull request Apr 11, 2026

Wire up CRAM 3.1 codecs for writing and a naive write profile. #1739

Closed

cmnbroad and others added 14 commits April 11, 2026 06:34

Enable a naive CRAM 3.1 write profile.

eac2364

Changes based on instrumented trial runs.

4ed1076

CRAM 3.1 write tests and code cleanup.

7cfddf2

Temp fix for preSorted=false default.

7417691

tfenne added 13 commits April 11, 2026 06:34

Update README and CHANGELOG for CRAM 3.1 write support (5.0.0)

3e30cf1

tfenne force-pushed the tf_cram_31 branch from c944b6c to ecce661 Compare April 11, 2026 14:00

tfenne changed the base branch from master to dev April 11, 2026 14:00

tfenne added 15 commits April 12, 2026 05:03

	- Add configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE) with trial compression for automatic codec selection
	- Add configurable compression profiles (FAST, NORMAL, SMALL, ARCHIVE) with trial compression for automatic codec selection


		> NOTE: _HTSJDK has only partial support for the latest Variant Call Format Specification. VCFv4.3 can be read but not written, VCFv4.4 can be read in lenient mode only, and there is no support for BCFv2.2._

		> NOTE: _HTSJDK now supports both reading and writing CRAM 3.1 files. CRAM 3.1 write support includes all codecs defined in the specification (rANS Nx16, adaptive arithmetic Range coder, FQZComp, Name Tokenisation, and STRIPE), configurable compression profiles (FAST, NORMAL, BEST, ARCHIVE), and trial compression for automatic codec selection. Files produced by htsjdk are interoperable with samtools/htslib._

		@@ -107,7 +109,7 @@ public void writeAlignment(final SAMRecord alignment) {
		*/
		// TODO: retained for backward compatibility for disq in order to run GATK tests (remove before merging this branch)


	tokenStreamBuffer.put(val.getBytes(StandardCharsets.UTF_8));

Conversation

tfenne commented Mar 30, 2026

Description

Things to think about before submitting:

Uh oh!

jkbonfield commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkbonfield commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tfenne commented Mar 31, 2026

Uh oh!

tfenne commented Apr 1, 2026

Uh oh!

nh13 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jkbonfield commented Mar 31, 2026 •

edited

Loading

jkbonfield commented Mar 31, 2026 •

edited

Loading