Provide a way to prevent overshooting past IEND (fixes #531)#684
Provide a way to prevent overshooting past IEND (fixes #531)#684ElliotSis wants to merge 3 commits into
IEND (fixes #531)#684Conversation
703f7ca to
a88c3a9
Compare
## Buffering issues addressed 1. Overshoot (fixes image-rs#531): Standard `BufReader` can overshoot and read past the `IEND` chunk boundary. This fix is needed to support decoding concatenated images from a single stream (simulated in Skia's [Codec_end](https://source.chromium.org/chromium/chromium/src/+/main:third_party/skia/tests/CodecExactReadTest.cpp;l=50-94) test), where the stream must be left positioned exactly at the end of the first image's `IEND` chunk for subsequent decodes to work. 2. Blocking reads from pipe-like inputs: Android allows decoding PNGs from blocking, non-seekable inputs such as pipes, through additional buffering layers (like Skia’s `SkFrontBufferedStream`). If the decoder asks for a large refill when only a small amount of PNG bytes remain, the lower layer may perform an over-wide read and block waiting for more data. This is highlighted by a timeout failure in AOSP CTS [testDecodePngFromPipe](https://android.googlesource.com/platform/cts/+/master/tests/tests/graphics/src/android/graphics/cts/BitmapFactoryTest.java#859) test. ## Seek vs limited reads approach Although issue 1 (overshoot) could theoretically be resolved by seeking backward to the correct boundary after a read that's too wide, this approach does not seem adequate for issue 2 because pipes are not genuinely seekable. To address both issues with a single solution, a limited buffering approach is implemented that limits reads at the dynamic state boundary. The decoder APIs retain their currently unused `Seek` bound, although some integrations might not fully support it. ## How it's achieved - Introduced a new public trait `LimitBufRead` that abstracts controlled buffering, replacing the generic `BufRead` bound on `Decoder` and `Reader` (retaining the `Seek` bound) in a backwards-compatible manner. The trait is designed without sealed constraints to potentially be implementable by downstream callers (e.g., potentially a future `SkStream` integration). - Exposes `LimitBufReader` wrapper which restricts buffer refills to the bytes expected by the decoder state machine. - Added `expected_read_limit` to `StreamingDecoder` to calculate expected bytes dynamically. - Added regression tests in `tests/limit_buf_read.rs` verifying concatenated stream decodes and eager-buffered blocking source handling (via `BlockOnLimit` simulating blocking buffering wrappers). ## Testing To verify the correctness of the new `LimitBufRead` usage and `LimitBufReader` APIs, I have run all existing integration and unit tests by replacing the `BufReader`s with `LimitBufReader`, and wrapping the `Cursor`s in `LimitBufReader`. All tests passed successfully. ## Benchmarks Additionally, I ran the decoder criterion benchmark suite across four distinct configurations to collect comparative metrics: 1. **Memory standard** (`Cursor<&[u8]>`) 2. **Memory limit** (`Cursor<&[u8]>` wrapped in a `LimitBufReader`) 3. **File standard** (`File` wrapped in a `BufReader`) 4. **File limit** (`File` wrapped in a `LimitBufReader`) The raw benchmark metrics are compiled below: <details> <summary>Click to expand benchmark results table</summary> (On my machine) | Benchmark Image / Case | Memory Standard | Memory Limit | Memory Overhead | File Standard | File Limit | File Overhead | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | **1. Real-world files (`decode`)** | | | | | | | | `Fantasy_Digital_Painting.png` | `23.61 ms` | `23.88 ms` | **+1.2%** | `24.76 ms` | `24.72 ms` | **-0.2%** | | `Exoplanet_Phase_Curve_(Diagram).png` | `28.64 ms` | `28.77 ms` | **+0.5%** | `28.97 ms` | `29.01 ms` | **+0.2%** | | `Lohengrin_Illustrated.png` | `23.87 ms` | `24.69 ms` | **+3.4%** | `26.63 ms` | `26.90 ms` | **+1.0%** | | `paletted-zune.png` | `10.99 ms` | `11.12 ms` | **+1.2%** | `11.51 ms` | `11.68 ms` | **+1.5%** | | `kodim02.png` | `3.66 ms` | `3.73 ms` | **+1.4%** | `3.87 ms` | `3.89 ms` | **+0.7%** | | `kodim07.png` | `4.64 ms` | `4.70 ms` | **+1.3%** | `4.83 ms` | `4.91 ms` | **+1.7%** | | `kodim17.png` | `3.86 ms` | `3.89 ms` | **+0.9%** | `4.02 ms` | `4.08 ms` | **+1.3%** | | `kodim23.png` | `3.55 ms` | `3.62 ms` | **+1.1%** | `3.75 ms` | `3.79 ms` | **+1.1%** | | `Exoplanet_indexed_gimp.png` | `8.90 ms` | `8.97 ms` | **+0.8%** | `9.26 ms` | `9.28 ms` | **+0.2%** | | `lorem_ipsum_screenshot.png` | `1.23 ms` | `1.23 ms` | **0.0%** | `1.30 ms` | `1.35 ms` | **+4.0%** | | `lorem_ipsum_oxipng.png` | `795.66 µs` | `801.32 µs` | **+0.7%** | `832.11 µs` | `848.90 µs` | **+2.0%** | | `tango-icon-new-128.png` | `108.77 µs` | `109.29 µs` | **+0.4%** | `120.16 µs` | `137.03 µs` | **+14.0%** | | `tango-icon-new-32.png` | `14.08 µs` | `14.31 µs` | **+1.6%** | `23.21 µs` | `37.15 µs` | **+60.1%** | | `tango-icon-new-16.png` | `7.75 µs` | `8.10 µs` | **+4.5%** | `16.86 µs` | `42.04 µs` | **+149.3%** | | `Transparency.png` | `87.43 µs` | `88.17 µs` | **+0.9%** | `97.67 µs` | `125.70 µs` | **+28.7%** | | **2. 4K chunks (Fragmented)** | | | | | | | | `8x8.png` | `964.47 ns` | `1.22 µs` | **+27.0%** | `9.83 µs` | `20.10 µs` | **+104.4%** | | `128x128.png` | `13.22 µs` | `15.04 µs` | **+13.7%** | `33.48 µs` | `94.32 µs` | **+181.7%** | | `2048x2048.png` | `3.14 ms` | `3.57 ms` | **+13.7%** | `7.08 ms` | `20.98 ms` | **+196.3%** | | `12288x12288.png` | `177.81 ms` | `191.36 ms` | **+7.6%** | `308.89 ms` | `772.36 ms` | **+150.0%** | | **3. 64K chunks (Stable)** | | | | | | | | `128x128.png` | `11.30 µs` | `12.20 µs` | **+7.9%** | `29.20 µs` | `44.02 µs` | **+50.7%** | | `2048x2048.png` | `3.55 ms` | `3.72 ms` | **+4.9%** | `7.90 ms` | `8.79 ms` | **+11.2%** | | `12288x12288.png` | `143.38 ms` | `157.81 ms` | **+10.1%** | `249.59 ms` | `283.96 ms` | **+13.7%** | | **4. 2G chunks (Massive-chunk)** | | | | | | | | `2048x2048.png` | `3.44 ms` | `3.93 ms` | **+14.1%** | `7.63 ms` | `8.03 ms` | **+5.2%** | | `12288x12288.png` | `141.82 ms` | `152.83 ms` | **+7.7%** | `255.86 ms` | `262.81 ms` | **+2.7%** | | **5. Incremental row-by-row** | | | | | | | | `128x128-4k-idat` | `12.59 µs` | `13.93 µs` | **+10.7%** | `32.71 µs` | `91.12 µs` | **+178.5%** | </details> Duplicating all the tests and benchmarks on CI felt redundant, and helps keep the PR diff clean and focused. Let me know if you think otherwise. ## Conclusions and BufReader by-default rationale 1. **Negligible memory CPU overhead**: In memory-to-memory decoding (`Cursor`), `LimitBufReader` introduces low overhead (+27% in the worst fragmented case), proving that the limit tracking has little CPU cost. 2. **OS system calls overhead**: Under File I/O, the overhead is dictated by context-switch costs. When chunk structures are highly fragmented (4K chunks), `LimitBufReader` must refill the buffer in smaller increments to respect boundaries safely, resulting in higher overhead (e.g., +160% to +200% regression in pathological cases). 3. **Low overhead on standard real-world files**: For normal real-world photos and diagrams which use larger, continuous `IDAT` chunks, the File I/O overhead of `LimitBufReader` remains low (mostly between 1.5% and 15%). This data supports keeping standard `BufReader` as the default decoder baseline for general performance, while exposing `LimitBufReader` as an opt-in wrapper with relatively low overhead for most files.
a88c3a9 to
1e33fd5
Compare
|
If we're looking at supporting reading directly from pipes, I think we should go ahead and drop the The original PR that introduced the @197g has just wrapped up and merged a rework of metadata API in |
|
@ElliotSis could you document the motivation for making a custom limited reader with custom bookkeeping instead of using |
Yeah, I’ve considered doing that but I didn’t have the context as to why the Since removing it is a non-semver-breaking relaxation and safe to do, I went ahead and updated the PR to drop the
To give some context, the requirements I was trying to satisfy were:
Considering the above, I think introducing a new trait and blanket impl for existing However, I think you’re right that the |
496f0ae to
5735215
Compare
|
FYI: I also ended up removing the |
5735215 to
f058761
Compare
…port pipes - Refactored `LimitBufReader` to be a newtype wrapper around `std::io::BufReader<std::io::Take<R>>`, completely eliminating custom buffering bookkeeping while preserving correctness. - Dropped the `Seek` bound from `Decoder`, `Reader`, and `ReadDecoder` since it was unused by the decoding logic. Relaxing this bound enables natively decoding from non-seekable sources like pipes without compilation or runtime blockers. - Cleaned up tests by removing unnecessary `Seek` bounds and mock `Seek` implementations.
f058761 to
bacbe88
Compare
|
Well, technically we may only need |
|
@197g the PR currently drops the Thanks both for the feedback! |
Buffering issues addressed
IENDchunk #531): StandardBufReadercan overshoot and read past theIENDchunk boundary. This fix is needed to support decoding concatenated images from a single stream (simulated in Skia's Codec_end test), where the stream must be left positioned exactly at the end of the first image'sIENDchunk for subsequent decodes to work.BufReader's 8KB refills) causes JNI/Skia layers to greedily request more bytes. Once the PNG data is fully consumed, the JNI loop callsread()again to satisfy the remaining buffer capacity, causing it to block indefinitely. This is highlighted by a timeout failure in AOSP CTS testDecodePngFromPipe test.Seek vs limited reads approach
Although issue 1 (overshoot) could theoretically be resolved by seeking backward to the correct boundary after a read that's too wide, this approach does not seem adequate for issue 2 because pipes are not genuinely seekable.
The AOSP CTS test deadlock occurs because the main test thread retains an open reference to the pipe's
writeFdeven after thewriteThreadcloses theFileOutputStream, preventing the pipe from naturally closing and returning EOF. Theoretically, we could fix the test by closingwriteFdin the main thread, but that would only hide a deeper production issue: if a standard decoder eagerly over-reads on an open, blocking stream (where the writer remains alive), the read blocks the entire decode thread indefinitely even after the image has been fully received.To address both issues with a single solution, a limited buffering approach is implemented that limits reads at the dynamic state boundary, and the currently unused
Seekbound is removed.How it's achieved
LimitBufReadthat abstracts controlled buffering, replacing the genericBufReadbound onDecoderandReaderin a backwards-compatible manner. The trait is designed without sealed constraints to potentially be implementable by downstream callers (e.g., potentially a futureSkStreamintegration).LimitBufReaderwrapper which restricts buffer refills to the bytes expected by the decoder state machine.expected_read_limittoStreamingDecoderto calculate expected bytes dynamically.tests/limit_buf_read.rsverifying concatenated stream decodes and eager-buffered blocking source handling (viaBlockOnLimitsimulating blocking buffering wrappers).Testing
To verify the correctness of the new
LimitBufReadusage andLimitBufReaderAPIs, I have run all existing integration and unit tests by replacing theBufReaders withLimitBufReader, and wrapping theCursors inLimitBufReader. All tests passed successfully.Benchmarks
Additionally, I ran the decoder criterion benchmark suite across four distinct configurations to collect comparative metrics:
Cursor<&[u8]>)Cursor<&[u8]>wrapped in aLimitBufReader)Filewrapped in aBufReader)Filewrapped in aLimitBufReader)The raw benchmark metrics are compiled below:
Click to expand benchmark results table
(On my machine)decode)Fantasy_Digital_Painting.png23.75 ms23.99 ms24.51 ms24.56 msExoplanet_Phase_Curve_(Diagram).png29.20 ms28.86 ms29.32 ms29.48 msLohengrin_Illustrated.png23.96 ms24.90 ms26.96 ms27.36 mspaletted-zune.png11.06 ms11.16 ms11.52 ms11.68 mskodim02.png3.68 ms3.72 ms3.86 ms3.92 mskodim07.png4.70 ms4.71 ms4.86 ms4.91 mskodim17.png3.91 ms3.92 ms4.06 ms4.08 mskodim23.png3.58 ms3.63 ms3.79 ms3.90 msExoplanet_indexed_gimp.png8.89 ms8.93 ms9.28 ms9.31 mslorem_ipsum_screenshot.png1.22 ms1.23 ms1.29 ms1.36 mslorem_ipsum_oxipng.png800.91 µs800.60 µs833.58 µs850.02 µstango-icon-new-128.png109.17 µs109.57 µs121.41 µs137.98 µstango-icon-new-32.png13.87 µs14.16 µs23.34 µs37.21 µstango-icon-new-16.png7.73 µs8.04 µs17.15 µs41.73 µsTransparency.png87.30 µs88.26 µs97.66 µs125.38 µs8x8.png972.78 ns1.18 µs10.07 µs20.09 µs128x128.png13.02 µs14.33 µs32.41 µs94.49 µs2048x2048.png3.99 ms4.48 ms8.42 ms21.92 ms12288x12288.png195.50 ms171.30 ms316.29 ms815.29 ms128x128.png11.27 µs11.99 µs29.19 µs43.61 µs2048x2048.png3.34 ms3.26 ms6.89 ms7.89 ms12288x12288.png133.93 ms141.75 ms243.02 ms275.64 ms2048x2048.png2.79 ms3.04 ms6.51 ms7.25 ms12288x12288.png131.59 ms142.90 ms242.90 ms251.84 ms128x128-4k-idat13.81 µs15.16 µs33.42 µs92.27 µsDuplicating all the tests and benchmarks on CI felt redundant, and helps keep the PR diff clean and focused. Let me know if you think otherwise.
Conclusions and BufReader by-default rationale
Cursor),LimitBufReaderintroduces low overhead (+20% on the tiny fragmented8x8.png, otherwise negligible on most files), proving that the limit tracking has little CPU cost.LimitBufReadermust refill the buffer in smaller increments to respect boundaries safely, resulting in higher overhead (e.g., +160% to +190% regression in pathological cases).IDATchunks, the File I/O overhead ofLimitBufReaderremains low (mostly between 1% and 15%).This data supports keeping standard
BufReaderas the default decoder baseline for general performance, while exposingLimitBufReaderas an opt-in wrapper with relatively low overhead for most files.