Skip to content

#420: Support multiple file Wikipedia dump archives#435

Draft
mawiesne wants to merge 10 commits intomasterfrom
420-Support-multiple-file-Wikipedia-dump-archives
Draft

#420: Support multiple file Wikipedia dump archives#435
mawiesne wants to merge 10 commits intomasterfrom
420-Support-multiple-file-Wikipedia-dump-archives

Conversation

@mawiesne
Copy link
Copy Markdown
Contributor

@mawiesne mawiesne commented Jan 17, 2026

What's in the PR

  • adds method for handling multi-file dump archives to IDecompressor
  • adds related impls in BZip2Decompressor, GZipDecompressor and UniversalDecompressor, each relying on java.io.SequenceInputStream
  • WIP -> Tests

How to test manually

  • mvn clean verify

Automatic testing

  • PR includes unit tests

Documentation

  • PR updates documentation

@mawiesne mawiesne self-assigned this Jan 17, 2026
@mawiesne mawiesne force-pushed the 420-Support-multiple-file-Wikipedia-dump-archives branch from 1f1bdea to cd9c9a8 Compare January 17, 2026 19:42
@mawiesne mawiesne added 🆕Enhancement java Pull requests that update java code labels Jan 17, 2026
@mawiesne mawiesne added this to the 2.0.2 milestone Jan 17, 2026
@mawiesne mawiesne force-pushed the 420-Support-multiple-file-Wikipedia-dump-archives branch from cd9c9a8 to 0b1fb64 Compare February 4, 2026 20:08
- adds method for handling multi-file dump archives to IDecompressor
- adds related impls in BZip2Decompressor, GZipDecompressor and UniversalDecompressor, each relying on SequenceInputStream
- WIP -> Tests
@rzo1 rzo1 force-pushed the 420-Support-multiple-file-Wikipedia-dump-archives branch from 0b1fb64 to d0052ec Compare April 24, 2026 09:54
rzo1 added 9 commits April 24, 2026 13:41
- BZip2Decompressor.getInputStreamSequence now enables
  decompressConcatenated on BZip2CompressorInputStream; without the
  flag only the first part of a multi-file bz2 archive was decoded.
- Close already-opened part streams on failure in BZip2Decompressor
  and GZipDecompressor to avoid file-descriptor leaks.
- UniversalDecompressor.getInputStreamSequence rejects lists whose
  parts do not share a single archive format, preventing silent
  misdecoding; also drops an unreachable null-element branch.
- Replace the legacy synchronized Vector with ArrayList in the bz2
  and gz sequence paths.
- Tidy IDecompressor.getInputStreamSequence javadoc (typo, wrong
  "or null" clause, clearer contract).
- Extend AbstractDecompressorTest with common contract tests for
  getInputStreamSequence (null list, empty list, null element,
  directory element); SevenZipDecompressorTest overrides these to
  assert UnsupportedOperationException.
- Add multi-part round-trip tests for bz2 and gz that generate
  fixtures via BZip2CompressorOutputStream / GZIPOutputStream and
  verify the decompressed stream equals the byte concatenation of
  all parts.
- Add UniversalDecompressor dispatch tests (bz2, gz), rejection of
  7z multi-file, unsupported extension, and mixed-extension lists.
Each Wikimedia multi-file dump part (e.g. pages-articles1.xml-p1p10.bz2)
is a standalone XML document with its own <mediawiki> root and <siteinfo>
preamble, so the decompressed byte concatenation is not well-formed XML
and cannot be fed to a single SAXParser.parse() call. This commit adds
the plumbing to parse such a sequence as one logical document without
touching the existing single-stream pipeline.

- MultiPartDumpWriter (mwdumper): DumpWriter decorator that forwards
  writeStartWiki / writeSiteinfo only on the first invocation, swallows
  per-part writeEndWiki / close, and exposes finish() to emit the single
  terminal writeEndWiki and close the delegate exactly once.
- AbstractXmlDumpReader: split readDump() into a protected doParse() that
  runs SAX without closing the writer, plus the existing readDump() which
  calls doParse() followed by writer.close(). The single-stream contract
  is unchanged.
- MultiPartXmlDumpReader (wikimachine): readDumps(parts, writer, factory)
  iterates parts, instantiates a fresh reader per part via the factory,
  routes events through MultiPartDumpWriter, and guarantees finish() runs
  on the failure path as well.
- Unit tests for MultiPartDumpWriter (lifecycle collapsing, passthrough,
  idempotent finish, null-delegate rejection).
- Integration tests for MultiPartXmlDumpReader using in-memory XML parts
  against WikiXMLDumpReader, asserting exact event ordering, null/empty
  input rejection, and delegate-close on parse failure.

Consumers (XML2Binary, DataMachineGenerator, TimeMachineGenerator) are
not yet wired up to this API; that is a separate follow-up.
Wikimedia publishes large XML dumps split across several files named
<prefix>-<role><N>.xml-p<start>p<end>.<ext> (e.g. pages-articles1.xml-
p1p297012.bz2). The existing DataMachineFiles scanner used
.contains("pages-articles.xml") and therefore silently ignored every
multi-part file. This commit adds grouping + ordering support without
wiring the XML consumers yet.

- New util DumpFileDiscovery (wikimachine): recognises the multi-part
  page-range suffix, matches role names under both single-file and
  multi-part schemes (rejecting look-alikes such as
  pages-articles-multistream), and orders a collection of parts by
  ascending start page id.
- DataMachineFiles: internally stores pages-articles and
  pages-meta-current as ordered List<File>. Legacy singular getters
  keep returning the first part for backwards compatibility; new
  getInputPagesArticlesFiles() / getInputPagesMetaCurrentFiles()
  expose the full ordered list. Role matching uses DumpFileDiscovery
  and now correctly picks up multi-part names. SQL roles (pagelinks,
  categorylinks) stay single-file.
- TimeMachineFiles: metaHistoryFiles is now a List<String>. Legacy
  setMetaHistoryFile/getMetaHistoryFile still work (singleton list);
  new setMetaHistoryFiles/getMetaHistoryFiles accept and expose the
  ordered list; checkAll() verifies every part is readable.
- Tests for DumpFileDiscovery (pattern matching, rejection of
  look-alikes, stable ordering with ranged and unranged files),
  DataMachineFiles multi-part grouping, and TimeMachineFiles list
  setter/getter behaviour and validation.
The earlier implementation fed a SequenceInputStream of raw compressed
parts to a single decompressor (GZIPInputStream for gz, relying on RFC
1952 multi-member support; BZip2CompressorInputStream with
decompressConcatenated=true for bz2). That worked locally but failed
on Java 17 / HSQLDB CI runs with decoding stopping after the first
part — only one part's content was returned.

Root cause: GZIPInputStream detects a subsequent concatenated member
by inspecting the underlying stream's available() count after the
trailer of the current member. SequenceInputStream's available()
returns the *current* underlying stream's count, which is zero at
the boundary between parts before the switch to the next part has
been triggered by a read(). On timing/buffer-sensitive paths this
made GZIPInputStream conclude the stream ended after part one.

Fix: wrap each part in its own decompressor (GZIPInputStream /
BZip2CompressorInputStream) and concatenate the *decompressed*
streams with SequenceInputStream. Identical semantic result, no
dependence on compressed-side multi-member detection, and
consistent between bz2 and gz. The existing multi-part round-trip
tests still pass and the CI-reproducible failure is gone.
Plumbing was in place (DumpFileDiscovery grouping in DataMachineFiles /
TimeMachineFiles, MultiPartXmlDumpReader for per-part SAX dispatch with
a shared DumpWriter); this commit connects the consumer pipelines so a
dump split across several Wikimedia files is ingested as a single
logical document.

- XML2Binary (datamachine): new XML2Binary(List<InputStream>,
  DataMachineFiles) constructor routes the list through
  MultiPartXmlDumpReader.readDumps with SimpleXmlDumpReader::new. The
  legacy single-stream constructor is unchanged.
- DataMachineGenerator.processInputDump: opens one decompressed stream
  per configured pages-articles / pages-meta-current part (favouring
  meta-current when present) and hands the list to the multi-part
  constructor. Single-file dumps collapse to a one-element list with
  identical semantics to the legacy path.
- DumpTableInputStream (wikimachine): adds default
  initialize(List<InputStream>, DumpTableEnum) that dispatches to the
  single-stream initializer for size-1 lists and throws
  UnsupportedOperationException otherwise. Subclasses override when
  they can read across parts.
- XMLDumpTableInputStreamThread (timemachine): new constructor that
  drives MultiPartXmlDumpReader for a List<InputStream>, selecting the
  Page/Revision/Text reader per DumpTableEnum. Single-stream mode
  unchanged.
- XMLDumpTableInputStream: overrides the list-based initialize to use
  the new multi-part thread; the single-stream path is preserved.
- TimeMachineGenerator: each of createRevisionParser, createPageParser,
  and createTextParser now opens one decompressed stream per configured
  meta-history part via a shared helper and passes the list to
  DumpTableInputStream.initialize.
readDumps used to leave its InputStream arguments open; callers
(DataMachineGenerator, TimeMachineGenerator) also never closed the
streams they opened via the decompressor, multiplying the leak by the
number of parts in a multi-file dump.

- readDumps now closes every stream in the parts list before returning,
  on both the success and failure paths. Exceptions raised during close
  or during the wrapper's final flush are attached as suppressed to any
  primary error from the parse.
- The error path is simplified: a single primary-exception slot is
  threaded through parse, per-part close, and wrapper.finish(), then
  rethrown at the end. Replaces the earlier dual-finish call that
  relied on finish()'s idempotency for correctness.
- Javadoc documents the ownership contract so callers no longer need
  to close the streams themselves.
BZip2Decompressor and GZipDecompressor carried byte-identical
closeQuietly helpers — the partial-open cleanup routine that closes
every stream collected so far and attaches IOExceptions from close
as suppressed on the primary error. Consolidates to a single protected
static method on AbstractDecompressor so any future decompressor that
wires up a multi-part path picks it up for free.
Three small encapsulation / API hygiene touch-ups surfaced during
review:

- DumpFileDiscovery.pageRangeStart is demoted from package-private to
  private. It is only referenced internally via the
  DumpFileDiscovery::pageRangeStart method reference inside
  orderByPageRange; method references to private static members resolve
  fine from the same class. No reason to leak the compare key into the
  package.
- MultiPartXmlDumpReader.ReaderFactory no longer extends
  BiFunction<InputStream, DumpWriter, AbstractXmlDumpReader>. Extending
  the JDK SAM pulled BiFunction.andThen into the public surface, which
  has no meaningful semantics here. It is now a standalone
  @FunctionalInterface with a single create(InputStream, DumpWriter)
  method; method references such as SimpleXmlDumpReader::new continue
  to match the SAM unchanged.
- MultiPartDumpWriter gains an explicit "not thread-safe" javadoc note.
  Matches the DumpWriter contract and documents the single-threaded
  expectation the multi-part pipeline relies on.
…gies

The thread class had two operating modes crammed into one body: the
single-file path set xmlReader and left the multi-part fields null;
the multi-part path did the opposite. run() and abort() switched on
(parts != null). Classic "state machine in nullable fields".

Replaces the mode flags with two final strategies assigned once in
each constructor:

- ParseTask parseTask — what run() invokes (reader.readDump() for the
  single-file path, MultiPartXmlDumpReader.readDumps(...) for the
  multi-part path).
- Runnable abortAction — what abort() invokes (reader.abort() for the
  single-file path, a no-op for multi-part).

No null fields, no runtime mode checks, and the multi-part path no
longer needs to keep a separate DumpWriter + ReaderFactory as state
purely to reconstruct the action at run() time.
@rzo1
Copy link
Copy Markdown
Contributor

rzo1 commented Apr 24, 2026

@mawiesne Missing pieces are in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🆕Enhancement java Pull requests that update java code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants