#420: Support multiple file Wikipedia dump archives by mawiesne · Pull Request #435 · dkpro/dkpro-jwpl

mawiesne · 2026-01-17T19:37:33Z

What's in the PR

adds method for handling multi-file dump archives to IDecompressor
adds related impls in BZip2Decompressor, GZipDecompressor and UniversalDecompressor, each relying on java.io.SequenceInputStream
WIP -> Tests

How to test manually

mvn clean verify

Automatic testing

PR includes unit tests

Documentation

PR updates documentation

- adds method for handling multi-file dump archives to IDecompressor - adds related impls in BZip2Decompressor, GZipDecompressor and UniversalDecompressor, each relying on SequenceInputStream - WIP -> Tests

- BZip2Decompressor.getInputStreamSequence now enables decompressConcatenated on BZip2CompressorInputStream; without the flag only the first part of a multi-file bz2 archive was decoded. - Close already-opened part streams on failure in BZip2Decompressor and GZipDecompressor to avoid file-descriptor leaks. - UniversalDecompressor.getInputStreamSequence rejects lists whose parts do not share a single archive format, preventing silent misdecoding; also drops an unreachable null-element branch. - Replace the legacy synchronized Vector with ArrayList in the bz2 and gz sequence paths. - Tidy IDecompressor.getInputStreamSequence javadoc (typo, wrong "or null" clause, clearer contract). - Extend AbstractDecompressorTest with common contract tests for getInputStreamSequence (null list, empty list, null element, directory element); SevenZipDecompressorTest overrides these to assert UnsupportedOperationException. - Add multi-part round-trip tests for bz2 and gz that generate fixtures via BZip2CompressorOutputStream / GZIPOutputStream and verify the decompressed stream equals the byte concatenation of all parts. - Add UniversalDecompressor dispatch tests (bz2, gz), rejection of 7z multi-file, unsupported extension, and mixed-extension lists.

Each Wikimedia multi-file dump part (e.g. pages-articles1.xml-p1p10.bz2) is a standalone XML document with its own <mediawiki> root and <siteinfo> preamble, so the decompressed byte concatenation is not well-formed XML and cannot be fed to a single SAXParser.parse() call. This commit adds the plumbing to parse such a sequence as one logical document without touching the existing single-stream pipeline. - MultiPartDumpWriter (mwdumper): DumpWriter decorator that forwards writeStartWiki / writeSiteinfo only on the first invocation, swallows per-part writeEndWiki / close, and exposes finish() to emit the single terminal writeEndWiki and close the delegate exactly once. - AbstractXmlDumpReader: split readDump() into a protected doParse() that runs SAX without closing the writer, plus the existing readDump() which calls doParse() followed by writer.close(). The single-stream contract is unchanged. - MultiPartXmlDumpReader (wikimachine): readDumps(parts, writer, factory) iterates parts, instantiates a fresh reader per part via the factory, routes events through MultiPartDumpWriter, and guarantees finish() runs on the failure path as well. - Unit tests for MultiPartDumpWriter (lifecycle collapsing, passthrough, idempotent finish, null-delegate rejection). - Integration tests for MultiPartXmlDumpReader using in-memory XML parts against WikiXMLDumpReader, asserting exact event ordering, null/empty input rejection, and delegate-close on parse failure. Consumers (XML2Binary, DataMachineGenerator, TimeMachineGenerator) are not yet wired up to this API; that is a separate follow-up.

Wikimedia publishes large XML dumps split across several files named <prefix>-<role><N>.xml-p<start>p<end>.<ext> (e.g. pages-articles1.xml- p1p297012.bz2). The existing DataMachineFiles scanner used .contains("pages-articles.xml") and therefore silently ignored every multi-part file. This commit adds grouping + ordering support without wiring the XML consumers yet. - New util DumpFileDiscovery (wikimachine): recognises the multi-part page-range suffix, matches role names under both single-file and multi-part schemes (rejecting look-alikes such as pages-articles-multistream), and orders a collection of parts by ascending start page id. - DataMachineFiles: internally stores pages-articles and pages-meta-current as ordered List<File>. Legacy singular getters keep returning the first part for backwards compatibility; new getInputPagesArticlesFiles() / getInputPagesMetaCurrentFiles() expose the full ordered list. Role matching uses DumpFileDiscovery and now correctly picks up multi-part names. SQL roles (pagelinks, categorylinks) stay single-file. - TimeMachineFiles: metaHistoryFiles is now a List<String>. Legacy setMetaHistoryFile/getMetaHistoryFile still work (singleton list); new setMetaHistoryFiles/getMetaHistoryFiles accept and expose the ordered list; checkAll() verifies every part is readable. - Tests for DumpFileDiscovery (pattern matching, rejection of look-alikes, stable ordering with ranged and unranged files), DataMachineFiles multi-part grouping, and TimeMachineFiles list setter/getter behaviour and validation.

The earlier implementation fed a SequenceInputStream of raw compressed parts to a single decompressor (GZIPInputStream for gz, relying on RFC 1952 multi-member support; BZip2CompressorInputStream with decompressConcatenated=true for bz2). That worked locally but failed on Java 17 / HSQLDB CI runs with decoding stopping after the first part — only one part's content was returned. Root cause: GZIPInputStream detects a subsequent concatenated member by inspecting the underlying stream's available() count after the trailer of the current member. SequenceInputStream's available() returns the *current* underlying stream's count, which is zero at the boundary between parts before the switch to the next part has been triggered by a read(). On timing/buffer-sensitive paths this made GZIPInputStream conclude the stream ended after part one. Fix: wrap each part in its own decompressor (GZIPInputStream / BZip2CompressorInputStream) and concatenate the *decompressed* streams with SequenceInputStream. Identical semantic result, no dependence on compressed-side multi-member detection, and consistent between bz2 and gz. The existing multi-part round-trip tests still pass and the CI-reproducible failure is gone.

Plumbing was in place (DumpFileDiscovery grouping in DataMachineFiles / TimeMachineFiles, MultiPartXmlDumpReader for per-part SAX dispatch with a shared DumpWriter); this commit connects the consumer pipelines so a dump split across several Wikimedia files is ingested as a single logical document. - XML2Binary (datamachine): new XML2Binary(List<InputStream>, DataMachineFiles) constructor routes the list through MultiPartXmlDumpReader.readDumps with SimpleXmlDumpReader::new. The legacy single-stream constructor is unchanged. - DataMachineGenerator.processInputDump: opens one decompressed stream per configured pages-articles / pages-meta-current part (favouring meta-current when present) and hands the list to the multi-part constructor. Single-file dumps collapse to a one-element list with identical semantics to the legacy path. - DumpTableInputStream (wikimachine): adds default initialize(List<InputStream>, DumpTableEnum) that dispatches to the single-stream initializer for size-1 lists and throws UnsupportedOperationException otherwise. Subclasses override when they can read across parts. - XMLDumpTableInputStreamThread (timemachine): new constructor that drives MultiPartXmlDumpReader for a List<InputStream>, selecting the Page/Revision/Text reader per DumpTableEnum. Single-stream mode unchanged. - XMLDumpTableInputStream: overrides the list-based initialize to use the new multi-part thread; the single-stream path is preserved. - TimeMachineGenerator: each of createRevisionParser, createPageParser, and createTextParser now opens one decompressed stream per configured meta-history part via a shared helper and passes the list to DumpTableInputStream.initialize.

readDumps used to leave its InputStream arguments open; callers (DataMachineGenerator, TimeMachineGenerator) also never closed the streams they opened via the decompressor, multiplying the leak by the number of parts in a multi-file dump. - readDumps now closes every stream in the parts list before returning, on both the success and failure paths. Exceptions raised during close or during the wrapper's final flush are attached as suppressed to any primary error from the parse. - The error path is simplified: a single primary-exception slot is threaded through parse, per-part close, and wrapper.finish(), then rethrown at the end. Replaces the earlier dual-finish call that relied on finish()'s idempotency for correctness. - Javadoc documents the ownership contract so callers no longer need to close the streams themselves.

BZip2Decompressor and GZipDecompressor carried byte-identical closeQuietly helpers — the partial-open cleanup routine that closes every stream collected so far and attaches IOExceptions from close as suppressed on the primary error. Consolidates to a single protected static method on AbstractDecompressor so any future decompressor that wires up a multi-part path picks it up for free.

@FunctionalInterface

Three small encapsulation / API hygiene touch-ups surfaced during review: - DumpFileDiscovery.pageRangeStart is demoted from package-private to private. It is only referenced internally via the DumpFileDiscovery::pageRangeStart method reference inside orderByPageRange; method references to private static members resolve fine from the same class. No reason to leak the compare key into the package. - MultiPartXmlDumpReader.ReaderFactory no longer extends BiFunction<InputStream, DumpWriter, AbstractXmlDumpReader>. Extending the JDK SAM pulled BiFunction.andThen into the public surface, which has no meaningful semantics here. It is now a standalone @FunctionalInterface with a single create(InputStream, DumpWriter) method; method references such as SimpleXmlDumpReader::new continue to match the SAM unchanged. - MultiPartDumpWriter gains an explicit "not thread-safe" javadoc note. Matches the DumpWriter contract and documents the single-threaded expectation the multi-part pipeline relies on.

…gies The thread class had two operating modes crammed into one body: the single-file path set xmlReader and left the multi-part fields null; the multi-part path did the opposite. run() and abort() switched on (parts != null). Classic "state machine in nullable fields". Replaces the mode flags with two final strategies assigned once in each constructor: - ParseTask parseTask — what run() invokes (reader.readDump() for the single-file path, MultiPartXmlDumpReader.readDumps(...) for the multi-part path). - Runnable abortAction — what abort() invokes (reader.abort() for the single-file path, a no-op for multi-part). No null fields, no runtime mode checks, and the multi-part path no longer needs to keep a separate DumpWriter + ReaderFactory as state purely to reconstruct the action at run() time.

rzo1 · 2026-04-24T12:46:13Z

@mawiesne Missing pieces are in.

mawiesne self-assigned this Jan 17, 2026

mawiesne force-pushed the 420-Support-multiple-file-Wikipedia-dump-archives branch from 1f1bdea to cd9c9a8 Compare January 17, 2026 19:42

mawiesne added 🆕Enhancement java Pull requests that update java code labels Jan 17, 2026

mawiesne added this to the 2.0.2 milestone Jan 17, 2026

mawiesne force-pushed the 420-Support-multiple-file-Wikipedia-dump-archives branch from cd9c9a8 to 0b1fb64 Compare February 4, 2026 20:08

#420: Support multiple file Wikipedia dump archives

d0052ec

- adds method for handling multi-file dump archives to IDecompressor - adds related impls in BZip2Decompressor, GZipDecompressor and UniversalDecompressor, each relying on SequenceInputStream - WIP -> Tests

rzo1 force-pushed the 420-Support-multiple-file-Wikipedia-dump-archives branch from 0b1fb64 to d0052ec Compare April 24, 2026 09:54

rzo1 added 9 commits April 24, 2026 13:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#420: Support multiple file Wikipedia dump archives#435

#420: Support multiple file Wikipedia dump archives#435
mawiesne wants to merge 10 commits intomasterfrom
420-Support-multiple-file-Wikipedia-dump-archives

mawiesne commented Jan 17, 2026 •

edited

Loading

Uh oh!

rzo1 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mawiesne commented Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rzo1 commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mawiesne commented Jan 17, 2026 •

edited

Loading