#420: Support multiple file Wikipedia dump archives#435
Draft
#420: Support multiple file Wikipedia dump archives#435
Conversation
1f1bdea to
cd9c9a8
Compare
cd9c9a8 to
0b1fb64
Compare
- adds method for handling multi-file dump archives to IDecompressor - adds related impls in BZip2Decompressor, GZipDecompressor and UniversalDecompressor, each relying on SequenceInputStream - WIP -> Tests
0b1fb64 to
d0052ec
Compare
- BZip2Decompressor.getInputStreamSequence now enables decompressConcatenated on BZip2CompressorInputStream; without the flag only the first part of a multi-file bz2 archive was decoded. - Close already-opened part streams on failure in BZip2Decompressor and GZipDecompressor to avoid file-descriptor leaks. - UniversalDecompressor.getInputStreamSequence rejects lists whose parts do not share a single archive format, preventing silent misdecoding; also drops an unreachable null-element branch. - Replace the legacy synchronized Vector with ArrayList in the bz2 and gz sequence paths. - Tidy IDecompressor.getInputStreamSequence javadoc (typo, wrong "or null" clause, clearer contract). - Extend AbstractDecompressorTest with common contract tests for getInputStreamSequence (null list, empty list, null element, directory element); SevenZipDecompressorTest overrides these to assert UnsupportedOperationException. - Add multi-part round-trip tests for bz2 and gz that generate fixtures via BZip2CompressorOutputStream / GZIPOutputStream and verify the decompressed stream equals the byte concatenation of all parts. - Add UniversalDecompressor dispatch tests (bz2, gz), rejection of 7z multi-file, unsupported extension, and mixed-extension lists.
Each Wikimedia multi-file dump part (e.g. pages-articles1.xml-p1p10.bz2) is a standalone XML document with its own <mediawiki> root and <siteinfo> preamble, so the decompressed byte concatenation is not well-formed XML and cannot be fed to a single SAXParser.parse() call. This commit adds the plumbing to parse such a sequence as one logical document without touching the existing single-stream pipeline. - MultiPartDumpWriter (mwdumper): DumpWriter decorator that forwards writeStartWiki / writeSiteinfo only on the first invocation, swallows per-part writeEndWiki / close, and exposes finish() to emit the single terminal writeEndWiki and close the delegate exactly once. - AbstractXmlDumpReader: split readDump() into a protected doParse() that runs SAX without closing the writer, plus the existing readDump() which calls doParse() followed by writer.close(). The single-stream contract is unchanged. - MultiPartXmlDumpReader (wikimachine): readDumps(parts, writer, factory) iterates parts, instantiates a fresh reader per part via the factory, routes events through MultiPartDumpWriter, and guarantees finish() runs on the failure path as well. - Unit tests for MultiPartDumpWriter (lifecycle collapsing, passthrough, idempotent finish, null-delegate rejection). - Integration tests for MultiPartXmlDumpReader using in-memory XML parts against WikiXMLDumpReader, asserting exact event ordering, null/empty input rejection, and delegate-close on parse failure. Consumers (XML2Binary, DataMachineGenerator, TimeMachineGenerator) are not yet wired up to this API; that is a separate follow-up.
Wikimedia publishes large XML dumps split across several files named
<prefix>-<role><N>.xml-p<start>p<end>.<ext> (e.g. pages-articles1.xml-
p1p297012.bz2). The existing DataMachineFiles scanner used
.contains("pages-articles.xml") and therefore silently ignored every
multi-part file. This commit adds grouping + ordering support without
wiring the XML consumers yet.
- New util DumpFileDiscovery (wikimachine): recognises the multi-part
page-range suffix, matches role names under both single-file and
multi-part schemes (rejecting look-alikes such as
pages-articles-multistream), and orders a collection of parts by
ascending start page id.
- DataMachineFiles: internally stores pages-articles and
pages-meta-current as ordered List<File>. Legacy singular getters
keep returning the first part for backwards compatibility; new
getInputPagesArticlesFiles() / getInputPagesMetaCurrentFiles()
expose the full ordered list. Role matching uses DumpFileDiscovery
and now correctly picks up multi-part names. SQL roles (pagelinks,
categorylinks) stay single-file.
- TimeMachineFiles: metaHistoryFiles is now a List<String>. Legacy
setMetaHistoryFile/getMetaHistoryFile still work (singleton list);
new setMetaHistoryFiles/getMetaHistoryFiles accept and expose the
ordered list; checkAll() verifies every part is readable.
- Tests for DumpFileDiscovery (pattern matching, rejection of
look-alikes, stable ordering with ranged and unranged files),
DataMachineFiles multi-part grouping, and TimeMachineFiles list
setter/getter behaviour and validation.
The earlier implementation fed a SequenceInputStream of raw compressed parts to a single decompressor (GZIPInputStream for gz, relying on RFC 1952 multi-member support; BZip2CompressorInputStream with decompressConcatenated=true for bz2). That worked locally but failed on Java 17 / HSQLDB CI runs with decoding stopping after the first part — only one part's content was returned. Root cause: GZIPInputStream detects a subsequent concatenated member by inspecting the underlying stream's available() count after the trailer of the current member. SequenceInputStream's available() returns the *current* underlying stream's count, which is zero at the boundary between parts before the switch to the next part has been triggered by a read(). On timing/buffer-sensitive paths this made GZIPInputStream conclude the stream ended after part one. Fix: wrap each part in its own decompressor (GZIPInputStream / BZip2CompressorInputStream) and concatenate the *decompressed* streams with SequenceInputStream. Identical semantic result, no dependence on compressed-side multi-member detection, and consistent between bz2 and gz. The existing multi-part round-trip tests still pass and the CI-reproducible failure is gone.
Plumbing was in place (DumpFileDiscovery grouping in DataMachineFiles / TimeMachineFiles, MultiPartXmlDumpReader for per-part SAX dispatch with a shared DumpWriter); this commit connects the consumer pipelines so a dump split across several Wikimedia files is ingested as a single logical document. - XML2Binary (datamachine): new XML2Binary(List<InputStream>, DataMachineFiles) constructor routes the list through MultiPartXmlDumpReader.readDumps with SimpleXmlDumpReader::new. The legacy single-stream constructor is unchanged. - DataMachineGenerator.processInputDump: opens one decompressed stream per configured pages-articles / pages-meta-current part (favouring meta-current when present) and hands the list to the multi-part constructor. Single-file dumps collapse to a one-element list with identical semantics to the legacy path. - DumpTableInputStream (wikimachine): adds default initialize(List<InputStream>, DumpTableEnum) that dispatches to the single-stream initializer for size-1 lists and throws UnsupportedOperationException otherwise. Subclasses override when they can read across parts. - XMLDumpTableInputStreamThread (timemachine): new constructor that drives MultiPartXmlDumpReader for a List<InputStream>, selecting the Page/Revision/Text reader per DumpTableEnum. Single-stream mode unchanged. - XMLDumpTableInputStream: overrides the list-based initialize to use the new multi-part thread; the single-stream path is preserved. - TimeMachineGenerator: each of createRevisionParser, createPageParser, and createTextParser now opens one decompressed stream per configured meta-history part via a shared helper and passes the list to DumpTableInputStream.initialize.
readDumps used to leave its InputStream arguments open; callers (DataMachineGenerator, TimeMachineGenerator) also never closed the streams they opened via the decompressor, multiplying the leak by the number of parts in a multi-file dump. - readDumps now closes every stream in the parts list before returning, on both the success and failure paths. Exceptions raised during close or during the wrapper's final flush are attached as suppressed to any primary error from the parse. - The error path is simplified: a single primary-exception slot is threaded through parse, per-part close, and wrapper.finish(), then rethrown at the end. Replaces the earlier dual-finish call that relied on finish()'s idempotency for correctness. - Javadoc documents the ownership contract so callers no longer need to close the streams themselves.
BZip2Decompressor and GZipDecompressor carried byte-identical closeQuietly helpers — the partial-open cleanup routine that closes every stream collected so far and attaches IOExceptions from close as suppressed on the primary error. Consolidates to a single protected static method on AbstractDecompressor so any future decompressor that wires up a multi-part path picks it up for free.
Three small encapsulation / API hygiene touch-ups surfaced during review: - DumpFileDiscovery.pageRangeStart is demoted from package-private to private. It is only referenced internally via the DumpFileDiscovery::pageRangeStart method reference inside orderByPageRange; method references to private static members resolve fine from the same class. No reason to leak the compare key into the package. - MultiPartXmlDumpReader.ReaderFactory no longer extends BiFunction<InputStream, DumpWriter, AbstractXmlDumpReader>. Extending the JDK SAM pulled BiFunction.andThen into the public surface, which has no meaningful semantics here. It is now a standalone @FunctionalInterface with a single create(InputStream, DumpWriter) method; method references such as SimpleXmlDumpReader::new continue to match the SAM unchanged. - MultiPartDumpWriter gains an explicit "not thread-safe" javadoc note. Matches the DumpWriter contract and documents the single-threaded expectation the multi-part pipeline relies on.
…gies The thread class had two operating modes crammed into one body: the single-file path set xmlReader and left the multi-part fields null; the multi-part path did the opposite. run() and abort() switched on (parts != null). Classic "state machine in nullable fields". Replaces the mode flags with two final strategies assigned once in each constructor: - ParseTask parseTask — what run() invokes (reader.readDump() for the single-file path, MultiPartXmlDumpReader.readDumps(...) for the multi-part path). - Runnable abortAction — what abort() invokes (reader.abort() for the single-file path, a no-op for multi-part). No null fields, no runtime mode checks, and the multi-part path no longer needs to keep a separate DumpWriter + ReaderFactory as state purely to reconstruct the action at run() time.
Contributor
|
@mawiesne Missing pieces are in. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What's in the PR
java.io.SequenceInputStreamHow to test manually
mvn clean verifyAutomatic testing
Documentation