[PLUG] how do you deal with a 2.8tb xml file?

Thu Mar 12 07:40:21 UTC 2009

On Tue, Feb 17, 2009 at 2:24 PM, Rogan Creswick <creswick at gmail.com> wrote:
> On Tue, Feb 17, 2009 at 1:15 PM, Eric Wilhelm
> <scratchcomputing at gmail.com> wrote:
>> `man 7z` implies that `7z x -so` gives you something resembling
>> `gunzip -c`
>
> Yup -- I just got that to work.  7z does some "too fancy for my
> tastes" pipe detection / etc.

Just following up with the solution I'm using now.  I learned a few
interesting things in the process, and thought I'd share.

I needed to extract the top N most revised wikipedia pages -- each
page is in a <page> tag, and each <page> contains a number of
<revision> tags.

The first attempt was to use a streaming xml parser to read in each
<page> element, count the revisions, and write it to disk if it made
the cut.  This failed horribly.  Many wikipedia pages have more than
3gb of revisions, java can't address more than 3gb on a 32bit machine,
and I was (foolishly) using java's serialization api to serialize the
content through a gzip stream -- that only nets you about 50%
compression, where as serializing plain text through the same stream
will compress things to 20-25% (or better).

Since the out of memory issue took me by surprise (the entire data set
is only 17gb, 7zipped, and I hadn't counted the articles yet), and it
was a Friday, I decided to scan the whole archive for <page> tags, and
record the byte offsets.  Grep was the perfect tool (--byte-offset),
and worked fast enough to complete over the weekend.

Using the output of grep to calculate and sort the page entries by
size was a minor adventure, but not too troublesome (there are over 11
million pages, by the way, and the largest has 52gb of revision
text(!))

That obviously dictated that I come up with a fixed-memory solution,
so I hacked out another java app that did not use an xml parser at
all, and instead searched for <page> tags, and streamed the page
content through a (100mb buffered) gzip stream to disk, using a uuid
as the file name, and counting the <revision> tags as it went.  The
UUIDs and revision counts are tracked, and once the threshold of
articles is past, that is used to delete the smaller files on disk.

Unfortunately, even gzipped, the articles I'm extracting will still
need over 110gigs of space -- and 188 hours to extract. (thanks to
'pv', the pipe viewer, for providing a progress bar & time estimates!)

I have been able to set up a cromfs that's 7zip-compressed, however,
so that's my next attempt. (Writing to the cromfs instead of through a
gzipped stream.)

Thanks for all the suggestions!

--Rogan