[PLUG] how do you deal with a 2.8tb xml file?

Rogan Creswick creswick at gmail.com
Tue Feb 17 06:27:46 UTC 2009


I need to run some experiments on the full histories of a large number
of wikipedia articles, so I downloaded the only dataset archive
supplied by wikipedia that includes page histories -- the *entire*
data set, as of the time the crawler stopped working. (it's dated Jan.
3, 2008, if anyone is curious, and if anyone wants a copy just let me
know.  It's no longer hosted on the wikipedia servers for some
reason--I don't have a checksum either <sigh> so hopefully it's
actually valid ;)

The problem is two-fold:

   (1) I just don't have enough space.  The 7zip file is 17 gigs, and
the contents will extract to 2.8 terabytes (I know that from
inspection of the 7zip file.)  I *think* I can partially get around
this by using FUSE and cromfs to mount the 7z file as a compressed
filesystem.

   (2) The real problem.... I've had issues running xml/xslt tools on
xml files that were 100's of megs in size.  How on earth do I manage a
2.8tb xml file?  Assuming I can get this zip file to mount as a
filesystem, are there any off-the-shelf tools that will actually
process it?  (I obviously don't have the virtual memory or temp file
space for the whole thing.)  I'm searching for the schema, and hoping
that I can put together sample data to test out xslt transforms on, or
somehow prune the file down to a "manageable" 100gigs or so, but even
then, we're talking about gigantic chunks of xml.

Anyhow, I'm just biting into these problems, but I don't have very
many ideas, so any suggestions would be very welcome.  (For
comparison, our "big" machine here has 300 gigs free, so I am somewhat
interested in suggestions on how to build a 3tb+ filesystem, but as
much fun as that would be, we probably can't actually afford it, so
I'm looking for software solutions first.)

Thanks!
Rogan



More information about the PLUG mailing list