[PLUG] how do you deal with a 2.8tb xml file?

Eric Wilhelm scratchcomputing at gmail.com
Tue Feb 17 07:43:12 UTC 2009


# from Rogan Creswick
# on Monday 16 February 2009 22:27:

>   (2) The real problem.... I've had issues running xml/xslt tools on
>xml files that were 100's of megs in size.  How on earth do I manage a
>2.8tb xml file?  Assuming I can get this zip file to mount as a
>filesystem, are there any off-the-shelf tools that will actually
>process it?

You should be able to stream uncompress on a pipe into any SAX parser 
(e.g. Expat) without needing any sort of huge memory space.  Not sure 
how long one pass will take (with decompression), but you probably 
don't want to be trying anything that's not a streaming extraction.  
Maybe start with just the first several hundred/thousand lines -- 
assuming reasonable line lengths.

The good news is that if you know what you're looking for, you can 
probably get a reasonably small subset of the data out on one pass and 
into something saner than XML!  Perhaps a bdb?

--Eric
-- 
Introducing change is like pulling off a bandage: the pain is a memory
almost as soon as you feel it.
--Paul Graham
---------------------------------------------------
    http://scratchcomputing.com
---------------------------------------------------



More information about the PLUG mailing list