[PLUG] seek()able compressed files?
Eric Wilhelm
scratchcomputing at gmail.com
Fri Dec 21 01:33:22 UTC 2007
Hi all,
I've been reading about Tim Bray's "Wide Finder" project, where he's
basically looking at "grep for multicore" in the context of a 4-million
line apache logfile (that's about a 1GB file without compression) on an
8-core sun machine.
http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder
Having recently written a clusterable apache log processing tool, this
piqued my interest, particularly the conclusion that the 8 cores can
process the file faster than any disk can deliver it to the CPU. So,
the times on his page are based on a hot cache and require mmap'ing the
entire file. A hot disk cache is great if you're doing benchmarks, but
not so great if you're in the real world where you're stuck doing
something useful (e.g. processing new files all the time.)
Yeah, a raid1 of 3.0Gb/s sata disks would theoretically keep the 8 cores
fed at 0.75GB/s -- just fast enough to keep up with the perl code, but
I'm guessing that the mmap has to read the entire file before any work
can start (or -- graciously assuming a lazy/parallel mmap() -- at least
the rindex() call needs to be looking at the end of the 1GB string for
the Nth process to get started.)
In my own experiments, I found that it is faster to read gzip'd data off
of the disk/nfs and then process the other end of a gunzip pipe (kernel
runs the pipe concurrently for free -- yay!) Particularly when you're
getting into clustering, the gigabit ethernet of the fileserver is
still going to fill-up at some point, so slinging 1/12th of the data
around and carrying a slight gunzip overhead on the node's CPU is a win
(even given ethernet trunking at the server, etc.)
Ok, for those who are still with me... How do I parallelize the gunzip
or bunzip2 process for a single file? That is, say I want 4 nodes to
work on one file -- is it possible (or worth it) to attempt to seek()
into the compressed file, re-sync the decompression stream, and then
start working? From my reading of the gzip and bzip format specs, it
seems that you could conceivably find the start of the Nth compressed
block, and start decompression -- but it doesn't appear that the
compressed blocks are labeled in any convenient way (even when
the --rsyncable option is given to gzip.) Yes, both spec's state that
the format is not intended to be randomly accessible. My reading also
implies that the gzip data may not even be byte-aligned, at which point
I'm guessing the whole re-sync thing would be a no-go due to the
overhead of all that bit-twiddling (not to mention the 8x increase to
the feeling-around time.)
Does seek() over NFS even gain me anything? That is, would the node
still be pulling all of the data up to the seek point or does NFS
bypass the intermediate data and only ship what gets read() down the
wire?
Break the files into smaller chunks and use the filesystem as my seek()?
That's what I've been concluding thus-far.
Alternatively, any thoughts on a chunkable compression format or aspects
of alternate network filesystems that I'm overlooking?
--Eric
--
"Beware of bugs in the above code; I have only proved it correct, not
tried it."
--Donald Knuth
---------------------------------------------------
http://scratchcomputing.com
---------------------------------------------------
More information about the PLUG
mailing list