[PLUG] seek()able compressed files?

Fri Dec 21 01:33:22 UTC 2007

Hi all,

I've been reading about Tim Bray's "Wide Finder" project, where he's 
basically looking at "grep for multicore" in the context of a 4-million 
line apache logfile (that's about a 1GB file without compression) on an 
8-core sun machine.

  http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder

Having recently written a clusterable apache log processing tool, this 
piqued my interest, particularly the conclusion that the 8 cores can 
process the file faster than any disk can deliver it to the CPU.  So, 
the times on his page are based on a hot cache and require mmap'ing the 
entire file.  A hot disk cache is great if you're doing benchmarks, but 
not so great if you're in the real world where you're stuck doing 
something useful (e.g. processing new files all the time.)

Yeah, a raid1 of 3.0Gb/s sata disks would theoretically keep the 8 cores 
fed at 0.75GB/s -- just fast enough to keep up with the perl code, but 
I'm guessing that the mmap has to read the entire file before any work 
can start (or -- graciously assuming a lazy/parallel mmap() -- at least 
the rindex() call needs to be looking at the end of the 1GB string for 
the Nth process to get started.)

In my own experiments, I found that it is faster to read gzip'd data off 
of the disk/nfs and then process the other end of a gunzip pipe (kernel 
runs the pipe concurrently for free -- yay!)  Particularly when you're 
getting into clustering, the gigabit ethernet of the fileserver is 
still going to fill-up at some point, so slinging 1/12th of the data 
around and carrying a slight gunzip overhead on the node's CPU is a win 
(even given ethernet trunking at the server, etc.)

Ok, for those who are still with me...  How do I parallelize the gunzip 
or bunzip2 process for a single file?  That is, say I want 4 nodes to 
work on one file -- is it possible (or worth it) to attempt to seek() 
into the compressed file, re-sync the decompression stream, and then 
start working?  From my reading of the gzip and bzip format specs, it 
seems that you could conceivably find the start of the Nth compressed 
block, and start decompression -- but it doesn't appear that the 
compressed blocks are labeled in any convenient way (even when 
the --rsyncable option is given to gzip.)  Yes, both spec's state that 
the format is not intended to be randomly accessible.  My reading also 
implies that the gzip data may not even be byte-aligned, at which point 
I'm guessing the whole re-sync thing would be a no-go due to the 
overhead of all that bit-twiddling (not to mention the 8x increase to 
the feeling-around time.)

Does seek() over NFS even gain me anything?  That is, would the node 
still be pulling all of the data up to the seek point or does NFS 
bypass the intermediate data and only ship what gets read() down the 
wire?

Break the files into smaller chunks and use the filesystem as my seek()?  
That's what I've been concluding thus-far.

Alternatively, any thoughts on a chunkable compression format or aspects 
of alternate network filesystems that I'm overlooking?

--Eric
-- 
"Beware of bugs in the above code; I have only proved it correct, not
tried it."
--Donald Knuth
---------------------------------------------------
    http://scratchcomputing.com
---------------------------------------------------