[PLUG] Question on efficiently searching large files for a simple text match

Dale Snell ddsnell at frontier.com
Sun Sep 30 01:49:33 UTC 2012


On Sat, 29 Sep 2012 17:37:51 -0700
website reader <website.reader3 at gmail.com> wrote:

> To all:
> 
> This is a question on efficiently searching for text items in a large
> file > 1 gigabyte in size.
> 
> I have a list of about 2 to 5 thousand items where an item is a couple
> of text words such as "Side 2050" and S is always in the starting
> column, and have to search a large file around 24 gigabytes in size
> for these items.  The file is a simple text file, delineated by line
> feeds.
> 
> Using the typical shell grep command script such as:
> 
> grep "Side 2050"
> grep "Side 2061"
> etc.
> 
> results in a very slow execution time, since the 24 gig file has to be
> searched for each line of the script and I am finding this to be very
> laborious and time consuming, not to mention all the hits on the hard
> drive as the script grinds through each line.
> 
> I am aware of combining the grep into a pattern set, but then run into
> command line length limitations.  Is it best to go this way?
> 
> What is the quickest way to do this type of search, when large files (
> > 1 gig ) are involved?
> 
> Thanks for your tips, or suggestions.
> 
> - Randall

How about something like this:

    $ grep -E "^Side" really_big_file

The caret (^) will anchor the search to the beginning of the line.
That way grep will only have to check out the first word.  Without
the anchor, grep will scan the entire line for the search string,
which would take more time.

There are some programs (e.g., Stringi or Hyper Estraier) that are
claimed to be extremely fast at searching text files.  I've never
used any of them, but from what I've been able to see, many of
them are meant to be part of a web site.  They also need to
pre-scan the text files in order to build an index, which probably
takes a while.  OTOH, if you have to scan the same set of files
over and over, this may not be a limitation.  (Stringi in
particular is supposed to have high-speed replacements for find
and grep.)

--Dale

--
"I've measured it from side to side;
Tis three feet long and two feet wide."
    -- William Wordsworth, describing a pond



More information about the PLUG mailing list