[PLUG] Data extraction

drew wymore drew.wymore at gmail.com
Sun Apr 4 21:31:47 UTC 2010


On Sun, Apr 4, 2010 at 12:31 PM, Michael Rasmussen <michael at jamhome.us> wrote:
>
> On Sun, Apr 04, 2010 at 12:10:03PM -0700, drew wymore wrote:
>> I have a large data set that is being exported from an Oracle DB,
>> unfortunately I can't work with the data directly in Oracle or this
>> wouldn't be a problem. I can export it as CSV and work with it.
>> ... I don't really care which language I
>> do it in and whether I do it directly from csv or a database source
>> other than Oracle (because I can't).
>>
>> Any clue sticks, ideas or links to something that might help me solve
>> this problem appreciated.
>
> With apologies to Randal...
>
> Assume you export to CSV and, for the purposes of this simple example there
> are no text fields that have commas embedded.
>
> And if the data of interest is in the third column:
>
>  3,14,word,blah,blech,bz
>  4,18,term,more,stuff
>
> then:
>
>  perl -ne '@F=split /,/; $words{$F[2]}++; \
>    END{ foreach $word (sort { $words{$a} <=> $words{$b} } keys %words) \
>    { print "$word\t$word_appearance{$word}\n"; } } ' file_of_data.cvs
>
> Assuming you want it sorted by word frequency.
>
> Disclaimer:  I'm at my in-laws for easter dinner and didn't test that.
> I'm reasonably sure that it's close enough that any gaps will serve
> as an exercise for the reader.
>
> --
>      Michael Rasmussen, Portland Oregon
>  Trading kilograms for kilometers since 2003
>    Be appropriate && Follow your curiosity
>          http://www.jamhome.us/
> The Fortune Cookie Fortune today is:
> At once it struck me what quality went to form a man of achievement,
> especially in literature, and which Shakespeare possessed so enormously
> -- I mean negative capability, that is, when a man is capable of being
> in uncertainties, mysteries, doubts, without any irritable reaching
> after fact and reason.
>                -- John Keats
> _______________________________________________
> PLUG mailing list
> PLUG at lists.pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug
>


Thanks Rich and Michael. I'll give the perl a shot and see what
happens. As far as the data layout. It's 5 columns with roughly 1100
rows, the column I'm interested in has a variable number of words per
entry but doesn't exceed a couple hundred words.

I did enable fulltext searching within mysql which works fine for
searching but doesn't give me the flexibility I'm looking for to
actually just get a count of unique words. I did find something in PHP
that is supposed to work but it's barfing on the array that's being
returned by the mysql query.

Drew-



More information about the PLUG mailing list