[PLUG] OT: Databases
Curtis Poe
cp at onsitetech.com
Thu Sep 19 22:04:30 UTC 2002
> On Wed, 18 Sep 2002, Jeremy Bowen wrote:
> This is probably more of a datatbase question than a Linux question.
>
> I need someone who can show me how to do the following (consultant or just
> someone good with databases and programs):
>
> List A (45,000 records) has: Name, Address, City, State, Zip
>
> List B (100,000) has: Name, Address, City, State, Zip, Additional Data,
> Additional Data.......
>
> I need to compare the two lists and MARK the records in List A that exist
in
> list B (They need to be marked in List A). Using Linux would be great but
I
> don't have a functioning Linux box right now and I need the data on the
> Windows box because the software the will ultimately use the list has not
> Linux alternative.
I think a database is likely overkill for this. Let's assume for a moment
the following facts:
1. The lists are in CSV (comma separated value) flat files.
2. The data is rather dirty (arbitrary spacing and caPitaliZation), but
none is missing.
The strategy I would try is:
1. Remove all non-word characters, leaving only letters, numbers, and
underscores.
2. Lower case all letters.
3. Join the fields in List A into one field and make them hash keys.
4. Iterate over List B and check for existence in List A. If it exists,
stuff into an array.
The following quick, dirty, and untested code (in Perl) should give you an
idea of what I mean:
#!/usr/bin/perl -w
use strict;
use Text::CSV_XS;
my $csv = Text::CVS_XS->new;
my ( $list_a, $list_b ) = ('list_a.txt','list_b.txt');
my %list_a;
open LIST_A, "< $list_a" or die $!;
while (<LIST_A>) {
next if ! $_; # skip empty lines
if ( $csv->parse($_) ) {
my $data = join '', $csv->fields;
$data =~ s/\W//g; # that's a capital W
$data = lc $data;
$list_a{ $data } = 1;
}
else {
warn "List_a: could not parse " . $csv->error_input;
}
}
close LIST_A;
my @match;
open LIST_B, "< $list_b" or die $!;
while( <LIST_B> ) {
next if ! $_;
if ( $csv->parse($_) ) {
my $data = join '', $csv->fields;
$data =~ s/\W//g;
$data = lc $data;
push @match => [$csv->fields] if exists $list_a{ $data };
}
else {
warn "List_b: could not parse " . $csv->error_input;
}
}
close LIST_B;
# @match now has a list of matching records (an array of arrayrefs)
Let me know if you have any questions.
--
Cheers,
Curtis Poe
Senior Programmer
ONSITE! Technology, Inc.
www.onsitetech.com
503-233-1418
Taking e-Business and Internet Technology To The Extreme!
More information about the PLUG
mailing list