[PLUG] OT: Databases

Curtis Poe cp at onsitetech.com
Thu Sep 19 22:04:30 UTC 2002


> On Wed, 18 Sep 2002, Jeremy Bowen wrote:
> This is probably more of a datatbase question than a Linux question.
>
> I need someone who can show me how to do the following (consultant or just
> someone good with databases and programs):
>
> List A (45,000 records) has: Name, Address, City, State, Zip
>
> List B (100,000) has: Name, Address, City, State, Zip, Additional Data,
> Additional Data.......
>
> I need to compare the two lists and MARK the records in List A that exist
in
> list B (They need to be marked in List A). Using Linux would be great but
I
> don't have a functioning Linux box right now and I need the data on the
> Windows box because the software the will ultimately use the list has not
> Linux alternative.

I think a database is likely overkill for this.  Let's assume for a moment
the following facts:

1.  The lists are in CSV (comma separated value) flat files.
2.  The data is rather dirty (arbitrary    spacing and caPitaliZation), but
none is missing.

The strategy I would try is:

1.  Remove all non-word characters, leaving only letters, numbers, and
underscores.
2.  Lower case all letters.
3.  Join the fields in List A into one field and make them hash keys.
4.  Iterate over List B and check for existence in List A.  If it exists,
stuff into an array.

The following quick, dirty, and untested code (in Perl) should give you an
idea of what I mean:

#!/usr/bin/perl -w
use strict;
use Text::CSV_XS;
my $csv = Text::CVS_XS->new;

my ( $list_a, $list_b ) = ('list_a.txt','list_b.txt');
my %list_a;
open LIST_A, "< $list_a" or die $!;
while (<LIST_A>) {
    next if ! $_; # skip empty lines
    if ( $csv->parse($_) ) {
        my $data = join '', $csv->fields;
        $data =~ s/\W//g; # that's a capital W
        $data = lc $data;
        $list_a{ $data } = 1;
    }
    else {
      warn "List_a: could not parse " . $csv->error_input;
    }
}
close LIST_A;

my @match;

open LIST_B, "< $list_b" or die $!;
while( <LIST_B> ) {
    next if ! $_;
    if ( $csv->parse($_) ) {
        my $data = join '', $csv->fields;
        $data =~ s/\W//g;
        $data = lc $data;
        push @match => [$csv->fields] if exists $list_a{ $data };
    }
    else {
        warn "List_b: could not parse " . $csv->error_input;
    }
}
close LIST_B;
# @match now has a list of matching records (an array of arrayrefs)

Let me know if you have any questions.

--
Cheers,
Curtis Poe
Senior Programmer
ONSITE! Technology, Inc.
www.onsitetech.com
503-233-1418

Taking e-Business and Internet Technology To The Extreme!





More information about the PLUG mailing list