[PLUG] Removing Duplicate Rows from SQL Dump

Rich Shepard rshepard at appl-ecosys.com
Tue Aug 16 03:09:53 UTC 2011


   My brain's stuck trying to figure out a way to automate removing
duplicate (and triplicate) rows from a database table that's been exported
as a .sql file.

   Here's the situation: The file has 12.5K rows of duplicate and triplicate
data. These data have the same loc_name, sample_date, and param columns, but
the quant column may differ, and there are other columns from the table that
are not applicable.

   I thought that I had a python solution but it's not working. There must be
an awk solution for this, but I've not yet found it. Of course, all of you
are welcome to submit solutions regardless of tool or language because I
don't want to spend my life going through all these rows by hand. Here are
samples:

\N	CIL	2007-08-29	Manganese	0.00499999989	mg/L	\N
\N	\N	\N
\N	CIL	2007-08-29	Manganese	0.00499999989	mg/L	\N
\N	\N	\N
\N	CIL	2007-08-29	Mercury	0.846000016	mg/L	\N	\N
\N	\N
\N	CIL	2007-08-29	Mercury	0.846000016	mg/L	\N	\N
\N	\N
\N	CIL	2007-08-29	Molybdenum	1.88999999	mg/L	\N
\N	\N	\N
\N	CIL	2007-08-29	Molybdenum	1.88999999	mg/L	\N
\N	\N	\N
\N	CIL	2007-08-29	Nickel	0.140000001	mg/L	\N	\N
\N	\N
\N	CIL	2007-08-29	Nickel	0.140000001	mg/L	\N	\N
\N	\N

   Where values differ, we'll use the first one since there's no way to
determine which one is actually correct (if either). After all, there should
be only one value for a measured paramter at a specific location on a given
date.

TIA,

Rich







More information about the PLUG mailing list