[PLUG] gawk: modify field contents

Pete Lancashire pete at petelancashire.com
Wed Jul 8 17:01:26 UTC 2015


A couple questions

Floating point numbers without a leading '<' (e.g., ',0.01,,') are written
to the output file and, if there is a blank field immediately following,
insert a zero (0) in that following field.

What are the FP fields ?

If a FP field and the following field is empty for example 123.4,<blank>
change this to 123.4,0

Sample 10321000__1981-09-17

The field NH4 is 0.13 and NO2 is empty. This should be translated to 0.13,0


Issue: Not knowing which are the FP fields, one would not know which
following fields need to be changed

Floating point numbers with a leading less-than symbol and space, '< ',
have that stripped off leaving the number in the field. The following blank
field has a one (1) inserted.

same sample

The field Co is < space 3 and ceneq1 is empty the translation should be 3,1


See above, need to know which are the FP fields


To make the program generic, floating point numbers with less-than symbols

but no following space, and those with a leading minus sign (-) also have
those symbols stripped off and a 1 inserted in the following field.

ditto

All fields with 'M' and a variable amount of white space have the noise
removed so it's a blank field written to the output file.

If contains a M change the fields to empty


The hardest part it knowing which are the FP fields. If you restrict
yourself to using
regex's it could be done but you would end up with something like

   <regex>{3}; <regex>{6}, <regex>{22}

If I was doing this and had a list of the fields I'd do the RTL process
either in Perl (I've not used python)
where one would have an array of which fields are FP something like
(0,0,0,1,0,0,1,1,0,....) where 1 is FP
and then read a line split into an array, loop through each fields if the
index of the 'if fp' array ='s 1 then
with a switch/case (makes it easy to add more logic) do what you want.

-pete back to work










On Wed, Jul 8, 2015 at 9:28 AM, Pete Lancashire <pete at petelancashire.com>
wrote:

> I did a cut/past of the data but have a question
>
> the 9th row does not match any of the others is this correct ?  If not,
> can you attached the sample data as a file vs. pasting it
>
> site.code sampdat    tmp.h2o tmp.air disc.cfs Turb
> 10321000  1965-10-01                 98
> 10321000  1965-11-01                 141
> 10321000  1965-12-01                 128
> 10321000  1979-07-30 26.00           80       10.0
> 10321000  1979-08-29 21.50           29
> 10321000  1979-08-29 21.50           19       13.0
> 10321000  1979-10-26 12.50           42       6.8
> .9        13         29      0.2     0        29
> 10321000  1980-06-23 20.50           2070     23.0
> 10321000  1980-07-29 24.00           203      11.0
> 10321000  1980-08-25 19.50           49       11.0
> 10321000  1980-09-30 17.00           47       6.0
> 10321000  1980-11-21 5.00            108      9.5
>
>
>
>
> On Tue, Jul 7, 2015 at 4:41 PM, Rich Shepard <rshepard at appl-ecosys.com>
> wrote:
>
>> On Tue, 7 Jul 2015, Larry Brigman wrote:
>>
>> > Messed that one up.... Sample data needed/wanted to verify proper
>> operation.
>>
>> Larry, et al.:
>>
>>    Here's what I want to accomplish. Don't care if the tool is written in
>> awk, sed, python, or a bash shell script; I want to save days and weeks of
>> tedious data cleaning exported from sloppy Excel spreadsheets.
>>
>>    The following 20 lines of data are for testing while developing a data
>> cleaning program. The first line (the header) is written, un-molested, to
>> the output file.
>>
>>    All strings and integer fields are written to the output file.
>>
>>    Floating point numbers without a leading '<' (e.g., ',0.01,,') are
>> written
>> to the output file and, if there is a blank field immediately following,
>> insert a zero (0) in that following field.
>>
>>    Floating point numbers with a leading less-than symbol and space, '< ',
>> have that stripped off leaving the number in the field. The following
>> blank
>> field has a one (1) inserted.
>>
>>    To make the program generic, floating point numbers with less-than
>> symbols
>> but no following space, and thosse with a leading minus sign (-) also have
>> those symbols stripped off and a 1 inserted in the following field.
>>
>>    All fields with 'M' and a variable amount of white space have the noise
>> removed so it's a blank field written to the output file.
>>
>> ----------------------------------------------------
>>
>> site.code,sampdat,tmp.h2o,tmp.air,disc.cfs,Turb,Cond,DO,pH,disc.cms,Alk,hard,TDS.mgl,TDS.tpd,TSS,ANC,HCO3,CO3,N.tot,N.org,NH4,NO2,NO3,o-PO3,o-P,org.C,CN,ceneq1,Ca,Mg,Na,K,Cl,SO4,F,ceneq1,Si,As,Ba,ceneq1,B,ceneq1,Ca,ceneq1,Cr,ceneq1,Co,ceneq1,Cu,ceneq1,F,ceneq1,Pb,ceneq1,Mn,ceneq1,Mo,ceneq1,Ni,ceneq1,Ag,ceneq1,Sr,ceneq1,V,ceneq1,Zn,ceneq1,Al,ceneq1,Li,ceneq1,Se,ceneq1
>>
>> 10321000,1965-10-01,,,98,,490,,8.4,2.80,,150,,85.60,,,238,4,,,,,,,,,,,40,12,46,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>>
>> 10321000,1965-11-01,,,141,,507,,8.3,4.00,,170,,122.00,,213,252,4,,,,,,,,,,,50,11,42,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>>
>> 10321000,1965-12-01,,,128,,564,,8.1,3.60,,190,359,129.00,,238,290,0,,,,,,,,,,,56,13,47,6.4,16,37,0.5,0,40,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>> 10321000,1979-07-30,26.00,,80,10.0,458,7.50,8.8,2.30,,160,299,61.30,83,235,280,3,0.66,0.62,0.01,,,,,,,,44,11,35,7.1,11,24,0.4,0,25,8.00,90.00,0,,,<
>> 2,,,,< 3,,M ,,< 10,,U ,,< 1,,,,,,U ,,,,,,< 3,,,,,,< 1,
>>
>> 10321000,1979-08-29,21.50,,29,,518,,,0.82,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>>
>> 10321000,1979-08-29,21.50,,19,13.0,509,8.50,8.4,0.54,,160,201,11.60,78,185,220,3,0.69,0.42,0.01,,,,,5.4,,,46,11,48,9.6,21,39,0.6,0,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>> 10321000,1979-10-26,12.50,,42,6.8,610,11.00,8.9,1.20,,190,395,42.50,,259,230,42,0.81,0.49,0.09,,,,,,,,54,14,55,11,27,48,0.6,0,30,10.00,100.00,0,,,M
>> ,,0.00,,< 3,,M ,,< 10,,0,,M ,,,,M ,,0,,,,,,< 3,,,,,,0,
>> .9,13,29,0.2,0,29,7.00,90.00,0,,,< 1,,0.00,,<
>> 3,,10,,20,,0,,10,,,,0,,0,,,,,,M ,,,,,,0,
>>
>> 10321000,1980-06-23,20.50,,2070,23.0,362,6.70,,59.00,,140,,1370.00,,,,,,,0.04,,,,,8.9,,,42,9.2,24,5.2,7.9,11,0.3,0,32,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>> 10321000,1980-07-29,24.00,,203,11.0,427,8.10,8.2,5.70,,160,257,147.00,,180,230,0,0.63,0.5,0.01,,,,,,,,46,9.9,31,5.9,10,20,0.5,0,25,7.00,90.00,0,,,M
>> ,,0.00,,< 3,,M ,,40,,M ,,20,,,,0,,0,,,,,,10,,,,,,0,
>>
>> 10321000,1980-08-25,19.50,,49,11.0,545,9.20,7.7,1.40,,170,342,45.60,,216,260,0,1.4,1.1,0,,,,,10,,,50,12,47,8.2,20,40,0.8,0,26,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>>
>> 10321000,1980-09-30,17.00,,47,6.0,530,11.00,8.4,1.30,,170,336,40.00,,223,260,6,0.82,0.74,0,,,,,4.4,,,47,12,50,8.8,23,36,0.6,0,25,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>> 10321000,1980-11-21,5.00,,108,9.5,518,12.20,8.5,3.10,,180,346,99.10,,190,220,8,0.6,0.46,0.02,,,,,,,,52,12,45,8.5,21,52,0.5,0,28,7.00,100.00,0,,,<
>> 1,,0.00,,< 3,,M ,,10,,0,,10,,,,0,,0,,,,,,< 3,,,,,,0,
>> 10321000,1981-01-28,3.50,,127,14.0,510,12.00,8.7,3.60,,170,355,113.00,,240,270,19,,,0.05,,,,,,,,50,12,46,6.5,19,38,0.5,0,30,8.00,100.00,0,,,<
>> 1,,0.00,,< 3,,0,,30,,M ,,20,,,,0,,0,,,,,,10,,,,,,0,
>>
>> 10321000,1981-03-25,9.50,,110,24.0,589,11.80,8.8,3.10,,200,364,110.00,,240,250,17,0.55,0.51,0.05,,,,,4.9,,,56,14,53,6.8,20,47,0.5,0,28,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
>> 10321000,1981-05-20,13.00,,117,16.0,475,,8.6,3.30,,180,,91.60,,200,230,7,1,0.94,0.08,,,,,,,,55,10,36,7.7,14,,0.5,0,27,9.00,,,,,<
>> 1,,0.00,,< 3,,M ,,20,,M ,,20,,,,M ,,0,,,,,,,,7,,,,0,
>> 10321000,1981-07-21,25.50,,6,8.8,602,9.20,8.6,0.16,,200,382,5.77,,240,270,12,1,0.71,0.13,,,,,,,,54,15,55,10,25,43,0.5,0,34,14.00,120.00,0,,,<
>> 1,,,,< 3,,M ,,10,,M ,,10,,,,M ,,0,,,,,,M ,,,,,,0,
>> 10321000,1981-09-17,21.50,,10,3.0,487,10.20,8.1,0.27,,170,292,7.39,,200,250,0,1.2,0.97,0.13,,,,,,,,44,14,37,8,15,25,0.6,0,24,9.00,100.00,0,,,<
>> 1,,0.00,,< 3,,M ,,10,,M ,,20,,,,M ,,0,,,,,,10,,,,,,1,
>> 10321000,1981-12-04,3.00,,23,3.5,630,13.00,8.8,0.65,,210,406,24.30,,240,260,15,0.57,0.38,0.09,0.02,0.1,,,,,,62,13,58,9.5,33,56,0.6,0,30,9.00,100.00,0,,,<
>> 1,,< 10,,< 3,,< 1,,30,,< 1,,20,,,,M ,,< 1,,,,,,< 3,,,,,,< 1,
>> ----------------------------------------------------
>> TIA,
>>
>> Rich
>> _______________________________________________
>> PLUG mailing list
>> PLUG at lists.pdxlinux.org
>> http://lists.pdxlinux.org/mailman/listinfo/plug
>>
>
>



More information about the PLUG mailing list