[PLUG] Translating ^M to \n [WORKING]

Robert Citek robert.citek at gmail.com
Wed Aug 14 03:25:46 UTC 2019


On Tue, Aug 13, 2019 at 10:47 AM Rich Shepard <rshepard at appl-ecosys.com>
wrote:

> On Tue, 13 Aug 2019, Robert Citek wrote:
>
> > Sounds like you used Emacs to do the equivalent of this:
> >
> > < hatchery_returns-2019-08-12.csv \
> > tr -s '\r\n' '\n' |
> > sed -e 's/, /,/g;s/,$//' \
> >> hatchery_returns-2019-08-12.cleaned.csv
> >
> > Is that right?
>
> Robert,
>
> Nope.
>
> On the command line I ran:
>
> dd if=<infile> bs=1 | tr '\r' '\n' > <outfile>
>
> Then I put the outfile in an emacs buffer. No space at the beginning of the
> file. Then I cleaned it by removing extraneous spaces and removing the
> terminal comma when there were values for the last field in the line.
>

Interesting.  I did a histogram on the number of fields.  Is it expected
that the number of fields is not consistent across all records?

$ cat hatchery_returns-2019-08-12.csv | tr '\r' '\n' | awk -F, '{print NF}'
| sort | uniq -c
   2 0
 100 41
 100 53
10599 93

FWIW, cat is much faster than dd:

$ dd if=hatchery_returns-2019-08-12.csv bs=1 | tr '\r' '\n' | md5
12746089+0 records in
12746089+0 records out
12746089 bytes transferred in 37.538310 secs (339549 bytes/sec)
f5450d6738a7d3242700a003266b03e0

$ time -p cat hatchery_returns-2019-08-12.csv | tr '\r' '\n' | md5
f5450d6738a7d3242700a003266b03e0
real 1.21
user 1.24
sys 0.03

Or did you mean to write bs=1m ?

$ dd if=hatchery_returns-2019-08-12.csv bs=1m | tr '\r' '\n' | md5
12+1 records in
12+1 records out
12746089 bytes transferred in 1.227313 secs (10385361 bytes/sec)
f5450d6738a7d3242700a003266b03e0

Although, I'm wondering why use dd ( or even cat ).

$ time -p < hatchery_returns-2019-08-12.csv tr '\r' '\n' | md5
f5450d6738a7d3242700a003266b03e0
real 1.20
user 1.23
sys 0.01

Regards,
- Robert



More information about the PLUG mailing list