[PLUG] Translating ^M to \n [WORKING]
Robert Citek
robert.citek at gmail.com
Wed Aug 14 03:25:46 UTC 2019
On Tue, Aug 13, 2019 at 10:47 AM Rich Shepard <rshepard at appl-ecosys.com>
wrote:
> On Tue, 13 Aug 2019, Robert Citek wrote:
>
> > Sounds like you used Emacs to do the equivalent of this:
> >
> > < hatchery_returns-2019-08-12.csv \
> > tr -s '\r\n' '\n' |
> > sed -e 's/, /,/g;s/,$//' \
> >> hatchery_returns-2019-08-12.cleaned.csv
> >
> > Is that right?
>
> Robert,
>
> Nope.
>
> On the command line I ran:
>
> dd if=<infile> bs=1 | tr '\r' '\n' > <outfile>
>
> Then I put the outfile in an emacs buffer. No space at the beginning of the
> file. Then I cleaned it by removing extraneous spaces and removing the
> terminal comma when there were values for the last field in the line.
>
Interesting. I did a histogram on the number of fields. Is it expected
that the number of fields is not consistent across all records?
$ cat hatchery_returns-2019-08-12.csv | tr '\r' '\n' | awk -F, '{print NF}'
| sort | uniq -c
2 0
100 41
100 53
10599 93
FWIW, cat is much faster than dd:
$ dd if=hatchery_returns-2019-08-12.csv bs=1 | tr '\r' '\n' | md5
12746089+0 records in
12746089+0 records out
12746089 bytes transferred in 37.538310 secs (339549 bytes/sec)
f5450d6738a7d3242700a003266b03e0
$ time -p cat hatchery_returns-2019-08-12.csv | tr '\r' '\n' | md5
f5450d6738a7d3242700a003266b03e0
real 1.21
user 1.24
sys 0.03
Or did you mean to write bs=1m ?
$ dd if=hatchery_returns-2019-08-12.csv bs=1m | tr '\r' '\n' | md5
12+1 records in
12+1 records out
12746089 bytes transferred in 1.227313 secs (10385361 bytes/sec)
f5450d6738a7d3242700a003266b03e0
Although, I'm wondering why use dd ( or even cat ).
$ time -p < hatchery_returns-2019-08-12.csv tr '\r' '\n' | md5
f5450d6738a7d3242700a003266b03e0
real 1.20
user 1.23
sys 0.01
Regards,
- Robert
More information about the PLUG
mailing list