[PLUG] Trying to avoid the ls limit
Martin A. Brown
martin at linux-ip.net
Sun Aug 3 23:08:13 UTC 2014
Hello Randall,
> I just now ran into the wall while processing files in a long
> simulation run, there are over 50,000 files in one directory and
> now the bash shell expansion wild card character * is expanding
> command line arguments and then the infamous "too many argments"
> message is given.
> This is particularly bad while trying the ls command: "ls *file* (
> I apparently understand that there is a limitation in the
> readdir() command buffer size)
OK, so you are doing something like this:
$SOME_CMD -- $( ls -- *file* )
And getting back from Linux the dreaded E2BIG error, translated by
the bash shell to "Argment list too long". Here's how I simulated
your problem:
md5sum -- $( find / -type f 2>/dev/null )
bash: /usr/bin/md5sum: Argument list too long
If you'd like to know more about why it is this way, you probably
want to go read up on ARG_MAX [0] and maybe find the spot in the XSH
section of the single Unix specification which details the behaviour
exec(). Under Linux, the glibc package has a command called
'getconf' which will tell you the settings a variety of system
runtime parameters. So, you can find out what that value is with:
getconf -a
getconf ARG_MAX
Whenever I run into E2BIG (which is a rarity, now that I have been
hornswaggled by it a few times over the years), I have taken some
refuge in the xargs command, which was expressly written to deal
with this sort of situation. Let me illustrate:
find / -type f | xargs -- md5sum # -- see quoting warnings below
The find command produces (now) a stream of output which xargs
reads. The glorious xargs utility turns any file into a sequency of
calls to 'exec' without falling afoul of E2BIG. I use this
technique all of the time, now, even when the number of files may be
small. Why? Because now, I don't have to think "will this work,
even if the number of files is large?" It simply will.
I will add a comment on quoting here. The above command has a nasty
gotcha. Many systems will create files with special characters in
them, and this can trip up the shell word-expansion when xargs calls
$CMD (md5sum, above). As a result, I would recommend the following
command, which eliminates the quoting concern:
find / -type f -print0 | xargs --null --no-run-if-empty -- md5sum --
This is almost identical, but avoids issues with special characters
(such as spaces) in filenames which plagues shell programming.
> I did find a c program under getdents(2) which gets around this
> problem of listing lots and lots files in one directory, but found
> out that I can use a certain find command:
>
> find . -maxdepth 1 -type f {paramter_here} -print
>
> which will take the wild card character * okay when manually
> entered. But when I attempt to use something simple like
>
> lsb *parameter*
>
> where lsb is a bash shell script then the infamous "too many
> arguments" error shows up again.
>
> I can toggle the bash "set -f" for turning off wildcard expansion,
> but I really need to toggle this off to get the command line
> parameter *parameter* without expansion, then drop it inside a
> simple bash script, then turn it on to execute the find line like
> above.
>
> Right now bash is expanding * in command parameter #1 before
> dropping it into the bash script.
>
> And ideas on how to do this? I would like to try to avoid
>
> set -f ; lsb *filename* ; set +f
>
> if possible
I would not choose to 'set' and re-'set' shell parameters as you are
considering, but it is certainly an option. (I worry always about
not getting it correct, and resetting the variables in all of the
right places.)
I would choose something like this, assuming a CPU-bound job:
find "$DIR" -type f -name '*parameter*' -print0 \
| xargs --null --no-run-if-empty --max-procs "$CPU_COUNT" -- \
$CPU_HEAVY_JOB --
One final note of warning....if you study the above command and know
what's happening on the system, you can also see that the 'find'
command occurs potentially long before the $CPU_HEAVY_JOB occurs.
This means that it's eminently possible for files discovered by
'find' to be missing by the time that they are encountered in
processing by $CPU_HEAVY_JOB.
Finally, I would agree with the comment by Wes....why so many files
in a single directory? I, too, would recommend changing that, if
you can. There are some filesystems which exhibit deteriorating
performance with large numbers of files in a single directory, never
mind the deteriorating ability of fragile-minded humans to be able
to understand what's stored in that directory.
Anyway, the net upshot is, if you encounter the E2BIG error
'Argument ist too long', it is probably time to rewrite that section
of your shell (or whatever) program to perform the exec actions
either in a loop or using xargs.
-Martin
[0] ARG_MAX, that wily beast, in a few manpages and online pages
http://www.in-ulm.de/~mascheck/various/argmax/
man 2 execve
--
Martin A. Brown
http://linux-ip.net/
More information about the PLUG
mailing list