[PLUG] Trying to avoid the ls limit

Martin A. Brown martin at linux-ip.net
Sun Aug 3 23:08:13 UTC 2014


Hello Randall,

> I just now ran into the wall while processing files in a long 
> simulation run, there are over 50,000 files in one directory and 
> now the bash shell expansion wild card character * is expanding 
> command line arguments and then the infamous "too many argments" 
> message is given.

> This is particularly bad while trying the ls command: "ls *file* ( 
> I apparently understand that there is a limitation in the 
> readdir() command buffer size)

OK, so you are doing something like this:

   $SOME_CMD -- $( ls -- *file* )

And getting back from Linux the dreaded E2BIG error, translated by 
the bash shell to "Argment list too long".  Here's how I simulated 
your problem:

   md5sum --  $( find / -type f 2>/dev/null )
   bash: /usr/bin/md5sum: Argument list too long

If you'd like to know more about why it is this way, you probably 
want to go read up on ARG_MAX [0] and maybe find the spot in the XSH 
section of the single Unix specification which details the behaviour 
exec().  Under Linux, the glibc package has a command called 
'getconf' which will tell you the settings a variety of system 
runtime parameters.  So, you can find out what that value is with:

   getconf -a
   getconf ARG_MAX

Whenever I run into E2BIG (which is a rarity, now that I have been 
hornswaggled by it a few times over the years), I have taken some 
refuge in the xargs command, which was expressly written to deal 
with this sort of situation.  Let me illustrate:

   find / -type f | xargs -- md5sum   # -- see quoting warnings below

The find command produces (now) a stream of output which xargs 
reads.  The glorious xargs utility turns any file into a sequency of 
calls to 'exec' without falling afoul of E2BIG.  I use this 
technique all of the time, now, even when the number of files may be 
small.  Why?  Because now, I don't have to think "will this work, 
even if the number of files is large?"  It simply will.

I will add a comment on quoting here.  The above command has a nasty 
gotcha.  Many systems will create files with special characters in 
them, and this can trip up the shell word-expansion when xargs calls 
$CMD (md5sum, above).  As a result, I would recommend the following 
command, which eliminates the quoting concern:

   find / -type f -print0 | xargs --null --no-run-if-empty -- md5sum --

This is almost identical, but avoids issues with special characters 
(such as spaces) in filenames which plagues shell programming.

> I did find a c program under getdents(2) which gets around this 
> problem of listing lots and lots files in one directory, but found 
> out that I can use a certain find command:
>
> find . -maxdepth 1 -type f {paramter_here} -print
>
> which will take the wild card character * okay when manually 
> entered.  But when I attempt to use something simple like
>
> lsb *parameter*
>
> where lsb is a bash shell script then the infamous "too many 
> arguments" error shows up again.
>
> I can toggle the bash "set -f" for turning off wildcard expansion, 
> but I really need to toggle this off to get the command line 
> parameter *parameter* without expansion, then drop it inside a 
> simple bash script, then turn it on to execute the find line like 
> above.
>
> Right now bash is expanding * in command parameter #1 before 
> dropping it into the bash script.
>
> And ideas on how to do this?  I would like to try to avoid
>
> set -f ; lsb *filename* ; set +f
>
> if possible

I would not choose to 'set' and re-'set' shell parameters as you are 
considering, but it is certainly an option.  (I worry always about 
not getting it correct, and resetting the variables in all of the 
right places.)

I would choose something like this, assuming a CPU-bound job:

   find "$DIR" -type f -name '*parameter*' -print0 \
     | xargs --null --no-run-if-empty --max-procs "$CPU_COUNT" -- \
       $CPU_HEAVY_JOB --

One final note of warning....if you study the above command and know 
what's happening on the system, you can also see that the 'find' 
command occurs potentially long before the $CPU_HEAVY_JOB occurs. 
This means that it's eminently possible for files discovered by 
'find' to be missing by the time that they are encountered in 
processing by $CPU_HEAVY_JOB.

Finally, I would agree with the comment by Wes....why so many files 
in a single directory?  I, too, would recommend changing that, if 
you can.  There are some filesystems which exhibit deteriorating 
performance with large numbers of files in a single directory, never 
mind the deteriorating ability of fragile-minded humans to be able 
to understand what's stored in that directory.

Anyway, the net upshot is, if you encounter the E2BIG error 
'Argument ist too long', it is probably time to rewrite that section 
of your shell (or whatever) program to perform the exec actions 
either in a loop or using xargs.

-Martin

  [0] ARG_MAX, that wily beast, in a few manpages and online pages
      http://www.in-ulm.de/~mascheck/various/argmax/
      man 2 execve

-- 
Martin A. Brown
http://linux-ip.net/



More information about the PLUG mailing list