[PLUG] Rsync for backup - logrotate again ...

Felix Lee felix.1 at canids.net
Wed Oct 22 09:15:03 UTC 2003


Russell Senior <seniorr at aracnet.com>:
> Try a "slow rename":
>   ln old-name new-name
>   rsync -a -H [...] src dest
>   rm old-name
>   rsync -a -H [...] src dest
> Presto-chango?  Or not?

won't be helpful for renaming/moving directories, like
    mv ~/cumshots ~/art/cumshots

1. detecting files that were renamed but didn't otherwise change:
     on src
       generate a list of (srcfile, mtime, size)
       send it to dest
     on dest
       generate a list of (destfile, mtime, size)
       find all destfiles and srcfiles with identical (mtime, size)
         checksum each destfile
         ask src for checksum of each srcfile
         if checksums match,
            assume it's the same file
            do a rename
     this is a probabilistic algorithm,
     it might get false positive matches (very rarely),
     but it's no worse than rsync's usual duplicate check,
     which also might get false positives (very rarely).

2. detecting files that were both renamed and changed:
     first take care of files that were renamed but didn't change
     then take care of files that weren't renamed but did change
     you're left with a list of src files with no known dest
       and dest files with no known src
     pretend that the remaining src files and dest files are
       a single huge file
       and run the rsync rolling-checksum algorithm on that
     (simplistic version:
       pack the src files and dest files into a .tar
       sync src.tar to dest.tar
       unpack dest.tar
       remove dest files that are no longer in dest.tar)

3. improving the rsync algorithm for large files (eg, database
   files, or the huge pseudo-file in part 2 above):
     so rsync slices a destination file into blocks and sends
     checksums of each block to the src, on the theory that the
     src can use the checksums to avoid sending unchanged data
     (no matter where it is in the src file).

     and one reason rsync sucks for large files is the blocksize
     defaults to 700 bytes, which was based on experiments in
     transmitting the linux kernel source, which is 1) a
     collection of pretty small files, 2) pretty small overall.

     there's an option to change the blocksize, but it affects
     the entire rsync.  if you want to tune it by hand, you'll
     probably want to rsync the big files separately from
     everything else.

     rsync could be taught to adjust this blocksize dynamically,
     perhaps set it to something like 1% of the file size.  this
     will need some experimenting.

     another thing to explore is the potential uses of recursive
     subdivision: like split a file into 100 pieces, then split
     mismatched blocks into smaller pieces, until some
     threshhold.  this probably becomes a number of time/space
     tradeoffs, like, if you have the memory you don't need to
     keep recomputing checksums, in a single read pass you can
     generate all the checksums you might need and save them.

this is off the top of my head.  I'll have to sit down sometime
and do some performance analysis of large file rsyncs, make sure
I understand what the bottlenecks are, before I plunge into doing
any coding.  if I ever do get around to looking at this.

and then there are plenty of potential auto-tuning to explore,
like the disk speed vs network speed question.
--




More information about the PLUG mailing list