[PLUG-TALK] High speed document scanners (2)

Pete Lancashire pete at petelancashire.com
Fri Sep 2 16:21:55 UTC 2011


Some background

My scanning setup was based on wanting to scan a set of magazines that
is no longer in print and
when the publisher sold the periodical the new owner trashed the
original masters, plus a minimum
of three complete unbound sets. Total pages for just the years of
interest is about 200,000. Adding
others my guess will be about 250,000.

Before 1948 the paper size is 9 x 12, after it was reduced to approx
8.5 x 11, and about the same time
it was stack bound (this is where from 2 to about 20 11 x 17 sheets
are folded in half and then multiple
sets are stacked into the final magazine).

I originally was going to find someone/place with a paper guillotine
and cut the magazine at the spine
but didn't really want to, in many magazines there are illustrations
that get parts lost when you do such,
specially when the fold was not centered.

Hardware, scanner #1

In the meantime I kept placing low bids on Epson GT-30000 scanners.
The GT-30000 does 11x17 and
has a duplexing ADF, the other was it is advertised to have a SCSI-3
interface. Doing the math at 600 DPI,
24 bits per pixel, it comes to 205 MBytes per scan. The
'specification' say it can do 0.79 msec / line in color mode.
This is pretty much a lie, even if it could the SCSI I/F can only do
75 % of the necessary data rate. And that
is RAW SCSI which is VERY hard to do.

BTW totally forget about using USB2. Find me a USB2 that can do 25
MB/Sec for a total of 205 MB without
a telling the sender to wait and I'll buy you lunch.

Scanner #2

The next application I have is getting closer to what Keith has, it is
8.5 x 11 and can be individual sheets,
My difference is I will need color at 600 DPI. I want it to do
duplexing but not have to do the paper roller coastar
that the GT-30000 does. In many cases the paper is not much more then
glossy newsprint and is over 50 years
old. The GT-30000 can scan-flip-scan flip and re-feed a 17 page in
less just over 2 seconds, pretty scary and
the old thin paper can't take it.

Software

I need to take a scanned image do some simple color/contrast/etc
correction then start the magic

magic is
  color reduce (in most cases the page is white,black,and one color)
  realign (rotate)
  find the center of a double page
  convert to pdf
  and what I was working on last find the page number
  lossless compress
  OCR - someday
  save as pdf

Since I dont have everything the way I want it, I am for now saving
the original scan in a lossless compressed format.

The things that got in the way

using windows, just looking at the computer would usually cause the
system to tell the scanner 'wait', basically
I just gave up.

move to Linux

to keep this short

I could never get the data rate to let the scanner go full speed which
after many emails to Epson joining the LINUX-SCSI and SANE developers
group. I got

1 Epson is going to put out a note and change their SCSI spec (have
not seen yet) that is closer to the truth
2 Two changes have been made to the Linux SCSI driver, there is still
one open case
3 Quite a bit of work has been done by the maintainer of the Epson SANE modules

Will all the above, and using /dev/null as the destination, I eneded
up with approx 13-14 seconds per scan
BTW A 4 GHz XP box maxed out was about 45 seconds per scan.

The next issue was writing to the disk, with the ADF going and at 14
seconds/205 MB per scan my disks
could not keep up. In the end I acquired via Free Geek a quad dual
core Pentium box with 32 GB of RAM, 24 of which is configured. The
other is it has 64 bit PCI-X slots, which one now has a dedicated
Adaptec 320 MB/Sec SCSI card.

In the end other then the RAM disk, it looks like I didn't need the
server's performance, although with doing image manipulation and
having 8 cores, it can run image manipulation code that is SMP aware
damn fast. Not yet tried
OCR.

Don't think any of this is relevant but ..

-pete


>  This full-color eight page document (two of which are business cards) was scanned on my ages-old unit a few weeks back:
>  http://dl.dropbox.com/u/9361768/2011_08_10_13_26_39.pdf
>
> This is a B&W only 14-page scan with a mixture of text and graphics:
>  http://dl.dropbox.com/u/9361768/2011_03_16_16_10_03.pdf
>
> So, totally roughing things out, and we come up with ~242KB/page. With a "perfect" 40PPS feedrate, you're looking at a maximum bandwidth need of ~77Mbps (not counting protocol overheads, etc...). Either way, 100Mbps should be sufficient for shuffling around the generated PDFs.
>
> What's the Real World workflow? Averaged over one hour blocks, and after watching the average office worker use equipment, my gut says a mature workflow could best-case shuffle ~1,500 pages/hour. Or, maybe one box every ~90 minutes. Which, when you think about it, pretty much explains why bulk scanning services can get away with anywhere from $0.10-$0.25/page...
>
> I sure hope you have a DMS in mind too; there's all kinds of ACL, auditing/tracking, and general management fun to be had here.
>
> -
> Gregg Berkholtz
> Datacenter consulting, hosting & support since 1995
>  www.tocici.com  |  503-488-5461 x17  |  AS14613
>
>
>> Keith
>>
>> --
>> Keith Lofstrom          keithl at keithl.com         Voice (503)-520-1993
>> KLIC --- Keith Lofstrom Integrated Circuits --- "Your Ideas in Silicon"
>> Design Contracting in Bipolar and CMOS - Analog, Digital, and Scan ICs
>
> _______________________________________________
> PLUG-talk mailing list
> PLUG-talk at lists.pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug-talk
>



More information about the PLUG-talk mailing list