[PLUG] Disk IO in Linux?

Thu Oct 10 08:23:41 UTC 2002

On Wed, 9 Oct 2002, Kyle Hayes wrote:

> 11b) hard disk cache.  Modern drives have this.  You can turn it off
> and should if you really want high levels of data integrity.  There
> was a large discussion on the ReiserFS list and the MySQL list some
> time ago (two years?).  You write to the drive, it puts it into cache
> and tells you it's done.  The power goes out.  Some drives do not use
> rotational energy to write the cache to disk (because the cache is too
> big).  See the actual drive specs.  But, don't assume that because the
> drive told you that it wrote the data that the magnetic fields are
> changed on the platters.

Anyone using a hard disk write-back cache without a UPS and auto-flush on
power failure is begging for pain.  I've felt that pain and it's not
fun.  (Can you say 'fish out the backup tapes', boys and girls?)

I will freely admin to being behind the times on IDE technology, but all
my SCSI drives have write cache OFF even though the system is
UPSed.  Unless I'm running a benchmark to impress my friends. ;-)  Turning
it on gets that extra 20%, but is never appropriate for data that can't be
recreated easily.

> I would not necessarily say that it was speed that the DB people were
> after.  That is a factor.  I think that one of the biggest things they
> wanted was to be able to ensure that blocks were written to disk in a
> specific order.

My experience has been they're after the higher performance.  
Double-buffering by the OS is just wasting CPU cycles if you're already
using internal database caching as well as possibly caching at the
disk array level.

I think the reason you hear the "specific order" argument is because of
old-timers who learned to manage databases on JBODs (bare disks bereft of
caching/striping/RAID).

Those pesky database people-- always trying to get down into the meat of
the hardware to optimize the database layout for the current data
set.  Unfortunately, most of the databases I deal with have areas of
constantly moving "local busy-ness" which generally correspond to that
day/week/month's data.  Since (unfortunately) most databases have
archiving/data removal put in as an afterthought (if at all), this little
section of high I/O keeps creeping forward in the database over time.

The net result is all that careful tuning is completely useless after a
few weeks or months since the hotspot has moved.

(My solution: loads of battery-backed cache on the hardware (absorbs your
random I/O), and stripe* the fury out of everything (speeds the sequential
I/O).  This gets you 80% of the way there with 10% of the effort and it
works just as well on day 1 as on day 100.)

*I find that striping across hardware RAID-5 groups to be particularly
effective since many RAID-5 controllers are slow on sequential
writes.  Striping over them gives 'em a chance to catch up between I/Os.

> With the OS doing one or more layers of caching, elevator algorithms
> etc.  and the drives or disk controllers doing the same thing, it is
> very easy, often common, for blocks to be written out of order.  If
> you are doing a very important transaction into a fully ACID compliant
> DB and the part of the transaction that says "DONE!" is written to
> disk, but the previous two blocks aren't when the power goes out,
> you'll have a corrupt database.

Yeah, this is why any database worth anything does opens its files with
the O_SYNC option to ensure that the kernel doesn't return from the
write() until the hardware says the data is safe.  This option
periodically seems to break on linux.  Most UNIX variants have ways of
"translating" this safety feature to be faster and less safe.  Not
recommended.  ;-)

Writing data out of order isn't necessarily a problem if ALL I/O is going
through the buffer cache.  (And anyone mixing buffered and unbuffered I/O
is begging for lots of pain.)  The buffer cache need only ensure that
multiple writes to the *same data* arrive in order, which is usually not a
problem.  Reads to that data will come out of the most-recently-written
buffer so the application need never even know it wasn't on disk.

The power-lost/OS-crashed loss of data is a problem on any write-back
cache and a major reason to avoid them if there is no way to externally
recreate the lost/corrupt data.

> By forcing the order of disk writes, you usually slow down disk
> access, not speed it up.  But, you make sure that you'll have a
> coherent DB after a sudden power failure.

Just to beat the point some more, the coherency problem isn't necessarily
related to the ordering-- it's a risk of having data "in transit".

One of the main reasons raw is faster for big databases is they do their
own caching so the buffer does nothing but add a few milliseconds of
searching the buffer to each I/O for no net gain.

Wow.  That sure got long.  ;-)

  -- Steve