[PLUG] Disk IO in Linux?

Thu Oct 10 02:50:50 UTC 2002

On Wednesday 09 October 2002 18:24, Steve Bonds wrote:
> Disclaimer:  Some of this is based on general UNIX architecture and not
> specifically on Linux.  I look to the others on the list to provide
> gentle correction where needed.  ;-)

OK. I have a few minor corrections, but I am not a kernel developer either.

Well, I lied about minor correction.  This got long.  Sorry, I deal with 
databases and filesystems all day long as a systems architect, so it tends 
to leak over :-)

> On Wed, 9 Oct 2002, Alex Daniloff wrote:
> > Linux kernel uses unbuffered IO file access
> > if a program (e.g. DB engine) reads and writes
> > to the raw partition/device.
> >
> > I assume it's because DB engine handles its
> > data transaction on raw device/partition
> > without going through the kernel IO calls.
>
> No, the DB will still need to use kernel I/O calls.  (I.e read() or
> write() system calls.)  Without those calls the DB would need to know
> how to talk directly to the hardware, and that would defeat much of the
> purpose of the OS.  ;-)  I.e. your database would need to know your disk
> SCSI ID, drive type, SCSI card, etc. etc.

Steve's got this right.

_All_ normal block device access goes through buffering.  This buffering 
has been called various things in Linux.  In 2.2 there were buffers for 
reading and writing.  In 2.4, that was unified to one cache.  In 2.5/6 
they're changing it again.

This makes database people unhappy because when the low level kernel 
routines tell the C library that they wrote the block to disk, it might 
not really be on disk, but could be in a buffer somewhere (yes, I'm 
simplifying).  There are system calls and special attributes and some ways 
around this (think fsync() and friends).  But, that's a pain.  And, it 
doesn't solve the whole problem (see later).

Hence, you get the raw devices.  These are special devices created on 
demand that give another interface to some existing block device.  This 
extra interface is outside the whole set of caching etc. that the kernel 
does.  It talks about as directly to the hardware as the kernel can do.  
If you write a block on that interface, when it's done, the drive has told 
it that it's done (it still could be in memory on the drive!  users of 
speedy IDE drives beware, I've lost data due to this).

> > Does all above apply to the case if drives support DMA?
>
> This happens below the level at which an database interfaces with the
> OS.  Databases have no knowledge of whether your drives use DMA or
> not.  You can have both raw and block divices on any type of hard drive,
> DMA or no DMA.
>
> Here's a description of the data flow from your database to your disk,
> on a sample disk write.  For you nitpickers, keep in mind this is
> appreviated for clarity and is not 100% technically perfect.  ;-)
>
> 1) database
> 2) system call (i.e. "write()")
> 3) OS kernel system call interface

3b) VFS call, this handles most of the high level filesystem stuff that is 
independent of the actual implementation of the filesystem.

> 4) OS filesystem driver (unless DB is on raw/block disk, then skip this)

I've never tried to mount a filesystem on a raw device.  Might be possible 
in which case this layer would get hit.

> 5) OS buffer cache (unless DB is on raw disk, then skip this)
> 7) LVM driver (if used)
> 8) RAID device driver (if used)

Are these before or after the buffer cache?  I can't for the life of me 
remember.

> 6) OS block device driver [not sure if this is before or after RAID/LVM]

this was out of order above.  There's not really much here other than a 
pseudo-vtable of file-ops.

> 9) SCSI/IDE driver
> 10) SCSI/IDE hardware (this is where DMA comes in)

Actually, there are a few more pieces to DMA. It is set up by the driver 
and other parts of the kernel (DMA requires locked pages etc., so the VM 
subsystem is involved partially).

> 11) hard drive firmware

11b) hard disk cache.  Modern drives have this.  You can turn it off and 
should if you really want high levels of data integrity.  There was a 
large discussion on the ReiserFS list and the MySQL list some time ago 
(two years?).  You write to the drive, it puts it into cache and tells you 
it's done.  The power goes out.  Some drives do not use rotational energy 
to write the cache to disk (because the cache is too big).  See the actual 
drive specs.  But, don't assume that because the drive told you that it 
wrote the data that the magnetic fields are changed on the platters.

> 12) bits on a platter
>
> > Why it's nessesary to bind raw devices to block devices
> > ( bind /dev/raw/raw1 to /dev/hdb1 )
> > if a database engine reads and writes to the raw partition
> > without such binding?
>
> Linux doesn't bind a particular raw device to a particular block
> device.  Most other unixes use something like /dev/dsk/<block> +
> /dev/rdsk/<char> where the <block> and <char> device names are the same.
>
> I think the actual command used is "raw /dev/raw/raw1 /dev/hdb1" based
> on your example.

Can't remember off the top of my head, but I remember the syntax being 
similar to that of the loop-back device.

> > Will this binding improve performance of the DB engine or
> > this needs to be done only in order to read from Linux what
> > has been written on a raw device?
>
> This is needed to tell linux unambiguously where to find the data for
> that raw device.  ;-)  Look at "man raw" for more info.  It does not
> appear to be optional, so it's not a performance question.
>
> My take on the raw interface is that it's kind of klugy.  Linux was
> built with block devices from the beginning, and working around them is
> likely to result in finding more bugs than you might find otherwise. 
> ;-)

Raw devices have been around on other forms of Unix for a long time.  The 
discussion of raw devices that worked just like those of other Unix 
variants ended with Linus saying "no way, I'm not going to clutter up my 
already full /dev with more cruft!".  Ok, I'm taking liberties with what 
he said, but that is the meat of the argument.  The device space was 
already nearly full (running out of major numbers) and the /dev tree is 
full of all kinds of things that aren't needed most of the time.  devfs 
solves some of the problems.

Raw devices allow specific block devices to have unbuffered access.  They 
do this without cluttering /dev and without overuse of major numbers.

I would not necessarily say that it was speed that the DB people were 
after.  That is a factor.  I think that one of the biggest things they 
wanted was to be able to ensure that blocks were written to disk in a 
specific order. 

With the OS doing one or more layers of caching, elevator algorithms etc. 
and the drives or disk controllers doing the same thing, it is very easy, 
often common, for blocks to be written out of order.  If you are doing a 
very important transaction into a fully ACID compliant DB and the part of 
the transaction that says "DONE!" is written to disk, but the previous two 
blocks aren't when the power goes out, you'll have a corrupt database.

Some kinds of disk controllers (SCSI??, FibreChannel??) can have "write 
barriers".  When the controller hits one of these in the list of commands, 
it makes sure that all writes before the barrier are flushed to disk 
before it does one past the barrier.  So, in my transaction example above, 
you'd put a write barrier just before the "DONE!" block (I think one is 
put after it too so that transactions don't get mixed).  All the blocks 
that were part of that transaction will be flushed to disk before the 
"DONE!" block.

By forcing the order of disk writes, you usually slow down disk access, not 
speed it up.  But, you make sure that you'll have a coherent DB after a 
sudden power failure.

I actually think that raw devices (as done on Linux) are a fairly elegant 
solution to the problem of having direct disk access.  However, I also 
think that a system that makes you need such access is definitely not that 
good.  Most Unix variants fall into that category.

Best,
Kyle