[PLUG] Disk IO in Linux?
Kyle Hayes
kyle_hayes at speakeasy.net
Thu Oct 10 02:50:50 UTC 2002
On Wednesday 09 October 2002 18:24, Steve Bonds wrote:
> Disclaimer: Some of this is based on general UNIX architecture and not
> specifically on Linux. I look to the others on the list to provide
> gentle correction where needed. ;-)
OK. I have a few minor corrections, but I am not a kernel developer either.
Well, I lied about minor correction. This got long. Sorry, I deal with
databases and filesystems all day long as a systems architect, so it tends
to leak over :-)
> On Wed, 9 Oct 2002, Alex Daniloff wrote:
> > Linux kernel uses unbuffered IO file access
> > if a program (e.g. DB engine) reads and writes
> > to the raw partition/device.
> >
> > I assume it's because DB engine handles its
> > data transaction on raw device/partition
> > without going through the kernel IO calls.
>
> No, the DB will still need to use kernel I/O calls. (I.e read() or
> write() system calls.) Without those calls the DB would need to know
> how to talk directly to the hardware, and that would defeat much of the
> purpose of the OS. ;-) I.e. your database would need to know your disk
> SCSI ID, drive type, SCSI card, etc. etc.
Steve's got this right.
_All_ normal block device access goes through buffering. This buffering
has been called various things in Linux. In 2.2 there were buffers for
reading and writing. In 2.4, that was unified to one cache. In 2.5/6
they're changing it again.
This makes database people unhappy because when the low level kernel
routines tell the C library that they wrote the block to disk, it might
not really be on disk, but could be in a buffer somewhere (yes, I'm
simplifying). There are system calls and special attributes and some ways
around this (think fsync() and friends). But, that's a pain. And, it
doesn't solve the whole problem (see later).
Hence, you get the raw devices. These are special devices created on
demand that give another interface to some existing block device. This
extra interface is outside the whole set of caching etc. that the kernel
does. It talks about as directly to the hardware as the kernel can do.
If you write a block on that interface, when it's done, the drive has told
it that it's done (it still could be in memory on the drive! users of
speedy IDE drives beware, I've lost data due to this).
> > Does all above apply to the case if drives support DMA?
>
> This happens below the level at which an database interfaces with the
> OS. Databases have no knowledge of whether your drives use DMA or
> not. You can have both raw and block divices on any type of hard drive,
> DMA or no DMA.
>
> Here's a description of the data flow from your database to your disk,
> on a sample disk write. For you nitpickers, keep in mind this is
> appreviated for clarity and is not 100% technically perfect. ;-)
>
> 1) database
> 2) system call (i.e. "write()")
> 3) OS kernel system call interface
3b) VFS call, this handles most of the high level filesystem stuff that is
independent of the actual implementation of the filesystem.
> 4) OS filesystem driver (unless DB is on raw/block disk, then skip this)
I've never tried to mount a filesystem on a raw device. Might be possible
in which case this layer would get hit.
> 5) OS buffer cache (unless DB is on raw disk, then skip this)
> 7) LVM driver (if used)
> 8) RAID device driver (if used)
Are these before or after the buffer cache? I can't for the life of me
remember.
> 6) OS block device driver [not sure if this is before or after RAID/LVM]
this was out of order above. There's not really much here other than a
pseudo-vtable of file-ops.
> 9) SCSI/IDE driver
> 10) SCSI/IDE hardware (this is where DMA comes in)
Actually, there are a few more pieces to DMA. It is set up by the driver
and other parts of the kernel (DMA requires locked pages etc., so the VM
subsystem is involved partially).
> 11) hard drive firmware
11b) hard disk cache. Modern drives have this. You can turn it off and
should if you really want high levels of data integrity. There was a
large discussion on the ReiserFS list and the MySQL list some time ago
(two years?). You write to the drive, it puts it into cache and tells you
it's done. The power goes out. Some drives do not use rotational energy
to write the cache to disk (because the cache is too big). See the actual
drive specs. But, don't assume that because the drive told you that it
wrote the data that the magnetic fields are changed on the platters.
> 12) bits on a platter
>
> > Why it's nessesary to bind raw devices to block devices
> > ( bind /dev/raw/raw1 to /dev/hdb1 )
> > if a database engine reads and writes to the raw partition
> > without such binding?
>
> Linux doesn't bind a particular raw device to a particular block
> device. Most other unixes use something like /dev/dsk/<block> +
> /dev/rdsk/<char> where the <block> and <char> device names are the same.
>
> I think the actual command used is "raw /dev/raw/raw1 /dev/hdb1" based
> on your example.
Can't remember off the top of my head, but I remember the syntax being
similar to that of the loop-back device.
> > Will this binding improve performance of the DB engine or
> > this needs to be done only in order to read from Linux what
> > has been written on a raw device?
>
> This is needed to tell linux unambiguously where to find the data for
> that raw device. ;-) Look at "man raw" for more info. It does not
> appear to be optional, so it's not a performance question.
>
> My take on the raw interface is that it's kind of klugy. Linux was
> built with block devices from the beginning, and working around them is
> likely to result in finding more bugs than you might find otherwise.
> ;-)
Raw devices have been around on other forms of Unix for a long time. The
discussion of raw devices that worked just like those of other Unix
variants ended with Linus saying "no way, I'm not going to clutter up my
already full /dev with more cruft!". Ok, I'm taking liberties with what
he said, but that is the meat of the argument. The device space was
already nearly full (running out of major numbers) and the /dev tree is
full of all kinds of things that aren't needed most of the time. devfs
solves some of the problems.
Raw devices allow specific block devices to have unbuffered access. They
do this without cluttering /dev and without overuse of major numbers.
I would not necessarily say that it was speed that the DB people were
after. That is a factor. I think that one of the biggest things they
wanted was to be able to ensure that blocks were written to disk in a
specific order.
With the OS doing one or more layers of caching, elevator algorithms etc.
and the drives or disk controllers doing the same thing, it is very easy,
often common, for blocks to be written out of order. If you are doing a
very important transaction into a fully ACID compliant DB and the part of
the transaction that says "DONE!" is written to disk, but the previous two
blocks aren't when the power goes out, you'll have a corrupt database.
Some kinds of disk controllers (SCSI??, FibreChannel??) can have "write
barriers". When the controller hits one of these in the list of commands,
it makes sure that all writes before the barrier are flushed to disk
before it does one past the barrier. So, in my transaction example above,
you'd put a write barrier just before the "DONE!" block (I think one is
put after it too so that transactions don't get mixed). All the blocks
that were part of that transaction will be flushed to disk before the
"DONE!" block.
By forcing the order of disk writes, you usually slow down disk access, not
speed it up. But, you make sure that you'll have a coherent DB after a
sudden power failure.
I actually think that raw devices (as done on Linux) are a fairly elegant
solution to the problem of having direct disk access. However, I also
think that a system that makes you need such access is definitely not that
good. Most Unix variants fall into that category.
Best,
Kyle
More information about the PLUG
mailing list