[PLUG-TALK] SSD lstat performance questions and seeking hard proof

Mon Nov 28 04:33:19 UTC 2016

Hi Richard,
Linode instances come with internal SSD storage. Perhaps that would be
good target for your experiment. Check it out.
I feel that you do have a lot of going on over the NFS and 15VMs
sharing it. I used similar architecture and I also felt that a web app
+ DB bottleneck was disk bound. I have done a lot of experiments,
including replacing NFS for internal disks and SAS SSDs.
Unfortunately, in my case it DID NOT lead to significant enough
application response time improvement. Local disks and especially SSDs
made the OS's way more responsive, but not the web application. My
conclusion at that time - Linux cashing can be quite good at speeding
up network access to a lot of small files with small enough throughput.
See your RAM disk experiment. Anyway, my root cause was DB related. In
the end and it was resolved via application architectural change
influenced by a "real DBA".
Tomas
On Fri, 2016-11-25 at 17:31 -0800, Richard Powell wrote:
> On 11/25/2016 3:17 PM, Chris Schafer wrote:
> > I feel like you aren't completely describing the architecture.
> Well, that's true.  I was just trying to provide what I thought was
> most relevant.  :-)
> 
> > It seems like there is some virtualization.  A NAS. Networking of
> > unknown configuration.
> Yes.  I'm using VMware.  ESXi 5.5.  The primary drives for the VM's
> are being served up from NFS shares on a 10 spindle RAID 5 array that
> has 15k SAS drives.  The shares are served over an internal 1GB
> network.  There is a 50GB write cache SSD on that network storage
> device.  But that doesn't help with the read's at all.
> 
> > 10k is a lot of on board ssd.
> Indeed it is.  It's not just the storage though.  That's for a
> completely new server that includes 256GB RAM and 2, 10 core
> processors (2.4GHz).  It includes roughly 9TB of usable SSD storage
> with RAID 6.
> 
> > Also this seems like you are doing a lot of things on this array.
> Mostly just shared hosting.  But spread across multiple VM's.  All
> VM's have their primary hard drives on that same storage array. 
> There is approximately 15 VM's being served up from that same storage
> array.
> 
> > Give that the mix could have a significant effect.  You could
> > probably test on AWS instances using different storage types before
> > jumping in.  
> I'm curious.  How could AWS simulate the scenario of having SSD's
> directly installed on a server running ESXi, and also loading the
> VM's files from that same SSD storage?  I mean, perhaps I could use
> an AWS scenario to compare performance to my own.  But that wouldn't
> necessarily tell me how switching to directly connected SSD's will
> effect my current situation.
> 
> Thanks for the response.
> Richard
> 
> 
> 
> > 
> > On Nov 25, 2016 3:15 PM, "Richard" <plug at hackhawk.net> wrote:
> > > Hello,
> > > 
> > > I am seeking advice before moving forward with a potential large
> > > investment.  I don't want to make such a large purchase unless
> > > I'm
> > > absolutely certain it's going to solve the problem that I
> > > perceive to be
> > > my biggest problem right now.  I figured there would be a
> > > plethora of
> > > expertise on this list.  :-)
> > > 
> > > I'm considering switching from network storage of NFS shares (SAS
> > > 15k
> > > RAID 5, 10 spindles) to solid state drives directly connected to
> > > the
> > > server.  But alas, the SSD's are extremely expensive, and I'm not
> > > sure
> > > how to go about ensuring they're going to improve things for me. 
> > > I can
> > > only surmise that they will.
> > > 
> > > Here is what I've found by running strace on some of my larger
> > > web based
> > > PHP applications.  As one example, I've got one WordPress install
> > > that
> > > opens 1,000+ php files.  The strace is showing 6,000+ lstat
> > > operations
> > > across all of these files, and it is taking roughly 4 seconds to
> > > get
> > > through all of this.  Not being super knowledgeable about
> > > interpreting
> > > the strace logs, I do wonder if the 4 seconds is mostly related
> > > to disk
> > > latency, or if some large percentage of those 4 seconds are also
> > > attributed to CPU and memory as the files are
> > > processed/compiled/interpreted.  My monitoring of memory and CPU
> > > have
> > > not revealed anything significant.
> > > 
> > > I have some suspicion that by switching from the network storage
> > > to
> > > directly attached SSD's, I will reduce my example app's response
> > > time by
> > > 2 or more seconds.  And, if this is true, than I would happily
> > > spend
> > > that $10k+ and switch directions in how I've been managing my
> > > network.
> > > However, if the payoff only turns out to be 1 second or less
> > > shaved off
> > > the response time, then it's not really worth the investment to
> > > me.
> > > 
> > > How might someone go about getting hard data on such a thing?  Is
> > > there
> > > such a thing as an open source lab available where someone like
> > > me can
> > > come in and run a real world test that specifically applies to my
> > > particular situation?  If I were to buy a new car, I'd expect to
> > > test
> > > drive the thing.  Well, can I do the same thing with a $10k+
> > > server
> > > investment?  Sadly my experience tells me no.  But I figured I'd
> > > ask
> > > others anyway.
> > > 
> > > One test that surprised me was when I mounted ramdisk's for 4 of
> > > the
> > > most highly accessed folders/files of this web application.  It
> > > resulted
> > > on virtually no improvement.  It had me wondering if the lstats
> > > are
> > > still having to access the root partition for their work, and
> > > even
> > > though the file read performance might be improved by switching
> > > to a
> > > ramdisk, perhaps the lstat's are still having to run against the
> > > root
> > > partition, which is on an NFS network share.  Does that make
> > > sense to
> > > anyone here that might be in the know?  Anyway, I need to know if
> > > it's
> > > the processing/compiling that is the bottleneck, or if the
> > > lstat's are
> > > the bottleneck, or some combination of the two.  I don't want to
> > > just
> > > guess about it.
> > > 
> > > For the record, I know that I can improve this applications
> > > performance
> > > with caching mechanisms.  I've already proven this to be true. 
> > > The
> > > problem is that I'm trying to increase performance across the
> > > board for
> > > everyone on my servers.  I don't want to enforce caching on my
> > > customers
> > > as that comes with an entirely different set of problems.
> > > 
> > > Thanks in advance for any advice.  And...  Happy Thanksgiving and
> > > Black
> > > Friday.
> > > Richard
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > PLUG-talk mailing list
> > > PLUG-talk at lists.pdxlinux.org
> > > http://lists.pdxlinux.org/mailman/listinfo/plug-talk
> > > 
> > 
> > _______________________________________________
> > PLUG-talk mailing list
> > PLUG-talk at lists.pdxlinux.org
> > http://lists.pdxlinux.org/mailman/listinfo/plug-talk
> _______________________________________________
> PLUG-talk mailing list
> PLUG-talk at lists.pdxlinux.org
> http://lists.pdxlinux.org/mailman/listinfo/plug-talk
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.pdxlinux.org/pipermail/plug-talk/attachments/20161127/e8a863a1/attachment.html>