[PLUG] Finger Pointing

Kristian Erik Hermansen kristian.hermansen at gmail.com
Sat Feb 16 22:09:07 UTC 2008


On Feb 16, 2008 7:49 AM, Michael Rasmussen <mikeraz at patch.com> wrote:
> Try that with a commercial product.
>
> Once again I floated on a contented cloud of appreciation for Open Source
> software and the community around it.
>
> Then I thought about work.  About our SNMP polling system and the (latest)
> product whose MIB isn't properly interpreted.  Not only do we have both
> companies claiming fault with the other we also have to be an intermediary
> for the discussion.  There's no way (that we've found yet) for a support
> or technical engineer at Vendor A to get in direct contact with a support
> or technical engineer at Vendor B.
>
> Have any of you ever encountered a problem like this with Open Source groups?
> Where parties would find fault with the other and refuse to discuss it?
>
> I don't want to let my mood get unjustifiably utopian today.

I will do you one better and tell you a brief story involving a
project I worked on while I was at Cisco :-)

So, I'm writing this massive automated computational "cloud"
infrastructure in C and Python (similar to Amazon EC2).  My boss asks
me to do it.  He wants to "virtualize everything" related to Cisco
Security Agent product testing.  Cool!  So, a couple months into the
project and we start noticing some major issues with the VMware VIX
API.  At first we just thought we were dumb, and writing some code
improperly, but as time went on we started to realize that the VIX API
has some problems.  After trying to work around them for a while, we
kept hoping a new release from VMware would solve some issues.
Finally, it came.  Or did it?

We upgraded all the machines to the latest VMware, in hopes that the
problem was fixed, and gave it a whirl.  Now things blew up even
worse.  This time, instead of just hanging, things were segfaulting.
I called up VMware support.  They gave us the run around.  Many calls
later, and debugging, and we finally get someone to open up a real
ticket.  I explain the problem, and they don't know what's going on,
but tell us to try all these hidden VMware options to see if they
might help.  Some of them actually make the problem worse, so that was
no help.  However, this back and forth with the vendor went on for 2
months just about.

I finally got sick of hearing "suggestions" about what to do and broke
our gdb, stepping through the VIX API myself, without debugging
symbols.  It was a major pain, and took me a couple days just to
figure out what was happening.  Well, in the end, I came to a
startling conclusion.  At some strange points between certain VIX API
calls, the connection handler would accidentally free an old pointer,
disrupt the connection, and ultimately dereference a pointer to dead
code (or invalid code).  When I got to the end of debugging, I found
that the VIX API would call this function panic(), which would call
panic_panic(), right before death of the process.  Now, to test my
suspicions, I actually opened up libvmwarevix.so and patched the
binary in place a few times to see if I could circumvent the bugs.
After a day of hacking around the possible places I thought that would
case the issue, I struck gold, and found a way to patch it so that the
incorrect dereferencing would not occur.  I tested the modified VMware
VIX library shared object on our infrastructure, and it worked!

Now, you would think that VMware would be so happy that I took all
this time out and debugged their issue, even telling them the exact
locations in the API that needed to be fixed.  But no!  I called them
up, asked to be elevated to a senior engineer, and they finally called
me back.  I gave them the whole story.  The engineer was surprised,
but quite intrigued about this, because he said "we actually have an
internal bug number for this, and I've seen it before".  Great!  He
asks me to give him a week to try and get a fix into VMware for it,
and that he will call me back.

A week goes by.  Two more weeks.  I finally call on the third week.  I
talk to the engineer.  "yeah, we looked at the bug, and we know
exactly what the issue is, but in order to fix it properly we need to
rewrite a lot of code, and we cannot do that at this time".  I was
stunned.  I asked, "so when can I expect it to be fixed?".  To which
he replied, "we will not be fixing this bug any time soon".  I just
about shit my pants.  They knew there was something wrong and they
*refused* to fix it?!?!?!  Wtf?!!  This would never happen in the Open
Source community!!!

I had my boss call up VMware and give them a bunch of heat.  Wouldn't
you know, after that, they agreed to fix the bug :-)  You can see the
end result of the project I led on here:
http://video.vmware.com/kickapps/_Virtual-Insanity/VIDEO/72491/5054.html;jsessionid=A1E13A01E4C890CB50E3D80A4FCCCBF1?as=5054
-- 
Kristian Erik Hermansen
--
"It has been just so in all my inventions. The first step is an
intuition--and comes with a burst, then difficulties arise. This thing
gives out and then that--'Bugs'--as such little faults and
difficulties are called--show themselves and months of anxious
watching, study and labor are requisite before commercial success--or
failure--is certainly reached" -- Thomas Edison in a letter to
Theodore Puskas on November 18, 1878



More information about the PLUG mailing list