[PLUG] My mail server is down for the moment...

Thu Jul 20 10:14:30 UTC 2006

On Jul 19, 2006, at 7:09 PM, Elliott Mitchell wrote:
>> From: Ronald Chmara <ron at Opus1.COM>
>> I know it's not much help now, but I have a *very* small list of 
>> things
>> I *never* do remotely:
> Oh, you just lack courage!  :-)   ...insanity helps too.

Well, occasionally I *will* do "cowboy" work, but those gigs come with 
hefty disclaimers and sign-offs. Or it's on my "own" old hardware 
(which, of course, I have already made nearly all the mistakes on, and 
thus know how to recover).

I guess *never* is more of a "never if I expect to keep a client for 
more than 6 months". Even when they've signed off to a policy which 
states "this may render the system permanently unusable", mgmt. folks 
want to know "how many minutes" until a system is back.

>> 1. kernel upgrades (if it bonks on the BIOS level, no remote daemon 
>> can
>> save you... the only fix would be a magic box that tunneled raw VGA
>> output, raw PS2/USB input, etc. over TCP/IP... is there such a beast?)
> The thing is, being stuck at the LI prompt is actually more secure 
> than a
> problematic kernel. Your stuff is inaccessible to you, but it keeps the
> intruders out as well.

Well, so does a melted motherboard. :-) Increases the time-to-recovery 
though.

> With a distribution kernel though, this isn't all that hazardous. They
> tend to ship working kernels, and the associated scripts tend to ensure
> no steps are skipped (`pkgadd` a kernel patch on SunOS remotely, no
> problem!). Also with GRUB understanding the filesystem and not needing 
> to
> be rerun, the danger has been greatly lessened. Still something to 
> avoid
> if you can, but a danger that can be minimized.

Yeah, if one can *get* to grub, and/or a functioning init level, it's 
possible to use a rollback/grub.conf change script to roll-back to a 
working state it, say, the rollback job isn't killed by a sysad getting 
a good login within 10 minutes after reboot/init/etc.

> Installing a new custom kernel on a system 600 miles away which uses
> LILO and Sun-style disk slices, definitely not for the faint of heart.

...custom binary RAID modules (gee, thanks, Dell!)...
...netboot+wireless drivers...

Basically, anything that has to do with lower-level access to a 
bootable kernel and enough being init'ed to login.

> Finally got it to work for me though (the previous times I was a lot
> closer, this time would of been a downtime of more than a week).
>> 2. RAID juggling (if /boot is on a SCSI RAID 5, for example)
>> 3. For that matter, anything SCSI. Requires waaaay too many goat
>> sacrifices when *at* the console to even consider doing remotely. (Or,
>> alternately, really good high end cables and terminators, which I
>> *never* seem to see IRL.)
> You've got a method to install SCSI disks remotely? Impressive.

Some colo's allow you to ship hardware to be installed with minimal 
instructions like "stick this in the RAID module bay with the failure 
light". I'd advise against doing it, or expect to burn long distance 
minutes pretty heavily.

>> 4. Any crypto library changes, or pam changes, if my sole admin access
>> is ssh. (Want to know how to make an sshd session work when the daemon
>> dies because libraries have changed? So do I.)
> Problematic on SunOS where open() never returns EBUSY, elsewhere, 
> pretty
> easy. You just have to make sure to only kill the server sshd process,
> and then test that you can get back in successfully (meaning you keep a
> termninal with a root shell open while testing). Not danger free, but
> something I do without many worries.

I've nuked this lots of times on linux with hand re-compiles (for speed 
reasons) of various crypto-libs over the years (make install step).... 
I just got into the habit of making a "safe, outside of program space, 
login" (usually telnet from a machine inside the perimeter). Maybe this 
rule is just because of working with some cruftier boxen, or I'm 
missing something simple. I dunno.

>> 5. ifconfig changes (without a crond/atd backup script to restore
>> settings) on single NIC machines.
>> 6. For that matter, ipchains/iptables/ipfw changes (without a 
>> crond/atd
>> backup script to restore settings) on even multi-NIC machines.
> Yeah, know that one two. /Somewhat/ less dangerous than kernel upgrades
> (less actually if you're using a package manager).
>> Anybody else have warnings to add to the historical pool of "Ooops", 
>> so
>> future admin folks can spend less time pounding their heads 
>> mercilessly
>> into keyboards?
> Not really, you've gotten the ones you have to worry most about. Though
> of your list I think only 1, 5 and 6 are really major. 2 and 3 you need
> to be at the console to juggle the hardware anyway.

Well, 2 and 3 are really things that should be possible to do mostly 
remotely (aside from physical moving of disks), *provided* that the 
booting system components, and the raid system components, being 
juggled, are separate.

Example: say a big fileserver with lots of images. 'images' is a RAID 
mount at /home/Images. Server can boot with, or without, the RAID or 
RAID modules. If software RAID, this can be juggled remotely. If 
hardware/BIOS raid... it's trickier.

Hm... Oh wait. I don't have anything in the list yet about MBR-related 
proggies... they belong somewhere in there, no? Upgrades that 
(not-so-nicely) "repair" or "refresh" or "replace" a MBR?

-Bop
--
4245 NE Alberta Ct.
Portland, OR 97218
503-282-1370