Pages

9/28/2009

Month of Chaos

Month of Chaos


Keeping head barely above water is all the progress made in the month of September.
- budget prep crunch for my bosses so I was hopping fulfilling requests for information
- swine flu - one employee and other possibly related absences for care of sick child, pick up from child-care with fever, etc. 2 staff with vacations, 1 staff with training absence, and a regional flood which resulted in multiple staff absences in a 2 day period, and I had a 2 day family emergency as well.
- I still have outstanding the "critical" "urgent" projects from over 3 weeks ago: extranet infrastructure, Windows 2008 Domain upgrade, Database cluster hardware replacement, Shanghai vpn concentrator setup, transition to new Exchange server in Hong Kong, migrate mailboxes out of NYC offices asap to get off rickety hardware platform as well as make progress toward centralization. Progress has been made in fits and bursts, but most of my time and that of my staff has been in dealing with crises almost on a daily basis.
- The Highly Visible Crises: in LIFO order: NYC circuit flap resulting in no failover, Centralized exchange server mysteriously bogged down for 2 minutes, multiple mysterious losses of connectivity of Gig/E link connecting our stretched vlan to secondary data center, various weekend power outages from flooding/storms, security door cardscanner controllers down, new remote contractor with vpn phone problem dragging on for over 1 week must be fixed now, 18 hour total loss of connectivity to branch office awaiting local operations visit to restart router, urgent planning and discussions for replacing database cluster server hardware asafp, two SAN issues in a single day, revelations about free disk requirements for our Compellent SAN, HP virtual connect interface disconnect from HP Onboard Administrator and hurried deployment of new servers to migrate VM's to in order to fix virtual connect, HP iLO bug remotely mapping A: does not work -- the only stinking way to deploy a server is via HP RDP -- pita.
Lessons Learned
- HP iLO firmware bug prevents remotely mapping to A: disk. So there is just about no way to do a manual install and hit F6 during install to load the boot from SAN disk drivers when you are using a blade server. Perhaps this will be fixed with an upgrade to firmware, however that's a house of cards I'd rather not disturb right now. I thought I had made some progress with nLite build of windows installation, but it keeps telling me that I don't have the right disk driver for 64 bit windows 2003. Before we do a remote blade enclosure around the world we *have* to get a windows install working without having to first implement an RDP infrastructure (chicken and egg thing...)
- Duh to Local Operations: Just the fact that somebody can connect to the internet but their phone can't connect to the VPN that doesn't mean there is a problem with the phone itself requiring re-requesting the phone and building in a 3 day delay before this problem gets abdicated on account of lack of gumption to figure it out.
- IPSec vpn - phone or not - *really* doesn't work well when a user buys a router and sticks it in to his 8 year old dsl modem. This was another really phun day of telling a 60+ guy how to unplug his PC from his router into his dsl modem and guess at how to logon to it and browse for possible options for turning off NAT. At one point he got his ISP tech support on his other phone, put it on speaker and put me on speaker and had me try to talk to her. English was her second language (or third) and that didn't work out so he held the phone up to the speaker then to his ear and I would translate her question to him and coach him through our answer back and then explain her instructions to him on what to open, etc.
- Although it may be technically possible to add a tray to a Data Domain Restorer appliance it didn't really work that way for us. A graceful shutdown is much more advisable first.
- Compellent SAN would work best with 15-20% *unallocated* free space to allow for faster rebuilds of the RAIDsets of failed disks, etc. We are pushing 15% of "free" but allocated space. And we had a failing SATA drive around the time when a software bug in 4.3.2 code resulted in it "losing" the preferred paths to the SATA disks. This resulted in the "unlucky" i/o requests coming down the wrong path to pass over the backplane instead and that resulted in 4500ms (yes, 4.5 s) disk latency for the pool overall. This resulted in over 12 hours of troubleshooting and cussing and a conference call and some more stuff before this bug was isolated and resolved. That was an extraordinarly bad day in a totally crappy month.