The Son of the last of a long line of thinkers. (delascabezas) wrote,
The Son of the last of a long line of thinkers.

my day...

so yeah, let see.

apparently the electricians who Sinai contracted for the ongoing infrastructure upgrade are complete and total wastes of space.

Our entire data center now runs on a building UPS - the juice should never go down.

Unless the main circuit for the failover is 60 amp, instead of 100 amp.

When that happens, the circuit breaker blows up like a Chinese new year party favor.

What happens then, is lots and lots of the data center has no power - servers crash, as do racks and racks of Cisco gear, since, of course, they no longer have individual UPS - we are on building wide now!

The problem is identified (by the ozone and smoke) - extra emergency wiring is used to bypass the fried circuit - supposedly 3 phase power will be restored in 20 minutes.

Instead of 3 phase power, we end up with 2 phase, over wiring which was not made to handle that kind of current* - result: wire that is so hot that the coating on it melt, and burns if you touch it.

Amazing we didn't have a fire - which would have been the end of things, since the block we were running from, apparently, was the backup grid for the fucking EMERGENCY POWER - y'know, the things like the fire suppression system, and backup lights.


So yeah, now we have a super hot server room that is running on melty wire - the juice is cut, new wires are brought in, they get the old circuit partially removed (now that it has gone from liquid to solid state once more).

Of course, the dipshits patch us in to a circuit that either has an open neutral or a faulty ground. Three machines die booting - the main web server goes down like Haley's comet.

Again juice is cut.

Damage assessment - three machines removed from racks by flashlight (folks, don't try this at home - its like playing twister with power tools as extra partners).

A secure circuit is found.

Power is restored.

The secure circuit collapses under load - not once, not twice, but thrice - each time, it comes back, but each time, it took another machine with it.

Net result:

2pm, running on emergency power circuit AND "secure" circuit.

5 servers down.

WWW on backup unit, last synch was 18 hours ago - clients cannot ftp - we don't even know where to start.

Replacement hardware is in the air - I got to see what a slagged raid array looks like.

Tape drives have been largely inaccessible, since one of the last crashes kicked the main backup DB into divide by 0 space, leaving us having to re-merge ~ 2 terabytes of data back into the db before we can figure out what we have in the deck to deal.

People are hot, angry, and shit is up, for now, but everyone is watching pagers warily.

I want to go home, I definitely don't make enough to deal with this shit.

Best estimates is that replacement box will be here by 7pm - we should have things restored by ~9ish IF we can find good tape.

I am supposed to stay "on call".

*edit - I have corrected my verbiage and description, props to bruteforcemethd for straightening me out.

