Wednesday, July 29, 2009


Last night, or should I say early this morning, we learned first-hand just how crucial good collaboration is to a smooth recovery of an outage.

If you haven't yet heard, a 'power bump' hit much of Utah and in the aftermath, somehow all of our redundant power systems didn't switch over as expected and the Data Center suffered a partial power outage. This, of course, caused system outages on both sides of the Data Center, resulting in lots of downed systems, web pages, etc.

While the recovery from this event actually went pretty smoothly, we quickly found that without collaboration, chaos would reign. We ended up having a 'scribe' attached to each of the core recovery teams, BYU OIT, CHQ ICS, and Electric Shop, taking notes and running messages around. A phone bridge was formed to recover the CHQ systems and it didn't take long to learn that one single person had to be 'in charge'. Initially, when everyone was saying what they knew and thought, the result was a cacaphony of sound that required lots of repeating. Once order was established, things began to flow - but that's really more about Major Incident Coordination, a topic for a later blog.

OK, so I know that everyone is going, "well duh! Of course when you have a major outage you need collaboration, but those don't really happen all that often, how on earth does it apply to daily stuff?" Well, I think it does.

While the scale is smaller, I know we daily run into situations, large and small, while working to resolve incidents, where more than one mind working on it helps to resolve the issue better and faster. Sometimes, however, there seems to be a hesitation to engage others. Perhaps it's just that it doesn't come to mind. Maybe it's that we don't want to be seen as less than adequate by asking for help. It could even be that the challenge of finding the resolution is so engaging that we want to keep it just for ourselves. Whatever the reason we may not do it now, we should try to remember that our focus with Incident Management is to restore service as quickly as possible, which should override any hesitation we have to bring someone else in to help.

With that said, however, we also want to remember that whenever we collaborate with others in resolving incidents, the learnings found and solutions determined should be a point for us to make sure gets documented and shared via the Knowledgebase. That way, next time the same situation occurs, we or others can take advantage of the previous collaboration and resolve the incident even more quickly. It's something we did last night for sure.

Sometime between 1:30 and 2:00 am, when it became clear that the power was stable and system recovery was complete for some systems and welll underway for others, we sat down and focused on creating a list of things that we learned and things that needed follow-up. We found that things like call-tree lists needed refreshing, procedures updated, and system configurations that need followup. Certainly there will be formal reporting on the situation and the outcome. Probably the first non-email round will happen in Production Council tomorrow.

Perhaps the nicest thing about the effort last night was the chance to serve with others and build feelings of teamwork. Despite a considerably stressed situation, I never saw one single instance of short temper and I did see lots of consideration for others and thankfulness for that consideration. And isn't that really what's most important in the long run?

No comments: