Last night, or should I say early this morning, we learned first-hand just how crucial good collaboration is to a smooth recovery of an outage.
If you haven't yet heard, a 'power bump' hit much of Utah and in the aftermath, somehow all of our redundant power systems didn't switch over as expected and the Data Center suffered a partial power outage. This, of course, caused system outages on both sides of the Data Center, resulting in lots of downed systems, web pages, etc.
While the recovery from this event actually went pretty smoothly, we quickly found that without collaboration, chaos would reign. We ended up having a 'scribe' attached to each of the core recovery teams, BYU OIT, CHQ ICS, and Electric Shop, taking notes and running messages around. A phone bridge was formed to recover the CHQ systems and it didn't take long to learn that one single person had to be 'in charge'. Initially, when everyone was saying what they knew and thought, the result was a cacaphony of sound that required lots of repeating. Once order was established, things began to flow - but that's really more about Major Incident Coordination, a topic for a later blog.
OK, so I know that everyone is going, "well duh! Of course when you have a major outage you need collaboration, but those don't really happen all that often, how on earth does it apply to daily stuff?" Well, I think it does.
While the scale is smaller, I know we daily run into situations, large and small, while working to resolve incidents, where more than one mind working on it helps to resolve the issue better and faster. Sometimes, however, there seems to be a hesitation to engage others. Perhaps it's just that it doesn't come to mind. Maybe it's that we don't want to be seen as less than adequate by asking for help. It could even be that the challenge of finding the resolution is so engaging that we want to keep it just for ourselves. Whatever the reason we may not do it now, we should try to remember that our focus with Incident Management is to restore service as quickly as possible, which should override any hesitation we have to bring someone else in to help.
With that said, however, we also want to remember that whenever we collaborate with others in resolving incidents, the learnings found and solutions determined should be a point for us to make sure gets documented and shared via the Knowledgebase. That way, next time the same situation occurs, we or others can take advantage of the previous collaboration and resolve the incident even more quickly. It's something we did last night for sure.
Sometime between 1:30 and 2:00 am, when it became clear that the power was stable and system recovery was complete for some systems and welll underway for others, we sat down and focused on creating a list of things that we learned and things that needed follow-up. We found that things like call-tree lists needed refreshing, procedures updated, and system configurations that need followup. Certainly there will be formal reporting on the situation and the outcome. Probably the first non-email round will happen in Production Council tomorrow.
Perhaps the nicest thing about the effort last night was the chance to serve with others and build feelings of teamwork. Despite a considerably stressed situation, I never saw one single instance of short temper and I did see lots of consideration for others and thankfulness for that consideration. And isn't that really what's most important in the long run?
Wednesday, July 29, 2009
Thursday, July 9, 2009
Incident Management Training
Now that you’ve had your intro to the newly updated Incident Management Process with the first round of training, it’s now time to learn how to really make it work. Using feedback from the training and day-to-day experience, we’ve built a ‘hands-on’ training session that will really help you get focused on some key points of the new process. We’ll learn more about and work on using the Accountability Model, Collaboration, and Major Incident Coordination. We also think that mixing together the various groups attending the training will enhance our ability to understand each other and work together better. Because this is a hands-on activity, class sizes are limited to ensure everyone has a good chance to participate and learn. To do this, we need the following for each session:
• 2 people from ESPM, Leadership Council, or Account Management
• 6 people from Development
• 16 people from Production Services
It is vital that each person in OIT is an expert at knowing and executing our Incident Management process. Unless your Supervisor specifically asks you not to attend this training, please sign up and attend one of the scheduled sessions.
Sessions are scheduled as follows:
• Fridays 10-Noon, 7/10 thru 8/28
• Tuesdays 8-10, 7/14 thru 9/1
• Tuesdays 1:30-3:30, 7/14 thru 9/1
• Every other Wed 2-4, 7/15 thru 8/26
All sessions are in MB270/276
• 2 people from ESPM, Leadership Council, or Account Management
• 6 people from Development
• 16 people from Production Services
It is vital that each person in OIT is an expert at knowing and executing our Incident Management process. Unless your Supervisor specifically asks you not to attend this training, please sign up and attend one of the scheduled sessions.
Sessions are scheduled as follows:
• Fridays 10-Noon, 7/10 thru 8/28
• Tuesdays 8-10, 7/14 thru 9/1
• Tuesdays 1:30-3:30, 7/14 thru 9/1
• Every other Wed 2-4, 7/15 thru 8/26
All sessions are in MB270/276
Thursday, July 2, 2009
It's Been a While!
Never fear!
It isn't that we haven't been making progress, it's simply that we haven't done a good job of keeping communication going out. Previously, all of our blogs were simultaneously written for and published in a hardcopy newsletter. Quite frankly, that was JUST TOO MUCH WORK!
So now, we are going to just blog here and we think we'll be able to post at least once per week.
Now, to get everyone up-to-date - - -
Previously at PIM Central (PIM=Process Improvement Management), we had developed the core PIM process and were testing it out with an Incident Management effort. What's happened since then and what is happening now - - -
Incident Management:
It isn't that we haven't been making progress, it's simply that we haven't done a good job of keeping communication going out. Previously, all of our blogs were simultaneously written for and published in a hardcopy newsletter. Quite frankly, that was JUST TOO MUCH WORK!
So now, we are going to just blog here and we think we'll be able to post at least once per week.
Now, to get everyone up-to-date - - -
Previously at PIM Central (PIM=Process Improvement Management), we had developed the core PIM process and were testing it out with an Incident Management effort. What's happened since then and what is happening now - - -
Incident Management:
- Incident PIM has completed the process design and it is officially production. The documentation is on BYU's eRoom here https://eroom.byu.edu/eRoom/OIT/OITPolicies/0_3096
- The introductory round of training was completed for 230+ people in OIT
- The second round of training starts July 10 - it will be very different from the first. We're going to be doing a very hands-on simulation that will have people working through the process and learning about the key concepts through experience. 28 sessions are scheduled from July 10 through Sept 1
- Key concepts to focus on for Incident PIM are: Teamsork, Accountability Model, Collaboration, New Statuses, New Priority Matrix
- We're really excited about the Accountability Model - with it, each OIT group has 100% accountability for certain types of incidents. Service Desk for incidents that affect an individual, Operations for incidents that affect systems, and Engineering for incidents that require a Design Change. The old 'dummy' escalation model is hereby banished from our world (you know, escalate all incidents to the next group 'up' if you can't solve it) We're still learning how to make it work, but it sure feels like the right thing to do!
- Some of the updates haven't yet been implemented because of tool constraints
- The Process Advisory Team (PAT) charter has been approved and they will start to meet monthly
- The PAT will focus on getting the metrics in place that are documented in the Continuous Improvement Plan
- Change PIM is far into the design process after doing a lot of learning about ITIL Change recommendations and looking into what gaps we have at BYU
- The new design accounts for Change from the large perspective, starting with University IT Objectives down through detailed change plans - the information flow and communication channels will greatly increase OIT's ability to eliminate surprises and increase the effectiveness of our changes
- A comprehensive Risk Analysis model has been developed and extensively tested - this will allow us to size change models so that the process is scalable without adding excess bureauacracy (we hope!!!)
- We've been able to capitalize on Leadership Council's directives and models including the 5-box work model and the system container panels concepts to build and tie-together the Change process
Problem Management:
- Problem PIM is also into the design process
- The new design is really starting to clarify how Problem Management fits into our work and is clearly going to be the link between Design Incidents (Engineering accountable) and Change Management (implementing the design change into production)
Best of all - EVERY ONE of these PIM efforts have led to great collaborative conversations between all OIT groups. Relationships have been built and strengthened. It has extended the conversation into other efforts and smoothed the way for coming to agreement. It's a good thing.
Subscribe to:
Posts (Atom)