Friday, October 9, 2009

Making the Case for ITIL

While looking up ITIL definitions, I stumbled upon this great article by Anthony Orr and Erin Casteel. Their article begins:
"Following a basic maintenance plan is the best way to keep an automobile running efficiently and at peak performance, in both good and poor driving conditions. Getting regular oil changes and tune-ups, checking the water level, keeping the tires properly inflated, and other such tasks not only will keep your car running smoothly, but also will lesson the chance of a breakdown on a long journey. Similarly, following the best practices outlined in the IT Infrastructure Library, better known as simply ITIL, will help your IT organization run at peak performance and efficiency on its journey through the current economic landscape."

For the full article, see ITSMWatch.com Article

Great points!

Sean

Wednesday, September 16, 2009

Incident Process Advisory Team

Well, we’ve met in Incident PIM meetings and discussed theories and concepts for months. We’ve drafted and redrafted proposals that finally got passed in Leadership Council. We’ve done a series of training (and have yet to do more) to help implement our ideas into the hearts of our OIT employees. And this week the Incident Process Advisory Team (PAT) started meeting.

The Incident PAT is where the rubber meets the road. This is where we analyze metrics to see if the process is actually working. We discuss ways we need to improve. And then we make assignments to implement those improvements. This week we specifically discussed changes to Incident ticket categorization, creation of reports to measure metrics, and ticket resolution. The time for abstract theorizing is over. This is the process in action!

Thursday, August 20, 2009

Adding the WOW to Process Improvement

We've all likely heard by now of the WOW principle - championed by the Arbringer Institute ("The Anatomy of Peace"), and with it of our duty (or should I say opportunity?) to WOW everyone with whom we work. At OIT we're in a unique position at BYU to assist people from every corner of campus, to affect their view of our organization, which, in turn, affects their opinion of BYU itself.

With the Process Improvement efforts, we're trying to take efficiency and visibility to a whole new level. It has nothing to do with bowing our heads and submitting to a Process - it's all about WOWing our customers, managers, employees, and co-workers - because with these new Processes we can get a lot more done in less time. That means happier bosses, happier students, happier vendors - happier everyone!

Let's really try to apply these Processes with feeling, not just because our boss expects us to. Let's WOW this University.

Tuesday, August 4, 2009

Accountability Model

First, from the dictionary:
  • Accountable - Liable to being called to account; answerable
  • aility or -ibility is a suffix meaning - Ability, inclination, or suitability for a specified action or condition
  • Model - a pattern or mode of structure or formation

So, we're really talking about a pattern or structure that shows us how to select the group most suitable to be answerable for a specific task.

What!?! What weirdness is this and what does it have to do with me?

As we progress with our process reviews and redesign, we're finding more and more that the better we define the Accountability Model, the smoother things go. It's a point that we illustrate in our latest round of Incident Management training. It's also the point the a large number of our trainees say was the biggest learning for them when they come to the training.

Essentially, our Accountability Model for Incident Management says that we can distribute all incidents into three specific types - Individual, System, and Design, and each type of incident has a specific group that is 100% accountable to get that incident resolved.

Individual incidents are those affecting an individual, whether it is their account, computer, connection, or ability to do what they need to. An individual incident can't be duplicated using another account or on another machine, etc - you get the idea. Accountability for individual incidents resides with either the Service Desks or Field Services, depending on the specific details of the incident.

System incidents are those affecting many individuals and are resolved by restoring, restarting or reseting a system component such as a server, switch, router, system, etc. Accountability for system incidents resides with either Operations or Field Services, again, depending on the specific details of the incident.

Design incidents are those that require a design change to resolve the incident. No amount of restore, restart, or reset will get the customer back in service. Accountability for design incidents belongs to Engineering.

The idea is to get the incident assigned to the correct accountability group as quickly as possible so the incident can be resolved quickly. We're having no more of this transferring incidents from group to group to group like a hot potato. It's important that we learn how to do this well so that our customer's incidents can be resolved as quickly as possible.

On ITSMWatch.com I found the following quote that talks about the value of the Accountability Model:

"There are many jewels of IT service management (ITSM) guidance in the IT Infrastructure Library (ITIL), but of the most valuable is the inherent accountability model that is referenced throughout the framework. Utilizing this accountability model can make all the difference in changing “hard work” efforts to “smart work” efforts.

Having clearly defined accountability and responsibility allows for individuals to understand what service they provide; what their role is in the overall management of a given service, process, or function; and provides a mechanism for aligning their daily activities with business priorities."

See the whole article here ==> http://www.itsmwatch.com/itil/article.php/3794216

Even better, we think the Accountability Model helps us to understand and respect one another. Too often, it seemed that the old way of just pushing an incident along the path "up" the levels of IT led too many people to think that the "first level" Service Desk wasn't as smart or as capable as the "higher levels" of Operations or Engineering. We affectionately call that the "Dummy Model" and want nothing further to do with it. Our Accountability Model abolishes the levels and focuses instead on ares of suitable expertise.

It's kind of fun in the training sessions to see how this plays out. When we do the wrap-up at the end, it's always interesting to hear how eyes are opened and how respect is generated for each of the OIT groups and how the Accountability Model helps people to 'get it'.

Wednesday, July 29, 2009

Collaboration

Last night, or should I say early this morning, we learned first-hand just how crucial good collaboration is to a smooth recovery of an outage.

If you haven't yet heard, a 'power bump' hit much of Utah and in the aftermath, somehow all of our redundant power systems didn't switch over as expected and the Data Center suffered a partial power outage. This, of course, caused system outages on both sides of the Data Center, resulting in lots of downed systems, web pages, etc.

While the recovery from this event actually went pretty smoothly, we quickly found that without collaboration, chaos would reign. We ended up having a 'scribe' attached to each of the core recovery teams, BYU OIT, CHQ ICS, and Electric Shop, taking notes and running messages around. A phone bridge was formed to recover the CHQ systems and it didn't take long to learn that one single person had to be 'in charge'. Initially, when everyone was saying what they knew and thought, the result was a cacaphony of sound that required lots of repeating. Once order was established, things began to flow - but that's really more about Major Incident Coordination, a topic for a later blog.

OK, so I know that everyone is going, "well duh! Of course when you have a major outage you need collaboration, but those don't really happen all that often, how on earth does it apply to daily stuff?" Well, I think it does.

While the scale is smaller, I know we daily run into situations, large and small, while working to resolve incidents, where more than one mind working on it helps to resolve the issue better and faster. Sometimes, however, there seems to be a hesitation to engage others. Perhaps it's just that it doesn't come to mind. Maybe it's that we don't want to be seen as less than adequate by asking for help. It could even be that the challenge of finding the resolution is so engaging that we want to keep it just for ourselves. Whatever the reason we may not do it now, we should try to remember that our focus with Incident Management is to restore service as quickly as possible, which should override any hesitation we have to bring someone else in to help.

With that said, however, we also want to remember that whenever we collaborate with others in resolving incidents, the learnings found and solutions determined should be a point for us to make sure gets documented and shared via the Knowledgebase. That way, next time the same situation occurs, we or others can take advantage of the previous collaboration and resolve the incident even more quickly. It's something we did last night for sure.

Sometime between 1:30 and 2:00 am, when it became clear that the power was stable and system recovery was complete for some systems and welll underway for others, we sat down and focused on creating a list of things that we learned and things that needed follow-up. We found that things like call-tree lists needed refreshing, procedures updated, and system configurations that need followup. Certainly there will be formal reporting on the situation and the outcome. Probably the first non-email round will happen in Production Council tomorrow.

Perhaps the nicest thing about the effort last night was the chance to serve with others and build feelings of teamwork. Despite a considerably stressed situation, I never saw one single instance of short temper and I did see lots of consideration for others and thankfulness for that consideration. And isn't that really what's most important in the long run?

Thursday, July 9, 2009

Incident Management Training

Now that you’ve had your intro to the newly updated Incident Management Process with the first round of training, it’s now time to learn how to really make it work. Using feedback from the training and day-to-day experience, we’ve built a ‘hands-on’ training session that will really help you get focused on some key points of the new process. We’ll learn more about and work on using the Accountability Model, Collaboration, and Major Incident Coordination. We also think that mixing together the various groups attending the training will enhance our ability to understand each other and work together better. Because this is a hands-on activity, class sizes are limited to ensure everyone has a good chance to participate and learn. To do this, we need the following for each session:
• 2 people from ESPM, Leadership Council, or Account Management
• 6 people from Development
• 16 people from Production Services

It is vital that each person in OIT is an expert at knowing and executing our Incident Management process. Unless your Supervisor specifically asks you not to attend this training, please sign up and attend one of the scheduled sessions.

Sessions are scheduled as follows:
• Fridays 10-Noon, 7/10 thru 8/28
• Tuesdays 8-10, 7/14 thru 9/1
• Tuesdays 1:30-3:30, 7/14 thru 9/1
• Every other Wed 2-4, 7/15 thru 8/26


All sessions are in MB270/276

Thursday, July 2, 2009

It's Been a While!

Never fear!

It isn't that we haven't been making progress, it's simply that we haven't done a good job of keeping communication going out. Previously, all of our blogs were simultaneously written for and published in a hardcopy newsletter. Quite frankly, that was JUST TOO MUCH WORK!

So now, we are going to just blog here and we think we'll be able to post at least once per week.

Now, to get everyone up-to-date - - -
Previously at PIM Central (PIM=Process Improvement Management), we had developed the core PIM process and were testing it out with an Incident Management effort. What's happened since then and what is happening now - - -

Incident Management:
  • Incident PIM has completed the process design and it is officially production. The documentation is on BYU's eRoom here https://eroom.byu.edu/eRoom/OIT/OITPolicies/0_3096
  • The introductory round of training was completed for 230+ people in OIT
  • The second round of training starts July 10 - it will be very different from the first. We're going to be doing a very hands-on simulation that will have people working through the process and learning about the key concepts through experience. 28 sessions are scheduled from July 10 through Sept 1
  • Key concepts to focus on for Incident PIM are: Teamsork, Accountability Model, Collaboration, New Statuses, New Priority Matrix
  • We're really excited about the Accountability Model - with it, each OIT group has 100% accountability for certain types of incidents. Service Desk for incidents that affect an individual, Operations for incidents that affect systems, and Engineering for incidents that require a Design Change. The old 'dummy' escalation model is hereby banished from our world (you know, escalate all incidents to the next group 'up' if you can't solve it) We're still learning how to make it work, but it sure feels like the right thing to do!
  • Some of the updates haven't yet been implemented because of tool constraints
  • The Process Advisory Team (PAT) charter has been approved and they will start to meet monthly
  • The PAT will focus on getting the metrics in place that are documented in the Continuous Improvement Plan
Change Management:
  • Change PIM is far into the design process after doing a lot of learning about ITIL Change recommendations and looking into what gaps we have at BYU
  • The new design accounts for Change from the large perspective, starting with University IT Objectives down through detailed change plans - the information flow and communication channels will greatly increase OIT's ability to eliminate surprises and increase the effectiveness of our changes
  • A comprehensive Risk Analysis model has been developed and extensively tested - this will allow us to size change models so that the process is scalable without adding excess bureauacracy (we hope!!!)
  • We've been able to capitalize on Leadership Council's directives and models including the 5-box work model and the system container panels concepts to build and tie-together the Change process

Problem Management:

  • Problem PIM is also into the design process
  • The new design is really starting to clarify how Problem Management fits into our work and is clearly going to be the link between Design Incidents (Engineering accountable) and Change Management (implementing the design change into production)

Best of all - EVERY ONE of these PIM efforts have led to great collaborative conversations between all OIT groups. Relationships have been built and strengthened. It has extended the conversation into other efforts and smoothed the way for coming to agreement. It's a good thing.