Easy ITIL - Demystifying the Complexity of ITIL & ITSM
77The Value in ITIL Problem Management
Easy ITIL Series by Reid Cooper
August 4th, 2009
Most small to medium and some large businesses are very good at quickly restoring services after a business impacting IT event, but after the dust settles few of them follow through to ensure the business impacting incident does not come back to impact business again and again. They spend a high percentage of their precious IT resource time fire fighting recurring incidents but very little on preventing the fires. That is where ITIL Problem Management comes into play.
What is ITIL?
IT Infrastructure Library (ITIL) is the most widely accepted approach to IT Service Management (ITSM) in the world. ITIL is a compendium of 20 years of best practice guidance in ITSM from world experts, business people and ITSM practitioners compiled by the Office of Government Commerce (OGC) of the United Kingdom. The ITIL prime directive is to ensure the IT resources effectively support the business.
What is Problem Management?
The goal of ITIL Problem Management is to minimize the adverse impact of Incidents and Problems on the business caused by errors within the IT infrastructure, and to prevent the recurrence of Incidents related to these errors. In order to achieve this goal, Problem Management seeks to get to the root cause of incidents and then to initiate actions to improve or correct the situation.
Incident and Problem Management Differentiated
For those new to ITIL, it may be helpful to understand the difference between an Incident and Problem using the ITIL lexicon. In ITIL, an Incident is basically any unplanned negative impact on a service. An Incident could be as simple as a slight degradation in performance below the agreed service level or the total unavailability of one or more IT services. So, the primary focus of Incident Management is the rapid restoration of services to levels expected or agreed with the users or consumers of those services. What Incident Management does to restore services as quickly as possible may be replacing failed hardware, patching an application, backing out a problematic change or something as simple as rebooting a server. Depending on the nature of the problem, rebooting the server may in fact restore service and the consumer of the impacted service is again happy and productive… for an unknown period of time. And that’s ok because the focus of Incident Management is rapid service restoration and that in some cases can be achieved with a simple reboot. However, unlike replacing failed hardware or actually patching an application or backing out a problematic change, rebooting the server fixes nothing in the long term. Rebooting clears memory and resets pointers and in many cases will restore service for a period but the reason the server failed is still lurking and we are still at risk of the incident recurring.
A Problem on the other hand is the under lying or root cause of one or more Incidents. The relationship of incident to problem is one-to-one or many-to-one. The focus of Problem Management is the elimination, avoidance or management of the risk associated with the problem to the extent that it does not have an unacceptable recurring impact on business operations. The preference is always problem elimination but it some cases elimination of the problem is not possible or cost effective.
Options Presented by Problem Management
In our server reboot example above, let’s assume we have worked with our technical resources and confirmed that the actual root cause of the incident is a memory leak in an application. A memory leak does not free up memory used by the application as it should so, over time, all the memory on the server is allocated and the Operating System can no longer function so the server freezes up and all hosted applications and services stop functioning. The reboot will clear memory and restore service for a period but there is a high probability the freeze incident will recur and further impact business operations. Rebooting in this case treats the symptoms but ignores the disease.
The Problem Management process brings a thorough and systematic approach to tracking, diagnosing and mitigating the risk posed by defects in the infrastructure. In the example above, once service is restored it would not be uncommon for the stressed technical resources to go off to fight another IT fire or go back to working on another project that is falling behind. Unless there is a Problem Management process active within the IT organization, the business is sure to encounter the same failure again and again until the problem is mitigated. Problem Management ensures the risk posed by the problem remains highly visible until mitigated and offers a number of approaches to managing or eliminating the risks.
The very best solution would be to work with the application support personnel and schedule a corrective change through Change and Release Management. However, finding the leak and correcting and implementing the required software fix may be a prolonged process that consumes large amounts of expensive programmer time leaving the business at risk until mitigated.
In addition to eliminating the defect from the infrastructure that is causing the negative impact on service, Problem Management can usually identify work arounds that lower the business impact to acceptable levels or avoid the impact entirely. Again, problem elimination is always the best solution but in some cases it takes a considerable amount of time and interim solutions are helpful in reducing or avoiding business impact to allow a thorough diagnosis and elimination process to work its magic.
If during the diagnosis phase we identify another way for the users to accomplish what they were attempting without using the offending application, even if it involves a few additional key strokes or seconds, then we have a work around. The work around may not be as fast or easy as using the offending application but good enough for a temporary solution if the offending application is not available. This work around would be recorded in the Known Error database used by the Help/Service Desk personnel to instruct users how to get around the problem when it does occur. We may even have the Service Desk personnel send an e-mail targeted at users of the application to make them aware of the situation and instruct them on what to do if the incident recurs thus reducing the time it takes the end users to return to productivity minimizing the impact on the business and calls to the Service Desk.
During the diagnosis we may also determine approximately how long it takes for the server memory to be consumed by the leak. In that scenario we could schedule server reboots at regular intervals in off peak times to avoid server shutdowns. This approach entirely avoids the risk of business disruption associated with the incident recurring. If the offending application is scheduled for replacement in the near term, it may be more cost effective to simply work around or avoid the incident until the replacement system is in place.
In a classical IT shop, if any form of incident follow up was performed, the options would generally be surfaced and decided within an IT vacuum from an IT perspective. In ITIL Problem Management, business representatives are involved in the decision process and the options that are in the best interest of the business are selected for implementation.
Summary
The focus of Problem Management is to reduce or eliminate the business impact associated with recurring incidents. As the examples above illustrate, there are many ways to achieve that objective with the best option determined by the needs of the business versus what makes the most sense from a purely IT perspective. The examples above are all reactive but there is also a proactive side of problem management that works very closely with other ITIL processes to design defects out of the infrastructure.
The cost of implementing Problem Management is quickly off set by improvements in service availability and quality that quickly equate to increases in customer satisfaction. It is also an essential link in freeing your costly IT technical support staff from time consuming and stressful fire fighting to allow them to focus on the planned changes needed for the business to compete in an aggressive business market.
Simplified IT Problem Management Using the 5-Point Template
Easy ITIL Series by Reid Cooper
August 4th, 2009
Most companies large enough to have a dedicated IT department or group have already devised a means of recording and tracking Incidents through service restoration and closure. Few have made the significant investment required to obtain a high-dollar Service Management Systems (SMS) like Remedy that directly support ITIL processes. Even if your company has made such an investment, depending on the quality of the vendor implementation of ITIL process support, you may be blessed or cursed with such a tool.
If you are fortunate enough to have one of the better SMS implementations of ITIL process support, you can simply start using the Problem Management module by creating Problem tickets for all incidents that meet the criteria established in your policy and start working your defined process. However, if your existing SMS is not so well endowed, that short fall can be overcome using the simple 5-Points Template to both guide the Problem Management process and serve as your historical Problem record repository.
The 5-Point Template
The 5-Point Template is a simple approach to Problem Management that is not included in the official ITIL documentation set but establishes process and structure consistent with ITIL Problem Management process. Policy supporting the 5-Point process would require the assigned problem owner, generally the manager of the support group with stewardship for the failed configuration item, to address the following questions for each and every incident that resulted in significant business impact:
1. How did the problem manifest itself to the end user?
· Identify all applications or services that were unavailable
· Detailed text description of any abnormal performance or behavior of applications or services using business versus technology terms.
· Describe how the incident was detected or by whom it was first reported and when.
2. What is the underlying technical description of the problem?
· Identify all Configuration Items (CI) involved
· Detailed text description of how/why the incident occurred; root cause?
3. What was the impact on the business
· Number users impacted (costs?) in each business unit?
· Duration of the outage (costs?) for each business unit?
· Loss of customers/business (costs?) for each business unit?
4. What was done to restore service?
· Event time-line detailing steps taken and individuals involved to diagnose reason for service impact and restore service
5. What has or will be done to ensure the incident does not recur?
· Description of work arounds to minimize business impact until the planned solution is implemented
· Verifiable steps that have or will be taken to prevent recurrence.
· Change tickets numbers and implementation dates
The 5-Points Template is primarily free form text and can be added on an enhancement tab or addressed as local customization in most generic Service Management Systems. If not, it can be easily duplicated in a simple DBMS like MS-Access. Each problem should either have a unique identifier and, where possible, have links to Incidents caused by the problem and/or Change tickets numbers that either caused the Problem or resolved the problem. This is important because in many companies, change is the primary cause of incidents.
The 5-Point Template Header
Each Problem will be linked to one or more Incident or Change records using header and association date like the following:
· Unique identifier (could be the number of the first associated Incident with a “P” prefix)
· Incident/Problem Title (40 characters or less)
· Date/Time Incident Detected/Reported
· Date/Time Incident Began
· Date/Time Service restored
· Duration of incident in Days/Hours/Mins
· Severity of Incident
· Impacted CI Class
· Impacted CI(s)
· Incidents (ID Numbers) caused by this problem. Links to Incident Records.
· Was incident caused by change? If yes, change number/identification? Links to Change Management System.
· Team assigned responsibility for Incident/Problem?
· Current status of the problem and date current status implemented
o In progress
o Pending review by the PM Review Board
o Pending Change Management Action
o Known Error
o Closed through Mitigation
o Closed by Management Exception
· Has an associated Incident recurred since problem closure (Y/N).
o If yes, the problem must be reopened.
Supporting Policy
Problem Management policy should state that all problems remain open, active and highly visible until the final problem resolution is in place and confirmed effective. Highly recommend that targets be established for the expected time to diagnose and schedule the remediation of problems and that compliance with these targets be incorporated into the performance objectives for each of the managers of the technology support groups. My experience has been that five business days to identify root cause and have a viable solution plan ready to present for Change Management approval is about the correct amount of time given the risk to the business of incident recurrence.
It is also recommended that a weekly Problem Management Status Report detailing the specifics of high impact incidents and efforts toward problem identification and resolution be circulated to IT and business unit managers to maintain visibility of the risks until properly mitigated. Number of open problems, total duration of town time and average time to mitigate problems by technology team should also be tracked and reported.
The policy should specify which incidents will be initially addressed by the Problem Management process and which are exempt from the current formal process. Using these guidelines, create a problem record in the 5-Points Template format for all unresolved high impact incidents for the past quarter. This will form the initial Problem Management workload to begin the control and operational phases of Problem management. When the load is complete, it is time to start the engine and execute the defined Problem Management process.
Summary
Employing the 5-Point Template and surrounding policy and process is an excellent way for small to medium sized companies to get the benefit of Problem Management without a large investment in high end service management systems that support ITIL processes. This article was not meant to be a complete guide to Problem management implementation but rather plant the seed that you don’t need to spend lots of money to get the benefit of ITIL processes. There is also a basic Problem Management governance structure that is the topic for another article I plan to publish that will bring more light to the process. Start simple and mature the processes as you go. One of my favorite expressions regarding ITIL implementation is you can eat the ITIL elephant if you do it one bite at a time.
For those interested in learning more about the ITIL value equation for your company or obtaining on-site assistance planning and implementing ITIL processes, please contact me directly. If you would like to learn more about ITIL in general, I would suggest starting with the ITIL V3 Foundation class and acquiring the five volume set of manuals through Amazon.com or other web sources. The books are generally priced at little over $100 per volume. The ITIL name is owned and books published by the UK Office of Government Commerce (OCG) and The Stationary Office (TSO). Books in the set include:
· ITIL – Service Strategy (ISBN 978-0-11-331045-6)
· ITIL – Service Design (ISBN 978-0-11-331047-0)
· ITIL – Service Transition (ISBN 978-0-11-331048-7)
· ITIL – Service Operation (ISBN 978-0-11-331046-3)
· ITIL – Continual Service Improvement (ISBN 978-0-11-331049-4)






