The Beetil Blog

Blog » Responding to Incidents – Sensible Service Management

Responding to incidents

In the previous blog post in this Sensible Service Management Series we looked at the core of servicing customers: managing Requests.

In Beetil, everything we respond to is called an Incident.  Let’s talk about incidents in the strictest sense of the word: dealing with things going wrong.

Everything we talked about last time regarding Requests still applies. We will add some more considerations now for when something needs to be fixed.   Recall we talked about three main capabilities:

1. Record

Provide a single point of contact with multiple channels to access it.  Make sure you keep a record of all requests as Incidents in Beetil. Record all interactions with your users. Track all your responses and record what you did about them.

2. Respond

Make sure someone owns every request. Build up information in the Beetil knowledgebase.  Use external information.  Provide scripts for how to deal with common requests. Use Beetil to pass requests to someone else.  Regularly monitor how long requests are taking and chase up the slow ones.

3. Report

Make sure you don’t drop the ball on any requests.  Look for trends to help you improve your service.

OK, so now let’s extend that.  Sometimes a user requests help with a service not working as expected: something needs to be fixed. That is an “incident” in the strictest use of the word.

Sometimes that “user” reporting an incident is an internal person picking up an error before it affects any of the “real” users consuming the service.  It can even be a software program detecting the error and automatically alerting us.

Elaborating on “Responding”

In any case, if something needs fixing we need to elaborate on the “2. Respond” part of our Request process we described last time.

Here is how we expand that part of the process:

2.1. Categorise

We didn’t talk much about this last time.   It helps to categorise all incidents, whether general requests or issues to be fixed, but it is particularly important when we are fixing stuff to get a general idea of what type of incident it is so that we can determine how serious it is, how wide and severe the impact is, and so that we pass it to the right person first time.

2.2. Diagnose

This is where keeping records of what we did in the past, and building up the knowledge base, really pay off.  Search Beetil’s Incident records, Problem records, and the Knowledgebase, to see if we know what the cause is and how to fix it.  If you get a match, fix it if you can or pass it to somebody who can. This is called Level 1 support.

If you don’t get a match or can’t fix it, pass it to Level 2 support: those who have the technical skills to do specialist diagnosis and resolution.  If they can’t fix things, they refer the incident to Level 3 support: the folk who built or supplied the stuff that is not working, often a supplier external to your organisation.

2.3. Escalate

All this passing around amongst support groups is called Functional Escalation, but when we talk of “escalating” we usually think of Hierarchal Escalation, i.e. telling somebody more senior.  We hierarchically escalate because:

  • The incident impact is serious enough that they should know about it
  • A fix can’t be found
  • Someone is not responding fast or well enough considering the severity of the incident

That person might make a call that this is a Major Incident.  This means drop the normal process described here and switch to a crisis-response process that we will talk about in a future blog post.

2.4. Resolve

Somebody is not getting the service they expect.  The incident process must focus on restoring that service.  Sometimes that is not the same thing as fixing the underlying problem (fixing the Problem is a different process we will talk about some other time).  If we have to fix the problem in order to get the user back on track again we will, but sometimes there is a Workaround: a way to get them back up and running without fixing anything.  For example, with some software, simply logging off and on again may get them around an issue and working again.  Or rebooting a server may make the problem go away. (There is an old joke that “a problem gone is a problem solved”).

You can find Workarounds as part of Beetil’s Problem records, and/or you can also record Workarounds in the knowledgebase.

Eventually a Problem may cause so many Incidents that we have to hold the user up without a Workaround while we properly diagnose it and nail it once and for all.  That is a management call whether the inconvenience is outweighed by the ongoing cost of recurring incidents.  But in general the Incident process takes whatever Workarounds or temporary fixes it can to get service restored to the user as quickly as possible.

2.5. Close

This applies to all Incidents and Requests.  Before you close the ticket make sure:

  • You tell the user it is done
  • You make sure the user thinks so too: that they are happy with the outcome
  • The Incident is properly categorised so our reporting data is useful.
  • The Incident has a record of everything that happened and what workaround or fix you used.  In the future, you or one of your colleagues may be grateful you wrote it down.

 

ITIL confuses things by talking a lot about finding and fixing the underlying problem, and recovering the broken service, as part of the Incident process.  We will keep it clean by talking about all that as part of the Problem process (coming up in a future blog post).  Keep it nice and crisp:

  • Incident process is about getting the user(s) working again as quickly as possible, however we manage that.
  • Problem process is about fixing the underlying cause(s).

There is a huge body of knowledge out there about Incidents and Requests, which you can investigate further as you need to.  ITIL has a lot (in the version 3 book Service Operation and the Operational Support and Analysis intermediate course).  The Helpdesk Institute produce a lot of useful material too.  COBIT 5 is my choice for formal  definition of what should be happening and what should be produced, and by who.

For now, start with:

1. Record

2. Respond

2.1 Categorise

2.2 Diagnose

2.3 Escalate

2.4 Resolve

2.5 Close

3. Report

 

Want to read more sensible service management goodness?

Comments are closed.