Functional safety - Which standard and what are their scopes?

I'm struggling to grasp the application of the concept of functional safety 

I think that I understand the basic concept - Where we are relying upon an electrical circuit/system we need to be able to rely upon the safety function proportionally to the risk

However this is pretty much my limit, I want to understand the scope of the standards and which standard applies, the BS scope sections aren't making things much clearer - A simple example that I have came across is with a ventilation systems, if a fan is ventilating an area due to prevent a build up of nasty chemicals, or explosive gasses.

Is anything preventing a system having two independent fans with users monitoring?

Is this required to be a functional safety system? If so, is this under the 13849 standard or another? What should the system look like?

Thanks

Parents
  • Oh gosh what a huge subject! But a really good question.

    The basic standard is IEC 61508. There are a whole heap of industry specific standards that come off this one, but that's where they all start. But I'll warn you, 61508 is HUGE. It does tell you how to work out the answer to your fan question for example, but in pages and pages of stuff even when you've found the right part (it has seven parts). ISO 13849 is one of these standards, and yes probably is the right one in your case (unfortunately I don't have a copy of it).

    However the basic principles are pretty simple, and you've nailed it with "we need to be able to rely upon the safety function proportionally to the risk". And the risk depends on what other mitigations you have in place, so yes if you have two fans and you can demonstrate that you can detect and repair a failure of the first fan before the second fan will credibly fail then that's a perfectly good mitigation - provided you can show that the detection itself is sufficiently reliable. And that it's not likely (within the level of risk that your  prepared to accept) that both fans will fail, for example due to a power supply failure.

    I should mention at this point that I was the independent safety assessor for the tunnel ventilation system on Crossrail (Elizabeth Line) which was a rather extreme case of exactly this problem! In that case the contractor and Crossrail decided that the consequences of ventilation failure occurring at the same time as a tunnel fire were so high that a high integrity fan control and monitoring system was required. But that was unusual.

    So back to your question, IEC 61508, or any other related standard such as ISO 13849 , won't give you a direct answer to your problem, what they will say is that you need to assess all the possible failure modes and consequences and then determine all potential mitigations. UK law (HASAWA) then says that you must put these mitigations in unless you can show they are not reasonably practicable (i.e. the cost of putting the measure in, e.g. by using a high integrity system, would be disproportionate to the risk). 61508 does give guidance as to how to consider duplication, and more complex systems such as 3 out of 5 and so on. But if you're going there you probably want someone to guide you. 

    You basically only have to use a high integrity system where you're working on a system where failures cannot be mitigated by another system (very simplistically). For example in railway signalling, if the control system sets the signal to green when it should be red there's not much that can be done to stop trains crashing, so we make them very high integrity systems. But it seems in your case that you may well be able to achieve the effect (ensuring that if a fan fails there isn't an explosive gas build up) by suitable duplication and monitoring. Just don't forget those pesky common mode failures like power supplies, or both fans freezing up in winter...

    Cheers,

    Andy 

Reply
  • Oh gosh what a huge subject! But a really good question.

    The basic standard is IEC 61508. There are a whole heap of industry specific standards that come off this one, but that's where they all start. But I'll warn you, 61508 is HUGE. It does tell you how to work out the answer to your fan question for example, but in pages and pages of stuff even when you've found the right part (it has seven parts). ISO 13849 is one of these standards, and yes probably is the right one in your case (unfortunately I don't have a copy of it).

    However the basic principles are pretty simple, and you've nailed it with "we need to be able to rely upon the safety function proportionally to the risk". And the risk depends on what other mitigations you have in place, so yes if you have two fans and you can demonstrate that you can detect and repair a failure of the first fan before the second fan will credibly fail then that's a perfectly good mitigation - provided you can show that the detection itself is sufficiently reliable. And that it's not likely (within the level of risk that your  prepared to accept) that both fans will fail, for example due to a power supply failure.

    I should mention at this point that I was the independent safety assessor for the tunnel ventilation system on Crossrail (Elizabeth Line) which was a rather extreme case of exactly this problem! In that case the contractor and Crossrail decided that the consequences of ventilation failure occurring at the same time as a tunnel fire were so high that a high integrity fan control and monitoring system was required. But that was unusual.

    So back to your question, IEC 61508, or any other related standard such as ISO 13849 , won't give you a direct answer to your problem, what they will say is that you need to assess all the possible failure modes and consequences and then determine all potential mitigations. UK law (HASAWA) then says that you must put these mitigations in unless you can show they are not reasonably practicable (i.e. the cost of putting the measure in, e.g. by using a high integrity system, would be disproportionate to the risk). 61508 does give guidance as to how to consider duplication, and more complex systems such as 3 out of 5 and so on. But if you're going there you probably want someone to guide you. 

    You basically only have to use a high integrity system where you're working on a system where failures cannot be mitigated by another system (very simplistically). For example in railway signalling, if the control system sets the signal to green when it should be red there's not much that can be done to stop trains crashing, so we make them very high integrity systems. But it seems in your case that you may well be able to achieve the effect (ensuring that if a fan fails there isn't an explosive gas build up) by suitable duplication and monitoring. Just don't forget those pesky common mode failures like power supplies, or both fans freezing up in winter...

    Cheers,

    Andy 

Children
  • Thank you for being so helpful Andy, so essentially there is nothing within the standard forcing you to use electronic means? It is only if a functional safety circuit has been chosen by the designer based upon the risk assessment, whereas a safe system of work could be perfectly adequate, as opposed to this system has a safety functional impact so requires a safety circuit?

  • Hi,

    The way I'd normally expect this to be looked at is that you decide the acceptable hazard rate (say in this case the probability that a person is exposed to hazardous fumes), and then you work down the various measures that will prevent this happening. Some of this will be probabilities of technical failures, and some will be human issues - people not following processes, or just making mistakes. When you've factored all those in you can see if you need a high integrity system to make the numbers work.

    There's two main techniques for this. Fault Tree Analysis (FTA) and Layers of Protection Analysis (LOPA). They basically do the same thing, just visualised in slightly different ways.

    In FTA you start with the event you don't want to happen, then work out what combination of things would immediately lead to that happening (e.g. "Fans fail" and "Failure of fans does not raise alarm"), then you work out what combination of events would lead to those, and so on and so on until you get to fundamental failures you can put a probability of failure on (e.g. "fan bearing breaks", "operator does not set alarm", and the one you are trying to find the correct answer to "electronic system fails to control fan speed"). You then play around with the probability you can control (the integrity of the electronics) until you get the answer you want. OR you put in another human or technical protection.

    In LOPA you start with the failure you are looking at, e.g. "fan fails", and think of all the measures that could come between that and the hazard , and again put figures on them to see if they give the final rate you want. It's an easier process, but doesn't handle a complicated combination of things going wrong so well. But is often good for process control systems where there may just be a series of individual barriers between a failure and a hazard.

    Oh, and all this typically depends on working out the probability of humans making mistakes. And that's not something personally I would recommend that engineers do. If you get to work with really good Human Factors people you realise how good they are at that stuff and how much we can get it wrong.

    If you want to look this up, try searching for "SIL allocation". (Safety Integrity Level allocation.) Or of course LOPA and FTA - you might want to start with LOPA.

    Incidentally I glibly said at the top "decide the acceptable hazard rate", i.e. decide how many people you are happy to kill in a thousand / million year period. Practically this can be a nightmare trying to get people to agree to this, no-one wants to take responsibility for this figure, but equally no-one's going to give you unlimited funds. So although I've given you all the theory above, in practice very often people just do what was done in another application to avoid anyone having to make that judgement. Which is fine until someone dies and the questions start being asked about your evidence that you've done enough.

    A pragmatic mid way is to use the same approach as LOPA and FTA to think about all the ways that a hazard could occur, and all the barriers that would prevent it, but instead of putting hard numbers on the probabilities to just use a judgement of "does this look enough?" More times than we might like to admit this is what we do in practice, and often it is enough, depending on the level of risk. It will still give you a clue about whether the electronics is the most important bit - in which case maybe it does need to be high integrity - or whether actually there's a huge number of other things that would need to go wrong first, some of which seem incredible. If the risk is that someone's exposed to fumes and may need a day or two in hospital this approach may be enough, if the result of fume build up is an explosion that rocks half a town (Flixborough!) it won't be.

    It's a very common issue in fault monitoring systems. You put in a fault monitoring system to improve reliability, but then someone realises that it could also detect a potential safety issue arising. The question is, at that point does the monitoring system itself become safety critical? It all depends on what the probability is of the fault occurring in the first place, and whether there is a probability that another system (which could be remaining levels of human inspection) could provide sufficient probability of detection combined with the low integrity monitoring system. This is an area I've been (and am) doing a lot of work on in the rail industry. Sooner or later the integrity of monitoring systems will need to get higher as they get more and more functional safety loaded on them - because we don't want humans doing monitoring / testing for safety functions, partly because they make errors (and, sadly, partly because they're expensive) but mainly from my point of view because humans doing inspections are exposing themselves to risk. However high integrity monitoring systems are very expensive, so it is a complex calculation of risk. Welcome to my world...

    Hope that helps. I got into this world by accident, I spent years developing non-safety critical electronics, and then, almost by accident found myself developing the highest level of safety-critical systems. Exactly the same electronics, but now in an application where if it failed the wrong way many people were likely to die - and nothing was going to stop that. It's a really interesting world, trying to work out if you've done enough. (And yes, I do sleep at night!) But as above, at the moment I also find a lot of my time is currently spent checking people's arguments to not need any level of integrity, because their system will never carry that much responsibility by itself, which again is a really interesting area.

    Cheers,

    Andy