Functional safety - Which standard and what are their scopes?

E95 over 1 year ago

I'm struggling to grasp the application of the concept of functional safety

I think that I understand the basic concept - Where we are relying upon an electrical circuit/system we need to be able to rely upon the safety function proportionally to the risk

However this is pretty much my limit, I want to understand the scope of the standards and which standard applies, the BS scope sections aren't making things much clearer - A simple example that I have came across is with a ventilation systems, if a fan is ventilating an area due to prevent a build up of nasty chemicals, or explosive gasses.

Is anything preventing a system having two independent fans with users monitoring?

Is this required to be a functional safety system? If so, is this under the 13849 standard or another? What should the system look like?

Thanks

Parents

+1 Andy Millar over 1 year ago

Oh gosh what a huge subject! But a really good question.

The basic standard is IEC 61508. There are a whole heap of industry specific standards that come off this one, but that's where they all start. But I'll warn you, 61508 is HUGE. It does tell you how to work out the answer to your fan question for example, but in pages and pages of stuff even when you've found the right part (it has seven parts). ISO 13849 is one of these standards, and yes probably is the right one in your case (unfortunately I don't have a copy of it).

However the basic principles are pretty simple, and you've nailed it with "we need to be able to rely upon the safety function proportionally to the risk". And the risk depends on what other mitigations you have in place, so yes if you have two fans and you can demonstrate that you can detect and repair a failure of the first fan before the second fan will credibly fail then that's a perfectly good mitigation - provided you can show that the detection itself is sufficiently reliable. And that it's not likely (within the level of risk that your prepared to accept) that both fans will fail, for example due to a power supply failure.

I should mention at this point that I was the independent safety assessor for the tunnel ventilation system on Crossrail (Elizabeth Line) which was a rather extreme case of exactly this problem! In that case the contractor and Crossrail decided that the consequences of ventilation failure occurring at the same time as a tunnel fire were so high that a high integrity fan control and monitoring system was required. But that was unusual.

So back to your question, IEC 61508, or any other related standard such as ISO 13849 , won't give you a direct answer to your problem, what they will say is that you need to assess all the possible failure modes and consequences and then determine all potential mitigations. UK law (HASAWA) then says that you must put these mitigations in unless you can show they are not reasonably practicable (i.e. the cost of putting the measure in, e.g. by using a high integrity system, would be disproportionate to the risk). 61508 does give guidance as to how to consider duplication, and more complex systems such as 3 out of 5 and so on. But if you're going there you probably want someone to guide you.

You basically only have to use a high integrity system where you're working on a system where failures cannot be mitigated by another system (very simplistically). For example in railway signalling, if the control system sets the signal to green when it should be red there's not much that can be done to stop trains crashing, so we make them very high integrity systems. But it seems in your case that you may well be able to achieve the effect (ensuring that if a fan fails there isn't an explosive gas build up) by suitable duplication and monitoring. Just don't forget those pesky common mode failures like power supplies, or both fans freezing up in winter...

Cheers,

Andy
Cancel
Vote Up +2 Vote Down

Sign in to reply

Reject Answer

Cancel

Reply

+1 Andy Millar over 1 year ago

Oh gosh what a huge subject! But a really good question.

The basic standard is IEC 61508. There are a whole heap of industry specific standards that come off this one, but that's where they all start. But I'll warn you, 61508 is HUGE. It does tell you how to work out the answer to your fan question for example, but in pages and pages of stuff even when you've found the right part (it has seven parts). ISO 13849 is one of these standards, and yes probably is the right one in your case (unfortunately I don't have a copy of it).

However the basic principles are pretty simple, and you've nailed it with "we need to be able to rely upon the safety function proportionally to the risk". And the risk depends on what other mitigations you have in place, so yes if you have two fans and you can demonstrate that you can detect and repair a failure of the first fan before the second fan will credibly fail then that's a perfectly good mitigation - provided you can show that the detection itself is sufficiently reliable. And that it's not likely (within the level of risk that your prepared to accept) that both fans will fail, for example due to a power supply failure.

I should mention at this point that I was the independent safety assessor for the tunnel ventilation system on Crossrail (Elizabeth Line) which was a rather extreme case of exactly this problem! In that case the contractor and Crossrail decided that the consequences of ventilation failure occurring at the same time as a tunnel fire were so high that a high integrity fan control and monitoring system was required. But that was unusual.

So back to your question, IEC 61508, or any other related standard such as ISO 13849 , won't give you a direct answer to your problem, what they will say is that you need to assess all the possible failure modes and consequences and then determine all potential mitigations. UK law (HASAWA) then says that you must put these mitigations in unless you can show they are not reasonably practicable (i.e. the cost of putting the measure in, e.g. by using a high integrity system, would be disproportionate to the risk). 61508 does give guidance as to how to consider duplication, and more complex systems such as 3 out of 5 and so on. But if you're going there you probably want someone to guide you.

You basically only have to use a high integrity system where you're working on a system where failures cannot be mitigated by another system (very simplistically). For example in railway signalling, if the control system sets the signal to green when it should be red there's not much that can be done to stop trains crashing, so we make them very high integrity systems. But it seems in your case that you may well be able to achieve the effect (ensuring that if a fan fails there isn't an explosive gas build up) by suitable duplication and monitoring. Just don't forget those pesky common mode failures like power supplies, or both fans freezing up in winter...

Cheers,

Andy
Cancel
Vote Up +2 Vote Down

Sign in to reply

Reject Answer

Cancel

Children

0 E95 over 1 year ago in reply to Andy Millar

Thank you for being so helpful Andy, so essentially there is nothing within the standard forcing you to use electronic means? It is only if a functional safety circuit has been chosen by the designer based upon the risk assessment, whereas a safe system of work could be perfectly adequate, as opposed to this system has a safety functional impact so requires a safety circuit?
Cancel
Vote Up 0 Vote Down

Sign in to reply

Mark as Helpful Answer

Cancel
0 Andy Millar over 1 year ago in reply to E95

Hi,

The way I'd normally expect this to be looked at is that you decide the acceptable hazard rate (say in this case the probability that a person is exposed to hazardous fumes), and then you work down the various measures that will prevent this happening. Some of this will be probabilities of technical failures, and some will be human issues - people not following processes, or just making mistakes. When you've factored all those in you can see if you need a high integrity system to make the numbers work.

There's two main techniques for this. Fault Tree Analysis (FTA) and Layers of Protection Analysis (LOPA). They basically do the same thing, just visualised in slightly different ways.

In FTA you start with the event you don't want to happen, then work out what combination of things would immediately lead to that happening (e.g. "Fans fail" and "Failure of fans does not raise alarm"), then you work out what combination of events would lead to those, and so on and so on until you get to fundamental failures you can put a probability of failure on (e.g. "fan bearing breaks", "operator does not set alarm", and the one you are trying to find the correct answer to "electronic system fails to control fan speed"). You then play around with the probability you can control (the integrity of the electronics) until you get the answer you want. OR you put in another human or technical protection.

In LOPA you start with the failure you are looking at, e.g. "fan fails", and think of all the measures that could come between that and the hazard , and again put figures on them to see if they give the final rate you want. It's an easier process, but doesn't handle a complicated combination of things going wrong so well. But is often good for process control systems where there may just be a series of individual barriers between a failure and a hazard.

Oh, and all this typically depends on working out the probability of humans making mistakes. And that's not something personally I would recommend that engineers do. If you get to work with really good Human Factors people you realise how good they are at that stuff and how much we can get it wrong.

If you want to look this up, try searching for "SIL allocation". (Safety Integrity Level allocation.) Or of course LOPA and FTA - you might want to start with LOPA.

Incidentally I glibly said at the top "decide the acceptable hazard rate", i.e. decide how many people you are happy to kill in a thousand / million year period. Practically this can be a nightmare trying to get people to agree to this, no-one wants to take responsibility for this figure, but equally no-one's going to give you unlimited funds. So although I've given you all the theory above, in practice very often people just do what was done in another application to avoid anyone having to make that judgement. Which is fine until someone dies and the questions start being asked about your evidence that you've done enough.

A pragmatic mid way is to use the same approach as LOPA and FTA to think about all the ways that a hazard could occur, and all the barriers that would prevent it, but instead of putting hard numbers on the probabilities to just use a judgement of "does this look enough?" More times than we might like to admit this is what we do in practice, and often it is enough, depending on the level of risk. It will still give you a clue about whether the electronics is the most important bit - in which case maybe it does need to be high integrity - or whether actually there's a huge number of other things that would need to go wrong first, some of which seem incredible. If the risk is that someone's exposed to fumes and may need a day or two in hospital this approach may be enough, if the result of fume build up is an explosion that rocks half a town (Flixborough!) it won't be.

It's a very common issue in fault monitoring systems. You put in a fault monitoring system to improve reliability, but then someone realises that it could also detect a potential safety issue arising. The question is, at that point does the monitoring system itself become safety critical? It all depends on what the probability is of the fault occurring in the first place, and whether there is a probability that another system (which could be remaining levels of human inspection) could provide sufficient probability of detection combined with the low integrity monitoring system. This is an area I've been (and am) doing a lot of work on in the rail industry. Sooner or later the integrity of monitoring systems will need to get higher as they get more and more functional safety loaded on them - because we don't want humans doing monitoring / testing for safety functions, partly because they make errors (and, sadly, partly because they're expensive) but mainly from my point of view because humans doing inspections are exposing themselves to risk. However high integrity monitoring systems are very expensive, so it is a complex calculation of risk. Welcome to my world...

Hope that helps. I got into this world by accident, I spent years developing non-safety critical electronics, and then, almost by accident found myself developing the highest level of safety-critical systems. Exactly the same electronics, but now in an application where if it failed the wrong way many people were likely to die - and nothing was going to stop that. It's a really interesting world, trying to work out if you've done enough. (And yes, I do sleep at night!) But as above, at the moment I also find a lot of my time is currently spent checking people's arguments to not need any level of integrity, because their system will never carry that much responsibility by itself, which again is a really interesting area.

Cheers,

Andy
Cancel
Vote Up +1 Vote Down

Sign in to reply

Mark as Helpful Answer

Cancel
0 DeanR 2 months ago in reply to Andy Millar

Hi Andy, I recently have had a question come up about SIL of a server used in a control panel my company are supplying to a customer.

The first thing I started doing was looking for SIL certificates, MTBF, other failure rate data but couldn't find anything. Our response has been to propose a mitigation with a redundancy in the panel architecture.

In situations like this, would it be fair to say that a mechanical or electrical engineer should have the skills to be able to determine whether the function the product to be sold has a functional safety purpose, or would that come down to someone with a title such as "Functional Safety Engineer"?

As a last note, the customer only came back to us with this SIL requirement after the panel had been designed and drawings signed off. Which you will understand is something of a mild annoyance!
Cancel
Vote Up 0 Vote Down

Sign in to reply

Mark as Helpful Answer

Cancel
0 Andy Millar 2 months ago in reply to DeanR

Hi Dean,

I do reasonably regularly get approached to "SIL certify" a system where the design is completed, and can make myself somewhat unpopular by explaining that it doesn't work like that, as most or all standards that describe SIL require a particular approach to have been taken throughout the design - we can't just do a couple of tests and say "yup, that's SILx"! Which it sounds like you've already worked out, yes it is an annoyance...

But back to your question, "can any engineer have sufficient skills?" yes absolutely. "Is it likely that any engineer will have sufficient skills?" almost certainly not. Very few engineers ever get involved in SIL allocation and functional safety assessment, and so very few have the experience to do it. It's not rocket science, it's just like most engineering in that you need to do it a few times in different circumstances to get the hang of it. I suppose now I probably could call myself a "Functional Safety Engineer" since it's now the majority of work that I do, but most of the time I was doing SIL work it was just a small part of my work.

What does happen on the other hand, is that engineers do sometimes (ok, quite often) say "xxx systems should be SILy" without necessarily understanding what this really means. To check that this hasn't happened to you, a good question to ask back is "what is the function of the server that needs to be SIL[number]?" SILs should ALWAYS be defined for functions, not for systems. What you're looking for is something like "the server must not cause xxx to occur, this functionality must be assured to SIL[number]". (I find most often that it's ensuring that something doesn't happen rather than that it does happen.) If you get a sensible answer to that then firstly you can be confident that there is a solid rationale for the SIL being allocated, and secondly you have a chance of being able to demonstrate it.

Incidentally if they just say "SIL rated" rather than "SIL1" or "SIL2" etc then you should be very suspicious.

DeanR said:
Our response has been to propose a mitigation with a redundancy in the panel architecture.

Absolutely. Achieving SIL functionality in a server is a nightmare at the best of times, if it wasn't developed for safety critical functions then it's going to be...interesting. Some form of diversity or redundancy is a good alternative approach - actually it isn't really an alternative approach, you are still managing the overall safety function to SILx, you're just doing it by using a combination of systems. The challenge is proving that the redundant system is completely independent of the "main" system, and that all common-made issues have been considered. The example of common mode failures that used to be given in past IET courses was the independent shut down system for a chemical plant that used the same type of valve as the main system, which meant that when both got hot (or cold?) both failed. You can imagine software examples where the same code module is used in both system. Also there's still going to need to be some level of integrity for each of the systems, there's also the big question as to what happens when the systems disagree - you might still need a small "SIL rated" element to compare the systems.

Another point, make sure your client is clear which standard they want you to work to, if they don't know that's another alarm bell. Each standard has different (sometimes very different) requirements. I'm guessing from your message that it's going to be IEC61508, but that's only a guess. It depends which industry you're in.

I would strongly suggest that if you need to deliver a SIL rated system, and you've never done it before, you should get expert help. What you probably want to look for is a consultancy who will gave you advice but let your company write the documents yourselves, that way you'll start getting the experience of what to do next time.

Good luck, and good question,

Andy
Cancel
Vote Up 0 Vote Down

Sign in to reply

Mark as Helpful Answer

Cancel
0 DeanR 2 months ago in reply to Andy Millar

Thank you Andy, that is very helpful and something I'll keep in mind should something similar crop up in the future.
Cancel
Vote Up +1 Vote Down

Sign in to reply

Mark as Helpful Answer

Cancel

Functional safety - Which standard and what are their scopes?

Community Rules & Guidelines