Fault Tree Analysis of parallel system for critical services

Hello Everyone,

I'm planning to develop reliability and maintainability study for data centers since it is required by the client. My method of developing this study is by using fault tree analysis. However, I am stuck with the fundamentals of Boolean algebra and not quite sure which approach to use. Below is an example

Assuming there are 2 redundant system, let's call them System 1 and System 2. These 2no. systems both have components A, B and C. These components are assumed to be identical for both System 1 and System 2 (A1, B1, C1 for System 1 and A2, B2, C2 for System 2).

Diagram below shows the approach I have used.

Which of the following fault tree above is the more appropriate representation of overall critical failure?

Regards,

  • First, your question is not well posed. A fault tree for a system failure is determined most highly by the detailed architecture of the system - what is connected to what and how. You can't sensibly ask the question in the abstract. 

    Second, a major issue with FTA is that there are no criteria for correctness. You just guess you best guess as to a FT and there is no technique to tell you if you have it right or wrong. If you are not experienced with FTs, then you need someone who is to inspect your work and comment on it.   But that is probably true for almost all the approaches you might want to try to use. 

  • Also, in addition, we are concerned at the probability of occurrences or overall estimated downtime (including failures) at each of the "outputs" of the "gates" in this kind of FTA?

  • So, apologies if any of what follows is preaching about "sucking eggs", but sometimes in a Forum like this, you don't know where to begin.

    The "or" in this case, is a failure of A, B or C at the input will cause a failure of the system at level of the output - and a similar analogy for AND.

    So, for a dual power supply to a cubicle full of dual redundant equipment, total system outage is experienced if Supply A AND Supply B fails.

    If we want to look at system failure, however, then it would be slightly more complicated.

    For example, if the cubicle only contains two identical system components A and B, then a failure of (power supply A AND component B) OR (power supply B AND component A) leads to a system failure.

    For a data-centre, the overall resilience can be complex, and depends on how the resilience is built in to the systems and the power supplies, along with the dependencies of each system on the other systems,

    For example, we would need to know not only whether we had multiple redundant supplies, but what failure of each of those did to each system So, first we need to consider things like network equipment each fed by n+1 or n+n supplies from the diverse sources could be quite reliable, whereas a server fed from only one of the power sources will cause part of a system to fail if it doesn't have a "fully redundant and duplicated" server to immediately take over if the supply to the other server fails. Then we'd need to consider the inter-dependencies of the systems on each other:

    • Computer system relies on [at least part of] the data network fails
    • Computer system relies on the Storage Area Network (SAN)
    • SAN fails if [at least part of] the data network fails
    • Data network (or access to it) may lose some functionality if the SAN fails, for example if access to SAN is used for account credentials

    etc.

    This type of thinking is necessary for each "mission critical" element of the data centre to determine its overall "uptime" and really has a more enterprise focus than simple power supply failure analysis.