aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
aster.cloud aster.cloud
  • /
  • Platforms
    • Public Cloud
    • On-Premise
    • Hybrid Cloud
    • Data
  • Architecture
    • Design
    • Solutions
    • Enterprise
  • Engineering
    • Automation
    • Software Engineering
    • Project Management
    • DevOps
  • Programming
    • Learning
  • Tools
  • About
  • Programming
  • Software
  • Tech

Fail-Fast vs. Fail-Safe: What Is the Most Reliable Software Strategy?

  • Ackley Wyndam
  • May 19, 2022
  • 8 minute read

I love cooking and use my Thermomix a lot. If you hadn’t heard about that amazing innovation, it’s a kitchen robot… Well, it’s a magical super cooking machine. When designing the Thermomix, its designers took the approach of fail-safe instead of fail-fast. This is a smart choice in this case, but it has its drawbacks.

E.g. my machine tried to recover from a failure which sent it into an infinite recovery loop. I literally couldn’t pull out the food from the lid that was sealed shut. But normally, it’s one of the most reliable devices I own.


Partner with aster.cloud
for your next big idea.
Let us know here.



From our partners:

CITI.IO :: Business. Institutions. Society. Global Political Economy.
CYBERPOGO.COM :: For the Arts, Sciences, and Technology.
DADAHACKS.COM :: Parenting For The Rest Of Us.
ZEDISTA.COM :: Entertainment. Sports. Culture. Escape.
TAKUMAKU.COM :: For The Hearth And Home.
ASTER.CLOUD :: From The Cloud And Beyond.
LIWAIWAI.COM :: Intelligence, Inside and Outside.
GLOBALCLOUDPLATFORMS.COM :: For The World's Computing Needs.
FIREGULAMAN.COM :: For The Fire In The Belly Of The Coder.
ASTERCASTER.COM :: Supra Astra. Beyond The Stars.
BARTDAY.COM :: Prosperity For Everyone.

Which approach should we take when we are developing our tools and how does that impact our long-term reliability?

Fail-Fast vs. Fail-Safe Approach

In case you aren’t familiar with the terms, fail-fast means a system that would quickly fail in an unexpected condition. A fail-safe system will try to recover and proceed even with bad input.

Java tries to take the fail-fast approach, whereas JavaScript leans a bit more towards the fail-safe approach. A good example of fail-fast behavior would be their respective approaches to null. In Java, a null produces a NullPointerException which fails the code instantly and clearly. JavaScript uses “undefined” which can propagate through the system.

Which one Should we Pick?

This is hard to tell. There’s very little research and I can’t think of a way to apply the scientific method objectively to measure this sort of methodology. It has both technical aspects and core business aspects. It’s pretty hard to determine something conclusively. What I can say for sure is that this shouldn’t be a senior executive decision alone. This is a sort of policy that management should integrate with engineering to mitigate the downside risk. This applies to you, whether you’re an engineer or a business leader. Whether you’re a Silicon Valley startup, Amazon or a bank. These principles are universal.

Companies using Microservices are probably more committed to some form of fail-safe. Resiliency is a common trait of Microservices, that’s in the fail-safe camp.

Modern approaches to fail-safe try to avoid some pitfalls of the approach by using thresholds to limit failure. A good example of this is a circuit breaker, both the physical one and software-based. A circuit breaker disconnects functionality that fails so it doesn’t produce a cascading failure.

Companies that pick the fail-fast approach take some risks but reap some big rewards. When you pick that approach, the failure can be painful if a bug reaches production, but there are two significant advantages:

  • It’s easier to catch bugs in fail-fast systems during the development/debugging cycle
  • These bugs are usually easier to fix

The fail-fast approach handles such bugs better since there’s a lower risk of cascading effect. A fail-safe environment can try to recover from an error and postpone it. As a result, the developer will see an error at a much later stage and might miss the root cause of the error.

Read More  Automate Checking For Flaws In Python With Thoth

Historically, I prefer fail-fast; I believe it makes systems more stable when we reach production. But this is anecdotal and very hard to prove empirically. I think a fail-fast system requires some appetite for risk, both from engineering and from the executives. Maybe even more so from the executives.

Notice that despite that opinion I said that the Thermomix was smart to pick fail-safe. Thermomix is hardware running in an unknown and volatile environment. This means a fix in production would be nearly impossible and very expensive to deploy. Systems like that must survive the worst scenarios.

We need to learn from the previous failures. Successful companies use both approaches, so it’s very hard to pick the best approach.

Hybrid Environment in the Cloud

A more common “strategy” for handling failure is to combine the best aspects of both worlds:

  • Fail-fast when invoking local code or services, e.g. DB
  • Fail-safe when depending on remote resources, e.g. remote web service

The core assumption behind this direction is that we can control our local environment and test it well. Businesses can’t rely on a random service in the cloud. They can build fault-tolerant systems by avoiding external risks but taking the calculated risks of a fail-fast system.

Defining Failure

When discussing failure the assumptions we make focus around a 500 error page, crash, etc. Those are serious P1 failures. But by no means are they the only type of failures or even the worst types of failure… A crash usually marks a problem we can fix and even workaround by spinning up a new server instance automatically. This is actually a failure we can handle relatively elegantly.

A far more sinister failure is data corruption. A bug can cause bad data to make its way into the database and potentially cause long-term problems. Even security risks and crashes can result from corrupted data and those will be much harder to fix. A fail-fast system can sometimes nip such issues in the bud.

With cloud computing, we’re seeing a rise in defensive programming such as circuit breakers, retries, etc. This is unavoidable, as the assumption behind this is that everything in the cloud can fail. We need to develop core knowledge on the failures we can expect. One approach I found useful is to review logs from long-running integration tests (nightly tests).

An important part of a good QA process is long-running tests that take hours to run and stress the system. When reviewing the logs of these tests, we can sometimes notice issues that didn’t fail but conflict with our assumptions about the system. This can help find the insidious bugs that went through.

[button style=’accent’ url=’https://aster.cloud/2019/09/09/how-the-words-bug-and-debug-made-it-to-the-world-of-computers/’ target=’_blank’ arrow=’true’ fullwidth=’true’]HOW THE WORDS ‘BUG’ AND ‘DEBUG’ MADE IT TO THE WORLD OF COMPUTERS[/button]

[button style=’accent’ url=’https://aster.cloud/2020/11/06/code-more-debug-less-with-virtual-environments-in-python/’ target=’_blank’ arrow=’true’ fullwidth=’true’]CODE MORE, DEBUG LESS WITH VIRTUAL ENVIRONMENTS IN PYTHON[/button]

Don’t Fix the Bug

Not right away. Well, unless it’s in production, obviously…

Read More  NetApp ONTAP Becomes First Enterprise Storage Platform To Receive Validation From NSA For Security And Encryption

We should understand bugs before we fix them. Why didn’t the testing process find it? Is it a cascading effect or is it missing test coverage? How did we miss that?

When developers resolve a bug, they should be able to answer that question on the issue tracker. Then comes the hard problem, find the root cause of the failure and fix the process so such issues won’t happen again. This is obviously an extreme approach to take on every bug, so we need to apply some discretion when we pick the bugs to focus on. But this must always apply to a bug in production. We must investigate bugs in production thoroughly since failure in the cloud can be very problematic to the business, especially when experiencing exponential growth.

Debugging Failure

Now that we have a general sense of the subject, let’s get into the more practical aspects of a blog focused on debugging. There’s no special innovation here. Debugging a fail-fast system is pretty darn easy.

But there are some gotchas, tips, and tricks we can use to promote fail-fast. There are other strategies we can use to debug a fail-safe system.

Ensuring we Fail-Fast

Use the following strategies:

  • Throw exceptions – define the contract of every API in the documentation and fail immediately if the API is invoked with out of bounds state, values, etc.
  • Enforce this strategy with unit tests – go over every statement made in the documentation for every API. Write a test that enforces that behavior
  • If you rely on external sources, create tests for unavailable situations, low performance, and sudden unavailability
  • Define low timeouts, never retry

The core idea is to fail quickly. Say we need to invoke an Amazon web service. A networking issue can trigger a failure. A fail-fast system will expect a failure and present an error to the user.

Intelligent Failure for Fail-Safe

The core idea isn’t so much to avoid failure, it’s unavoidable. The core idea is to soften the blow of a failure. E.g. if we take the Amazon web service example from above… A fail-safe environment could cache responses from Amazon and would try to show an older response.

The problem here is that users might get out-of-date information and this might cause a cascading effect. It might mean it will take us longer to find the problem and fix it since the system might seem in order.

The obvious tip here is to log and alert on every failure and mitigation so we can address them. But there’s another hybrid approach that isn’t as common but might be interesting to some.

Read More  GitLab Introduces TeamOps, A New Practice For All Work Environments – Remote, Hybrid And In-Office

Hybrid Fail-Safe

A hybrid fail-safe environment starts as a fail-fast environment. This is also true for the testing environment and staging. The core innovation is wrappers that enclose individual components and provide a failsafe layer. This can be very similar to CloudFlare or Amazon cloud front providing a cached version of the website.

But how can we apply this in the code or the OPS layer?

When the system is nearing production, we need to review the fault points within the system, focusing on external dependencies but also on internal components.

A simplistic example like the Amazon example from above will include a quick failure by default. The failsafe wrapper can retry the operation and can implement various policies. There’s some ready-made software tools that let us define failsafe strategy after the fact, e.g. failsafe, spring-retry and many other such tools. Some of these tools are at the SaaS API levels and can mitigate availability/networking issues.

This has the downside of adding a production component that’s mostly missing in development and QA. But it includes many of the advantages of fail-fast and keeps the code relatively clean.

Additional Best Practices for all

Here are some best practices you should keep in mind, regardless of the strategy you pick:

  • Run the software in the debugger with exception breakpoints turned on. Exclude APIs that use exceptions to control flow (ugh, please fix those APIs) from the breakpoint. This lets you challenge your assumptions about the reliability of the application
  • Make sure the environment is random. If you use native code, randomize memory locations. Always randomize test execution to promote failure
  • Proper code review – I can’t stress this enough. I love code reviews. I despise nitpicking! When I get a response on variable naming, code styling etc. it pushes my buttons… Sometimes comments like that ignore an actual bug. People hate code review because of that type of nitpicking. Companies should train developers in substantive processes and evaluation.

TL;DR

Failure can come in many shapes and forms. We should accept that failure happens. It happens to Amazon, Facebook and Google despite all their efforts to avoid it. We need to decide on a strategy. Make assumptions and get support from senior management all the way through engineering.

We need to make choices:

  • Do we fail more often and recover quickly?
  • Do we fail rarely but take time to recover?

Software reliability is still a function of QA/testing. But ultimately, failure is inevitable and we need to make strategic choices. I believe most startups should focus on fail-fast, since the growth mindset makes it very hard to keep fail-safe strategies functional. Since we have QA and testing, most of these issues are outliers and they are very hard to optimize for.

Source: hackernoon.com


For enquiries, product placements, sponsorships, and collaborations, connect with us at [email protected]. We'd love to hear from you!

Our humans need coffee too! Your support is highly appreciated, thank you!

Ackley Wyndam

You May Also Like
Getting things done makes her feel amazing
View Post
  • Computing
  • Data
  • Featured
  • Learning
  • Tech
  • Technology

Nurturing Minds in the Digital Revolution

  • April 25, 2025
View Post
  • Software
  • Technology

Canonical Releases Ubuntu 25.04 Plucky Puffin

  • April 17, 2025
View Post
  • Software
  • Technology

IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management

  • March 27, 2025
View Post
  • Tech

Deep dive into AI with Google Cloud’s global generative AI roadshow

  • February 18, 2025
Volvo Group: Confidently ahead at CES
View Post
  • Tech

Volvo Group: Confidently ahead at CES

  • January 8, 2025
zedreviews-ces-2025-social-meta
View Post
  • Featured
  • Gears
  • Tech
  • Technology

What Not to Miss at CES 2025

  • January 6, 2025
Vehicle manufacturing
View Post
  • Software

IBM Study: Vehicles Believed to be Software Defined and AI Powered by 2035

  • December 12, 2024
View Post
  • Tech

IBM and Pasqal Plan to Expand Quantum-Centric Supercomputing Initiative

  • November 21, 2024

Stay Connected!
LATEST
  • college-of-cardinals-2025 1
    The Definitive Who’s Who of the 2025 Papal Conclave
    • May 7, 2025
  • conclave-poster-black-smoke 2
    The World Is Revalidating Itself
    • May 6, 2025
  • 3
    Conclave: How A New Pope Is Chosen
    • April 25, 2025
  • Getting things done makes her feel amazing 4
    Nurturing Minds in the Digital Revolution
    • April 25, 2025
  • 5
    AI is automating our jobs – but values need to change if we are to be liberated by it
    • April 17, 2025
  • 6
    Canonical Releases Ubuntu 25.04 Plucky Puffin
    • April 17, 2025
  • 7
    United States Army Enterprise Cloud Management Agency Expands its Oracle Defense Cloud Services
    • April 15, 2025
  • 8
    Tokyo Electron and IBM Renew Collaboration for Advanced Semiconductor Technology
    • April 2, 2025
  • 9
    IBM Accelerates Momentum in the as a Service Space with Growing Portfolio of Tools Simplifying Infrastructure Management
    • March 27, 2025
  • 10
    Tariffs, Trump, and Other Things That Start With T – They’re Not The Problem, It’s How We Use Them
    • March 25, 2025
about
Hello World!

We are aster.cloud. We’re created by programmers for programmers.

Our site aims to provide guides, programming tips, reviews, and interesting materials for tech people and those who want to learn in general.

We would like to hear from you.

If you have any feedback, enquiries, or sponsorship request, kindly reach out to us at:

[email protected]
Most Popular
  • 1
    IBM contributes key open-source projects to Linux Foundation to advance AI community participation
    • March 22, 2025
  • 2
    Co-op mode: New partners driving the future of gaming with AI
    • March 22, 2025
  • 3
    Mitsubishi Motors Canada Launches AI-Powered “Intelligent Companion” to Transform the 2025 Outlander Buying Experience
    • March 10, 2025
  • PiPiPi 4
    The Unexpected Pi-Fect Deals This March 14
    • March 13, 2025
  • Nintendo Switch Deals on Amazon 5
    10 Physical Nintendo Switch Game Deals on MAR10 Day!
    • March 9, 2025
  • /
  • Technology
  • Tools
  • About
  • Contact Us

Input your search keywords and press Enter.