Microservice Testing: A New Dawn

14 min readSep 24, 2019

Sunrise at the Maasai Mara. Photo from Javi Lorbada on unsplash.com

“A best effort verification of the system — and making smart bets and tradeoffs given our needs”

Testing is an integral part of the software development process that is arguably its least understood. It’s essential for the betterment of software but is also incredibly difficult and time consuming, a point that most find to be rather unfortunate.

What is testing anyway; software testing to be precise?

Software testing means to a great many people something along the lines of: evaluating and verifying that a software product or application does “what it’s supposed to do” before being shipped to production. It means any activity aimed at evaluating an attribute or capability of a program or system and determining that it meets its “required results”.

Although crucial to software quality and widely deployed by programmers and testers, software testing still remains an elusive art perhaps due to limited understanding of the ever evolving principles of software.

There’s a pretty good explanation for it; requiring a trip down memory lane, that we shall get to in a short while.

In the meantime, let’s explore Cindy Sridharan’s 2017 blog post titled Testing Microservices, the sane way in which she makes a revolutionary observation about what testing really is:

“… Bigger companies can afford this level of sophistication, but for the rest of us treating testing as what it really is — a best effort verification of the system — and making smart bets and tradeoffs given our needs appears to be the best way forward.” ~ Cindy Sridharan

That post by Cindy Sridharan is a master piece that has been acclaimed and critiqued in equal measure and that I highly recommend. Sure it’s quite lengthy and you may not see eye to eye with her on everything, but there’s plenty to love and to nod in agreement with. It’s packed with loads of thought provoking observations, interesting points of view and nuggets of distilled wisdom.

Right. Time for that trip to the Smithsonian museum.

Golden Age of Software Testing

The 1990s were the golden age of software testing.

As an industry, there was still quite a lot to figure out. Global or local data? File and variable naming conventions. Time constraints versus memory utilization. Library, procedure or inline code? Use or reuse? But the biggest of them all: fixing bugs that occur in the field when the only way to get bug reports is by post, phone or email and the only way to update software is by mailing a new set of floppies.

You see, we were once upon a time not very experienced in writing code. And after shipping that code, fixing it was a really, really painful proposition. No wonder so much time and effort was put into testing. There was no choice but to double check developers’ work and try to ensure that as few bugs as possible made it into the released product.

The 1990s was also an era of monoliths being the default architectural style. Well, applications composed of other applications began to surface (although the first experimentations were during the 1980s) but monoliths were all the rage.

While things have moved along considerably — with microservices and what not — , we might still be hung up on the 90s; at least in the testing end of things as Cindy points out:

“As an industry, we’re beholden to test methodologies invented in an era vastly different to the current one we’re in.
People still seem to be enamored with ideas such as full test coverage (so much so that at certain companies a merge is blocked if a patch or a new feature branch ends up fractionally decreasing the test coverage of the codebase), test-driven development and complete end-to-end testing at the system level.”

In the words of Prince, we test like it’s 1999.

The Distributed Monolith

João Vazao Vasques makes a bold assertion that he calls “an important truth, our north star, that will guide us on this journey” in his 2018 blog post Your Distributed Monoliths are secretly plotting against you:

“Most implementations of microservices are nothing more than distributed monoliths.”

It’s very easy to create a distributed monolith when designing a system using the microservice architectural style; most have certainly done it. It’s also quite easy to allow the monolith to quietly sneak in elsewhere.

Matthew Skelton in his presentation at the London DevOps Enterprise Summit of June 2017 spoke about the types of “software monoliths” that can creep into a project:

Application monolith: single block of code deployed as a unit
Joined at the DB: difficulty to change separately
Monolithic build: rebuild everything; one gigantic CI build
Monolithic releases: coupled release; smaller components bundled together into a “release”
Monolithic thinking: standardization; “one-size-fits-all” for teams

The “testing monolith”, similar to the monolithic build and releases types that Matthew describes, is a sixth type of monolith that has been suggested. Continuous Delivery consultant Steve Smith has already suggested that he believes end-to-end testing is considered harmful.

Steve’s point of view is that (monolithically) spinning everything up in order to verify the presence — or lack thereof — of issues in a system is fundamentally incompatible with Continuous Delivery. Besides, it’s a proposition that greatly suffers from the fallacy of decomposition and the cheap investment fallacy; the idea that testing a whole system will be cheaper than testing its constituent parts.

“Any advantage you gain by talking to the real system is overwhelmed by the need to stamp out non-determinism” ~ Martin Fowler

Challenges in testing microservice-based applications can be even more insidious than this — because, granted, testing microservices is particularly difficult. It is inherently more difficult than testing monoliths due to the distributed nature of the code under test.

Testing is more than just debugging.

The difficulty in software testing stems from the complexity of software: we can not completely test a program with moderate complexity. Testing is more than just debugging. The purpose of testing can be quality assurance, verification and validation, or reliability estimation. Testing can be used as a generic metric as well. Correctness testing and reliability testing are two major areas of testing. Software testing is a trade-off between budget, time and quality.

A further complication has to do with the dynamic nature of programs. If a failure occurs during preliminary testing and the code is changed, the software may now work for a test case that it didn’t work for previously. But its behavior on pre-error test cases that it passed before can no longer be guaranteed. To account for this possibility, testing should be restarted; an expense at which it’s often prohibitive.

An interesting analogy that parallels the difficulty in software testing with a pesticide is The Pesticide Paradox defined by Boris Beizer:

“Every method you use to prevent or find bugs leaves a residue of subtler bugs against which those methods are ineffectual.”

Full Stack in a Box

Real-time graph of microservice dependencies at Amazon in 2008. https://twitter.com/Werner/status/741673514567143424

The full stack in-a-box testing strategy entails replicating a cloud environment locally and testing everything all in one local instance.

As you must be imagining — cringing even — it’s no mean feat given how elaborate and fragile a setup it is, often requiring a team in its own right to build, maintain, troubleshoot and evolve the infrastructure.

“If anyone so much as sneezes, my service becomes untestable.” ~ Tyler Treat

That’s without any attempt to bring into perspective the scale (read number of dependencies)most microservice architectures bear; case in point, Amazon’s graph of microservice dependencies from back in 2008.

With a two services setup, Cindy Sridharan recounts the trials and tribulations of her first hand experience with this “fallacy”. She explains that at one of her previous companies, they tried to spin up the entire stack in a Vagrant box — the Vagrant repo itself called something along the lines of “full-stack in a box” — with the idea being that a simple vagrant up would enable any engineer to spin up the stack in its entirety on their laptops.

“… asking to boot a cloud on a dev machine is equivalent to becoming multi-substrate, supporting more than one cloud provider, but one of them is the worst you’ve ever seen (a single laptop)” ~ Fred Hébert

On top of being difficult to pull off, this localized testing strategy doesn’t scale. With a local deployment, you have to run most of the services (and dependencies) to get a fully running system. That’s stretching even the high end 16GB RAM machines quite hard.

“Software complexity (and therefore that of bugs) grows to the limits of our ability to manage that complexity.” ~ Boris Beizer

Moreover, even with modern DevOps best practices like infrastructure-as-code and immutable infrastructure, trying to replicate a cloud environment locally doesn’t offer benefits commensurate with the effort required to get it off the ground and subsequently maintain it; a relationship governed by the law of diminishing returns.

The Spectrum of Testing

Historically, testing has been something that refers to an activity in the purview of a pre-production or pre-release phase, often with siloed testing/QA teams.

Driven by the “build it / run it / own it” ethos popularized by Amazon — at whose core is the rationale that being woken up at 2 am severally by your service’s pager is quite a powerful incentive to find and fix root causes and focus on quality while writing code — this model is slowly being phased out. Development teams are now responsible for testing as well as operating the services they author.

“Good testing involves balancing the need to mitigate risk against the risk of trying to gather too much information” ~ Jerry Weinberg

According to Cindy Sridharan, this new evolving model is something that is incredibly powerful as it truly allows development teams to think about the scope, goal, tradeoffs and payoffs of the entire spectrum of testing in a manner that’s realistic as well as sustainable. “In order to be able to craft a holistic strategy for understanding how our services function, and gain confidence in their correctness, it becomes salient to be able to pick and choose the right subset of testing techniques given the availability, reliability and correctness requirements of the service” she adds.

You can’t test in quality, but you can code it in.

Quality is no less important, of course, but achieving it requires a different focus than in the past. Testing ought to be at the heart of the development and DevOps culture, providing new opportunities for testers.

The spectrum of testing illustrated above, adapted from Cindy Sridharan, broadly partitions software testing into pre-production testing and testing in production. This, she says, can be used to encompass a variety of activities, including many practices that traditionally used to fall under the umbrella of “release engineering” or Operations or QA.

“…but it encompasses some of the most common forms of testing seen in the wild all the same.”

She however counsels that while the illustration presents the testing taxonomy as a binary, the reality isn’t quite as neatly delineable as depicted. “For instance, profiling falls under the “testing in production” column, but it can very well be done during development time in which case it becomes a form of pre-production testing.” she explains.

Pre-production Testing

Pre-production testing is predominantly, as Cindy states:

“A best effort verification of the correctness of a system as well as a best effort simulation of the known failure modes”

As such, in the crucible of pre-production testing, the goal of its suite of test methods is not necessarily to prove there aren’t any bugs (except perhaps in parsers and any application that deals with money or safety or where human lives are at stake), but to assure that the known-knowns are well covered and the known-unknowns have instrumentation in place for.

Software bugs will almost always exist in any software module with moderate size: not because programmers are careless or irresponsible, but because the complexity of software is generally intractable — and humans have only limited ability to manage complexity. It’s also true that for any complex systems, design defects can never be completely ruled out.

Well, if that’s the case, what is to be said about the scope of pre-production testing?

“The scope of pre-production testing is only as good as our ability to conceive good heuristics that might prove to be a precursor of production bugs.”

The scope of pre-production testing is rather dependent on the ability to intuit the boundaries of a system, the happy code paths (success cases) and perhaps more importantly the sad paths (error and exception handling), and continuously refine these heuristics over time.

Any scope is heavily curtailed by the implicit assumptions the system is built upon and by a plethora of biases held by software engineers on a development team because invariably, the person (team) writing the code also writes the test; with code reviews nonetheless.

The infographic below highlights the essence of some of the pre-production test methods:

Image from https://martinfowler.com/articles/microservice-testing/#conclusion-summary

The Unit of Test

Single Responsibility Principle. Image from LearnStuff.io

A microservice architecture is the natural consequence of applying the Single Responsibility principle at the architectural level. Microservices are thus built on the notion of splitting up units of business logic into standalone services, where every individual service is responsible for a standalone piece of business or infrastructural functionality.

Microservices are inherently (and often) stateful entities: they encapsulate state and behavior akin to an Object or an Actor.

Cindy states that in her experience, an individual microservice (with the exception of perhaps network proxies) is almost always a software frontend (with a dollop of business logic) to some sort of stateful backend like a database or a cache. In such systems, most, if not all, rudimentary units of functionality often involve some form of (hopefully non-blocking) I/O — be it reading bytes off the wire or reading some data from disk.

“Not all I/O is equal”

Given the bounded context of a microservice, different forms of I/O have different stakes. For instance protocol parsing libraries, RPC clients, database drivers, AMQP clients and so forth all perform I/O, yet these are different forms of I/O with varying significance in a microservice’s tangible boundaries of applicability.

Consider the example of testing a microservice that is responsible for managing inventory. It’s certainly more prudent to verify items being created successfully in the database than it is to ascertain that HTTP parsing works as expected. Granted, a bug in the HTTP parsing library can act as a single point of failure for such a service; and is hence an important aspect to verify, but it’s also verily subservient to the primary responsibility of the service.

“What is of the essence here is that the most important unit of functionality a microservice provides happens to be an abstraction of the underlying I/O involved to its persistent backend, and as such should become the hermetic unit of base functionality under test.”

Domain logic often manifests as complex calculations and a collection of state transitions. Since these types of logic are highly state-based there is little value in trying to isolate the units. This means that as far as possible, real domain objects should be used for all collaborators of the unit under test; a unit of test.

Testing (QA) In Production

QA in Production. Image from https://martinfowler.com/articles/qa-in-production.html

“I’m more and more convinced that staging environments are like mocks — at best a pale imitation of the genuine article and the worst form of confirmation bias.” ~ Cindy Sridharan

Testing in production has gotten a bad rap and is frowned upon — despite the fact that we all do it, all the time; consciously or otherwise.

In reality, every deploy is a test; in production (because every deploy is a unique, never-to-be-replicated combination of an artifact, environment, infrastructure, and time of day). Every user performing an action on your system is a test; in production. Increasing scale and changing traffic patterns are tests; in production.

Oh! There’s more.

Distributed systems exist in a perpetual state of partial degradation. Failure is the only constant. Failure is happening on your systems right now, in a hundred ways you aren’t aware of and may never learn about. So obsessing over individual errors will drive you straight to the madhouse.

“This is an industry that’s largely in denial about failure, and the denial is only just beginning to lift.” ~ Charity Majors

I must insist here that the proposition of testing in production does not in any way imply throwing caution to the wind, or such an attitude thereof, or even taking away the “sanctity” of production.

On the contrary. There’s a lot of daylight between throwing your code over the wall past any form of pre-production or pre-release safeguards waiting to get paged and shipping with alert eyes on it as it goes out, watching your instrumentation, and actively flexing the new code. The job of modern software engineers is not done until they have watched users use their code in production.

Pre-production testing is great for finding defects you expect to happen, but many production defects are surprises.

Tests can only help you with scenarios you already know about; the known-knowns. They are a good way of making sure that a system behaves as intended, but they cannot tell you whether the intended behaviour is correct. Tests simply cannot cover every scenario.

Developers need to get comfortable with the idea of testing and evolving their systems based on the sort of accurate feedback they can only derive by observing the way these systems behave in production. Sole reliance on pre-production testing won’t stand them in good stead, not just for the future but also for the increasingly distributed present of even the most nominally non-trivial architecture.

The difference between Observability and Monitoring boils down to the known-unknowns and the unknown-unknowns.

Good production monitoring can provide valuable feedback about scenarios you hadn’t foreseen and help you adjust your system’s behaviour accordingly. Quality is as much about learning the correct behaviour of a system as it is about safeguarding that behaviour — an aspect that is often overlooked.

Summary

Because a microservice architecture relies more on over-the-wire (remote) dependencies and less on in-process components, your testing strategy and test environments need to adapt to these changes.

Given how broad a spectrum testing is, there’s really no one true way of doing it right. Any approach is going to involve making compromises and tradeoffs.

Your application is being tested in production every single day by the people who use it. You just need to find a way to use all the data users are already generating.

Finding the right balance of pre-production and production quality practices can help you gain a more realistic and holistic understanding of the quality of your system.

Thank you for reading. I sincerely hope it was a nice read.

You can catch me at:

GitHub: kwahome

Twitter: @kwahome_