A Good Test

software

testing

guide

verification

Testing is meant to provide confidence.

As many of you may know, I have always been opinionated about tests, whether they were tests for the software I was writing or tests to prove my aptitude in university calculus. High-quality tests that measure with reliability and validity instill confidence that our software will perform as we have instructed. From that viewpoint, I think that testing is often the weak point for many software codebases.

Then what makes a good test in practice? What provides that confidence? When is a test worth writing? This is where I would get stuck — often resulting in not writing a single test. Honestly, although rarely, this can be the right choice. More frequently, having confidence in your software can save you from firefighting and afford you increased productivity as you modify and augment the code in the future. So why do I dislike creating tests?

Testing Error

I briefly mentioned two measurements of a "quality test" above: reliability and validity.

Test Quality	Description
Reliability	The consistency of your assessment. Does it achieve the same result for the same (similar) input?
Validity	Does our test actually measure what we want to measure?

If we want confidence, we want to ensure our tests are both as reliable and valid as possible. Defending these test qualities becomes increasingly difficult as codebases increase in complexity. Similarly though, the need for tests and confidence increases with the increase in codebase complexity. A common issue with reliability that arises is the flakey test. Which, if you're unfamiliar, is a test that will sometimes pass and sometimes fail. Not reliable at all.

For validity I want to touch on two additional ways to measure the quality of our testing. Type 1 Error and Type 2 Error.

Error	Description
Type 1	The false positive - rejecting a true conclusion
Type 2	The false negative - accepting a false conclusion

Having valid tests is impossible with these types of errors in our testing setup. Invalid tests can be especially detrimental given the false confidence they may provide.

Completely eliminating these risks may be impossible. Instead, test creators make trade-offs and choose designs that reduce one type of error while possibly allowing for the other. For example: the creators of a driver's test would risk failing a competent driver over allowing any false positives. Medical tests will often prefer false positives over false negatives and perform more precise (and expensive) tests after the initial test concludes positively.

Tests in Software

We have an established basis for test quality... so why is it that testing is still a weak point for software? Surprise answer: testing is complex.

I have written tests for small utilities, video games, and small to large scale web applications. I think everyone has manually tested their software, but I want to explore automated testing. Regardless of codebase complexity, we are pushed towards two broad categories of automated software testing: unit testing and integration testing.

Side note: These categories are broken down further and given standardized names in industry. However, I will stick with integration tests as a broad category since it covers the idea of building assertions against increasingly higher-level systems and components.

Integration Tests

Integration tests provide confidence for pieces of software work together. An integration test may test how components in your system interact together or interact with other systems entirely. The focus is on determining how well these integrations perform. It is harder to create reliable and valid integration tests, simply by their nature. A new failure point is introduced with the addition of each individual component relied on by your test. Integration tests are also brittle for evolving codebases as even small centralized changes result in issues across many different integration tests.

Integration tests are also far more likely to be flakey tests. Tasked with reaching across component boundaries and sometimes relying on actual network protocols, they are likely to experience timeouts and other failures in the communication layers between these independent pieces. Tests that sometimes fail will reduce our confidence by deteriorating our reliability.

Integration tests certainly have their time and place. They require time and effort to implement but may be what provides confidence for some types of software (think: government compliances, medical devices, etc...). Complex integrations demand integration tests, but be wary of their pitfalls.

Unit Tests

Unit tests are individual tests concerned with distinct pieces of code that can be tested in isolation. The name unit test comes from the idea that each test should represent only one single unit of work. Application state is not required and outputs must be deterministic. This means that unit tests should be highly reliable. Further, with strong isolation boundaries for the unit of work, our developers can reason better about validity.

1
    // React Button
2
    function Button({label}: {label: string}) {
3
        return <button>{label}</button>
4
    }
5
6
    // Button unit test
7
    it('renders the consumed label prop') {
8
        const button = <Button label="⛵" />
9
10
        expect(button.text()).toBe('⛵')
11
    }
12

A good unit test in my mind is one that tests the contract of the unit of work. At the start of my university career, the professors encouraged us to write the contracts for our functions explicitly. It describes the function as a black box. Regardless of internal implementation, the user will expect my <Button /> to render the label. Further, if a project is lacking in documentation, seeing this contract outlined by the tests can fill that void. The contract is always my basis for starting to test a unit of work.

1
    // <Button /> consumes a string called label and renders it within an html button.
2
    function Button({label}: {label: string}) ...
3

Unit tests deliver a ton of confidence for their required effort. Knowing how to create meaningful tests (ones that provide additional confidence) is a skill and muscle to train. It will still be time-consuming and requires subscribing to some set of best practices to maximize that return of effort to confidence.

Towards a testing strategy

Let's play the role of designing our testing strategy. Great! We decide to create unit tests. However, it is hard to write tests, it is difficult to define the unit of work boundaries, and it is not as rewarding as feature development. That being said, releasing broken code feels far worse than writing tests.

How many tests do we need? How can we be confident?

A starting point could be to establish code review best practices and encourage each other to write good tests.

One possible automated solution would be to introduce a test coverage policy. Test coverage (code coverage) measures the proportion of code that was executed while running a collection of tests. If a test suite has high code coverage it may be likely that fewer bugs will occur. For me, test coverage guarantees are dubious claims towards validity at best. To put it nicely, test coverage should be used to encourage developers to write tests and confidence should come from your tests (not from achieving some level of code coverage).

Code coverage

There are a number of different styles of code coverage. I have seen line coverage used most frequently.

Coverage Type	Description
Function Coverage	Has each function been executed?
Line Coverage (default)	Has each line / statement been executed?
Branch Coverage	Has each branch (ex: an if and its else) been executed?
Path Coverage	Has every possible control path been followed?

Path coverage is the most complex coverage. It requires that every combination of conditional statements and loops are executed and explored. This coverage may provide the most protection, but is also probably far too expensive (in both implementation and execution resources) to want to implement for every situation.

Consider this fairly simple function in pseudocode:

1
    function coolNumberProducer(a: integer, b: integer, c: integer, d: integer) {
2
        if (a > 10) {
3
            a = a * 2
4
        }
5
6
        if (b > 10) {
7
            b = b * 2
8
        }
9
10
        if (c > 10) {
11
            c = c * 2
12
        }
13
14
        if (d > 10) {
15
            d = d * 2
16
        }
17
18
        return a + b + c + d
19
    }
20

Coverage Type	Minimum # of tests for 100% Coverage
Function Coverage	1
Line Coverage (default)	1
Branch Coverage	2
Path Coverage	2^4 = 16

Are we stuck?

Yes, and no. This is a clear point of friction for both the test strategy designers and the developers writing unit tests. But, even with our unit testing groundwork, we have built a lot of confidence already. Maybe line coverage helps set a standard bar for confidence. I have found that it leads to creating tests for the sake of creating tests, failing to provide confidence or to cover our contracts.

Static verification

We are lucky that our discipline straddles the line between creativity and objectivity. We have desired outcomes that we can test for, but the manner of solving is free to explore. Our confidence, then, should come from expected outcomes. Static verification lets us assert these outcomes without having to write explicit tests. Imagine a world where we can prove the correctness of our programs. Clever software compilers and interpreters can build rules about the code we write and use them to deduce correctness. To assist and guide these verifiers, the developer can also write additional constraints / rules.

1
    // the compiler knows that a and b are both unsigned integers (natural numbers)
2
    function add(a: unsigned_int, b: unsigned_int) {
3
        /* the compiler knows that the result is an unsigned int
4
            and that the result is greater than a or b  */
5
        return a + b
6
    }
7

Hoare Logic is built around the idea of a Hoare triple which is the set {Pre-conditions, Commands, Post-Conditions} for the subject of your proof. These triples allow us to prove partial correctness (termination needs to be shown separately).

Hoare Logic	Example
Pre-condition	`a: number`	= assert that variable a is a number
Commands	`function() {}`	= the actual code
Post-condition	`result >= a`	= assert that result is greater than or equal to `a`

1
    // pre-conditions: a is a number, b is a number
2
    function max(a: number, b: number) {
3
        // post-condition 1: if there's a result it must be greater than or equal to a
4
        ensures(res => res >= a)
5
        // post-condition 2: if there's a result it must be greater than or equal to b
6
        ensures(res => res >= b)
7
8
        // commands
9
        if (a >= b) {
10
            return a
11
        } else {
12
            return b
13
        }
14
    }
15

A verifier (there's even one for javascript) can analyze our pre-conditions, compile our code, and assert our post-conditions. We have mathematically proven that this code is correct. We did not have to write a single test, and we can have 100% confidence that this code performs as we expect. Verifiers (or solvers) are clever enough to produce counterexamples when our commands fail to produce the desired output, which helps with ensuring the validity of our code.

Mathematically proven correctness is second to none for confidence, but this approach lacks the developer ergonomics that would make it accessible and commonplace. Our example above is fairly straightforward, but building a proof for a more complex piece of code can be involved or even impossible. Testing is about confidence; our testing policy and testing approach must be about impact.

Confidence

Despite my dislike for writing tests, they are a huge shortcut to confidence in your project. I'm grateful for the tests that I do have once they are written. That being said, quality testing practices are something you have to actively defend and preserve. Exploring the balance of confidence and implementation cost is a capable guideline for determining where to focus efforts.

For me, confidence comes from a mixture of things. I heavily prefer the contract based unit test. I also rely on languages and tooling that provide confidence for different parts of the development process. Choosing a typed language (such as Typescript) when possible means that your compiler can catch issues before they end up being deployed. Using analytics and metrics to watch for bugs and server performance can provide confidence after deployment as well. Finally, one thing to remember is that we are trying to build confidence in our software - not in our tests. A huge part of confidence should also be from writing and reviewing quality code.

Friends, look for what could help you feel confident in your code. Write some tests if you have to. Thank you for reading. Thank you for exploring ideas with me.

A Good Test

A Good Test

Testing Error

Tests in Software

Side note: These categories are broken down further and given standardized names in industry. However, I will stick with integration tests as a broad category since it covers the idea of building assertions against increasingly higher-level systems and components.

Integration Tests

Unit Tests

Towards a testing strategy

Code coverage

Static verification

Confidence

justin mills 🏃

Lover of building things. Thank you for exploring ideas with me.

Conversation