A Method for Finding Missing Unit Tests

The Coverage Illusion

Code coverage is the most common metric for test quality. If 80% of lines are executed during tests, you have 80% coverage. But coverage is a weak measure. A test can execute code without verifying its correctness.

Consider:

def calculate_discount(price, is_member):
    if is_member:
        return price * 0.9
    return price

def test_discount():
    result = calculate_discount(100, True)
    # No assertion!

This test achieves 100% coverage of the function but verifies nothing. The discount could return price * 0.5 and the test would pass.

The paper addresses a harder question: which tests are missing? Not "is code covered?" but "is behavior verified?"

Mutation Testing Basics

Mutation testing systematically introduces bugs (mutations) into code and checks if tests catch them. A mutant is a modified version of the code with one small change:

Replace < with <=
Replace + with -
Replace True with False
Delete a statement

If tests fail when a mutant is introduced, the mutant is "killed." If tests pass, the mutant "survives," indicating a gap in test coverage.

The mutation score measures test effectiveness:

\text{Mutation Score} = \frac{\text{Killed Mutants}}{\text{Total Mutants}}

A test suite with 90% mutation score catches 90% of synthetic bugs. This correlates better with real bug detection than line coverage.

The Combinatorial Problem

Mutation testing is computationally expensive. For a codebase with $n$ mutation points and $m$ possible mutations per point, the total mutants is $O(n \times m)$ . Each mutant requires running the full test suite.

For a modest project with 10,000 lines and 5 mutations per line:

50,000 \text{ mutants} \times 10 \text{ seconds/test run} = 138 \text{ hours}

The paper proposes techniques to reduce this cost.

Prioritizing Mutations

Not all mutations are equally valuable. A mutation in dead code or error handling for impossible conditions doesn't represent real risk.

The method prioritizes mutations in:

Code with high cyclomatic complexity (more branches = more logic to verify)
Recently changed code (more likely to contain bugs)
Code with low existing test coverage (obvious gaps)

By scoring code regions, mutation testing focuses on high-value areas:

\text{Priority}(r) = w_1 \cdot \text{complexity}(r) + w_2 \cdot \text{churn}(r) + w_3 \cdot (1 - \text{coverage}(r))

Equivalent Mutants

Some mutations don't change program behavior. These "equivalent mutants" cannot be killed because they're semantically identical to the original.

# Original
for i in range(0, n):
    process(i)

# Equivalent mutant
for i in range(0, n, 1):
    process(i)

Equivalent mutants pollute mutation scores and waste testing effort. The paper uses static analysis to identify and filter likely equivalents.

Two mutations are equivalent if they produce the same output for all inputs in the program's domain. Determining this precisely is undecidable, but heuristics catch common cases:

Mutations in unreachable code
Mutations that cancel out (e.g., x + 1 - 1)
Mutations in logging or debug statements

Test Generation Suggestions

When a mutant survives, the method suggests what test is missing. It analyzes:

Which code path contains the surviving mutant
What input values reach that path
What assertion would distinguish mutant from original

For example, if a mutant changes price * 0.9 to price * 0.8, the suggestion might be:

Missing test: verify discount calculation
Input: price=100, is_member=True
Expected: 90
Actual (mutant): 80

This gives developers actionable feedback rather than just "your tests are incomplete."

Incremental Analysis

The paper emphasizes incremental application. Running full mutation analysis on every commit is impractical. Instead:

On each commit, identify changed functions
Generate mutations only for changed code
Run only tests that cover changed code
Report surviving mutants

This reduces analysis time from hours to minutes while focusing on code most likely to contain new bugs.

\text{Time}_{\text{incremental}} = O(\text{changed lines} \times \text{mutations} \times \text{relevant tests})

Practical Limitations

Mutation testing assumes that small syntactic changes represent realistic bugs. This isn't always true. Real bugs often involve:

Misunderstood requirements (logic is wrong, not just off-by-one)
Integration issues (works in isolation, fails in combination)
Concurrency bugs (timing-dependent, hard to synthesize)

The paper acknowledges these limitations. Mutation testing complements, not replaces, other testing approaches like integration tests, property-based testing, and manual review.

Takeaway

Code coverage answers "did tests run this code?" Mutation testing answers "would tests catch a bug here?" The gap between these questions represents risk.

The paper's contribution is making mutation testing practical through prioritization, equivalent mutant detection, and incremental analysis. These techniques bring mutation testing from research curiosity to viable engineering practice.

Tools like PIT (Java), mutmut (Python), and Stryker (JavaScript) implement these ideas. They're worth running periodically, especially before releases, to find tests you didn't know you were missing.