Skip to main content

Martin Fowler Eradicating non-Determinism testing

https://martinfowler.com/articles/nonDeterminism.html

Eradicating Non-Determinism in Tests

non-deterministic tests - tests that sometimes pass and sometimes fail. Left uncontrolled, non-deterministic tests can completely destroy the value of an automated regression suite. In this article I outline how to deal with non-deterministic tests. Initially quarantine helps to reduce their damage to other tests, but you still have to fix them soon. Therefore I discuss treatments for the common causes for non-determinism: lack of isolation, asynchronous behavior, remote services, time, and resource leaks.


 Footnotes

1: Yes, I know many advocates of TDD consider that a primary virtue of testing is the way it drives requirements and design. I agree that this is a big benefit, but I consider the regression suite to be the single biggest benefit that automated tests give us. Even without TDD tests are worth the cost for that.

2: Sometimes, of course, a test failure is due to a change in what the code is supposed to do, but the test hasn't been updated to reflect the new behavior. This is essentially a bug in the tests, but is equally easy to fix if it's caught right away.

3: There is a useful role for non-deterministic tests. Tests seeded from a randomizer can help hunt out edge cases. Performance tests will always come back with different values. But these kinds of tests are quite different from automated regression tests, which are my focus here.

4: This works well for the Mingle team as they are skillful enough to find and fix non-deterministic tests quickly and disciplined enough to ensure they do it quickly. If your build remains broken for long due to your quarantine tests failing you will lose the value of continuous integration. So for most teams I'd advise keeping the quarantined tests out of the main pipeline.

5: There's no hard-and-fast definitions here, but I'm using the early Extreme Programming terminology of using "unit test" to mean something fine-grained and "functional test" as a test that's more end-to-end and feature related.

6: One trick is to create the initial database and copy it using file system commands before opening it for each test run. File system copies are often faster than loading data using the database commands.

7: Of course this trick only works when you can conduct the test without committing any transactions.

8: Although you'll still need a timeout in case you never get a reply - and that time out is subject to the same danger when you move to a different environment. Fortunately you can set that timeout to be pretty high, which minimizes the chances of that biting you.

9: In that case, however the tests will run very slowly. You may want to consider aborting the whole test suite if you reach the wait limit.

10: If your asynchronous behavior is triggered from the UI, it's often a good UI choice to have some indicator to show an asynchronous operation is in progress. Having this be part of the UI also helps testing as the hooks required to stop this indicator can be the same hooks as detecting when to progress the test logic.

11: There are other advantages to using a test double in these circumstances, even if the remote system is deterministic. Often response time is too slow to use a remote system. If you can only talk to a live system, then your tests can generate significant, and unappreciated, load on that system.

12: You could reseed your datastore for each test based on the current time. But that's a lot of work, and fraught with potential timing errors.

13: In this case the clock stub is a common way to break isolation, each test that uses it should ensure it's properly re-initialized.

14: One of my colleagues likes to force a test run just before and after midnight in order to catch tests that use the current time and assume it's the same day an hour or two later. This is especially good at times like the last day of the month.

15: Although, of course, this isn't always a non-determinism bug, but one that's due to a change in environment. Depending on how close the clock ticks are to the id allocation, it could result in non-deterministic behavior.


Lack of Isolation

Asynchronous Behavior

Remote services

Time

Resource Leaks

Testing with game app (LibGdx). Game app have some special characteristics.

https://github.com/libgdx/libgdx/issues/5995


Lack of Isolation

In order to get tests to run reliably, you must have clear control over the environment in which they run, so you have a well-known state at the beginning of the test. If one test creates some data in the database and leaves it lying around, it can corrupt the run of another test which may rely on a different database state.

Case study: a website for auto parts sales, mainly tire, brakes for example.

We have DEV, Staging and Live environments, and may be local computer of each Dev (3 Devs + 1 leader).

And we have some child or subsidiary sites like for Canada market, for only tire sales, for Shopify channel etc.

And only some main sites have Staging / Dev environments, others are sole Live site or Dev sites.

And we have blog too (Wordpress obviously) embedded to PHP framework (a common but not top popular and outdated PHP 5 vs 8 for example).

And we have 2 teams from two different countries. And we have Backend team too (real Backend ?) with .NET tech.

=> So think ab it, each time we have a test, think about isolation.
APIs
Order 
Order Paypal, PayFlow, Amz, Affirms
Orders have stored in many table: order, order_detail, order_history, order_log (may be some table for each payment methods) and some methods log to file system.
And table store API_response (talk to other Backend/Warehouse .NET for example)

Therefore I find it's really important to focus on keeping tests isolated. Properly isolated tests can be run in any sequence. As you get to larger operational scope of functional tests, it gets progressively harder to keep tests isolated. When you are tracking down a non-determinism, lack of isolation is a common and frustrating cause.

Keep your tests isolated from each other, so that execution of one test will not affect any others.

One trick that's handy when you're using databases, is to conduct your tests inside a transaction, and then to rollback the transaction at the end of the test. That way the transaction manager cleans up for you, reducing the chance of errors

https://www.sqlshack.com/rollback-sql-rolling-back-transactions-via-the-rollback-sql-query/

- So if we have 3 insert queries follow each others to 3 different tables. Then can we rolled back all inserted query after done a test ?

Between these insert queries there are select queries to get data just inserted, for example: we inserted order record, then query its data to grab order_id to order_history...

Can transactions work ? in other words, did select query have data ?
Transaction mean atomic update (?), ...

This sound impossible, at least if we check AUTO_INCREMENT ? if transaction can rolled back then can AUTO_INCREMENT too ? seem practical but not logic.

https://dba.stackexchange.com/questions/89638/rollback-doesnt-work-after-insert-into-newly-created-destination-table

Another approach is to do a single build of a mostly-immutable starting fixture before running a group of tests. Then ensure that the tests don't change that initial state (or if they do, they reverse the changes in tear-down). This tactic is more error-prone than rebuilding the fixture for each test, but it may be worthwhile iff it takes too long to build the fixture each time.

Although databases are a common cause for isolation problems, there are plenty of times you can get these in-memory too. In particular be aware with static data and singletons. A good example for this kind of problem is contextual environment, such as the currently logged in user.

Some people prefer to put less emphasis on isolation and more on defining clear dependencies to force tests to run in a specified order. I prefer isolation because it gives you more flexibility in running subsets of tests and parallelizing tests.

https://github.com/libgdx/libgdx/issues/5995

It sucks, but I don't see unit testing catching on in game development (outside of situations where the logic can be tested independently) any time soon.

@tommyettinger You make a good point about not breaking backwards compatibility. I disagree with you on the rest of what you said. Testing a game presents all of the exact same challenges you face testing anything with a graphical user interface. I've unit tested games before. The scenarios you mentioned were exactly the kind of things I did test.

To make it backwards compatible, we just need to add a configuration option. We could call it startOnInitialize, which would default to true. Then you just have to set that to false in your tests.


https://gamesfromwithin.com/when-is-it-ok-not-to-tdd

Never use bare sleeps to wait for asynchonous responses: use a callback or polling.

        //pseudo-code
        makeAsyncCall
        startTime = Time.now;
        while(! responseReceived) {
          if (Time.now - startTime > waitLimit) 
            throw new TestTimeoutException;
          sleep (pollingInterval);
        }
        readResponse

In some crawl script I have used mechanism look like polling, I wait a bit of time after an
unsuccessful call and repeat 2-3 times in a nested condition.
No response
All this advice is handy for async calls where you expect a response from the provider, but how about those where there is no response. These are calls where we invoke a command on something and expect it to happen without any acknowledgment. This is the trickiest case since you can test for your expected response, but there's nothing to do to detect a failure other than timing-out. If the provider is something you're building you can handle this by ensuring the provider implements some way of indicating that it's done - essentially some form of callback. Even if only the testing code uses it, it's worth it - although often you'll find this kind of functionality is valuable for other purposes too[10]. If the provider is someone else's work, you can try persuasion, but otherwise may be stuck. Although this is also a case when using Test Doubles for remote services is worthwhile (which I'll discuss more in the next section).

Gerard Meszaros's book, xUnit Test Patterns, contains lots of good patterns for constructing tests.

If you have a general failure in something asynchronous, such that it's not responding at all, then you'll always be waiting for timeouts and your test suite will take a long time to fail. To combat this it's a good idea to use a smoke test to check that the asynchronous service is responding at all and stop the test run right away if it isn't.

You can also often side-step the asynchrony completely. Gerard Meszaros's Humble Object pattern says that whenever you have some logic that's in a hard-to-test environment, you should isolate the logic you need to test from that environment. In this case it means put most of the logic you need to test in a place where you can test it synchronously. The asynchronous behavior should be as minimal (humble) as possible, that way you don't need that much testing of it.


#303633 (gray text color above)
Remote Services

Using a double has a downside, in particular when we are testing across a broad scope. How can we be sure that the double behaves in the same way that remote system does? We can tackle this again using tests, a form of test that I call Contract Tests. These run the same interaction with the remote system and the double, and check that the two match. In this case 'match' may not mean coming up with the same result (due to the non-determinisms), but results that share the same essential structure. Integration Contract Tests need to be run frequently, but not part of our system's deployment pipeline. Periodic running based on the rate of the change of the remote system is usually best.

For writing these kinds of test doubles, I'm a big fan of Self Initializing Fakes - since these are very simple to manage.

Some people are firmly against using Test Doubles in functional tests, believing that you must test with real connection in order to ensure end-to-end behavior. While I sympathize with their argument, automated tests are useless if they are non-deterministic. So any advantage you gain by talking to the real system is overwhelmed by the need to stamp out non-determinism.

Time
I've heard so many problems due to direct calls to the system clock that I'd argue for finding a way to use code analysis to detect any direct calls to the system clock and failing the build right there. Even a simple regex check might save you a frustrating debugging session after a call at an ungodly hour.

Resource Leaks
Usually the best way to handle these kind of resources is through a Resource Pool. If you do this then a good tactic is to configure the pool to a size of 1 and make it throw an exception should it get a request for a resource when it has none left to give. That way the first test to request a resource after the leak will fail - which makes it a lot easier to find the problem test.
This idea of limiting resource pool sizes, is about increasing constraints to make errors more likely to crop up in tests. This is good because we want errors to show in tests so we can fix them before they manifest themselves in production. This principle can be used in other ways too. One story I heard was of a system which generated randomly named temporary files, didn't clean them up properly, and crashed on a collision. This kind of bug is very hard to find, but one way to manifest it is to stub the randomizer for testing so it always returns the same value. That way you can surface the problem more quickly.



Comments

  1. Hey thanks for sharing a great article in this blog page. It's very nice define every steps. You can visit here for know about which are the Best Web Designing Companies in India.

    ReplyDelete

Post a Comment

Popular posts from this blog

Rand mm 10

https://stackoverflow.com/questions/2447791/define-vs-const Oh const vs define, many time I got unexpected interview question. As this one, I do not know much or try to study this. My work flow, and I believe of many programmer is that search topic only when we have task or job to tackle. We ignore many 'basic', 'fundamental' documents, RTFM is boring. So I think it is a trade off between the two way of study language. And I think there are a bridge or balanced way to extract both advantage of two method. There are some huge issue with programmer like me that prevent we master some technique that take only little time if doing properly. For example, some Red Hat certificate program, lesson, course that I have learned during Collage gave our exceptional useful when it cover almost all topic while working with Linux. I remember it called something like RHEL (RedHat Enterprise Linux) Certificate... I think there are many tons of documents, guide n books about Linux bu

Martin Fowler - Software Architecture - Making Architecture matter

  https://martinfowler.com/architecture/ One can appreciate the point of this presentation when one's sense of code smell is trained, functional and utilized. Those controlling the budget as well as developer leads should understand the design stamina hypothesis, so that the appropriate focus and priority is given to internal quality - otherwise pay a high price soon. Andrew Farrell 8 months ago I love that he was able to give an important lesson on the “How?” of software architecture at the very end: delegate decisions to those with the time to focus on them. Very nice and straight-forward talk about the value of software architecture For me, architecture is the distribution of complexity in a system. And also, how subsystems communicate with each other. A battle between craftmanship and the economics and economics always win... https://hackernoon.com/applying-clean-architecture-on-web-application-with-modular-pattern-7b11f1b89011 1. Independent of Frameworks 2. Testable 3. Indepe