Fixing intermittent oranges

1. Try to see if the test has one of the known intermittent orange patterns. If it does, just fix it in the hope that it really was a test problem.

2. Try to narrow down the small places in the code base where the cause of the failure could live in. This step takes some time usually, as many of the orange bugs occur in parts of the code which you're reading for the first time. After you identify those places, start reading the code very carefully, looking for things which might be failing. Basically, assume that

*anything*

can fail, all error paths can be taken, etc, and just read my way through the call stack to determine what would happen if something fails. If the code is too complex for me to follow, inject the failing condition myself (such as, replacing |if (failed) return error_code;| with |return error_code;|) and see what happens. After a while, you'll find the exact condition in the code where the failure is being generated from, and from there work your way to come up with a fix.

3. If anywhere in the process you need to reproduce something, write a small shell script to run a test over and over, count the number of failures, calculate the "probability", then run the same script on a build containing your potential fix, and rerun the test a large number of times to see if your fix really works. You can run this type of stuff overnight on a machine, but if you don't have a machine dedicated to this, you can use the try server and the self-serve API to trigger any number of test runs that you want.

Document Tags and Contributors