Failure Rates vs Change Rates
I’ve been spending a lot of time recently data mining our test suite at Songkick.com. I’ve been using our own internal version of www.limited-red.com which records all our test fails for every build on our continuous integration server.
With the Cucumber features I decided to plot the number of times a feature file has been significantly changed vs the number of times the feature has failed in the build. The theory being that a change to a feature file(the plain text specification of behaviour) most often represents a chance in functionality.
So I wrote a simple script checking the number of commits to each feature file.
1 2 3 4 5 6 7 8 |
|
Feature changes compared with feature failures
Throwing that into a pretty graph, here is a snapshot of some of the features (I’ve changed the name of the features to hide top secret Songkick business).
Insights
Based on this I identified two possible groups of features:
Features failing more often than the code around them is changing
Features which are robust and are not breaking when the code around them is changing.
Further investigation into the suspect features highlighted 3 causes:
Brittle step definitions
Brittle features coupled to the UI
Tests with a high dependency on asynchronous behaviour
Holes in the data – step definitions
We only recorded the change rate of features files in Git. Features could be broken without ever changing the feature file, for example if a common step definition is broken. Next steps are to identify all the step definitions used by a feature and examine how often the step definitions have changed.
First find the change count for all the step definitions.
1 2 3 4 5 6 7 |
|
Then working out what step definitions a feature uses. We can do this by running cucumber with the json formatter and match up step definitions (and hence step definition files) to feature files:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Change rate vs Failure rate with step definition changes
Combining those two bits of data we can now add to our original graph the step definition change rates for a feature.
We also can examine an individual break down of the step definition change rates for a feature:
Holes in using the step definition change rate
The step definition changes from git are at the file level (the *_step.rb file) so a change in git may not touch a step definition used by a feature. Hence we may be counting changes which are not relevant for a feature. Further work would be to examine the git diffs and check if a change touched a step definition used by a feature.
Conclusions
Our tests hide lots of interesting information that can provide evidence of areas we can make improvements. It’s important to realise that like anything in statistics our data mining does not yield facts, just suggestions. At Songkick we are already mining this information with Cucumber and using it to help improve and learn about our tests.