Joseph Wilk

Joseph Wilk

Things with code, creativity and computation.

Mining Cucumber Features

Failure Rates vs Change Rates

I’ve been spending a lot of  time recently data mining our test suite at Songkick.com. I’ve been using our own internal version of www.limited-red.com which records all our test fails for every build on our continuous integration server.

With the Cucumber features I decided to plot the number of times a feature file has been significantly changed vs the number of times the feature has failed in the build. The theory being that a change to a feature file(the plain text specification of behaviour) most often represents a chance in functionality.

So I wrote a simple script checking the number of commits to each feature file.

1
2
3
4
5
6
7
8
features_folder = ARGV[0]
feature_files = Dir["#{features_folder}/**/*.feature"]

feature_files.each do |feature_file|
  change_logs = `git log --oneline #{feature_file} 2>/dev/null`
  change_count = change_logs.split("\n")
  puts "#{feature_file},#{change_count}"
end

Feature changes compared with feature failures

Throwing that into a pretty graph, here is a snapshot of some of the features (I’ve changed the name of the features to hide top secret Songkick business).

Insights

Based on this I identified two possible groups of features:

  • Features failing more often than the code around them is changing

  • Features which are robust and are not breaking when the code around them is changing.

Further investigation into the suspect features highlighted 3 causes:

  • Brittle step definitions

  • Brittle features coupled to the UI

  • Tests with a high dependency on asynchronous behaviour

Holes in the data – step definitions

We only recorded the change rate of features files in Git. Features could be broken without ever changing the feature file, for example if a common step definition is broken. Next steps are to identify all the step definitions used by a feature and examine how often the step definitions have changed.

First find the change count for all the step definitions.

1
2
3
4
5
6
7
step_files = Dir["features/**/*_steps.rb"]

step_files.each do |step_file|
  change_logs = `git log --oneline #{step_file} 2>/dev/null`
  change_count = change_logs.split("\n").count
  puts "#{step_file},#{change_count}"
end

Then working out what step definitions a feature uses. We can do this by running cucumber with the json formatter and match up step definitions (and hence step definition files) to feature files:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
require 'json'

`cucumber features --dry-run --format json --out .json.out`
features_json = JSON.parse(File.read('.json.out'))

stats = Hash.new{|h,k| h[k] = []}
features_json['features'].each do |feature|
  feature_name = feature['name']
  #The JSON does not have the feature file. Find the file via the feature name. Messy
  feature_file = `egrep -riE "feature:? *#{feature_name}" features/`.split(":")[0]
  feature["elements"].each do |element|
    element["steps"].each do |step|
      file_location = step['match']['location']
      file, _ = file_location.split(":")
      if file =~ /_steps\.rb$/
        stats[feature_file] = (stats[feature_file] + [file]).uniq
      end
    end
  end
end
pp stats

Change rate vs Failure rate with step definition changes

Combining those two bits of data we can now add to our original graph the step definition change rates for a feature.

We also can examine an individual break down of the step definition change rates for a feature:

Holes in using the step definition change rate

The step definition changes from git are at the file level (the *_step.rb file) so a change in git may not touch a step definition used by a feature. Hence we may be counting changes which are not relevant for a feature. Further work would be to examine the git diffs and check if a change touched a step definition used by a feature.

Conclusions

Our tests hide lots of interesting information that can provide evidence of areas we can make improvements. It’s important to realise that like anything in statistics our data mining does not yield facts, just suggestions. At Songkick we are already mining this information with Cucumber and using it to help improve and learn about our tests.

Comments