Metrics for Plain Text Acceptance Tests

10 Nov

There has been lots of activity around the value of metrics for source code and tests. In the Ruby world tools like metric_fu provide a wealth of analysis.

While working on my Cucumber talk for Rails Underground I started investigating how we could apply metrics to the customer focused plain text of Cucumber. For those not familiar with Cucumber it’s an acceptance testing framework which allows non-technical people to write plain-text describing the behaviors of their system. The developers/testers map the plain-text to tests.

Having spent time teaching people about the plain-text side of Cucumber I often found myself recommending the same guidelines and plain-text anti-patterns. This lead me to think about providing metrics scoring the customers plain-text.

Why would we want plain-text acceptance test metrics?

  • Help plain-text beginners avoid bad practices early on.
  • Help improve the quality of plain-text
  • Help quality review with a large frequency of incoming features

Why does the quality of the plain-text matter?

Why focus on quality, the plain-texts primarily goal is to be easy for customer to use?

  • The developer builds the Domain specific language via mapping plain text to ruby. Higher quality plain-text could make it easier to manage these mappings without any major impact to readability.
  • Higher quality text is easier to read, edit and understand.

Who would find it useful?

Initially Developers.

  • In some scenarios the developers write the features from discussions and give to the customer to review.
  • Developers may tweak/review customer written changes/features.
  • Developers often edit/tweak plain-text from the customer to enable reuse of existing test code .
  • In open source projects often developers write Cucumber features. Metrics are something they are comfortable with.

Can you measure quality in plain-text?

First its important to distinguish acceptance tests from pure plain-text. Within acceptance tests we have some degree of structure, for example using Given/When/Then to describe scenarios.

Cucumber Example:

Scenario: Eating all cucumbers
  Given there are 5 cucumbers
  When I eat 5 cucumbers
  Then I should have 0 cucumbers

This structure reduces the complexity of analysing the quality of the text. It provides us with different structural elements which have different rules/guidelines on what their content should be.

The problem with measuring the quality of text is that it is far more subjective in than in code. So while we cannot be absolute in our assessment of quality we can try and codify smells that *could* indicate areas in the text that *could* be improved. This is pretty much true for all metrics, they are guidelines not absolutes (Dan Norths highlights the dangers of absolute metrics in the Parable of Metrics)

So what useful metrics could we look at?

Plain text Metrics

From my experience with Cucumber I would suggest examining:

Feedback

What do you think of the idea?

Can you think of any other useful plain-text metrics?

  • Thanks for all the feedback. I'm focusing on my coding experiments on this in my rotten project: http://github.com/josephwilk/rotten
  • Perryn Fowler has a vaguely related post: http://www.jroller.com/perryn/entry/given_when_then_and_how

    Might be hard to detect automatically...
  • One more thought relating to the scenario length smell is the step to scenario ratio. (I think Matt has brought this up on the ML before.) We could find out how much reuse of steps is occurring by looking at the length of scenarios versus the number of actual steps used. The number that this tool produced would be very subjective as some projects may choose to take a very declarative approach and sacrifice step reuse.

    re: ignoring warnings, the other code smell detectors use a YAML file to do this so I think that would be the best option.
  • Greg Hnatiuk
    Speaking of putting comments in the feature files, I think comments themselves in features files are smell.

    Narratives for Features, Scenarios, Outlines, Backgrounds, and Examples are too permissive to have to need to explain things further with comments.
  • I think this is a very interesting idea. Gherkin (the library) is pretty mature now, so implementing this can be done with a simple gherkin listener.

    Another thing that drives me crazy is formatting - when people left-align steps with the word following the keyword, and not the keyword itself. That could be detectable.

    Another one could be detection of duplication - and suggest refactoring to Scenario Outline.

    I'm building a CLI for gherkin itself, and I'd be happy to include a smell detector in the gherkin gem!

    I'm not sure how you would ignore warnings - putting comments in the feature files to achieve that is ugly noise. Maybe a YAML file could do it - referring to files and names. In order to void orphans in the YAML file we could also check for orphans and error if they are found.
  • Hi Joe,
    I like the idea. I think it would be best to phrase these as "feature smells". As you point out, these measures of feature quality can be very subjective. Just like code smells these would be indicators that something might be wrong but doesn't necessarily dictate that it is. With that in mind it would be nice if such a Gherkin-smell detector allowed you to ignore warnings and set thresholds much like the code smell detectors.

    One thing that really bugs me is when I see variable names in the features. I.E. "Given 10 admin_users exist". I see this a lot in table headers too. So maybe have a Gherkin-smell for code-isms creeping into the features. (An obvious exception to this would be if you are creating a tool for developers.) I'm sure I could think of some more but I think you've outlined the big ones.
blog comments powered by Disqus