Joseph Wilk

Joseph Wilk

Things with code, creativity and computation.

A Little Bit of Pig

Currently in the Science team at Songkick I’ve been working with Apache Pig to generate lots of interesting metrics for our business intelligence. We use Amazon’s MapReduce and Pig to avoid having to run complex, long running and intensive queries on our live db, we can run them on Amazon in a timely fashion instead. So lets dive into Pig and how we use it at Songkick.com.

Pig (whats with all these silly names)

The Apache project Pig is a data flow language designed for analysing large datasets. It provides a high-level platform for creating MapReduce programs used with Hadoop. A little bit like SQL but Pig’s programs by their structure are suitable for parallelization, which is why they are great at  handling very large data sets.

Heres how we use Pig and ElasticMapReduce at Songkick in our Science team.

Data (Pig food)

Lets start by uploading some huge and interesting data about Songkicks artists onto S3. We start by dumping a table from mysql (along with a lot of other tables) and then query that data with Pig on Hadoop. While we could extract all the artist data by querying the live table its actually faster to use mysqldump and dump the table as a TSV file.

For example it took 35 minutes to dump our artist table with a sql query ‘select * from artists’. It takes 10 minutes to dump the entire table with mysqldump.

We format the table dump as a TSV which we push to S3 as that makes it super easy to use Amazons ElasticMapReduce with Pig.

shell> mysqldump --user=joe --password  --fields-optionally-enclosed-by='"'
                  --fields-terminated-by='\t' --tab /tmp/path_to_dump/ songkick artist_trackings

Unfortunately this has to be run on the db machine since mysqldump needs access to the file system to save the data. If this is a problem for you there is a Ruby script for dumping tables to TSV: http://github.com/apeckham/mysqltsvdump/blob/master/mysqltsvdump.rb

Launching (Pig catapult)

We will be using Amazons Elastic MapReduce to run our Pig scripts. We can start our job in interactive Pig mode which allows us to ssh to the box and run the pig script line by line.

Examples (Dancing Pigs)

An important thing to note when running pig scripts interactively is that they defer execution until they have to expose a result. This means you can get nice schema checks and validations helping ensure your PIG script is valid without actually executing it over your large dataset.

We are going to try and calculate the average number of users tracking an artist based on the condition that we only count users who logged in, in the last 30 days.

This is what our Pig script is doing:

The Pig script:

1
2
3
4
5
6
7
8
-- Define some useful dates we will use later
%default TODAYS_DATE `date  +%Y/%m/%d`
%default 30_DAYS_AGO `date -d "$TODAYS_DATE - 30 day" +%Y-%m-%d`
    
-- Pig is smart enough when given a folder to go and find files, decompress them if necessarily and load them.
-- Note we have to specify the schema as PIG does not know know this from our TSV file.
trackings = LOAD 's3://songkick/db/trackings/$TODAYS_DATE/' AS (id:int, artist_id:int,  user_id:int); 
users = LOAD 's3://songkick/db/users/$TODAYS_DATE/' AS (id:int, username:chararray, last_logged_in_at:chararray);
trackings
<1, 1, 1>
<2, 1, 2>

users
<1,'josephwilk', '11/06/2012'>
<2,'elisehuard', '11/06/2012'>
<3,'tycho', '11/06/2010'>
1
2
3
-- Filter users to only those who logged in, in the last 30 days
    -- Pig does not understand dates, so just treat them as strings
    active_users = FILTER users by last_logged_in_at gte '$30_DAYS_AGO'
Users
<1,'josephwilk', '11/06/2012'>
<2,'elisehuard', '11/06/2012'>
1
2
3
4
active_users_and_trackings = JOIN active_users BY id, trackings BY user_id
    
    -- group all the users tracking an artists so we can count them.
    active_users_and_trackings_grouped = GROUP active_users_and_trackings BY active_users::user_id;
<1, 1, /\{<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>\/}>`
1
trackings_per_artist = FOREACH active_users_and_trackings_grouped GENERATE group, COUNT($2) as number_of_trackings;
`<\/{<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>\/}, 2>`
1
2
-- group all the counts so we can calculate the average
    all_trackings_per_artist = GROUP trackings_per_artist ALL;
<\/{\/{<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>\/}, 2\/}>
1
2
3
-- Calculate the average
    average_artist_trackings_per_active_user = FOREACH all_trackings_per_artist
      GENERATE '$DATE' as dt, AVG(trackings_per_artist.number_of_trackings);
<{<'11/062012', 2>}>
1
2
3
--Now we have done the work store the result in S3.
    STORE average_artist_trackings_per_active_user INTO
      's3://songkick/stats/average_artist_trackings_per_active_user/$TODAYS_DATE'

Debugging Pigs (Pig autopsy)

In an interactive pig session there are two useful commands for debugging: DESCRIBE to see the schema. ILLUSTRATE to see the schema with sample data:

DESCRIBE users;
users: {id:int, username:chararray, created_at:chararray, trackings:int}

ILLUSTRATE users;
----------------------------------------------------------------------
| users   | id: int | username:chararray | created_at | trackings:int |
----------------------------------------------------------------------
|         | 18      | Joe                | 10/10/13   | 1000          |
|         | 20      | Elise              | 10/10/14   | 2300          |
----------------------------------------------------------------------

Automating Elastic MapReduce (Pig robots)

Once you are happy with your script you’ll want to automate all of this. I currently do this by having a cron task which at regular intervals uses the elastic-mapreduce-ruby lib to fire up a elastic map reduce job and run it with the pig script to execute.

Its important to note that I store the pig scripts on S3 so its easy for elastic-mapreduce to find the scripts.

Follow the instructions to install elastic-mapreduce-ruby: https://github.com/tc/elastic-mapreduce-ruby

To avoid having to call elastic-mapreduce with 100s of arguments a colleague has written a little python wrapper to make it quick and easy to use: https://gist.github.com/2911006

You’ll need to configure where you’re elastic-mapreduce tool is installed AND where you want elastic map-reduce to log to on S3 (this means you can debug your elastic map reduce job if things go wrong!).

Now all we need to do is pass the script the path to the pig script on S3.

./emrjob s3://songkick/lib/stats/pig/average_artist_trackings_per_active_user.pig

Testing with PigUnit (Simulating Pigs)

Pig scripts can still take a long time to run even with all that Hadoop magic. Thankfully there is a testing framework PigUnit.

http://pig.apache.org/docs/r0.8.1/pigunit.html#Overview

Unfortunately this is where you have to step into writing Java. So I skipped it. Sshhh.

References

  1. Apache Pig official site: http://pig.apache.org

  2. Nearest Neighbours with Apache Pig and JRuby: http://thedatachef.blogspot.co.uk/2011/10/nearest-neighbors-with-apache-pig-and.html

  3. Helpers for messing with Elastic MapReduce in Ruby https://github.com/tc/elastic-mapreduce-ruby

  4. mysqltsvdump http://github.com/apeckham/mysqltsvdump/blob/master/mysqltsvdump.rb

Examples Alone Are Not a Specification

The Gherkin syntax used by Cucumber enforces that Feature files contain scenarios which are examples of the behaviour of a feature. However Gherkin has no constraints on if there is a specification present. Examples are great at helping us understand specifications but they are not specifications themselves.

What do we mean when we say specification?

definition: A detailed, exact statement of particulars

In a Gherkin feature the specification lives here:

Lets look at a real example:

A Feature with just Examples

A Cucumber example based on a feature (which I have modified) from the test library Rspec rspec-expectations:

Feature: be_within matcher
  Scenario: basic usage
  Given a file named "be_within_matcher_spec.rb" with:
  """
  describe 27.5 do
  it { should be_within(0.5).of(27.9) }
  it { should be_within(0.5).of(27.1) }
  it { should_not be_within(0.5).of(28) }
  it { should_not be_within(0.5).of(27) }
  # deliberate failures
  it { should_not be_within(0.5).of(27.9) }
  it { should_not be_within(0.5).of(27.1) }
  it { should be_within(0.5).of(28) }
  it { should be_within(0.5).of(27) }
  end
  """
  When I run `rspec be_within_matcher_spec.rb`
  Then the output should contain all of these:
  | 8 examples, 4 failures                     |
  | expected 27.5 not to be within 0.5 of 27.9 |
  | expected 27.5 not to be within 0.5 of 27.1 |
  | expected 27.5 to be within 0.5 of 28       |
  | expected 27.5 to be within 0.5 of 27       |

So where is the explanation of what be_within does? If I want to know how be_within works I want a single concise explanation not 5/6 different examples. Examples add value later to validate that specification.

A Feature with both Specification and Examples

Lets add back in the specification part of the Feature. drum roll

Feature: be_within matcher

  Normal equality expectations do not work well for floating point values.
  Consider this irb session:

      > radius = 3
        => 3 
      > area_of_circle = radius * radius * Math::PI
        => 28.2743338823081 
      > area_of_circle == 28.2743338823081
        => false 

  Instead, you should use the be_within matcher to check that the value
  is within a delta of your expected value:

      area_of_circle.should be_within(0.1).of(28.3)

  Note that the difference between the actual and expected values must be
  smaller than your delta; if it is equal, the matcher will fail.

  Scenario: basic usage
    Given a file named "be_within_matcher_spec.rb" with:
      """
      describe 27.5 do
        it { should be_within(0.5).of(27.9) }
        it { should be_within(0.5).of(27.1) }
        it { should_not be_within(0.5).of(28) }
        it { should_not be_within(0.5).of(27) }

        # deliberate failures
        it { should_not be_within(0.5).of(27.9) }
        it { should_not be_within(0.5).of(27.1) }
        it { should be_within(0.5).of(28) }
        it { should be_within(0.5).of(27) }
      end
      """
    When I run `rspec be_within_matcher_spec.rb`
    Then the output should contain all of these:
      | 8 examples, 4 failures                     |
      | expected 27.5 not to be within 0.5 of 27.9 |
      | expected 27.5 not to be within 0.5 of 27.1 |
      | expected 27.5 to be within 0.5 of 28       |
      | expected 27.5 to be within 0.5 of 27       |

Thats better, we can get an explanation of why this method exists and how to use it.

Imagine RSpec without the specification

I think of a Cucumber feature without a specification much like an Rspec example without any English sentence/description.

context "" do
  it "" do
    user = Factory(:user)
    user.generate_password
    user.activate

    get "/session/new", :user_id => user.id

    last_response.should == "Welcome #{user.name}"
  end
end

Feels a little odd doesn’t it.

Cucumber Features as Documentation (for real)

Rspec is an example of a project that has taken its Cucumber features and published them as its documentation. Just browse through those features and it quickly highlights how important it is to have a specification as well as examples. Imagine an API with nothing but examples, leaving you the detective work of trying to work out what the thing actually does.

Documentation needs to explain/specify what something does as well provide examples. If you really want anyone to read your feature provide both examples and a specification.

Co-chair of the Agile Alliance Functional Testing Tools Group

I’m happy to announce that I have joined Elizabeth Hendrickson as Co-Chair of the Agile Alliance Functional Testing Tools group (AAFTT, yes the name is a bit of a mouthful).

I came across the AAFTT group when they organised a pattern writing workshop in London 2010, facilitated by Linda Rising. It was an opportunity to bringing together some of the thought leaders in testing and share experiences. Somehow I managed to sneak in and proceded to steal all these experts knowledge. The AAFTT also runs an open space pria to the Agile conference. Last year I was surprised to find people who had travelled to the conference just for the AAFTT open space!

As a developer who loves messing around with testing tools, I’ll happily admit I know little about testing as a profession. The scope and breath of ideas that the AAFTT exposed me to has left me excited and hungry to play with new tools, ideas and patterns to make testing as fun as it should be. I hope as a developer who is obsessed about testing I can help blur the lines between developers and testers. I’m both.

Want to know more about the Agile Alliance Functional Testing Tools group?

And keep your ears open for any events!

Mining Cucumber Features

Failure Rates vs Change Rates

I’ve been spending a lot of  time recently data mining our test suite at Songkick.com. I’ve been using our own internal version of www.limited-red.com which records all our test fails for every build on our continuous integration server.

With the Cucumber features I decided to plot the number of times a feature file has been significantly changed vs the number of times the feature has failed in the build. The theory being that a change to a feature file(the plain text specification of behaviour) most often represents a chance in functionality.

So I wrote a simple script checking the number of commits to each feature file.

1
2
3
4
5
6
7
8
features_folder = ARGV[0]
feature_files = Dir["#{features_folder}/**/*.feature"]

feature_files.each do |feature_file|
  change_logs = `git log --oneline #{feature_file} 2>/dev/null`
  change_count = change_logs.split("\n")
  puts "#{feature_file},#{change_count}"
end

Feature changes compared with feature failures

Throwing that into a pretty graph, here is a snapshot of some of the features (I’ve changed the name of the features to hide top secret Songkick business).

Insights

Based on this I identified two possible groups of features:

  • Features failing more often than the code around them is changing

  • Features which are robust and are not breaking when the code around them is changing.

Further investigation into the suspect features highlighted 3 causes:

  • Brittle step definitions

  • Brittle features coupled to the UI

  • Tests with a high dependency on asynchronous behaviour

Holes in the data – step definitions

We only recorded the change rate of features files in Git. Features could be broken without ever changing the feature file, for example if a common step definition is broken. Next steps are to identify all the step definitions used by a feature and examine how often the step definitions have changed.

First find the change count for all the step definitions.

1
2
3
4
5
6
7
step_files = Dir["features/**/*_steps.rb"]

step_files.each do |step_file|
  change_logs = `git log --oneline #{step_file} 2>/dev/null`
  change_count = change_logs.split("\n").count
  puts "#{step_file},#{change_count}"
end

Then working out what step definitions a feature uses. We can do this by running cucumber with the json formatter and match up step definitions (and hence step definition files) to feature files:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
require 'json'

`cucumber features --dry-run --format json --out .json.out`
features_json = JSON.parse(File.read('.json.out'))

stats = Hash.new{|h,k| h[k] = []}
features_json['features'].each do |feature|
  feature_name = feature['name']
  #The JSON does not have the feature file. Find the file via the feature name. Messy
  feature_file = `egrep -riE "feature:? *#{feature_name}" features/`.split(":")[0]
  feature["elements"].each do |element|
    element["steps"].each do |step|
      file_location = step['match']['location']
      file, _ = file_location.split(":")
      if file =~ /_steps\.rb$/
        stats[feature_file] = (stats[feature_file] + [file]).uniq
      end
    end
  end
end
pp stats

Change rate vs Failure rate with step definition changes

Combining those two bits of data we can now add to our original graph the step definition change rates for a feature.

We also can examine an individual break down of the step definition change rates for a feature:

Holes in using the step definition change rate

The step definition changes from git are at the file level (the *_step.rb file) so a change in git may not touch a step definition used by a feature. Hence we may be counting changes which are not relevant for a feature. Further work would be to examine the git diffs and check if a change touched a step definition used by a feature.

Conclusions

Our tests hide lots of interesting information that can provide evidence of areas we can make improvements. It’s important to realise that like anything in statistics our data mining does not yield facts, just suggestions. At Songkick we are already mining this information with Cucumber and using it to help improve and learn about our tests.

Conferences and Accessibility: Please Try Harder

I speak at a lot of conferences around the world and I’ve been happy to hear a lot of healthy debate around the diversity of the programming community. I really enjoyed and felt inspired by the talk from Joshua Wehner on “Must It Always Be About Sex” at Nordic Ruby and Ruby Lugdunum.

It got me thinking about the issue of accessibility at conferences. I use a wheelchair and happily throw myself at any conference irrelevant of the obstacles. But not everyone is as crazy or as flexible as I am. Once at the conference the organisers are always willing to help but I wonder if we are not excluding disabled people who may dimiss the chance to talk or attend as unfeasible due to lack of information about access. Even if thats only a few people, those people are valuable members of our communities and deserve the chance to participate if they want to.

How would we ever know if we were discriminating, if no-one turns up in a wheelchair we are none the wiser.

I understand conferences are expensive and complicated events to organise. I would ask two small things of every conference organiser:

1. Try and select an accessible venue if possible. For both speakers and attendees.

2. Make it clear on the conference website if the venue, party, hack sessions are accessible or not.

My life was profoundly changed through speaking and attending conferences, in many ways beyond programming. I hope we can make sure anyone who wants that opportunity won’t be put off.

Startups Infiltrating Agile and XP

I’m very excited to be talking at this years Agile2011 conference. Its a goal I’ve been working towards for a long time.

I’ve been to lots of agile conferences where I have listen to consultants talk about how they have made their multi-nationals or large organisations better through x. Interesting but often not much I could take away and apply at a startup.

Sadly there are only 4 talks at Agile2011 tagged with Startups and 6 tagged with Lean Startup.

I’m passionate about startups and I want there to be more focus at agile events around them. Having spoken to lots of startups all over the world and living in the London startup hub, I all to often hear people dismiss Agile and XP events as not relevant. While there is a lot of activity around the Lean Startup little is said about Extreme Programming and startups. The rich experience of the Agile community tends to  get lost. The Lean Startup has provided a movement which has helped deliver ideas from agile to startups, something agile conferences have failed at.

I’m happy to be injecting a dose of startup with my talk at Agile2011 about Acceptance testing in the world of the startup. I’ll be drawing from my experiences of working at the startup Songkick.com. Looking at the things we did badly, things we did well and things we still have no clue about.

I’m also helping organise events in London and other major cities around Extreme Programming and Startups.

The Art of Cucumber London Workshop

I’ll be hosting a ½ workshop “The Art of Cucumber” at Skillsmatters in London. If you are interested in learning more about Cucumber, understanding how Cucumber can fit in software development and patterns for writing healthy Cukes this workshop is for you.

The workshop will also give you a sneak peek into my talk at Agile2011 (Salt Lake City) and Spa2011 (London) without the big conference ticket price.

Book a place: http://theartofcucumber.eventbrite.com/

Testing Outside of the Ruby World

I recently spoke at the Scotland Ruby Conference about interesting testing tools and ideas outside of the Ruby community. The goal was to inspire the Ruby community to push the state of the art in testing.

Slides

Further Resources

Story Smells: The Valueless Story

“Why bother discussing the value or writing it down for the stories, everybody already knows what it is”

Problem

The value of a story to a stakeholder is not discussed or written on a card. The group participating in story writing workshop feel there is no point in dealing with the value since it seems obvious.

To a degree this is true, we should (hopefully) already know the high level business values that we want to achieve before we try and write any story cards. Otherwise we may end up with a vomit of user stories

Avoiding discussing or writing down the value on story cards can result in cards like this:

Whose value is it anyway?

While we may already know our business values when writing stories I tend to shift the discussion to the users value. We know what we want, now why is a user going to do what we want? Note that this shift does not always apply, sometimes we want to force a user to do something that they don’t want to do (such as filling in a captcha).

Problems

Not discussing the value

  • Miss uncovering differences in underlying assumptions about why this feature is wanted.

    • I’ve witnessed too many groups who when jogged to think about the value start asking the right questions and discovering underlying differences in otherwise implicit assumptions.
  • Create features you don’t need 

Not writing the value on a story card

  • Difficultly Prioritizing cards

    • Knowing the value to a role of a feature can help guide a cards prioritisation.
  • Physical cards on a board without any value, mean its not clear quickly why you are building this feature.

    • When a card is being worked on by various roles knowing the value helps guide you to re-assess and ask relevant questions through the cards journey to being delivered.

Using high level obvious values

This problem can also be a symptom of expressing the value at too high a level. We all know we need to:

  • Protect revenue
  • Increase revenue
  • Manage cost
  • Increase brand value
  • Make the product remarkable
  • Provide more value to your customers

So putting it as a value for a narrative feels contrived and pointless.

Solutions

Ask ‘why?’

If a group feels the value is obvious they will have no trouble popping the why stack. Popping the why stack quickly uncovers questions and can lead to a refinement of what seemed obvious. At worst everyone in the group has a shared understanding alining what they all saw as obvious.

Feature Injection

Try and structure narratives using the Feature injection format:

Starting with the value as the very first thing you write can help ensure you don’t progress onto the other points until you have at least discussed it.

Avoid obvious values

Make sure when you’re discussing the value you drop down from the highest value to something that seems sensible to everyone.

But what if the value really is obvious.

Ok, sometimes a value does seem obvious, perhaps it’s the same as a previous card. It is still important to write it down for the story and at least ask the question ‘is this value obvious to everyone?’ and ‘is this value meaningful to everyone?’. Sometimes that is discussion enough.

Page Object Pattern

What Is the Page Object Pattern?

The Page model is a pattern that maps a UI page to a class, where for example a page could be a HTML page. The functionality to interact or make assertions about that that page is captured within the Page class. Then these methods may be called by a test. So ultimately we are introducing a gatekeeper to the GUI of a page.

Why use the Page Object Pattern?

  • Readable dsl for tests

  • Promotes Reuse

  • Centralise UI coupling – One place to make changes around the UI.

Implementing the Page Pattern in Cucumber

Within Cucumber there are two main ways we can encapsulate the page UI:

The Page Object Pattern

features/pages/login_page.rb

1
2
3
4
5
6
7
8
9
10
11
class LoginPage
  def login(user, password)
    fill_in :user, user
    fill_in :password password
    click 'login'
  end

  def visit
    visit "/login"
  end
end

features/step_definitions/user_steps.rb

1
2
3
4
5
6
Given /^I login with username "Joseph" and password "cuker"$/ do |username, password|
  login_page = LoginPage.new

  login_page.visit
  login_page.login(username, password)
end

The Page Step definition Pattern

Cucumber step definitions are all defined at the same scope, but we use folders and files to create logical organisation. We can create folders for UI step definitions and domain step definitions.

  • features/domain/step_definitions/*

  • features/ui/step_definitions/*

We create a step definition file mapping to a UI page. features/ui/step_definitions/login_page_steps.rb

1
2
3
4
5
Given /^I login with username "Joseph" and password "cuker"$/ do |username, password|
  fill_in :user, user
  fill_in :password password
  click 'login'
end

Whats the right way to encapsulate the UI?

Just using step definitions for organisation within a project can have a number of problems:

Global scope within Cucumbers world Instance variables are global across all step definitions

1
2
3
Given /^I mess with scope$/ do
  @this_can_be_seen_by_every_other_step = 'uh oh'
end

Managed and run through Cucumber No easy way to be reused outside of Cucumber or test in isolation. By isolating the test code we can easily provide adapters for reuse in different test frameworks (for example similar to what email-spec does).

The Page Object pattern (and adding another layer of abstraction) has a couple of nice properties:

  • Bounded scope (if you use your classes/objects nicely)

  • Isolated units that can be invoked and controlled independently of overarching testing framework

Should I be using the Page Object Pattern?

Yes, No, Maybe.

Extra layers of abstraction introduce complexity and so the Page Object Pattern should be used carefully when there is a sufficiently high burden of maintenance (which usually means lots of step definitions).

Its important outside of the Page object pattern to realise the weaknesses of just using step definitions as your only modelling tool. Irrelevant of what metaphor you decide to organise around it’s a good habit to push the code out of the step definitions.