Joseph Wilk

Joseph Wilk

Things with code, creativity and computation.

Isolating External Dependencies in Clojure

Isolating external dependencies helps make testing easier. We can focus on a specific unit of code and we can avoid slow tests calling real services or databases.

Clojure provides many different ways of achieving isolation. Lets explore what’s possible:

Redefining functions

We can redefine vars and hence functions in a limited scope with with-redefs

The documentation suggests its usefulness in testing:

Useful for mocking out functions during testing.

Lets look at an example where we want to isolate a function that logs to file:

1
2
3
(defn log-fn [] #(spit "report.xml" %))

(defn log [string] ((log-fn) string))

And the test removing the dependency on the filesytem:

1
2
(with-redefs [log-fn (fn [data] data)]
  (log "hello"))

Its important to note that with-redefs are visible in all threads and it does not play well with concurrency:

with-redefs can permanently change a var if applied concurrently:

1
2
3
4
5
6
(defn ten-sixty-six [] 1066)
(doall 
  (pmap #(with-redefs [ten-sixty-six (fn [] %)] (ten-sixty-six))
        (range 20 100)))

(ten-sixty-six) ; => 49 Ouch!

The parallel redefs conflict with each other when setting back the var to its original value.

Another option is alter-var-root which globally redefines a var and hence a function. alter-var-root can alter functions in other namespaces.

Writing our test to use alter-var-root:

1
2
3
4
(alter-var-root
 (var log-fn)
 (fn [real-fn] ; We are passed the function we are about to stub.
   (fn [data] (println data))))

Its important to note we have to reset the var if we want to restore the system to its previous state for other tests.

Redefining Dynamic Vars

Using dynamic vars we can rebind the value and hence we can use this as an injection point. Again if we can rebind vars we can rebind functions. Note though that we have to mark those functions as dynamic:

1
2
3
4
5
6
7
8
9
;The real http request
(defn ^:dynamic http-get-request [url] http/get url)
(defn get [url] (http-get-request [url]))

(defn fake-http-get [url] "{}")

(fact "make a http get request"
  (binding [http-get-request fake-http-get]
    (get "/some-resource")) => "{}"))

Unlike alter-var-root and with-redefs dynamic vars are bound at a thread-local level. So the stubbings would only be visible in that thread. Which makes this safe for tests being run concurrently!

Atoms & Refs (Global vars in disguise)

While insidious, evil, malicious and ugly we could use atom or refs to contain a dependency.

1
2
3
(def cache (atom (fn [method, & args] (apply (resolve (symbol (str "memcache/" method))) args))))

(defn get [key] (@cache get key))

And in our test:

1
(reset! cache (fn [method, & args] (apply (resolve (symbol (str "fake-cache/" method))) args)))

Yuck, lets never speak of that again.

Midje

The Midje testing framework provides stubbing methods through provided. In the core of Midje this uses our previously visited alter-var-root.

Lets see how our example would look using Midje:

The code:

1
2
3
(defn log-fn [] #(spit "report.xml" %))

(defn log [string] ((log-fn) string))

And our test that uses provided:

1
2
3
4
(fact "it should spit out strings"
  (log "hello") => "hello"
  (provided
    (log-fn) => (fn [data] data)))

Its important to note that the provided is scoped in effect. It is only active during the form before the Midje “=>” assertion arrow.

Conceptually think of it like this:

1
2
3
(provided
  (log-fn) => (fn [data] (println data))
  (captured-output (log "hello"))) => (contains "hello")

Flexibility

Midjes provided gives very fine grained control of when a stub is used:

1
2
3
4
5
6
(do
  (log "mad hatter")   ;will use the stub
  (log "white rabbit") ;will not use the stub
)
(provided
  (log "mad hatter") => (fn [data] (println data)))

And we can go further and define argument matcher functions giving huge flexibility in when a stub should be run.

Safety

Midje validates your stubs and checks your not doing anything too crazy which would fundamentally break everything.

Higher order functions

We can isolate dependencies by passing in functions which wrap that dependency. This abstracts the details of the dependency and provides a point where we can inject our own functions which bypasses the dependency.

For example:

1
2
3
(defn extract-urn [data]
  (let [urn-getter #(:urn data)]
 (do-it urn-getter)))

In our tests:

1
  (do-it (fn [] 10))

Simple and beautiful.

Substituting namespaces

We can switch the namespace that a block of functions are evaluated in. Hence we can swap in a completely new implementation (such as a fake) by changing the namespace.

An example faking out a key value store:

1
2
(defn get-it [arg & [namespace]]
  ((ns-resolve (or namespace *ns*) 'get) arg))

A fake implementation of this service:

1
2
3
4
5
6
7
8
9
(ns test.cache.fake)

(def cache (atom {}))

(defn get [arg]
  (@cache arg))

(defn put [arg value]
  (reset! @cache (assoc @cache arg value)))

And our test:

1
2
(fact "it should do something useful"
  (get-it "1234" 'test.cache.fake) => "1234")

Alternatively if we don’t want the mess of injecting new namespaces into our functions we could change namespace aliases to achieve the same effect:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(ns example
  (:require [cache.memcache :as memcache]))

(when (= (System/getenv "ENV") "TEST")
  (ns-unalias 'example 'memcache)
  (require 'test.cache.fake)
  (alias 'memcache 'test.cache.fake))

(defn get [arg]
  (memcache/get arg))

; ...

(when (= (System/getenv "ENV") "TEST")
  ;Cleanup our rebinding of memcache alias
  (ns-unalias 'memcache 'example))

When running our tests we mutate the behaviour of the system by setting the ENV environment variable to TEST.

Runtime Polymorphism

Switch behaviour based on a the type of an argument. During testing we can inject a specific type which will behave differently from the real system.

Clojure gives us protocols and multimethods to achieve this:

Protocols

1
2
3
4
5
6
7
8
9
10
11
12
13
(:require [[fake-service :as fake]
           [service      :as service]])

(defprotocol Service
  (do-get [this arg]))

(deftype FakeService []
  Service
  (do-get [this arg] (fake/do-get arg)))

(deftype RealService []
  Service
  (do-get [this arg] (service/do-get arg)))

And in our test:

1
(do-get (FakeService.) "cheshire")

Multimethods

Similar we can use the type of arguments to indicate different behaviour.

1
2
3
4
5
6
7
8
9
10
(:require [[fake-service :as fake]
           [service      :as service]])

(defmulti do-get (fn [service param] [(:Service service) param]))

(defmethod do-get [:FakeService] [service param]
  (fake/do-get param))

(defmethod do-get [:RealService] [service param]
  (service/do-get param))

And in our test:

1
(do-get (FakeService.) "rabbit")

Defining functions at Runtime

Using environment variables its possible to switch what functions are defined at runtime. def always defines a method at the top level of a namespace.

Here is an example inspired from Midje’s source code:

1
2
3
4
5
6
7
8
9
10
(defn init! []
  (case (System/getenv "ENV")
    "TEST"
    (do
      (def get [key]       (fake/get key))
      (def set [key value] (fake/set key value))
    ;; else
    (do
      (def get [key]       (memcache/get key))
      (def set [key value] (memcache/set key value))))))

We would run our tests with ENV=test lein test.

How should I isolate dependencies in Clojure?

Having explored what we can do, what should we do?

There are a number of choices and a lot depends on you’re programming and testing style:

The Purest form of isolation

Passing a function, functions that wrap our dependencies means we do not have to mutate the code under test. This is the ideal form of isolating. This is where we want to be.

But sometimes either aesthetics or control might make us look elsewhere.

Functions with many parameters can become ugly and cumbersome to use.

Using external libraries where we cannot have design the way we want it (though we can try by wrapping the heck out of any library).

Finally integration tests are hard if not impossible to do with this form of dependency isolation.

The Aesthetic small touch form of isolation

var-alter-root is (very) scary, but the guard rails of Midje make it an easy way to isolate dependencies. It also supports flexibility in how we stub functions based on the arguments they are called with (or completely ignore the arguments). This flexibility is extremely powerful and is a big plus for Midjes provided.

The danger ever present with this form of isolation is ignoring your tests telling you about a problem in your design.

The Simple small touch form of isolation

While Midje provides lots of power and flexibly it does so at the cost of slightly awkward syntax and a lot of crazy macros (I say this having stared into the heart of Midje). For example parallel functions do not work with provided.

with-redefs, binding and var-alter-root provide flexibly to handle different testing scenarios. and no prior knowledge of an external tool is required.

If you don’t need the power of Midje or fear its complexity you can happily use nothing but Clojure’s standard library. Maybe you will even learn something about how Clojure works!

The Java small touch form of isolation.

Since Clojure supports java interop its always possible to fall back to using Java, OO and dependency injection. If you love Java, this is a good path.

The Crazy Large touch form of isolation

Namespace switching is a shortcut to having to stub out every single method. In one sweep we redefine all the functions of a namespace. This might be more useful for integration tests than unit tests.

That shortcut does come at a cost, we still have to maintain our fake ns every time something changes in the real namespace and our production code is left wrapped in ns-resolve or a ugly switch based on Environment settings. Ugly!

I don’t recommend using this form of isolation regularly but in edge cases it can be very convenient, though people will still think you are crazy.

A Developers Guide to Creating Presentations

So your talk got selected. Great!

Oh Crap! Now hits the panic, you have to actually create a presentation.

Every presenter wants to give the best possible presentation they can that sticks in peoples minds.

Here are some hard earned lessons for getting the best out of your presentation.

Accept that most of our preconceptions of how we learn are wrong.

Forget those high information dense, black and white acetate slides with professors droning on about solving the towers of Hanoi while you scribble your own notes while downing your fifth coffee.

“Attending a lecture is a passive experience for the student. Of all teaching events, the lecture is most likely to promote basic assumption dependence and sleep”

http://www.dailyprincetonian.com/2009/10/15/24142/

The good news

That does not mean you cannot give a super dense, deeply detailed presentation. It means if you want people to get the most out of your presentation you need to think outside a boring lecture. Luckily Fun, humor, creativity, color, interaction, invoking emotions are a key part of helping people learn effectively. Think back to your favourite teacher, why were they your favourite? I’m guessing they introduced some of those things in their teaching.

Creating a presentation

Lets look at some general ideas to help make your presentation great:

1. Dedicate time

Creating a presentation, content, theme, styling, finding pictures, practicing all take time. As you get better you can speed up and hone your tactics for presentations.

My first presentation took 3 months. Make sure you start early and give yourself enough time!

2. Research the current state of the art

Doing a talk on Continuous Deployment? Search for every Continuous Deployment presentation that has been given. Watch them all, steal the good ideas and throw away the bad. Try and make sure you are saying something new or presenting a different slant on the topic.

3. Don’t be on your own

Experienced speakers have developed a repertoire of little tricks, ideas and thoughts for presentations. They have also sat through a lot of presentations and have good experience of being the audience. Don’t be shy finding someone who has spoken before for some advice, even finding someone who would be happy to be your mentor.

If nothing else watch some of the most popular speakers on Confreaks and see what tricks/styles they use.

Conferences are not very good at helping get the best out of you when it comes to creating your talk. Once your talk is submitted and accepted you are often left on your own until the day of your talk. I believe this is something fundamentally broken with conferences today. Realise this and find someone to be your mentor.

4. Practice frees the subconscious

Practicing a presentation is an important part of getting better and more confident at giving presentations. Practice wherever and whenever you can, at work, at your local meetup and in the shower (yes I’ve done this). Don’t forget you are not learning if you don’t get feedback from your audience.

There is however another secret reason to practice.

When you know a presentation and are confident in giving it you become freer to improvise, to play to the crowd, to react to the room. Your mind is no longer concerned with messing it up, its free to improvise and be creative. Understand what I mean? Watch any of Jim Weirich talks.

5. Borrow confidence from the Samurais

The hard truth is its easier to pay attention to someone who seems confident about what they are saying than someone who is very nervous. There is an old and secret trick I learnt to calm nerves before a presentation:

“When faced with a crisis, if one puts some spittle on his earlobe and exhales deeply through his nose, he will overcome anything at hand. This is a secret matter.”

Bushido: The Way of the Samurai

While it sounds a little silly, I found this trick to be very useful. It gives you something to focus on to calm yourself, a ritual with which you find a sense of comfort and deeply breathing is a well know way of increasing oxygen to reduce stress.

6. Talk to your audience

If you want to engage your audience you need to talk to them, not your notes or your laptop. Look around the audience as you speak. This engenders engagement with the whole audience, eye contact draws us into a conversation and draws the audience into your story.

Going further always try and remove any obstacles between you and the audience like those pesky lecterns or table. You want the audience to connect with you and associate you as one of them. This can help encourage people to listen to what you have to say. Barriers create a separation between you and the audience, and while that can be overcome why add the challenge in the first place.

7. Time

If you have a 40 minute slot for talking, you do not have to fill it. In fact most peoples attention span dips massively towards that figure. Personally I believe 30 minutes is the sweet spot for a presentation. Though you should not feel pressured to pad your talk. Use what time you need to get your point across in the most concise way you can.

8. The final curtain, end you presentation

The end of the presentation is one of the most powerful moments to set the mood of the room and help fire the discussion in the after talk chit chat. You have done all the hard work to get here, don’t wimper out with a quiet “thats all” or a jarring quick finish.

You want too leave a short summary of the ideas of your presentation, big questions the audience can discuss afterwards. The best presentations leave the room alive with discussions.

Creating your slides

Now lets take a deeper look at some ideas to help you create great slides:

1. Learn your tools

You know your developers tools right? Spend any time practicing and getting sharper with your IDE?

If you want to make great presentations you have to invest time in mastering your presentation tools.

Have you ever watched the Keynote tutorials? How about spending an hour playing with animations and find out whats possible. Why not look and see what is possible with presentation tools from watching other peoples talks.

2. Shape first, design later

Before getting too caught up in making your slides beautiful and full of pictures of cats, first think out the flow, order and shape of your presentation. A good presentation has a natural flow where each slide leads into the other.

I usually start first by brain storming all the ideas I have about a presentation on sticky notes. No need for each sticky to contain exact details, just words, ideas or thoughts. Leaving a mass of mess on the wall.

Then on another wall try and extract a story putting the stickies into a beginning, middle and end

3. Deviate from the default

Keynote and Powerpoint provide lots of nice default (boring) themes. Default themes are good way to knock something together quickly. But if you use a default theme its hard to stand out from the crowd.

Do you want your presentation to stand out and be memorable? Do you want people leaving your presentation remembering the funny use of star wars Lego characters.

4. Experiment with design

Don’t be afraid to play around with the design of your slides. Experiment lots until you start to like what you have. I spend the most time on the first slide and I usually have hundreds of different designs until I’m happy.

5.  Pick an original theme

Pick a theme/style for your presentation. Try and choose a theme that has sufficient material so you can find lots of pictures. Look to other peoples presentations for ideas.

Examples:

Street art

Tv Programs / Mad Men

Silent Movies

6. Kill them with your first slide

Your very first slide will probably be looked at longer than any of your other slides. You want to immediately capture the audiences attention and excite them. Show them how you are going to tell them the story of your presentation, introducing your theme.

Which of these first slides would most engage you?

7. Fonts make the theme

Default fonts can work in a presentation if you have little text on the slide or you have very powerful images/colours. If you want to make your presentation standout try different fonts. There are thousands of readable fonts free to download. Explore which ones work for your presentation and your theme.

For example this slide uses a font which fits very well with its DIY/tools theme:

Always remember to pick a font that does not compromise the readability of your content. Use that font consistently throughout your whole presentation.

8. Live and die by your theme

Maintain consistency of your theme throughout. The more daring and original the theme the harder it can be to find media. With great risk can come great reward.

9. Invest time or money into images

There are two paths to great images that help make your presentation stand out.

Buy images

On average I spend £100 per presentation on buying images (Mostly on http://www.istockphoto.com). That is one of my key secrets for standing out in a presentation. Spending money buys you great, high resolution and original images.

Hunt or create images

There are lots of free sources of images on the Internet such as flickr. It can take more time to hunt around and find the right images but its possible. Or create your own images to give your presentation that personal, unique touch.

10. Minimalism is king with content

When we first start a presentation we often overfill the slides with content. It makes sense as we are thinking out what we are going to talk about.  As you practice your presentation, slowly that content sinks into your brain (or your notes) allowing you to ween out as much content as possible from the slides. Leaving the minimal possible text on the slide. The audience is left listening to you rather than trying to read overly detailed, complex slides.

Anti-patterns of presentations

There are some common anti-patterns in presentations. Lets look at some with examples:

1. Bullet points of death

You can get away with a few of these slides mixed into an interesting presentation. Rely on them too much and it gets boring quick.

Do you want to keep my attention?

2. Breaking continuity of images, text and content

One of the hardest challenges of adding images to your presentation is ensuring they feel part of the presentation, fitting with your content and style.

Lets look at an example where images do not fit in the presentation:

Now improve that slide by making the image feel less jarring with the rest of the content.

Using framing or picture frames is a easy way to help a image fit in a presentation.

Without a frame:

With a frame:

3. Inconsistent design

If you want to make your presentation beautiful having consistency in design is important.

Consider how the diagrams in this presentation break consistency.

We have non-shaded, hard edged blocks of colour in one diagram.

Then in following diagrams we have round blocks with shading.

4. The death of colour

Colour stimulates my brain. Do you want to stimulate it or send it to sleep?

5. Live code demo fail

You have 30ish minutes of my time. I don’t want to spend that time watching you make typos and debugging an error. You immediately make me lose my concentration, I’ll start flicking through my twitter feed. You have been working hard through your presentation to engage me, why throw it away?

Either practice a heck of a lot or pre-record your demos. I personally use Screenflick for all my recorded demos.

6. Black and white code.

How often do you read source code without syntax highlighting?

For me, that’s never. Highlighting code is also a great excuse to add some colour and life to your slides.

Its not tricky, for example extracting syntax highlighting from Textmate to Keynote.

7. Hello my name is

The slides where you tell me who you are, what company you work for, what open source projects you work on, your cats name, your dogs name, and what you ate for breakfast.

Earn me wanting to know who you are through giving a stimulating presentation.

Content over character.

Avoid breaking the flow of the presentation with a slide which has nothing to do with your topic or content.

Final words

Creating a presentation is hard work. Creating a great presentation is lots and lots of work and even then you are never sure if your audience will think its great. Public speaking is stressful and takes a lot of concentration and confidence.

No matter what happens with your talk, be proud of yourself. You decided to put yourself out there, up on stage trying to explain your ideas to people. This is already a great achievement. Don’t be too dishearten or critical of yourself if you are not happy with your talk.

Take a moment in that euphoric buzz of the applause to enjoy yourself.

Then work out how you can do better!

Good luck!

The Aesthetics of Density

Programming languages can be described as Dense.

What does dense mean?

Closely compacted in substance. Having the constituent parts crowded closely together:

What does it mean for a programming language to be dense?

I consider there are two axis for the density of programming languages:

Density of syntax The syntax is very dense/compact.

For example Assembly has a very dense syntax, abbreviate commands, small register names, etc… It takes a lot to express simple expressions

A simple “for” loop in Assembly

1
2
3
4
5
6
7
8
9
10
11
12
13
mov cx, 3
startloop:
   cmp cx, 0
   jz endofloop
   push cx
loopy:
   Call ClrScr
   pop cx
   dec cx
   jmp startloop
endofloop:
   ; Loop ended
   ; Do what ever you have to do here

Density of expression The means of expressing simple concepts or solutions is very compact.

This is a little fuzzier than syntax, it can depend on what you are trying to express and languages often provide many different ways to express something. For example string processing in Erlang is a lot messier than say, Ruby. Paul Graham measures this form of density by the number of elements

As an example PROLOG scores highly in expressive density. One of the main reasons why is when you give up control of execution (imperative style) and describe the problem (declarative style) you increase the expressive density.

The towers of hanoi in Prolog:

1
2
3
4
5
6
7
8
9
10
11
12
move(1,X,Y,_) :-
    write('Move top disk from '),
    write(X),
    write(' to '),
    write(Y),
    nl.
move(N,X,Y,Z) :-
    N>1,
    M is N-1,
    move(M,X,Z,Y),
    move(1,X,Y,_),
    move(M,Z,Y,X).

Programming languages fit along a spectrum within these forms of density. Ruby provides the means to express concepts very syntactically densely. Just look at Ruby Golf (solving a problem with the smallest possible number of characters) for example:

1
2
3
def fizzbuzz(n)
n%3<1&&f="Fizz";n%5<1?"#{f}Buzz":f||n.to_s
end

It is also always posible to build a DSL within a programming language to maximise density.

Where does Density fit with Literate Programming?

Dense syntax moves code away from being an easily accessible form of documentation.

Density of expression can move code away from being easily accessible as documentation. For example do you understand how that PROLOG towers of hanoi works?

The more focused a language is on expressive/syntactical density the further it moves the art of programming away from Literate programming where we focus on our code being the documentation. Much like writing an essay:

Instead of writing code containing documentation, the literate programmer writes documentation containing code.

Ross Williams. FunnelWeb Tutorial Manual, pg 4.

The readability of the code to humans is the priority.

Under the literate programming paradigm, the central activity of programming becomes that of conveying meaning to other intelligent beings rather than merely convincing the computer to behave in a particular way.

Ross Williams. FunnelWeb Tutorial Manual, pg 4.

Density within our heads

One could argue that dense code can still be literate in style. Its just that you have to fit all the programming languages syntax into your head. Its not unrealistic to ask developers to know the syntax/api of their language. Though holding it all in memory when its particularly dense can be challenging.

If your a Clojure programmer you might have a good understanding of this code as documentation:

1
2
3
(def ^{:dynamic true
       :doc "some doc here"}
     *allow-default-prerequisites* false)

And if you’re a Ruby or Perl programming you might read this with ease:

1
$!.is_a?(MonkeyError)

Can dense languages be a good idea?

“The quantity of meaning compressed into a small space by algebraic signs, is another circumstance that facilitates the reasonings we are accustomed to carry on by their aid.”

  • Charles Babbage, quoted in Iverson’s Turing Award Lecture

Is there a trade-off in moving to a more dense form of expression in helping shape the way we think and the kind of thoughts we have?

How easy is it to hold a dense language in our heads, remembering all the syntax in order to easily read code?

Regular Expressions

While regular expressions are not a programming language they are one of the best examples of a very dense language both syntactically and expressively  that has persisted in its syntax through many programming languages.

Is that a sign that regular expressions have succeeded in encoding pattern matching text?

Write Once

Do you understand this pattern?

1
/^[\w]$/

How about we push the complexity level and try some of the more esoteric symbols in regular expressions:

Do you know what this does?

1
/\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b/i

How about this?

Full email detection regular expression (RFC822)

While regular expressions are very well suited to small patterns, with a very dense language our ability to parse complex statements is reduced.

Which has a knock on effect for maintenance, its read-only and even then its not easy to read.

Readability

In fact its considered bad practice to write a regular expression of the form above. Its understood that its hard to read and hence programmers have to add to the dense language to increase readability:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/
^                                             # start of string
(                                             # first group start
  (?:
    (?:[^?+*{}()[\]\\|]+                      # literals and ^, $
     | \\.                                    # escaped characters
     | \[ (?: \^?\\. | \^[^\\] | [^\\^] )     # character classes
          (?: [^\]\\]+ | \\. )* \]
     | \( (?:\?[:=!]|\?<[=!]|\?>)? (?1)?? \)  # parenthesis, with recursive content
     | \(\? (?:R|[+-]?\d+) \)                 # recursive matching
     )
    (?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )?   # quantifiers
  | \|                                        # alternative
  )*                                          # repeat content
)                                             # end first group
$                                             # end of string
/

This is definitely not literate programming, comments and code are clearly separate things.

Named captures groups are also an optional feature to improve and document the readability.

1
2
3
4
5
6
7
8
9
10
user_regexp = %r{
   (?<username> [a-z]+ ){0}

   (?<ip_number> [0-9]{1,3} ){0}
   (?<ip_address> (\g<ip_number>\.){3}\g<ip_number> ){0}

   (?<admin> true | false ){0}

   \g<username>:\g<ip_address>:\g<admin>
 }x

Mnemonics

Our memory also struggles to find mnemonics or associations to remember the full vocabary of regexps:

1
2
3
4
5
6
7
8
9
#Some easy ones
/w #word
/s #space

#Harder ones
(?<!pat)
(?<=pat)
(?!pat)
(?=pat)

Reducing the Density of Regular Expressions

Creating a DSL for parsing text is a huge domain. The power of regular expressions is very clear.

Yet there have been attempts in various languages to move regular expressions towards a more verbose form to improve readability.

Regexp::English

The Perl community has attempted to provide a more English, verbose syntax for regular expressions:

Regexp::English provides an alternate regular expression syntax, one that is slightly more verbose than the standard mechanisms

Lets look at an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
        use Regexp::English;

        my $re = Regexp::English
                -> start_of_line
                -> literal('Flippers')
                -> literal(':')
                -> optional
                        -> whitespace_char
                -> end
                -> remember
                        -> multiple
                                -> digit;

        while (<input />) {
                if (my $match = $re->match($_)) {
                        print "$match\n";
                }
        }

Better?

Loving the Density of Regular Expressions

Clearly there has been some recognition among developers that regexp could be improved by being more verbose. Its interesting that these attempts are considered failures. It would imply the majority of developers prefer dense regexps.

“you can document them with comments, named capture groups, composing them from well-named variables. of course, no one does those things.” Tom Stuart

In the Perl community some people have given up completely on the humans and their dense, hard to maintain regular expressions. They create tools to decode the density automatically:

1
2
3
4
use YAPE::Regex::Explain;

print YAPE::Regex::Explain->new(qr/this.*(?:that)?(?!another)/)
      ->explain;

Outputs:

The regular expression:

(?-imsx:this.*(?:that)?(?!another))

matches as follows:

NODE                     EXPLANATION
----------------------------------------------------------------------
(?-imsx:                 group, but do not capture (case-sensitive)
                         (with ^ and $ matching normally) (with . not
                         matching \n) (matching whitespace and #
                         normally):
----------------------------------------------------------------------
  this                     'this'
----------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
----------------------------------------------------------------------
  (?:                      group, but do not capture (optional
                           (matching the most amount possible)):
----------------------------------------------------------------------
    that                     'that'
----------------------------------------------------------------------
  )?                       end of grouping
----------------------------------------------------------------------
  (?!                      look ahead to see if there is not:
----------------------------------------------------------------------
    another                  'another'
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
)                        end of grouping
----------------------------------------------------------------------

Some snippets on developers thoughts about regular expressions:

“Love them, probably because they’re a form of arcane magic and they make me feel special for being able to control them”

“I think they are geek candy … sometimes used to show off maximum geekness”

“Love them because they make me look clever and LIKE A H4XX0rrrrr!”

The Aesthetics of Density

Regular expressions have succeeded and they are one of the few very dense languages to have done so.

The density of regular expressions make the initial barrier to getting started and moving to expert high. They are far from what we imagine in a literate programming style, yet once you have the syntax in your head, movement becomes fluid, you think in regular expressions. Dense languages with a very limited common syntax set allow experimenting rapidly. Without practice dense languages quickly drop from your mind and you struggle to fit the problem into the right expressive form.

There is an undeniable beauty in the density of regular expressions, in both syntax and expression.

Finding prime numbers using a single Regexp:

1
/^1?$|^(11+?)\1+$/

It’s also drink absinthe and cut off your own ear crazy.

I wrote this PROLOG code for my thesis. I have no idea how it works now and it would take me about a month of playing with it to get back to a state where the dense language was back in my head and I could express ideas in the PROLOG way.

I spent over a month adding nothing more than a single “!” mark in the code.

1
2
3
4
5
6
7
abdemo_holds_ats([holds_at(F,T)|Gs],R1,R3,N1,N3,D) :-
     !,
     abdemo([holds_at(F,T)],R1,R2,N1,N2,D),

     %cut added Joseph Wilk 16/03/2004
     !,
     abdemo_holds_ats(Gs,R2,R3,N2,N3,D).

I still feel its some of the most beautiful code I’ve written. Why?

I revel in the expressive density. Bending my brain to express my thoughts in the densely expressive PROLOG way.

I guiltily dip into the syntactical density because it’s like the detailing on the brush strokes of a painting.

Would I ever write this code in a production system that developers would have to maintain? Hell no.

Would I consider this literate programming? Hell no. Just look at the 100 of lines of comments.

But for the sake of art and realising a form of flow I’ve not encountered since, I would happily revel in the aesthetics of density.

Michael Wolf “Architecture of Density no.36”: http://www.flickr.com/photos/worldeconomicforum/6751247749/

Why Are You SHOUTING Programmer?

Being shouted at is not a lot of fun. So why do we shout in code?

Shouting code

Compare

"how_many_monkeys_can_a_monkey_eat_before_it_explodes"

and:

"HOW_MANY_MONKEYS_CAN_A_MONKEY_EAT_BEFORE_IT_EXPLODES"

How do you read those differently in your head?

We associate capital letters with someone shouting.

Now lets turn to two functionally equivalent pieces of Ruby code:

1
2
3
4
5
6
7
8
9
class Monkey
  def capacity
    10
  end

  def stomach
      "I can fit #{capacity} monkeys in my belly"
  end
end

And

1
2
3
4
5
6
7
class Monkey
  CAPACITY = 10

  def stomach
    "I can fit #{CAPACITY} monkeys in my belly"
  end
end

Constants are shouted. Why do we shout? Because:

  • We are angry

  • We have something we think is important and we want everyone to hear it.

Why in Ruby are constants uppercase? Well they don’t have to be, Ruby constrains us to ensuring the first letter is a capital. We get warnings if we try and reassign their value but ultimately they are just Fixnums. In order to stick with our Ruby naming convention we use ‘_’ and all caps.

So its a combination or telling the compiler that this value is a constant and fitting with the naming scheme in Ruby.

We shout because society indicates to us thats the normal behaviour and we all want to be nice citizens of the Ruby republic.

As a side effect constants feel like they are more important than the other variables or methods. They should take our attention first.

Reading uppercase is slow

What wait a minute isn’t uppercase text harder to read? There is evidence [1] to show that all-caps is less legible and less readable than lower case. So constants are harder for us to read.

lowercase permits reading by word units, while all capitals tend to be read letter by letter

Numbers, letters and uppercase

A common use of shouting case is constants used to remove magic numbers from calculations.

1000 * 667895 / LIMIT + 475436

The brain recognises numbers and letters very differently. The brain in general can recognise words faster than a sequence of digits since with a word we do not need to read each character in order to recognise the word.

To see for yourself try and read the following:

The hmaun mind does not raed eervy letter by istlef but the word as aa woelh.

Compare how much more time it takes you to read the numbers.

124 3456 3234 5443 3342 55334 66554 47567

By uppercasing the constant we are slowing down this natural ability to read words.

Compare again:

1000 * 667895 / LIMIT + 475436

With:

1000 * 667895 / limit + 475436

Do we really need this further uppercase difference? With the instinctive separation between words and letters the further effort and cognitive slow down has little value.

Immutability vs Mutability

A legitimate case were it does become useful to use shouting case is where you want to distinguish between variables/functions and constants. Expressing that a value is immutable in comparison to a mutable variable is important in a language like Ruby were immutability is not the norm.

1000 * 667895 + scale / LIMIT + 475436 - radius

Shouting in technicolor

Editors often provide color markup for words in uppercase. For example for Ruby in Textmate:

Though they also provide the difference between variables/function and numbers. Shouting provides us with a discernible way to see all constants within the code at glance based on colour.

History of shouting in code

How or where did this convention of uppercasing constants come from? Why is it a convention? When did we start shouting in our code?

What made us so mad?

Assembly

It started with assembly, the convention was to uppercase variables names and lowercase instructions.

ADCTL  = 0x30
staa ADCTL,X

Variable names were limited in length, so often variable names were acronyms, which in English are often capitalised (we just skipped the dot). Registers, memory & caches all had nice acronyms which you could reference in your assembly.

Assembly was more machine centric than human centric. Not quite shouting as we know it now (though its understandable we were angry with all that ugly code).

FLOW-MATIC

The birth of programming in English. It was not a pretty birth, this thing was born shouting very loudly. EVERYTHING is in capitals, even the programming languages name!

INPUT  INVENTORY FILE=A
 PRICE FILE=B,
 OUTPUT PRICED-INV FILE=C
 UNPRICED-INV FILE=D,
 HSP D.

Flow-Matic had the builtin in constant ZERO. Our first example of a Constant but where everything is capitalised so it is not distinguished from other code.

FORTRAN

FORTRAN was a confused language when it came to shouting. The use of lowercase letters in keywords was strictly nonstandard.

IF (IA+IC-IB) 777,777,705
IF (IB+IC-IA) 777,777,799
STOP 1
C USING HERON'S FORMULA WE CALCULATE THE
C AREA OF THE TRIANGLE
S = FLOATF (IA + IB + IC) / 2.0
AREA = SQRT( S * (S - FLOATF(IA)) * (S - FLOATF(IB)) + (S - FLOATF(IC)))
WRITE OUTPUT TAPE 6, 601, IA, IB,
STOP
END

But then the liberation came and after a bloody battle FORTRAN was renamed Fortran.

In this new post revolution age Fortran’s compiler was a liberal one, not caring about shouting or case at all.

program helloworld
     print *, "Hello, world."
end program helloworld

The society on the other hand was still very keen to tell its citizens when they should shout. It was a social coding convention that local variables be in lowercase and language keywords be in uppercase.

Language keywords were more important and hence shouted. Far more important than those pesky human named local variables. This inverted the previous Assembly conventions on the use of case.

LISP

Common Lisp is case sensitive but the Common Lisp reader converts all uppercase to lower case:

(defun hi () "Hi!)
(hi) ;; outputs "Hi"
(HI) ;; outputs "Hi"
(Hi) ;; outputs "Hi"

LISP had a social convention to only use lowercase (one might think to avoid confusing situations like the one above). Did LISPeans shout at all? They did when it came to documentation strings:

“In a documentation string, when you need to refer to function arguments, names of classes, or other lisp objects, write these names in all uppercase, so that they are easy to find”

This was to help humans easily find them and because documentation generation tools could detect them.

Shouting the references in unstructured text made them clearly visible to both machines and humans.

COBOL

COBOL is another of those shouting languages which liked everything in uppercase. Which makes reading a COBOL program akin to having someone shout very loudly in your face. Until you cry. Lots.

01 RECORD-NAME.
02 DATA-NAME-1-ALPHA PIC X(2).
02 DATA-NAME-2.
03 DATA-NAME-3-NUMERIC PIC 99.
03 DATA-NAME-4.
04 DATA-NAME-5-ALPHA PIC X(2).
04 DATA-NAME-6-NUMERIC PIC 9(5).
02 DATA-NAME-7-ALPHA PIC X(6).

The only thing that was not upper case was comments.

It helps if all comments are in lower-case, to differentiate from actual commands which should always be in upper-case

Comments where not important, so no need to shout them. Which in turn makes them easier to read. Perhaps there is an understanding here that shouting makes code hard to read. Comments which might contain a lot of text should also be easier to read.

If you were still in doubt about COBOL’s evilness: user defined constants were distinguished by using a single character variable name. MAD. YES THAT WAS WORTH SHOUTING.

Basic

In basic keywords were capitalised to distinguish between variables names. The case is insignificant, it’s for the humans not the compiler.

LET m = 2
LET a = 4
LET force = m*a
PRINT force
END

Keys words are important, so shout them. But in turn make it easier to read the user named variables by leaving them lower case.

C

C uses uppercase by convention for object-like Macros which get replaced during pre-processing.

#define BUFFER_SIZE 1024
foo = (char *) malloc (BUFFER_SIZE);

Uppercase is used to defined a templating language within C, which we can quickly distinguish from the code. It also makes the job of the pre-processor easier, parsing macros.

So is shouting a bad thing?

When it comes to expressing ideas in nothing but text we use everything we can to provide structure and separation to help improve clarity. Shouting or uppercasing words provides a very powerful way of rapidly distinguishing certain aspects of text.

How programming languages spend this limited currency of instinctive separation reflects the languages understanding of readability (i’m looking at you COBOL) and what they find to be important enough to earn shouting case.

However most modern languages provide you with the choice of shouting. In Ruby we can skip it all together.

Try not shouting for a while. See how it makes you feel.

And always:

TRY AND AVOID SHOUTING FOR TOO LONG AS IT IS HARD TO READ.

Keep it sharp, short and loud.

References

[1] Type and Layout: How Topography and Design Can Get Your Message Across – Or get in the Way

Recurrent Neural Networks in Ruby

We look at how neural networks work, what is different about a recurrent networks and a library which allows us to use recurrent networks in Ruby (tlearn-rb).

What the heck is a Recurrent Network?

First lets look briefly at how a neural network works:

Neural Networks

Neural networks use the model of neurones in the human brain. Put very simply the artifical neuron given some inputs (the dendrites) sums them to produce an output (the neuron’s axon) which is usually passed through some non-linear function. The sum of the nodes is usually weighted.

By taking a set of training data we can teach a neural network such that it can be applied to new data outside of the training set. For example we could have as inputs the states of a chess board and the output as a rank for how good the position is for white. We could after training, input an unseen board state and as output get a rank for how effective the position is for white.

As a neural network is trained it builds up the set of weights for the connections between nodes. Through many training iterations comparing expected outputs and the inputs these weights are built up.

Feedforward Neural Network

In some problems the order in which the inputs arrive at the network is important. A normal network fails at this as there is no explicit sense of the relationships between sets of inputs.

Lets consider an example. A network that is trained to detect how satisfying a word sounds to children.

We feed our network all the syllables of a word and get an output:

["mon", "key"]
["o", "key", "do", "key"]

When we feed the syllable “key” into the neural network it will always return the same output irrelevant of what came before it. This misses a relationship between the syllables of the word.

A recurrent network aims to solve this problem by using both the input layer and the output layer to devise weights of the hidden layer.

Recurrent Network

Going back to our example:

["mon", "key"]
["o", "key", "do", "key"]

When we feed “key” into the neural network the weight returned will be effected by what the previous input was, [“mon”] or [“o”, “key”, “do”].

So our recurrent neural network would detect that “o-key-do-key” has a rhythm between the syllables that is appealing to children.

A recurrent network allows us to decided when to wipe the previous output and start again. So in our example we would reset the output layer after we have fed in all the syllables of the word. We are interested in the relationships between syllables of a word, not syllables of different words.

So all this is a complicated way of saying Recurrent networks have state. Yes.

Recurrent Networks in Ruby

There was no Ruby library that support Recurrent Networks. There was an attempt to add Recurrent networks to FANN (which has a ruby-fann gem with bindings) but it was never merged in.

So I adapted the TLearn C library which supports Recurrent Neural Networks and wrapped it in Ruby Love.

It’s having some trouble coming to terms with its new found rubyness, so there is a big alpha warning hanging on the door.

Installing TLearn

gem install tlearn

Using TLearn

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
require 'tlearn'

tlearn = TLearn::Run.new(:number_of_nodes => 10,
                    :'output_nodes'    => 5..6,
                    :linear          => 7..10,
                    :weight_limit    => 1.00,
                    :connections     => [{1..6   => 0},
                                         {1..4   => :i1..:i3},
                                         {1..4  => 7..10},
                                         {5..6   => 1..4},
                                         {7..10  => [1..4, {:min => 1.0, :max => 1.0}, :fixed, :'one_to_one'}]

training_data = [{[0, 0, 0] => [0, 1]}],
                 [{[1, 1, 1]  => [1, 0]}]

tlearn.train(training_data, sweeps = 200)

tlearn.fitness([1, 1, 1], sweeps = 200)
# => [0.2, 0.9]

Wait! What does that output mean?

In our example we had 2 outputs. The result we get from running the fitness test are the final weights:

[0.2, 0.9]

In this example we can think of the first output as rank 1, and the second output as rank 2. We look at which has the highest weighting in the fitness test, In this case it shows us that the input “000” has rank 2. So really we can map the output to many different classifications.

How is state reset?

Tlearn resets the state for each list of elements

1
2
3
[{[0, 0, 0] => [0, 1]}, {[0, 0, 0] => [0, 1]}],
# State will be reset here
[{[1, 1, 1]  => [1, 0]}]

Wait! What the heck does all that config mean?

Part of the work of using Neural networks is finding the right configuration settings. TLearn supports a lot of different options. Lets look at what all that configuration options means. (Checkout the TLearn Github Readme for full details of the config options):

:number_of_nodes => 10

The total number of nodes in this network (not including input nodes)

:'output_nodes'    => 5..6

Which nodes are used for output.

:linear          => 7..10

Nodes 7 to 10 are linear. This defines the activation function of the nodes. The activation function is how all the weights and input are combined for a node to create an output. Linear nodes output the inner-product of the input and weight vectors.

:weight_limit    => 1.00

Limit of 1.0 must not be exceeded in the random initialization of weights.

Connections

Connections specify how all the nodes of the neural network connect. This is the architecture of the neural network. Lets look at the connection settings:

{1..6   => 0}

Node 0 feeds into node 1 to 6. Node 0 is the bias node that is always 1.

 {1..4   => :i1..:i3}

The input nodes 1-3 feed into each node from 1 to 4.

{1..4  => 7..10},

Nodes nodes 7-10 feed into each node from 1 to 4

{5..6   => 1..4},

Nodes nodes 1..4 feed into each node from 5 to 6

 {7..10  => [1..4, {:min => 1.0, :max => 1.0}, :fixed, :'one_to_one'}]

This connection contains a couple of special options. Rather than node 1-4 being fed into node 7, node 1 only connects with node 7, node 2 only with node 8, node 3 only with node 9, node 4 only with node 10. The :‘one_to_one’ option causes this. The weights of the connections between these nodes is fixed at 1.0 and never changes throughout training

So put all these together our full neural network is:

Urmm… So how do I know what connection settings to use?

When it comes to deciding how many hidden nodes to have in your network there is a general rule:

The optimal number of hidden nodes is usually between the size of the input and size of the output layers

When deciding what connections to specify in your neural network you can start with everything connected to everything and slowly experiment with pruning connections/nodes which will increase the performance of your network without radically affecting the output efficiency.

Its important to have the bias node connect to all the nodes in the hidden layer and output. This is required so a zero input to the neural network can generate outputs other than 0.

With recurrent networks it is important to build connections and nodes in your network to maintain state. It is quite possible with TLearn to build a plain old neural network with no state. It can be helpful like the example given above to draw out your state, hidden layer and output layer nodes and use this to decided how the network connects.

How do you decide what activation functions to use? Linear, bipolar, etc. Checkout this great paper on the effectiveness of different functions: http://www.cscjournals.org/csc/manuscript/Journals/IJAE/volume1/Issue4/IJAE-26.pdf

One neat (crazy) experimental (crazy) path to explore is neural network toplogies generated from using a genetic algorithm to assess the effectiveness of the network: http://www.cs.ucf.edu/~kstanley/neat.html.

TLearn’s Source

If you want to peer into the heart of TLearn the source code is on Github:

git clone git://github.com/josephwilk/tlearn-rb.git

Further reading

Fake Execution

A little RubyGem for faking out execution in your tests and inspecting afterwards what was run.

Why FakeExecution?

I’ve been creating internal tools for developers to help improve productivity. These tools written in Ruby, ended up doing lots of shell scripting. These scripts started becoming fairly complicated so I wanted some test feedback. How could I easily test execution?

Enter FakeExecution.

Installing

gem install fake_execution

How do I use it?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
require 'fake_execution/safe'

FakeExecution.activate!

`echo *` # This is not executed

`git checkout git://github.com/josephwilk/fake-execution.git`
`touch monkeys`
system("git add monkeys")
system('git commit -m "needs more monkeys"')
`git push`

FakeExecution.deactivate!

cmds[0].should =~ /echo/
cmds[1].should =~ /git checkout/
cmds[2].should == 'touch monkeys'

`echo *` # outputs: echo *

But I use Rspec

1
2
3
4
5
6
7
8
9
10
11
 require 'fake_execution/spec_helper'

 describe "monkeys" do
   include FakeExecution::SpecHelpers

   it "should touch the monkey" do
     `touch monkey`

     cmds[0].should == 'touch monkey'
   end
 end

Source code

http://github.com/josephwilk/fake_execution

Conferences and the Cult of Celebrity

How much should character be a factor for a conference talk being selected?  Big names sell conference tickets. Yet I believe that we can do a lot more to help promote conferences where content takes greater weight than character and in the process help people who have never spoken before start speaking at conferences.

Risk.

Conference organisers take a big risk running a conference.

Will they cover their venue cost?

Which ultimately leads on to will they sell enough tickets?

Once they have enough tickets, will it be a good conference?

Other successful conferences define a pattern of how a conference is laid out. One obviously way of dealing with Risk is to follow an example of a success.

I really respect people who organise conferences, they are putting a lot of effort and their own time to make a successful event.

But I wondering if maybe we could be focusing more on content and less on character.

Selling a presentation

When you submit a talk an aspect of convincing the organisers to allow you to talk is who you are. Persuasion by character (Ethos if you like your ancient Greek).

Most submissions forms give you plenty of space to sell yourself.

Content not character.

For NordicRuby 2012 the organisers took a step pushing for interesting content over character. They recognised that conscious and subconscious persuasion by character is a powerful means of convincing.

During most of their reviewing process they removed all the names from the proposals.

“Another thing we’re doing differently this year is that we’re starting out with anonymous proposals. Each card in Trello just shows the proposal’s title and description. No information about the speaker. We do this to avoid bias in the first stages.”

Talks where selected based on the content. Only later once the talks had been whittled down did they introduce the speaker names. Character was considered, it was just delayed to the late selection stage.

This tweet caught my attention as it highlighted that for Jsconf.com.au proposals are selected without names attached. Content is king, character does not matter!

Lonestar Ruby conference submissions, your name, your email, no big bio to sell how great you are. Content rules.

Keynotes: where character is King

The keynote of a conference is the stable diet of most conferences. Its where the big names are rolled out, luminaries of our industry share their wise thoughts with us.

Akin to live music, they are the headline act we pay for, the others are the  support acts. We might skip them, or give them half of our attention.

What effect does having two tiers of talks at a conference have?

The keynote talk is more important irrelevant of content because of the character of the speaker. Often we don’t even know what they will be talking about, just that they are keynoting.

Some examples (I’m not focusing any blame on these examples, I respect the conferences and the organisers involved. They just help illustrate my point)

We assume (based on authority) that important people will have important things to tell us (and sometimes they do have important ideas to share with us).

Are the other, non-keynote talks of the conference of lesser importance? This is where content starts to win because all these speakers are non-keynoters, we don’t have as many leading assumptions about their authority or character. We go to the talks where the content interests us.

Maybe all talks should be driven by the content and less by character. Maybe all talks are as important as each other.

Conferences are already starting to kill of this idea of having to have a Keynote to sell your conference. FutureRuby and NordicRuby are two examples of conferences that have no Keynote yet produced conferences that are highly regarded by those who attended.

Will a keynote really make your conference better?

The Content Conference

Lets see if we can take this a step further and define the content conference.

  1. Talks authors are not revealed during review/selection. (There is no bias by character)

  2. Only publish the talk titles and content on the conference website. (Sell the conference on its content not its characters.)

  3. No keynotes. All talks are equal.

Conference Mentors

If we want to focus on content and remove character (as much as possible) we have to deal with people with little experience but with great ideas. We need to help make it easier to give feedback and help mentor those people so they can best express their ideas.

Being accepted to speak is not the end of contact its the start.

A group of experience speakers acting as mentors who give feedback and help people give the best presentations they can.

Final Words

There is a place in the conference world for events which focus on bringing big names to an audience. Looking back at the 30 conferences I’ve spoken at so far I’ve come to the conclusion that the conferences I really valued where those that pushed for content over character. Conferences are popping up all the time that follow the content conference ideas and understand the concious and subconsious bias of character.

I feel if we try and push for content over character and improve the proposal/speaker setup we can help find new people with great, interesting, crazy ideas and encourage them to submit proposals and speak.

I hope these ideas about the content conference might help conference organisers think about how they structure their conference and proposal system.

I am offering my time to mentor anyone who wants help writing their first conference proposal.

I am also happy to join any conference that wants a set of experienced speakers to help new speakers get the best out of their presentations.

Crazy, huh?

So what do you think?

A Little Bit of Pig

Currently in the Science team at Songkick I’ve been working with Apache Pig to generate lots of interesting metrics for our business intelligence. We use Amazon’s MapReduce and Pig to avoid having to run complex, long running and intensive queries on our live db, we can run them on Amazon in a timely fashion instead. So lets dive into Pig and how we use it at Songkick.com.

Pig (whats with all these silly names)

The Apache project Pig is a data flow language designed for analysing large datasets. It provides a high-level platform for creating MapReduce programs used with Hadoop. A little bit like SQL but Pig’s programs by their structure are suitable for parallelization, which is why they are great at  handling very large data sets.

Heres how we use Pig and ElasticMapReduce at Songkick in our Science team.

Data (Pig food)

Lets start by uploading some huge and interesting data about Songkicks artists onto S3. We start by dumping a table from mysql (along with a lot of other tables) and then query that data with Pig on Hadoop. While we could extract all the artist data by querying the live table its actually faster to use mysqldump and dump the table as a TSV file.

For example it took 35 minutes to dump our artist table with a sql query ‘select * from artists’. It takes 10 minutes to dump the entire table with mysqldump.

We format the table dump as a TSV which we push to S3 as that makes it super easy to use Amazons ElasticMapReduce with Pig.

shell> mysqldump --user=joe --password  --fields-optionally-enclosed-by='"'
                  --fields-terminated-by='\t' --tab /tmp/path_to_dump/ songkick artist_trackings

Unfortunately this has to be run on the db machine since mysqldump needs access to the file system to save the data. If this is a problem for you there is a Ruby script for dumping tables to TSV: http://github.com/apeckham/mysqltsvdump/blob/master/mysqltsvdump.rb

Launching (Pig catapult)

We will be using Amazons Elastic MapReduce to run our Pig scripts. We can start our job in interactive Pig mode which allows us to ssh to the box and run the pig script line by line.

Examples (Dancing Pigs)

An important thing to note when running pig scripts interactively is that they defer execution until they have to expose a result. This means you can get nice schema checks and validations helping ensure your PIG script is valid without actually executing it over your large dataset.

We are going to try and calculate the average number of users tracking an artist based on the condition that we only count users who logged in, in the last 30 days.

This is what our Pig script is doing:

The Pig script:

1
2
3
4
5
6
7
8
-- Define some useful dates we will use later
%default TODAYS_DATE `date  +%Y/%m/%d`
%default 30_DAYS_AGO `date -d "$TODAYS_DATE - 30 day" +%Y-%m-%d`
    
-- Pig is smart enough when given a folder to go and find files, decompress them if necessarily and load them.
-- Note we have to specify the schema as PIG does not know know this from our TSV file.
trackings = LOAD 's3://songkick/db/trackings/$TODAYS_DATE/' AS (id:int, artist_id:int,  user_id:int); 
users = LOAD 's3://songkick/db/users/$TODAYS_DATE/' AS (id:int, username:chararray, last_logged_in_at:chararray);
trackings
<1, 1, 1>
<2, 1, 2>

users
<1,'josephwilk', '11/06/2012'>
<2,'elisehuard', '11/06/2012'>
<3,'tycho', '11/06/2010'>
1
2
3
-- Filter users to only those who logged in, in the last 30 days
    -- Pig does not understand dates, so just treat them as strings
    active_users = FILTER users by last_logged_in_at gte '$30_DAYS_AGO'
Users
<1,'josephwilk', '11/06/2012'>
<2,'elisehuard', '11/06/2012'>
1
2
3
4
active_users_and_trackings = JOIN active_users BY id, trackings BY user_id
    
    -- group all the users tracking an artists so we can count them.
    active_users_and_trackings_grouped = GROUP active_users_and_trackings BY active_users::user_id;
<1, 1, /\{<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>\/}>`
1
trackings_per_artist = FOREACH active_users_and_trackings_grouped GENERATE group, COUNT($2) as number_of_trackings;
`<\/{<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>\/}, 2>`
1
2
-- group all the counts so we can calculate the average
    all_trackings_per_artist = GROUP trackings_per_artist ALL;
<\/{\/{<1,'josephwilk', '11/06/2012'>, <2,'elisehuard', '11/06/2012'>\/}, 2\/}>
1
2
3
-- Calculate the average
    average_artist_trackings_per_active_user = FOREACH all_trackings_per_artist
      GENERATE '$DATE' as dt, AVG(trackings_per_artist.number_of_trackings);
<{<'11/062012', 2>}>
1
2
3
--Now we have done the work store the result in S3.
    STORE average_artist_trackings_per_active_user INTO
      's3://songkick/stats/average_artist_trackings_per_active_user/$TODAYS_DATE'

Debugging Pigs (Pig autopsy)

In an interactive pig session there are two useful commands for debugging: DESCRIBE to see the schema. ILLUSTRATE to see the schema with sample data:

DESCRIBE users;
users: {id:int, username:chararray, created_at:chararray, trackings:int}

ILLUSTRATE users;
----------------------------------------------------------------------
| users   | id: int | username:chararray | created_at | trackings:int |
----------------------------------------------------------------------
|         | 18      | Joe                | 10/10/13   | 1000          |
|         | 20      | Elise              | 10/10/14   | 2300          |
----------------------------------------------------------------------

Automating Elastic MapReduce (Pig robots)

Once you are happy with your script you’ll want to automate all of this. I currently do this by having a cron task which at regular intervals uses the elastic-mapreduce-ruby lib to fire up a elastic map reduce job and run it with the pig script to execute.

Its important to note that I store the pig scripts on S3 so its easy for elastic-mapreduce to find the scripts.

Follow the instructions to install elastic-mapreduce-ruby: https://github.com/tc/elastic-mapreduce-ruby

To avoid having to call elastic-mapreduce with 100s of arguments a colleague has written a little python wrapper to make it quick and easy to use: https://gist.github.com/2911006

You’ll need to configure where you’re elastic-mapreduce tool is installed AND where you want elastic map-reduce to log to on S3 (this means you can debug your elastic map reduce job if things go wrong!).

Now all we need to do is pass the script the path to the pig script on S3.

./emrjob s3://songkick/lib/stats/pig/average_artist_trackings_per_active_user.pig

Testing with PigUnit (Simulating Pigs)

Pig scripts can still take a long time to run even with all that Hadoop magic. Thankfully there is a testing framework PigUnit.

http://pig.apache.org/docs/r0.8.1/pigunit.html#Overview

Unfortunately this is where you have to step into writing Java. So I skipped it. Sshhh.

References

  1. Apache Pig official site: http://pig.apache.org

  2. Nearest Neighbours with Apache Pig and JRuby: http://thedatachef.blogspot.co.uk/2011/10/nearest-neighbors-with-apache-pig-and.html

  3. Helpers for messing with Elastic MapReduce in Ruby https://github.com/tc/elastic-mapreduce-ruby

  4. mysqltsvdump http://github.com/apeckham/mysqltsvdump/blob/master/mysqltsvdump.rb

Examples Alone Are Not a Specification

The Gherkin syntax used by Cucumber enforces that Feature files contain scenarios which are examples of the behaviour of a feature. However Gherkin has no constraints on if there is a specification present. Examples are great at helping us understand specifications but they are not specifications themselves.

What do we mean when we say specification?

definition: A detailed, exact statement of particulars

In a Gherkin feature the specification lives here:

Lets look at a real example:

A Feature with just Examples

A Cucumber example based on a feature (which I have modified) from the test library Rspec rspec-expectations:

Feature: be_within matcher
  Scenario: basic usage
  Given a file named "be_within_matcher_spec.rb" with:
  """
  describe 27.5 do
  it { should be_within(0.5).of(27.9) }
  it { should be_within(0.5).of(27.1) }
  it { should_not be_within(0.5).of(28) }
  it { should_not be_within(0.5).of(27) }
  # deliberate failures
  it { should_not be_within(0.5).of(27.9) }
  it { should_not be_within(0.5).of(27.1) }
  it { should be_within(0.5).of(28) }
  it { should be_within(0.5).of(27) }
  end
  """
  When I run `rspec be_within_matcher_spec.rb`
  Then the output should contain all of these:
  | 8 examples, 4 failures                     |
  | expected 27.5 not to be within 0.5 of 27.9 |
  | expected 27.5 not to be within 0.5 of 27.1 |
  | expected 27.5 to be within 0.5 of 28       |
  | expected 27.5 to be within 0.5 of 27       |

So where is the explanation of what be_within does? If I want to know how be_within works I want a single concise explanation not 5/6 different examples. Examples add value later to validate that specification.

A Feature with both Specification and Examples

Lets add back in the specification part of the Feature. drum roll

Feature: be_within matcher

  Normal equality expectations do not work well for floating point values.
  Consider this irb session:

      > radius = 3
        => 3 
      > area_of_circle = radius * radius * Math::PI
        => 28.2743338823081 
      > area_of_circle == 28.2743338823081
        => false 

  Instead, you should use the be_within matcher to check that the value
  is within a delta of your expected value:

      area_of_circle.should be_within(0.1).of(28.3)

  Note that the difference between the actual and expected values must be
  smaller than your delta; if it is equal, the matcher will fail.

  Scenario: basic usage
    Given a file named "be_within_matcher_spec.rb" with:
      """
      describe 27.5 do
        it { should be_within(0.5).of(27.9) }
        it { should be_within(0.5).of(27.1) }
        it { should_not be_within(0.5).of(28) }
        it { should_not be_within(0.5).of(27) }

        # deliberate failures
        it { should_not be_within(0.5).of(27.9) }
        it { should_not be_within(0.5).of(27.1) }
        it { should be_within(0.5).of(28) }
        it { should be_within(0.5).of(27) }
      end
      """
    When I run `rspec be_within_matcher_spec.rb`
    Then the output should contain all of these:
      | 8 examples, 4 failures                     |
      | expected 27.5 not to be within 0.5 of 27.9 |
      | expected 27.5 not to be within 0.5 of 27.1 |
      | expected 27.5 to be within 0.5 of 28       |
      | expected 27.5 to be within 0.5 of 27       |

Thats better, we can get an explanation of why this method exists and how to use it.

Imagine RSpec without the specification

I think of a Cucumber feature without a specification much like an Rspec example without any English sentence/description.

context "" do
  it "" do
    user = Factory(:user)
    user.generate_password
    user.activate

    get "/session/new", :user_id => user.id

    last_response.should == "Welcome #{user.name}"
  end
end

Feels a little odd doesn’t it.

Cucumber Features as Documentation (for real)

Rspec is an example of a project that has taken its Cucumber features and published them as its documentation. Just browse through those features and it quickly highlights how important it is to have a specification as well as examples. Imagine an API with nothing but examples, leaving you the detective work of trying to work out what the thing actually does.

Documentation needs to explain/specify what something does as well provide examples. If you really want anyone to read your feature provide both examples and a specification.