Joseph Wilk

Joseph Wilk

Things with code, creativity and computation.

Clojure and Kinesis at Scale

I’ve been working over the last year in the data team at SoundCloud building a realtime data pipeline using Clojure and Amazon’s Kinesis. Kinesis is Amazons equivalent to Kafka, “Real-Time data processing on the Cloud”. This is a summary of what was built, some lessons learnt and all the details in-between.

Kinesis pipeline
Fig1: Overall System flow

Tapping Real traffic

The first step was to tee the traffic from a live system to a test system without comprising its function. The main function of the live system is logging JSON events to file (which eventually end up somewhere like HDFS). Tailing the logs of the live system gives us access to the raw data we want to forward on to our test system. A little Go watches the logs, parses out the data and then forwards them in batch to test instances that will push to kinesis. Hence we had live data flowing through the system and after launch a test setup to experiment with. Sean Braithwaite was the mastermind behind this little bit of magic.

Canary Kinesis pipeline
Tapping Traffic

Sending to Kinesis

All kinesis sending happens in an application called the EventGateway (also written in Clojure). This endpoint is one of the heaviestly loaded services in SoundCloud (at points it has more traffic than the rest of SoundCloud combined). The Eventgateway does a couple of things but at its core it validates and broadcasts JSON messages. Hence this is where our Kinesis client slots in.

Squeezing Clojure Reflection

Its worth mentioning that in order for the Eventgateway service to be performant we had to remove all reflection in tight loops through type hints. It simply could not keep up without this. It became a common pattern to turn reflection warnings on while working in Clojure.

Project.clj

1
:profiles {:dev {:global-vars {*warn-on-reflection* true *assert* false}}}

Kinesis

The Eventgateway posts to Kinesis in batch using a ConcurrentLinkedQueue and separate producers and consumers. Messages are pushed into a ConcurrentLinkedQueue. We rolled our own Clojure kinesis client using Amazons Java library rather than using Amazonica.

1
2
3
;; Java Amazon libraries used
[com.amazonaws/aws-java-sdk "1.9.33"         :exclusions [joda-time]]
[com.amazonaws/amazon-kinesis-client "1.1.0" :exclusions [joda-time]]

Amazonica was good to get started quickly in the initial phase but there are a couple of reasons we switched to our own unique snowflake (which still looked a little like Amazonica):

  • Amazonica did not support batch mode for Kinesis. Under initial tests it was impossible to scale this without batch.
  • Injecting our own telemetry at low levels to learn more about Kinesis running.
  • Some of its sensible defaults where not so sensible (for example default encoding the data using nippy).
  • Ultimately most of any Kinesis client/server is configuration and tuning.
  • Amazonica’s source is hard to read with a little too much alter-var-root going on.
1
2
3
4
5
6
;;Ugh. Its not just me right?
(alter-var-root
  #'amazonica.aws.kinesis/get-shard-iterator
  (fn [f]
    (fn [& args]
      (:shard-iterator (apply f args)))))

Pushing Messages in a Queue

Very simple, just adding a message to the ConcurrentLinkedQueue. A environment variable allows us to gradually scale up or down the percentage of traffic that is added to the queue.

1
2
3
4
5
6
7
8
9
10
11
12
(require '[environ.core :refer :all])

(def kinesis-message-queue (ConcurrentLinkedQueue.))
(def hard-limit-queue-size 1000)
(def queue-size (atom 0))

(defn send [message]
  (when-let [threshold (env :kinesis-traffic)]
    (when (> (rand-int 100) (- 100 (or (Integer/valueOf ^String threshold) 0)))
    (when (<= @queue-size hard-limit-queue-size)
      (.add kinesis-message-queue message)
      (swap! queue-size inc)))))
Failure

The queue pusher operates within a wider system and any failures due to Amazon being unreachable should not impede the function of the system. For the client this means:

  • Not exceeding memory limits with a hard queue size (since ConcurrentLinkedQueue is unbound in size).
  • Backing off workers if the queue is full to prevent cpu throttling.

When we cannot send messages to kinesis we instead log them to disk, and into our normal logging pipeline (usually ending up in HDFS). Hence we coule replay at a later date if required.

Sending batches to Kinesis

The workers, operating in separate threads consuming messages from the ConcurrentLinkedQueue collecting them into a batch:

1
2
3
4
5
6
7
8
9
10
11
12
(loop [batch []
       batch-start-time (time/now)
       poll-misses 0]
  (if (batch-ready-to-send? batch batch-start-time)
    batch
    (if-let [event (.poll kinesis-message-queue)]
      (do
        (swap! queue-size dec)
        (recur (conj batch event) batch-start-time 0))
      (do
        (Thread/sleep (exponential-backoff-sleep poll-misses))
        (recur batch batch-start-time (inc poll-misses))))))))

When polling from the queue an exponential back-off if no messages are on the queue.

1
2
3
4
5
6
7
8
9
10
(defn exponential-backoff-sleep
  "Exponential backoff with jitter and a max "
  [misses]
  (let [max-timeout 1000
        jitter-order 10]
    (Math/min
     max-timeout
     (Math/round (+ (Math/exp misses)
                    (* (Math/random)
                       jitter-order))))))

Once the batch is ready (in terms of age or size) its sent to Kinesis.

1
2
3
4
5
6
7
8
9
10
11
(defn- send-events
  "Perform putRecords request to send the batch to kinesis
   returns a list of events that failed."
  [^AmazonKinesisClient client stream-name events]
  (try+
   (let [result (.putRecords client (events->put-records-request events stream-name telemetry))]
     (if (pos? (.getFailedRecordCount result))
       (let [failed-events (failures->events result events)]
         (count-failures telemetry (failures->error-codes result))
         failed-events)
       []))

Note this is where we also decided the partition key. In our case its important for the same user to be located on the same partition. For example when consuming from Kinesis a worker is allocated a partition to work from and would miss events if they where across multiple partitions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
(defn- events->put-records-request
  "Take client and a vector of JsonNodes and produce a PutRecord"
  [batch event-stream]
  (let [batch-list  (java.util.ArrayList.)
        put-request (PutRecordsRequest.)]
    (.setStreamName put-request event-stream)
    (doseq [^ObjectNode event batch]
      (.remove event ^String failure-metadata)
      (let [request-entry (PutRecordsRequestEntry.)
            request-data  (.getBytes (str event))
            request-buf   (ByteBuffer/wrap request-data 0 (alength request-data))
            partition-key (:user-id event)]
        (doto request-entry
          (.setData         request-buf)
          (.setPartitionKey partition-key))
        (.add batch-list request-entry)))
    (.setRecords put-request batch-list)
    put-request))
Failure

Failure can occur on individual records within a batch or in the batch as a whole.

Individual failures
  1. These messages are re-added to the queue so we can try again. If the messages fail for some nth time they are considered invalid and rejected from kinesis and logged as an error.
Batch level
  1. Amazon had an Internal Failure. We don’t know what went wrong. (We see this regularly in normal function).

  2. Amazon Kinesis is not resolvable (AmazonClientException/AmazonServiceException).

  3. Exceeding the read/write limits of Kinesis (ProvisionedThroughputExceededException).

This is our backpressure signal, in which case at worst we need to log to disk for replay later

Consuming Messages from Kinesis

With the consumption of events we have a different application stream for every worker. All workers have their own streams, and own checkpoints so they operate independently of each other. Some example of the workers we gave running:

  • Logging Events to s3
  • Calculating listening time
  • Forwarding certain messages on to various other systems (like RabbitMQ).

Launching a worker is pretty simple with the Amazon Java Kinesis library.

1
2
3
4
5
6
7
8
(:import [com.amazonaws.services.kinesis.clientlibrary.lib.worker Worker])

(defn -main [& args]
  (let [worker-fn (fn [events] (print events))
        config (KinesisClientLibConfiguration.   worker-fn  )   ;;I'm airbrushing over the Java classes
        processor (reify IRecordProcessorFactory worker-fn)   ;;Ultimately this is a lot of config wrapped in Java fun
        [^Worker worker uuid] (Worker. processor config)]
          (future (.run worker))))

One of the hardest parts of setting up the a worker is getting the configuration right to ensure that the consumers are getting through the events fast enough. Events are held in Amazon for 24 hours after entry, and hence there is a minimum consumption rate.

Counting events in and events out with Prometheus made it easier to get the correct consumption rates. Entry/exit rates

Via the Amazon console you also get access to various graphs around read/write rates and limits:

Finally you can also look at Amazon’s Dynamodb instance for the Kinesis stream providing insight into metrics around leases, how many where revoked, stolen, never finished, etc.

Here is an example of one of our Kinesis workers configuration covered in scribblings of me trying to work out the right settings.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
  {
   ;;default 1 sec, cannot be lower than 200ms
   ;;If we are not reading fast enough this is a good value to tweak
   :idle-time-between-reads-in-millis 500

   ;;Clean up leases for shards that we've finished processing (don't wait
   ;;until they expire)
   :cleanup-leases-upon-shard-completion true

   ;;If the heartbeat count does not increase within the configurable timeout period,
   ;;other workers take over processing of that shard.
   ;;*IMPORTANT* If this time is shorter than time for a worker to checkpoint all nodes
   ;;will keep stealing each others leases producing a lot of contention.
   :failover-time-millis ...

   ;;Max records in a single returned in a `GetRecords`. Cannot exceed 10,000
   :max-records 4500

   ;;Process records even if GetRecords returned an empty record list.
   :call-process-records-even-for-empty-record-list false

   ;;Sleep for this duration if the parent shards have not completed processing,
   ;;or we encounter an exception.
   :parent-shard-poll-interval-millis 10000

   ;;By default, the KCL begins withs the most recently added record.
   ;;Instead always reads data from the beginning of the stream.
   :initial-position-in-stream  :TRIM_HORIZON}

Monitoring

Prometheus (http://prometheus.io/) a monitoring tool built at SoundCloud was core to developing, scaling and monitoring all of this pipeline. Amazon does provide some useful graphs within the AWS console but more detailed feedback was very helpful even if it was removed later.

Exception Logging pattern

All Exceptions are counted and sent to log. This was a very useful pattern for driving out errors and spotting leaks in the interactions with Kinesis and consumption:

(Using a Clojure wrapper around Prometheus: https://github.com/josephwilk/prometheus-clj)

1
2
3
4
5
6
(try+
  (worker-fn raw-events)
(catch Exception e
  ;;Count based on exception class
  (inc-counter :failed-batches {:type worker-type :error-code (str (.getClass e))})
  (log/error e)))]

Note Kinesis regularly spits out “InternalFailure” Exceptions. Thats all you get…

Kinesis Internal failures

A Cloud Pipeline in Pictures

In my previous post about Building Clojure services at scale I converted the system metrics to sound. With so many machines processing so many events its easy to loose track of the amount of work being done in the cloud. To make this feel more real I captured metrics across all the machines involved and created 3d renderings using OpenFrameworks and meshes of the systems function:

Thanks

This work constitues a team effort by the Data team at SoundCloud. A lot of advice, collaboration and hard work. Kudos to everyone.

Comments