Spark Streaming Microbatch Metrics, Programmatically via the REST API

TLDR; metric collection script.

The Spark Streaming web UI shows a number of interesting metrics over time. Tan and I were specifically interested in the (micro)batch start times, processing times and scheduling delays, which we could find no reported way of obtaining programmatically. We were running Spark 2.0.0 on YARN 2.7.2 in cluster-mode.

All I could find with was this StackOverflow question that suggested scraping the Spark UI webpage (as, horribly, some have done) or hitting the JSON API endpoint at /api/v1/. This endpoint, unfortunately again, does not provide the same metrics we see on the Spark Streaming web UI.

Not directly.

It turns out that you can use the /jobs/ endpoint to reconstruct the metrics you see on the Spark Streaming web UI: the batch start time, processing delay and scheduling delay. The key to how this reconstruction is done lies in the BatchInfo class definition in the Spark codebase.

I wrote a script that parses the JSON from this endpoint and reconstructs these metrics, given the application ID (the one YARN generates for you on submission) and the YARN master URL. All times are in seconds. A sample execution is:

python get_spark_streaming_batch_statistics.py \
  --master ec2-52-40-144-150.us-west-2.compute.amazonaws.com \
  --applicationId application_1469205272660_0006

Sample output (batch start-time, processing time, scheduling delay):

  18:36:55 3.991 3783.837
  18:36:56 4.001 3786.832
  18:36:57 3.949 3789.862
  ...

The script creates a map from each batch to its start time and all the jobs it contains, along with the jobs’ start and completion times. Simple arithmetic then generates the required metrics. It is easy to modify the script to print the actual timestamps instead of the delay, if one wishes to.

Do file an issue with any questions or contributions you have.