An optimised code saves a lot of time and cost. But, optimising a Spark code is challenging and time taking, without understanding where the task is facing bottlenecks and what is the optimised run time a user can aspire to achieve. Such bottlenecks can be caused by mainly 3 reasons –
- The driver compute time is large
- There is a task skew
- There are lack of tasks, implying parallelism issues
Based on the situation, two optimisation goals can be achieved — we can either work on reducing the runtime of the application or we can optimise it by making it work better with less number of resources. There has been a need for a solution that helps in identifying such bottlenecks which can help us in achieving the desired optimisation goal.
The performance of any Spark application can be observed via Yarn resource manager UI or Spark Web UI, but it does not provide us with detailed metrics that can point out the bottlenecks faced by the application. There are various tools available for Spark performance monitoring, such as Sparklens, Sparklint, Dr. Elephant, SparkOscope etc., that provide us with more granular details of the application performance.
In this blog, we will provide a detailed discussion on Sparklens and how it could be used to generate performance metrics for Spark optimisation and implementation. In this process, we cover the following aspects.
- What is Sparklens?
- How to use Sparklens?
- Detailed performance metrics provided by Sparklens report
- Driver vs executor wall clock time
- Estimate of ideal application runtime
- Critical path time
- Optimal number of executors
- Compute hours
- Skewness and lack of tasks
- Common terminologies
What is Sparklens?
Sparklens is an offering by Qubole. It is a profiling and performance prediction tool for Spark with built-in Spark Scheduler simulator. It helps in identifying the bottlenecks that a Spark application is facing and provides us with critical path time.
The Sparklens report gives us an idea on the next steps that can be taken up for optimisation. Its primary goal is to make it easy to understand the scalability limits of Spark applications. It helps in understanding how efficiently a given Spark application uses the compute resources provided. Maybe an application will run faster with more executors and maybe it won’t, Sparklens can answer this question by looking at a single run of the application. It helps in approaching Spark application tuning as a well defined method/process instead of something that one can learn by trial and error, thus saving both developer and compute time, as well as costs.
How to use Sparklens?
Sparklens is very easy to use. There is no prerequisite and we don’t have to install anything specifically. We just have to pass a fixed set of commands along with the Spark-submit shell as mentioned below and right after the run, the Sparklens report gets generated.
— packages qubole:Sparklens:0.3.1-s_2.11
— conf Spark.extraListeners=com.qubole.Sparklens.QuboleJobListener
Detailed performance metrics provided by Sparklens report
- Driver vs executor wall clock time — The total Spark application wall clock time can be divided into time spent on driver and time spent on executors. When a Spark application spends too much time on the driver, it wastes the executors’ compute time.
Few of the common causes that result in a higher load on driver or even driver out of memory are –
- Misconfiguration of spark.sql.autoBroadcastJoinThreshold. Spark uses this limit to broadcast a relation to all the nodes in case of a join operation. At the very first usage, the whole relation is materialised at the driver node. Sometimes multiple tables are also broadcasted as part of the query execution.
We should try to write the application in such a way that all the explicit result collection at the driver can be avoided. Such tasks can be delegated to one of the executors. For example, if some results have to be saved to a particular file, they can either be collected at the driver or we can assign an executor to do that for us.
- Estimate of ideal application runtime — Ideal application time is computed by assuming ideal partitioning (tasks == cores and no skew) of data in all stages of the application.
- Critical path time — It is the minimum time that the application will take, even if we give it infinite executors. The difference between the critical path time and application runtime is the real elasticity of the Spark application. If the difference is large, then more executors are needed.
- Optimal number of executors — If autoscaling or dynamic allocation is enabled, we can see how many executors were available at any given time. Sparklens plots the executors used by different Spark jobs within the application and what is the minimal number of executors (ideal) which could have finished the same work in the same amount of wall clock time.
- Compute hours — It tells us how much of the compute resources given to an application are not used at all.
All of the above mentioned parameters can also be observed graphically via Sparklens UI –
- Skewness and lack of tasks — Executors can also waste compute time because of lack of tasks or skew.
- Common terminologies — Sparklens report provides detailed definitions of the major parameters mentioned, at the end of the report, thus making it self-sufficient and clear.
The bottlenecks of a Spark job can be identified through the report, be it from resource, data or script side. The Sparklens report identifies the minimum time that could be achieved and with a fixed target time. We can understand how far we are from an optimised code. A critical path solution may not always be achievable, but we can try to reach as close as possible. Based on the findings of Sparklens, a detailed action and implementation plan can be prepared. Sparklens implementation can be carried out across different pipelines, as Sparklens is easy to use — certain fixed sets of commands can be passed along with the Spark-submit shell. Sparklens reports will be generated with similar metrics each time, thus making it easier to understand.