top of page

Benchmarking in the Real World: Apache Spark™ 3.5.0 on EC2

In our conversations with clients, we've found that Spark is a go-to tool for many data engineering and analytics tasks. It's battle-tested, robust, and capable of handling virtually any amount of data. However, a common challenge we hear about is the difficulty in managing an organization's Spark deployment. Which clusters should be used? How many clusters are needed? Should you use EMR or EMR Serverless?


The answer often depends on the specific characteristics of your organization's data and workloads. However, there are some general guidelines that can help improve your total cost of ownership (TCO) without resorting to "it depends." At Underspend, we've benchmarked various Spark programs on different EC2 instances and found a difference of over 100% in the cost of running a given program. The key takeaway is that it's worth investing time to consider the instances being used for Spark, as the cost difference can be significant.


Our results are based on a real-world PySpark program provided by one of our clients. The program does not use UDFs and is translated to clean Spark code, running on Spark 3.5.0. While the TPC-DS benchmark is commonly used and definitely has its uses, we've found that the program we used is more representative of what companies run in the real world. Here are our findings:



The most cost-effective instances were c7a.xlarge and c7g.xlarge, both showing meaningful improvements compared to their previous generation (c6a and c6g). The difference between the most expensive instance (r6i.xlarge) and the cheapest ones is $0.54, which is 142% of the cost of the query on the cheapest instance.


Your workloads might differ in various aspects (e.g., memory usage, data) from the one we tested. Nevertheless, if you're running a workload frequently, it's advisable to benchmark it on various instances and optimize the instance type to achieve the best total cost of ownership.


© Underspend 2024. All rights reserved. Apache, Apache Spark, Spark and the Spark logo are trademarks of the Apache Software Foundation.

Comments

Couldn’t Load Comments
It looks like there was a technical problem. Try reconnecting or refreshing the page.
Get a Demo
bottom of page