. Advertisement .
. Advertisement .
Spark performance tuning is key to avoiding bottlenecking of your system’s resources and maintaining optimal performance. It requires a great deal of deep knowledge of Spark applications. Here are some basic considerations you can take on right now.
Spark Performance Tuning
A good way to reduce your program’s memory usage is to store its resilient distributed datasets (RDDs) in serialized form.
Like any other distributed application, serialization is crucial. It can help with performance time and increase the system resource usage efficiency.
Some formats can consume a greater amount of memory or require more time for objects to serialize. They tend to bog down your system. You should pay attention to them first when tuning a Spark application.
There are two serialization libraries supported by Spark: Java and Kryo serialization. The ObjectOutputStream framework is used by Spark to serialize objects by default and can deal with any class. While this option is flexible, it can be quite slow and create big formats for some classes.
You can switch to the Kryo library to speed up the process. It doesn’t support every Serializable type, and you will need to register the class. But Kryo is far faster than Java’s default solution in most cases.
The spark.serializer setting needs to be changed in SparkConf(). To register a class, you will need to pass its name to registerKryoClasses and save this property in Spark configuration:
val conf = new SparkConf() .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer") .registerKryoClasses(Array(classOf[Person], classOf[Furniture])) .set("spark.kryo.registrationRequired", "true")
Spark relies heavily on memory for both execution and storage. Data can be cached into your system’s RAM for later use. Some computations like aggregations, joins, sorts, and shuffles also use a lot of memory at runtime.
Resource constraints may prompt you to reduce memory consumption. In those cases, you must take into account three factors: the garbage collection’s overhead (when the objects have a high turnover rate), the cost of accessing objects, and their memory consumption.
The two configurations you need to keep in mind are spark.memory.storageFraction and spark.memory.fraction.
When memory is no object, you can cache data in it to improve performance.
dataFrame.cache() and spark.catalog.cacheTable() are two functions in Spark SQL that allow you to cache tables. When the table no longer needs to stay in memory, you can call dataFrame.unpersit() or spark.catalog.uncacheTable() to remove it.
val spark:SparkSession = SparkSession.builder().getOrCreate() val df = spark.read.csv("data.csv") val df2 = df..cache()
In this case, we have cached a DataFrame after creating it from a CSV file.
In-memory caching configurations can be altered using the setConf method or SET commands. For example:
- spark.sql.inMemoryColumnarStorage.compressed: this property dictates whether Spark SQL must automatically choose a compression code for every column depending on the data’s statistics.
- spark.sql.files.openCostInBytes: this integer property controls batch size in column caching. A larger size can increase memory compression and utilization but risk Out Of Memory (OOM) errors. Its default value is 10000.
You can’t achieve optimal performance without selecting the right degree of parallelism.
If there aren’t enough partitions, the cluster can’t make use of all cores. But on the other hand, your system may run into excessive overhead when there are too many tasks due to a large number of partitions.
You can set the spark.default.parallelism property to change the number of partitions. It should depend on the number of CPU cores available in your cluster.
spark-submit --conf spark.default.parallelism=60
Spark performance tuning is an advanced topic that needs more than just a simple tutorial. But the suggestions above should give you a rough idea of optimizing Spark computations.
Leave a comment