. Advertisement .
..3..
. Advertisement .
..4..
Spark provides its user with both spark.default.parallelism
and spark.sql.shuffle.partitions
to take care of parallelisms. In the hands of experienced users, they work wonders on the partition process. However, someone new to this language may become confused about how to use them.
That is why we prepared this article, identifying the difference between spark.sql.shuffle.partitions vs spark.default.parallelism.
What Is Spark Shuffle?
Before we can delve into the difference between spark.sql.shuffle.partitions
vs spark.default.parallelism
, we must first understand spark shuffle. In short, it is a mechanism that both redistributes and re-partitions data.
Its main goal is to ensure that all the records are lumped in different ways across every single partition. To attain this goal, it becomes one of the most expensive operations to run, moving data across both executors and worker nodes in clusters.
Both spark.sql.shuffle.partitions
and spark.default.parallelism
are methods that we can use to specify how many partitions a shuffle should create.
Difference Between spark.sql.shuffle.partitions vs spark.default.parallelism
spark.default.parallelism
The first and most obvious characteristic of spark.sql.shuffle.partitions
is that it can only work with RDD. The reason for this phenomenon lies in the function being introduced alongside RDD. This synchronization means that there is no way for spark.sql.shuffle.partitions
to work with DataFrame in a proper way.
Due to its synchronization with RDD, this configuration sets its default value in accordance with every core within every node of one cluster. If you are using it locally on your computer, it will detect your system’s core number and adjust accordingly.
As it’s using RDD, all transformations on the wide scale like join()
, groupByKey()
, and reduceByKey()
will jump-start the data shuffle process. That is why it’s necessary to set your partitions to the desired value before using the operation.
You do so with this code:
spark.conf.set("spark.default.parallelism",100)
We want to emphasize that while reduceByKey()
does trigger data shuffling, there is no change in the partition count. After all, RDDs tend to inherit their parent RDDs’ partition size.
spark.sql.shuffle.partitions
Just like the link between spark.default.parallelism and RDD
, spark.sql.shuffle.partitions
were released as a combination with DataFrame. This nature also means that it can only function with DataFrame, and it has a default value of 200.
Of course, this value is not fixed at 200, as you have the option of changing it using the conf method like the following code:
spark.conf.set("spark.sql.shuffle.partitions",100)
The Size Of Shuffle Partition
Spark shuffling is an interesting case, as it can either harm or benefit your goals depending on the size of your dataset, computer memory, and core number.
From our experience, it’s better to reduce your shuffle partition when dealing with a lesser data amount. If you keep the partition high, there is a certain risk of ending up with a huge pile of partitioned files containing few records.
In other words, you run a lot of tasks without having sufficient data for adequate progression.
On the contrary, having lots of data with too few partitions ensures that you have fewer tasks running for a long time. Sometimes, you can even escape memory errors. It can be as useful as knowing how to export DataFrame to CSV.
Conclusion
We didn’t just tell you the difference between spark.sql.shuffle.partitions
vs spark.default.parallelism
, but also the true nature of spark shuffle and the correct way to approach it. There is also a guide on getting the correct partition size to ensure better memory utilization.
We hope that you enjoyed what we put through and are looking for our next publication.
Leave a comment