[SPARK-49741][DOCS] Add spark.shuffle.accurateBlockSkewedFactor to …

…config docs page ### What changes were proposed in this pull request? `spark.shuffle.accurateBlockSkewedFactor` was added in Spark 3.3.0 in https://issues.apache.org/jira/browse/SPARK-36967 and is a useful shuffle configuration to prevent issues where `HighlyCompressedMapStatus` wrongly estimates the shuffle block sizes when the block size distribution is skewed, which can cause the shuffle reducer to fetch too much data and OOM. This PR adds this config to the Spark config docs page to make it discoverable. ### Why are the changes needed? To make this useful config discoverable by users and make them able to resolve shuffle fetch OOM issues themselves. ### Does this PR introduce _any_ user-facing change? Yes, this is a documentation fix. Before this PR there's no `spark.sql.adaptive.skewJoin.skewedPartitionFactor` in the `Shuffle Behavior` section on [the Configurations page](https://spark.apache.org/docs/latest/configuration.html) and now there is. ### How was this patch tested? On the IDE: <img width="1633" alt="image" src="https://github.com/user-attachments/assets/616a94b9-2408-491c-a17b-c6dbdff14465"> Updated: <img width="1274" alt="image" src="https://github.com/user-attachments/assets/ba170e9a-eba2-4fdf-85eb-a3aebefc055e"> ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#48189 from timlee0119/add-accurate-block-skewed-factor-to-doc. Authored-by: Tim Lee <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
himadripal · Oct 19, 2024 · 54771b2 · 54771b2
1 parent bd8c212
commit 54771b2
Show file tree

Hide file tree

Showing 2 changed files with 13 additions and 1 deletion.
diff --git a/core/src/main/scala/org/apache/spark/internal/config/package.scala b/core/src/main/scala/org/apache/spark/internal/config/package.scala
@@ -1386,7 +1386,6 @@ package object config {
 
   private[spark] val SHUFFLE_ACCURATE_BLOCK_SKEWED_FACTOR =
     ConfigBuilder("spark.shuffle.accurateBlockSkewedFactor")
-      .internal()
       .doc("A shuffle block is considered as skewed and will be accurately recorded in " +
         "HighlyCompressedMapStatus if its size is larger than this factor multiplying " +
         "the median shuffle block size or SHUFFLE_ACCURATE_BLOCK_THRESHOLD. It is " +

diff --git a/docs/configuration.md b/docs/configuration.md
@@ -1232,6 +1232,19 @@ Apart from these, the following properties are also available, and may be useful
   </td>
   <td>2.2.1</td>
 </tr>
+<tr>
+  <td><code>spark.shuffle.accurateBlockSkewedFactor</code></td>
+  <td>-1.0</td>
+  <td>
+    A shuffle block is considered as skewed and will be accurately recorded in
+    <code>HighlyCompressedMapStatus</code> if its size is larger than this factor multiplying
+    the median shuffle block size or <code>spark.shuffle.accurateBlockThreshold</code>. It is
+    recommended to set this parameter to be the same as
+    <code>spark.sql.adaptive.skewJoin.skewedPartitionFactor</code>. Set to -1.0 to disable this
+    feature by default.
+  </td>
+  <td>3.3.0</td>
+</tr>
 <tr>
   <td><code>spark.shuffle.registration.timeout</code></td>
   <td>5000</td>