1.1.1 Case 12: Spark CBO Optimization
Scenario
An SQL statement can be converted into multiple execution plans. An optimizer helps the SQL compiler select the most efficient execution plan. Traditional databases, such as Oracle, provide rule-based optimization (RBO) and cost-based optimization (CBO).
l RBO
Rules are abstracted from best practices. As long as SQL statements comply with these rules, execution plans will be fairly good regardless of the content in data tables and data distribution pattern.
l CBO
The cost of every execution plan is evaluated based on data distribution and structure. The cheapest plan will be selected.
At present, a majority of Spark optimizers are RBO-based and equipped with tens of optimization rules, including predicate pushdown, constant folding, and projection clipping. These rules are helpful but insensitive to data, making RBO optimizers unware of changes to data distribution in a data table. Consequently, there is the possibility that the execution plan generated by RBO is not the optimal choice. With CBO, there should not be a great challenge to select the optimal execution plan. An important function of CBO is evaluating SQL statements based on the data distribution pattern and generating an execution plan that is the cheapest among candidate options.
The most noticeable advantage of CBO lies in Join algorithm selection. Imagine that two tables need to be joined and after one of the tables is filtered, the result set is smaller than the broadcast threshold. Without CBO, Spark is unware that the original table has been downsized and selects the SortMergeJoin algorithm, which incurs many shuffle operations and downgrades performance. However, if CBO is in use, Spark can stay on top of what is happening to the original table and then select the BroadcastHashJoin algorithm to broadcast the downsized table to every node, eliminating the need for shuffle operations and improving performance.
Procedure
The philosophy behind Spark CBO is evaluating the size of intermediate result sets, which are generated by each operator, based on statistics in tables and columns and selecting the optimal execution plan based on the evaluation result.
1. Set configuration options.
− Add the configuration option spark.sql.cbo to the spark-defaults.conf file and set it to true.
The default value is false.
− Execute the SQL statement set spark.sql.cbo=true on the Spark client.
2. Run COMPUTE STATS commands to compute data distribution statistics.
![]()
Perform this step only once before executing all SQL statements. If data sets are inserted, modified, or deleted, run COMPUTE STATS commands again on the afffected tables or columns to acquire the latest data distribution statistics.
− For tables, run the COMPUTE STATS FOR TABLE src command to compute table statistics, including the number of entries, the number of files, and total size of physical storage.
− For columns:
Run the COMPUTE STATS FOR TABLE src ON COLUMNS command to compute statistics about all columns.
Run the COMPUTE STATS FOR TABLE src ON COLUMNSname, age command to compute statistics about name and age fields in tables.
Column statistics can be digits, dates, time, and character strings.
Statistics about digits, dates, and time include the Max (maximum value), Min (minimum value), number of distinct values (NDV), number of nulls, and histograms (equi-width or equi-height).
Character strings include the Max (maximum value), Min (minimum value), Max Length, Average Length, NDV, number of null values, and equi-width histograms.
3. Perform CBO.
− Automatic optimization: Type SQL statements. Spark automatically determines whether your inputs have room for improvement. If the SQL statements can be optimized further, Spark selects the optimal Join algorithm.
− Manual optimization: Run the DESC FORMATTED src command to view statistics and optimize SQL statements based on the statistics distribution pattern.



