Got it

Spark:Case 12: Spark CBO Optimization

Latest reply: Nov 28, 2018 08:34:21 1273 2 0 0 0

1.1.1 Case 12: Spark CBO Optimization

Scenario

An SQL statement can be converted into multiple execution plans. An optimizer helps the SQL compiler select the most efficient execution plan. Traditional databases, such as Oracle, provide rule-based optimization (RBO) and cost-based optimization (CBO).

l   RBO

Rules are abstracted from best practices. As long as SQL statements comply with these rules, execution plans will be fairly good regardless of the content in data tables and data distribution pattern.

l   CBO

The cost of every execution plan is evaluated based on data distribution and structure. The cheapest plan will be selected.

At present, a majority of Spark optimizers are RBO-based and equipped with tens of optimization rules, including predicate pushdown, constant folding, and projection clipping. These rules are helpful but insensitive to data, making RBO optimizers unware of changes to data distribution in a data table. Consequently, there is the possibility that the execution plan generated by RBO is not the optimal choice. With CBO, there should not be a great challenge to select the optimal execution plan. An important function of CBO is evaluating SQL statements based on the data distribution pattern and generating an execution plan that is the cheapest among candidate options.

The most noticeable advantage of CBO lies in Join algorithm selection. Imagine that two tables need to be joined and after one of the tables is filtered, the result set is smaller than the broadcast threshold. Without CBO, Spark is unware that the original table has been downsized and selects the SortMergeJoin algorithm, which incurs many shuffle operations and downgrades performance. However, if CBO is in use, Spark can stay on top of what is happening to the original table and then select the BroadcastHashJoin algorithm to broadcast the downsized table to every node, eliminating the need for shuffle operations and improving performance.

Procedure

The philosophy behind Spark CBO is evaluating the size of intermediate result sets, which are generated by each operator, based on statistics in tables and columns and selecting the optimal execution plan based on the evaluation result.

1.         Set configuration options.

           Add the configuration option spark.sql.cbo to the spark-defaults.conf file and set it to true.

The default value is false.

           Execute the SQL statement set spark.sql.cbo=true on the Spark client.

2.         Run COMPUTE STATS commands to compute data distribution statistics.

note

Perform this step only once before executing all SQL statements. If data sets are inserted, modified, or deleted, run COMPUTE STATS commands again on the afffected tables or columns to acquire the latest data distribution statistics.

           For tables, run the COMPUTE STATS FOR TABLE src command to compute table statistics, including the number of entries, the number of files, and total size of physical storage.

           For columns:

Run the COMPUTE STATS FOR TABLE src ON COLUMNS command to compute statistics about all columns.

Run the COMPUTE STATS FOR TABLE src ON COLUMNSname, age command to compute statistics about name and age fields in tables.

Column statistics can be digits, dates, time, and character strings.

Statistics about digits, dates, and time include the Max (maximum value), Min (minimum value), number of distinct values (NDV), number of nulls, and histograms (equi-width or equi-height).

Character strings include the Max (maximum value), Min (minimum value), Max Length, Average Length, NDV, number of null values, and equi-width histograms.

3.         Perform CBO.

           Automatic optimization: Type SQL statements. Spark automatically determines whether your inputs have room for improvement. If the SQL statements can be optimized further, Spark selects the optimal Join algorithm.

           Manual optimization: Run the DESC FORMATTED src command to view statistics and optimize SQL statements based on the statistics distribution pattern.

 


This article contains more resources

You need to log in to download or view. No account? Register

x

Spark:Case 12: Spark CBO Optimization-2785931-1
View more
  • x
  • convention:

good Spark:Case 12: Spark CBO Optimization-2810119-1
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.