Value distribution of each column in a database table usually meets service rules. However, if
a source system has data quality issues, some columns of a LibrA table loaded with the data
will suffer seriously skewed value distribution. Use the XX video cloud project as an example.
There are 900 million records in the vehicle traffic record table, and up to 500 million records
have value NULL in the EXIT_TIME column. Therefore, when EXIT_TIME is used for
GROUP BY or JOIN in a query, the computing workload on a single DN is much greater
than that on other DNs, greatly reducing the query efficiency.
Note that the data skew caused by data quality issues is often hard to discover. For example,
in the vehicle traffic record table, the EXIT_TIME column is not a distribution key, which
makes data distribution seem even. However, when EXIT_TIME is used for GROUP BY in
a query, the uneven value grouping still reduces the query performance.
If a non-distribution key is often used for GROUP BY or JOIN and it contains a large
number of NULL values, you can filter out the records that have NULL values in the query
while ensuring the service logic.