MySQL的直方图是如何影响执行计划生成的?
创建直方图有哪些注意事项?
直方图和索引对优化器的选择上有什么差异,又该如何选择?
如何判断直方图对执行计划的影响?
MySQL官方blog的这篇文章用非常具体的示例回答了这一系列问题,let’s go。
原文地址为https://dev.mysql.com/blog-archive/histogram-statistics-in-mysql/,以下为译文:
什么是直方图
- 每个表中有多少行?
- 每一列有多少不同的值?
- 数据如何分布在每一列中?
CREATE TABLE bedtime ( person_id INT, time_of_day TIME);
1) SELECT * FROM bedtime WHERE time_of_day BETWEEN "22:00:00" AND "23:59:00" 2) SELECT * FROM bedtime WHERE time_of_day BETWEEN "12:00:00" AND "14:00:00"
如何创建和删除直方图统计
为了管理直方图统计数据,我们扩展了ANALYZE TABLE,增加了两个新的子句:
ANALYZE TABLE tbl_name UPDATE HISTOGRAM ON col_name [, col_name] WITH N BUCKETS; ANALYZE TABLE tbl_name DROP HISTOGRAM ON col_name [, col_name];
mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. | +----------------+-----------+----------+---------------------------------------------------+ 1 row in set (0.27 sec) mysql> ANALYZE TABLE payment UPDATE HISTOGRAM ON amount, payment_date WITH 32 BUCKETS; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics created for column 'amount'. | | sakila.payment | histogram | status | Histogram statistics created for column 'payment_date'. | +----------------+-----------+----------+---------------------------------------------------------+
请注意,必须指定桶的数量,并且可以在 1 到 1024 的范围内(默认为100)。您应该为数据集选择多少个桶取决于几个因素;您有多少个不同的值,您的数据集有多大偏差,您需要多高的准确性等。
但是,在一定数量的桶之后,(再继续加大桶的数据量)对准确性的提高效果相当低。所以我们建议从较低的数字开始,例如 32,如果您发现它不符合您的需求,则增加它。
mysql> ANALYZE TABLE payment DROP HISTOGRAM ON payment_date; +----------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+---------------------------------------------------------+ | sakila.payment | histogram | status | Histogram statistics removed for column 'payment_date'. | +----------------+-----------+----------+---------------------------------------------------------+
mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day, c_foobar, c_birth_month WITH 32 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. | | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_month'. | | tpcds.customer | histogram | Error | The column 'c_foobar' does not exist. | +----------------+-----------+----------+----------------------------------------------------------+ 3 rows in set (0.15 sec)
直方图的创建在数据库内部是如何实现的?
如果您已经阅读了MySQL手册,您可能已经看到了新的系统变量histogram_generation_max_mem_size。这个变量将控制服务器在生成直方图统计数据时允许使用的内存大小(以字节计)。那你为什么要控制它呢?
当您指定想要构建一个直方图时,服务器将把所有数据读入内存并在内存中执行所有工作(包括排序)。如果您想在一个非常大的表上生成一个直方图,那么您可能要冒着将数百兆字节的数据读入内存的风险,这可能是不可取的。因此,为了处理这个问题,MySQL将计算在给定由系统变量histogram_generation_max_mem_size指定的内存量的情况下,它可以将多少行数据放入内存中。如果它意识到它只能在给定的内存限制内装入行的一个子集,它将求助于抽样。这可以通过查看属性“采样率”来观察:
mysql> SET histogram_generation_max_mem_size = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_country WITH 16 BUCKETS; +----------------+-----------+----------+------------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+------------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_country'. | +----------------+-----------+----------+------------------------------------------------------------+ 1 row in set (0.22 sec) mysql> SELECT histogram->>'$."sampling-rate"' -> FROM information_schema.column_statistics -> WHERE table_name = "customer" -> AND column_name = "c_birth_country"; +---------------------------------+ | histogram->>'$."sampling-rate"' | +---------------------------------+ | 0.048743243211626014 | +---------------------------------+
在这里,我们可以看到优化器通过读取“c_birth_country”列中大约4.8%的数据创建了一个直方图。值得注意的是,抽样是不确定的,因此如果使用抽样,在同一个数据集上的两次后续调用“ANALYZE TABLE tbl UPDATE HISTOGRAM…”可能会给您两个不同的直方图。
Query examples
那么,使用直方图统计可以得到什么呢?让我们看看TPC-DS Benchmark Suite中的几个查询,其中添加一个直方图可以在查询执行时间上产生很大的差异。下面我们将使用规模系数为1的TPC-DS,这意味着数据库的大小大约为1GB。这台机器是英特尔酷睿i7-4770,运行Debian Stretch和MySQL 8.0 RC1。这个配置是相当标准的,除了innodb_buffer_pool_size被增加到2G,以便我们可以将整个数据库放入缓冲池中。
为了让优化器实际使用直方图提供的统计数据,您只需确保优化器开关“condition_fanout_filter”处于打开状态。注意,这在默认情况下是打开的。
Query 90
mysql> SELECT CAST(amc AS DECIMAL(15, 4)) / CAST(pmc AS DECIMAL(15, 4)) am_pm_ratio -> FROM (SELECT COUNT(*) amc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 9 AND 9 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) at, -> (SELECT COUNT(*) pmc -> FROM web_sales, -> household_demographics, -> time_dim, -> web_page -> WHERE ws_sold_time_sk = time_dim.t_time_sk -> AND ws_ship_hdemo_sk = household_demographics.hd_demo_sk -> AND ws_web_page_sk = web_page.wp_web_page_sk -> AND time_dim.t_hour BETWEEN 15 AND 15 + 1 -> AND household_demographics.hd_dep_count = 2 -> AND web_page.wp_char_count BETWEEN 5000 AND 5200) pt -> ORDER BY am_pm_ratio -> LIMIT 100; +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (1.48 sec)
View Code
mysql> ANALYZE TABLE web_page UPDATE HISTOGRAM ON wp_char_count WITH 8 BUCKETS; +----------------+-----------+----------+----------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+----------------------------------------------------------+ | tpcds.web_page | histogram | status | Histogram statistics created for column 'wp_char_count'. | +----------------+-----------+----------+----------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +-------------+ | am_pm_ratio | +-------------+ | 1.27619048 | +-------------+ 1 row in set (0.50 sec)
View Code
mysql> SELECT -> (SELECT COUNT(*) FROM web_page WHERE web_page.wp_char_count BETWEEN 5000 AND 5200) -> / -> (SELECT COUNT(*) FROM web_page) AS ratio; +--------+ | ratio | +--------+ | 0.0167 | +--------+ 1 row in set (0.00 sec)
Query 61
mysql> SELECT promotions, -> total, -> CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 -> FROM (SELECT SUM(ss_ext_sales_price) promotions -> FROM store_sales, -> store, -> promotion, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_promo_sk = p_promo_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = 'Home' -> AND ( p_channel_dmail = 'Y' -> OR p_channel_email = 'Y' -> OR p_channel_tv = 'Y' ) -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) promotional_sales, -> (SELECT SUM(ss_ext_sales_price) total -> FROM store_sales, -> store, -> date_dim, -> customer, -> customer_address, -> item -> WHERE ss_sold_date_sk = d_date_sk -> AND ss_store_sk = s_store_sk -> AND ss_customer_sk = c_customer_sk -> AND ca_address_sk = c_current_addr_sk -> AND ss_item_sk = i_item_sk -> AND ca_gmt_offset = -5 -> AND i_category = 'Home' -> AND s_gmt_offset = -5 -> AND d_year = 2000 -> AND d_moy = 12) all_sales -> ORDER BY promotions, -> total -> LIMIT 100; +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (2.78 sec)
View Code
mysql> ANALYZE TABLE store UPDATE HISTOGRAM ON s_gmt_offset WITH 8 BUCKETS; +-------------+-----------+----------+---------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +-------------+-----------+----------+---------------------------------------------------------+ | tpcds.store | histogram | status | Histogram statistics created for column 's_gmt_offset'. | +-------------+-----------+----------+---------------------------------------------------------+ 1 row in set (0.06 sec) mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.37 sec)
View Code
有了这个直方图,查询执行时间下降到不到1.4秒,提高了2倍。原因是在第一个计划中,优化器选择第一个派生表在store表上执行全表扫描,然后分别在<item、store_sales、date_dim、customer和customer_address中执行主键查找。但是,当它意识到store表将返回比它预期的更多的行,优化器选择对Item表执行全表扫描,并分别在store_sales、store、date_dim、customer和最后的customer_address中执行主键查找。
But, why not an index?
mysql> CREATE INDEX s_gmt_offset_idx ON store (s_gmt_offset); Query OK, 0 rows affected (0.53 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> SELECT ... +------------+------------+--------------------------------------------------------------------------+ | promotions | total | CAST(promotions AS DECIMAL(15, 4)) / CAST(total AS DECIMAL(15, 4)) * 100 | +------------+------------+--------------------------------------------------------------------------+ | 3213210.07 | 5966836.78 | 53.85114741 | +------------+------------+--------------------------------------------------------------------------+ 1 row in set (1.41 sec)
然而,有两个原因可以解释为什么你可能会考虑直方图而不是索引:
1. 维护索引是有成本的。如果您有一个索引,那么每次INSERT/UPDATE/DELETE都会导致索引被更新。这不是免费的,而且会影响您的性能。另一方面,直方图只创建一次,并且永远不会更新,除非您明确地要求它。因此,它不会损害您的插入/更新/删除性能。
2. 如果您有一个索引,优化器将执行我们称为“index dives” 的操作,以估计给定范围内的记录数量。这也有一定的成本,如果查询中有非常长的in -list,那么成本可能会太高。在这种情况下,直方图统计要便宜得多,因此可能更合适。
译者注:简单地理解,index dives就是MySQL在对where id in (***,***,……)这种语句生成执行计划的时候,通过扫描索引页的方式来估算符合条件的数据行数,这种方式潜在的问题就是,如果In里面的值很多,以至于符合条件的数据页面很多,那么仅在执行计划评估阶段,就需要扫描大量的数据页面,可能会造成一定的性能损耗,如果换一种评估方式,也就是基于统计信息做评估,就可以避免潜在的扫描大量的索引页的情况(但是基于统计信息的预估也不是完美的,最大的问题是不够精准)。index dives的参数为eq_range_index_dive_limit,默认为200。
检查直方图统计信息
直方图统计数据作为JSON对象存储在数据字典中,这使得它们既灵活又可读。例如,您可以使用内置的JSON函数从直方图中提取信息。假设您想知道您的柱状图是何时为“payment”表中的“amount”列创建/更新的。你可以很容易地使用JSON反引用提取操作符来查找这些信息:
mysql> SELECT -> HISTOGRAM->>'$."last-updated"' AS last_updated -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = "sakila" -> AND TABLE_NAME = "payment" -> AND COLUMN_NAME = "amount"; +----------------------------+ | last_updated | +----------------------------+ | 2017-09-15 11:54:25.000000 | +----------------------------+
或者假设你想找出直方图中有多少个桶与你在ANALYZE TABLE语句中指定的桶的数量进行比较:
mysql> SELECT -> TABLE_NAME, -> COLUMN_NAME, -> HISTOGRAM->>'$."number-of-buckets-specified"' AS num_buckets_specified, -> JSON_LENGTH(HISTOGRAM, '$.buckets') AS num_buckets_created -> FROM INFORMATION_SCHEMA.COLUMN_STATISTICS -> WHERE -> SCHEMA_NAME = "sakila"; +------------+--------------+-----------------------+---------------------+ | TABLE_NAME | COLUMN_NAME | num_buckets_specified | num_buckets_created | +------------+--------------+-----------------------+---------------------+ | payment | amount | 32 | 19 | | payment | payment_date | 32 | 32 | +------------+--------------+-----------------------+---------------------+
关于可以从直方图中提取什么样的信息,我们参考了手册中的更多信息。
优化器跟踪
如果你想知道直方图所做的估计,最简单的方法是查看EXPLAIN输出:
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 11.11 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec) mysql> ANALYZE TABLE customer UPDATE HISTOGRAM ON c_birth_day WITH 32 BUCKETS; +----------------+-----------+----------+--------------------------------------------------------+ | Table | Op | Msg_type | Msg_text | +----------------+-----------+----------+--------------------------------------------------------+ | tpcds.customer | histogram | status | Histogram statistics created for column 'c_birth_day'. | +----------------+-----------+----------+--------------------------------------------------------+ 1 row in set (0.10 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day BETWEEN 1 AND 10; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 32.12 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)
mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ | 1 | SIMPLE | customer | NULL | ALL | NULL | NULL | NULL | NULL | 98633 | 6.38 | Using where | +----+-------------+----------+------------+------+---------------+------+---------+------+-------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)
mysql> SET OPTIMIZER_TRACE = "enabled=on"; Query OK, 0 rows affected (0.00 sec) mysql> SET OPTIMIZER_TRACE_MAX_MEM_SIZE = 1000000; Query OK, 0 rows affected (0.00 sec) mysql> EXPLAIN SELECT * FROM customer WHERE c_birth_day <= 20 AND c_birth_year = 1967; mysql> SELECT JSON_EXTRACT(TRACE, "$**.filtering_effect") FROM INFORMATION_SCHEMA.OPTIMIZER_TRACE; +----------------------------------------------------------------------------------------+ | JSON_EXTRACT(TRACE, "$**.filtering_effect") | +----------------------------------------------------------------------------------------+ | [[{"condition": "(`customer`.`c_birth_day` <= 20)", "histogram_selectivity": 0.6376}]] | +----------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
mysql> SELECT
-> (SELECT count(*) FROM customer WHERE c_birth_day <= 20)
-> /
-> (SELECT COUNT(*) FROM customer) AS ratio;
+--------+
| ratio |
+--------+
| 0.6376 |
+--------+
1 row in set (0.03 sec)