使用MySQL EXPLAIN優化查詢 (Use MySQL EXPLAIN for Query Optimization)

字幕列表影片播放

由 AI 自動生成

In this tutorial, we're going to see why MySQL creates a query plan before each query request.

在本教程中，我們將瞭解為什麼 MySQL 會在每個查詢請求之前創建查詢計劃。
Then, we're going to use a series of examples to show how the EXPLAIN output can be used to identify where MySQL spends time on your query and why, and we'll learn which are the important pieces of information to look for in order to spot performance issues.

然後，我們將使用一系列示例來說明如何使用 EXPLAIN 輸出來識別 MySQL 在查詢上花費時間的地方和原因，並瞭解要發現性能問題需要查找哪些重要資訊。
Once we have a candidate query for optimization, we need to analyze why is it slow or why it impacts the system so much.

一旦有了需要優化的候選查詢，我們就需要分析其速度慢或對系統影響大的原因。
The main tool to do that is the EXPLAIN statement, which provides information about the query plan chosen by the optimizer.

EXPLAIN 語句是實現這一功能的主要工具，它提供了優化器所選查詢計劃的相關資訊。
The optimizer has to make a few decisions before executing each query.

在執行每個查詢之前，優化器必須做出一些決定。
For example, what is cheaper from a response time perspective?

例如，從響應時間的角度來看，什麼更便宜？
Fetch the data directly from a table, or go to an index and stop here because all the columns required by the client are in the index, or get the location of the records from the index and go to the table to get the actual data.

直接從表中獲取數據，或者轉到索引並在此停止，因為客戶需要的所有列都在索引中，或者從索引中獲取記錄的位置，然後轉到表中獲取實際數據。
The first method, fetching data directly from the table, is called fullscan.

第一種方法是直接從表中獲取數據，稱為全掃描。
This is normally the most expensive because all rows must be fetched from the table and checked against a condition, yet this method works best for small tables.

這種方法通常成本最高，因為必須從表中獲取所有記錄並根據條件進行檢查，但這種方法最適合小型表。
In the second set of options, we access an index.

在第二組選項中，我們訪問一個索引。
If the index has all the necessary data and there is no need to access the table, we have what is called a covering index.

如果索引擁有所有必要的數據，並且不需要訪問表，那麼我們就擁有了所謂的覆蓋索引。
However, this is less often the case, so the index is used to filter out rows and then access those rows from the table.

但這種情況並不常見，是以索引用於過濾出記錄，然後從表中訪問這些記錄。
Usually, this is the cheapest way to access a table.

通常，這是訪問表格最便宜的方法。
Still, if the client selects a large number of rows, this may not be valid anymore.

不過，如果客戶端選擇了大量的行，這可能就不再有效了。
Therefore, the optimizer has to make a lot of decisions based on particular database statistics before the query is executed.

是以，在執行查詢之前，優化器必須根據特定的數據庫統計數據做出大量決策。
As a result, our goal will be to observe what the optimizer thinks is the most expensive subtask so that we could eliminate or enhance that part.

是以，我們的目標是觀察優化器認為最昂貴的子任務是什麼，從而消除或增強這部分任務。
If you have a slow query, the first thing to try is running it with EXPLAIN.

如果查詢速度較慢，首先要嘗試使用 EXPLAIN 運行查詢。
This will show the query plan, in other words, the list of things expected to happen when the query is executed.

這將顯示查詢計劃，換句話說，就是執行查詢時預計會發生的事情列表。
If you instead use EXPLAIN ANALYZE before the statement, you'll get both the estimation of what the planner expected, along with what actually happened when the query ran.

如果在語句前使用 EXPLAIN ANALYZE，就能同時獲得規劃器的預期結果和查詢運行時的實際結果。
Consider the following statement.

請看下面的陳述。
EXPLAIN DELETE FROM city.

EXPLAIN DELETE FROM city.
This query is not executed, so it won't delete all the rows.

該查詢不會被執行，是以不會刪除所有記錄。
Therefore, it's safe to obtain the query plan.

是以，獲取查詢計劃是安全的。
To actually execute the query, we can use EXPLAIN ANALYZE, as we'll see later in more detail.

要實際執行查詢，我們可以使用 EXPLAIN ANALYZE，稍後我們將詳細介紹。
This is not only going to show the query plan for deleting those rows, it is actually going to delete them.

這不僅會顯示刪除這些記錄的查詢計劃，還會實際刪除這些記錄。
Usually, it's more difficult to compare the timings of operations when doing inserts, updates, or deletes using EXPLAIN ANALYZE.

通常情況下，在使用 EXPLAIN ANALYZE 進行插入、更新或刪除操作時，比較操作的時間比較困難。
This is because the underlying data will change while executing the same queries.

這是因為在執行相同查詢時，底層數據會發生變化。
Optionally, we can add the FORMAT option to specify whether we want the results returned in traditional table format, JSON, or tree-style format.

作為選項，我們可以添加 FORMAT 選項，指定我們希望以傳統表格格式、JSON 格式還是樹形格式返回結果。
Keep in mind that each format will show more or less information about the query plan.

請記住，每種格式都會顯示或多或少的查詢計劃資訊。
For instance, the JSON format is the most verbose of all.

例如，JSON 格式是最冗長的格式。
Let's see some examples to dig deeper into the query plan.

讓我們看一些示例來深入瞭解查詢計劃。
We will use the sample database world link in the description, and for the first example we will execute a simple SELECT on the city table with a condition on a non-indexed column called name.

我們將使用描述中的示例數據庫世界鏈接，在第一個示例中，我們將在城市表上執行一個簡單的 SELECT，條件是在一個名為 name 的非索引列上。
Since there is no index that can be used, it will require a full table scan to evaluate the query.

由於沒有可以使用的索引，是以需要進行全表掃描來評估查詢。
The table access types show whether a query accesses the table using an index, scan, and the similar.

表訪問類型顯示查詢是否使用索引、掃描或類似方式訪問表。
Since the costs associated with each type fluctuate greatly, it is also one of the most important values to look for in the EXPLAIN output to determine which part of the query to work on to improve the performance.

由於與每種類型相關的成本波動很大，是以它也是 EXPLAIN 輸出中最重要的值之一，可用於確定查詢的哪個部分以提高性能。
The output has the access type set to ALL, which is the most basic access type, because it scans all rows for the table.

輸出的訪問類型設置為 ALL，這是最基本的訪問類型，因為它會掃描表的所有行。
It is also the most expensive one, and for this reason, this type is written in ALL uppercase.

它也是最昂貴的一種，是以，這種類型的文字全部大寫。
It's estimated that 4046 rows will be examined, and for each row a WHERE clause will be applied.

估計將檢查 4046 條記錄，每條記錄都將應用 WHERE 子句。
It's expected that 10% of the rows examined will match the WHERE clause.

預計 10%的檢查記錄將與 WHERE 子句匹配。
Here, the optimizer uses default values to estimate the filtering effect of various conditions, so we can't really use this filtering value to estimate whether an index is useful or not.

在這裡，優化器使用默認值來估計各種條件的過濾效果，是以我們無法真正使用這個過濾值來估計索引是否有用。
This is the traditional format, however, it doesn't show the relationship between the executed subtasks, so it's more difficult to have an overview of the query plan.

這是一種傳統格式，但它不能顯示已執行子任務之間的關係，是以更難概覽查詢計劃。
Which format is preferred depends on your needs.

選擇哪種格式取決於您的需求。
For example, the traditional format it's easier to use to see the indexes used and other basic information about the query plan, while the JSON format provides much more details.

例如，傳統格式更容易查看所使用的索引和查詢計劃的其他基本資訊，而 JSON 格式則能提供更多詳細信息。
The TREE style is the newest format, and is the format we're going to use for the next examples.

TREE 樣式是最新的格式，我們將在接下來的示例中使用這種格式。
The TREE format focuses on describing how the query is executed, in terms of the relationship between the parts of the query, and also the order in which these parts are executed.

TREE 格式側重於描述查詢的執行方式，即查詢各部分之間的關係，以及這些部分的執行順序。
In this case, EXPLAIN OUTPUT is organized into a series of nodes.

在這種情況下，EXPLAIN OUTPUT 被組織成一系列節點。
At the lowest level, there are nodes that scan tables or search indexes.

最底層是掃描表或搜索索引的節點。
Higher level nodes take the results from the lower level nodes and operate on them.

高層節點從低層節點獲取結果並對其進行操作。
Here, it can be easier to understand the execution by reading the output from the INSIDE OUT.

在這裡，通過閱讀 INSIDE OUT 的輸出，可以更容易地理解執行過程。
The TREE format is also the default format for EXPLAIN ANALYZE statement, which is as new as MySQL 8.0.18.

TREE 格式也是 EXPLAIN ANALYZE 語句的默認格式，這是 MySQL 8.0.18 的新功能。
This TREE format output gives us a good overview of how the query is executed.

通過 TREE 格式的輸出，我們可以很好地瞭解查詢是如何執行的。
First, there is a table scan on the city table, and then a filter is applied for the name column.

首先，對城市表進行表掃描，然後對姓名列應用過濾器。
Here, we also have an estimation cost for an expected 4000 rows.

在這裡，我們也有一個預計 4000 行的估算成本。
Please note that this estimation cost is represented in some internal MySQL units.

請注意，該估算成本是以 MySQL 的某些內部單位表示的。
For example, reading a row from the disk can have an associated cost of 2, while reading the same row from the memory will have a cost of 1.

例如，從磁盤讀取一行的相關成本為 2，而從內存讀取同一行的成本為 1。
Moving to the actual statistics, we see that the first row was read in about 4 milliseconds, and all other rows were read in roughly 5 milliseconds.

在實際統計數據中，我們可以看到第一行的讀取時間約為 4 毫秒，所有其他行的讀取時間約為 5 毫秒。
There was a single loop for this query, because there was no join-close involved.

該查詢只有一個循環，因為不涉及連接-關閉。
In this case, the estimate was pretty accurate regarding the numbers of returned rows.

在這種情況下，對返回行數的估計相當準確。
Then, these rows are passed to the second phase for filtering, where we see a slight increase in the actual time of execution.

然後，這些行將進入第二階段進行過濾，在這一階段，我們看到實際執行時間略有增加。
The second example is similar to the first, except that the filter condition is changed to use the country code column, which has a secondary, non-unique index.

第二個示例與第一個示例類似，只是將篩選條件改為使用國家代碼列，該列有一個二級非唯一索引。
This should make it cheaper to access the matching rows.

這樣訪問匹配行的成本會更低。
For example, let's retrieve all French cities.

例如，讓我們檢索所有法國城市。
Select ALL from city, where country code is equal to FRA.

從國家代碼等於 FRA 的城市中選擇 ALL。
This time, only a lookup on the country code index can be used for the query.

這一次，只能使用國家代碼索引進行查詢。
It's estimated that 40 rows will be accessed, which is exactly as InnoDB responds when asked how many rows will match.

估計將訪問 40 條記錄，這正是 InnoDB 在被問及將匹配多少條記錄時的回答。
This is because an index will also bring some statistics with it.

這是因為索引也會帶來一些統計數據。
For the next example, we'll use the country language table, which has a primary key with two columns, country code and language.

在下一個示例中，我們將使用國家語言表，該表的主鍵有兩列，即國家代碼和語言。
Imagine that we want to find all languages spoken in a single country.

想象一下，我們想找到一個國家使用的所有語言。
In that case, we'll need to filter on the country code, but not on language.

在這種情況下，我們需要過濾國家代碼，而不是語言。
A query that can be used to find all languages spoken in China is Select ALL from country language, where country code is CHN.

要查找中國使用的所有語言，可以使用如下查詢：Select ALL from country language，其中國家代碼為 CHN。
The index on the primary key can still be used to perform the filtering.

主鍵上的索引仍可用於執行過濾。
The EXPLAIN output shows that with a primary tag and also the column of the index that was used.

EXPLAIN 輸出顯示了主標記和使用的索引列。
As always, only the left part of the index can be used for filtering.

一如既往，只有索引的左側部分可用於篩選。
For the last example, we'll use a mix of various features and with multiple query blocks.

在最後一個示例中，我們將混合使用各種功能和多個查詢塊。
To save time, I'll just copy-paste the query, but you can find it in the description.

為了節省時間，我將直接複製粘貼詢問內容，但您可以在描述中找到。
This query will select the top 10 smallest countries by area, and then for these countries, find the largest cities by population.

該查詢將選擇面積最小的前 10 個國家，然後在這些國家中找到人口最多的城市。
The query plan starts out with a subquery that uses the country table to find the 10 smallest countries by area.

查詢計劃以一個子查詢開始，該子查詢使用國家表按面積查找 10 個最小的國家。
Here we can see how the CO table is a materialized subquery created by first doing a table scan on the country table, then applying a filter for the continent, then sorting based on the surface area, and then limiting the result to 10 rows.

在這裡，我們可以看到 CO 表是一個具體化的子查詢，它首先對國家表進行表掃描，然後應用大洲過濾器，再根據表面積排序，最後將結果限制為 10 行。
Once the derived table has been constructed, it can be used as the first table for the join with the CI table.

一旦構建了派生表，就可以將其作為與 CI 表連接的第一個表。
The second part of the nested loop is simpler, as it just consists of an index lookup on the CI table using the country code index.

嵌套循環的第二部分比較簡單，只需使用國家代碼索引對 CI 表進行索引查找即可。
Here, the estimated cost was about 4 for an expected 17 rows.

在這裡，預計 17 行的估計成本約為 4 行。
There were 10 loops, one for each of the 10 countries, and each fetching an average of 2 rows for a total of 20 rows.

共有 10 個循環，10 個國家各一個，每個循環平均獲取 2 行，共獲取 20 行。
So, in this case, the estimate was not very accurate, because the query exclusively picks small countries.

是以，在這種情況下，估算並不十分準確，因為查詢只挑選小國。
It is estimated that the join will return 174 rows, which comes from the 10th rows in the derived table, multiplied with the estimated 70 rows per index lookup in the CI table.

據估計，連接將返回 174 行，這來自派生表中的第 10 行，再乘以 CI 表中每個索引查找的估計 70 行。
When the nested loop has been resolved using the inner join, the result is streamed, that is, not materialized, for the sorting, and the first 5 rows are returned.

使用內連接解決嵌套循環後，對結果進行流式排序，即不是物化排序，而是返回前 5 行。
The total cost of the query is estimated to be around 4 milliseconds.

據估計，查詢的總成本約為 4 毫秒。
So, what can we do with this information?

那麼，我們能利用這些資訊做些什麼呢？
When looking at a plan, there are two questions that you might want to ask yourself.

在查看計劃時，您可能要問自己兩個問題。
Is the runtime shown by the EXPLAIN ANALYZE clause justified for the given query?

EXPLAIN ANALYZE 子句顯示的運行時間對於給定查詢是否合理？
If the query is slow, where does the runtime jump?

如果查詢速度慢，運行時間會跳到哪裡？
Looking for jumps in the execution time of the query will reveal what is really going on.

查找查詢執行時間的跳變可以揭示真實情況。
Some general advice, it's not possible here, because there are too many things that can cause issues, but here are some tips.

這裡不可能提供一般性的建議，因為有太多的事情會導致問題，但這裡有一些提示。
What we should pay attention to is whether the estimates and the actual costs differ from each other.

我們應該注意的是，估算費用和實際費用之間是否存在差異。
If there is a big difference, the optimizer will make poor decisions.

如果差別很大，優化器就會做出錯誤的決定。
Possible causes for this difference could be that either the optimizer doesn't have up-to-date statistics, or the optimizer estimates are off for some reason.

造成這種差異的原因可能是優化器沒有最新的統計數據，或者優化器的估計值因某種原因出現偏差。
Running an ANALYZE TABLE clause is definitely a good thing to start with.

運行 ANALYZE TABLE 子句絕對是一個好的開始。
This will collect statistics for the table.

這將收集表格的統計數據。
However, under normal circumstances, you don't need to analyze the table, as this happens automatically when certain thresholds are reached.

不過，在正常情況下，您不需要對錶格進行分析，因為當達到某些閾值時，分析會自動進行。
For example, the index cares for himself in everyday situation, but on high-traffic tables, for example, with lots of deletes, the index might get confused and needs more time to select the right rows.

例如，在日常情況下，索引會自顧自地工作，但在高流量表（例如，有大量刪除的表）上，索引可能會感到困惑，需要更多時間來選擇正確的行。
In this case, we might run ANALYZE TABLE when no other traffic is made.

在這種情況下，我們可以在沒有其他流量時運行 ANALYZE TABLE。
So, it's definitely worth considering other options that are causing the bad estimates.

是以，絕對值得考慮造成估算錯誤的其他選項。
Let's see such an example.

讓我們來看看這樣一個例子。
We'll switch to a clean database, and we'll create a simple table.

我們將切換到一個乾淨的數據庫，並創建一個簡單的表。
This table has an autoincrement primary key and two integer values.

該表有一個自動遞增主鍵和兩個整數值。
Then, we add an index on the first integer value column, alter table, test estimates, and add an index.

然後，我們在第一列整數值上添加索引，更改表格，測試估計值，並添加索引。
Then, we'll load some data using a simple Python script in the MySQL CH tool.

然後，我們將在 MySQL CH 工具中使用簡單的 Python 腳本加載一些數據。
This script will execute the insert statement 100,000 times.

該腳本將執行插入語句 100,000 次。
The integer value is also an autoincremented value from 1 to 100,000.

整數值也是一個自動遞增值，從 1 到 100,000 不等。
After loading the data, we make sure that optimizer statistics are created.

加載數據後，我們將確保創建優化器統計數據。
And now, let's execute a query and check the estimates.

現在，讓我們執行查詢並檢查估算結果。
Explain analyze, select all from test estimates, where 2 multiplied with value is less than 3.

解釋分析，從測試估計值中選擇所有 2 乘以值小於 3 的值。
In many cases, MySQL might not be able to estimate the where clause properly, even though it has an index.

在很多情況下，即使有索引，MySQL 也可能無法正確估計 where 子句。
This is because it only has statistics on columns and not on expression.

這是因為它只統計列，而不統計表達式。
What we see here is a major overestimation of the data returned from the where clause.

我們在這裡看到的是對 where 子句返回數據的嚴重高估。
Because MySQL thinks that it will need to read the whole table, it considers that the index is overhead, so it goes ahead and scans the whole table, ignoring the index.

因為 MySQL 認為它需要讀取整個表，認為索引是開銷，所以它會繼續掃描整個表，忽略索引。
In this case, trying to avoid the expression on database level is the best choice.

在這種情況下，儘量避免在數據庫層面使用表達式是最好的選擇。
However, if that is not possible, adding an index will fix statistics and will also ensure significantly better performance.

但是，如果無法做到這一點，添加索引將修復統計數據，並確保顯著提高性能。
So, let's add an index on the expression.

是以，讓我們為表達式添加一個索引。
So, let's add an index on the expression.

是以，讓我們為表達式添加一個索引。
The key takeaway from this example is to take a look at the row counters for estimates versus actual execution.

從這個示例中得到的主要啟示是，查看行計數器的估計值與實際執行值。
A large difference, that is, a couple of orders of magnitude or more between the estimated number of rows and the actual numbers of rows is a sign that you need to look closer at it.

如果估計行數與實際行數相差很大，即相差幾個數量級或更多，就說明你需要仔細研究一下。
The optimizer chooses its plan based on the estimates, but looking at the actual execution may tell you that another plan would have been better.

優化器會根據估計值選擇計劃，但實際執行情況可能會告訴你，另一個計劃會更好。
If you run a query twice, the second will likely be much faster simply because of caching, regardless of whether the plan was better or worse.

如果將一個查詢運行兩次，不管計劃是好是壞，第二次可能會因為緩存而快得多。
This represents hot cache behavior, meaning that the data needed for the query was already in the database or the operating system caches.

這代表了熱緩存行為，即查詢所需的數據已經存在於數據庫或作業系統緩存中。
Basically, it was left in the cache from where the data was loaded in the first place.

從根本上說，它被留在了最初加載數據的緩存中。
Whether your cache is hot or cold is a thing to be very careful of when analyzing queries.

在分析查詢時，您的緩存是熱的還是冷的，這一點要非常小心。
One way to solve this problem is to repeatedly run the query and check if it takes the same amount of time each run.

解決這個問題的方法之一是重複運行查詢，並檢查每次運行所需的時間是否相同。
This means that the amount of cached data is staying constant and not impacting the results.

這意味著緩存數據的數量保持不變，不會影響結果。
In this case, it's 100% cached.

在這種情況下，它是 100% 緩存的。
Explain Analyze is a profiling tool for your queries that will show you where MySQL spends time on your query and why.

Explain Analyze 是用於查詢的剖析工具，可顯示 MySQL 在查詢上花費的時間及其原因。
It will plan the query, instrument it, and execute it while counting rows and measuring time spent at various points in the execution plan.

它將對查詢進行規劃、工具化和執行，同時計算行數並測量執行計劃中各點所花費的時間。
It is important to understand how the queries are actually executed based on the statistics available.

重要的是要了解查詢是如何根據現有統計數據實際執行的。
Once you see how to read query plans and understand how each of the underlying nodes type work on MySQL, then you should be confident to manage the queries on a production database.

一旦你瞭解瞭如何讀取查詢計劃，並理解了 MySQL 上每種底層節點類型的工作方式，那麼你就應該有信心管理生產數據庫上的查詢。