Identify Slow Queries

To help users identify slow queries, analyze and improve the performance of SQL execution, TiDB outputs the statements whose execution time exceeds tidb_slow_log_threshold (The default value is 300 milliseconds) to slow-query-file (The default value is “tidb-slow.log”).

TiDB enables the slow query log by default. You can enable or disable the feature by modifying the system variable tidb_enable_slow_log.

Usage example

  1. # Time: 2019-08-14T09:26:59.487776265+08:00
  2. # Txn_start_ts: 410450924122144769
  3. # User@Host: root[root] @ localhost [127.0.0.1]
  4. # Conn_ID: 3086
  5. # Exec_retry_time: 5.1 Exec_retry_count: 3
  6. # Query_time: 1.527627037
  7. # Parse_time: 0.000054933
  8. # Compile_time: 0.000129729
  9. # Rewrite_time: 0.000000003 Preproc_subqueries: 2 Preproc_subqueries_time: 0.000000002
  10. # Optimize_time: 0.00000001
  11. # Wait_TS: 0.00001078
  12. # Process_time: 0.07 Request_count: 1 Total_keys: 131073 Process_keys: 131072 Prewrite_time: 0.335415029 Commit_time: 0.032175429 Get_commit_ts_time: 0.000177098 Local_latch_wait_time: 0.106869448 Write_keys: 131072 Write_size: 3538944 Prewrite_region: 1
  13. # DB: test
  14. # Is_internal: false
  15. # Digest: 50a2e32d2abbd6c1764b1b7f2058d428ef2712b029282b776beb9506a365c0f1
  16. # Stats: t:pseudo
  17. # Num_cop_tasks: 1
  18. # Cop_proc_avg: 0.07 Cop_proc_p90: 0.07 Cop_proc_max: 0.07 Cop_proc_addr: 172.16.5.87:20171
  19. # Cop_wait_avg: 0 Cop_wait_p90: 0 Cop_wait_max: 0 Cop_wait_addr: 172.16.5.87:20171
  20. # Cop_backoff_regionMiss_total_times: 200 Cop_backoff_regionMiss_total_time: 0.2 Cop_backoff_regionMiss_max_time: 0.2 Cop_backoff_regionMiss_max_addr: 127.0.0.1 Cop_backoff_regionMiss_avg_time: 0.2 Cop_backoff_regionMiss_p90_time: 0.2
  21. # Cop_backoff_rpcPD_total_times: 200 Cop_backoff_rpcPD_total_time: 0.2 Cop_backoff_rpcPD_max_time: 0.2 Cop_backoff_rpcPD_max_addr: 127.0.0.1 Cop_backoff_rpcPD_avg_time: 0.2 Cop_backoff_rpcPD_p90_time: 0.2
  22. # Cop_backoff_rpcTiKV_total_times: 200 Cop_backoff_rpcTiKV_total_time: 0.2 Cop_backoff_rpcTiKV_max_time: 0.2 Cop_backoff_rpcTiKV_max_addr: 127.0.0.1 Cop_backoff_rpcTiKV_avg_time: 0.2 Cop_backoff_rpcTiKV_p90_time: 0.2
  23. # Mem_max: 525211
  24. # Disk_max: 65536
  25. # Prepared: false
  26. # Plan_from_cache: false
  27. # Succ: true
  28. # Plan: tidb_decode_plan('ZJAwCTMyXzcJMAkyMAlkYXRhOlRhYmxlU2Nhbl82CjEJMTBfNgkxAR0AdAEY1Dp0LCByYW5nZTpbLWluZiwraW5mXSwga2VlcCBvcmRlcjpmYWxzZSwgc3RhdHM6cHNldWRvCg==')
  29. use test;
  30. insert into t select * from t;

Fields description

Identify Slow Queries - 图1

Note

The unit of all the following time fields in the slow query log is “second”.

Slow query basics:

  • Time: The print time of log.
  • Query_time: The execution time of a statement.
  • Parse_time: The parsing time for the statement.
  • Compile_time: The duration of the query optimization.
  • Optimize_time: The time consumed for optimizing the execution plan.
  • Wait_TS: The waiting time of the statement to get transaction timestamps.
  • Query: A SQL statement. Query is not printed in the slow log, but the corresponding field is called Query after the slow log is mapped to the memory table.
  • Digest: The fingerprint of the SQL statement.
  • Txn_start_ts: The start timestamp and the unique ID of a transaction. You can use this value to search for the transaction-related logs.
  • Is_internal: Whether a SQL statement is TiDB internal. true indicates that a SQL statement is executed internally in TiDB and false indicates that a SQL statement is executed by the user.
  • Index_names: The index names used by the statement.
  • Stats: The health state, internal version, total row count, modified row count, and load state of statistics that are used during this query. pseudo indicates that the statistics information is unhealthy. If the optimizer attempts to use some statistics that are not fully loaded, the internal state is also printed. For example, the meaning of t1:439478225786634241[105000;5000][col1:allEvicted][idx1:allEvicted] can be understood as follows:
    • t1: statistics on table t1 are used during query optimization.
    • 439478225786634241: the internal version.
    • 105000: the total row count in the statistics.
    • 5000: the number of rows modified since the last statistics collection.
    • col1:allEvicted: statistics on the column col1 are not fully loaded.
    • idx1:allEvicted: statistics on the index idx1 are not fully loaded.
  • Succ: Whether a statement is executed successfully.
  • Backoff_time: The waiting time before retry when a statement encounters errors that require a retry. The common errors as such include: lock occurs, Region split, and tikv server is busy.
  • Plan: The execution plan of a statement. Execute the SELECT tidb_decode_plan('xxx...') statement to parse the specific execution plan.
  • Binary_plan: The execution plan of a binary-encoded statement. Execute the SELECT tidb_decode_binary_plan('xxx...') statement to parse the specific execution plan. The Plan and Binary_plan fields carry the same information. However, the format of execution plans parsed from the two fields are different.
  • Prepared: Whether this statement is a Prepare or Execute request or not.
  • Plan_from_cache: Whether this statement hits the execution plan cache.
  • Plan_from_binding: Whether this statement uses the bound execution plans.
  • Has_more_results: Whether this statement has more results to be fetched by users.
  • Rewrite_time: The time consumed for rewriting the query of this statement.
  • Preproc_subqueries: The number of subqueries (in the statement) that are executed in advance. For example, the where id in (select if from t) subquery might be executed in advance.
  • Preproc_subqueries_time: The time consumed for executing the subquery of this statement in advance.
  • Exec_retry_count: The retry times of this statement. This field is usually for pessimistic transactions in which the statement is retried when the lock is failed.
  • Exec_retry_time: The execution retry duration of this statement. For example, if a statement has been executed three times in total (failed for the first two times), Exec_retry_time means the total duration of the first two executions. The duration of the last execution is Query_time minus Exec_retry_time.
  • KV_total: The time spent on all the RPC requests on TiKV or TiFlash by this statement.
  • PD_total: The time spent on all the RPC requests on PD by this statement.
  • Backoff_total: The time spent on all the backoff during the execution of this statement.
  • Write_sql_response_total: The time consumed for sending the results back to the client by this statement.
  • Result_rows: The row count of the query results.
  • IsExplicitTxn: Whether this statement is in an explicit transaction. If the value is false, the transaction is autocommit=1 and the statement is automatically committed after execution.
  • Warnings: The JSON-formatted warnings that are generated during the execution of this statement. These warnings are generally consistent with the output of the SHOW WARNINGS statement, but might include extra warnings that provide more diagnostic information. These extra warnings are marked as IsExtra: true.

The following fields are related to transaction execution:

  • Prewrite_time: The duration of the first phase (prewrite) of the two-phase transaction commit.
  • Commit_time: The duration of the second phase (commit) of the two-phase transaction commit.
  • Get_commit_ts_time: The time spent on getting commit_ts during the second phase (commit) of the two-phase transaction commit.
  • Local_latch_wait_time: The time that TiDB spends on waiting for the lock before the second phase (commit) of the two-phase transaction commit.
  • Write_keys: The count of keys that the transaction writes to the Write CF in TiKV.
  • Write_size: The total size of the keys or values to be written when the transaction commits.
  • Prewrite_region: The number of TiKV Regions involved in the first phase (prewrite) of the two-phase transaction commit. Each Region triggers a remote procedure call.
  • Wait_prewrite_binlog_time: The time used to write binlogs when a transaction is committed.
  • Resolve_lock_time: The time to resolve or wait for the lock to be expired after a lock is encountered during a transaction commit.

Memory usage fields:

  • Mem_max: The maximum memory space used during the execution period of a SQL statement (the unit is byte).

Hard disk fields:

  • Disk_max: The maximum disk space used during the execution period of a SQL statement (the unit is byte).

User fields:

  • User: The name of the user who executes this statement.
  • Host: The host name of this statement.
  • Conn_ID: The Connection ID (session ID). For example, you can use the keyword con:3 to search for the log whose session ID is 3.
  • DB: The current database.

TiKV Coprocessor Task fields:

  • Request_count: The number of Coprocessor requests that a statement sends.
  • Total_keys: The number of keys that Coprocessor has scanned.
  • Process_time: The total processing time of a SQL statement in TiKV. Because data is sent to TiKV concurrently, this value might exceed Query_time.
  • Wait_time: The total waiting time of a statement in TiKV. Because the Coprocessor of TiKV runs a limited number of threads, requests might queue up when all threads of Coprocessor are working. When a request in the queue takes a long time to process, the waiting time of the subsequent requests increases.
  • Process_keys: The number of keys that Coprocessor has processed. Compared with total_keys, processed_keys does not include the old versions of MVCC. A great difference between processed_keys and total_keys indicates that many old versions exist.
  • Num_cop_tasks: The number of Coprocessor tasks sent by this statement.
  • Cop_proc_avg: The average execution time of cop-tasks, including some waiting time that cannot be counted, such as the mutex in RocksDB.
  • Cop_proc_p90: The P90 execution time of cop-tasks.
  • Cop_proc_max: The maximum execution time of cop-tasks.
  • Cop_proc_addr: The address of the cop-task with the longest execution time.
  • Cop_wait_avg: The average waiting time of cop-tasks, including the time of request queueing and getting snapshots.
  • Cop_wait_p90: The P90 waiting time of cop-tasks.
  • Cop_wait_max: The maximum waiting time of cop-tasks.
  • Cop_wait_addr: The address of the cop-task whose waiting time is the longest.
  • Rocksdb_delete_skipped_count: The number of scans on deleted keys during RocksDB reads.
  • Rocksdb_key_skipped_count: The number of deleted (tombstone) keys that RocksDB encounters when scanning data.
  • Rocksdb_block_cache_hit_count: The number of times RocksDB reads data from the block cache.
  • Rocksdb_block_read_count: The number of times RocksDB reads data from the file system.
  • Rocksdb_block_read_byte: The amount of data RocksDB reads from the file system.
  • Rocksdb_block_read_time: The time RocksDB takes to read data from the file system.
  • Cop_backoff_{backoff-type}_total_times: The total times of backoff caused by an error.
  • Cop_backoff_{backoff-type}_total_time: The total time of backoff caused by an error.
  • Cop_backoff_{backoff-type}_max_time: The longest time of backoff caused by an error.
  • Cop_backoff_{backoff-type}_max_addr: The address of the cop-task that has the longest backoff time caused by an error.
  • Cop_backoff_{backoff-type}_avg_time: The average time of backoff caused by an error.
  • Cop_backoff_{backoff-type}_p90_time: The P90 percentile backoff time caused by an error.

backoff-type generally includes the following types:

  • tikvRPC: The backoff caused by failing to send RPC requests to TiKV.
  • tiflashRPC: The backoff caused by failing to send RPC requests to TiFlash.
  • pdRPC: The backoff caused by failing to send RPC requests to PD.
  • txnLock: The backoff caused by lock conflicts.
  • regionMiss: The backoff caused by that processing requests fails when the TiDB Region cache information is outdated after Regions are split or merged.
  • regionScheduling: The backoff caused by that TiDB cannot process requests when Regions are being scheduled and the Leader is not selected.
  • tikvServerBusy: The backoff caused by that the TiKV load is too high to handle new requests.
  • tiflashServerBusy: The backoff caused by that the TiFlash load is too high to handle new requests.
  • tikvDiskFull: The backoff caused by that the TiKV disk is full.
  • txnLockFast: The backoff caused by that locks are encountered during data reads.

Fields related to Resource Control:

  • Resource_group: the resource group that the statement is bound to.
  • Request_unit_read: the total read RUs consumed by the statement.
  • Request_unit_write: the total write RUs consumed by the statement.
  • Time_queued_by_rc: the total time that the statement waits for available resources.
  • tidb_slow_log_threshold: Sets the threshold for the slow log. The SQL statement whose execution time exceeds this threshold is recorded in the slow log. The default value is 300 (ms).

  • tidb_query_log_max_len: Sets the maximum length of the SQL statement recorded in the slow log. The default value is 4096 (byte).

  • tidb_redact_log: Determines whether to desensitize user data using ? in the SQL statement recorded in the slow log. The default value is 0, which means to disable the feature.

  • tidb_enable_collect_execution_info: Determines whether to record the physical execution information of each operator in the execution plan. The default value is 1. This feature impacts the performance by approximately 3%. After enabling this feature, you can view the Plan information as follows:

    1. > select tidb_decode_plan('jAOIMAk1XzE3CTAJMQlmdW5jczpjb3VudChDb2x1bW4jNyktPkMJC/BMNQkxCXRpbWU6MTAuOTMxNTA1bXMsIGxvb3BzOjIJMzcyIEJ5dGVzCU4vQQoxCTMyXzE4CTAJMQlpbmRleDpTdHJlYW1BZ2dfOQkxCXQRSAwyNzY4LkgALCwgcnBjIG51bTogMQkMEXMQODg0MzUFK0hwcm9jIGtleXM6MjUwMDcJMjA2HXsIMgk1BWM2zwAAMRnIADcVyAAxHcEQNQlOL0EBBPBbCjMJMTNfMTYJMQkzMTI4MS44NTc4MTk5MDUyMTcJdGFibGU6dCwgaW5kZXg6aWR4KGEpLCByYW5nZTpbLWluZiw1MDAwMCksIGtlZXAgb3JkZXI6ZmFsc2UJMjUBrgnQVnsA');
    2. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    3. | tidb_decode_plan('jAOIMAk1XzE3CTAJMQlmdW5jczpjb3VudChDb2x1bW4jNyktPkMJC/BMNQkxCXRpbWU6MTAuOTMxNTA1bXMsIGxvb3BzOjIJMzcyIEJ5dGVzCU4vQQoxCTMyXzE4CTAJMQlpbmRleDpTdHJlYW1BZ2dfOQkxCXQRSAwyNzY4LkgALCwgcnBjIG51bTogMQkMEXMQODg0MzUFK0hwcm9jIGtleXM6MjUwMDcJMjA2HXsIMg |
    4. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    5. | id task estRows operator info actRows execution info memory disk |
    6. | StreamAgg_17 root 1 funcs:count(Column#7)->Column#5 1 time:10.931505ms, loops:2 372 Bytes N/A |
    7. | └─IndexReader_18 root 1 index:StreamAgg_9 1 time:10.927685ms, loops:2, rpc num: 1, rpc time:10.884355ms, proc keys:25007 206 Bytes N/A |
    8. | └─StreamAgg_9 cop 1 funcs:count(1)->Column#7 1 time:11ms, loops:25 N/A N/A |
    9. | └─IndexScan_16 cop 31281.857819905217 table:t, index:idx(a), range:[-inf,50000), keep order:false 25007 time:11ms, loops:25 N/A N/A |
    10. +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

If you are conducting a performance test, you can disable the feature of automatically collecting the execution information of operators:

  1. set @@tidb_enable_collect_execution_info=0;

The returned result of the Plan field has roughly the same format with that of EXPLAIN or EXPLAIN ANALYZE. For more details of the execution plan, see EXPLAIN or EXPLAIN ANALYZE.

For more information, see TiDB specific variables and syntax.

Memory mapping in slow log

You can query the content of the slow query log by querying the INFORMATION_SCHEMA.SLOW_QUERY table. Each column name in the table corresponds to one field name in the slow log. For table structure, see the introduction to the SLOW_QUERY table in Information Schema.

Identify Slow Queries - 图2

Note

Every time you query the SLOW_QUERY table, TiDB reads and parses the current slow query log.

For TiDB 4.0, SLOW_QUERY supports querying the slow log of any period of time, including the rotated slow log file. You need to specify the TIME range to locate the slow log files that need to be parsed. If you don’t specify the TIME range, TiDB only parses the current slow log file. For example:

  • If you don’t specify the time range, TiDB only parses the slow query data that TiDB is writing to the slow log file:

    1. select count(*),
    2. min(time),
    3. max(time)
    4. from slow_query;
    1. +----------+----------------------------+----------------------------+
    2. | count(*) | min(time) | max(time) |
    3. +----------+----------------------------+----------------------------+
    4. | 122492 | 2020-03-11 23:35:20.908574 | 2020-03-25 19:16:38.229035 |
    5. +----------+----------------------------+----------------------------+
  • If you specify the time range, for example, from 2020-03-10 00:00:00 to 2020-03-11 00:00:00, TiDB first locates the slow log files of the specified time range, and then parses the slow query information:

    1. select count(*),
    2. min(time),
    3. max(time)
    4. from slow_query
    5. where time > '2020-03-10 00:00:00'
    6. and time < '2020-03-11 00:00:00';
    1. +----------+----------------------------+----------------------------+
    2. | count(*) | min(time) | max(time) |
    3. +----------+----------------------------+----------------------------+
    4. | 2618049 | 2020-03-10 00:00:00.427138 | 2020-03-10 23:00:22.716728 |
    5. +----------+----------------------------+----------------------------+

Identify Slow Queries - 图3

Note

If the slow log files of the specified time range are removed, or there is no slow query, the query returns NULL.

TiDB 4.0 adds the CLUSTER_SLOW_QUERY system table to query the slow query information of all TiDB nodes. The table schema of the CLUSTER_SLOW_QUERY table differs from that of the SLOW_QUERY table in that an INSTANCE column is added to CLUSTER_SLOW_QUERY. The INSTANCE column represents the TiDB node address of the row information on the slow query. You can use CLUSTER_SLOW_QUERY the way you do with SLOW_QUERY.

When you query the CLUSTER_SLOW_QUERY table, TiDB pushes the computation and the judgment down to other nodes, instead of retrieving all slow query information from other nodes and executing the operations on one TiDB node.

SLOW_QUERY / CLUSTER_SLOW_QUERY usage examples

Top-N slow queries

Query the Top 2 slow queries of users. Is_internal=false means excluding slow queries inside TiDB and only querying slow queries of users.

  1. select query_time, query
  2. from information_schema.slow_query
  3. where is_internal = false
  4. order by query_time desc
  5. limit 2;

Output example:

  1. +--------------+------------------------------------------------------------------+
  2. | query_time | query |
  3. +--------------+------------------------------------------------------------------+
  4. | 12.77583857 | select * from t_slim, t_wide where t_slim.c0=t_wide.c0; |
  5. | 0.734982725 | select t0.c0, t1.c1 from t_slim t0, t_wide t1 where t0.c0=t1.c0; |
  6. +--------------+------------------------------------------------------------------+

Query the Top-N slow queries of the test user

In the following example, the slow queries executed by the test user are queried, and the first two results are displayed in reverse order of execution time.

  1. select query_time, query, user
  2. from information_schema.slow_query
  3. where is_internal = false
  4. and user = "test"
  5. order by query_time desc
  6. limit 2;

Output example:

  1. +-------------+------------------------------------------------------------------+----------------+
  2. | Query_time | query | user |
  3. +-------------+------------------------------------------------------------------+----------------+
  4. | 0.676408014 | select t0.c0, t1.c1 from t_slim t0, t_wide t1 where t0.c0=t1.c1; | test |
  5. +-------------+------------------------------------------------------------------+----------------+

Query similar slow queries with the same SQL fingerprints

After querying the Top-N SQL statements, continue to query similar slow queries using the same fingerprints.

  1. Acquire Top-N slow queries and the corresponding SQL fingerprints.

    1. select query_time, query, digest
    2. from information_schema.slow_query
    3. where is_internal = false
    4. order by query_time desc
    5. limit 1;

    Output example:

    1. +-------------+-----------------------------+------------------------------------------------------------------+
    2. | query_time | query | digest |
    3. +-------------+-----------------------------+------------------------------------------------------------------+
    4. | 0.302558006 | select * from t1 where a=1; | 4751cb6008fda383e22dacb601fde85425dc8f8cf669338d55d944bafb46a6fa |
    5. +-------------+-----------------------------+------------------------------------------------------------------+
  2. Query similar slow queries with the fingerprints.

    1. select query, query_time
    2. from information_schema.slow_query
    3. where digest = "4751cb6008fda383e22dacb601fde85425dc8f8cf669338d55d944bafb46a6fa";

    Output example:

    1. +-----------------------------+-------------+
    2. | query | query_time |
    3. +-----------------------------+-------------+
    4. | select * from t1 where a=1; | 0.302558006 |
    5. | select * from t1 where a=2; | 0.401313532 |
    6. +-----------------------------+-------------+

Query slow queries with pseudo stats

  1. select query, query_time, stats
  2. from information_schema.slow_query
  3. where is_internal = false
  4. and stats like '%pseudo%';

Output example:

  1. +-----------------------------+-------------+---------------------------------+
  2. | query | query_time | stats |
  3. +-----------------------------+-------------+---------------------------------+
  4. | select * from t1 where a=1; | 0.302558006 | t1:pseudo |
  5. | select * from t1 where a=2; | 0.401313532 | t1:pseudo |
  6. | select * from t1 where a>2; | 0.602011247 | t1:pseudo |
  7. | select * from t1 where a>3; | 0.50077719 | t1:pseudo |
  8. | select * from t1 join t2; | 0.931260518 | t1:407872303825682445,t2:pseudo |
  9. +-----------------------------+-------------+---------------------------------+

Query slow queries whose execution plan is changed

When the execution plan of SQL statements of the same category is changed, the execution slows down, because the statistics is outdated, or the statistics is not accurate enough to reflect the real data distribution. You can use the following SQL statement to query SQL statements with different execution plans.

  1. select count(distinct plan_digest) as count,
  2. digest,
  3. min(query)
  4. from cluster_slow_query
  5. group by digest
  6. having count > 1
  7. limit 3\G

Output example:

  1. ***************************[ 1. row ]***************************
  2. count | 2
  3. digest | 17b4518fde82e32021877878bec2bb309619d384fca944106fcaf9c93b536e94
  4. min(query) | SELECT DISTINCT c FROM sbtest25 WHERE id BETWEEN ? AND ? ORDER BY c [arguments: (291638, 291737)];
  5. ***************************[ 2. row ]***************************
  6. count | 2
  7. digest | 9337865f3e2ee71c1c2e740e773b6dd85f23ad00f8fa1f11a795e62e15fc9b23
  8. min(query) | SELECT DISTINCT c FROM sbtest22 WHERE id BETWEEN ? AND ? ORDER BY c [arguments: (215420, 215519)];
  9. ***************************[ 3. row ]***************************
  10. count | 2
  11. digest | db705c89ca2dfc1d39d10e0f30f285cbbadec7e24da4f15af461b148d8ffb020
  12. min(query) | SELECT DISTINCT c FROM sbtest11 WHERE id BETWEEN ? AND ? ORDER BY c [arguments: (303359, 303458)];

Then you can query the different plans using the SQL fingerprint in the query result above:

  1. select min(plan),
  2. plan_digest
  3. from cluster_slow_query
  4. where digest='17b4518fde82e32021877878bec2bb309619d384fca944106fcaf9c93b536e94'
  5. group by plan_digest\G

Output example:

  1. *************************** 1. row ***************************
  2. min(plan): Sort_6 root 100.00131380758702 sbtest.sbtest25.c:asc
  3. └─HashAgg_10 root 100.00131380758702 group by:sbtest.sbtest25.c, funcs:firstrow(sbtest.sbtest25.c)->sbtest.sbtest25.c
  4. └─TableReader_15 root 100.00131380758702 data:TableRangeScan_14
  5. └─TableScan_14 cop 100.00131380758702 table:sbtest25, range:[502791,502890], keep order:false
  6. plan_digest: 6afbbd21f60ca6c6fdf3d3cd94f7c7a49dd93c00fcf8774646da492e50e204ee
  7. *************************** 2. row ***************************
  8. min(plan): Sort_6 root 1 sbtest.sbtest25.c:asc
  9. └─HashAgg_12 root 1 group by:sbtest.sbtest25.c, funcs:firstrow(sbtest.sbtest25.c)->sbtest.sbtest25.c
  10. └─TableReader_13 root 1 data:HashAgg_8
  11. └─HashAgg_8 cop 1 group by:sbtest.sbtest25.c,
  12. └─TableScan_11 cop 1.2440069558121831 table:sbtest25, range:[472745,472844], keep order:false

Query the number of slow queries for each TiDB node in a cluster

  1. select instance, count(*) from information_schema.cluster_slow_query where time >= "2020-03-06 00:00:00" and time < now() group by instance;

Output example:

  1. +---------------+----------+
  2. | instance | count(*) |
  3. +---------------+----------+
  4. | 0.0.0.0:10081 | 124 |
  5. | 0.0.0.0:10080 | 119771 |
  6. +---------------+----------+

Query slow logs occurring only in abnormal time period

If you find problems such as decreased QPS or increased latency for the time period from 2020-03-10 13:24:00 to 2020-03-10 13:27:00, the reason might be that a large query crops up. Run the following SQL statement to query slow logs that occur only in abnormal time period. The time range from 2020-03-10 13:20:00 to 2020-03-10 13:23:00 refers to the normal time period.

  1. SELECT * FROM
  2. (SELECT /*+ AGG_TO_COP(), HASH_AGG() */ count(*),
  3. min(time),
  4. sum(query_time) AS sum_query_time,
  5. sum(Process_time) AS sum_process_time,
  6. sum(Wait_time) AS sum_wait_time,
  7. sum(Commit_time),
  8. sum(Request_count),
  9. sum(process_keys),
  10. sum(Write_keys),
  11. max(Cop_proc_max),
  12. min(query),min(prev_stmt),
  13. digest
  14. FROM information_schema.CLUSTER_SLOW_QUERY
  15. WHERE time >= '2020-03-10 13:24:00'
  16. AND time < '2020-03-10 13:27:00'
  17. AND Is_internal = false
  18. GROUP BY digest) AS t1
  19. WHERE t1.digest NOT IN
  20. (SELECT /*+ AGG_TO_COP(), HASH_AGG() */ digest
  21. FROM information_schema.CLUSTER_SLOW_QUERY
  22. WHERE time >= '2020-03-10 13:20:00'
  23. AND time < '2020-03-10 13:23:00'
  24. GROUP BY digest)
  25. ORDER BY t1.sum_query_time DESC limit 10\G

Output example:

  1. ***************************[ 1. row ]***************************
  2. count(*) | 200
  3. min(time) | 2020-03-10 13:24:27.216186
  4. sum_query_time | 50.114126194
  5. sum_process_time | 268.351
  6. sum_wait_time | 8.476
  7. sum(Commit_time) | 1.044304306
  8. sum(Request_count) | 6077
  9. sum(process_keys) | 202871950
  10. sum(Write_keys) | 319500
  11. max(Cop_proc_max) | 0.263
  12. min(query) | delete from test.tcs2 limit 5000;
  13. min(prev_stmt) |
  14. digest | 24bd6d8a9b238086c9b8c3d240ad4ef32f79ce94cf5a468c0b8fe1eb5f8d03df

Parse other TiDB slow log files

TiDB uses the session variable tidb_slow_query_file to control the files to be read and parsed when querying INFORMATION_SCHEMA.SLOW_QUERY. You can query the content of other slow query log files by modifying the value of the session variable.

  1. set tidb_slow_query_file = "/path-to-log/tidb-slow.log"

Parse TiDB slow logs with pt-query-digest

Use pt-query-digest to parse TiDB slow logs.

Identify Slow Queries - 图4

Note

It is recommended to use pt-query-digest 3.0.13 or later versions.

For example:

  1. pt-query-digest --report tidb-slow.log

Output example:

  1. # 320ms user time, 20ms system time, 27.00M rss, 221.32M vsz
  2. # Current date: Mon Mar 18 13:18:51 2019
  3. # Hostname: localhost.localdomain
  4. # Files: tidb-slow.log
  5. # Overall: 1.02k total, 21 unique, 0 QPS, 0x concurrency _________________
  6. # Time range: 2019-03-18-12:22:16 to 2019-03-18-13:08:52
  7. # Attribute total min max avg 95% stddev median
  8. # ============ ======= ======= ======= ======= ======= ======= =======
  9. # Exec time 218s 10ms 13s 213ms 30ms 1s 19ms
  10. # Query size 175.37k 9 2.01k 175.89 158.58 122.36 158.58
  11. # Commit time 46ms 2ms 7ms 3ms 7ms 1ms 3ms
  12. # Conn ID 71 1 16 8.88 15.25 4.06 9.83
  13. # Process keys 581.87k 2 103.15k 596.43 400.73 3.91k 400.73
  14. # Process time 31s 1ms 10s 32ms 19ms 334ms 16ms
  15. # Request coun 1.97k 1 10 2.02 1.96 0.33 1.96
  16. # Total keys 636.43k 2 103.16k 652.35 793.42 3.97k 400.73
  17. # Txn start ts 374.38E 0 16.00E 375.48P 1.25P 89.05T 1.25P
  18. # Wait time 943ms 1ms 19ms 1ms 2ms 1ms 972us
  19. .
  20. .
  21. .

Identify problematic SQL statements

Not all of the SLOW_QUERY statements are problematic. Only those whose process_time is very large increase the pressure on the entire cluster.

The statements whose wait_time is very large and process_time is very small are usually not problematic. This is because the statement is blocked by real problematic statements and it has to wait in the execution queue, which leads to a much longer response time.

ADMIN SHOW SLOW command

In addition to the TiDB log file, you can identify slow queries by running the ADMIN SHOW SLOW command:

  1. ADMIN SHOW SLOW recent N
  2. ADMIN SHOW SLOW TOP [internal | all] N

recent N shows the recent N slow query records, for example:

  1. ADMIN SHOW SLOW recent 10

top N shows the slowest N query records recently (within a few days). If the internal option is provided, the returned results would be the inner SQL executed by the system; If the all option is provided, the returned results would be the user’s SQL combinated with inner SQL; Otherwise, this command would only return the slow query records from the user’s SQL.

  1. ADMIN SHOW SLOW top 3
  2. ADMIN SHOW SLOW top internal 3
  3. ADMIN SHOW SLOW top all 5

TiDB stores only a limited number of slow query records because of the limited memory. If the value of N in the query command is greater than the records count, the number of returned records is smaller than N.

The following table shows output details:

Column nameDescription
startThe starting time of the SQL execution
durationThe duration of the SQL execution
detailsThe details of the SQL execution
succWhether the SQL statement is executed successfully. 1 means success and 0 means failure.
conn_idThe connection ID for the session
transaction_tsThe start ts of the transaction
userThe user name for the execution of the statement
dbThe database involved when the statement is executed
table_idsThe ID of the table involved when the SQL statement is executed
index_idsThe ID of the index involved when the SQL statement is executed
internalThis is a TiDB internal SQL statement
digestThe fingerprint of the SQL statement
sqlThe SQL statement that is being executed or has been executed