March 2020 – SQLozano

2020-03-23

Query Store: customising your capture mode

When SQL 2019 was announced, one of the changes that got me more interested was the new capture mode for Query Store, called Custom. Until then, the only 3 modes were

None: only statistics for the already captured queries will be captured. New queries executed after setting this capture mode will be ignored.
All: all queries and their statistics will be captured.
Auto: Query Store itself will decide what queries will be captured.

So the next question was “What does Auto capture?”. There is no documentation regarding it, so it is a mystery and the only way to find it out would be executing the same load on two identical copies of the same database, one with the All capture mode and another with the Auto capture mode, and compare the information Query Store has recorded on each.

So the Custom mode arrived as a way of giving us some control on how Query Store will behave, but let’s see how much we can actually control how queries are filtered, and how Query Store achieves this.

How customizable can Custom be?

Once the Query Store Capture Mode is set to Custom capture mode is enabled, 4 parameters are enabled:

Changing the Capture Mode will give you access to the settings below

Stale Threshold

There is a predefined set of intervals to select from that applies to all the settings below (Execution Count, Total Compile CPU Time, and Total Execution CPU).

Execution Count

Number of executions a certain query must reach in the allocated Stale Threshold.

Total Compile CPU Time (ms)

Total time spent compiling the query in the allocated Stale Threshold. This threshold applies to the sum of compilation times for each the plans generated for the same query.

Total Execution CPU Time (ms)

Total CPU time spent execution the query in the allocated Stale Threshold. This threshold applies to the total CPU of the query, regardless of the execution plan used.

Exceeding any of these thresholds in the allocated Stale Threshold will cause the Query Store to capture the query’s statistics. But something is missing, isn’t it?

If my math is right, there are more than 3 parameters in that drop down menu

Out of the 13 available metrics, we can only use two of them for the filtering (Execution Count and CPU Time), and the Compile Time which is not even shown anywhere in the regular reports. Not even the Duration metric, which all Query Store reportes default to, can be used to configure the query filters on this Custom capture mode. It is understandable that some of those metrics (I can’t think of any scenario where I’d be interested in capturing queries that returned more than 100 rows, for example), so let’s assume these metrics were selected as a compromise between speed (we don’t want Query Store to increase its impact on the server by performing complex calculations to filter the captured queries), and giving the DBAs some control so they don’t have to choose between capturing All, and whatever Auto captures.

How does the Custom mode really capture queries?

Regular Query Store captures information in memory, and once the time interval is closed it aggregates the data, and doing this final calculations removes the duplicate entries on some of the queries’ statistics that it maintains when the time interval is still open and capturing live data. So common sense dictates Query Store will do something similar with this Custom capture mode, waiting until the Stale Threshold time has passed and clean up the queries that didn’t exceed any of the thresholds defined.

To confirm it, I’ve run several tests on two copies of the same database, one with an All capture mode as a control database (QSDB_ALL), and another one with different Custom capture settings (QSDB_CUSTOM). To avoid queries being captured due to a threshold different than the one being tested, the out-of-scope threshold were set to high values (millions of executions, CPU and Compile times).

Testing: Execution Count.

Custom setting: 10 executions / 1 hour Stale Threshold.

I executed a simple UPDATE command 10 times on each database right after clearing the Query Store cache, and these are the results.

QSDB_ALL captures all executions, but what happened with QSDB_CUSTOM? I executed the query exactly the same amount of time on both databases, yet the only measurement that seems to be similar is the compile time. Let’s try executing the query 10 more times on each database and see what happens.

Now we have 11 vs 20 executions. If this were the only issue, we could just add the threshold value -1 to the first hour of statistics and we’d get the current execution count. But unless the Statistics Collection Interval is set to 1 hour, you are execution details: was the query executed regularly every 5 minutes, or all of the sudden all 10 executions took place in the last minutes of the interval?

Not only are we missing execution counts, we are also missing all other statistics on those queries, like CPU. What if you have a process running every 5 minutes to process a huge amount of data, and that data is deleted at XX:44? You’d be missing all information regarding the first 9 executions, and suddenly the 10th would be captured and look innocuous.

Testing: Total Compile CPU Time (ms)

Custom setting: 40 ms/ 1 hour Stale Threshold.

For this test, the same UPDATE command was executed once on both databases.

As expected, QSDB_ALL has logged the query’s statistics, but since it didn’t reach the threshold set on QSDB_CUSTOM, it won’t appear on its Query Store tables. So let’s create some new index and run the same query again.

Both queries can be found on QSDB_ALL, but still nothing on QSDB_CUSTOM. This may be due to the total compilation time of both queries taking less than 40 ms, so let’s try again with a new plan.

Finally, we have some live signs on QSDB_CUSTOM. And since the compilation time of this particular plan hasn’t exceeded the 40 ms, it is safe to assume the compilation total of all the plans generated for his query totalled at least 40 ms. But, since we now 3 plans were generated, why is this plan the #1 on the database? We are gonna rerun the query using one of the previous plans and see what will happen.

The new plan is there, but it has the #2, corresponding with #3 plan on the QSDB_ALL database. Unfortunately, details of previous executions have been lost, but the good news is that new plans (and old plans being reused) will be captured

Testing: Total Execution CPU Time (ms)

Custom setting: 100 ms / 1 hour Stale Threshold.

For this test, the same UPDATE command was executed 100 times on both databases, and here are the results.

The total CPU time is recorded as 35496 microseconds. Compared to the captured data in the QSDB_ALL database, that’s roughly 100000 microseconds (the 100 milliseconds defined in the threshold) missing. But not only that, the first 73 executions of the query are missing too.

Back to the previous example, imagine the data processing has no data to process at all until XX:55, and then the CPU time of that single execution exceeds the threshold and gets logged. If you only see information of the CPU-intensive executions, the regular executions would not be captured and therefore the query would not be listed on the Regressed Queries report.

How do Auto compares with Custom when using the default values?

If the Auto-equivalent parameters were publicly available, this would be an easy task, and you wouldn’t worry that much about what queries are being missed. Normally, I’d assume the default settings would be equivalent to the ones used by the Auto mode, but since the previous tests showed some data could be missed, I decided to run several tests to confirm it myself.

Execution count (default: 30)

I ran several simple queries to compare their results, and only after they were executed 30 times Auto kicked in and started recording their executions. The total CPU and Compile duration were lower than any threshold I could have set using the Custom mode, so let’s assume 30 executions can serve as a measurement to emulate Auto’s behaviour.

Total Compile CPU Time (ms) (default: 1000)

By running a SELECT statement joining up to 70 copies of the same table and adding/removing tables to the INNER JOIN clause, I confirmed only queries with a compilation time over 10000000 microseconds (1000 milliseconds second). So this default parameter seems to be consistent with the ones used by the Auto mode.

Total Execution CPU Time (ms) (default: 100)

By running a SELECT statement with a limited row count, I was able to confirm this threshold to be set at 100000 microseconds. Same as for the compile time, this must be converted into milliseconds, which gets us the default value 100 milliseconds.

Stale Threshold (default: 1 hour)

Using the Execution Count as a filter (set to 11), I executed the same query 10 times, and after advancing the server’s time one hour ran the same query another 10 times. The query was not recorded, which proves the Stale Threshold resets its statistics after 1 hour when using the Auto capture mode.

We have proved all 4 default parameters used by the Custom capture mode are consistent with the Auto capture mode results, and we now understand how the Auto capture mode works. If you don’t have the opportunity to work with SQL 2019 instances, this will give you some hindsight on how Query Store decides what is worth capturing.

What capture mode should I use?

If you can work with SQL 2019, I’d suggest using the Custom capture mode with the default settings, and adjust them once you are familiar with the load and can define a baselines for the queries that, for your particular environment, would get the most out of Query Store whilst reducing its impact on your instances.

If you are limited to SQL 2017, definitely use the Auto capture mode, being aware of knowing its limitation and what queries may fail through the cracks.

For lower environments I’d suggest using the All capture mode, those environments usually hold much less data than the production ones, and query regressions would be harder to identify there due to the gaps in the Auto and Custom capture modes. If you decide to do that, educate your database developers to use the Query Store feature so they can analyse the impact of their actions before it is too late and their changes have caused issues in production.

Notes on this post

These tests were all performed on a SQL 2019 RTM instance.

2020-03-11

Query Store: configuration

Enabling Query Store is easy, but configuring it properly can be tricky. In this post we’ll analyse all the different settings and how to properly adapt them to get the most out of Query Store for your particular situation.

General

The general settings are portrayed differently in the GUI than they are when changing settings using T-SQL. This post will focus on the settings accessible using SSMS.

Operation Mode

OFF: Query Store is disabled. It won’t capture any information on running queries, and no changes on its settings or data are allowed.
READ_ONLY: Query Store is enabled for read-only operations, and settings changes.
READ_WRITE: Query Store is enabled for read & write operations and settings changes. It will capture queries’ information based on its current configuration.

Monitoring

This group of settings configures of data is aggregated and flushed to disk.

Data Flush Interval (minutes)

The frequency at which data is persisted to disk. This event can trigger an internal cleanup process that in the best case scenario provokes a surge in the CPU usage, and in the worst case may potentially change your database’s Query Store into read-only mode. A whole post will be published about Query Store and how its space requirements and cleanup processes.

Since the amount of data loaded into memory doesn’t tend to be much, it is normally safe to have small intervals: test different values and monitor its impact on the CPU and I/O of the server if you find any issue.

Statistics Collection Interval

There is a predefined set of intervals to select from: the smaller the interval, the more granular information you can get from the Query Store. Let’s see two opposite examples, assuming a certain regularity on the database’s activity:

1 Minute: details of the activity on the database can be looked into down to the minute level, so certain analysis like query regression can highlight issues just one minute after it happens.

1 Hour: details and reports are only fully accurate after the o’clock. All reports such as query regression or top resource consuming won’t be comparable to previous intervals until the whole hour has passed and the data has been aggregated, potentially causing a delay when analysing current issues. But on the other hand, on stable systems it will save much disk space, since this interval could require roughly 1/60 of what the “1 minute” interval does. The math on Query Store’s space requirements is more complex than that, but you can get an overall idea.

Query Store Retention

Defines what data will remain in Query Store, and for how long before it is deleted.

Max Plans Per Query

A query may have different execution plans caused by changes in the index, statistics, or parameters used in the query. The default may be a bit too much (200 plans for a single query), but I have encountered some queries with dozens like the one below.

(This is what I like calling “Rainbow Query” due to the number of different coloured execution plans, obtained from a production database)

Max Size (MB)

Query Store’s objects are stored in the PRIMARY filegroup of the database, and this cannot be changed. Take this into consideration when estimating the database size, and if possible try always to move all your database objects out of the PRIMARY filegroup so that Query Store and the actual data won’t have to compete for space.

Query Store Capture Mode

All: Captures all queries, no matter how small they look.
Auto: Lets Query Store decide which queries are worth capturing.
Custom: Lets you configure certain settings to decide what queries will be captured.
None: No new queries will be captured, but statistics on the queries already registered in Query Store will still be captured.

Size Based Cleanup Mode

Auto: When Query Store reaches 90% of its Max Size, an internal process will start deleting queries (starting with the ones with a lower CPU consumption) until it is below 85%.
Off: No cleanup of the database will be triggered based on the Query Store current space utilisation.

Stale Query Threshold (Days)

Any statistics stored older than this number of days will be deleted.

Wait Statistics Capture Mode

Flag (On/Off) to enable the capture of Wait Statistics introduced in SQL 2017.

Query Store Capture Policy

This functionality was introduced in SQL 2019 in order to improve the opaque “Auto” Query Store Capture Mode previously available. Exceeding any of these thresholds in the time state in the State Threshold parameter will cause Query Store to log the query’s stats.

Execution Count

Number of executions a certain query must reach in the allocated Stale Threshold.

Total Compile CPU Time (ms)

Total time spent compiling the query in the allocated Stale Threshold. This threshold applies to the sum of compilation times for each the plans generated for the same query.

Total Execution CPU Time (ms)

Total CPU time spent execution the query in the allocated Stale Threshold. This threshold applies to the total CPU of the query, regardless of the execution plan used.

Stale Threshold

There is a predefined set of intervals to select from: this applies to all the settings above (Execution Count, Total Compile CPU Time, and Total Execution CPU).

There are some caveats on this capture mode, though:

It is not possible to review what information is the Custom mode analysing to decide what to persist or not. Although Query Store keeps the data belonging to the open interval in memory until the interval is closed and aggregated, there is no way to get information on what data is temporary stored during the Stale Threshold interval before it is closed and persistent to the system tables. Once I finish my investigation on the new capture mode I’ll publish a detailed post on the subject.

Notes on this post

All the tests have been performed on a SQL 2019 instance using SSMS 18.4. Different versions of both SQL server and the SSMS client may present slight differences.

2020-03-03

Query Store: a brief introduction

Query Store is a feature introduced in SQL 2016 that keeps track of multiple metrics (CPU, duration, I/O…) of the queries executed in your database. This information is aggregated at configurable intervals, so the activity can be analysed with different levels of detail depending on your needs.

Although this functionality has been available since SQL 2016, starting SQL 2017 wait statistics are being captured as well, along with the corresponding changes in SSMS to access these new metrics’ reports.

However, I don’t recommend enabling this functionality on just any SQL 2017 instance you can get your hands on. A fix was released in SQL 2017’s Cumulative Update 15 to solve a bug that would cause access violation errors when accessing corrupted execution plans. Without this patch, opening the Query Store reports may wind up loading on of such corrupted plans: best case you will get an error message and a memory dump; worst case your instance will crash.

Enabling Query Store

Query Store is enabled on a per-database basis, so open the properties of the database you want to enable it on.

Let’s start by enabling it with the default values. A detailed explanation of the available settings will be provided in a following post, but for the purpose of learning how to use the SSMS interface built on top of it, the defaults will serve just fine

Starting now, the Query Store is capturing details of the queries that are being executed on the database, provided they exceed the minimum threshold set in the AUTO capture mode by SQL server. These thresholds are not published, but they are meant to exclude queries that, due to their low number of executions and small impact on the SQL engine (that being CPU, duration, I/O, memory….) are not deemed worth of being captured by the SQL engine.

Accessing the reports

The reports shipped with SSMS cover most if not all the data captured by Query Store

Regressed Queries

These are those queries whose metrics have worsened. These metrics can measure a number of values, from total duration (the default used), to standard deviation of the queries’ memory consumption, and selecting one or another will change the queries flagged as regressed.

The current view shows queries whose total duration has worsen
An increase in the total duration of the query can be seen in the graph
The plan #6 (yellow one) seems to be the one causing the change in the query’s duration

This is a useful report when there is a recent change in the SQL performance, such as a sudden increase in the CPU, or deterioration of a particular query’s performance. If the issue has been caused by a change in the execution plan, just select the ones you want to compare and hit the “Compare Plans” button (only two plans at a time).

Selected the two plans we want to compare, #2 and #6, and click on the “Compare plans” button
The plan comparer will highlight the differences found in pieces of both execution plans

Overall Resource Consumption

This report gives an overview of some basic measurements. In my opinion, whilst the “CPU Time” can give you an idea on how much of the server’s CPU was occupied by queries executed on the current database, the other metrics, lacking the proper context, don’t provide useful information.

Top Resource Consuming Queries

This report sort the queries captured by any metric selected. The default is “Duration (ms) / Total”, but any combination is possible.

Queries With Forced Plans

Query Store allows you to force a specific plan to be used for a query. This can be useful when outdated statistics would otherwise persuade the SQL engine to use a plan that would cause a performance impact, such as switching from an index seek to an index scan operation. Those queries can be found using this view.

Notice how the forced plan has a tick in the graphical representation to easily identify it among any other plans used for the same query

Queries With High Variation

There are queries that may have a very regular duration over time, but with certain executions that deviate from the usual behaviour. This report will highlight them so you can identify whether they represent an issue and should be looked further into.

There has been an abrupt change in the query’s average duration in just a few minutes

Query Wait Statistics

Since the query wait statistics were introduced in SQL 2017, they were not part of the original system tables and don’t fully integrate with the pre-existing reports, so this separate report fills that need. It lets you drill down from the general view (all wait types), to the specific queries that incurred in that specific wait type.

The first report of the wait stats is mostly empty, only a list of the wait stats
Once a particular wait stat has been selected, queries incurring in that wait are shown

Tracked Queries

Contrary to what the name seems to indicate, we can’t select queries to track information on them over, or have a filtered view of Query Store’s data on them. This is not a report but a detailed view of an specific query, so once a specific query has been identified details about it can be seen here.

Notes on this post

All the tests have been performed on a SQL 2019 instance using SSMS 18.4. Different versions of both SQL server and the SSMS client may present slight differences.

Also, in order to capture query details for the post, I modified some of the settings in Query Store such as capture mode to ALL (in order to capture every query executed), and the Statistics Collection Interval reduced to 1 minute so quick changes would be captured by Query Store rather than waiting for it to capture several hours of data before obtaining any report.