When SQL 2019 was announced, one of the changes that got me more interested was the new capture mode for Query Store, called Custom. Until then, the only 3 modes were
- None: only statistics for the already captured queries will be captured. New queries executed after setting this capture mode will be ignored.
- All: all queries and their statistics will be captured.
- Auto: Query Store itself will decide what queries will be captured.
So the next question was “What does Auto capture?”. There is no documentation regarding it, so it is a mystery and the only way to find it out would be executing the same load on two identical copies of the same database, one with the All capture mode and another with the Auto capture mode, and compare the information Query Store has recorded on each.
So the Custom mode arrived as a way of giving us some control on how Query Store will behave, but let’s see how much we can actually control how queries are filtered, and how Query Store achieves this.
How customizable can Custom be?
Once the Query Store Capture Mode is set to Custom capture mode is enabled, 4 parameters are enabled:
Stale Threshold
There is a predefined set of intervals to select from that applies to all the settings below (Execution Count, Total Compile CPU Time, and Total Execution CPU).
Execution Count
Number of executions a certain query must reach in the allocated Stale Threshold.
Total Compile CPU Time (ms)
Total time spent compiling the query in the allocated Stale Threshold. This threshold applies to the sum of compilation times for each the plans generated for the same query.
Total Execution CPU Time (ms)
Total CPU time spent execution the query in the allocated Stale Threshold. This threshold applies to the total CPU of the query, regardless of the execution plan used.
Exceeding any of these thresholds in the allocated Stale Threshold will cause the Query Store to capture the query’s statistics. But something is missing, isn’t it?
Out of the 13 available metrics, we can only use two of them for the filtering (Execution Count and CPU Time), and the Compile Time which is not even shown anywhere in the regular reports. Not even the Duration metric, which all Query Store reportes default to, can be used to configure the query filters on this Custom capture mode. It is understandable that some of those metrics (I can’t think of any scenario where I’d be interested in capturing queries that returned more than 100 rows, for example), so let’s assume these metrics were selected as a compromise between speed (we don’t want Query Store to increase its impact on the server by performing complex calculations to filter the captured queries), and giving the DBAs some control so they don’t have to choose between capturing All, and whatever Auto captures.
How does the Custom mode really capture queries?
Regular Query Store captures information in memory, and once the time interval is closed it aggregates the data, and doing this final calculations removes the duplicate entries on some of the queries’ statistics that it maintains when the time interval is still open and capturing live data. So common sense dictates Query Store will do something similar with this Custom capture mode, waiting until the Stale Threshold time has passed and clean up the queries that didn’t exceed any of the thresholds defined.
To confirm it, I’ve run several tests on two copies of the same database, one with an All capture mode as a control database (QSDB_ALL), and another one with different Custom capture settings (QSDB_CUSTOM). To avoid queries being captured due to a threshold different than the one being tested, the out-of-scope threshold were set to high values (millions of executions, CPU and Compile times).
Testing: Execution Count.
Custom setting: 10 executions / 1 hour Stale Threshold.
I executed a simple UPDATE command 10 times on each database right after clearing the Query Store cache, and these are the results.
QSDB_ALL captures all executions, but what happened with QSDB_CUSTOM? I executed the query exactly the same amount of time on both databases, yet the only measurement that seems to be similar is the compile time. Let’s try executing the query 10 more times on each database and see what happens.
Now we have 11 vs 20 executions. If this were the only issue, we could just add the threshold value -1 to the first hour of statistics and we’d get the current execution count. But unless the Statistics Collection Interval is set to 1 hour, you are execution details: was the query executed regularly every 5 minutes, or all of the sudden all 10 executions took place in the last minutes of the interval?
Not only are we missing execution counts, we are also missing all other statistics on those queries, like CPU. What if you have a process running every 5 minutes to process a huge amount of data, and that data is deleted at XX:44? You’d be missing all information regarding the first 9 executions, and suddenly the 10th would be captured and look innocuous.
Testing: Total Compile CPU Time (ms)
Custom setting: 40 ms/ 1 hour Stale Threshold.
For this test, the same UPDATE command was executed once on both databases.
As expected, QSDB_ALL has logged the query’s statistics, but since it didn’t reach the threshold set on QSDB_CUSTOM, it won’t appear on its Query Store tables. So let’s create some new index and run the same query again.
Both queries can be found on QSDB_ALL, but still nothing on QSDB_CUSTOM. This may be due to the total compilation time of both queries taking less than 40 ms, so let’s try again with a new plan.
Finally, we have some live signs on QSDB_CUSTOM. And since the compilation time of this particular plan hasn’t exceeded the 40 ms, it is safe to assume the compilation total of all the plans generated for his query totalled at least 40 ms. But, since we now 3 plans were generated, why is this plan the #1 on the database? We are gonna rerun the query using one of the previous plans and see what will happen.
The new plan is there, but it has the #2, corresponding with #3 plan on the QSDB_ALL database. Unfortunately, details of previous executions have been lost, but the good news is that new plans (and old plans being reused) will be captured
Testing: Total Execution CPU Time (ms)
Custom setting: 100 ms / 1 hour Stale Threshold.
For this test, the same UPDATE command was executed 100 times on both databases, and here are the results.
The total CPU time is recorded as 35496 microseconds. Compared to the captured data in the QSDB_ALL database, that’s roughly 100000 microseconds (the 100 milliseconds defined in the threshold) missing. But not only that, the first 73 executions of the query are missing too.
Back to the previous example, imagine the data processing has no data to process at all until XX:55, and then the CPU time of that single execution exceeds the threshold and gets logged. If you only see information of the CPU-intensive executions, the regular executions would not be captured and therefore the query would not be listed on the Regressed Queries report.
How do Auto compares with Custom when using the default values?
If the Auto-equivalent parameters were publicly available, this would be an easy task, and you wouldn’t worry that much about what queries are being missed. Normally, I’d assume the default settings would be equivalent to the ones used by the Auto mode, but since the previous tests showed some data could be missed, I decided to run several tests to confirm it myself.
Execution count (default: 30)
I ran several simple queries to compare their results, and only after they were executed 30 times Auto kicked in and started recording their executions. The total CPU and Compile duration were lower than any threshold I could have set using the Custom mode, so let’s assume 30 executions can serve as a measurement to emulate Auto’s behaviour.
Total Compile CPU Time (ms) (default: 1000)
By running a SELECT statement joining up to 70 copies of the same table and adding/removing tables to the INNER JOIN clause, I confirmed only queries with a compilation time over 10000000 microseconds (1000 milliseconds second). So this default parameter seems to be consistent with the ones used by the Auto mode.
Total Execution CPU Time (ms) (default: 100)
By running a SELECT statement with a limited row count, I was able to confirm this threshold to be set at 100000 microseconds. Same as for the compile time, this must be converted into milliseconds, which gets us the default value 100 milliseconds.
Stale Threshold (default: 1 hour)
Using the Execution Count as a filter (set to 11), I executed the same query 10 times, and after advancing the server’s time one hour ran the same query another 10 times. The query was not recorded, which proves the Stale Threshold resets its statistics after 1 hour when using the Auto capture mode.
We have proved all 4 default parameters used by the Custom capture mode are consistent with the Auto capture mode results, and we now understand how the Auto capture mode works. If you don’t have the opportunity to work with SQL 2019 instances, this will give you some hindsight on how Query Store decides what is worth capturing.
What capture mode should I use?
If you can work with SQL 2019, I’d suggest using the Custom capture mode with the default settings, and adjust them once you are familiar with the load and can define a baselines for the queries that, for your particular environment, would get the most out of Query Store whilst reducing its impact on your instances.
If you are limited to SQL 2017, definitely use the Auto capture mode, being aware of knowing its limitation and what queries may fail through the cracks.
For lower environments I’d suggest using the All capture mode, those environments usually hold much less data than the production ones, and query regressions would be harder to identify there due to the gaps in the Auto and Custom capture modes. If you decide to do that, educate your database developers to use the Query Store feature so they can analyse the impact of their actions before it is too late and their changes have caused issues in production.
Notes on this post
These tests were all performed on a SQL 2019 RTM instance.