Analysing an execution plan is something very nice and easy even if you are using SSMS with no additional addons.
You can compare 2 plans too, either with SSMS itself or with some third party tools that do a good job highlighting potential problems.
The problem is when you are used to Query Store and these “rainbow queries” have hundreds of execution plans
Since all tools rely on a DBA to look into the plan an identify the actual issues, you’d have to review on at a time. What if you could it automatically?
Query Store already has the plans, and there is a ton of information they contain:
The statements executed (only one for Query Store’s plans, but plans in the SQL cache or stored in exported plan files could contain several).
Indexes that SQL server considers may improve the performance of the query (also known as “missing indexes”).
Filtered indexes that exists on the tables, but due to the values passed to the query in the WHERE clauses were not used.
The details on each node of the execution plan (the little boxes that contain the details).
Columns accessed by the statement (for either read of write operations).
Statistics used by the SQL Engine to generate the execution plan.
So I thought… why not get all this information from the plan?
There is no way I was gonna be able to navigate all that nightmare, but I remembered something I had written almost two years ago to take an XML file and turn it into smaller chunks (one per line) to make it more manageable and to be able to extract certain fields. Back then I was only listing the tables and columns involved, so I decided to update it and extract more information.
The process goes as follow:
The execution plan’s XML is extracted into a temp table (one line at a time, treated as if it were a plain text), but turning every line into a valid XML (replacing “</label>! with “<label/>”, and adding “/>” to turn each line into a XML itself).
Modify certain lines (such as the multi-line close tags into a separate custom one).
Based on the contents of each line, store the information into the proper table (indexes, statistics, columns) whilst keeping track of the node this information is associated to.
QDSToolbox’s PlanMiner does all this, and combined with the plans available in Query Store, this can be used to incrementally load the new plans’ information into those tables for later consumption.
The next step would be defining certain rules to define which execution plans have issues, and what kind of situations should be flagged, or what reports to generate:
Missing indexes with an impact > 80%.
Queries that would be impacted if a certain table or column is modified or dropped.
Queries that are strictly read-only and you could offload to a read-only replica on your Availability Groups.
Queries using forced indexes that would need rewriting if the indexes are removed.
New plans generated in the last 24 hours with potential issues.
These rules are not in place at the moment, just the extraction of the data. I am not expert in performance, but anyone who wants to try this tool and define some rules to apply to the execution plans is welcome to do so: I’ll try my best to integrate it with the tool and allow flexibility in the rules so anyone can define his/her own.
Whenever I see some query regression in Query Store due to SQL going nuts and using an execution plan which is obviously not the best available, I wonder why it is doing that. But since I don’t have access to its source code (and even if I had, I’d probably wouldn’t fully understand it), the only thing I could do was either force the better plan and hope SQL will use it (which doesn’t always do: there are certain limitations to that), or update the statistics on all the tables involved in the query and hope for the best.
This was a very manual operation:
Open the query text.
Identify all the tables involved in the query, which may include accessing the definition of the views it accesses, and that’s if the query text is not truncated and some of them are lost due to that.
Find all the statistics on the identified tables that have a certain small percentage of sample rate or had a high percentage of rows updated since the statistics were last updated, rendering them obsolete.
Write a script to update them based using a sample rate appropriate based on the row count of the table.
But since the execution plans detail all statistics used by SQL Engine to generate the plan and they are kept in Query Store, I decided to extract those details from them.
The new tool added to the QDSToolBox, StatisticsUsed, does just that: given a query ID (or list of them), or even a whole object, generates a table with the details of the statistics used. Based on factors like when were the statistics last updated, the row count on the table or the percentage of rows changed since then, it generates the appropriate UPDATE STATS command. This is how the final result would look like:
Query Store contains a lot of information on the activity on your databases, and that can take some space (for more details, check my previous post ). We can limit the amount of data stored by adjusting its basic settings , and what data is stored by adjusting the capture mode . Once the data is stored, there are two different processes to clean up the stored data:
Date-based cleanup: uses the retention settings configured to decide what information will be deleted on a regular basis.
Size-based cleanup: if the current usage of Query Store space exceeds the 90% of its max size, a SQL internal process analyzes the information stored for all queries and starts deleting them to clear disk space. The extended event query_store_size_retention_cleanup_finished captures details on its outcome. There are no details on how this calculation is performed, but seems to be strictly CPU-based:
This automatic size-based cleanup can have a huge impact on your servers, as explained in this post entry from my colleague Mark Wilkinson titled “The perfect storm“. The summary: size-based cleanup caused a 70% CPU peak on a 8 cores server and took more than one hour to complete, having an impact on all application connecting to the server. So you can either increase the maximum size of your Query Store so the date-based cleanup will kick in before the size-based one does, or…
Manually cleaning your Query Store
The only option available to clean your Query Store seems to be clearing it completely, but the documentation presents another solution to do some fine cleanup, using some stored procedures
sp_query_store_remove_plan: removes a plan and related entries (runtime stats and wait stats are plan-specific and will be removed as well)
sp_query_store_remove_query: removes a query from Query Store (along with its plans and their runtime & wait stats).
sp_query_store_reset_exec_stats: deletes the statistics (runtime & waits) for a specific plan.
However, that puts the burden on the administrator, who has to decide what plans & queries can be safely removed and will have a definitive impact on the Query Store space usage. Would not make much sense to delete details from a query that will be executed right afterwards, since the plan & query details would have to be stored once again, being those the components that take the most space.
With these 3 stored procedures there are a number of custom cleanup process that can be implemented:
-Delete queries associated with a particular object, such a stored procedure used by administrators or auditors with no significant impact on the database.
-Delete all plans associated to a query that has a forced plan in place, and therefore will no longer be used by the query.
The Query Store custom cleanup script
For the issue explained in Mark’s post, I wrote an SP to clean certain queries from Query Store:
Any query flagged as internal, since they only contain queries such as index & statistics maintenance.
Any query referencing to a object no longer present in the database (“orphan” queries).
Ad hoc queries that had not been executed recently (less than X times in the past Y hours).
When used against the same SQL instance that had the 70% CPU peak due to the automated cleanup, it ran on 10 different database simultaneously with no noticeable impact, and reduced the space used by our databases’ Query Store from 9-10 GBs each to 6-8 GBs. With the automated size cleanup kicking once its utilization exceeded 90%, this SP was configured to run on databases that had exceeded a 80% of space usage and we haven’t had any issue since then.
There are some other functionalities in my TO DO list, such as thresholds to clean up queries (based on the ones used by the Custom Capture mode, but allowing for more parameters to configure it).
Introducing QDS Toolbox : QDSCacheCleanup
An improved version of the original procedure has been published in ChannelAdvisor’s its public GitHub repository as part of the QDS Toolbox, which I’ll be exploring in following posts. This particular component of the QDS Toolbox (QDSCacheCleanup) is designed to be deployed either on each of your databases or in a single database (they like the one you have deployed your maintenance procedures like OLA’s scripts) to perform the cleanup operation and log (if desired) the operations performed. You can install it and run it on test mode to analyze which queries would be deleted depending on the parameters you sent before deploying that on production.
The reports generated are available in multiple formats:
As tables
In text format (for summaries only)
**********************************
* Stale queries found *
**********************************
# of Queries : 5
# of Plans : 5
KBs of query texts : 0
KBs of execution plans : 390
KBs of runtime stats : 3
KBs of wait stats : 0
**********************************
* Orphan queries found *
**********************************
# of Queries : 59
# of Plans : 60
KBs of query texts : 65
KBs of execution plans : 3435
KBs of runtime stats : 51
KBs of wait stats : 4
Stored in SQL tables
The parameters are stored in an XML format to ease the process of adding/removing parameters in future updates of this script
The captures were extracted from a SQL 2017 instance. The original cleanup SP has been deployed on hundreds of SQL 2017 instances and several SQL 2019 instance, and the version mentioned on in this post and published on GitHub has been developed and tested on both SQL 2017 and 2019 instances.
Notes and captures of the QDS Toolbox components may vary from the current version release, so please refer to the documentation available in GitHub and the comments in the procedures’ code
Since Query Store objects are stored in the PRIMARY filegroup and that cannot be changed, it is important to understand how big it could get so your filegroup won’t get filled by it. This also means that piecemeal restores will include your Query Store data in the partial-restore sequence. The SSMS UI only provides a very basic summary of Query Store’s data usage: just a number and a pie chart that doesn’t give you any details at all.
So first of all, let’s understand how much space a single query takes.
How much space does a query take?
First, let’s differentiate between the query data itself, and the metrics. The query’s data is stored in the following three tables
This table contains some metrics for the plans, along with the plan itself in XML format. That accounts for a total of 549 bytes + the size of the XML plan.
This table contains the actual text for the queries. In addition to the size of the query text stored as a nvarchar, each entry stores 54 extra bytes of information.
Doing some math, one simple query that uses a single plan, takes:
549 bytes for the plan’s additional data
409 bytes for the query’s additional data
54 bytes for the query text’s metadata
A total of 1012 bytes, and that’s not even counting the size of the query and its plan. For a very simple query such as the one below, that accounts for 90 bytes of text and 4818 bytes of an XML plan.
So that query alone took 5920 bytes, a bit short of 6 KBs. But that only covers the query itself and its plans, but none of the information represented in the graphic report SSMS generates, which is stored in two separate tables
But all this still only gives us “static” figures, those numbers are not gonna change anytime soon unless the object changes (new queries and plans to substitute the existing ones), or there is a change in the execution plan due to changes in indexes, statistics… (adding more plans to Query Store).
Execution metrics
The details of each query (organized based on the plan it uses) are stored in two tables
This table contains all execution metrics of each plan, with separate entries for successful, failed (returning an error) or cancelled executions. Each entry requires 653 bytes, but if the same plan is executed 3 times in the same interval (one successful execution, one failed execution, and a cancelled execution), that would mean up to 3*653 = 1959 bytes per interval.
This table contains wait stats details for each plan, one entry for each wait type’s statistics (if there is any wait at all for the plan). These waits are not 1:1 representative of all the wait stats a query can experience, but an aggregation available in the official documentation). Still, a single query could experience several of them, and each takes 315 bytes.
Still, there are too many variables and unknowns to estimate how much space a single query or object can take in Query Store, but it is possible to get the size it currently takes to help estimate how much it will take in the future.
Total space used by one object
The following query calculates the size of the objects and statistics stored in Query Store associated to a given object (function or stored procedure)
DECLARE @ObjectName NVARCHAR(262) = '[schema].[procedure]'
DECLARE @query TABLE
(
query_id BIGINT,
query_text_id BIGINT
)
DECLARE @QueryCount INT
DECLARE @QueryTextSize INT
DECLARE @PlanCount INT
DECLARE @PlanSize INT
DECLARE @RuntimesCount INT
DECLARE @WaitsCount INT
INSERT INTO @query
SELECT query_id, query_text_id
FROM sys.query_store_query
WHERE object_id = OBJECT_ID(@OBjectName)
SET @QueryCount = @@ROWCOUNT
SELECT @PlanCount = COUNT(1), @PlanSize = SUM(DATALENGTH(query_plan)) FROM sys.query_store_plan qsp
INNER JOIN @query q
ON qsp.query_id = q.query_id
SELECT @QueryTextSize = SUM(DATALENGTH(query_sql_text))
FROM sys.query_store_query_text qsqt
INNER JOIN @query q
ON qsqt.query_text_id = q.query_text_id
SELECT @RuntimesCount = COUNT(1) FROM sys.query_store_runtime_stats qsrs
INNER JOIN sys.query_store_plan qsp
ON qsrs.plan_id = qsp.plan_id
INNER JOIN @query q
ON qsp.query_id = q.query_id
SELECT @WaitsCount = COUNT(1) FROM sys.query_store_wait_stats qsws
INNER JOIN sys.query_store_plan qsp
ON qsws.plan_id = qsp.plan_id
INNER JOIN @query q
ON qsp.query_id = q.query_id
SELECT
@QueryCount AS [QueryCount],
@QueryTextSize AS [QueryTextSize],
@PlanCount AS [PlanCount],
@PlanSize AS [PlanXMLSize],
(@QueryCount * (409+54)) + @QueryTextSize AS [QueryTotalSpace],
(@PlanCount * (549)) + @PlanSize AS [PlanTotalSpace],
(@RuntimesCount * (653)) AS [RuntimeMetricSize],
(@WaitsCount * (315)) AS [WaitsMetricSize],
(((@QueryCount * (409+54)) + @QueryTextSize) + ((@PlanCount * (549)) + @PlanSize)) + (@RuntimesCount * (653)) + (@WaitsCount * (315)) AS [TotalSpaceBytes],
(((@QueryCount * (409+54)) + @QueryTextSize) + ((@PlanCount * (549)) + @PlanSize) + (@RuntimesCount * (653)) + (@WaitsCount * (315))) / 1024.0 AS [TotalSpaceKiloBytes]
Other tables
There are a few more tables that, due to their small size, I haven’t included in the previous calculations
Contains not only the parameters for the Query Store configuration, but their current states so you can find if there is any deviation from your settings (such as the it turning into READ_ONLY because it ran out of space), and that take approximately 662 bytes (size varies depending on the length of some of its nvarchar columns).
Despite its name not reflecting its relation with Query Store, this table contains the configuration settings for automatic tuning through Query Store and the current state of them, similar to sys.database_query_store_options. It takes 500 bytes.
If automatic tuning is enabled, every query that has a tuning recommendation will have an entry on this table, consisting in the recommendation itself stored in a JSON format along with some more information such as why the recommendation was produced, the current state of the recommendation (applied, reverted…) and the time for each event associated to the query recommendation. Since it contains 2 JSON files with varying sizes, it is not possible to get an estimated size for each entry on this table, or the number of entries it may contain. (It would be possible, given the values the nvarchar columns contain, but since that may change in future releases it would not be an accurate representation).
Contains information about the semantics affecting the queries execution. Each entry takes 39 bytes and there are very few entries (under 50 in close to 1000 production databases analyzed): with its total size under 2 KBs this is negligible.
This table contains the details of the interval themselves, such as the start/end time. Each interval requires an entry and 28 bytes.
So we have the details of the space utilization based on an object, to understand the impact of each query/object in terms of space and why Query Store needs that much space, but what happens when you have thousands of objects, or all your SQL code resides on the application side and your database only receives DDL and DML commands or ad-hoc queries? How can you find out what is taking so much space in your Query Store, to add more details to the initial pie chart?
Query Store space usage detailed
Although I have been referring to them as “tables”, the objects listed above are system views, that rely on a combination of actual system tables that cannot be queried directly, and in-memory tables the SQL engine uses to improve Query Store performance during data capture operations.
As seen below, the object sys.query_store_runtime_stats has some of its data in a persisted internal table (in red), while some data is contained in the in-memory table (in blue)
However, when looking for particular entries (in this example, the first entry), the information we are trying to retrieve has already been persisted to disk and there is no access to in-memory tables required
The query below analyzes space usage from those interval tables’ indexes to get a better understanding on the pie chart
SET NOCOUNT ON
SELECT
s.name AS [SchemaName],
t.name AS [TableName],
i.name AS [IndexName],
p.rows AS RowCounts,
(SUM(a.total_pages) * 8)/1024.0 AS TotalSpaceMB,
(SUM(a.used_pages) * 8)/1024.0 AS UsedSpaceMB
FROM
sys.internal_tables t WITH (NOLOCK)
INNER JOIN
sys.indexes i WITH (NOLOCK)
ON t.OBJECT_ID = i.object_id
INNER JOIN
sys.partitions p WITH (NOLOCK)
ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN
sys.allocation_units a WITH (NOLOCK)
ON p.partition_id = a.container_id
LEFT OUTER JOIN
sys.schemas s WITH (NOLOCK)
ON t.schema_id = s.schema_id
INNER JOIN sys.filegroups f WITH (NOLOCK)
ON f.data_space_id = i.data_space_id
WHERE
t.NAME LIKE 'plan_persist%'
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY
s.name, t.name, i.name, t.create_date, p.Rows, f.name
ORDER BY
s.name, t.name, i.name
Playing a bit with the numbers from a real production database (aggregating data on the table rather than per index), we get a more detailed pie chart.
Notes on this post
This post was written and tested on a SQL 2017 CU15 instance.
When SQL 2019 was announced, one of the changes that got me more interested was the new capture mode for Query Store, called Custom. Until then, the only 3 modes were
None: only statistics for the already captured queries will be captured. New queries executed after setting this capture mode will be ignored.
All: all queries and their statistics will be captured.
Auto: Query Store itself will decide what queries will be captured.
So the next question was “What does Auto capture?”. There is no documentation regarding it, so it is a mystery and the only way to find it out would be executing the same load on two identical copies of the same database, one with the All capture mode and another with the Auto capture mode, and compare the information Query Store has recorded on each.
So the Custom mode arrived as a way of giving us some control on how Query Store will behave, but let’s see how much we can actually control how queries are filtered, and how Query Store achieves this.
How customizable can Custom be?
Once the Query Store Capture Mode is set to Custom capture mode is enabled, 4 parameters are enabled:
Stale Threshold
There is a predefined set of intervals to select from that applies to all the settings below (Execution Count, Total Compile CPU Time, and Total Execution CPU).
Execution Count
Number of executions a certain query must reach in the allocated Stale Threshold.
Total Compile CPU Time (ms)
Total time spent compiling the query in the allocated Stale Threshold. This threshold applies to the sum of compilation times for each the plans generated for the same query.
Total Execution CPU Time (ms)
Total CPU time spent execution the query in the allocated Stale Threshold. This threshold applies to the total CPU of the query, regardless of the execution plan used.
Exceeding any of these thresholds in the allocated Stale Threshold will cause the Query Store to capture the query’s statistics. But something is missing, isn’t it?
Out of the 13 available metrics, we can only use two of them for the filtering (Execution Count and CPU Time), and the Compile Time which is not even shown anywhere in the regular reports. Not even the Duration metric, which all Query Store reportes default to, can be used to configure the query filters on this Custom capture mode. It is understandable that some of those metrics (I can’t think of any scenario where I’d be interested in capturing queries that returned more than 100 rows, for example), so let’s assume these metrics were selected as a compromise between speed (we don’t want Query Store to increase its impact on the server by performing complex calculations to filter the captured queries), and giving the DBAs some control so they don’t have to choose between capturing All, and whatever Auto captures.
How does the Custom mode really capture queries?
Regular Query Store captures information in memory, and once the time interval is closed it aggregates the data, and doing this final calculations removes the duplicate entries on some of the queries’ statistics that it maintains when the time interval is still open and capturing live data. So common sense dictates Query Store will do something similar with this Custom capture mode, waiting until the Stale Threshold time has passed and clean up the queries that didn’t exceed any of the thresholds defined.
To confirm it, I’ve run several tests on two copies of the same database, one with an All capture mode as a control database (QSDB_ALL), and another one with different Custom capture settings (QSDB_CUSTOM). To avoid queries being captured due to a threshold different than the one being tested, the out-of-scope threshold were set to high values (millions of executions, CPU and Compile times).
Testing: Execution Count.
Custom setting: 10 executions / 1 hour Stale Threshold.
I executed a simple UPDATE command 10 times on each database right after clearing the Query Store cache, and these are the results.
QSDB_ALL captures all executions, but what happened with QSDB_CUSTOM? I executed the query exactly the same amount of time on both databases, yet the only measurement that seems to be similar is the compile time. Let’s try executing the query 10 more times on each database and see what happens.
Now we have 11 vs 20 executions. If this were the only issue, we could just add the threshold value -1 to the first hour of statistics and we’d get the current execution count. But unless the Statistics Collection Interval is set to 1 hour, you are execution details: was the query executed regularly every 5 minutes, or all of the sudden all 10 executions took place in the last minutes of the interval?
Not only are we missing execution counts, we are also missing all other statistics on those queries, like CPU. What if you have a process running every 5 minutes to process a huge amount of data, and that data is deleted at XX:44? You’d be missing all information regarding the first 9 executions, and suddenly the 10th would be captured and look innocuous.
Testing: Total Compile CPU Time (ms)
Custom setting: 40 ms/ 1 hour Stale Threshold.
For this test, the same UPDATE command was executed once on both databases.
As expected, QSDB_ALL has logged the query’s statistics, but since it didn’t reach the threshold set on QSDB_CUSTOM, it won’t appear on its Query Store tables. So let’s create some new index and run the same query again.
Both queries can be found on QSDB_ALL, but still nothing on QSDB_CUSTOM. This may be due to the total compilation time of both queries taking less than 40 ms, so let’s try again with a new plan.
Finally, we have some live signs on QSDB_CUSTOM. And since the compilation time of this particular plan hasn’t exceeded the 40 ms, it is safe to assume the compilation total of all the plans generated for his query totalled at least 40 ms. But, since we now 3 plans were generated, why is this plan the #1 on the database? We are gonna rerun the query using one of the previous plans and see what will happen.
The new plan is there, but it has the #2, corresponding with #3 plan on the QSDB_ALL database. Unfortunately, details of previous executions have been lost, but the good news is that new plans (and old plans being reused) will be captured
Testing: Total Execution CPU Time (ms)
Custom setting: 100 ms / 1 hour Stale Threshold.
For this test, the same UPDATE command was executed 100 times on both databases, and here are the results.
The total CPU time is recorded as 35496 microseconds. Compared to the captured data in the QSDB_ALL database, that’s roughly 100000 microseconds (the 100 milliseconds defined in the threshold) missing. But not only that, the first 73 executions of the query are missing too.
Back to the previous example, imagine the data processing has no data to process at all until XX:55, and then the CPU time of that single execution exceeds the threshold and gets logged. If you only see information of the CPU-intensive executions, the regular executions would not be captured and therefore the query would not be listed on the Regressed Queries report.
How do Auto compares with Custom when using the default values?
If the Auto-equivalent parameters were publicly available, this would be an easy task, and you wouldn’t worry that much about what queries are being missed. Normally, I’d assume the default settings would be equivalent to the ones used by the Auto mode, but since the previous tests showed some data could be missed, I decided to run several tests to confirm it myself.
Execution count (default: 30)
I ran several simple queries to compare their results, and only after they were executed 30 times Auto kicked in and started recording their executions. The total CPU and Compile duration were lower than any threshold I could have set using the Custom mode, so let’s assume 30 executions can serve as a measurement to emulate Auto’s behaviour.
Total Compile CPU Time (ms) (default: 1000)
By running a SELECT statement joining up to 70 copies of the same table and adding/removing tables to the INNER JOIN clause, I confirmed only queries with a compilation time over 10000000 microseconds (1000 milliseconds second). So this default parameter seems to be consistent with the ones used by the Auto mode.
Total Execution CPU Time (ms) (default: 100)
By running a SELECT statement with a limited row count, I was able to confirm this threshold to be set at 100000 microseconds. Same as for the compile time, this must be converted into milliseconds, which gets us the default value 100 milliseconds.
Stale Threshold (default: 1 hour)
Using the Execution Count as a filter (set to 11), I executed the same query 10 times, and after advancing the server’s time one hour ran the same query another 10 times. The query was not recorded, which proves the Stale Threshold resets its statistics after 1 hour when using the Auto capture mode.
We have proved all 4 default parameters used by the Custom capture mode are consistent with the Auto capture mode results, and we now understand how the Auto capture mode works. If you don’t have the opportunity to work with SQL 2019 instances, this will give you some hindsight on how Query Store decides what is worth capturing.
What capture mode should I use?
If you can work with SQL 2019, I’d suggest using the Custom capture mode with the default settings, and adjust them once you are familiar with the load and can define a baselines for the queries that, for your particular environment, would get the most out of Query Store whilst reducing its impact on your instances.
If you are limited to SQL 2017, definitely use the Auto capture mode, being aware of knowing its limitation and what queries may fail through the cracks.
For lower environments I’d suggest using the All capture mode, those environments usually hold much less data than the production ones, and query regressions would be harder to identify there due to the gaps in the Auto and Custom capture modes. If you decide to do that, educate your database developers to use the Query Store feature so they can analyse the impact of their actions before it is too late and their changes have caused issues in production.
Notes on this post
These tests were all performed on a SQL 2019 RTM instance.
Enabling Query Store is easy, but configuring it properly can be tricky. In this post we’ll analyse all the different settings and how to properly adapt them to get the most out of Query Store for your particular situation.
General
The general settings are portrayed differently in the GUI than they are when changing settings using T-SQL. This post will focus on the settings accessible using SSMS.
Operation Mode
OFF: Query Store is disabled. It won’t capture any information on running queries, and no changes on its settings or data are allowed.
READ_ONLY: Query Store is enabled for read-only operations, and settings changes.
READ_WRITE: Query Store is enabled for read & write operations and settings changes. It will capture queries’ information based on its current configuration.
Monitoring
This group of settings configures of data is aggregated and flushed to disk.
Data Flush Interval (minutes)
The frequency at which data is persisted to disk. This event can trigger an internal cleanup process that in the best case scenario provokes a surge in the CPU usage, and in the worst case may potentially change your database’s Query Store into read-only mode. A whole post will be published about Query Store and how its space requirements and cleanup processes.
Since the amount of data loaded into memory doesn’t tend to be much, it is normally safe to have small intervals: test different values and monitor its impact on the CPU and I/O of the server if you find any issue.
Statistics Collection Interval
There is a predefined set of intervals to select from: the smaller the interval, the more granular information you can get from the Query Store. Let’s see two opposite examples, assuming a certain regularity on the database’s activity:
1 Minute: details of the activity on the database can be looked into down to the minute level, so certain analysis like query regression can highlight issues just one minute after it happens.
1 Hour: details and reports are only fully accurate after the o’clock. All reports such as query regression or top resource consuming won’t be comparable to previous intervals until the whole hour has passed and the data has been aggregated, potentially causing a delay when analysing current issues. But on the other hand, on stable systems it will save much disk space, since this interval could require roughly 1/60 of what the “1 minute” interval does. The math on Query Store’s space requirements is more complex than that, but you can get an overall idea.
Query Store Retention
Defines what data will remain in Query Store, and for how long before it is deleted.
Max Plans Per Query
A query may have different execution plans caused by changes in the index, statistics, or parameters used in the query. The default may be a bit too much (200 plans for a single query), but I have encountered some queries with dozens like the one below.
Max Size (MB)
Query Store’s objects are stored in the PRIMARY filegroup of the database, and this cannot be changed. Take this into consideration when estimating the database size, and if possible try always to move all your database objects out of the PRIMARY filegroup so that Query Store and the actual data won’t have to compete for space.
Query Store Capture Mode
All: Captures all queries, no matter how small they look.
Auto: Lets Query Store decide which queries are worth capturing.
Custom: Lets you configure certain settings to decide what queries will be captured.
None: No new queries will be captured, but statistics on the queries already registered in Query Store will still be captured.
Size Based Cleanup Mode
Auto: When Query Store reaches 90% of its Max Size, an internal process will start deleting queries (starting with the ones with a lower CPU consumption) until it is below 85%.
Off: No cleanup of the database will be triggered based on the Query Store current space utilisation.
Stale Query Threshold (Days)
Any statistics stored older than this number of days will be deleted.
Wait Statistics Capture Mode
Flag (On/Off) to enable the capture of Wait Statistics introduced in SQL 2017.
Query Store Capture Policy
This functionality was introduced in SQL 2019 in order to improve the opaque “Auto” Query Store Capture Mode previously available. Exceeding any of these thresholds in the time state in the State Threshold parameter will cause Query Store to log the query’s stats.
Execution Count
Number of executions a certain query must reach in the allocated Stale Threshold.
Total Compile CPU Time (ms)
Total time spent compiling the query in the allocated Stale Threshold. This threshold applies to the sum of compilation times for each the plans generated for the same query.
Total Execution CPU Time (ms)
Total CPU time spent execution the query in the allocated Stale Threshold. This threshold applies to the total CPU of the query, regardless of the execution plan used.
Stale Threshold
There is a predefined set of intervals to select from: this applies to all the settings above (Execution Count, Total Compile CPU Time, and Total Execution CPU).
There are some caveats on this capture mode, though:
It is not possible to review what information is the Custom mode analysing to decide what to persist or not. Although Query Store keeps the data belonging to the open interval in memory until the interval is closed and aggregated, there is no way to get information on what data is temporary stored during the Stale Threshold interval before it is closed and persistent to the system tables. Once I finish my investigation on the new capture mode I’ll publish a detailed post on the subject.
Notes on this post
All the tests have been performed on a SQL 2019 instance using SSMS 18.4. Different versions of both SQL server and the SSMS client may present slight differences.
Query Store is a feature introduced in SQL 2016 that keeps track of multiple metrics (CPU, duration, I/O…) of the queries executed in your database. This information is aggregated at configurable intervals, so the activity can be analysed with different levels of detail depending on your needs.
Although this functionality has been available since SQL 2016, starting SQL 2017 wait statistics are being captured as well, along with the corresponding changes in SSMS to access these new metrics’ reports.
However, I don’t recommend enabling this functionality on just any SQL 2017 instance you can get your hands on. A fix was released in SQL 2017’s Cumulative Update 15 to solve a bug that would cause access violation errors when accessing corrupted execution plans. Without this patch, opening the Query Store reports may wind up loading on of such corrupted plans: best case you will get an error message and a memory dump; worst case your instance will crash.
Enabling Query Store
Query Store is enabled on a per-database basis, so open the properties of the database you want to enable it on.
Let’s start by enabling it with the default values. A detailed explanation of the available settings will be provided in a following post, but for the purpose of learning how to use the SSMS interface built on top of it, the defaults will serve just fine
Starting now, the Query Store is capturing details of the queries that are being executed on the database, provided they exceed the minimum threshold set in the AUTO capture mode by SQL server. These thresholds are not published, but they are meant to exclude queries that, due to their low number of executions and small impact on the SQL engine (that being CPU, duration, I/O, memory….) are not deemed worth of being captured by the SQL engine.
Accessing the reports
The reports shipped with SSMS cover most if not all the data captured by Query Store
Regressed Queries
These are those queries whose metrics have worsened. These metrics can measure a number of values, from total duration (the default used), to standard deviation of the queries’ memory consumption, and selecting one or another will change the queries flagged as regressed.
This is a useful report when there is a recent change in the SQL performance, such as a sudden increase in the CPU, or deterioration of a particular query’s performance. If the issue has been caused by a change in the execution plan, just select the ones you want to compare and hit the “Compare Plans” button (only two plans at a time).
Overall Resource Consumption
This report gives an overview of some basic measurements. In my opinion, whilst the “CPU Time” can give you an idea on how much of the server’s CPU was occupied by queries executed on the current database, the other metrics, lacking the proper context, don’t provide useful information.
Top Resource Consuming Queries
This report sort the queries captured by any metric selected. The default is “Duration (ms) / Total”, but any combination is possible.
Queries With Forced Plans
Query Store allows you to force a specific plan to be used for a query. This can be useful when outdated statistics would otherwise persuade the SQL engine to use a plan that would cause a performance impact, such as switching from an index seek to an index scan operation. Those queries can be found using this view.
Queries With High Variation
There are queries that may have a very regular duration over time, but with certain executions that deviate from the usual behaviour. This report will highlight them so you can identify whether they represent an issue and should be looked further into.
Query Wait Statistics
Since the query wait statistics were introduced in SQL 2017, they were not part of the original system tables and don’t fully integrate with the pre-existing reports, so this separate report fills that need. It lets you drill down from the general view (all wait types), to the specific queries that incurred in that specific wait type.
Tracked Queries
Contrary to what the name seems to indicate, we can’t select queries to track information on them over, or have a filtered view of Query Store’s data on them. This is not a report but a detailed view of an specific query, so once a specific query has been identified details about it can be seen here.
Notes on this post
All the tests have been performed on a SQL 2019 instance using SSMS 18.4. Different versions of both SQL server and the SSMS client may present slight differences.
Also, in order to capture query details for the post, I modified some of the settings in Query Store such as capture mode to ALL (in order to capture every query executed), and the Statistics Collection Interval reduced to 1 minute so quick changes would be captured by Query Store rather than waiting for it to capture several hours of data before obtaining any report.