Extracting information from execution plans

Analysing an execution plan is something very nice and easy even if you are using SSMS with no additional addons.

You can compare 2 plans too, either with SSMS itself or with some third party tools that do a good job highlighting potential problems.

The problem is when you are used to Query Store and these “rainbow queries” have hundreds of execution plans

Sometimes SQL has a hard time deciding which plan to use, and too many crayons

Since all tools rely on a DBA to look into the plan an identify the actual issues, you’d have to review on at a time. What if you could it automatically?

Query Store already has the plans, and there is a ton of information they contain:

  • The statements executed (only one for Query Store’s plans, but plans in the SQL cache or stored in exported plan files could contain several).
  • Indexes that SQL server considers may improve the performance of the query (also known as “missing indexes”).
  • Filtered indexes that exists on the tables, but due to the values passed to the query in the WHERE clauses were not used.
  • The details on each node of the execution plan (the little boxes that contain the details).
  • Cursors used in the statement.
  • Operations involving indexes (scan, seek, update…).
  • Columns accessed by the statement (for either read of write operations).
  • Statistics used by the SQL Engine to generate the execution plan.

So I thought… why not get all this information from the plan?

… right, because execution plan’s XML can be complex even for the simplest SELECT

There is no way I was gonna be able to navigate all that nightmare, but I remembered something I had written almost two years ago to take an XML file and turn it into smaller chunks (one per line) to make it more manageable and to be able to extract certain fields. Back then I was only listing the tables and columns involved, so I decided to update it and extract more information.

The process goes as follow:

  • The execution plan’s XML is extracted into a temp table (one line at a time, treated as if it were a plain text), but turning every line into a valid XML (replacing “</label>! with “<label/>”, and adding “/>” to turn each line into a XML itself).
  • Modify certain lines (such as the multi-line close tags into a separate custom one).
  • Based on the contents of each line, store the information into the proper table (indexes, statistics, columns) whilst keeping track of the node this information is associated to.

QDSToolbox’s PlanMiner does all this, and combined with the plans available in Query Store, this can be used to incrementally load the new plans’ information into those tables for later consumption.

The next step would be defining certain rules to define which execution plans have issues, and what kind of situations should be flagged, or what reports to generate:

  • Missing indexes with an impact > 80%.
  • Queries that would be impacted if a certain table or column is modified or dropped.
  • Queries that are strictly read-only and you could offload to a read-only replica on your Availability Groups.
  • Queries using forced indexes that would need rewriting if the indexes are removed.
  • New plans generated in the last 24 hours with potential issues.

These rules are not in place at the moment, just the extraction of the data. I am not expert in performance, but anyone who wants to try this tool and define some rules to apply to the execution plans is welcome to do so: I’ll try my best to integrate it with the tool and allow flexibility in the rules so anyone can define his/her own.

SQL, why are you acting up? SQL & Statistics

Whenever I see some query regression in Query Store due to SQL going nuts and using an execution plan which is obviously not the best available, I wonder why it is doing that. But since I don’t have access to its source code (and even if I had, I’d probably wouldn’t fully understand it), the only thing I could do was either force the better plan and hope SQL will use it (which doesn’t always do: there are certain limitations to that), or update the statistics on all the tables involved in the query and hope for the best.

This was a very manual operation:

  • Open the query text.
  • Identify all the tables involved in the query, which may include accessing the definition of the views it accesses, and that’s if the query text is not truncated and some of them are lost due to that.
  • Find all the statistics on the identified tables that have a certain small percentage of sample rate or had a high percentage of rows updated since the statistics were last updated, rendering them obsolete.
  • Write a script to update them based using a sample rate appropriate based on the row count of the table.

But since the execution plans detail all statistics used by SQL Engine to generate the plan and they are kept in Query Store, I decided to extract those details from them.

You know a task is annoying when querying an XML file looks easier.

The new tool added to the QDSToolBox, StatisticsUsed, does just that: given a query ID (or list of them), or even a whole object, generates a table with the details of the statistics used. Based on factors like when were the statistics last updated, the row count on the table or the percentage of rows changed since then, it generates the appropriate UPDATE STATS command. This is how the final result would look like:

Demo of the result of StatisticsUsed

Query Store: time for some spring cleaning

Query Store contains a lot of information on the activity on your databases, and that can take some space (for more details, check my previous post ). We can limit the amount of data stored by adjusting its basic settings , and what data is stored by adjusting the capture mode . Once the data is stored, there are two different processes to clean up the stored data:

  • Date-based cleanup: uses the retention settings configured to decide what information will be deleted on a regular basis.
  • Size-based cleanup: if the current usage of Query Store space exceeds the 90% of its max size, a SQL internal process analyzes the information stored for all queries and starts deleting them to clear disk space. The extended event query_store_size_retention_cleanup_finished captures details on its outcome. There are no details on how this calculation is performed, but seems to be strictly CPU-based:

This automatic size-based cleanup can have a huge impact on your servers, as explained in this post entry from my colleague Mark Wilkinson titled “The perfect storm“. The summary: size-based cleanup caused a 70% CPU peak on a 8 cores server and took more than one hour to complete, having an impact on all application connecting to the server. So you can either increase the maximum size of your Query Store so the date-based cleanup will kick in before the size-based one does, or…

Manually cleaning your Query Store

The only option available to clean your Query Store seems to be clearing it completely, but the documentation presents another solution to do some fine cleanup, using some stored procedures

  • sp_query_store_remove_plan: removes a plan and related entries (runtime stats and wait stats are plan-specific and will be removed as well)
  • sp_query_store_remove_query: removes a query from Query Store (along with its plans and their runtime & wait stats).
  • sp_query_store_reset_exec_stats: deletes the statistics (runtime & waits) for a specific plan.

However, that puts the burden on the administrator, who has to decide what plans & queries can be safely removed and will have a definitive impact on the Query Store space usage. Would not make much sense to delete details from a query that will be executed right afterwards, since the plan & query details would have to be stored once again, being those the components that take the most space.

With these 3 stored procedures there are a number of custom cleanup process that can be implemented:

-Delete queries associated with a particular object, such a stored procedure used by administrators or auditors with no significant impact on the database.

-Delete all plans associated to a query that has a forced plan in place, and therefore will no longer be used by the query.

The Query Store custom cleanup script

For the issue explained in Mark’s post, I wrote an SP to clean certain queries from Query Store:

  • Any query flagged as internal, since they only contain queries such as index & statistics maintenance.
  • Any query referencing to a object no longer present in the database (“orphan” queries).
  • Ad hoc queries that had not been executed recently (less than X times in the past Y hours).

When used against the same SQL instance that had the 70% CPU peak due to the automated cleanup, it ran on 10 different database simultaneously with no noticeable impact, and reduced the space used by our databases’ Query Store from 9-10 GBs each to 6-8 GBs. With the automated size cleanup kicking once its utilization exceeded 90%, this SP was configured to run on databases that had exceeded a 80% of space usage and we haven’t had any issue since then.

There are some other functionalities in my TO DO list, such as thresholds to clean up queries (based on the ones used by the Custom Capture mode, but allowing for more parameters to configure it).

Introducing QDS Toolbox : QDSCacheCleanup

An improved version of the original procedure has been published in ChannelAdvisor’s its public GitHub repository as part of the QDS Toolbox, which I’ll be exploring in following posts. This particular component of the QDS Toolbox (QDSCacheCleanup) is designed to be deployed either on each of your databases or in a single database (they like the one you have deployed your maintenance procedures like OLA’s scripts) to perform the cleanup operation and log (if desired) the operations performed. You can install it and run it on test mode to analyze which queries would be deleted depending on the parameters you sent before deploying that on production.

The reports generated are available in multiple formats:

As tables

A summary of the estimated cleanup results
A summary of the estimated cleanup results
All the queries targeted for deletion based on the input parameters
All the queries targeted for deletion based on the input parameters

In text format (for summaries only)

**********************************
*       Stale queries found      *
**********************************
# of Queries : 5
# of Plans : 5
KBs of query texts : 0
KBs of execution plans : 390
KBs of runtime stats : 3
KBs of wait stats : 0
 
 
**********************************
*      Orphan queries found      *
**********************************
# of Queries : 59
# of Plans : 60
KBs of query texts : 65
KBs of execution plans : 3435
KBs of runtime stats : 51
KBs of wait stats : 4

Stored in SQL tables

Summary all executions, allowing you to compare the results depending on the different parameters selected
Summary all executions, allowing you to compare the results depending on the different parameters selected
List of all the queries marked for deletion, and the reason for them to be marked along with the parameters that helped make that decision
List of all the queries marked for deletion, and the reason for them to be marked along with the parameters that helped make that decision

The parameters are stored in an XML format to ease the process of adding/removing parameters in future updates of this script

<Root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <CleanupParameters>
    <CleanAdhocStale>0</CleanAdhocStale>
    <CleanStale>1</CleanStale>
    <Retention>24</Retention>
    <MinExecutionCount>2</MinExecutionCount>
    <CleanOrphan>1</CleanOrphan>
    <CleanInternal>1</CleanInternal>
  </CleanupParameters>
</Root>

Notes on this post

The captures were extracted from a SQL 2017 instance. The original cleanup SP has been deployed on hundreds of SQL 2017 instances and several SQL 2019 instance, and the version mentioned on in this post and published on GitHub has been developed and tested on both SQL 2017 and 2019 instances.

Notes and captures of the QDS Toolbox components may vary from the current version release, so please refer to the documentation available in GitHub and the comments in the procedures’ code