Search This Blog

Monday, July 31, 2023

Monitoring Long-Running SQL Server Agent Jobs with T-SQL (Updated)

Monitoring Long-Running SQL Server Agent Jobs with T-SQL (Updated)

As a SQL Server DBA, especially in production, you'll eventually hit SQL Agent jobs that run way longer than they should. They blow past SLAs, block other jobs, or start impacting users and apps. SSMS Job Activity Monitor is great for manual checks, but what you really need is automation that pages you the moment a job crosses your threshold.

This post shares a battle-tested T-SQL script that finds long-running Agent jobs and emails you an HTML report. It's dead simple to deploy, needs no third-party tools, and adapts easily to any environment. Even if you have fancy monitoring suites, I still keep this in my toolkit, it's lightweight, portable, and gets the job done anywhere, anytime.


Script Overview

The script is divided into two main parts:

  • Part 1 identifies active/running jobs whose run durations exceed a set threshold.

  • Part 2 formats the results and sends an email alert with the job details.

You can execute the whole script at once or run each part separately based on your workflow.


Setup Instructions

  1. Configure Database Mail if not already enabled.

  2. Update the variables at the top of the script, especially the runtime threshold (in minutes) and email address.

  3. Schedule this script as a SQL Agent job or run manually as required.

  4. Review output in SSMS or wait for the automated email alert.

TSQL Script: 

-- See blog post: https://sqlpal.blogspot.com/2023/07/long-running-sql-server-jobs.html
-- 
/*==========================================================================================
  Script Name : Monitor_Long_Running_SQL_Agent_Jobs.sql

  Purpose     : 
      Identify currently running SQL Server Agent jobs that exceed a configurable 
      runtime threshold and optionally send an HTML email alert with details 
      (job name, duration, and command text).

  Context / Typical Use:
      - Scheduled as a SQL Agent job (e.g., every 5–10 minutes) on production 
        instances to detect runaway or stuck jobs.
      - Can be run interactively to quickly see which jobs are currently 
        running long.

  Key Behaviors:
      - Uses msdb job metadata plus dm_exec dynamic views to correlate SQL Agent 
        job execution with live session/request runtime metrics.
      - Filters only *currently running* jobs and applies a runtime threshold 
        (default: > 30 minutes) to avoid noise from short/expected jobs.
      - Builds an HTML table via FOR XML so the email is readable and easy to scan.
      - Skips the email step entirely when no qualifying long-running jobs exist.

  Assumptions / Requirements:
      - Database Mail is configured and usable on this instance.
      - Caller has appropriate permissions on msdb and server-level DMVs.
      - SQL Server Agent uses the standard "SQLAgent - TSQL JobStep" program_name 
        format for job sessions.

  Configuration Variables (at top of script):
      @MinutesThreshold  - Minimum runtime (minutes) before flagging as "long-running"
      @EmailRecipients   - Semicolon-delimited list of email addresses to notify
      @MailProfile       - Database Mail profile name (optional; uses default if NULL)

  How to Adjust:
      - Modify @MinutesThreshold for your SLA (e.g., 15 for aggressive monitoring).
      - Update @EmailRecipients for environment-specific oncall lists.
      - Set @MailProfile if you use named profiles instead of the default.

  Output:
      - Result set in SSMS with long-running jobs.
      - HTML email with one row per long-running job, when any are found.

  Maintenance Notes:
      - When changing thresholds, recipients, or HTML layout, keep this header 
        updated so behavior matches documentation.
      - If SQL Server version or Agent behavior changes (e.g., program_name 
        format), revisit the session-to-job mapping query.

  Last Updated:
      - 2026-01-29 : Added configurable @MinutesThreshold and @EmailRecipients variables.

==========================================================================================*/

-- ==========================================
-- CONFIGURATION VARIABLES - Modify these per environment
-- ==========================================
DECLARE @MinutesThreshold INT = 30;        -- Jobs running longer than this (minutes) trigger alerts
DECLARE @EmailRecipients  NVARCHAR(500) = '<dba-team@company.com;oncall@company.com>';  -- Semicolon-delimited recipients
DECLARE @MailProfile      NVARCHAR(128) = NULL;  -- Optional: Database Mail profile name; NULL uses default

-- PART 1: Check for active/running jobs and their duration so far
SET NOCOUNT ON;

IF OBJECT_ID('tempdb..#job_durations') IS NOT NULL 
    DROP TABLE #job_durations;

;WITH cte_jobs AS
(
    SELECT 
           sja.session_id AS [Session ID],
           sj.name        AS [Job Name], 
           r.command      AS [Command],

           -- WHY: We derive a simplified job status from start/stop times so that 
           --      downstream logic can just check [Status] instead of multiple datetime columns.
           [Status] = CASE
                WHEN sja.start_execution_date IS NULL THEN 'Not running'
                WHEN sja.start_execution_date IS NOT NULL 
                     AND sja.stop_execution_date IS NULL THEN 'Running'
                WHEN sja.start_execution_date IS NOT NULL 
                     AND sja.stop_execution_date IS NOT NULL THEN 'Not running'
           END,

           sja.start_execution_date AS [Job Start Time],

           r.total_elapsed_time         AS [Elapsed Milliseconds],
           r.total_elapsed_time / 1000  AS [Elapsed Seconds],
           r.total_elapsed_time / 1000 / 60        AS [Elapsed Minutes],
           r.total_elapsed_time / 1000 / 60 / 60   AS [Elapsed Hours],
           sj.job_id                    AS [Job ID] 

    FROM msdb..sysjobs sj
    INNER JOIN msdb..sysjobactivity sja 
        ON sj.job_id = sja.job_id

    -- WHY: SQL Agent job steps appear as sessions whose program_name starts with
    --      'SQLAgent - TSQL JobStep'.  This trick lets us map from dm_exec_sessions
    --      back to the msdb job_id by decoding the program_name.
    INNER JOIN (
        SELECT 
            TRY_CONVERT(binary(30), SUBSTRING(program_name, 30, 34), 1) AS job_id,
            *
        FROM sys.dm_exec_sessions 
        WHERE program_name LIKE 'SQLAgent - TSQL JobStep%'
    ) s ON sj.job_id = s.job_id

    LEFT JOIN sys.dm_exec_requests r 
        ON s.session_id = r.session_id

    WHERE sja.session_id = (
        SELECT MAX(session_id) 
        FROM msdb..sysjobactivity 
        WHERE job_id = sj.job_id
    )
)

-- Persist and enrich only the jobs that are currently running and above the threshold
SELECT 
    cte.*,

    -- Convert the duration into words
    -- WHY: This human-readable string is meant for quick eyeballing in an email 
    --      or SSMS result grid, especially during incident response.
    [Duration] = 
        CASE WHEN [Elapsed Hours] > 0 
             THEN CAST([Elapsed Hours] AS VARCHAR(20)) + ' Hours' 
             ELSE '' 
        END +

        CASE WHEN [Elapsed Minutes] > 0 AND [Elapsed Hours] = 0 
             THEN CAST([Elapsed Minutes] AS VARCHAR(20)) + ' Minutes' 
             ELSE '' 
        END +
        CASE WHEN [Elapsed Minutes] > 0 AND [Elapsed Hours] > 0 
             THEN ', ' + CAST([Elapsed Minutes] - ([Elapsed Hours]*60) AS VARCHAR(20)) 
                      + ' Minutes' 
             ELSE '' 
        END +

        CASE WHEN [Elapsed Seconds] > 0 AND [Elapsed Minutes] = 0 AND [Elapsed Hours] = 0 
             THEN CAST([Elapsed Seconds] AS VARCHAR(20)) + ' Seconds' 
             ELSE '' 
        END +
        CASE WHEN [Elapsed Seconds] > 0 AND [Elapsed Minutes] > 0 
             THEN ', ' + CAST([Elapsed Seconds] - ([Elapsed Minutes] * 60) AS VARCHAR(20)) 
                      + ' Seconds' 
             ELSE '' 
        END   
INTO #job_durations
FROM cte_jobs cte
WHERE [Status] = 'Running'            -- Only currently running jobs matter here
  AND [Elapsed Minutes] > @MinutesThreshold 

SELECT 
    [Session ID], 
    [Job Name], 
    [Command], 
    [Job Start Time], 
    [Duration], 
    [Elapsed Milliseconds] 
FROM #job_durations 
ORDER BY [Elapsed Milliseconds] DESC;

-- WHY: If there are no rows, we skip sending an email. Using GOTO here keeps
--      the control flow simple in a single script that may run under Agent.
IF @@ROWCOUNT = 0 
    GOTO QUIT;

-- PART 2: Send email alert

-- Variables for HTML-formatted email
DECLARE 
    @html_body       NVARCHAR(MAX), 
    @html_table_head VARCHAR(1000),  
    @html_table_tail VARCHAR(1000);

SET @html_table_tail = '</table></body></html>';

SET @html_table_head = 
      '<html><head>'
    + '<style>'
    + 'td {border: solid black;border-width: 1px;padding-left:5px;padding-right:5px;'
    + 'padding-top:1px;padding-bottom:1px;font: 15px arial} '
    + '</style>'
    + '</head>'
    + '<body>'  
    + '<pre style="font-weight: bold">'
    + 'SQL Server Instance: ' + @@SERVERNAME
    + '<br> <br> Report generated on : ' + CAST(GETDATE() AS VARCHAR(100))
    + '<br> <br> Long-running threshold: ' + CAST(@MinutesThreshold AS VARCHAR(10)) + ' minutes'
    + ' <br> <table cellpadding=0 cellspacing=0 border=0>' 
    + '<br> <tr> ' 
    + '<td bgcolor=#E6E6FA><b>Job Name</b></td>'
    + '<td bgcolor=#E6E6FA><b>Duration</b></td>'
    + '<td bgcolor=#E6E6FA><b>Command</b></td></tr>';

SET @html_body = (
    SELECT    
        td = LEFT([Job Name], 50), '',
        td = Duration, '',
        td = [Command], ''
    FROM #job_durations 
    ORDER BY [Elapsed Milliseconds] DESC
    FOR XML RAW('tr'), ELEMENTS
);

-- Combine header, body, and footer into a complete HTML document
SELECT @html_body = @html_table_head + ISNULL(@html_body, '') + @html_table_tail;

-- Send email with HTML body

DECLARE @EmailSubject NVARCHAR(500) = 'Long Running SQL Server Jobs found (>' + CAST(@MinutesThreshold AS VARCHAR(10)) + ' min)';
EXEC msdb.dbo.sp_send_dbmail  
       @profile_name = @MailProfile,  -- NULL uses default profile
       @recipients   = @EmailRecipients,  
       @body         = @html_body,  
       @subject      = @EmailSubject,
       @body_format  = 'HTML';

QUIT:


Example Output

For a demo, I started two sample jobs, doing nothing but running the WAITFOR DELAY command.  Here is the results displayed on the screen:






If a job exceeds your chosen threshold, you’ll receive an HTML-formatted email alert summarizing job details:











These next two screenshots are from a real/production job:












You can adapt thresholds and recipients to suit your production environment or integrate further enhancements as needed.


Tips

  • Customize thresholds at the top of the script for reuse and clarity.

  • Add exclusions or filters for specific job names as needed.

  • Provide context for others: update email subjects, include server or environment tags.

  • Enhance with automation: Schedule this script as a recurring SQL Agent job.

  • Extend for more detail: Add columns for the affected database, more job properties, or job step text.


Conclusion

My hope is that this script gives DBAs a quick, flexible way to keep tabs on long-running SQL Server Agent jobs - no third-party tools required. With just a few tweaks, it can scale nicely for large enterprise setups or work just as well in smaller environments. Feel free to adapt it to your needs, and if you come up with improvements or new ideas, share them in the comments!


Extending to Long-Running Query Monitoring

If you want to take things a step further, you can easily tweak this script’s logic to keep an eye on long-running queries instead of just jobs. By querying dynamic management views like sys.dm_exec_requests and sys.dm_exec_sessions, you can spot active queries that go over a set duration. Grab details like the start_time and total_elapsed_time, join in the actual query text with sys.dm_exec_sql_text, and you’ll be able to create a similar alert system for queries that might be dragging down performance. It’s a great way to dive deeper into SQL Server DMVs and start building your own custom monitoring solutions that go beyond just tracking jobs.



Wednesday, July 19, 2023

Tracking the Inaccurate Cardinality Estimates

Tracking the Inaccurate Cardinality Estimates

When it comes to query performance, lately, Cardinality Estimates seem to have been a hot topic with each new release of SQL Server. They are making the query optimizer smarter, helping it generate more efficient query execution plans, and making queries run 

The SQL Server query optimizer, being 'cost-based', fundamentally relies on cardinality estimation, using per-attribute summary statistics (histograms). This principle applies to any Relational Database Management System (RDBMS), such as Oracle, that employs a cost-based optimizer.

The following description and warning included with the SSMS 'create Extended Events' wizard should be self-explanatory regarding why cardinality estimates matter.

Occurs when an operator outputs significantly more rows than estimated by the Query Optimizer. Use this event to identify queries that may be using sub-optimal plans due to cardinality estimate inaccuracy. Using this event can have a significant performance overhead so it should only be used when troubleshooting or monitoring specific problems for brief periods of time.


Wait a minute, according to that description, it's bad for performance when the optimizer underestimates the row count. Does that mean I don't need to worry about it if it overestimates the row count? Actually, overestimation is also bad for performance, perhaps not as severely, but it's still problematic enough that you should not ignore it. When the optimizer overestimates the row count:

  • The optimizer will allocate too much memory to the query, resulting in wasted memory. If this happens frequently, it can starve the SQL Server of memory, leading to spills to disk, hard paging, and so on.
  • The query will be executed with a parallel plan when a serial plan would not only have been faster but also would have avoided the possibility of the infamous CXPACKET waits.
  • The optimizer is more likely to choose scanning the entire index over seeking.


At a very basic level, the Cardinality Estimator is an estimate of the row count for each operation in the query, particularly for columns used in filter conditions (also known as the WHERE CLAUSE) and columns used in JOIN PREDICATES.

How are these estimates calculated?


While the internal algorithms evolve and keep changing, in general, the Optimizer uses the statistics created on the underlying objects/tables. Statistics are implicitly or automatically created when any index is created, and you can also create them separately, independent of any index. Additionally, SQL Server will automatically create statistics it needs if the AUTO_CREATE_STATISTICS setting is ON for the database. There are three key pieces of information in statistics:

  • The header includes, among other things, the last update date, the number of rows sampled, etc. This information can be obtained from the system DMV sys.stats and the function sys.dm_db_stats_properties.
  • Index density measures the uniqueness of a column. Low density indicates higher uniqueness, and vice versa. This information can be obtained from the system function sys.dm_db_stats_histogram.
  • Histogram shows the number of distinct values in a column and the distribution of the data in it. For example, consider the distribution of data for the Code column with 5 distinct values:



If there is an index (and, consequently, statistics on column values) on the Code column, assuming the stats are up-to-date and accurate, the Optimizer is (rightfully so) more likely to use a Seek against that index for the following query because it knows from the statistics that the index contains much fewer records for Code = 'E'.

SELECT * FROM Blood_Type WHERE Code = 'E';


If the filter condition instead contains Code = 'F', the Optimizer is more likely to not only ignore the index but also perform a full scan on the table/clustered index!

While there have been significant improvements in how the CE's are calculated, there will still be instances where those estimates will differ, sometimes significantly, from the reality, i.e., the actual row counts.

Before I get to the Extended Events, let me share couple queries that I use to look at the meta data for the statistics and their histograms.

SELECT -- TOP 10
	   SCHEMA_NAME(o.schema_id) [object_owner],
	   o.name [object_name],
	   s.name [stats_name],
	   (sp.rows_sampled * 100)/ sp.rows percent_sampled,
       sp.*, 
       s.auto_created,
	   s.user_created,
	   s.has_filter,
	   s.is_temporary,
	   s.is_incremental
FROM sys.objects o
inner join sys.stats s ON s.object_id = o.object_id
cross apply sys.dm_db_stats_properties(s.object_id, stats_id) sp
WHERE	1=1
	AND o.is_ms_shipped = 0
	AND s.auto_created = 0
--	AND o.name = 'LookupCodes'
order by sp.rows desc;  

The query retrieves information about statistics for user-created objects, excluding system objects, and sorts the result by the number of rows in the statistics in descending order. It also calculates the percentage of rows sampled for each statistic.












Get index/statistics histograms:

SELECT top 100 
		[schemas].name [schema_name],
		t.name TABLE_NAME,
		i.name index_name,
STATS_DATE(i.object_id, i.index_id) stats_update_date,
INDEX_COL([schemas].name + '.' + t.name, i.index_id, 1) ind_column, h.* FROM sys.tables t INNER JOIN sys.schemas [schemas] ON t.schema_id = [schemas].schema_id INNER JOIN sys.indexes i ON i.object_id = t.object_id INNER JOIN sys.stats s ON i.object_id = s.object_id AND i.index_id = s.stats_id INNER JOIN sys.stats_columns sc ON s.object_id = sc.object_id AND s.stats_id = sc.stats_id AND sc.stats_column_id = 1 -- inner join sys.columns c on c.object_id = s.object_id and c.column_id = sc.column_id CROSS apply sys.dm_db_stats_histogram(i.object_id, i.index_id) h WHERE 1=1 AND t.is_ms_shipped = 0 -- AND i.is_unique = 0 -- AND i.index_id > 1 AND t.name = 'tblxxx' AND i.name = 'idx_nc_tblxxx1';



The query retrieves detailed information about the specified non-clustered index, including its name, schema, last update date for statistics, the name of the first column in the index, and histogram data.











Now, the remainder of the article will focus on leveraging the Extended Events feature to detect these irregularities, often referred to as 'skewed cardinality estimates' in SQL Server documentation. 

And I would like to also emphasize the importance of selective data capture to avoid excessive information gathering.





Creating the extended event

(If you prefer TSQL scripts instead of a GUI Wizard, you find them towards the end of this article.)

To create the extended event, connect to the SQL Server instance and expand the Management--->Extended Events and right click on the Sessions to launch the New Session Wizard:















Click Next on the Introduction page, if it appears on your screen.

On the next page, enter a name for the XE session, I am calling it xe_inaccurate_cardinality_estimate.


Click Next and on the next page, select Do Not Use a Template.


On the next page, make sure to first click on the Channel drop down and check Debug. This is important. Otherwise the "inaccurate_cardinality_estimate" event will not get displayed.












In the Event Library search box, type "inaccurate_cardinality_estimate" and click on the event name to select it. Click on > button to add the event.











Click Next that will bring up option to capture Global Fields.  Here, I will check boxes next to database_id, database_name, sql_text and plan_handle. Click Next.



In the following screen, you can apply filters to refine the results as needed, such as by specifying a filter based on the database_id. I'll be adding a filter that selects events where the actual row count exceeds 1000. Without such a filter, SQL Server might capture events even when the estimated row count is 1 and the actual count is 2. Would you consider this a significant deviation from the estimate? It certainly is from the perspective of Extended Events (XE).

To enhance the effectiveness and practicality of this approach, you may also consider adding an additional filter to capture events where the estimated row count significantly differs from the actual row count. However, for now, I'll stick with filtering based on actual row counts only.

 







'




Click Next, which brings up the page to select storage location for the capture events.










Check box in front of "Save data to a file...", which should automatically be filled in with the name of the event "xe_inaccurate_cardinality_estimate" as the storage file name. You could enter a complete path to the event file, if not by default the event files will be stored under \MSSQL\Log within the installation folder of the instance.


Adjust the Maximum file size, rollover settings to your preferences then click Next

Click Finish to create the event.

Wednesday, July 12, 2023

A high level view of the most costly queries by CPU

A high level view of the most costly queries by CPU

This article is more about identifying High CPU Usage Queries in SQL Server at the Batch and Stored Procedure Levels

Understanding which SQL Server queries consume the most CPU resources is crucial for database performance tuning. However, focusing solely on individual statements might not always provide the complete picture. Instead, examining queries at a batch or stored procedure level offers a broader view, enabling more effective optimizations. This approach aggregates CPU usage across all statements within a batch or stored procedure, facilitating a targeted investigation of high-impact areas.

Overview

The goal here is to aggregate CPU usage by all statements within the same batch or stored procedure, identified by their shared 'plan_handle'. This high-level view helps in pinpointing which queries are most costly in terms of CPU resources. Later on, we can drill down to individual statements with the highest CPU consumption for detailed analysis.


Important Note: CPU time is measured in microseconds, where 1,000,000 microseconds equal one second. For demonstration purposes, this guide focuses on the top 5 queries based on CPU usage, with at least one execution and cached execution plans. Adjustments may be necessary depending on the activity level and type in your specific SQL instance.


The following query aggregates CPU times for batches or stored procedures, ranking them to identify the top consumers.



IF OBJECT_ID('tempdb..#top_costly_queries') IS NOT NULL
DROP TABLE #top_costly_queries;
GO

;WITH cte AS 
(
    SELECT TOP 5 
            SUM(qs.total_worker_time) AS total_cpu_time,
            SUM(qs.execution_count) AS total_execution_count,
            SUM(qs.total_worker_time) / SUM(qs.execution_count) AS avg_cpu_time,
            SUM(qs.total_elapsed_time) / SUM(qs.execution_count) AS avg_elapsed_time,
            COUNT(*) AS number_of_statements,
            qs.plan_handle
    FROM   sys.dm_exec_query_stats AS qs
    GROUP  BY qs.plan_handle
    HAVING SUM(qs.execution_count) > 1
)
SELECT 
    DB_NAME(qp.dbid) AS db_name,
    COALESCE(QUOTENAME(DB_NAME(qp.dbid)) + N'.' + QUOTENAME(OBJECT_SCHEMA_NAME(qp.objectid, qp.dbid)) + 
             N'.' + QUOTENAME(OBJECT_NAME(qp.objectid, qp.dbid)), '') AS obj_name,
    qp.objectid,
    qp.query_plan,
    cte.*
INTO   #top_costly_queries
FROM   cte
       CROSS APPLY sys.dm_exec_query_plan(cte.plan_handle) AS qp
ORDER  BY total_cpu_time DESC;






Analyzing the Results

After identifying the batches or stored procedures with the highest CPU usage, we focus on the specific statements within them:


SELECT t.*,
       SUBSTRING(st.text, ( qs.statement_start_offset / 2 ) + 1, (
            (CASE qs.statement_end_offset 
                WHEN -1 THEN DATALENGTH(st.text)
                ELSE qs.statement_end_offset 
            END - qs.statement_start_offset ) / 2 ) + 1) AS statement_text,
       qs.total_worker_time / qs.execution_count AS avg_cpu_time_stmt,
       qs.*,
       st.*
FROM   sys.dm_exec_query_stats AS qs
       INNER JOIN #top_costly_queries AS t ON t.plan_handle = qs.plan_handle
       CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER  BY qs.total_worker_time DESC;


Partial Results:












This detailed examination helps pinpoint specific statements within the identified batches or stored procedures that warrant further investigation due to their high CPU usage.

Analyzing the results, we might find outliers indicating inefficient queries. The next step involves reviewing the execution plan of these queries, accessible via the 'query_plan' link in the query results, to identify optimization opportunities.

Conclusion


The effectiveness of SQL Server performance tuning lies in a holistic approach. While this guide focuses on CPU usage, other factors such as I/O wait times and overall server performance also play crucial roles. Regularly reviewing and optimizing high-impact queries ensures efficient resource utilization and optimal database performance. Remember, the impact of running performance tuning queries in a production environment should be carefully considered, as they can be resource-intensive themselves. Always aim for a balanced approach, taking into account the full spectrum of performance metrics.