SharePoint Diagnostic Studio 2010

When troubleshooting SharePoint 2010 issues, you often find the need to have a tool that can give you performance, availability and usage metrics over a period of time across the farm.

One such tool is Microsoft’s SharePoint Diagnostic Studio 2010 that comes free as part of the SharePoint 2010 Administration Toolkit v2.0.
Download here: http://www.microsoft.com/download/en/details.aspx?displaylang=en&id=20022

Whats interesting to note is that SharePoint Diagnostic Studio 2010 is often used by Microsoft support personnel too, so when you contact Microsoft Support, depending on the nature of the issue in your SharePoint environment, you could be requested to provide the data collected by this tool for their analysis.

SharePoint Diagnostic Studio 2010 saves a lot of time as it is a single interface in collecting and reporting data from logs, http requests, windows events, timer jobs, hardware specs, performance counters, SQL – deadlocks, query I/O and an overview of SQL.

Here is a list of Preconfigured Reports that are available in SharePoint Diagnostic Studio 2010:

Base

HTTP Requests

This report displays all HTTP requests across the entire farm. When you select a row from the top report, the full trace from the request is fetched and displayed in the bottom pane.

Windows Events

This report displays critical and SharePoint-related events from the Windows Event Logs on all machines in the farm. Use this report to look for critical issues that occurred during the specified time frame.

 ULS Trace Issues

This report displays problems detected in the Unified Logging Service (ULS) trace logs. High-level traces that occur at the time of an issue might provide clues as to the root cause. When you select a row from the top report, the full trace from the request or timer job is fetched and displayed in the bottom pane.

Timer Jobs

This report displays all Timer Job executions. When you select a row from the top report, the full trace from the timer job is fetched and displayed in the bottom pane.

Performance Counters

This report shows key performance counter data over time for counters that are collected in the Usage database.

 

Capacity

SQL Server Query IO Over Time

This report shows expensive stored procedures IO over time. The graph shows the five most expensive queries or stored procedures over time based on the SQL Dynamic Management views.

CPU

This report shows the percentage of processors used by each process on each box.

Process Memory (MB)

This report shows the amount of available memory on the servers over time.

 

Performance

SQL Read Intensive Queries

This report shows SQL queries that read more than 50,000 pages (1 page = 8 kilobytes).

Queries that read too much data can cause SQL Server to respond slowly by forcing useful data out of memory and causing other queries to perform expensive physical reads. This can affect users that have content on the same SQL Server.

If a correlation id is present in the query text, you can use it to find the request or timer job that generated the query. Copy the correlation id into the filter field of the HTTP Requests or Timer Jobs reports.

Latency Tier Breakdown

This report shows a moving average of server-side HTTP request page latency over time. The time spent rendering the request is broken down into the three tiers that a typical HTTP request passes through during processing.

SQL Server

If SQL Server queries take longer than 250ms, use the SQL Overview report to identify SQL Server bottlenecks.

Application Servers

If service calls take longer than 250ms, you can use the Service Call Duration column in the HTTP Requestsreport to find the requests most impacted by service calls.

Web Server

If there does not appear to be a bottleneck in either the SQL Server or application server tier, and requests take longer than 250ms on the Web server, use the Duration column in the HTTP Requests report to see the slowest requests overall.

You can examine the Latency All Requests report to see if the issue is isolated to a single computer.

Finally, you can examine the CPU report to determine if one or more Web servers or application servers are exhibiting excessive processor usage.

Changed Objects

This report displays all object types that have changed over a specific period of time based on information in the change log. The change log is a history of changes that have occurred in a content database. It provides search crawlers and other features a means of querying for only those changes that have occurred since a previous crawl.

Data points are collected once every k minutes (where, by default, k is 5). This report displays the data aggregated across all content databases as a stacked bar graph. Each stack represents a different object type (see the corresponding legend).

You can filter these results by the database or the object type. For instance, you can customize the report to show only changed objects made on a database named ‘contentdb1’, assuming it is in the filter dropdown. Similarly, you can customize the report to show only data for the changes with object type ‘List’ to see all list-level changes.

These data can contribute to an overall understanding of what types of changes are occurring around a specific timeframe.  From these data, you can further examine the HTTP Requests report to determine what requests are causing these changes, or you can look at the Changed Objects Per Database report to view the same data using a different pivot.  Additionally, you can look at the Change Types report and the Change Types Per Database report to look further into the types of changes being made to the objects.

If you just want to save the current set of results to share, click the Export button.

Changed Objects per Database

This report displays all object types that have changed in specific content databases over a specific period of time based on information in the change log. The change log is a history of changes that have occurred in a content database. It provides search crawlers and other features a means of querying for only those changes that have occurred since a previous crawl.

Change Types

This report displays all object types that have been changed over a specific period of time based on information in the change log. The change log is a history of changes that have occurred in a content database. It provides search crawlers and other features a means of querying for only those changes that have occurred since a previous crawl.

Change Types Per Database

This report displays all object types that have changed in specific content databases over a specific period of time based on information in the change log. The change log is a history of changes that have occurred in a content database. It provides search crawlers and other features a means of querying for only those changes that have occurred since a previous crawl.

Latency, All Requests

This report plots the duration of all requests (up to a limit of 50,000).

Use this report to spot abnormal patterns in usage. For example, a poorly performing site might consistently take 5 seconds to load, which would present as a horizontal band at the 5 second mark. For a more detailed view, you can zoom in to a smaller area, go to the HTTP Requests report and looking for requests taking around 5 seconds.

Latency spikes will appear as columns. If the spikes have a regular period, you might look at the Timer Jobs report to see if a particular job runs during the same time.

Latency Percentiles

This report shows several key percentile thresholds over a period of time to give you an idea of how many requests are affected by a particular latency spike.

For example, if the fastest 25% of all requests are taking a second or more, then it is likely that an outage in some shared resource (such as the network or SQL Server computer) is affecting all requests. Use the Latency Tier Breakdown report to look for issues in shared resources.

On the other hand, if 75% of all requests complete quickly, but the 95th percentile is very high, then you might need to look for a root cause that affects a smaller number of requests, such as blocking in a single database, or custom code that is only used by a subset of sites.

To see logs for the slowest requests, you can view the HTTP Requests report and sort the list by clicking on the header of the Duration column.

You may also want to use the usage reports like Requests Per User and Application Workload to look for users or applications that are placing unexpected load on the system.

SQL Deadlocks

This report lists SQL deadlocks. SQL server uses deadlock detection to prevent the server from hanging when two incompatible queries are executed. To resolve a deadlock, one or more of the queries are canceled. SharePoint is able to recover from some deadlocks and retry. However, deadlocks can also cause some requests to fail.

SQL Blocking

This report lists SQL queries that have blocked other SQL queries.

Blocking can halt all activity on the farm. When blocked requests cannot be processed by the affected database, all available Web server memory will eventually be consumed, causing the affected servers to stop responding or crash.

This report displays the request or timer job responsible for generating the blocking query, if possible, along with any associated logs. These can be useful if the block is caused by a specific end user transaction. In such cases, restructuring a list or redesigning an application that uses custom queries may be indicated.

Some blocking cannot be avoided. For example, nightly database maintenance tasks necessarily lock large parts of a database.

Availability

Availability

This report charts the availability of the HTTP Web Service. Drops in availability indicate periods when your users may have been unable to access their SharePoint sites.

This report calculates availability dividing the number of successful web requests by the total number of requests sent to the server. An attempt is made to remove requests coming from automated agents, such as a search crawler, from this calculation. However, some unknown automated agents may not be excluded.

Zoom into a period of low availability by selecting it with your mouse. Subsequent reports used in this investigation will load faster if you select a smaller time range.

Once you’ve narrowed down the time range, you might want to use the Failed User Requests report to examine details about requests that failed in the selected time period.

Crashes reduce availability by terminating the process without allowing requests to complete. Because the process doesn’t have an opportunity to write logs during a crash, requests that were running at the time of the crash will not appear in these reports, and their impact on availability will not be displayed in the graph. Regardless, crashes should be investigated.

Scheduled Worker Process Recycles rarely reduce availability. The server will attempt to gracefully allow requests from one process to complete while simultaneously starting another process to handle new requests. Frequent, unscheduled recycles during periods of high traffic hours may cause some requests to fail if the server cannot respond to the increased demands of running multiple processes in parallel.

SQL Overview Report

This report displays information that can help you understand the overall health of the SQL Server computers in your farm.

SQL Server Locking/Blocking

SQL Server query blocking can increase some SQL Server query duration values, and might contribute to availability issues and increased latency.

• Average Lock Wait Time: Locks are held on SQL Server resources, such as rows read or modified during a transaction, to prevent concurrent use of resources by different transactions. For example, an update will hold an XLOCK and it will block a shared read lock. A high lock wait time means there is a blocking issue in the SQL Server tier, and you should pay attention to slow updating threads, as they will block reads.
• Average Latch Wait Time: A latch is primarily used to synchronize database pages. Each latch is associated with a single allocation unit. A latch wait occurs when a latch request cannot be granted immediately because the latch is held by another thread in a conflicting mode. Unlike locks, a latch is released immediately after the operation, even in write operations. High latch wait time might mean it is taking too long to load a specific page into memory.

When Lock Wait Time is high, examine the SQL Blocking report to identify the queries holding onto the locks.

You can examine the SQL Deadlocks report to identify queries that may have generated failed requests.

SQL Server Disk IO

A common SQL Server performance issue is an I/O bottleneck. When SQL Server does not have sufficient IO bandwith to process incoming queries, performance across all requests will degrade, and performance across all farm Web servers will suffer.

Average Disk Queue Length: This metric is for overall Disk I/O. Higher values translate to higher overall IO pressure, and if you have higher than 10, it is possible there is an I/O bottleneck.
Average Logical Reads / s: This metric is for the Read Disk I/O. Higher values translate to higher Read I/O pressure.
Average Logical Writes / s: This metric is for the Write Disk I/O. Higher values translate to higher Write I/O pressure.

When there is an I/O bottleneck, examine the SQL Read-Intensive Traces report to see what specific queries are consuming the most resources.

SQL Server CPU

When SQL Server computer processor usage is excessively high, SQL queries are queued, and Web server performance degrades. Processor and I/O performance is related, so when SQL Server processor usage is high, I/O is usually high as well. An average processor usage of 80% is considered a bottleneck.

When there is a CPU bottleneck, take a look at the SQL Read-Intensive Traces report and click the CPU column to sort by the most expensive queries.

Worker Process Recycles

Recycles do not usually affect availability. IIS will create a new process, gracefully allow existing requests to complete, and then shut down the recycled process cleanly. However, the first browse to a new process can be delayed while the process is initialized.

By default, SharePoint schedules worker process recycle jobs to take place overnight. Frequent recycles during working hours can increase the latency of end user requests. Check to see if web.config settings may have been changed, or if the recycle settings have been modified in IIS.

Failed User Requests

These user requests failed, or they were so slow that users may have assumed they failed.

Select a failed request to fetch its trace logs. Look for traces that mention a failure in some component of the system. If the cause isn’t apparent, look at the Windows Events report for signs of a system failure on the machine or in IIS.

If a request failed because it was too slow, look for a gap in the log, which may be highlighted. If the lines preceding the gap indicate that the delay occurred in SQL Server, this request was most likely a lock victim. Look at the SQL Blocking report to find the blocking query that is the root cause of the issue.

Some requests, such as downloads of large files, may be expected to be slow.

Crashes

This report displays all of the IIS worker process crashes that occurred in the specified time range. After a row is selected in the top report, the last few seconds of traces from the crashing process are displayed in the bottom panel. These traces may provide an indication of why the crash occurred.

Crashes can dramatically affect availability. The availability report may underestimate the impact of crashes because requests that are being executed at the time of a crash are not recorded. Even when a crash doesn’t noticeably affect availability, it can lead to data loss or other problems and should be investigated.

Usage

Requests Per URL

This report displays the most commonly requested URLs.

Requests Per User

This report displays the percentage of requests made by the most common user accounts. Some system accounts, such as the search crawler service account, may be expected to generate a lot of requests. At certain times, individual users may also perform operations that create an unexpected peak in resource usage.

Application Workload

This report displays the amount of time spent serving requests from a variety of client applications in a given time range.  The table provides an estimate of which resources are being consumed by the client’s requests.

High total durations require addtional memory on the web front ends that is consumed as long as the request is executing.

High SQL durations imply high SQL IO or CPU use, or the clients may simply execute a lot of requests and are blocked behind other queries.

High web server durations may suggest high cpu use in the front end tier.

Requests Per Site

This report displays the percentage of requests made to each site in the farm.

 

– The above information on the SharePoint Diagnostic Studio 2010 reports have been taken from the tool and are accurate at the time of writing.

2 thoughts on “SharePoint Diagnostic Studio 2010

  1. Great info, ¿has anybody done an installation in a production environment? We have installed this tool in our labs, and we are looking forward to get it to our prod. but we would like to first address the amount of logs that this tool produces so we don’t collapse our SQL Clusters. We have about 1k+ Site Collections without counting my sites, and about 3 TBs of data.

Leave a Reply

Your email address will not be published. Required fields are marked *