Newsletter

 

For a Free Quote...

Latest Blog Posts

Telnet Network News

Telnet Network News - We'll keep you up to date with what's happening in the industry.
5 minutes reading time (981 words)

The 3 Most Important KPI's to Monitor On Your Windows Servers

b2ap3_thumbnail_Monitoring_Networks.jpgMuch like monitoring the heath of your body, monitoring the health of your IT systems can get complicated. There are potentially hundreds of data points that you could monitor, but I am often asked by customers to help them decide what they should monitor. This is mostly due to there being so many available KPI options that can be implemented.

However, once you begin to monitor a particular KPI, then to some degree you are implicitly stating that this KPI must be important (since I am monitoring it) and therefore I must also respond when the KPI creates an alarm. This can easily (and quickly) lead to “monitor sprawl” where you end up monitoring so many data point and generating so many alerts that you can’t really understand what is happening – or worse yet – you begin to ignore some alarms because you have too many to look at.

In the end, one of the most important aspects of designing a sustainable IT monitoring system is to really determine what the critical performance indicators are, and then focus on those. In this blog post, I will highlight the 3 most important KPI's to monitor on your windows servers. Although, as you will see, these same KPI’s would be suited for any server platform.

1. Processor Utilization

Most monitoring systems have a statically defined threshold for processor utilization somewhere between 75% and 85%. In general, I agree that 80% should be the “simple” baseline threshold for core utilization.

However, there is more than meets the eye to this KPI. It is very common for a CPU to exceed this threshold for a short period of time. Without some consideration for the length of time that this mark is broken, a system could easily generate a large number of alerts that are not actionable by the response team.

I usually recommend a “grace period” of about 5 minutes before an alarm should be created. This provides enough time for a common CPU spike to return to an OK state, but is also short enough that when a real bottleneck occurs due to CPU utilization, the monitoring team is alerted promptly.

It is also important to take into consideration the type of server that you are monitoring. A well scoped out VM should in fact see high average utilization. In that case, it may be useful to also monitor a value like the total percentage interrupt time. You may want to alarm when total percentage interrupt time is greater than 10% for 10 minutes. This value, combined with the standard CPU utilization mentioned above can provide a simple but effective KPI for CPU health.

2- Memory Utilization

Similar to CPU, memory bottlenecks are usually considered to take place at around 80% memory utilization. Again, memory utilization spikes are common enough (especially in VM’s) that we want to allow for some time before we raise an alarm. Typically, memory utilization over 80-85% for 5 minutes is a good criteria to start with.

This can be adjusted over time as you get to understand the performance of particular servers or groups of servers. For example, Exchange servers typically have a different memory usage pattern compared to Web servers or traditional file servers. It is important to baseline these various systems and make appropriate deviations in the alert criteria for each.

The amount of paging on a server is also a memory related KPI which is important to track. If your monitoring system is able to track memory pages per second, then I recommend also including this KPI in your monitoring views. Together with standard committed memory utilization these KPI’s provide a solid picture of memory health on a server.

3- Disk Utilization

Disk Drive monitoring encompasses a few different aspects of the drives. The most basic of course is drive utilization. This is commonly measured as an amount of free disk space (and not as an amount of used disk space).

This KPI can should be measured both as a percentage of free space – 10% is the most common threshold I see – as well as an absolute value, for example 200MB free. Both of these metrics are important to watch and should have individual alerts associated with their capacity KPI. It is also key to understand that a system drive might need a different threshold as compared to nonsystem drives.

A second aspect of HDD performance is the KPI’s associated with the time it takes for disk reads and writes. This is commonly described as “average disk seconds per transfer” although you may see this described in other terms. In this case the hardware that is used greatly influences the correct thresholds for such a KPI, so I cannot make a recommendation here. However, most HDD manufacturers will provide a KPI for their drives that is appropriate. You can usually find information on the vendors website for your specific drives.

The last component of drive monitoring seems obvious, but I have seen many monitoring systems that unfortunately ignore it (usually because it is not enabled by default and nobody ever thinks to check) and that is pure logical drive availability. For example checking the availability on a server of the C:\ , D:\ and E:\ Drives (or whatever should exist). This is simple, but can be a lifesaver when a drive is lost for some reason and you want to be alerted quickly.

Summary:

In order to make sure that your Windows servers are fully operational, there are few really critical KPIs that I think you should focus on. By eliminating some of the “alert noise” you can make sure that important alerts are not lost.

Of course each server has some application / service functions that also need to be monitored. We will explore the best practices for server application monitoring in a further blog post.

http://www.telnetnetworks.ca/en/contact-us.html

Thanks to NMSaaS for the article. 

×
Stay Informed

When you subscribe to the blog, we will send you an e-mail when there are new updates on the site so you wouldn't miss them.

Infosim® Global Webinar Day - How to prevent - Or ...
Bell Rolls Out 290Mbps Tri-band LTE-A, Claiming No...

Related Posts

 

Comments

No comments made yet. Be the first to submit a comment
Saturday, 30 November 2024

Captcha Image

Contact Us

Address:

Telnet Networks Inc.
4145 North Service Rd. Suite 200
Burlington, ON  L7L 6A3
Canada

Phone:

(800) 561-4019

Fax:

613-498-0075

For More Information about Telnet Networks, our products, or our services, or to request a quote please feel free to contact us directly.