Troubleshooting Server Performance
The discussion of a specific issue below is perhaps useful in a more general sense for troubleshooting and performance monitoring topics.
Problem: After upgrading to Citrix Presentation Server 4.5 a higher average cpu utilization is observed as well as a high rate of context switches. Previously we have often received warnings in Citrix Performance Monitor for %interrupt -- this issue continues and is perhaps seen more often in 4.5 servers as well.
Background: Running PS4.5 using published applications and desktops on a Microsoft Windows 2003 SP2 server on a physical machine. Running several "high maintenance" accounting applications on two PS4.5 as published applications on virtual machines on VMWare Virtual Infrastructure 3.0 cluster. These all exhibit the symtoms above just since the upgrade to 4.5. Also, we are still running 4.0 on several other servers in the same Citrix Farm and various versions of PNA are in use by client machines (predominantly 8.x)
Investigation regarding context switches
A lot of good resources turned up:
Intel: Using Windows Performance Monitor
Analyzing Processor Activity
Since this issue occurs on both physical and virtual servers it is not a VM problem, but will investigate this avenue as well to ensure correct and optimal configuration.
VMware: improving scalability for Citrix PS
- definition: CPU's share their time between all threads according to priority. When the CPU stops working on one thread and starts working on another that is a context switch.
- monitoring: A ballpark rule of thumb is "normally" there should be no more than 28000 context switches per CPU on a system.
- What to look for
- Page file - too small, or is allowed to dynamically grow - recommendation: set to larger fixed size.
- Consider write cache on RAID controller
- insufficient hardware
- poorly designed device drivers or applications
- PerfMon - system/context switches
- SysInternals - Process Explorer - View > select columns > Process Performance > context switches, context switch delta
- pstat.exe (windows resource kit or support tools
Some asides that came up during this investigation explained some issues we have had with virtualizing citrix servers. We needed to keep 2 cpu's in the VM after we converted them. That is the opposite of the VMWare recommendations we have seen.
- The multiprocessor HAL had not been downgraded to single processor HAL.
- Hidden devices in device manager had not all been removed.
1. Click Start, click Run, type cmd.exe, and then press ENTER.
2. Type set devmgr_show_nonpresent_devices=1, and then press ENTER.
3. Type Start DEVMGMT.MSC, and then press ENTER.
4. Click View, and then click Show Hidden Devices.
5. Expand the Network Adapters tree.
6. Right-click the dimmed network adapter, and then click Uninstall
uninstall any other physical devices not needed
- Interesting - on the VM servers when looking at Task Manager the %cpu listed individually for all the processes for all users did not appear to add up to what was showing up on the Performance tab (at least 50% discrepency.) This was not observed on the physical server
- For both VM's and physical servers: Citrix Performance Monitor was showing warnings and intermittent error conditions on %cpu, %interrupt, context switches/sec.
- The VM's cpu utilization on the host machine is extremely high. On the server with the greatest number of users it maxed out the host cpu for much of the time I watched it.
- Watching performance monitor a few minutes showed context switches/sec to be in the hundreds of thousands.
- Opened Process Explorer and set view to show context switches and context switch deltas. I observed that at times it reported up to 50% cpu was due to hardware interrupts (this was not as dramatic when I checked it on the physical machine so I wonder if this is a reporting issue related to vmware's magic behind the scenes.) Also, the highest context switch delta was for hardware interrupts so Process Explorer was no help to further isolate it.
- To isolate what driver or program might be causing this issue, I piped the output of pstat.exe to a file and looked for the highest count of context switches. I took the memory address of that thread and looked it up in the bottom section to find what address range it fell in. In this case it was CDM.SYS
- google search of CDM.SYS turned up multiple articles about Citrix servers. I think CDM stands for Client Data Mapper. Of greatest interest is an article about a hotfix for PS4.5:
http://support.citrix.com/article/CTX114121 (and I see a lot of other post FR1 hotfixes out there too.)
The issue resolved in this hotfix is:
"Winlogon.exe shows higher than average CPU consumption on the server. The issue occurs because the server refreshes the smart card reader state more frequently than necessary. This occurs even if smart cards are not being used. With this fix, the reader state is refreshed only once per noticeable event."