This was an interesting problem I had recently and is well worthy of a blog post! I was dealing with a server that would stop functioning on the network roughly every two days. There was nothing extraordinary about this server and we have quite a few with very similar configurations. The customer would reboot the server to start functioning again and we would log on remotely to try and determine the cause. After a few crashes we noticed it was always preceded by Event ID 2019 The server was unable to allocate from the system NonPaged pool because the pool was empty.
I started watching the server's Non Paged pool usage with Task Manager and Poolmon but was not able to determine what was causing the problem. At this stage I still wasn't sure whether it was a hardware or software issue so decided to restore the server onto one of ours in the office and let it run for two days. This was over the bank holiday weekend and low and behold the server experienced the same issue. This was great news because now I had the opportunity to do further analysis. I ran Process Explorer, Task Manager and Poolmon but still could not determine the cause (not sure if I was using Poolmon correctly). I have had experience with analysing Minidumps and so thought it would be a good idea to get a full memory dump but needed a way to create a BSOD. In the back of my head I was thinking sysinternals and found reference to NotMyFault.exe which has a /crash switch. I was able to use this to create a BSOD and get a much needed memory dump. You can also use Ctrl+ScrlLck+ScrlLck but must be first enabled in the registry.
Opening this memory dump (C:\WINDOWS\MEMORY.DMP) in Windows Debugging Tools for windows allowed me to do some further analysis. Running the !vm command gave me the following information:-
1: kd> !vm
*** Virtual Memory Usage ***
Physical Memory: 524002 ( 2096008 Kb)
Page File: \??\C:\pagefile.sys
Current: 2095104 Kb Free Space: 1766344 Kb
Minimum: 2095104 Kb Maximum: 4190208 Kb
Available Pages: 178832 ( 715328 Kb)
ResAvail Pages: 439715 ( 1758860 Kb)
Locked IO Pages: 3528 ( 14112 Kb)
Free System PTEs: 234209 ( 936836 Kb)
Free NP PTEs: 319 ( 1276 Kb)
Free Special NP: 0 ( 0 Kb)
Modified Pages: 229 ( 916 Kb)
Modified PF Pages: 229 ( 916 Kb)
NonPagedPool Usage: 64932 ( 259728 Kb)
NonPagedPool Max: 65536 ( 262144 Kb)
This shows my NonPagedPool Usage is very close to NonPagedPool Max value. I then ran !poolused 2 which gave me the following:-
kd> !poolused 2
Sorting by NonPaged Pool Consumed
Pool Used:
NonPaged Paged
Tag Allocs Used Allocs Used
AvgU 401672 86761152 0 0 UNKNOWN pooltag 'AvgU', please update pooltag.txt
Although AvgU is a unknown pooltag it was logical to guess that this was related to the Anti virus product AVG 9 and this reference cements these findings. Uninstalling AVG from our test server lead to the problem disappearing.
The customer purchased and installed AVG9 by themselves and so we told them to log a support call with AVG to get a resolution.
Getting to the root cause of the problem in this was very rewarding and highlighted the importance of being able to restore the machine to rule out hardware and to be able to do further diagnosis.
Saturday, 17 April 2010
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment