Va vom prezenta un caz interesant de performanta: Pe un Windows Server 2003 cluster serviciile nu mai puteau fi oprite. De exemplu Automatic Updates, Network connections si Windows Management Instrumentation ramaneau in status-ul de "stopping"

 clip_image002

Dupa o discutie cu clientul, am descoperit ca in momentul in care se problema era prezenta, intreaga performanta a server-ului era afectata. Dupa un restart, problema dispare pentru 24 ore. Poate un posibil Memory leak?!

Iata configuratia:

- Windows 2003, Enterprise Edition, EN-US, 5.2.3790 SP2
- Software: MS SQL Server Enterprise Edition, Version 9.00.3228.00, English
- 4 Node Cluster
- 32bit system

Pool Tag: _Total Pool Paged Bytes
Type: Pool_Paged
First/Average/Last/High Values: 48.18 MB / 97.81 MB / 94.48 MB / 116.73 MB

Pool Tag: _Total Pool Non-Paged Bytes
Type: Pool_Non-Paged
First/Average/Last/High Values: 40.31 MB / 51.58 MB / 52.48 MB / 62.41 MB

In Boot.ini nu avem /3gb :

multi(0)disk(0)rdisk(0)partition(2)\WINDOWS="Windows Server 2003, Enterprise" /fastdetect /NoExecute=OptOut

Deci Paged/Nonpaged Pool pareau in regula, la fel si perfmon, in afara de cateva discuri care se miscau cam greu. Dead end! Nu astept mult si primesc un mail de la client ca problema se intampla chiar in acel moment. Initializez un remote desktop si puteam observa cum serverul abia se misca. Ce poti face? Task manager:
taskman
 
Acesta instanta de svchost(pid 800) a cam luat-o razna.hmm...Sa vedem ce zice prietenul nostru Process Explorer (a.k.a Super Task Manager!)

procmon 

La prima vedere totul in regula...pana la sectiunea Handles: 38,693 !!-----> Sub instanta acea de Svchost rulau mai multe servicii. Am hotarat sa facem un user mode dump pe acest PID folosind Adplus.

Comment: 'Full dump in Hang Mode for SVCHOST.EXE_running_on_XXX'

Windows Server 2003 Version 3790 (Service Pack 2) MP (8 procs) Free x86 compatible
Product: Server, suite: Enterprise TerminalServer SingleUserTS
Machine Name: Debug session time: Thu Nov 20 20:02:02.000 2008 (GMT+2)System Uptime: 20 days 23:02:01.609
Process Uptime: 20 days 23:01:44.000

0:000> ~

.  0  Id: 320.324 Suspend: 1 Teb: 7ffdf000 Unfrozen
   1  Id: 320.340 Suspend: 1 Teb: 7ffdc000 Unfrozen
   2  Id: 320.344 Suspend: 1 Teb: 7ffdb000 Unfrozen
   3  Id: 320.350 Suspend: 1 Teb: 7ffd7000 Unfrozen

....

3725  Id: 320.53b4 Suspend: 1 Teb: 7f87e000 Unfrozen
3726  Id: 320.24bc Suspend: 1 Teb: 7f87c000 Unfrozen
3727  Id: 320.4cd0 Suspend: 1 Teb: 7fa22000 Unfrozen
3728  Id: 320.23e4 Suspend: 1 Teb: 7fa00000 Unfrozen

3728 de threaduri....asta da svchost! Majoritatea acestor thread-uri aveau urmatoarea structura, drept urmare in acest user mode dump nu am putut gasi prea multe informatii suplimentare.

24fbf718 77ce252f rpcrt4!NdrProxySendReceive+0x43
24fbfb00 77ce25a6 rpcrt4!NdrClientCall2+0x206
24fbfb20 77c64f87 rpcrt4!ObjectStublessClient+0x8b
24fbfb30 58b75d19 rpcrt4!ObjectStubless+0xf
24fbfb5c 752ef4c2 wmiprvsd!CInterceptor_IWbemProvider::Query+0x53
24fbfbbc 752ef3d1 wbemcore!CWbemNamespace::DynAux_ExecQueryExtendedAsync+0xf1
24fbfc78 752e4896 wbemcore!CQueryEngine::ExecComplexQuery+0x1f4
24fbfd84 752e460b wbemcore!CQueryEngine::ExecQlQuery+0x37
24fbfe28 752efbaf wbemcore!CQueryEngine::ExecQuery+0x2ad
24fbfe6c 752f026b wbemcore!CWbemNamespace::Exec_CreateInstanceEnum+0x107
24fbfeac 752e9750 wbemcore!CAsyncReq_CreateInstanceEnumAsync::Execute+0x32

Am hotarat sa obtinem un full dump al acelei masini, resursele vor face automat failover pe celalalt node (failover cluster,nu?). Acest server avea switch KVM, era remote si fara functionalitate NMI (vezi post-ul despre NMI).

Am decurs la vechiul truc, omoram csrss.exe cu pskill:

pskill \\computer -u username -p passwort PID

Perfect, acum avem un dump complet. Dupa 8 ore de transfer, am putut incepe debugging-ul:

!PROCESS 8ed9e020 SessionId: 0 Cid: 0320 Peb: 7ffdb000 ParentCid: 01a4
DirBase: bff51160 ObjectTable: e1ee2e20 HandleCount: 46853.
Image: svchost.exe

Thread-urile s-au inmultit:

5620 !lp 8ed9e020 0320 01a4 8 3866195 10.718 20.843
4:18:49:33.999* 46853 1620 57896 1437732 478136 svchost.exe

Dupa verificarea catorva thread-uri (alese chiar aleator), se putea observa cum SQL WMI Provider era sursa acestui handle leak.
Discutand cu echipa de SQL, aceasta problema este cunoscuta si a fost prima oara adresata aici:

Cumulative update package 7 for SQL Server 2005 Service Pack 2
http://support.microsoft.com/kb/949095/en-us

In aceeasi seara a fost programat un Change Window si patch-ul a fost instalat pe toate nodurile. Problema nu a mai reaparut, iar dupa 1 luna de monitorizare cazul a putut fi inchis.

Yet another happy customer!

Dan Andrus
- Support Escalation Engineer / Enterprise Platforms Support (Core)