The FTP server hanged in the load test and used about 2 GB memory. In my previous investigation, the FTP server hang was due to the deadlock triggered by the hang in PostgreSQL worker processes. As to high memory usage, it probably was because the deadlock blocked the finalizer thread. However, when I dig into the dump, I did not find any deadlock this time. Then I noticed that even there was not any connection to the server, it still used about 50% CPU.
0:013> !runaway 7
User Mode Time
Thread Time
13:834 0 days 12:35:08.281
14:f34 0 days 0:00:06.328
10:8b0 0 days 0:00:01.718
0:9f8 0 days 0:00:01.687
9:ed0 0 days 0:00:01.562
7:478 0 days 0:00:00.296
6:878 0 days 0:00:00.203
8:ebc 0 days 0:00:00.140
4:7ec 0 days 0:00:00.078
5:994 0 days 0:00:00.031
12:804 0 days 0:00:00.000
11:8a8 0 days 0:00:00.000
3:858 0 days 0:00:00.000
2:9d8 0 days 0:00:00.000
1:670 0 days 0:00:00.000
Kernel Mode Time
Thread Time
13:834 8 days 16:27:26.875
10:8b0 0 days 0:00:49.500
14:f34 0 days 0:00:20.546
9:ed0 0 days 0:00:14.187
0:9f8 0 days 0:00:01.328
7:478 0 days 0:00:00.437
6:878 0 days 0:00:00.265
8:ebc 0 days 0:00:00.156
4:7ec 0 days 0:00:00.125
2:9d8 0 days 0:00:00.093
5:994 0 days 0:00:00.046
3:858 0 days 0:00:00.046
12:804 0 days 0:00:00.015
11:8a8 0 days 0:00:00.000
1:670 0 days 0:00:00.000
Elapsed Time
Thread Time
0:9f8 10 days 2:46:48.317
2:9d8 10 days 2:46:48.270
1:670 10 days 2:46:48.270
3:858 10 days 2:46:47.708
4:7ec 10 days 2:46:47.692
5:994 10 days 2:46:46.833
6:878 10 days 2:46:46.817
7:478 10 days 2:46:44.301
8:ebc 10 days 2:46:44.161
9:ed0 10 days 2:46:44.098
10:8b0 10 days 2:46:43.958
11:8a8 10 days 2:46:43.911
12:804 10 days 2:36:43.301
13:834 10 days 0:20:23.708
14:f34 10 days 0:11:55.114
Thread 13 used most of the kernel and user mode time. It caused the starvation of the other threads. Then I displayed the unmanaged and managed stack traces of this thread along with its registers.
0:013> r
eax=00000002 ebx=00002000 ecx=00041000 edx=775f9a94 esi=00260000 edi=00040000
eip=77610191 esp=77588ee8 ebp=77588f00 iopl=0 nv up ei pl nz na pe nc
cs=001b ss=0023 ds=0023 es=0023 fs=003b gs=0000 efl=00000206
ntdll!RtlpExtendHeap+0x10d:
77610191 23c7 and eax,edi
0:013> kb 100
ChildEBP RetAddr Args to Child
77588f00 775df8c1 001fa000 00040000 00277602 ntdll!RtlpExtendHeap+0x10d
77588fe8 77608752 0003fff8 00040000 00000000 ntdll!RtlpAllocateHeap+0x7db
77589060 775df258 00260000 00800000 0003fff8 ntdll!RtlAllocateHeap+0x1e3
775890a8 77607cd1 002768ca 00260000 00000000 ntdll!RtlpAllocateUserBlock+0xa2
77589120 77605955 776058a6 00000000 77094de9 ntdll!RtlpLowFragHeapAllocFromContext+0x785
7758918c 75e21859 00260000 00000000 00000f30 ntdll!RtlAllocateHeap+0x17c
775891c8 7568ccec 00000001 7568334c 00000000 nsi!NsiAllocateAndGetTable+0xb8
7758922c 7568d07c 049107c8 77589264 00000001 IPHLPAPI!GetIfTable+0x5a
7758924c 7568d0f3 70a0c704 00000001 04910000 IPHLPAPI!InternalGetTable+0x33
7758926c 70a0235b 70a0c704 04910000 00000000 IPHLPAPI!InternalGetIfTable+0x1f
77589280 70a016d6 d6f74c01 775893b0 00000001 inetmib1!LoadIfTable+0x70
775892b4 70a048f7 00000001 775893b4 77589308 inetmib1!UpdateCache+0x51
775892c4 72a8380d 000000a1 06ad9c68 775893b4 inetmib1!MibGetIfEntry+0xd
77589308 72a8415a 06ac18d0 775893b0 775893b4 snmpapi!ProcessQuery+0x26
77589330 72a842e7 0560cf58 00000000 775893a8 snmpapi!ProcessQueryList+0x48
77589358 70a02072 0560cf58 000000a1 775893a8 snmpapi!SnmpTfxQuery+0x41
77589374 00b11431 000000a1 775893a8 775893b0 inetmib1!SnmpExtensionQuery+0x1c
775894fc 00b136e2 072101f0 03202ed0 77589668 ftpaccess!getmacs+0x381 [d:\fts\licensing\client\sdks\client sdk\ipsa13\source\hardware.c @ 108]
77589634 00b14413 072101f0 03202ed0 77589704 ftpaccess!license_checkhw+0x202 [d:\fts\licensing\client\sdks\client sdk\ipsa13\source\license.c @ 184]
77589648 00b1506b 77589668 9c569d50 03202ed0 ftpaccess!license_initA+0x23 [d:\fts\licensing\client\sdks\client sdk\ipsa13\source\license.c @ 747]
77589704 00b14fd4 072101f0 00000001 032c0cb0 ftpaccess!CIpswitchLicense::Validate+0x6b [d:\fts\licensing\client\sdks\client sdk\ipsa86\ipswitchlicense.cpp @ 708]
77589728 00b19e29 77589954 7758974c 9c569c34 ftpaccess!CIpswitchLicense::Validate+0xc4 [d:\fts\licensing\client\sdks\client sdk\ipsa86\ipswitchlicense.cpp @ 689]
77589b68 00b1a4f5 77589ce0 77589ccc 06abe000 ftpaccess!IFTPLicense_Impl::Validate+0xc9 [d:\fts\ftpaccess\src\ftplicense_impl.cpp @ 183]
77589c84 00b1a6a0 77589d14 00000104 77589ccc ftpaccess!InitializeLicensing+0x95 [d:\fts\ftpaccess\src\ftplicense_impl.cpp @ 1077]
77589d38 717d609d 77589d54 717d6041 011d2548 ftpaccess!IFTPCreateLicense+0x40 [d:\fts\ftpaccess\src\ftplicense_impl.cpp @ 1168]
0:013> !clrstack
OS Thread Id: 0x834 (13)
ESP EIP
77589ccc 77610191 [NDirectMethodFrameStandalone: 77589ccc] .IFTPCreateLicense(Char*, UInt32)
77589cdc 035865a2 UtilAPI.Core.IsOKToRun()
77589ce0 0358625f [InlinedCallFrame: 77589ce0]
77589f58 03585cdc SSHServerAPI.Transport.FTPTransport.ProcessCommand()
77589fa4 03585499 SSHServerAPI.Transport.FTPTransport.ProcessCommand(Byte[], Int32)
77589fc0 03588b74 SSHServerAPI.Transport.FTPTransport.OnPlainData(Byte[], SSLApi.SSLClient)
77589fd4 03588a9c SSLApi.SSLServer.DecryptDataHelper(SSLApi.SSLClient)
7758e7fc 03587a47 SSLApi.SSLServer.DecryptData(Byte[], Int32, SSLApi.SSLClient)
7758e820 035878bc SSHServerAPI.Transport.FTPTransport.ReadCallback(System.IAsyncResult)
7758e858 70f5ee09 System.Net.LazyAsyncResult.Complete(IntPtr)
7758e890 70f61eba System.Net.ContextAwareResult.CompleteCallback(System.Object)
7758e898 717d01ef System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
7758e8b0 70f625b1 System.Net.ContextAwareResult.Complete(IntPtr)
7758e8c8 70f5ed70 System.Net.LazyAsyncResult.ProtectedInvokeCallback(System.Object, IntPtr)
7758e8f8 70f624b4 System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
7758e930 7177cdc4 System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
7758ead0 72bb1b4c [GCFrame: 7758ead0]
From the stack trace, it seemed that thread 13 was blocked in the heap manager.
0:013> uf ntdll!RtlpExtendHeap
...
ntdll!RtlpExtendHeap+0xf1:
77610175 8b4d0c mov ecx,dword ptr [ebp+0Ch]
77610178 81c100100000 add ecx,1000h
7761017e 394d08 cmp dword ptr [ebp+8],ecx
77610181 7431 je ntdll!RtlpExtendHeap+0x130 (776101b4)
ntdll!RtlpExtendHeap+0xff:
77610183 d16d08 shr dword ptr [ebp+8],1
77610186 394d08 cmp dword ptr [ebp+8],ecx
77610189 7303 jae ntdll!RtlpExtendHeap+0x10a (7761018e)
ntdll!RtlpExtendHeap+0x107:
7761018b 894d08 mov dword ptr [ebp+8],ecx
ntdll!RtlpExtendHeap+0x10a:
7761018e 8b4640 mov eax,dword ptr [esi+40h]
77610191 23c7 and eax,edi ; current IP
77610193 f7d8 neg eax
77610195 1bc0 sbb eax,eax
77610197 83e03c and eax,3Ch
7761019a 83c004 add eax,4
7761019d 50 push eax
7761019e 53 push ebx
7761019f 8d4508 lea eax,[ebp+8]
776101a2 50 push eax
776101a3 6a00 push 0
776101a5 8d45fc lea eax,[ebp-4]
776101a8 50 push eax
776101a9 6aff push 0FFFFFFFFh
776101ab e8b87bfeff call ntdll!ZwAllocateVirtualMemory (775f7d68)
776101b0 85c0 test eax,eax
776101b2 7cc1 jl ntdll!RtlpExtendHeap+0xf1 (77610175)
...
I unassembled RtlpExtendHeap function. There is indeed a loop. So thread 13 has been spinning in a loop most of the time. Then the question is why this thread has been spinning in an infinite loop.
0:013> !gle
LastErrorValue: (Win32) 0x8 (8) - Not enough storage is available to process this command.
LastStatusValue: (NTSTATUS) 0xc0000017 - {Not Enough Quota} Not enough virtual memory or paging file quota is available to complete the specified operation.
So it seemed that due to the high memory usage, the heap manager failed to allocate memory and somehow ran into an infinite loop. It caused the starvation of the other threads and thereby the service hang. Then the next question is what caused the high memory usage. The usual answer is memory leak. Then which code caused the memory leak -- unmanged code, managed code, or both?
0:013> !eeheap -gc
Number of GC Heaps: 1
generation 0 starts at 0x01343340
generation 1 starts at 0x01343334
generation 2 starts at 0x011d1000
ephemeral segment allocation context: none
segment begin allocated size
011d0000 011d1000 0134334c 0x0017234c(1516364)
Large object heap starts at 0x021d1000
segment begin allocated size
021d0000 021d1000 021d68d0 0x000058d0(22736)
Total Size 0x177c1c(1539100)
------------------------------
GC Heap Size 0x177c1c(1539100)
Since the managed heap only used about 1.5 MB memory, the memory leak did not come from the managed code.
0:013> !heap -s
LFH Key : 0x1ed5084c
Termination on corruption : DISABLED
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
00260000 00000002 1921552 1921552 1921552 43 177 262 0 ec LFH
00010000 00008000 64 4 64 2 1 1 0 0
000c0000 00001002 1088 136 1088 23 7 2 0 0 LFH
00790000 00041002 256 28 256 2 5 1 0 0
00050000 00001002 1088 968 1088 3 9 2 0 0 LFH
00b40000 00001002 64 32 64 30 1 1 0 0
011c0000 00001002 7232 5504 7232 1385 130 4 0 1c LFH
External fragmentation 25 % (130 free blocks)
032e0000 00001002 256 128 256 2 10 1 0 5 LFH
00c40000 00001002 64 12 64 4 2 1 0 0
035a0000 00001002 256 92 256 2 2 1 0 0 LFH
04d70000 00001002 1280 628 1280 224 44 2 0 2 LFH
04f90000 00001002 1088 628 1088 112 25 2 0 6 LFH
04de0000 00011002 256 4 256 1 2 1 0 0
052d0000 00001002 64 4 64 2 1 1 0 0
054d0000 00001002 64 4 64 2 1 1 0 0
05200000 00001002 64 16 64 13 1 1 0 0
053f0000 00001002 256 116 256 5 5 1 0 3 LFH
05f00000 00001003 256 4 256 2 1 1 0 bad
060c0000 00001003 256 4 256 2 1 1 0 bad
04910000 00001002 64 4 64 1 1 1 0 0
-----------------------------------------------------------------------------
One heap used about 1.9 GB memory. So it seemed that the memory leak was from the unmanaged code.
0:013> !handle 0 f
...
245442 Handles
Type Count
Event 63093
Section 16
File 125795
Directory 2
Mutant 4
Semaphore 36
Key 56460
Token 1
Thread 27
IoCompletion 7
Timer 1
There were a lot of event, file and key handles open. It seemed that the memory leak was caused by the handle leak.
The event was named.
Handle 000f0188
Type Event
Attributes 0
GrantedAccess 0x1f0003:
Delete,ReadControl,WriteDac,WriteOwner,Synch
QueryState,ModifyState
HandleCount 64096
PointerCount 64098
Name \BaseNamedObjects\licensestatus
No object specific information available
It was pretty straighfoward to find out which code left the handle open. It was the license code.
There was no name associated with the file handle. They might be sockets.
Handle 000f0190
Type File
Attributes 0
GrantedAccess 0x100080:
Synch
ReadAttr
HandleCount 2
PointerCount 3
No object specific information available
The registry key handle had a name associcated to it as well,
Handle 000f01b4
Type Key
Attributes 0
GrantedAccess 0x3:
None
QueryValue,SetValue
HandleCount 2
PointerCount 3
Name \REGISTRY\MACHINE\SOFTWARE\Microsoft\RFC1156Agent\CurrentVersion\Parameters
No object specific information available
The name had something to do with SNMP. The license code used SNMP to communicate with the license server.
Since these three types of handles had been leaking at the same time, it indicated that they might be related and probably all came from the license code.
After the code review, the event handle leak was because of the early return from a license activation function without closing it. The license activation function is called when the server creates a license object or refreshes an existing license object. What is interesting was why the server needed to create so many license objects or keep refreshing the license object. From the following stack trace of the managed code,
77589ccc 77610191 [NDirectMethodFrameStandalone: 77589ccc] .IFTPCreateLicense(Char*, UInt32)
77589cdc 035865a2 UtilAPI.Core.IsOKToRun()
77589ce0 0358625f [InlinedCallFrame: 77589ce0]
77589f58 03585cdc SSHServerAPI.Transport.FTPTransport.ProcessCommand()
It indicated that the server tried to create the license object when the client connected to it . However, the license object should be created during the server startup and would not be set to NULL in the lifetime of the server. Therefore, it seemed that the license object had never been successfully created and in the load test, each connection from the client triggered the server to create the license object and thereby invoked the license activation function. Then I checked the per-process global license object. It was indeed NULL. When I checked the server log, I found tons of the error message -- "License file is for wrong hardware". It seemed that the culprit was the license file.
Friday, March 13, 2009
Subscribe to:
Post Comments (Atom)
2 comments:
Prefer a named event over an unnamed event. Since it is more debugger-friendly.
When WinDBG loads the dump file, it displays the system and process up times, for instance,
System Uptime: 10 days 3:17:01.859
Process Uptime: 10 days 2:46:49.000.
The !runaway extension command displays thread elapsed times, such as
0:013> !runaway 7
...
Elapsed Time
Thread Time
0:9f8 10 days 2:46:48.317
2:9d8 10 days 2:46:48.270
1:670 10 days 2:46:48.270
3:858 10 days 2:46:47.708
4:7ec 10 days 2:46:47.692
5:994 10 days 2:46:46.833
6:878 10 days 2:46:46.817
7:478 10 days 2:46:44.301
8:ebc 10 days 2:46:44.161
9:ed0 10 days 2:46:44.098
10:8b0 10 days 2:46:43.958
11:8a8 10 days 2:46:43.911
12:804 10 days 2:36:43.301
13:834 10 days 0:20:23.708
14:f34 10 days 0:11:55.114
The question is when the up/elapsed time starts -- when a process/thread is created or when a debugger attaches to a process.
I attached the debugger to the server and investigated the issue within ONE day, therefore, the up/elapsed time is from the beginning of the life cycle of a process/thread.
Post a Comment