Conficker causes LSASS to consume CPU Time on Domain Controllers

Conficker causes LSASS to consume CPU Time on Domain Controllers

  • Comments 7
  • Likes

Hi Gautam here, I wanted to blog about a high-impact problem we have been seeing recently.

The problem has to do with LSASS consuming a lot of CPU time on your Domain Controllers (DC's). The cause of this high CPU turns out to be Conficker infected computers throwing bad passwords against the DC's. The rate of bad passwords can be as high as 10,000 per minute from multiple clients.

Technical information on Conficker can be found here.

The problem could manifest itself in many ways, some being...

  1. Slow authentication and logons being reported by users,
  2. Slow mail flow
  3. Slow Resource access (resources could be Files shares, printers and more) or even complete failure in Resource access.

Some of the above problems take time to be narrowed down. You typically will have to go through a few other pieces before you narrow down on the domain controllers being bogged down with high CPU time.

Background:

CPU usage on domain controllers continues to be very high (I'm rating high = 70% and above as long as this is not normal for the DC). On looking closer, you find LSASS.EXE eating up all of this CPU. Perfmon reports show the CPU usage stays more or less consistent throughout the day. It doesn't climb down during off-peak hours.

As you can imagine, this high CPU usage affects other workflows which are AD dependent – including Exchange/SharePoint/Authentication etc.

If you temporarily pull the network cable from the DC and wait a few minutes, LSASS drops back down to ~1% or whatever value is normal in your setup. Ned Pyle has the logic of pulling out the network cable described in a previous post in detail.

In this case as well, we saw that pulling out the network cable brings down the LSASS CPU usage to normal limits. Plugging it back in makes LSASS shoot up again to 80%-90% CPU.

If you follow the steps which Ned has documented in the blog, network traces will show a HUGE number of authentication requests coming into the DCs. Now it's not always easy to differentiate between bad and good traffic when you are looking at 100MB worth of network traffic.

In this case however, what you are bound to see if something like the below in the network traces – I highly recommend using Netmon 3* - the Conversations feature is ideal to work through a large trace which you are bound to get when collecting network traces from the DC.

09:54:16.593    192.168.0.1    DC01.CONTOSO.COM    KerberosV5    KerberosV5:AS Request Cname: User1 Realm: CONTOSO.COM Sname: krbtgt/CONTOSO.COM

09:54:16.625    DC01.CONTOSO.COM    192.168.0.1    KerberosV5    KerberosV5:KRB_ERROR - KDC_ERR_PREAUTH_FAILED (24)

OR

09:54:16.531    192.168.0.2    DC01.CONTOSO.COM    TCP    TCP:Flags=......S., SrcPort=4614, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3314092510, Ack=0, Win=65535 ( ) = 65535

09:54:16.531    DC01.CONTOSO.COM    192.168.0.2    TCP    TCP:Flags=...A..S., SrcPort=Microsoft-DS(445), DstPort=4614, PayloadLen=0, Seq=1831638666, Ack=3314092511, Win=17520 ( Scale factor not supported ) = 17520

09:54:16.531    192.168.0.2    DC01.CONTOSO.COM    TCP    TCP:Flags=...A...., SrcPort=4614, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3314092511, Ack=1831638667, Win=65535 (scale factor 0x0) = 65535

09:54:16.531    192.168.0.2    DC01.CONTOSO.COM    SMB    SMB:C; Negotiate, Dialect = PC NETWORK PROGRAM 1.0, LANMAN1.0, Windows for Workgroups 3.1a, LM1.2X002, LANMAN2.1, NT LM 0.12

09:54:16.531    DC01.CONTOSO.COM    192.168.0.2    SMB    SMB:R; Negotiate, Dialect is NT LM 0.12 (#5), SpnegoNegTokenInit

09:54:16.578    192.168.0.2    DC01.CONTOSO.COM    SMB    SMB:C; Session Setup Andx, NTLM NEGOTIATE MESSAGE

09:54:16.578    DC01.CONTOSO.COM    192.168.0.2    SMB    SMB:R; Session Setup Andx, NTLM CHALLENGE MESSAGE - NT Status: System - Error, Code = (22) STATUS_MORE_PROCESSING_REQUIRED

09:54:16.593    192.168.0.2    DC01.CONTOSO.COM    TCP    TCP:Flags=...A...F, SrcPort=4614, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3314092888, Ack=1831639470, Win=64732 (scale factor 0x0) = 64732

09:54:16.593    DC01.CONTOSO.COM    192.168.0.2    TCP    TCP:Flags=...A...F, SrcPort=Microsoft-DS(445), DstPort=4614, PayloadLen=0, Seq=1831639470, Ack=3314092889, Win=17143 (scale factor 0x0) = 17143

09:54:16.593    192.168.0.2    DC01.CONTOSO.COM    TCP    TCP:Flags=...A...., SrcPort=4614, DstPort=Microsoft-DS(445), PayloadLen=0, Seq=3314092889, Ack=1831639471, Win=64732 (scale factor 0x0) = 64732

Now, in the above three examples of network traffic, the first one with the Kerberos KDC_ERR_PREAUTH_FAILED is a sure shot bad password attempt. The other two traces aren't necessarily always bad authentication attempts, but is data connections to LSARPC which I saw on three of the four recent cases I had with this issue.

SPA reports will show high number of calls to SAMSRV or LSARPC. Tim Springston, who runs his own excellent AD related blog, has discussed the using of SPA here.

With TOP users attained from both SPA and from the Network traces, we explored 3 of the top client computers. We pulled MPSReports (an often used PSS Data collection tool) from these client computers. The first thing which stood out in the event logs was all the Audit Failures Logon/Logoff Event Id 529's in the Security Event logs.

Note: by default, only Success for Logon/Logoff and Account Logon is enabled. And in this case, the Domain Controllers were running with the defaults. The client computers had Failure for Logon/Logoff enabled.

More..

This of course led us to...

  1. Checking this customer's account lockout policy –we saw they did not have account lockouts enabled
  2. We enabled Failure for Account Logon on a policy which applied to the Domain Controllers as well.

No sooner had the failure-audit policy applied to the DC, the Security event logs were filled with Audit Failures Account Logon Event Id 675. Here is an example of a 675 event.

Event Type:    Failure Audit
Event Source:    Security
Event Category:    Account Logon
Event ID:    675
Date:        3/23/2009
Time:        3:03:57 AM
User:        NT AUTHORITY\SYSTEM
Computer:    DC01
Description:
Pre-authentication failed:

    User Name:    User1
    User ID:        %{S-1-5-21-xxxxxxxxxx-xxxxxxxxxx-xxxxxxxxx-xxxxx}
    Service Name:    krbtgt/CONTOSO.COM
    Pre-Authentication Type:    0x2
    Failure Code:    0x18
    Client Address:    192.168.0.100 ß IP of the computer which is throwing the bad credentials

Using EVENTCOMBMT to pull out the relevant event Ids from various DCs (namely 529, 644, 675, 676, and 681) and a little bit of Office Excel magic, I quickly had a list of ~100 computers sending bad passwords within a 30 minute time frame. The total number of failed logons were enough to drive up the LSASS.EXE CPU usage. LSASS ofcourse was only doing its job of keeping up with the load and failing the bad authentication attempts.

Putting it all together:

The kind of (multiple user logons from a single computer) and the rate (100's of attempts per minute per computer) at which they were throwing bad passwords, were a pretty sure sign of malware activity. A few more client computers, which we picked up from the SPA and Netmon reports, revealed traces of Conficker. With the Microsoft PSS Security team and the customers own Antivirus vendor involved, they were able to patch, scan, and clean their computers and this effort showed the LSASS CPU usage on the DCs drop down dramatically.

So from high LSASS CPU – to network traces leading to TOP client computers – to security events – to DC security events – back to the client computers! As you can imagine, it took some time in nailing down the first time. The 2nd, 3rd and 4th cases were nailed down to unpatched computers infected by Conficker way quicker.

I hope with this blog post out, someone will save themselves a LOT of time and effort when facing such an issue.

  • Gautam Anand
  • Excellent post Gautam.

    This analysis leads one to wonder if the Windows Domain controllers should be enhanced to automatically detect systems generating too many bad authenication requests and ignoring requests from those systems for some configurable period of time.  Of course the DC sould also log a unique event when they begin to ignore requests from the suspect system.

    John

  • Thanks Gautam...

    From one Anand to another....

  • 237 Microsoft Team blogs searched, 109 blogs have new articles in the past 7 days. 245 new articles found

  • Thanks John. Interesting thought.

    Ignoring abnormally large traffic from a particular IP could be addressed via a firewall rule. The important part here is to put a number on Large traffic. Over the network, there isnt much difference between Good Authentication and Bad authentication. In such cases, its going to be either Kerberos or NTLM.

    Additionally, for proactive troubleshooting / upkeep - monitoring software like SCOM with the ADMP (Active Directory Management Pack) can be used to send out an alert when a certain threshold (for CPU usage of LSASS, for example) is reached. In most of my cases, had SCOM been in place, the customers would have been notified of the High CPU usage on day one of it happening.

    * Flagging a SCOM alert (sending an alert mail) notifying the administrator “the CPU usage on DC1 is higher than the thresholds you have manually configured and SCOM itself has learnt”

    * SCOMs health rollup feature is excellent for pinpointing a point of failure on monitored system/distributed application.

         -   Gautam

  • De afgelopen periode was nogal een periode van veranderingen. Zo was er de bijna overname van Sun door

  • Thanks, that was very usefull information it worked very well to eradicate the worm on one of our sites.

    For more information on how we proceeded you can check

    http://www.ldap389.info/en/2010/03/07/

  • DUDE you saved my ass@!!!! we had no way of telling how and whoo, after a few audit failure selections to audit and BAM! thank you so so so so so so much!!!!!!