Tuesday, July 15, 2014

LSASS at high CPU after SCCM 2012 updated to R2

This has been a tough nut to crack, and really leads to two Microsoft bugs that you might care to think about that can come together in a perfect storm.

We started to have enterprise issues with multiple applications, it was determined that all our domain controllers were running close to, or at, 100% CPU. LSASS was the offending process. Of course LSASS (Local Security Authority Subsystem Service) is the authenticating heart-and-soul of a domain controller so all we could be certain of, was that something on our network was beating on the domain controllers really hard. We started to investigate and took a Microsoft engineer along for the ride.

During office hours, the issue was very intermittent but we also noticed a similar effect happening every night around 11:30 pm. Since the general load on the domain is much smaller at night, the effect was not causing issues, but the effect was very measurable with CPUs running at 80% +. So, since that was a predictable event, we decided to focus on that.

A sniff using Microsoft's NetMon application was clumsy since I was trying to determine what object on our network was creating the traffic, and NetMon does not easily provide a 'top talkers' list (maybe someone can educate me on that) We used WireShark, and did some late night sniffs looking for the culprit.

http://www.wireshark.org/

Imagine my surprise to find that the traffic was coming from our entire workstation environment! (which is around 13,000 Windows PCs.

Closer inspection of the sniff, seemed to reveal that the traffic was a series of group policy updates, each workstation performing 4 GPO updates every time they received a new signature for SCEP (System Center EndPoint Protection, formally known as Forefront AntiVirus), what the hell?!

Using our Microsoft SCOM (System Center Operations Manager) system, we determined that the nighttime episode had been occurring for some weeks and when we compared that with our change management system, it seemed to coincide with an upgrade to our SCCM (System Center Configuration Manager) system from SCCM 2012 to SCCM 2012 R2. At this point our SCCM administrator reached out to Microsoft PSS to understand. In our opinion, the workstations should not have been doing ANY group policy updates, let alone 4!

Microsoft explained that if you are delivering your SCEP signatures by group policy (which we are!) (and probably should not be!) (a hangover from the pre-SECP Forefront days!) then the SCEP client will compare the URL of your WSUS (Windows Server Update Service) server as defined by the policy, with the URL of your WSUS server as configured in the SCCM server. If it finds they do not match, it concludes that the name (URL) of the WSUS server last, delivered by group policy, is out of date and initiates a gpupdate to get fresh information.

(the punchline is coming soon...)

Obviously we jumped into the policy and the configuration, only to find they were the same:

WSUS Server set in GPO:

HTTP://CM2012SUP.MYCOMPANY.ORG:8530

WSUS Server being set by ConfigMgr:

http://CM2012SUP.MYCOMPANY.ORG:8530

As you can see, the only difference is the upper/lower case of the word 'http'.

Surely not I hear you cry! Yep we discovered a bug was introduced in SCCM 2012 R2 whereby the comparison of the strings is case sensitive. The cycle of believing the URL to be different and then initiating a gpupdate occurs four times before giving up.

So 6,000 group polices x 13,000 workstations x 4 refreshes = 312 million policies being accessed simultaneously.

After changing the URL in the group policy to match the SCCM configuration (you cannot do it the other way around) the problem was resolved.

But wait, there is more...

Due to another bug, this enormous amount of traffic was being sent between the domain controllers and our XP workstations two bytes at a time! This Microsoft article exlplains:

http://support.microsoft.com/kb/319440

This advises that the following registry key be added to the client, a reboot is not required.

HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\Winlogon
Entry: BufferPolicyReads
Type: DWORD
Value: 1

So in this case the 312 million polices (at say 1KB each) =
1024 bytes / 2 byte chunks = 512 x 312,000,000 policy transactions = 159,744,000,000 or ~160 billion packets.

We are in ther process of upgrading or clients from Windows XP to Window 7 and this 2-byte bug on exists in XP and Server 2003.

Cheers!

No comments:

Post a Comment