Monday, June 23, 2014

Exchange 2013 CAS server exhausts memory due to leak in MSExchangeRPCProxyAppPool

So, we had two cases in 48 hours where users became disconnected from their mailbox for some minutes. The issue originated in one of our CAS servers, it had run out of memory. Specifically:

Log Name:      System
Source:        Microsoft-Windows-Resource-Exhaustion-Detector
Date:          6/20/2014 8:21:37 AM
Event ID:      2004
Task Category: Resource Exhaustion Diagnosis Events
Level:         Warning
Keywords:      Events related to exhaustion of system commit limit          (virtual memory).
User:          SYSTEM
Computer:      excas1.MYCOMPANY.COM
Description:
Windows successfully diagnosed a low virtual memory condition. The following programs consumed the most virtual memory: w3wp.exe (17344) consumed 7262064640 bytes, w3wp.exe (3032) consumed 460181504 bytes, and MSExchangeFrontendTransport.exe (2640) consumed 446877696 bytes.

Clearly the W3WP.EXE (highlighted) tells us that the problem originated in IIS, so the first step in diagnosing this issue is to identify which application pool is causing the issue. For me, this is a 2012 server, but the process is similar on a Windows 2008 platform.


  • Open Task Manager and go to the DETAILS tab.
  • Right click any existing column, select SELECT COLUMNS.
  • Scroll down and select COMMAND LINE.
  • Sort the data by memory and find the runaway process.
The data in the command line field is really long, but the relevant part exposes which AppPool is taking the memory. In my case it was:

c:\windows\system32\inetsrv\w3wp.exe -ap "MSExchangeRpcProxyAppPool"

So, it looks like the TPCProxyAppPool is at fault and in my case was taking a crazy 7 GB of memory. I smelt a memory leak.

Solution
In Exchange 2013 there is a feature that is enabled by default called SSL-Offload. You can read about it here:

http://technet.microsoft.com/en-us/library/dn635115(v=exchg.150).aspx

However, there is a snag. The feature is not really there. The code was left out until Exchange 2013 SP1. If you are not running Exchange 2013 SP1 then the feature is not there. The problem is that due to a bug, the Exchange RpcHttpConfigurator service sees that the feature is enabled and attempts to examine the configuration every 15 minutes and leaks memory as a result. I have read that one solution is to disable the auto-configuration thus:

The registry setting below disables the automatic configuration update process from occurring. The impact of this change is any future updates to Outlook Anywhere settings will not propagate to servers without reverting the changes and restarting the Microsoft Exchange ServiceHost service.

Be aware that once deployed, rarely are changes made to Outlook Anywhere settings. After completing the registry change, the Microsoft Exchange ServiceHost process must be recycled.


Registry Key: SYSTEM\CurrentControlSet\Services\MSExchangeServiceHost\RpcHttpConfigurator


Registry Value: PeriodicPollingMinutes  

Default Value: 15


Adjusting this value to "0" disables the behavior that leads to the resource leak. The impact of making this change is any configuration updates to Outlook Anywhere will not replicate to servers.


To be clear, I think this is a terrible idea. The configurator is essential to maintaining the health of the whole ecosystem. Having worked this issue with Microsoft, the far better solution is to simply switch off SSL-Offload and make a note to revisit that decision once SP1 is installed.

I had (have) some concerns that when this feature is disabled it might reconfigure the "SSL REQUIRED" check-box for the IIS root which would then be inherited by all the application pools. I tried to create that issue and could not, however I recommend base-lining those settings just in case.

So, on to the solution I prefer. Run the following (obviously change the string for your own environment):

$strServers = "excas1","excas2","excas3","excas4"
foreach ($Srvr in $strServers)
{
    Get-OutlookAnywhere -Server $Srvr | Set-OutlookAnywhere -SSLOffloading:$false
}

You will then want to perform an IIS restart on each CAS server.

Cheers!

No comments:

Post a Comment