SQL Server 2012 – Bug – CACHESTORE_OBJCP Memory Leak sp_trace_getdata
This is to document a bug we have uncovered in SQL Server 2012. Microsoft have acknowledged this bug and are releasing a fix for it in SQL Server 2012 SP1 CU3.
We have monitoring software which registers a trace and reads data from this trace in order to provide metrics. We noticed that after moving to our new SQL 2012 platform that we had performance problems every few days during our peak/busy season. At one point they were happening every other day. The symptoms we saw were an increase in CPU, and an increase in recompiles. We felt that the CPU increase was due to the recompiles, but the key thing was that the busier the system was, the quicker we knew the problem was going to happen – recompiles went up high, CPU maxed and the system crawled to a halt.
In order to resolve as we couldn’t figure what was going on initially, we failed over to our HA node. Our solution uses AlwaysOn and we had a HA node on the local site which was synchronous so this was easy and took a matter of seconds. Failing over to our secondary replica instantly fixed the problem, but we would see the problem occur in the next day or so and we would have to flip back. This went on for a week or so and we raised support tickets with Microsoft to get to the bottom of the issue.
To cur a long story short, by the time we were through to the right department and we had run all the standard diagnostics, we had already determined ourselves that the issue was due to the CACHESTORE_OBJCP entry in sys.dm_os_memory_cache_counters and specifically that the value for pages_kb was excessively high. We recorded a value of 8.5GB during one of the issues, which was high for our 140GB RAM server.
CACHESTORE_OBJCP is a cache store in memory for object plans (stored procs, triggers and functions). As for any of the cache stores, if they get 75% towards the calculated ‘Cache Plan Pressure Limit’ (see technet article at the end) it triggers an internal ‘memory pressure’ event. This is to alert SQL Server that there is memory pressure and SQL Server starts reacting to that trigger.
Tracking down the leak!
We were able to track down the statement which caused the object olan cache store to grow by monitoring the dm_exec_cached_plans dmv after failover (ordering by size_in_bytes DESC). The field to look at here as well is the objtype column – we don’t care about prepared types when we’re looking at the OBJCP cache store which only deals with procs, triggers and functions plans.
We collected readings from dm_exec_cached_plans every 3 minutes and we noticed one compiled plan which arose to the top of the list every time. It grew to the maximum size_in_bytes this dmv would allow with it being and int (214783647):
We can only presume that it further grows in the background to make up the value of the OBJCP cachestore value.
We can figure out which SQL statement the plan belongs to by CROSS APPLYing sys.dm_exec_sql_text:
SELECT TOP 10 * FROM sys.dm_exec_cached_plans cp CROSS APPLY sys.dm_exec_sql_text(cp.plan_handle) order by size_in_bytes DESC
For us this showed the statement sp_trace_getdata which we tracked down to the trace SQL Sentry (performance monitoring) uses to collect data.
That’s not to say that SQL Sentry is the cause here. We raised it to SQL Sentry’s attention however and they are keen to learn about this as we are.
We found that restarting the SQL Sentry service resolved the issue during peak times. The pages_kb value for CACHESTORE_OBJCP would then reset and start growing slowly again. We then put a job in place which kills the trace (SQL Sentry restarts the trace automatically) and this stopped the issue getting out of control and causing us problems at peak times.
We raised these facts with Microsoft and asked them if there were any known bugs around sp_trace_getdata or OPENROWSET which may cause the object case to bloat. They used our data collected to uncover this was a bug and it has been scheduled to be included in SP1 CU3 which is scheduled for March.
Obviously, this bug is not due to our monitoring service being SQL Sentry, this bug could have quite happily raised itself from any product which uses the same method, and if we had Sentry pointing at a different version of SQL there would have been no problem.
Keep an eye on http://blogs.msdn.com/b/sqlreleaseservices/ for the release of SQL Server 2012 SP1 CU3 to resolve this issue.