Skip to content

Keep your bottlenecks in place!!

by Dan Thompson on July 31st, 2013
Bottlenecks

It’s great if your business decide they want to invest in your database platform, as an architect it means you can design all those things you’ve always wanted to and have a proper data tier, the way all the books and articles intended. That being said, if you consider your current data tier and think about the waits/bottlenecks you see day to day which could be TempDB, Memory, IO, CPU, Network - What actually happens if you remove those bottlenecks? Bottlenecks never actually go, they just move to other places.

For example, take I/O into consideration. If you put a Fusion I/O card and/or SSD drives in place and move all your file groups up to the fusion I/O card, what results would you get? Yes, you’ve more than likely removed the I/O bottleneck you had before, but have you just moved the bottleneck to the CPU which now has more data to sort and join coming at it at quicker rates than it has ever seen before? Did you invest all that money in improving the I/O because you just wanted to clear that wait-type? Did you save any of that money to invest in some faster CPUs (…or a new server as mostly seems the case for CPU upgrades!)? Just because your data moves around from the disks quicker, does it mean that your server can now perform quicker necessarily?

Looking at TempDB contention – is this caused again by badly performing I/O subsystems? or perhaps not the most efficient number of TempDB data files? Obviously, the first thing we should consider here is why is TempDB getting hammered so much, but I’m presuming that angle has already been crossed off as a not-reasonably-possible-task. Again though, what happens if we move the bottleneck? Let’s say we put in some high iops storage to remove the bottleneck, now you can create more and more temp tables than before and perform data operations quicker into those temp tables and your developers who keep writing that bad code which uses TempDB so heavily can now rest assured they can remain shoddy coders for a little while longer. What’s your next option when TempDB consumes all that IO and you need something faster than those high iops disks/cards? Did you just move the finish line a little further into the distance rather than make it into a circuit track? Again, if TempDB can put data in and out quicker, does that mean it can necessarily sort and join data quicker? Again, you could potentially be moving your bottleneck to the CPUs.

One concern here is that your newly invested resources actually mask problems. What if you put that epic table which gets used all the time on your fast SSD/IO cards? Yeah, it can read into memory the pages it needs quicker and it’s likely to reduce the changes of any deadlocks or the duration of any blocking, but what happens if that nice plan with it’s SEEK operation goes bad and it starts scanning the data? Again your bottleneck is removed, so it’s able to scan the data at a huge rate which may go unnoticed, but then what does it do with it? Does it put all of those pages in memory? Does it start to order your whole table? Again, you might find your CPU or TempDB takes a knock.

The same thing happens with network, it just means there’s more bandwidth & latency for increased traffic and more data to process out of the door. Do your switches become the bottleneck or does the application tier now have more knocking on its door and you end up with ASYNC_NETWORK_IO waits instead?

I think the correct way to use these new high-iops devices is to add them to your solution and migrate carefully selected objects over to them as and when they become a problem. Not only do you look like a hero when there is a live problem, or when the developers have been trying to import the contents of the Internet into a heavily indexed table, but it also allows you to think about your storage in a tiered manor. Performance-Storage, Mid-line storage, and archive-storage. Now apply that to those very large tables and tell those pesky developers to consider archiving strategies for that data!

There are ways round some of these issues. I’m sure resource governor can assist in reducing the saturation to your CPU by placing an artificial limit on that resource for the general processes which can allow your important processes through even when the server is really busy.

I think bottlenecks and waits are one of the most important (if not THE important) consideration when accessing the performance of a server, but it’s really important to think about those “what if I did this..” questions. Are you just moving impending doom of your server to another day later down the line? Are you removing the obstacles which are in your way at the current moment in time, just to find that you have more obstacles behind them? I’m sure a lot of people jump at the chance of spending those hardware budgets on those annoying wait types, but think about the bigger picture for this or you could have to explain why the business have invested all that money in that expensive kit only to find no real return from their investment.