Reading up on file caching in Windows

C# (the programming language I use most in my day-to-day coding) and other high-level languages like it are great, because they abstract away a lot of the ugly details of programming. Every once in a while though, there is a problem that forces one to peel away theses layers of abstraction and understand the inner workings of what’s beneath to fully analyze the problem.

The most recent instance of this was a situation where a database server would randomly hang for a couple of seconds every now and then (see my Server Fault question about this). This article is a collection of the things I learned and helpful resources I came across while trying to solve this problem.

Analysis and reproduction

The first thing I did was fire up Process Monitor to record what the process in question was doing during the hang as well as the seconds leading up to it. Process Monitor is a great tool for situations like this, as it provides a very detailed picture what API calls a process makes, what flags it passes et cetera. The first step in understanding the problem was obviously to dig into the documentation of the APIs I saw the process was calling (most notably CreateFile, WriteFile and FlushFileBuffers, which confusingly enough is referred to as FlushBuffersFile in Process Monitor). By the way, Windows Sysinternals Administrator’s Reference is a great book to get the most out of Process Monitor and the various other Sysinternals tools.

I then wrote a tool that parsed the Process Monitor log file and replayed the same API calls, so I could reproduce the problem without having to fiddle with the production system. While I am obviously spoiled by the nice APIs in .NET, it was actually kind of fun to be coding at a lower level again, working with handles, pointers, byte arrays and the like.

What Process Monitor and Process Explorer did not tell me, however, was what mode was used to open the file (which had happened long before I had begun my analysis). Luckily, I found a blog post explaining how to determine the share mode of opened file handles. In involves using WinDbg to look at the kernel structures for the file object in question. Using WinDbg is always fun, since it has such an intuitive user interface (not). And as is usually the case, because I use WinDbg so rarely, I hadn’t set up symbols correctly on the machine I was using, so WinDbg didn’t want to do anything at first. I am not exactly sure, though, what it needed symbols for in this situation, but anyway.

Understanding the cache manager

Another great book I always rely on in situations like this is Windows Internals, which provides detailed insights into the inner workings of Windows, in this case, Windows’ cache manager. There is also this presentation on the cache manager providing similar detail. It’s quite interesting, actually, how the cache manager is implemented and how it relies on the memory manager to do the actual caching. Then there is a two part in-depth video ([1], [2]) on Channel 9 with Molly Brown who at the time was (and maybe still is, I don’t know) responsible for the cache manager. She mentions a few additional interesting points about the cache manager, for instance, how it is monitoring an application, trying to figure out the pattern in which it accesses a file so it can provide intelligent read-ahead to improve IO performance.

Finally, there is a bunch of documentation on MSDN, such as this intro to file caching, an article on how to evaluate memory and cache usage and a couple of support documents ([1], [2]) and a blog post, about what problems the file cache can cause under certain circumstances. While this article on file cache performance tuning was written for Windows 2000, a lot of its recommendations should still be applicable as according to Molly Brown in one of the Channel 9 videos mentioned above, the cache manager has changed little since it was first created during the NT days (side note: because of this, I have ordered Inside Windows NT, the first book in the Windows Internals series, to learn about the origins of NT and see how much of it is still there).

Conclusion

As my tests are still underway, I haven’t yet been able to determine the root cause of the performance issues or find a definitive solution for it. If you have any additional insights, tips or things I should look at, please leave a comment or post an answer to my Server Fault question.

I am also curious to see what changes have been made in Windows Server 2008 R2, but unfortunately, Windows Internals 6th Edition Part 2 covering the cache manager, hasn’t been released yet.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s