I have this colleague who is extremely smart and also very detail-oriented. Its seems like whenever he is writing code, he is thinking about optimizing it at a much finer level than anybody else would (such as whether its faster in Delphi to determine the length of string once and store it in a local variable or whether to call Length() two or three times). Since he is working on some of our core systems, so it’s generally great that he optimizes them for speed. One of his pet-peeves is the memory manager he uses. Over the course of the past couple of years, he has been alternating between three different ones. I thought him crazy for this, because memory management like networking is such a low-level thing that most programmers don’t concern themselves with and just take for granted that it works. At least that’s what I thought. I don’t anymore after seeing how much of a difference your choice of memory manager can make.
The memory manager does make a difference
Even though I do most of my new development in C# these days, I recently had to do some work on an old Delphi application of mine. This application is highly parallel and I have taken great care to write it in a lock free fashion for optimal performance. As the amount of traffic this application had to handle grew, however, we noticed that its CPU usage was barely exceeding 12% on an eight-core system, meaning despite its many threads it was using just one processor core.
These threads don’t share a single data structure or have to go through some code that would block them, since all the critical parts are lock-free. But of course, all threads do share something and that is the memory manager. Delphi makes it very easy to swap the default memory manager for something else, so I compiled my app with my colleagues favorite memory manager of the day and low and behold, performance improved dramatically because of it. This event has converted me and I started looking around what other memory managers there were that might be useful.
Finding the perfect one
For the type of applications I write (long-running, high-throughput server applications) the memory manager should first and foremost scale well with the number of threads. The memory manager I had used originally didn’t scale at all, since even on a multi-processor machine with multiple threads it was effectively locking me down so I was no better off than with a single-threaded application on a single core-machine. Another important aspect is low memory fragmentation, when the application has to keep running for days or even weeks. There is also the amount of overhead incurred by memory management, though I have found that to be negligible. As my code is mostly CPU- and not memory-constrained, a slight increase in memory consumption is not a problem.
Here is a list with the memory managers I have found people in the Delphi community use with a short comment on my experience with them:
- Default Delphi Memory Manager: The default memory manager in Borland Delphi 2006 (the Delphi version I still use the most) is based on FastMM (see next bullet point). It was this memory manager that I had the aforementioned scalability issues with. It seems to be working fine, though, for pretty much everything else we do that is not heavily multithreaded. I can’t speak for the memory managers that ship with newer versions of Delphi as I haven’t used those.
- FastMM: This memory has been created by Pierre le Riche as a replacement for the default memory manager. According to its creator it “scales well in multi-threaded applications, is not prone to memory fragmentation” which I have found to be mostly true. We do have seen some weird access violations though in an application using this memory manager that didn’t occur with other memory managers. It might not be FastMM’s fault, maybe we are doing something we are not supposed to do and other memory managers are simply more forgiving, but whatever it is, I am hesitant to use FastMM in any of my projects.
- TBBMM: This one I came across looking for alternatives to FastMM. It is essentially a wrapper around Intel’s Memory Manager from their Threading Building Blocks library. I am not quite sure what the added benefit of the wrapper is (since I couldn’t find source code on their site), but it is pretty simple to include the threading building blocks memory manager oneself. In the experiments I did for my application, this memory manager has been the fastest and most scalable overall. However, it seemed to be prone to memory fragmentation as I was getting some nasty out-of-memory exceptions in a common scenario in my application (the application allocated a block of memory and after a while freed it, then a short while later allocated a slightly larger block freed that and so on).
- Visual C++ Runtime Memory Manager: Since one of our libraries written in C++ already required Microsoft’s memory manager, it seemed natural to give it a try for other parts of the application as well. The results were similar to Intel’s: it scaled well (though a bit slower than Intel’s), but was exhibiting the same problems with fragmentation.
- Miscellaneous: In my search I have come a couple of other memory managers. Some seemed rather obscure or new and unproven so it didn’t spend much time with them. I include them here for the sake of completeness though in alphabetical order): google perf-tools, Nexus Memory Manager, SynScaleMM and TopMemoryManager.
Important note: While most memory managers can be downloaded for free (for instance as part of some redistributable), they may not be free to use under all circumstances. Intel, for instance, has different licenses for open-source and commercial use.
Doing your own memory management
While rolling your own is probably a pretty bad idea when it comes to memory management, I did in fact end up doing something like it: I am still using the default memory manager for most of my memory management needs. However, some of the data structures that I used the most I converted to fixed-size records. This allows me to allocate one large chunk of memory and than carve out pieces to produce the records as needed. An integer-sized flag is used to track usage of the records. When my initial chunk runs out of free records, I just allocate a new one. Working with records also cuts back on some of the overhead that comes from using objects, although I am not sure that that overhead is really significant, especially since you trade in some flexibility and convenience by using records. It did prove to be much faster, though.