Donnie’s Parallelized Buddhabrot Solver
Last updated: 2007-09-04
Anytime you get a brand new computer, you almost immediately want to find something to do with it that really pushes it to the limit. I recently got a quad-core Intel box with 4GB of RAM, and once I got it all set up I really wanted to use it for something that would leverage the capabilities of the machine. Since I have wanted to write a program to generate Buddhabrot images for quite a while, I decided that this would be a great project to try out. Since this is a computationally intensive task, I chose C++ since I will need all the help I can get.
To give a bit of background information about the Buddhabrot problem, it really is just a different way of rendering the Mandelbrot set. For the traditional Mandelbrot set, programs will compute the number of times a given function on the complex plane iterates (traditionally it’s Zn+1 = Zn² + c) before escaping from a particular region. The Buddhabrot rendering method simply changes this by recording each value of Z as the computation progresses, and then each point is plotted on the final image. Deeply iterated images are the most detailed and interesting. It takes a lot of computation, to be sure.
There are several different ways to generate a Buddhabrot image. The traditional way is to randomly generate points within a particular region of the complex plane, compute the results of iteration, then plot those points. Repeat this process a lot of times, and voila you get your beautiful Buddhabrot image.
A second option is to simply scan the entire region of the complex plane, choosing regularly spaced sample points to generate the resulting image. This works fine enough for a view of the entire Mandelbrot set, but if you want to zoom in on a particular region then this becomes very unacceptable. But, it’s easy enough to implement, and that is the one I chose. (It’s also the harder one to parallelize, so I figure that it’s always easier to switch to the traditional approach from what I have.)
The first version of my Buddhabrot generator was single-threaded, and it worked well enough, but of course it was slow. My programs generate ASCII PPM format images, since it is very simple to write such images to the console or to a file using C++ stream IO. Also, my initial program only generates grayscale images;I haven’t quite gotten around to implementing the color support I want, yet.
Of course, once you have a sequential version of a program, the next step is to figure out how to parallelize it. The Buddhabrot computation is what they call “embarrassingly parallel” - since individual computations don’t depend on the values of other computations, you can partition the task in any way you want, and it won’t have much effect on the difficulty of implementation.
Now, I just happen to have a pthreads thread-pool class lying around from one of the classes I teach; it’s part of a lab assignment, so of course I implemented my own version to see how hard the problem is. I thought this would be a good place to start, so I created a simple “job” class to compute chunks of the Buddhabrot image and store the values. Since I have 4GB of RAM I decided to make it really simple and just store STL vectors filled with the results of the iteration computations. I ended up with something like this:
- Sample 1: z0, num_iters, { z1, z2, z3, … }
- Sample 2: z0, num_iters, { z1, z2, z3, … }
- Sample 3: z0, num_iters, { z1, z2, z3, … }
- …
Once a job has completed generating samples of its entire region, the job goes into a “completed jobs” queue where it is picked up by an aggregator thread. The aggregator is responsible for taking each sample’s list of values and updating the Buddhabrot image properly. (This takes a little bit of math, since you are mapping points in the complex plane into a set of pixels, but it is really quite straightforward.)
So the general process is as follows. The main thread starts up the thread-pool with N threads (I pick N = 4 since I have four cores and the threads don’t need to wait around for IO or anything), throws all the jobs into the thread-pool, and then every second it pulls all completed jobs out of the thread-pool and updates the image it’s building. Simple!
This worked okay, for a while, at least. Suddenly all four cores were doing something, and of course that is going to seem like progress when the last version of the program would use one and leave the other three sitting around with nothing to do.
But, eventually a problem showed up. I couldn’t figure it out at first, because the program seemed to be doing so well at first. But when I would choose certain parameters, my program would use up more and more memory, until the point that the system would start swapping to disk, and then my CPU utilization would jump from 100% down to about 15%. It was exasperating!
But then I started thinking about it, and I realized that the aggregator thread was simply not able to keep up with the worker threads. In retrospect this is pretty obvious; there is really only one part of the design that can vary its memory usage, and that’s these potentially huge lists of samples that the jobs were generating and storing, until the point where the aggregator could actually get to them. If the worker threads outpace the aggregator thread, the memory use will climb through the roof.
At first blush, a reasonable solution might seem to be to limit the number of completed jobs that can be stored in the queue. If a worker finishes a job and there’s no more room for it, the worker just waits until the aggregator can get some of the jobs wrapped up. This would certainly work, but it would be pretty inefficient: If only one worker is blocked then the aggregator can take over those CPU cycles. If multiple workers are blocked, and this is the most likely result, then now you are wasting some CPU cycles.
The problem here is that the design simply moves around too much information between threads! If one thread sends another thread a big giant message, then that other thread has to process that big giant message before it can do anything else. If you can somehow reduce the amount of information that has to be passed between threads, your whole program is going to go faster. That’s just obvious.
So, this is exactly what I did. Instead of having all of the results stored in the job objects, I gave each thread its own local copy of the Buddhabrot image being generated. There is simply no way this will require as much memory as all those iteration values use, and it reduces the communication requirements between threads down to nearly zero. This also means that threads can update local storage, so there are no synchronization requirements either.
The threads do still need to know who should compute what row, but that turns out to be pretty easy to solve. The Buddhabrot set, just like any rendering of the Mandelbrot set, has certain regions that are fast to compute, and other regions that are very slow to compute. The center is usually the worst; if you have a very large iteration limit and you are computing a large image, you are going to have this big gob of pixels in the center that go all the way out to that full iteration limit, and thus they are going to take a really long time to compute. So it’s not good enough to just divide the image into N sections and then hand each section to one of N threads; some of the threads are going to finish right away, wasting CPU cycles that could be spent on the hard parts of the image.
So, I implemented a very simple row-counter. When a thread is ready for more work, it simply asks the row-counter for the coordinate of the next row to compute. This is shared state, one integer basically, and all the threads can have nearly identical loads. Easy peasy.
Another benefit of this altered design is that the aggregator thread really only needs to do its work once everything else is completed. So the workers are no longer limited by the performance of an aggregator thread; they can just go until they’re done. The main thread simply monitors the shared state (again, polling it every second or so), and when all rows are completed, the main thread can go through and combine all of the worker threads’ results. Since the worker threads store much less information than in the previous design, this aggregation process goes very quickly.
Once I got these changes all coded up, I was delighted to find that the Buddhabrot generator ran much more efficiently. CPU usage was nearly 100% (or nearly 400% if you count each core separately), and memory usage sat at a cool 2%.
So I cranked up the iteration count a bit, and something a little disturbing started happening. Suddenly, instead of my program spending nearly 100% of its time in user-space, it started spending as much as 25% in system calls. This isn’t good - where are those cycles going?!
I investigated the synchronization code first and found that everything was working pretty well there. Very little time was being lost to thread contention issues. So it had to be something else.
I wasn’t sure this was going to be the problem, but I had a hunch about something so I went back to my STL vector code. You see, in order to keep my Mandelbrot fractal functions generic, the functions take a pointer to an STL vector of complex values. If you pass in null, the function knows you don’t care about the intermediate values. If you pass in a non-null pointer, the function stores the intermediate values for you. This lets me easily switch between several different fractal functions. (Right now I have Mandelbrot, Mandelbar (tricorn), and a few variations of these like the Cactus fractal.)
This is great, but each time you move to a new sample, you don’t want those old values around anymore. So, my original code created the vector local-variable inside the loop that iterated over the samples. Each time a sample’s computation was complete, the vector would go out of scope and get cleaned up, and each time a new sample was computed, a new vector would be initialized. Nice and clean.
The only problem with this was that my code, or more accurately the STL vector code, was doing a ton of memory management! Hence the massive amount of time being spent in system calls.
To test this theory I simply moved the vector variable into the data-members of the thread class. This way the vector can be reused for each new sample; you just call vector<T>::clear() on it when you’re ready to start the next sample. Fortunately, when the vector clears its contents, it retains the capacity to store that many values, so subsequent use of the vector will be much more efficient from a memory-management perspective.
After I made this small tweak, I recompiled the code and gave it a run. The memory usage jumped about 0.3% from the extra memory I wasn’t releasing, but the time spent in system calls went from 25% down to around 0.2%. Sweet.
So, now my Buddhabrot generator seems to be pretty efficient. There is probably room to improve the efficiency of individual computations, but in a general sense, the program now seems to use system resources pretty efficiently. And I have to say, it’s a whole lot faster than the original single-threaded version. Man, is that ever true.
In summary, here are the lessons that I have learned from this project so far:
-
When parallelizing a computationally intensive task, try to minimize the amount of information that has to be passed between threads, and minimize the amount of aggregation processing that needs to be performed. If this aggregation is performed on the fly in one thread, it will very likely impede the progress of other threads at some point, when the nature of the computation pushes the aggregator over the edge.
In those cases, there are two choices: You can block worker threads until the aggregator catches up, which will waste CPU cycles. Or, you can keep an unbounded queue of messages/data for the aggregator to consume, which will waste memory, and then CPU cycles when the virtual memory manager starts to thrash. So best to avoid this completely, if it’s at all possible.
-
For computationally intensive tasks in general, reuse as much work as possible. The obvious realization of this is to cache frequently-used values that don’t change across large chunks of the computation. This is always a good idea. Most people who need to optimize code have done this kind of thing, and we all know it can make a big difference.
But there is a second, less obvious place where you can apply this technique. If you are using standardized collection classes, cache the collection objects and reuse them. These classes often do significant memory management under the hood, and if your collection object can be cleared of its contents but still retain the capacity to store that amount of data again, you just eliminated a huge amount of memory management overhead from your program.
-
If you are running a program with N worker threads on a computer with N cores, do yourself a favor and use the
niceutility. Especially if the computer is also your webserver and mailserver.