Taming the Parallel Beast

Many programmers seem to think parallelism is hard. A quick Internet search will yield numerous blogs commenting on the difficulty of writing parallel programs (or parallelizing existing serial code). There do seem to be many challenges for novices. Here’s a representative list:

  • Finding the parallelism. This can be difficult because when we tune code for serial performance, we often use memory in ways that limit the available parallelism. Simple fixes for serial performance often complicate the original algorithm and hide the parallelism that is present.
  • Avoiding the bugs. Certainly, there is a class of bugs such as data races, deadlocks, and other synchronization problems that affect parallel programs, and which serial programs don’t have. And in some senses they are worse, because timing-sensitive bugs are often hard to reproduce -- especially in a debugger.
  • Tuning performance. Serial programmers have to worry about granularity, throughput, cache size, memory bandwidth, and memory locality. But for parallel programs, the programmer also has to consider the parallel overheads and unique problems, like false sharing of cache lines.
  • Ensuring future proofing. Serial programmers don’t worry whether the code they are writing will run well on next year’s processors -- it’s the job of the processor companies to maintain upward compatibility. But parallel programmers need to think about how their code will run on a wide range of machines, including machines with two, four, or even more processors. Software that is tuned for today’s quad-core processors may still be running unchanged on future 16-, 32- or even 64-core machines.
  • Using modern programming methods. Object-oriented programming makes it much less obvious where the program is spending its time.
  • Other reasons that parallel programming is considered hard include the complexity of the effort, insufficient help for developers unfamiliar with the techniques, and a lack of tools for dealing with parallel code. When adding parallelism to existing code, it can also be difficult to make all the changes needed to add parallelism all at once, and to ensure that there is enough testing to eliminate timing-sensitive bugs.

Use Serial Modeling to Evolve Serial Code to Parallel
The key to success in introducing parallelism is to rely on a well-proven programming method called serial modeling. Using serial modeling tools and technique, programmers can achieve parallelization with enhanced performance and without synchronization issues. The essence of the method involves consistently checking and resolving problems, and beginning early in the process to slowly evolve the code from pure serial, to serial but capable of being run in parallel, to truly parallel.

The first step is to measure where the application spends time -- effort spent in hot areas will be effective, while effort spent elsewhere is wasted. The next step is to use a serial modeling tool to evaluate opportunities for potential parallelization and determining what would happen if this code ran in parallel. This kind of tool observes the execution of the program, and uses the serial behavior to predict the performance and bugs that might occur if the program actually executed in parallel.

Checking for problems early in the evolution process, while a program is still serial, ensures that you don’t waste time on parallelization efforts that are doomed because of poor performance. You can then model parallelizations that resolve the performance issues or, if no alternatives are practical, focus your efforts on more profitable locations.

The tool can also model the correctness of the theoretical parallel program, and detect race conditions and other synchronization errors while still running the serial program. Although the program still runs serially, it is easy to debug and test, and it computes the same results. The programmer can change the program to resolve the potential races, and after each change, the program remains a serial program (with annotations) and can be tested and debugged using normal processes.

When the program has fully evolved, the result is a correct serial program with annotations describing a parallelization with known good performance and no synchronization issues. The final step in the process is to convert those annotations to parallel code. After conversion, the parallel program can undergo final tuning and debugging with the other tools. The beast has been tamed.

Photo: @iStockphoto.com/angelhell

What’s Next For OpenACC?

The OpenACC parallel programming standard emerged late last year with the goal of making it easier for developers to tap graphics process units (GPUs) to accelerate applications. The scientific and technical programming community is a key audience for this development. Jeffrey Vetter, professor at Georgia Tech’s College of Computing and leader of the Future Technologies Group at Oak Ridge National Laboratory, recently discussed the standard. He is currently project director for the National Science Foundation’s (NSF) Track 2D Experimental Computer Facility, a cooperative effort that involves the Georgia Institute of Technology and Oak Ridge, among other institutions. Track 2D’s Keeneland Project employs GPUs for large-scale heterogeneous computing.

Q: What problems does OpenACC address?

Jeffrey Vetter: We have this Keeneland Project -- 360 GPUs deployed in its initial delivery system. We are responsible for making that available to users across NSF. The thinking behind OpenACC is that all of those people may not have the expertise or funding to write CUDA code or OpenCL code for all of their scientific applications.

Some science codes are large, and any rewriting of them -- whether it is for acceleration or a new architecture of any type -- creates another version of the code and the need to maintain that software. Some of the teams, like the climate modeling team, just don’t want to do that. They have validated their codes. They have a verification test that they run, and they don’t want to have different versions of their code floating around.

It is a common problem in software engineering: People branch their code to add more capability to it, and at some point they have to branch it back together again. In some cases, it causes conflict.

OpenACC really lets you keep your applications looking like normal C or C++ or Fortran code, and you can go in and put the pragmas in the code. It’s just an annotation on the code that’s available to the compiler. The compiler takes that and says, “The user thinks this particular block or structured loop is a good candidate for acceleration.”

Q: What’s the impact on scientific/technical users?

J.V.: We have certain groups of users that are very sophisticated and willing to do most anything to port their code to a GPU -- write new version of code, sit down with an architecture expert and optimize it.

But some don’t want to write any new code other than putting pragmas in the code. They really are conservative in that respect. A lot of the large codes out there used by DOE labs just haven’t been ported to GPUs because there’s uncertainty over what sort of performance improvement they might see, as well as a lack of time to just go and explore that space.

What we are trying to do is broaden the user base on the system and make GPUs, and in fact other types of accelerators, more relevant for other users who are more conservative.

After a week of just going through the OpenACC tutorials, users should be able to go in and start experimenting with accelerating certain chunks of their applications. And those would be people who don’t have experience in CUDA or OpenCL.

Q: Does OpenACC have sufficient support at this point?

J.V.: PGI, CAPS and Cray: We expect they will start adhering to OpenACC with not too much trouble. What’s less certain is how libraries and performance analysis tools and debugging tools will work with the new standard. One thing that someone needs to make happen is to ensure that there is really a development environment around OpenACC.

OpenMP was a decade ago -- they had the same issue. They had to create the specification and the pragmas and other language constructs, and people had to create the runtime system that executes the code and does the data movement.

Q: What types of applications need acceleration?

J.V.: Generally, we have been looking at applications that have this high computational intensity. You have things like molecular dynamics and reverse time migration and financial modeling -- things that basically have the characteristic that you take a kernel and put it in a GPU and it runs there for many iterations, without having to transfer data off the GPU.

OpenACC itself is really targeted at kernels that have a structured block or a structured loop that is regular. That limits the applicability of the compiler to certain applications. There will be applications with unstructured mesh or loops that are irregular in that they have conditions or some type of compound statements that make it impossible for the compiler to analyze. Users will have to unroll those loops so an OpenACC compiler has enough information to generate the code.

Some kernels are not going to work well on OpenACC whether you work manually or with a compiler. There isn’t any magic. I’m supportive, but trust and verify.