Software performance with roofline model

2021-10-23
4 min read

Terminologies

Memory bandwidth

In context of roofline, memory bandwidth is the rate (unit: byte/second) at which data can be read from either DRAM or caches. This measure is the first boundary of how fast a software can be.

Peak floating-point performance

Peak floating-point performane indicates the theoretical maximum of how many floating point operations (FLOPs) a computing unit can process in a second (unit: FLOPs/second). This measure is the second boundary of how fast a software can be.

Arithmetic intensity

With memory bandwidth and peak floating-point performance as measures for hardware. We continue with arithmetic (AI) as a measure of the target software.

AI of a program is a measure of FLOPs performed relative to the amount of memory accesses (unit: byte). In other word, AI describes how many computing instructions are executed for each byte loaded from the memory. The more instructions performed for the same amount of memory loaded, the higher the AI.

drawing

Interpreting roofline

The process of optimizing software performance is sometimes more of art than engineering. Modern computer architecture and compilers are inherent complex, choosing the right optimization in a time efficient manner becomes more difficult with every new introduced technology. Because of this, we sometimes want some formal tool in our toolkit, which provides us a strategy to make software faster with an engineering approach.

Before we go further with roofline: the first thing an programmer should consider is an optimal algorithm. Runtime complexity should not only be interest of theoretical computer scientist. Only if the algorithm is optimal, software performance can be the next step.

The roofline itself is a simple plot involving all three measures we can gather pretty easy: hardware’s memory bandwidth and peak floating-point performance, software’s arithmetic intensity. The hardware’s memory bandwidth and peak-point-performance in this case are two upper bounds of software’s speed running on the plattform. Together they form the canonical form of a ‘roofline’, hence the name.

drawing

The hardware’s memory bandwidth and peak-point-performance in this case are two upper bounds of software’s speed running on the plattform. Together they form the canonical form of a ‘roofline’, hence the name. Depending on the software’s AI, it could be bounded by memory bound or by computation bound. A program’s theoretical performance can be computed by the simple form min(compute's bound, AI * memory bound). In the plot above, the software 1 and 2’s actual performance can be seen together with the roofline.

drawing

The first software is memory bounded, the second is compute bounded. A memory bounded software like software 1 can be made faster by enhancing the arithmetic intensity. Software 2 in this case can only be faster by buying faster hardware, i.e., no further optimization is neccessary.

Optimization strategies

Optimally, any software’s performance should be near the roofline. But in reality, softwares' attained performance don’t have to be anywhere near the roofline and, we often don’t know the reason why. In the next plot, software 3 with a high AI doesn’t seem to achieve its performance potential.

drawing

Often, the bottlenecks of performance could be:

  • Bad cache locality
  • Lack of vectorization / SIMD
  • Load imbalance
  • etc.

For optimization of kernels with high AI, following steps can be applied:

  • Improve instruction level parallelism: Concretely, loop unrooling or divide a single loop into multiple seperate loop, depends on the kernel.
  • Balance floating-point operation mix: Peak FLOPs performance of course requires the most operations to be floating-point instruction.

For optimization of kernels with low AI, following steps can be applied:

  • Unit stride access: Optimizing for unit stride memory accesses engages hardware prefetching, which significantly increases memory bandwidth
  • Ensure memory affinity: Most microprocessors today include a memory controller on the same chip with the processors. If the system has two multicore chips, then some addresses go to the DRAM local to one multicore chip and the rest must go over a chip interconnect to access the DRAM that is local to another chip. This latter case lowers performance. This optimization allocates data and the threads tasked to that data to the same memory-processor pair, so that the processors rarely have to access the memory attached to other chips.
  • Use software prefetching. Usually the highest performance requires keeping many memory operations in flight, which is easier to do via prefetching rather than waiting until the data is actually requested by the program. On some computers, software prefetching delivers more bandwidth than hardware prefetching alone.

Practical tools

  • Generating Roofline Plot: ERT
  • Measure software’s AI: perf

Sources