Optimizing Performance Requires an Open Mind
Performance optimization can be great fun. Finding new ways to make a code base run faster, have more throughput, or support more users is immensely satisfying. Pursuing optimizations can also lead you down plenty of dark alleys with disappointing dead ends. Figuring out which optimizations to pursue and avoiding the ones that end up making performance worse requires decent knowledge of the technology stack that you're using and an open mind to explore the code to find where it most needs to be sped up.
Our intuition can easily lead us astray when trying to reason about what parts of a program are slow and should be the target of optimization. One reason that this is such a common problem is that the human brain processes things differently than a computer. While the brain is a massively parallel processing structure, the microprocessor is essentially a sequential processing machine. Sure, it does do some parallelization, but not nearly on the level of the brain, so the types of computation that are easy in each system are vastly different.
These large differences in computational models between the brain and the microprocessor lead to misconceptions in what we think would be the optimal algorithm for any given problem. To make up a simple example, consider what you would do if given a huge matrix of numbers and asked to manually change them all to zero to reset the matrix. Assume that it is a sparse matrix, so most of the numbers are already zero. Why would you do this manually? I don't know. It's a contrived example, so humor me.
Since it is a sparse matrix, the fastest way to clear the matrix is probably to find the non-zero values and change them to zero. The reason this algorithm can be stated so simply is because the "find non-zero values" operation is almost instantaneous for us. Our visual system is excellent at pattern matching, and we can easily pick out the non-zero values in a field of zeros.
The same is not true of a computer program. The same algorithm translated to a program would have to scan through every element of the array, check if it's zero already, and change it to zero if it's not. Checking if every element is zero is a waste of time for the computer since it has to touch every element anyway. It might as well simply write a zero to every memory location, and since the matrix is probably contiguous in memory, it could do this operation even faster by writing larger blocks of zeros at once. It's the same task, but the brain and the microprocessor each lend themselves to different optimal solutions.
These differences come up all the time in programming. It's common when processing a lot of values in a loop to only do a certain calculation on some of the values. If the combination of the filtering of values and the calculation on the values that pass the filter can be converted into one calculation on all of the values, the latter option is often faster than the former operation. Not only does it optimize for the type of calculations that are most efficient for a microprocessor, but it also removes a code branch inside of a loop. The resulting loop makes more efficient use of the hardware because more instructions can be in flight in the processor pipeline at once and there will be less data-dependent branch mis-predictions.
All this talk about processor pipelines and branch prediction leads into another point. Getting good at recognizing where code performance can be improved is a skill that requires a good mental model of how computation is done at multiple levels of your technology stack. If you're programming in a web framework like Ruby on Rails on Django, you'll have a better idea of how to optimize your code if you know the underlying programming language better. Going further, knowing more about how compilers and interpreters work will enable you to see more optimizations.
You can keep drilling down to gain an even better understanding of how computation works in a computer. Assembly languages have quite different ways of doing computation than high-level languages, with multitudes of addressing modes and special instructions for controlling program flow and transforming data in interesting ways. Microprocessor architecture has all sorts of ways to speed up code as well, with deep pipelines, multiple FPUs (Floating Point Units) and ALUs (Arithmetic Logic Units), branch prediction, instruction trace caches, and cache-memory hierarchies. Computer Organization and Design, and Computer Architecture: A Quantitative Approach are good books to start exploring microprocessor architecture.
Although knowledge is power, knowledge can also be dangerous, so knowledge of the tech stack should be used wisely. Once you have a better idea of how computer computation works, it's easy to fall into the trap of trying to micro-optimize every line of code. I'm not sure why that's the case. Maybe after learning about all of those cool optimizations it's hard to resist attempting to do them by hand, but if learning about compilers and hardware architectures teaches you anything, it should teach you that tons of optimizations are being done to your code automatically.
Instead of spending your time trying to micro-optimize stuff that the compiler and hardware can do better, you should be focusing on algorithm complexity and domain knowledge, areas where those lower-level tools can't possibly help you. Knowing about the lower levels of abstraction can help you make better higher-level decisions, but you shouldn't use that knowledge as an excuse to attempt to duplicate what your tools are already doing for you.
I've inherited code bases from other embedded software engineers who did not trust GCC to compile their code correctly with optimizations turned on. My impression is that it is a common belief in the embedded software world that the only safe way to compile C or C++ code is to do it with optimizations turned all the way off. While it may be true that a particular code base works with optimizations turned off, and it crashes with optimizations turned on, that doesn't mean GCC optimizations aren't safe. GCC is a pretty well-proven tool, and if I had to put money on it, I would bet that GCC is correct and there's a runtime problem in the code rather than the other way around.
If code doesn't work or it's unstable with optimizations turned on, it's most likely because of a race condition or memory leak in the code, and compiling without optimizations is hiding the problem. Trust me, you don't want to run production code with optimizations turned off. It's like buying a Ferrari and never driving it over 20mph. You are sacrificing a tremendous amount of performance for the illusion of safety.
Consider a few of the things that the first level of GCC optimizations gives you: faster subroutine calls, branch prediction from probabilities, and register allocation. Register allocation alone is huge. Without it, the assembly code that's generated will read every variable in your code from memory, perform the operation in the expression being evaluated, and store the result back to memory in the variable specified on the left-hand side of the expression. Even temporary variables that are only used once to calculate intermediate results are assigned a memory location and values are loaded and stored to those locations.
Register allocation uses the processor's fastest memory to store intermediate values, and operations can be chained together using these registers as operands. Register allocation also enables the processor to do register renaming and use a much larger internal set of registers not available to the assembly language. Without register allocation, not enough registers get used to take advantage of register renaming, and the processor pipeline utilization is so restricted that not many instructions can be scheduled to execute at once. Why waste all of that power by turning off optimizations?
Other optimization levels add even more performance gains by doing a deeper analysis of the code to do things like reorder instructions, unroll loops, and inline functions. These are riskier optimizations for code that's not robust against race conditions, but the real problem that should be fixed in this case is the code. Higher optimization levels also are more likely to increase code size. The first level will almost certainly reduce code size as well as increase performance because it's primarily removing instructions and condensing code to make it faster. The higher levels start doing things that increase code size, and code size may be a real issue in some applications. In any case, if you're optimizing performance, you should definitely turn compiler optimizations on to some appropriate level. Otherwise, you're ignoring a huge performance gain for no good reason.
What about doing some of these optimizations by hand where they're needed most? Well, compilers and VMs (virtual machines used for interpreted languages and newer compiled languages like Java and C#) can do all kinds of intricate optimizations that would take us humans way too long to do in every place that they would help, and get it right every time. The compiler or VM writer has to get it right once—granted, for the more complex general case—and then the compiler or VM can use that optimization every place that it detects it to be applicable.
For example, Steve Yegge describes a great optimization that was figured out way back in 1994 in his Dynamic Optimizations Strike Back post. It's called type-feedback optimization, and what it does is during run-time the VM watches the type of the dynamic method calls. When a certain type is used in a method call in a loop, the VM assumes that type will be used the next time through the loop and inlines the method. If it was wrong, a guard instruction will catch the misprediction and recover, but if it was right, the program gets a nice speed-up.
That type-feedback optimization is one example. Compilers and VMs implement hundreds, if not thousands of optimizations, and they keep getting better and better all the time. Even if we could do these optimizations slightly better by hand, why would we try? The compiler can do so many more optimizations than we could even remember and do them on much larger code bases because it's automated. Trying to compete with that kind of scale is rather silly. We should be looking for higher-level optimizations that organize the overall structure of the code, and use the best algorithm for the task that we're trying to accomplish in the code. Then we're working with the compiler, leveraging the strengths of both us as programmers and the compiler as an automated optimization system, instead of trying to compete on a level that ends up being a waste of time.
I've come this far without even mentioning that you should profile your code before optimizing. Profiling is hugely important because if you don't measure where your code is wasting time, you're much more likely to make the wrong optimizations based solely on intuition. The places where a program is slow can catch you by surprise, and the optimizations that fix it can be equally surprising (sometimes surprisingly easy).
At this point it's pretty much common knowledge that you have to measure before you optimize, but you still need to be careful in how you measure. Using microbenchmarks to measure the performance of small snippets of code before and after optimizations can be incredibly misleading. It's much better to measure the performance of code in a real application with a real environment and a real workload. Reality can be drastically different than microbenchmarks, and things like disk and network accesses, process interactions, and realistic data sets can have a huge impact on a real program while a microbenchmark may not even run into those issues.
Caching can also have profound effects on the run-time performance of code, and microbenchmarks may hide the fact that certain optimizations perform well in isolation but trash the cache for the rest of a larger program. Organizing data structures to take advantage of caching can drastically speed up a program, but be careful because different processors have different cache characteristics. You could be optimizing for a particular, narrow cache architecture. If a program will run on different cache architectures, profiling should be done on as many of those architectures as possible to make sure the code isn't being optimized into a corner.
Optimizations can make code brittle. Heavily optimized code becomes completely dependent on the context under which it was optimized, and if that context changes—because the processor or the compiler or the operating system changed—the optimizations can actually make code slower than if the optimizations weren't done at all. It's best to first make code clear, clean, and readable. Most of the time if the code is well-designed, it doesn't need to be optimized further. Making it clean is good enough. Most poorly performing code I've come across is also messy, verbose, ugly code. Cleaning it up has the dual benefits of making it easier to understand and improving its performance. Only optimize when performance becomes a problem, measure where it's a problem, and focus on improving the code that is actually a problem.
When you do optimize code, remember that your brain does computation much differently than a computer does so your intuition is probably wrong, your knowledge and understanding of the technology stack can greatly help you find optimizations, your compiler is your friend so you should work with it, and your profiler will give you the best answers if you measure reality. The best optimizations may not be obvious, but with an open mind, you can find them.
Silicon Vs. Grey Matter
Our intuition can easily lead us astray when trying to reason about what parts of a program are slow and should be the target of optimization. One reason that this is such a common problem is that the human brain processes things differently than a computer. While the brain is a massively parallel processing structure, the microprocessor is essentially a sequential processing machine. Sure, it does do some parallelization, but not nearly on the level of the brain, so the types of computation that are easy in each system are vastly different.
These large differences in computational models between the brain and the microprocessor lead to misconceptions in what we think would be the optimal algorithm for any given problem. To make up a simple example, consider what you would do if given a huge matrix of numbers and asked to manually change them all to zero to reset the matrix. Assume that it is a sparse matrix, so most of the numbers are already zero. Why would you do this manually? I don't know. It's a contrived example, so humor me.
Since it is a sparse matrix, the fastest way to clear the matrix is probably to find the non-zero values and change them to zero. The reason this algorithm can be stated so simply is because the "find non-zero values" operation is almost instantaneous for us. Our visual system is excellent at pattern matching, and we can easily pick out the non-zero values in a field of zeros.
The same is not true of a computer program. The same algorithm translated to a program would have to scan through every element of the array, check if it's zero already, and change it to zero if it's not. Checking if every element is zero is a waste of time for the computer since it has to touch every element anyway. It might as well simply write a zero to every memory location, and since the matrix is probably contiguous in memory, it could do this operation even faster by writing larger blocks of zeros at once. It's the same task, but the brain and the microprocessor each lend themselves to different optimal solutions.
These differences come up all the time in programming. It's common when processing a lot of values in a loop to only do a certain calculation on some of the values. If the combination of the filtering of values and the calculation on the values that pass the filter can be converted into one calculation on all of the values, the latter option is often faster than the former operation. Not only does it optimize for the type of calculations that are most efficient for a microprocessor, but it also removes a code branch inside of a loop. The resulting loop makes more efficient use of the hardware because more instructions can be in flight in the processor pipeline at once and there will be less data-dependent branch mis-predictions.
Know Your Tech Stack
All this talk about processor pipelines and branch prediction leads into another point. Getting good at recognizing where code performance can be improved is a skill that requires a good mental model of how computation is done at multiple levels of your technology stack. If you're programming in a web framework like Ruby on Rails on Django, you'll have a better idea of how to optimize your code if you know the underlying programming language better. Going further, knowing more about how compilers and interpreters work will enable you to see more optimizations.
You can keep drilling down to gain an even better understanding of how computation works in a computer. Assembly languages have quite different ways of doing computation than high-level languages, with multitudes of addressing modes and special instructions for controlling program flow and transforming data in interesting ways. Microprocessor architecture has all sorts of ways to speed up code as well, with deep pipelines, multiple FPUs (Floating Point Units) and ALUs (Arithmetic Logic Units), branch prediction, instruction trace caches, and cache-memory hierarchies. Computer Organization and Design, and Computer Architecture: A Quantitative Approach are good books to start exploring microprocessor architecture.
Although knowledge is power, knowledge can also be dangerous, so knowledge of the tech stack should be used wisely. Once you have a better idea of how computer computation works, it's easy to fall into the trap of trying to micro-optimize every line of code. I'm not sure why that's the case. Maybe after learning about all of those cool optimizations it's hard to resist attempting to do them by hand, but if learning about compilers and hardware architectures teaches you anything, it should teach you that tons of optimizations are being done to your code automatically.
Instead of spending your time trying to micro-optimize stuff that the compiler and hardware can do better, you should be focusing on algorithm complexity and domain knowledge, areas where those lower-level tools can't possibly help you. Knowing about the lower levels of abstraction can help you make better higher-level decisions, but you shouldn't use that knowledge as an excuse to attempt to duplicate what your tools are already doing for you.
Don't Fight The Compiler
I've inherited code bases from other embedded software engineers who did not trust GCC to compile their code correctly with optimizations turned on. My impression is that it is a common belief in the embedded software world that the only safe way to compile C or C++ code is to do it with optimizations turned all the way off. While it may be true that a particular code base works with optimizations turned off, and it crashes with optimizations turned on, that doesn't mean GCC optimizations aren't safe. GCC is a pretty well-proven tool, and if I had to put money on it, I would bet that GCC is correct and there's a runtime problem in the code rather than the other way around.
If code doesn't work or it's unstable with optimizations turned on, it's most likely because of a race condition or memory leak in the code, and compiling without optimizations is hiding the problem. Trust me, you don't want to run production code with optimizations turned off. It's like buying a Ferrari and never driving it over 20mph. You are sacrificing a tremendous amount of performance for the illusion of safety.
Consider a few of the things that the first level of GCC optimizations gives you: faster subroutine calls, branch prediction from probabilities, and register allocation. Register allocation alone is huge. Without it, the assembly code that's generated will read every variable in your code from memory, perform the operation in the expression being evaluated, and store the result back to memory in the variable specified on the left-hand side of the expression. Even temporary variables that are only used once to calculate intermediate results are assigned a memory location and values are loaded and stored to those locations.
Register allocation uses the processor's fastest memory to store intermediate values, and operations can be chained together using these registers as operands. Register allocation also enables the processor to do register renaming and use a much larger internal set of registers not available to the assembly language. Without register allocation, not enough registers get used to take advantage of register renaming, and the processor pipeline utilization is so restricted that not many instructions can be scheduled to execute at once. Why waste all of that power by turning off optimizations?
Other optimization levels add even more performance gains by doing a deeper analysis of the code to do things like reorder instructions, unroll loops, and inline functions. These are riskier optimizations for code that's not robust against race conditions, but the real problem that should be fixed in this case is the code. Higher optimization levels also are more likely to increase code size. The first level will almost certainly reduce code size as well as increase performance because it's primarily removing instructions and condensing code to make it faster. The higher levels start doing things that increase code size, and code size may be a real issue in some applications. In any case, if you're optimizing performance, you should definitely turn compiler optimizations on to some appropriate level. Otherwise, you're ignoring a huge performance gain for no good reason.
What about doing some of these optimizations by hand where they're needed most? Well, compilers and VMs (virtual machines used for interpreted languages and newer compiled languages like Java and C#) can do all kinds of intricate optimizations that would take us humans way too long to do in every place that they would help, and get it right every time. The compiler or VM writer has to get it right once—granted, for the more complex general case—and then the compiler or VM can use that optimization every place that it detects it to be applicable.
For example, Steve Yegge describes a great optimization that was figured out way back in 1994 in his Dynamic Optimizations Strike Back post. It's called type-feedback optimization, and what it does is during run-time the VM watches the type of the dynamic method calls. When a certain type is used in a method call in a loop, the VM assumes that type will be used the next time through the loop and inlines the method. If it was wrong, a guard instruction will catch the misprediction and recover, but if it was right, the program gets a nice speed-up.
That type-feedback optimization is one example. Compilers and VMs implement hundreds, if not thousands of optimizations, and they keep getting better and better all the time. Even if we could do these optimizations slightly better by hand, why would we try? The compiler can do so many more optimizations than we could even remember and do them on much larger code bases because it's automated. Trying to compete with that kind of scale is rather silly. We should be looking for higher-level optimizations that organize the overall structure of the code, and use the best algorithm for the task that we're trying to accomplish in the code. Then we're working with the compiler, leveraging the strengths of both us as programmers and the compiler as an automated optimization system, instead of trying to compete on a level that ends up being a waste of time.
Experiment With Reality
I've come this far without even mentioning that you should profile your code before optimizing. Profiling is hugely important because if you don't measure where your code is wasting time, you're much more likely to make the wrong optimizations based solely on intuition. The places where a program is slow can catch you by surprise, and the optimizations that fix it can be equally surprising (sometimes surprisingly easy).
At this point it's pretty much common knowledge that you have to measure before you optimize, but you still need to be careful in how you measure. Using microbenchmarks to measure the performance of small snippets of code before and after optimizations can be incredibly misleading. It's much better to measure the performance of code in a real application with a real environment and a real workload. Reality can be drastically different than microbenchmarks, and things like disk and network accesses, process interactions, and realistic data sets can have a huge impact on a real program while a microbenchmark may not even run into those issues.
Caching can also have profound effects on the run-time performance of code, and microbenchmarks may hide the fact that certain optimizations perform well in isolation but trash the cache for the rest of a larger program. Organizing data structures to take advantage of caching can drastically speed up a program, but be careful because different processors have different cache characteristics. You could be optimizing for a particular, narrow cache architecture. If a program will run on different cache architectures, profiling should be done on as many of those architectures as possible to make sure the code isn't being optimized into a corner.
Optimizations can make code brittle. Heavily optimized code becomes completely dependent on the context under which it was optimized, and if that context changes—because the processor or the compiler or the operating system changed—the optimizations can actually make code slower than if the optimizations weren't done at all. It's best to first make code clear, clean, and readable. Most of the time if the code is well-designed, it doesn't need to be optimized further. Making it clean is good enough. Most poorly performing code I've come across is also messy, verbose, ugly code. Cleaning it up has the dual benefits of making it easier to understand and improving its performance. Only optimize when performance becomes a problem, measure where it's a problem, and focus on improving the code that is actually a problem.
When you do optimize code, remember that your brain does computation much differently than a computer does so your intuition is probably wrong, your knowledge and understanding of the technology stack can greatly help you find optimizations, your compiler is your friend so you should work with it, and your profiler will give you the best answers if you measure reality. The best optimizations may not be obvious, but with an open mind, you can find them.
0 Response to "Optimizing Performance Requires an Open Mind"
Post a Comment