So all at once today, several people come up to me and tell me that a particular program I'm responsible for is broken. It didn't entirely surprise me -- I'd had a couple of reports that the program didn't work when compiled for 64-bit, even after I thought that I had fixed the issue. What did surprise me was that everybody was emphatic that it was the build that was done this week that had introduced the breakage. As far as I knew that was impossible -- literally nothing had been changed in several weeks.
From the symptoms that people were reporting, I was pretty sure that this was another instance of a bug I had fixed a while ago. The program in question interacts with a particular bit of hardware via some memory-mapped registers. There's one very important rule for accessing those registers: your accesses must be 4-bytes wide or smaller. If you break this rule, you read back all-ones instead of a valid value. The previous bug that I had quashed happened because I got a bit too clever in how I pulled data out of those registers, and when compiled for 64-bit the program ended up trying a 64-bit (8-byte) read on a register. I made some changes to the program at that point to ensure that all accesses to the memory-mapped registers went through a single pair of read/write functions, and wrote those functions to try and guarantee that all accesses would be of a valid width.
Knowing all that I had a really good starting point, but I was still mystified as to how things had been broken in the last week. I quickly confirmed that build X-1 worked while build X would always fail. But I also verified that they were built from exactly the same code. WTF? The first thing that I did was jump right into the disassembly of the two builds. I wanted to verify that I was actually issuing the read with the right width. I knew exactly which read was failing so it was easy to track it down in the disassembly in both cases. This is where I started to get very confused, though: the assembly that did the read was identical up to renaming of the registers and the offset at which a particular stack variable lived.
Ok, so the next step is to single-step through the failing code. I set a breakpoint right after the instruction that read from the memory mapped regster. gdb blew right past it and the program completed. Damn, I must have been looking at the wrong bit of disassembly after all. I set a breakpoint on a function that gets called right after the failing read so that I can figure out what I'm actually supposed to be looking at. I do hit this breakpoint, but when I look backwards from where I am I see that the exact same read instruction that I set the first breakpoint on -- but I never hit it! Ok, this time I'll set the breakpoint just before the read, and single-step to it. I set the breakpoint, and pretty soon gdb hits a branch that I can't make heads or tails of. It doesn't correspond to any if statement that I can see in the code. I ignore it and step past it, and the program jumps away from what I believed to be the failing read instruction. Ok, that explains why my breakpoint wasn't triggering, but why isn't the program reading from the hardware? I keep stepping until gdb gets to an SSE instruction, and that's when the penny finally dropped.
Later versions of gcc try to use SSE instructions for accessing pieces of memory larger than 16-bytes wide. However, gcc only ever emits an SSE instruction that accesses memory that's aligned to a 16-byte boundary -- the aligned versions of the instruction is way more efficient. Now, the C code that was failing looked like this:
Code: Select all
for (i = 0; i < count; i++)
array[i] = read_register(REGISTER_ARRAY_BASE + i);
gcc was clever enough to inline read_register, and after doing that it realized that I was copying from one conituguous piece of memory to another. So it unrolls the loop and uses 16-byte SSE memory instructions to do the copy. But there's a catch: gcc only issues the aligned instructions, but it can't guarantee the the destination array is properly aligned. So that mysterious branch that I couldn't figure out? gcc was testing the alignment of the destination array: if it was properly aligned, it jumped to the unrolled loop with SSE instructions. If it wasn't aligned, it jumped to a fallback loop that only used 1-byte accesses. This fallback loop was where I kept setting that breakpoint that never got hit. So the reason that one build consistently worked while the other didn't was that the stack layout changed subtly between the two builds for whatever reason. In the failing build, the destination array happened to be aligned to the 16-byte boundary needed to use the SSE instructions, and everything went to hell.
Let me tell you, I got some really funny reactions from some people at work today when I told them that the new build was broken because the stack layout happened to change.