Tuesday, February 27, 2007

Finding bugs that don't want to be found

I'm a big complainer when it comes to modern IDEs and debuggers. They mostly suck. I've got a good example of how they suck.

For the past week I've been working on a bug which only reproduces about 1 out of 500 times I run my validation test. It takes about 30 seconds to run the test, so even if I script the validation, it still takes minutes to reproduce the problem. Obviously there is some crazy race condition going on.

So where would you start? At this point nearly every debugger in the world is useless. You cannot sit there and run your program 500 times, it would take forever. And even if you did manage to reproduce it in the debugger, if you step over you precious bug you're screwed.

Well I started off by adding assertions... everywhere. Assertions are kind of like printf's except they only do the printing on a condition and when they get hit, they normally exit the program. One benefit of assertions over printfs is that you can keep them in your code. They also serve as documentation for other programmers who comes along.

I was able to get my assertion to be hit. But it wasn't enough. In order to figure out what was wrong I needed to know the state of some data structures when the assertion was hit. So I needed to use printfs... there was no other way to do it. I started dumping lots and lots of data.

After I inserted my logging code I tried reproducing the bug again. Every run I would dump about 5MB of logging data... for 200 runs thats 1GB. After about 250 runs I hit my assertion. I looked through the logging output and found my bug... I was locking a data structure just a couple of lines too late. I moved the lock up a couple of lines and the bug vanished. It was a two-line fix to very hard problem.

In a perfect world finding this bug could have been much easier. It would be really nice if when the assertion was hit, a debugger would automatically be launched on that program. Once the debugger was launched I could view all of the data structures that got corrupted. In this particular case that wouldn't have been enough. I also needed the execution history of how the data structure got corrupt. However that too would not have been enough because there was another process which was writing to the data structure too. I also needed to know know how that process was changing the data structure....

To solve this problem I not only needed to know what code I ran, or the data structures when that code was run, I also needed to know how those data structures changed over time. Thats where the difficulty of debugging comes in, figuring out when and how data structures change.

Can visual studio help you with this? How about gdb or eclipse? No, no and no. They all suck at debugging and not a single one would help you get closer to finding the problem. I need something else...

No comments: