Svaka mu čast.
In year 2000, the group that is responsible for the JPEG file format, called the JPEG group, decided to come up with a new version of the JPEG format. It was called JPEG2000. It had some really cool ideas. One of them was that it supported streaming of images. One JP2 image contained multiple resolutions of the same image, and the lower resolutions were kept upfront. So, when you are downloading the image, you will get a lower resolution immediately. The advantage was that on low speed connections, the browser could show a low-res image pretty quickly. Also, devices that were low-res could simply stop downloading the image when they had the resolution they wanted
At the time, JPEG2000 was hoping that the JP2 standard would make its way into browsers (spoiler: it still hasn’t). We wanted to use it to build mapping applications. We had encoded aerial photos into the JP2 format, and we had a server that returned SVG maps. Since there was no browser support, I built an ActiveX control that would stream the JP2 image and overlay it with the SVG maps. It was very cool. Our resolution was like 10 times better than Google Maps (at the time).
I was using a library called Kakadu library. It was an open-source library that could stream and parse JP2 images. Kakadu performed really really well. It was fast! Except that once in a while it would get stuck. Randomly. No pattern. So, they pulled me in and I suspected a problem that arises because of thread contention. So, I got to debugging it. At the time, I was young, and I understood multi-threading pretty well, but hadn’t really fixed a thread-contention problem. I was excited.
So, I dug into their code and started debugging it. The first thing I figured is when I debug, the problem goes away. Crap! Essentially, the debugger itself acts as a synchronization mechanism, and it changes the timing of how the instructions in the thread execute.
So, I started adding logs. The next thing I figured out is that when I add logs, the problem goes away! Crap again! Again, since the logging goes to the file system, the file system acts as a synchronization mechanism and throws the timing off.
So, I can’t debug, and I can’t log. It’s a fairly complicated piece of engineering, that I haven’t written. So, before I solve the real problem, I need to figure out how to troubleshoot in a multi-threaded environment. Crap!
So, I started thinking that it’s the synchronization by things outside the code that throws the timing off, right? So, as long as I keep the code inside of the external synchronization very tight, I might prevent the timing from going off. So, I started minimizing my log statements. Eventually, I figured out that if I put one character logs, I am fine. And I can’t put too many 1-character logs.
So, the first thing I had to figure out was if the code takes a path that is different when the problem occurs versus when the problem doesn’t occur. Remember, I could put only single character logs, and not too many of them. So, I started reading through the code without trying to understand it. Whenever I reached a decision point, I would put 2 logs in the 2 branches. One branch logged “\”, the other branch logged “/”. When I saw a loop, I logged “|” inside the loops. When I would see the logs, I would see the log as
I would run the app and note down these strings of characters when the code ran fine, then note down the characters when it didn’t. Next I compared the strings of characters to find the deviation. Finally, I would trace back through the code to find the spot at which the log message deviated.
Since I couldn’t put too many logs, I had to be judicious. Luckily, the Kakadu code was structured very very nicely. I want to kiss the Kakadu developers (even though they caused the bug). All their code was layered very nicely. They had high-level functions that called lower-level functions, that called even-lower-level functions. So, I picked the topmost layer and put my magic single-character logs there. When I found the deviation, I would understand the code to figure out why the deviation happened. Usually, it was because a lower-level routine behaved differently. So, I had to remove all my logs, and then add similar logs in a lower-level routine. I did this layer by layer till I found the bug.
This entire process took about three weeks. It was a one character fix. There was a busy-wait loop in the rendering thread that waited for data to be loaded by a producer thread. It checked a counter using < instead of <=. Usually, the counter would go from counter < expected to counter > expected, and it would work fine. In the rare condition that the = condition was satisfied, the rendering thread would prematurely parse the data, get an exception and cleanly exit. This would stop all rendering.
Three weeks, one character. I should get a T-shirt that says that.
Fixing this bug really showed me the value of building your code in layers.