Debugging Misadventures: Down the Rabbit Hole
For the past few years for Christmas, I’ve asked people in my family for stories from their lives rather than a physical gift. Last year, my sister Emma, who works as a dentist, asked me for the same. This is the story I sent my sister, along with some illustrations by my very multitalented coworker Jessica Liu.
A lot of my day-to-day work as a software engineer involves debugging. We expect the system to do X but it does Y instead. Why?
One of my coworkers, Karl, really enjoys this process. He imagines the bug as a sneaky adversary, running around a maze, trying to evade him as he methodically searches through each turn. In the end, he always finds his hidden foe, wins a round of this game, and rests awaiting his next tour of the maze.
All of my most bizarre debugging adventures come from my time working at Figma. I’ve visited other offices, found bugs in Chrome, been bedevilled by a “Back to the Future” easter egg left in third party software, but the most outrageous lengths brought me into an hour long Lyft ride from San Francisco to Mountain View to meet a stranger who’d never heard of Figma on a Wednesday afternoon.
Debugging is often a process of making and testing hypotheses. Since you think the system should be doing one thing but it’s doing another, it must mean that one of your assumptions about how the system works is wrong.
For example, let’s say that your friend is trying to get into your apartment using your electronic keypad but it isn’t working. They might call you and tell you “Hey, I can’t get in.” You told them the code before, so they should be able to get in. Something you’re assuming is true clearly isn’t, so you ask them some questions to test your hypotheses.
“The code is 1499, is that what you’re entering?”
“Yeah, that’s what I entered. I’ll try it again right now… Nope. Still nothing.”
The first assumption you were testing was that they had the correct code. It turns out that assumption was correct, so that can’t be the problem. Time to test a different hypothesis.
“Do the buttons light up when you press them?”
“Yep. They light up. When I hit the second 9, the keypad turns red, blinks, and then goes dark.”
Okay, so the batteries aren’t dead. Your assumption that the keypad has power was right too. What could possibly be wrong?
“Uhh, are you sure you’re entering it in right?”
“Yes. Duh. I’m not an idiot. I press 1, then I press 4, then 9, then 9 again. It goes red. I turn the handle. Door doesn’t budge.”
“Are you sure you’re telling me the code right? My hands are freezing out here trying to enter this code, and your fireplace inside is just taunting me.”
“What fireplace? Wait. What address are you at?”
“907 Elk St, like you told me.”
You check the last text message you sent your friend. You did indeed send “907 Elk St”. You live at “907 Elm”.
Different bugs can require wildly different amounts of work to figure out.
One of the key things affecting how long it takes to diagnose and fix a bug is the duration of the feedback cycle. Figuring out the cause of a bug can take anywhere from one to hundreds of hypotheses tests to figure out the root cause of the issue. When it takes 100 cycles of “test hypothesis → form new hypothesis”, a feedback cycle of 10 seconds versus 10 minutes makes the difference between finding the cause in about 15 minutes and finding the cause in two days of full-time work.
All sorts of things can disrupt this feedback cycle.
Some bugs are tricky because they’re non-deterministic, which is a technobabble way of saying “even when you do the exact same things in the exact same order, you get different results.” Every once in a while, the laundry machine seems to swallow a sock, but you don’t know why. You think you might be able to stop it by always tying your socks together before putting them in the laundry machine. On the next load of laundry, you tie all your socks together in pairs and put them in. When you pull them out all your socks are still there! But did you fix the problem, or did you just get lucky this time? You can neither accept nor reject your hypothesis yet.
Another especially frustrating category of bug is nicknamed the “Heisenbug”, after the physicist Werner Heisenberg. Heisenberg was one of the first to assert that the mere act of observing a phenomenon may cause it to change its behaviour. In terms of software, what this means is that as soon as you start trying to change the code to have it tell you more about what it’s doing and what might be going wrong, the bug stops appearing! Imagine if your dentistry patient’s toothache went away when they opened their mouth even slightly.
But most of the gnarliest bugs I’ve had the imperative to work on at Figma suffer from a third kind of feedback loop problem: we can’t reproduce the problem, but some of our users can, and they hit it all the time.
This can happen for all sorts of reasons. They might have some Chrome extension installed that none of us do, and they didn’t realize that was an important part of the puzzle. Their network administrator might’ve blocked some communication channel that’s essential for Figma’s functioning over the internet. After diagnosing this kind of problem, we can usually either find a way of reproducing it by installing their software, or telling them to talk to their own coworkers to resolve the problem.
But the most frustrating version of the “no reproduction” issue is when something about their hardware is interacting poorly with our software. And the most fickle variety of hardware Figma deals with more directly than most other companies is graphics cards.
Graphics cards are the component in your computer that’s specially designed to efficiently perform the kinds of computations needed to update the millions of pixels on your screen at sixty frames per second.
I’m fond of computer graphics in part because the bugs can be so entertaining. Sometimes instead of text being drawn legibly on the screen, its upside-down and flickering madly. It’s got a special hall-of-mirrors kind of beauty to it.
These bugs are much less entertaining when they’re causing customers distress and we can’t reproduce the problems on our own hardware.
The first time I was involved in debugging a graphics card problem was in October, 2017. A designer at one of our early customers, a recruiting software company called Lever, reported that for some files, their entire canvas went black in Figma. We couldn’t reproduce, but the designer at Lever consistently could. Thankfully, Lever and Figma’s offices were a short bus ride away, so I went with our CTO Evan Wallace and one other coworker to investigate. They lent us their laptop (a MacBook Pro with an unusual graphics card inside), and I watched Evan test hypothesis after hypothesis about what could be going wrong on the machine using some special diagnostic software he’d written. After about 30 minutes, he was able to find a fix. We thanked our gracious hosts at Lever, and returned to our office, everyone quite happy.
The next time was in June, 2018. This time, a number of our users had written into us reporting that with a change in Figma’s graphics code, they started seeing patterns of visual noise appear in their design files. After chatting with many of the users through our support system, we were able to deduce that they all had the same series of graphics card. The GeForce GTX 10 series. This time, none of the affected users were anywhere in San Francisco, so we couldn’t just drop by their office. We could ask them to hop on a phone call, but it would be a lot to ask for them to sit with us for hours, typing what we asked them to into their computer word by word or giving us total remote access. Thankfully, I looked online and saw that Best Buy had laptops with this kind of graphics card in stock. So I took a car to the San Francisco Best Buy location, talked to one of the sales associates to confirm that the laptop I was looking had the card I needed, charged a few thousand dollars to the company card our CEO Dylan Field had lent me, and was on my merry way. A few hours later we’d figured out the cause and had a workaround to dodge the problem.
But by far the most outrageous instance of this was in May, 2019. We had just started rolling out changes I was spearheading to a core graphics component of Figma when a small portion of our user base wrote in to tell us something was wrong. Instead of the image taking up the full screen as they expected, it was about an eighth of the expected size, stuck up in the top left corner. After some careful analysis, we realized that they were all on Windows 7, and all using a specific version of a graphics card manufactured by Intel. Now knowing how this dance works, we looked to see if any of them were in San Francisco. Nope. One in Mumbai, one in Indonesia. We actually didn’t find any Figma users reporting this problem anywhere in the United States. Digging further, we discovered that this was a pretty old laptop, now out of vogue in America, but relatively common in Russia and South Asia. It was so old, in fact, that nobody sells them any more either.
We knew that to diagnose this sanely, we needed to have the device in hand. My coworker Ryan mentioned a friend of his works in the device lab at Dropbox and they might be able to lend us a device. Large companies like Dropbox and Google keep large repositories of hardware configurations in their device lab to deal with exactly this contingency. But Figma was still under 100 people at the time, so we were far from having a fully stocked warehouse of obscure graphics cards.
While we waited to hear back from Ryan’s contact, the third teammate working with me on this graphics component, Lauren, turned to me and said “Hey Jamie, how many followers do you have on Twitter again?” “Just over 3000”, I responded. “Why don’t you see if any of them can help you?”
I thought there was no way that would work, but I had nothing to lose, so I sent out a tweet saying “Okay, I need to debug on a really specific old laptop configuration. I need a Window 7 or 8 laptop with an Intel HD Graphics card. If you have one and are in the SF bay area, I’ll buy you a meal or a drink if you can let me use it!”
Literally one minute later, a man named Scott replied “How long do you need it for?”
After confirming with him that he can reproduce the problem we’ve been seeing on this laptop, and that he’d be willing to lend me the laptop for a few weeks, I ask if he’d be willing to meet me in a coffeeshop somewhere convenient for him the next day. He agrees.
The next day I hopped in a Lyft from Figma’s office in San Francisco destined for a Starbucks in Mountain View. Once arrived at the Starbucks, I did the awkward “waiting for someone you’ve never met before” dance. Maybe you’ve done this before for a date where the profile picture was unclear, or maybe for a networking connection set up through a friend. After a few of the Starbucks patrons assured me that they were, indeed, not Scott, I sat down and waited.
Scott walked in a few minutes later. I stood up to greet him, and we made some idle chatter while I got us both drinks from the cashier. We settled into some tables nearby. As it turns out, the laptop Scott was going to lend me miraculously already had none of his own personal files on it. It was a spare laptop he kept around to drive custom karaoke events at different anime conventions. As long as I could return or replace the laptop before the event, he was happy to lend it to me. Scott works as a Technical Support Engineer at a big company, and just happened to follow me on Twitter after seeing my blog post about color. He’d never heard of Figma before, though was interested in it when I explained a bit more.
After talking for about half an hour, I thanked Scott profusely and returned to San Francisco in another hour long car ride. The day after, I was able to isolate the problem, create a workaround, and unblock our project. Reluctant to give up the laptop since it was the only way we had of debugging similar issues, I asked my manager for permission to expense sending a replacement laptop to Scott. The request was approved, and now Scott is the happy owner of a nicer, newer laptop, paid for by a company he had never heard of for deciding to help a person he had never met.
Scott is a great guy.
Thanks to Nikhil Thota, Spencer Chang, Andee Liao, and Lauren Budorick for providing feedback on drafts of this post, to Jessica Liu for the illustrations, and to Karl, Evan, Dylan, Ryan, and Lauren for being a part of this story!