Technical Debt
Manifest Destiny?
Technical debt, while it still sends shudders down most product leaders’ spines, is no longer a four-letter word. We have all learned to accept technical debt. The best product leaders have found ways to manage and even harness it.
Others have written about technical debt, including Ward Cunningham, Martin Fowler, Andrea Goulet, Maiz Lulkin, Eric Ries, and Joel Spolsky. We’ll build upon their insights and add our own, along with those of the product leaders we spoke to.
When a new venture is just starting out and doesn’t yet have product-market fit, accumulating technical debt should not be concerning. In fact, it might even be advisable. What’s the point of a perfect code base for a feature that doesn’t resonate with your intended users? Any time spent refactoring code means that less engineering time is available for iterating the product, as the team searches for product-market fit. The goal of the company at this stage is to gather as much user feedback as possible to refine and reshape the product.
However, herein lies a paradox: once the product is successful, the engineering team is rarely given the opportunity to go back and fix the short cuts it took to get something out the door, in their quest for product-market fit. As the product gains momentum, founders will push engineering to build out the backlog of great new features that will fuel the next wave of growth. This is exhibit #1 on the path to technical debt accumulation.
There are several common reasons why startups need to refactor their code base.
Build for 1x, Engineer for 10x, Architect for 100x
Steve Kaufer, founder/CEO of TripAdvisor, had some practical advice which works well when engineers are teamed with technical PMs – don’t view a project as being complete until the code base has been cleaned up – or, leave enough time in between projects to do the same. Non-technical PMs are more inclined to view a project as complete the moment it’s been shipped, not understanding that potential problems might still lie beneath the surface.
Bug-Bash. Kaufer’s technique works well as long as the bugs/clean-up issues are well known. But what happens when the bugs appear weeks or months down the road? We run into the same quality-features-time tradeoff, with quality again at risk of taking a back seat. Jen Fitzpatrick, currently SVP at Google, came up with a great approach to tackle the bug accumulation problem in the mid 2000s. She started a program called ‘bug-bash,’ where the entire engineering team would only focus on addressing bugs for a week. New feature work was prohibited. PMs put in a lot of pre-work to prioritize the bugs, and special awards were given to engineers who cleaned the hairiest bugs, the most P1 bugs, the greatest number of bugs, etc. The driving spirit behind bug-bash was an acceptance that house cleaning is a necessary and critical part of the product life cycle and product teams should embrace and celebrate rather than shun it. Regular bug-bashes have now become a cultural element of Google engineering teams.
Re-cast refactoring as a business problem. How will the business benefit when you address the technical debt—beyond hard-to-quantify improvements in software engineers’ productivity? Selina Tobaccowala, who worked as the President/CTO at SurveyMonkey when the late Dave Goldberg was CEO recounts that he would say: “When a CTO talks to me, it’s like going to a car mechanic and hearing about all the work that needs to be done, without understanding how a car works.” For a business oriented CEO like Goldberg, the best approach for a product leader is to present refactoring as a way to unlock business opportunities. For instance: “Entering the European market requires accepting CHF, GBP and Euros in credit and debit card formats. So, we need to refactor the codebase to accept multiple currencies and forms of payment.” When scaling rapidly, focusing on business value helps determine when to invest scarce engineering resources in refactoring and when, in Selina’s words, to simply “let some fires burn.”
Which box? When to rewrite the entire code base? Rushabh Doshi offers a three part approach to managing technical debt:
With all of the approaches above, it’s important to treat refactoring the same way as new feature creation work, that is, assign the requisite number of resources for the appropriate amount of time.
When refactoring is finished, the team should run A/A tests so that both the old and new code bases are run in parallel. This tests the new code and provides a fail-safe mechanism in case the new code doesn’t perform as expected (surprise!). Product leaders should not succumb to the temptation to add new features to the new code base while it is being re-written. This might make the rewrite more palatable to business leaders, but it can also increase the probability of new bugs and messier code – essentially, adding new technical debt while paying down old debt.
Everything we’ve shared so far may sound perfectly logical on paper, but how does it really happen in practice? For example, has anyone actually survived a full rewrite? It turns out that Deep lived through a full code rewrite at LinkedIn along with Mohak Shroff, current SVP and Head of Engineering at LinkedIn, who at the time was a key engineering exec responsible for the successful rewrite. We share more about their experience below.
LinkedIn launched in 2003 to connect the world’s professionals. Its technology stack was originally built on the best-in-class hardware (Sun Solaris servers) and software (Oracle DB) that was available at that time. The company experienced steady growth in both its user base and revenue, while continuing to build new software on this technology stack. However, by late 2008 it was becoming apparent that things were not able to scale to keep up with the evolving technology needs of the platform.
Software release frequency had been reduced by 50%, and the site had to be taken offline for hours at times for routine maintenance. When Twitter had its ‘Fail Whale,’ LinkedIn had its ‘IN Wizard.’ The technology stack was no longer state-of-the-art for what the platform needed, but a full scale ‘lift and shift’ seemed overwhelming and downright impractical. Engineering time was precious, so throwing hardware and off-the-shelf software (e.g., libraries of open-source code) at the problem was deemed more expedient.
However, at a certain scale, off-the-shelf solutions no longer exist: you can’t just wish away the problems, and reality stares you in the face. The development process became process heavy, because, according to Mohak, “process was all we had.” LinkedIn’s tech stack was complex with no one person having a complete grasp of the full system, and as a result, everybody became risk averse and resorted to lots of testing and certification before making any changes. Mohak reflects that, “Every time you hide complexity from engineers, you complicate the process and create confusion amongst your engineering teams. A great race car driver needs to understand how the car works.”
Meanwhile, the company was preparing to go public in May 2011. Earlier that year, a group of engineering and product executives agreed that the status quo simply couldn’t last and that a massive code refactoring was needed. In the spirit of keeping the feature functionality train running, Mohak proposed dedicating 30 engineers to refactoring (resource boxing) for six months (along with time boxing). Deep, recognizing that more effort was needed and a carpe diem attitude was more appropriate, suggested taking the full engineering team of 300 people and redeploying them for six weeks on the rewrite.
Deep’s proposal was met with great skepticism — especially from the CTO at the time, David Henke. David had lived through the failed Yahoo Panama rewrite and bore scars from that attempt. As Mohak reflects, in the tech industry, “the curse of knowledge weighs heavy.” Based on experience and his deep knowledge of LinkedIn’s systems, David listed 15 different reasons why a complete code rewrite was a fool’s errand. But eventually the company went ahead with the rewrite and succeeded. Here were some of the key factors that contributed to their success.
* * * * *
The term debt has negative connotations. However, technical debt is an inevitable part of the development process. It’s not about cutting corners. Rather, it’s about making smart choices. Having a perfect code base is not only impossible, but it is also impractical. Beautiful code is worthless if you aren’t getting things done. The real test of a great product leader is the rigor with which they make and address the consequences of technical debt on an ongoing basis.
People
Pathways
Process
People
People
Process
Pathways
Pathways
Pathways