October 15, 2023

Technical Debt

Manifest Destiny?

Deep Nishar

Tom Eisenmann

Read

Why Refactor?

There are several common reasons why startups need to refactor their code base.

Always picking time to ship or new features when presented with quality-features-time tradeoffs. This happens when founders pursue hypergrowth, and also when a product manager is not super technical. Rushabh Doshi, CPO at Digit and formerly a product leader at Facebook and YouTube, notes that a mismatch between goals assigned to PMs (e.g., shipping on time) and to engineering (e.g., bug counts) can exacerbate the pressure to compromise quality in order to gain a time or feature edge. Rushabh notes, “As soon as your engineering team writes a single line of code, you start accruing technical debt. We imagine features will work the way they are supposed to, and we don’t imagine bugs. But in reality, all new features have warts and bugs.” Not understanding and accepting this reality—and budgeting time for refactoring—results in the rapid accumulation of technical debt. ‍
Unexpected rapid growth. One of the maxims Deep abided by during his operational life was to “build for 1x, engineer for 10x and architect for 100x,” where “x” represents current usage levels. Building a product in this way enables smoother scaling. However, it’s not always possible to predict the rate of scaling for a successful new product. As Andrey Khushid shared with us, the Covid-19 pandemic increased Miro’s product usage by 10x within a few months, requiring him to dedicate more than half of his engineering capacity to code refactoring. While this can be a ‘good’ problem to have, architecture that doesn’t scale well can also be fatal. Web 1.0 startups like Friendster got crushed by rapid growth; Friendster’s users were frustrated by slow response times as servers struggled to process their interactions. Friendster never managed to recover, and they ceded their early-mover advantage to new entrants, in particular, MySpace.‍
Infrastructure shift. This is technically not a cause of technical debt, but it may require massive refactoring of the code base — with no new features. Many enterprise software companies have transitioned from an on-premises client-server architecture to a cloud service. However, this shift can cause massive disruption to engineering schedules and consume lots of resources. At the end of the transition, the customer most likely gets the same feature functionality as before, leading to heated executive suite debates whenever this type of transition is contemplated.‍
New requirements. Jared Smith, president of Qualtrics, points out that new feature requirements or opportunities sometimes emerge that simply weren’t on the radar when the product was originally designed. He gives the example of a team inventing a great spell checker and realizing that it should be incorporated across all of an application’s features. It’s time to refactor! Or, imagine that it’s 1970 and you are an engineer coding a financial application that tracks customer account balances. Processing power and memory are at a premium and you’ve been trained to be parsimonious with them. So, when it comes time to parameterize the year, you assign it a two digit integer value. This works swimmingly until 1999, but the software needs to work in the new millennium. It’s time to rewrite – or at least patch – the code!
‍Engineering incompetence. Again, it’s not technically a cause of technical debt, but sloppy work by less competent engineers can be a root cause of the need to refactor. This is especially likely to happen in young startups when founders lack the connections and track record needed to attract top notch engineering talent.

Build for 1x, Engineer for 10x, Architect for 100x

Ways to Pay Down Debt

Steve Kaufer, founder/CEO of TripAdvisor, had some practical advice which works well when engineers are teamed with technical PMs – don’t view a project as being complete until the code base has been cleaned up – or, leave enough time in between projects to do the same. Non-technical PMs are more inclined to view a project as complete the moment it’s been shipped, not understanding that potential problems might still lie beneath the surface.

Bug-Bash. Kaufer’s technique works well as long as the bugs/clean-up issues are well known. But what happens when the bugs appear weeks or months down the road? We run into the same quality-features-time tradeoff, with quality again at risk of taking a back seat. Jen Fitzpatrick, currently SVP at Google, came up with a great approach to tackle the bug accumulation problem in the mid 2000s. She started a program called ‘bug-bash,’ where the entire engineering team would only focus on addressing bugs for a week. New feature work was prohibited. PMs put in a lot of pre-work to prioritize the bugs, and special awards were given to engineers who cleaned the hairiest bugs, the most P1 bugs, the greatest number of bugs, etc. The driving spirit behind bug-bash was an acceptance that house cleaning is a necessary and critical part of the product life cycle and product teams should embrace and celebrate rather than shun it. Regular bug-bashes have now become a cultural element of Google engineering teams.

Re-cast refactoring as a business problem. How will the business benefit when you address the technical debt—beyond hard-to-quantify improvements in software engineers’ productivity? Selina Tobaccowala, who worked as the President/CTO at SurveyMonkey when the late Dave Goldberg was CEO recounts that he would say: “When a CTO talks to me, it’s like going to a car mechanic and hearing about all the work that needs to be done, without understanding how a car works.” For a business oriented CEO like Goldberg, the best approach for a product leader is to present refactoring as a way to unlock business opportunities. For instance: “Entering the European market requires accepting CHF, GBP and Euros in credit and debit card formats. So, we need to refactor the codebase to accept multiple currencies and forms of payment.” When scaling rapidly, focusing on business value helps determine when to invest scarce engineering resources in refactoring and when, in Selina’s words, to simply “let some fires burn.”

Which box? When to rewrite the entire code base? Rushabh Doshi offers a three part approach to managing technical debt:

Time box: Rushabh is a big fan of this approach, which is similar to the bug bashes advocated by Jen at Google and allocates portions of sprint cycles—or entire cycles—to refactoring. Time boxing makes code refactoring a routine part of the product life cycle. With this approach, fixing code is not left to the junior members of the technical team as ‘starter’ projects. Experienced engineers who understand the code base are charged with refactoring. According to Rushabh, refactoring as you build works best with disciplined engineering teams that are able to withstand the pressure to divert resources to building new features. Such pressure can be very strong in startups where product-led growth is a priority.
‍Resource box: With this approach, a few members of the engineering team are dedicated to code refactoring for substantial periods of time—say, a few calendar quarters. This becomes their only job, and their output is viewed at par or even more important than feature delivery. Their goal is to improve the developer experience (DevEx) and make the rest of the team more productive. A common failure mode of this approach is to assign junior team members to this task as a way to learn the code base and get their ‘hands dirty.’ Unfortunately when a team member lacks the full system view of the code base and barely understands its functionality, this starter project becomes an unproductive nightmare.
Full re-write: This approach, which may strike terror in the hearts of engineers who’ve been through a full re-write, is required in one of three circumstances:
The codebase has been neglected forever and the proverbial wheels are coming off the bus. The hair-raising risk here: Can the teams who created this mess in the first place be trusted to fix it?
The underlying technology is changing in fundamental ways – as with the move from client-server based software to cloud services described above.
The business has changed in fundamental ways – for instance, after a large acquisition that requires the code base across two organizations to be reconciled (e.g., Google buying YouTube).

With all of the approaches above, it’s important to treat refactoring the same way as new feature creation work, that is, assign the requisite number of resources for the appropriate amount of time.

When refactoring is finished, the team should run A/A tests so that both the old and new code bases are run in parallel. This tests the new code and provides a fail-safe mechanism in case the new code doesn’t perform as expected (surprise!). Product leaders should not succumb to the temptation to add new features to the new code base while it is being re-written. This might make the rewrite more palatable to business leaders, but it can also increase the probability of new bugs and messier code – essentially, adding new technical debt while paying down old debt.

Case Study: LinkedIn

Everything we’ve shared so far may sound perfectly logical on paper, but how does it really happen in practice? For example, has anyone actually survived a full rewrite? It turns out that Deep lived through a full code rewrite at LinkedIn along with Mohak Shroff, current SVP and Head of Engineering at LinkedIn, who at the time was a key engineering exec responsible for the successful rewrite. We share more about their experience below.

LinkedIn launched in 2003 to connect the world’s professionals. Its technology stack was originally built on the best-in-class hardware (Sun Solaris servers) and software (Oracle DB) that was available at that time. The company experienced steady growth in both its user base and revenue, while continuing to build new software on this technology stack. However, by late 2008 it was becoming apparent that things were not able to scale to keep up with the evolving technology needs of the platform.

Software release frequency had been reduced by 50%, and the site had to be taken offline for hours at times for routine maintenance. When Twitter had its ‘Fail Whale,’ LinkedIn had its ‘IN Wizard.’ The technology stack was no longer state-of-the-art for what the platform needed, but a full scale ‘lift and shift’ seemed overwhelming and downright impractical. Engineering time was precious, so throwing hardware and off-the-shelf software (e.g., libraries of open-source code) at the problem was deemed more expedient.

However, at a certain scale, off-the-shelf solutions no longer exist: you can’t just wish away the problems, and reality stares you in the face. The development process became process heavy, because, according to Mohak, “process was all we had.” LinkedIn’s tech stack was complex with no one person having a complete grasp of the full system, and as a result, everybody became risk averse and resorted to lots of testing and certification before making any changes. Mohak reflects that, “Every time you hide complexity from engineers, you complicate the process and create confusion amongst your engineering teams. A great race car driver needs to understand how the car works.”

Meanwhile, the company was preparing to go public in May 2011. Earlier that year, a group of engineering and product executives agreed that the status quo simply couldn’t last and that a massive code refactoring was needed. In the spirit of keeping the feature functionality train running, Mohak proposed dedicating 30 engineers to refactoring (resource boxing) for six months (along with time boxing). Deep, recognizing that more effort was needed and a carpe diem attitude was more appropriate, suggested taking the full engineering team of 300 people and redeploying them for six weeks on the rewrite.

Deep’s proposal was met with great skepticism — especially from the CTO at the time, David Henke. David had lived through the failed Yahoo Panama rewrite and bore scars from that attempt. As Mohak reflects, in the tech industry, “the curse of knowledge weighs heavy.” Based on experience and his deep knowledge of LinkedIn’s systems, David listed 15 different reasons why a complete code rewrite was a fool’s errand. But eventually the company went ahead with the rewrite and succeeded. Here were some of the key factors that contributed to their success.

Executive alignment: LinkedIn’s CEO, CFO, CRO and the rest of the executive team all saw the challenges of the current system, and agreed to support the rewrite team. The engineering and product teams could see that senior management was fully committed to the rewrite and felt they had the support needed to move forward. For example, Deep stood up in front of the entire company to explain the need for the feature hiatus and said, “Blame me if this fails.”
‍Planning before execution: The team spent months understanding the system and meticulously planned the sequence of the rewrite. This is where Henke’s experience with Panama came to bear, as he had learned key lessons from that rewrite and applied those to this update. For example, an important lesson he shared was to “Aim, aim, and aim some more — before you shoot.” He added that many code rewrites are treated like a second-grade soccer game with every player running after the ball, when in reality they need to be treated like a major project -- with planning that takes months and action that may require a few weeks.
‍Don’t forget the small wins: Early refactoring gave the team a sense of the efficiencies that could be gained. They were able to apply quick ‘hot releases’ and avoid complete system shutdowns by building a Continuous Integration environment with lots of automated testing as one of its first projects. Rapid iteration by coding in small chunks worked well after this, and they saw a faster time to recovery if something didn’t go as planned. This was a big morale booster and converted many of the remaining holdouts in engineering.
‍Honor the deadline: This gives respect to the company’s patience with the process and builds trust for the engineering and product teams.
‍You only do this once: The team finished the bulk of the work within the deadline, and made an important decision to allocate between 10-20% of all future engineering resources to infrastructure projects (DevEx) thus obviating the need for future large-scale rewrites.
‍

^{* * * * *}‍

The term debt has negative connotations. However, technical debt is an inevitable part of the development process. It’s not about cutting corners. Rather, it’s about making smart choices. Having a perfect code base is not only impossible, but it is also impractical. Beautiful code is worthless if you aren’t getting things done. The real test of a great product leader is the rigor with which they make and address the consequences of technical debt on an ongoing basis.

People

Poet or Librarian?

Hiring Your First Product Leader

Deep Nishar

Tom Eisenmann

People

Poet or Librarian?

Hiring Your First Product Leader

Deep Nishar

Tom Eisenmann

Pathways

Technical Debt

Manifest Destiny?

Deep Nishar

Tom Eisenmann

Pathways

Technical Debt

Manifest Destiny?

Deep Nishar

Tom Eisenmann

Process

Hacking or Engineering?

Managing Growth Teams

Deep Nishar

Tom Eisenmann

Process

Hacking or Engineering?

Managing Growth Teams

Deep Nishar

Tom Eisenmann

People

Getting It Right

Hiring Your First PM

Deep Nishar

Tom Eisenmann

People

Getting It Right

Hiring Your First PM

Deep Nishar

Tom Eisenmann

People

An OS for a World Class Product Organization, Part 1: People

Deep Nishar

Tom Eisenmann

People

An OS for a World Class Product Organization, Part 1: People

Deep Nishar

Tom Eisenmann

Process

An OS for a World-Class Product Organization, Part 2: Process

Deep Nishar

Tom Eisenmann

Process

An OS for a World-Class Product Organization, Part 2: Process

Deep Nishar

Tom Eisenmann

Pathways

An OS for a World-Class Product Organization, Part 3: Principles

Deep Nishar

Tom Eisenmann

Pathways

An OS for a World-Class Product Organization, Part 3: Principles

Deep Nishar

Tom Eisenmann

Pathways

International Expansion

Oh, the Places You’ll Go!

Deep Nishar

Tom Eisenmann

Pathways

International Expansion

Oh, the Places You’ll Go!

Deep Nishar

Tom Eisenmann

Pathways

Managing Acquisitions

The Pushmi-Pullyu

Deep Nishar

Tom Eisenmann

Pathways

Managing Acquisitions

The Pushmi-Pullyu

Deep Nishar

Tom Eisenmann

Technical Debt

Why Refactor?

Ways to Pay Down Debt

Case Study: LinkedIn

Article gallery

Poet or Librarian?

Hiring Your First Product Leader

Poet or Librarian?

Hiring Your First Product Leader

Technical Debt

Manifest Destiny?

Technical Debt

Manifest Destiny?

Hacking or Engineering?

Managing Growth Teams

Hacking or Engineering?

Managing Growth Teams

Getting It Right

Hiring Your First PM

Getting It Right

Hiring Your First PM

An OS for a World Class Product Organization, Part 1: People

An OS for a World Class Product Organization, Part 1: People

An OS for a World-Class Product Organization, Part 2: Process

An OS for a World-Class Product Organization, Part 2: Process

An OS for a World-Class Product Organization, Part 3: Principles

An OS for a World-Class Product Organization, Part 3: Principles

International Expansion

Oh, the Places You’ll Go!

International Expansion

Oh, the Places You’ll Go!

Managing Acquisitions

The Pushmi-Pullyu

Managing Acquisitions

The Pushmi-Pullyu

Chapter 1