Working at Nextdoor, I’m always surrounded by super talented people. Many of our early employees and founders came from companies like Google, LinkedIn, Netflix, Amazon, and Twitter. Coming to Nextdoor after working for tenured companies these individuals bring with them a wealth of best practices, engineering paradigms, and good rules of thumb. I’ve been fortunate enough to work on many different teams and learn from many great engineers. While it’s impossible to write down everything I’ve learned, I’ve written down some of the important patterns that any engineer should follow.
Note that some of these items tug each other in opposite directions. Your experience as an engineer will help you decide when to leverage which one. I’ve grouped these learnings into sections regarding Architecting, Developing, and Debugging software; but other than that — these are in no particular order.
Know your short/mid/long-term solutions.
It’s always important to understand how long your work is going to live. Frequently you will be in situations where you need a solution right now, and this project isn’t going away anytime soon. Chances are some quick-n-dirty solution will pop in your head very quickly, and the right-but-time-consuming solution has to be tabled. Take a little extra time to come up with several short-term solutions. For your final decision make sure that your approach lays the path toward the wanted future.
Doing the right thing should be easy.
(aka “Sane by Default”, or “Doing the wrong things should be hard”)
Your work will always serve as an example for others on how to do things. If you’re adding new functionality make sure that any assumptions it makes are sane. If you want this new feature to be leveraged by other engineers — make it really easy to use. The easiest thing to do is to not do anything. Automate what you can so that anyone leveraging your technology gets the benefits for free.
Consider a situation where you are refactoring code, or introducing a new pattern for analyzing data in a spreadsheet. It may be tempting to get your work done quickly and be done with it. Take the extra time to generalize your work and make it the new standard for all your peers. Whether it’s moving all database calls into a DAL, or disseminating your spreadsheet with formulas to all employees — do it right, make it easy, and getting everyone to use it will be simple.
Add monitoring & alerting early on.
Monitoring and alerting is the bridge between product and engineering. Where unit tests confirm that your code works — monitoring confirms that your product works. The historical data of monitoring can be invaluable when making changes, so adding it early on is imperative. Hopefully adding monitoring is as easy as 1 line of code (see “Doing the right thing should be easy”) and everyone will start monitoring vital parts of your product.
For self-hosting consider Graphite or Munin. Some monitoring companies have a decent free tier: geckoboard, librato, and New Relic. Nextdoor uses Datadog for monitoring, alerting and performance analytics.
Edit: I’ve written about the challenges here in a separate blog.
Limiting the stack / Leveraging the existing stack
The hindsight on “the stack is too big” is very clear, and scary. Have the foresight by always questioning if the new technology is worth the complexity. It’s not just about avoiding the “shiny-new-toy” syndrome. Avoid increasing the learning curve for new engineers. Avoid distributing sources of truth. Avoid duplicating business logic.
It’s also possible that the new tech stack you’re considering is only useful for 1 of its features. Let’s say ElasticSearch is better at some things than Postgres. But if you’d be leveraging only 5% of ElasticSearch’s features and Postgres is already up and running, then weigh that against having to learn and maintain a new system, and remember the engineering and financial costs also.
Use the right tools for the job.
Contrary to the point above if you find yourself commonly thinking “I can trick it into doing what I want if I…” then perhaps you should look into other tools that are better suited. Hacking is fun but eventually it becomes unmaintainable, and errors become un-debuggable. Find the best tool for the job and use it. Experience will give you a good gut feeling when you’ve outgrown your current stack, and the overhead of adding new tools is worth the time savings you’d get otherwise.
Engineers love to fantasize, and get off on tangents. Many conversations can be a network of, “And another way we can do this is…” When your initial insight tells you that there are a million ways to approach a problem — it’s important to come to a conclusion in a reasonable amount of time. Do not spiral down into a demagogue oblivion. Set aside a finite amount of hours to think over this issue or discuss it with colleagues. This will help you be lean about approaches and not discuss really silly ones.
It’s important to realize that continuing to come up with solutions for the same problem will have diminishing returns over time. Go with your gut feeling and say, “we should be able to answer this in 30 minutes” before your next meeting.
Separation of responsibilities.
As your work grows you’ll inevitability have to split it into parts. These parts should have a clear separation of responsibilities. If the backend is already doing the heavy lifting—try to limit the work on frontend. Small, straightforward services/components don’t break — so move responsibility up the chain.
Clear separation of work also helps with debugging. You’ll save a lot of time if you never have to ask, “where is the logic for handling/generating this data?”
Pain of failures ought to be felt by those who can fix it (and those who caused it).
This guideline might be my most favorite, and ought to be thought about in both directions: those in power to fix an issue should feel the pain of the issue, and those who feel the pain ought to be able to fix it. Here “pain” means any issue, burden or annoyance within the work.
To create a culture of ownership and sustained improvement it is critical that any mistakes or failures are first seen by those who can fix the root cause. This applies whether talking about being paged at night, comfort of the chairs, or even on a more technical level — a service losing access to a shared resource.
If you’re in a situation where you feel the burden of a failure in another department do everything you can to either become empowered to address the failures, or redirect the burden to it’s origin. When multiple parties can fix the issue then you are at a luxury of asking who should fix it. Try to find a balance between causality assignment, and disseminating the knowledge.
Example 1: You may be in a situation where the cause of an issue comes from a different department — one that is unable to fix the issue. Take, for example, deadlines set by management which cause workers to cut corners too much. This can be rephrased as “the management does not have sufficient visibility into downsides of cutting corners.” Empower yourself with data, and dashboards showing that going faster in the short term is actually slower in the long run.
Example 2: As an opposite example, a graphics designer may feel the pain (negative feedback from customers/management) of an incorrectly implemented solution. One approach is to have this designer daisy-chain the feedback to engineers. This works, but this loop is slow. Shorten the loop by having the designer and engineer work close together. Possibly, even make it requirement that an engineer’s work is reviewed by the designer before it is shipped.
Example 3: Lastly there are situations where the pain is simply not necessary. In the situation of a “shared resource” between parties A and B it can be crucial that a culprit A does not affect the innocent bystander B. If someone is late for a train and squeezes in by jamming the doors it can slow down or even break the train causing everyone, not just the culprit, to be late. Observe situations where innocent bystanders pay the price, and see how the cause and effect can be limited to just the culprit. This is sometimes referred to as controlling the blast radius.
Cleanup should be easy.
When developing some augmentation, something temporary, or something that is possibly throw-away code — it pays to architect your solution such that undoing it all is very easy. Deeply embedded solutions will likely remain in the code or in your system for too long because people won’t be interested in rolling up their sleeves and figuring out how to untangle the mess. Clean engineering is more durable, and debugging it is a pleasure!
Solutions via Engineering or via People.
Not everything requires writing code. Also, sometimes people problems can be solved through code. Learn to question various solutions and always think if it’s possible to “engineer out of a problem.” On the flip side if your engineering work is very large — see if it’s possible to have a human solution.
For example: employees are supposed to fill out TPS reports but keep forgetting to write the due date. One solution is to email everyone in the company and remind them to write the due date. But you can engineer yourself out of this problem by making TPS reports done via Google forms and mark the field as required.
Another example: some data in your database is corrupt for one reason or another. You have 400 records out of hundreds of thousands that are wrong. You can write a script that fixes it, but it’ll take a couple days to really make sure the script works and doesn’t make it worse. Alternatively… just handle these all manually. All in all it will be faster.
Nextdoor engineers send a “changelog” email whenever manual changes to production are done. This is a cultural agreement that helps communication across the board. Some changelog emails are automated, but many still have to be done manually.
Write Testable Code
Ideally you can follow the Test Driven Development paradigm. But even if you aren’t — it’s important to think about your tests while you’re writing the code. This thought alone will help you write cleaner code with necessary abstractions around the messy parts.
Tested code should be your pride, and a bragging right. Taking the time to write tests makes the development faster in the long run.
Develop for debug-ability (or self-teaching failures)
Zero people can write code that never breaks. So make sure that your code is easy to debug when it breaks. Someone down the road who doesn’t have the necessary context or institutional knowledge will be debugging your code. Write it such that it is self explanatory. Your code already explains what it does, so limit your comments to explain the why and not the what.
If you’re relying on log statements then make sure they make sense out of context also. One pattern I follow is to explain how to debug directly in the log message that handles a failure.
Multiple failures caught. Search logs for XYZ to find details.
This sort of log message will help anyone, even engineer that starts next week, figure out what to do. This is also useful directly in the development section. In Nextdoor infrastructure if a unit test tries to reach out to a third-party service you get an error such as:
You are trying to access service XYZ but it’s not mocked. Use subsystem_mock(XYZ) or subsystem_use_real(XYZ)!
This error message tells you exactly how to resolve itself! Great!
Understand when some extra work can become a slippery slope, and cut that work into a separate step. You can apply this even to important features and have a quick-follow up to your main project instead of allowing for a scope creep. Only experience will really tell you what is a right amount of work to break off into a quick-follow. Make sure your team mates also agree!
No worse than before
Often when preparing for a significant overhaul of some functionality you may end up changing the logic significantly while keeping the end result nearly the same. Use the “no worse than before” approach to understand your impact, and make sure that all changes are improvements. This is also a good argument to prevent scope-creep when a teammate asks for additional features.
All or nothing
When you are changing a pattern, or introducing a new concept/class — aim for replacing all instances of the old pattern. This can be a daunting amount of work, but it’s the right thing to do. Getting stuck in half-transition, or allowing your code to have multiple ways of doing the same thing can be detrimental to future development.
Use time correlation to notice changes.
Being able to ask “What just changed?” is crucial. With proper architecture you can look at the relevant dashboards, note events (like production releases), and note manual changes (email changelogs are a blessing).
Going from alert to failure should be fast.
Software is not a baby. When it screams it should also give you details about what exactly is wrong. Of course sometimes it’s impossible, which makes it even more important to be as detailed as possible when an alert is fired off.
Create a tradition of going back to your alerts and improving them. Whenever your alert goes off ask yourself, “what is my first step in debugging this? Can the alert provide me this information?”
Canary alerts (via tools like Pingdom) have their utility. They are high-level alerts that something is wrong. Go figure out what… In order for an alert to be actionable (and to minimize debugging time) it’s good to have fine grain alerts in addition to catch-alls. Using Datadog you can create a generic latency monitor, but leveraging their tagging system your monitor can tell you which host is latent, or even which specific view. The challenge with fine grain alerts is to make sure they are not noisy! Start with “quiet” alerts (just email or Slack) and keep tuning them before making them into real pager-worthy ones.
Lastly, when you create your monitors and alarms — take an extra 10 minutes to write down first steps you recommend in debugging when this alarm goes off. Collect these “runbooks” in an easy to find location, and add a link to the specific runbook directly in your alarm.
Bring in all the right people
If debugging is time sensitive — start gathering more minds. Escalate the page if needed, or get a group of people around the table. Cost of the site being down is always worth bringing in extra help. If nothing else — the rubber-ducky effect is real!
Bad alerts are evil alerts.
If an alert is not actionable — change it. Changing it can mean adjusting the value, or disabling the alert altogether. The danger of an alert being useless noise can often outweigh the danger of this one alert not going off. With sufficient alerting — if something really bad happens then multiple alerts are fired. This gives you the freedom to be strict about cleaning up annoying and useless alerts. Too many fake alerts can drown out an important alert. Have fine-tuned alerts, and keep them lean.
There are countless wisdoms that life outside of academia has to offer. And unsurprisingly these wisdoms are applicable mostly just outside of academia. Some are specific to tech, some are for all types of engineering, others — applicable to all fields. Got another good rule of thumb? Got a situation you’d like to share? Start a discussion below!
I’ve been extremely fortunate to work with the brightest minds in the industry. I love learning and sharing my knowledge with others; if you like pushing yourself to be the best you can be — we’re hiring!
Summary of 22 product-engineering patterns that college didn’t teach me was originally published in Nextdoor Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.