Having worked in a few startups that hired rapidly, I have noticed that there is always this first moment when someone would exclaim that some department can no longer fit in the largest conference room. It would be said aloud, sometimes with a subtle chuckle, but always with some pride in the acknowledgment that “this is how far we have come everybody but never forget I am Employee #17.”
This is how far we have come everybody but never forget I am Employee #17. — Employee #17
(I must admit I was that guy once.)
When I first dipped my toes into management, my company and I worked out an arrangement where I was effectively both a tech lead and a manager. (This is also known as a “TLM”). It was a great way to transition myself into a new management role. In hindsight, I stayed in this dual role for too long (years), would not recommend it for first-timers. I now believe that with proper coaching and guidance, the transition period for new managers should be no longer than 2 quarters.
Taking on new roles demands a great deal of attention to identify the…
Co-author: Anuj Desai
Anuj and I have had the privilege of working with several teams on their roadmaps and sprint plans over the last few years, and we noticed that it is not always obvious what a healthy backlog should look like, or how it can be maintained. Unhealthy backlogs, on the other hand, are much more apparent.
Let’s declare <X> bankruptcy.
- Each of us, at some point in our careers, where X is often email, backlog, and sometimes PagerDuty.
The foremost example of an unhealthy backlog is one that grows in length uncontrollably. Week after week, more than…
Promotions and compensation adjustments are some of the most important functions of management. By electing to recognize (or not recognize) individuals and their contributions, we construct a system of incentives that encourage and reward certain behaviors. This shapes the culture of the company and is an important aspect of talent development.
Am I going to get promoted this quarter? — Everyone.
When adjusting the dual levers of promotions and raises, we want to:
Together, these create virtuous cycles of learning…
This is a short explainer of event-driven servers, intended to help readers gain an intuitive understanding of event loops. It could be useful when:
In the classic client-server architecture, a server accepts connections from clients, receives data on the new socket, forwards it to the application for processing, and then sends data back to the client on the same socket.
At the beginning of 2019, Engineering@Affirm set aggressive performance goals for our react apps and affirm.js¹ to improve user experience. To drive this effort, we started out by improving instrumentation, and measured, in granular detail, the performance of each of our apps and of affirm.js. Shortly after, we coordinated a concerted effort across the organization and prioritized optimization projects across engineering teams, which included code-splitting and CDN improvements.
The common web page optimizations are well covered by several other articles. …
How much are you learning from your postmortems?
Young startups often follow a familiar narrative: in the pursuit of product-market fit, engineers march to the drumbeat of “move fast and break things”. The company values speed of execution and agility over everything else. However, as systems become more complex (for example, through multiple product pivots), failures happen. Eventually, as the business gains significance and prominence, incidents and breaches become increasingly painful and costly. The company looks to its larger peers for guidance, and finds Google’s SRE book, or chances upon some of John Allspaw’s writing.
Earlier this year, I helped load test a gunicorn application on an EC2 instance. This was a 5th generation EC2 instance running a modern Linux distribution, and gunicorn was writing to log files on a single EBS gp2 volume. To our collective awe, we noticed that it was I/O-bound on logging to files on disk.
Even more surprisingly, changing that EC2 instance to use a RAID0 configuration over two EBS volumes worked really well, and we were able to double the number of gunicorn workers on that EC2 instance until it hit the next limiting factor.
Suppose we have a web app, named MyApp
. It could be on Django, Spring, or Ruby on Rails, but it started out as a single, small application. All the code is in the same repository, everything is deployed in a single artifact, and all its tables are in the same database. As the app grows and attracts more users, it gets more data. It also gets more developers, more tables in the database, and gets hosted on more machines. The codebase starts to snowball. As we get more successful, we try to scale.
It depends on which issues start becoming…
Inspecting greenlets and tracing coroutines for optimal performance.
The effects of greenlet contention can be difficult to measure and investigate. It sporadically affects unrelated tasks that just happen to be waiting in the queue and can have an outsized effect on response latency, causing spikes in the high percentiles.
At Affirm, we learned the importance of instrumenting greenlets in order to observe their switching behaviors under different combinations of workloads. This article briefly describes how we use coroutines, some of the problems we encountered, and how we solved them.
Python has a Global Interpreter Lock, which means threads run concurrently…