by Chas Peacock
For NFL athletes, the championship is the ultimate goal. It’s a career-high—and it’s also a pretty big deal for those working behind the scenes of the beloved commercials that debut during the game.
For us, a commercial during the game turned out to be a dress rehearsal.
Of course, at the time we had no idea that the traffic levels during our commercial on February 2 would pale in comparison to the traffic we would receive roughly a month later. During the game, tons of customers jammed our servers to enter the contest we advertised, offering them a chance to win a lifetime of groceries—but soon even more would be shopping for groceries to prepare for a global pandemic.
Thankfully, the traffic surge helped us create a playbook for how to deal with an incredible stress on our systems.
Know your limits
We anticipated that a commercial and contest on this massive scale would be big. So to support a huge influx of users, we tried to exercise our traffic capacity with on-premise servers. Our hope was that we had enough power to handle 1.5 million people hitting the site in 10 minutes.
We quickly learned that it was not possible.
So our goal became minimizing downtime as much as possible to as few people as possible, essentially spreading out the curve of traffic so everyone who wanted Eva Longoria to give them free groceries could have the opportunity to enter during the commercial or shortly afterward. That meant our servers needed to support a lot: signups, new app downloads, logins, forgotten password requests, and every other interaction that needed to happen in those crucial ten minutes.
Thankfully we had a plan and were able to set our site up for what we thought would be our max capacity. That night things went off fairly well, and our teams worked hard to ensure we could handle this level of capacity again.
But the bigger challenger was coming, and we underestimated it: In March within six hours of Tom Hanks testing positive for COVID and the NBA shutting down, we saw nearly double the amount of traffic from the night of the game.
But this time it was relatively smooth sailing.
Why? Well, to prepare for the commercial we did serious testing of how much traffic we could handle—and followed it up with a real-life massive traffic event. We ran scenarios across our systems to see if the mobile app would allow a sizable amount of users to enter the contest, as well as handle regular traffic of people shopping for groceries. We had never tested at this scope, and coming out of it we learned a lot of important lessons. Of course, things didn’t turn out perfectly—but the evening of the game we handled our hiccups swiftly.
To solve for a huge influx, we figured out the maximum traffic per second we could handle, and then figured out how to slow things down before we brought the site to a crawl with no way to restart. Our work showed us that the best way to ensure the experience worked was to streamline it.
Spread the problem around
Much of our data is processed in our on-premise data center—our databases, the networks that run between them, the servers that actually run heb.com. It’s an incredible support system, but it’s physical. With a looming pandemic, we didn’t have time to throw more machines at the problem.
So we had to find solutions that didn’t involve horizontally scaling our architecture. Instead, we focused on capacity planning at the front door of our experience—turning off functionality that users wouldn’t miss so we could focus on what they really wanted. Losing a little bit of personalization (like reordering your last order) was a small sacrifice to ensure you were able to place an order at all. You don’t need to know whether an item’s coupon was clipped when looking at search results, so we saved power there. We only took 15% of the opportunities to do that—but that still took enough heat off of us so we could do all the things we needed.
We also added gate-keeping experiences to give us space (“hey we’re a bit busy, try again”). We worked to convert heavily demanded products into “Best Available.” With probably a dozen levers to pull, the goal was to find the right mix for the moment and to find out if we could create new levers in real-time. We also picked our battles. When we noticed a few savvy users had created time-slot scrapers to get hotly demanded time slots, we decided to focus on fixing the larger problems of providing users with more time slots, rather than building out against this issue. By saving that time, we were able to add services that got to the root of our issue faster and got more customers the groceries they needed.
Remember what you did (you’ll need to do it again)
For a situation that’s ever-evolving, you’ve got to be as future savvy as possible. How can we make sure we can turn all of this stuff off when we no longer need it? How can we add more levers if we need more support?
Right now, nobody knows what’s going to happen next, so we need to be cognizant of the tools we have in our reserve if we continue to see load stressors. We need to save space in our roadmaps to reset and return to the best possible experience for our customers. That means preserving stable features and finding effective incremental improvements, while also starting net new tech stacks so that we create a true elasticity of demand for the next unexpected event.
During the initial chaos of the crisis, what was most inspiring was seeing how the teams came together. We didn’t wait for central leadership. Partners did what they thought the right thing was as quickly as they could, just like we had the night of the commercial. We took the playbook we had created then and ran with it. Everyone was on top of their game from beginning to finish, because we knew every thing we did added up to helping Texans get what they needed fast. Hopefully, now our team is even stronger for the next time Texas needs us.
Chas Peacock is the Director of Engineering. You can connect with him on Linkedin here.