Is developer compensation becoming bimodal?

September 26, 2016, 11:33 pm

≫ Next: Why's that company so big? I could do that in a weekend

≪ Previous: How I learned to program: school sucked, and then I got lucky

Developer compensation has skyrocketed since the demise of the Google et al. wage-suppressing no-hire agreement, to the point where compensation rivals and maybe even exceeds compensation in traditionally remunerative fields like law, consulting, etc.

Those fields have sharply bimodal income distributions. Are programmers in for the same fate? Let’s see what data we can find. First, let’s look at data from the National Association for Law Placement, which shows when legal salaries become bimodal.

Lawyers in 1991

First-year lawyer salaries in 1991. $40k median, trailing off with the upper end just under $90k

Median salary is $40k, with the numbers slowly trickling off until about $90k. According to the BLS $90k in 1991 is worth $160k in 2016 dollars. That’s a pretty generous starting salary.

Lawyers in 2000

First-year lawyer salaries in 2000. $50k median; bimodal with peaks at $40k and $125k

By 2000, the distribution had become bimodal. The lower peak is about the same in nominal (non-inflation-adjusted) terms, putting it substantially lower in real (inflation-adjusted) terms, and there’s an upper peak at around $125k, with almost everyone coming in under $130k. $130k in 2000 is $180k in 2016 dollars. The peak on the left has moved from roughly $30k in 1991 dollars to roughly $40k in 2000 dollars; both of those translate to roughly $55k in 2016 dollars. People in the right mode are doing better, while people in the left mode are doing about the same.

I won’t belabor the point with more graphs, but if you look at more recent data, the middle area between the two modes has hollowed out, increasing the level of inequality within the field. As a profession, lawyers have gotten hit hard by automation, and in real terms, 95%-ile offers today aren’t really better than they were in 2000. But 50%-ile and even 75%-ile offers are worse off due to the bimodal distribution.

Programmers in 2015

Enough about lawyers! What about programmers? Unfortunately, it’s hard to get good data on this. Anecdotally, it sure seems to me like we’re going down the same road. Unfortunately, almost all of the public data sources that are available, like H1B data, have salary numbers and not total compensation numbers. Since compensation at the the upper end is disproportionately bonus and stock, most data sets I can find don’t capture what’s going on.

One notable exception is the new grad compensation data recorded by Dan Zhang and Jesse Collins:

First-year programmer compensation in 2016. Compensation ranges from $50k to $250k

There’s certainly a wide range here, and while it’s technically bimodal, there isn’t a huge gulf in the middle like you see in law and business. Note that this data is mostly bachelors grads with a few master’s grads. PhD numbers, which sometimes go much higher, aren’t included.

Do you know of a better (larger) source of data? This is from about 100 data points, members of the “Hackathon Hackers” Facebook group, in 2015. Dan and Jesse also have data from 2014, but it would be nice to get data over a wider timeframe and just plain more data. Also, this data is pretty clearly biased towards the high end – if you look at national averages for programmers at all levels of experience, the average comes in much lower than the average for new grads in this data set. The data here match the numbers I hear when we compete for people, but the population of “people negotiating offers at Microsoft” also isn’t representative.

If we had more representative data it’s possible that we’d see a lot more data points in the $40k to $60k range along with the data we have here, which would make the data look bimodal. It’s also possible that we’d see a lot more points in the $40k to $60k range, many more in the $70k to $80k range, some more in the $90k+ range, etc., and we’d see a smooth drop-off instead of two distinct modes.

Stepping back from the meager data we have and looking at the circumstances, “should” programmer compensation be bimodal? Most other fields that have bimodal compensation have a very different compensation structure than we see in programming. For example, top law and consulting firms have an up-or-out structure, which is effectively a tournament, which distorts compensation and certainly makes it seem more likely that compensation is likely to end up being bimodal. Additionally, competitive firms pay the same rate to all 1st year employees, which they determine by matching whoever appears to be paying the most. For example, this year, Cravath announced that it would pay first-year associates $180k, and many other firms followed suit. Like most high-end firms, Cravath has a salary schedule that’s entirely based on experience:

0 years: $180k
1 year: $190k
2 years: $210k
3 years: $235k
4 years: $260k
5 years: $280k
6 years: $300k
7 years: $315k

In software, compensation tends to be on a case-by-case basis, which makes it much less likely that we’ll see a sharp peak the way we do in law. If I had to guess, I’d say that while the dispersion in programmer compensation is increasing, it’s not bimodal, but I don’t really have the right data set to conclusively say anything. Please point me to any data you have that’s better.

Appendix A: please don’t send me these

H-1B: mostly salary only.
Stack Overflow survey: salary only. Also, data is skewed by the heavy web focus of the survey – I stopped doing the survey when none of their job descriptions matched anyone in my entire building, and I know other people who stopped for the same reason.
Glassdoor: weirdly inconsistent about whether or not it includes stock compensation. Numbers for some companies seem to, but numbers for other companies don’t.
O’Reilly survey: salary focused.
BLS: doesn’t make fine-grained distribution available.
IRS: they must have the data, but they’re not sharing.
IDG: only has averages.
internal company data: too narrow
compensation survey companies like PayScale: when I’ve talked to people from these companies, they acknowledge that they have very poor visibility into large company compensation, but that’s what drives the upper end of the market (outside of finance).
#talkpay on twitter: numbers skew low¹.

Appendix B: wtf?

Since we have both programmer and lawyer compensation handy, let’s examine that. Programming pays so well that it seems a bit absurd. If you look at other careers with similar compensation, there are multiple factors that act as barriers or disincentives to entry.

If you look at law, you have to win the prestige lottery and get into a top school, which will cost hundreds of thousands of dollars. Then you have to win the grades lottery and get good enough grades to get into a top firm. And the you have to continue winning tournaments to avoid getting kicked out, which requires sacrificing any semblance of a personal life. Consulting, investment banking, etc., are similar. Compensation appears to be proportional to the level of sacrifice (e.g., investment bankers are paid better, but work even longer hours than lawyers).

Medicine seems to be a bit better from the sacrifice standpoint because there’s a cartel which limits entry into the field, but the combination of medical school and residency is still incredibly brutal compared to most jobs at places like Facebook and Google.

Programming also doesn’t have a licensing body limiting the number of programers, nor is there the same prestige filter where you have to go to a top school to get a well paying job. Sure, there are a lot of startups who basically only hire from MIT, Stanford, CMU, and a few other prestigious schools, and I see job ads like the following whenever I look at startups:

Our team of 14 includes 6 MIT alumni, 3 ex-Googlers, 1 Wharton MBA, 1 MIT Master in CS, 1 CMU CS alum, and 1 “20 under 20” Thiel fellow. Candidates often remark we’re the strongest team they’ve ever seen.
We’re not for everyone. We’re an enterprise SaaS company your mom will probably never hear of. We work really hard 6 days a week because we believe in the future of mobile and we want to win.

That happens. But, in programming, measuring people by markers of prestige seems to be a Silicon Valley startup thing and not a top-paying companies thing. Big companies, which pay a lot better than startups, don’t filter people out by prestige nearly as often. Not only do you not need the right degree from the right school, you also don’t need to have the right kind of degree, or any degree at all. Although it’s getting rarer to not have a degree, I still meet new hires with no experience and either no degree or a degree in an unrelated field (like sociology or philosophy).

How is it possible that programmers are paid so well without these other barriers to entry that similarly remunerative fields have? One possibility is that we have a shortage of programmers. If that’s the case, you’d expect more programmers to enter the field, bringing down compensation. CS enrollments have been at record levels recently, so this may already be happening. Another possibility is that programming is uniquely hard in some way, but that seems implausible to me. Programming doesn’t seem inherently harder than electrical engineering or chemical engineering and it certainly hasn’t gotten much harder over the past decade, but during that timeframe, programming has gone from having similar compensation to most engineering fields to paying much better. The last time I was negotiating with a EE company about offers, they remarked to me that their VPs don’t make as much as I do, and I work at a software company that pays relatively poorly compared to its peers. There’s no reason to be believe that we won’t see a flow of people from engineering fields into programming until compensation is balanced.

Another possibility is that U.S. immigration laws act as a protectionistic barrier to prop up programmer compensation. It seems impossible for this to last (why shouldn’t there by really valuable non-U.S. companies), but it does appear to be somewhat true for now. When I was at Google, one thing that was remarkable to me was that they’d pay you approximately the same thing in a small midwestern town as in Silicon Valley, but they’d pay you much less in London. Whenever one of these discussions comes up, people always bring up the “fact” that SV salaries aren’t really as good as they sound because the cost of living is so high, but companies will not only match SV offers in Seattle, they’ll match them in places like Madison, Wisconsin. My best guess for why this happens is that someone in the midwest can credibly threaten to move to SV and take a job at any company there, whereas someone in London can’t². While we seem unlikely to loosen current immigration restrictions, our immigration restrictions have caused and continue to cause people who would otherwise have founded companies in the U.S. to found companies elsewhere. Given that the U.S. doesn’t have a monopoly on people who found startups and that we do our best to keep people who want to found startups here out, it seems inevitable that there will eventually be Facebooks and Googles founded outside of the U.S. who compete for programmers the same way companies compete inside the U.S.

Another theory that I’ve heard a lot lately is that programmers at large companies get paid a lot because of the phenomenon described in Kremer’s O-ring model. This model assumes that productivity is multiplicative. If your co-workers are better, you’re more productive and produce more value. If that’s the case, you expect a kind of assortive matching where you end up with high-skill firms that pay better, and low-skill firms that pay worse. This model has a kind of intuitive appeal to it, but it can’t explain why programming compensation has higher dispersion than (for example) electrical engineering compensation. With the prevalence of open source, it’s much easier to utilize the work of productive people outside your firm than it most fields. This model should be less true of programming than in most engineering fields, but the dispersion in compensation is higher.

I don’t understand this at all and would love to hear a compelling theory for why programming “should” pay more than other similar fields, or why it should pay as much as fields that have much higher barriers to entry.

People often worry that comp surveys will skew high because people want to brag, but the reality seems to be that numbers skew low because people feel embarrased about sounding like they’re bragging. I have a theory that you can see this reflected in the prices of other goods. For example, if you look at house prices, they’re generally predicatable based on location, square footage, amenities, and so on. But there’s a significant penalty for having the largest house on the block, for what (I suspect) is the same reason people with the highest compensation disproportionately don’t participate in #talkpay: people don’t want to admit that they have the highest pay, have the biggest house, or drive the fanciest car. Well, some people do, but on average, bragging about that stuff is seen as quite gauche. ^[return]
There’s a funny move some companies will do where they station the new employee in Canada for a year before importing them into the U.S., which gets them into a visa process that’s less competitive. But this is enough of a hassle that most employees balk at the idea. ^[return]

↧

Why's that company so big? I could do that in a weekend

October 3, 2016, 1:14 am

≫ Next: Developer hiring and the market for lemons

≪ Previous: Is developer compensation becoming bimodal?

I can’t think of a single large software company that doesn’t regularly draw internet comments of the form “What do all the employees do? I could build their product myself.” Benjamin Pollack and Jeff Atwood called out people who do that with Stack Overflow. But Stack Overflow is relatively obviously lean, so the general response is something like “oh, sure maybe Stack Overflow is lean, but FooCorp must really be bloated”. And since most people have relatively little visibility into FooCorp, for any given value of FooCorp, that sounds like a plausible statement. After all, what product could possible require hundreds, or even thousands of engineers?

A few years ago, in the wake of the rapgenius SEO controversy, a number of folks called for someone to write a better Google. Alex Clemmer responded that maybe building a better Google is a non-trivial problem. Considering how much of Google’s $500B market cap comes from search, and how much money has been spent by tens (hundreds?) of competitors in an attempt to capture some of that value, it seems plausible to me that search isn’t a trivial problem. But in the comments on Alex’s posts, multiple people respond and say that Lucene basically does the same thing Google does and that Lucene is poised to surpass Google’s capabilities in the next few years.

What would Lucene at Google’s size look like? If we do a naive back of the envelope calculation on what it would take to index a significant fraction of the internet (often estimated to be 1 trillion (T) or 10T documents), we might expect a 1T document index to cost something like $10B¹. That’s not a feasible startup, so let’s say that instead of trying to index 1T documents, we want to maintain an artisanal search index of 1B documents. Then our cost comes down to $12M/yr. That’s not so bad – plenty of startups burn through more than that every year. While we’re in the VC-funded hypergrowth mode, that’s fine, but once we have a real business, we’ll want to consider trying to save money. At $12M/yr for the index, a 3% performance improvement that lets us trim our costs by 2% is worth $360k/yr. With those kinds of costs, it’s surely worth it to have at least one engineer working full-time on optimization, if not more.

Businesses that actually care about turning a profit will spend a lot of time (hence, a lot of engineers) working on optimizing systems, even if an MVP for the system could have been built in a weekend. There’s also a wide body of research that’s found that decreasing latency has a roughly linear effect on revenue over a pretty wide range of latencies and businesses. Businesses should keep adding engineers to work on optimization until the cost of adding an engineer equals the revenue gain plus the cost savings at the margin. This is often many more engineers than people realize.

And that’s just performance. Features also matter: when I talk to engineers working on basically any product at any company, they’ll often find that there are seemingly trivial individual features that can add integer percentage points to revenue. Just as with performance, people underestimate how many engineers you can add to a product before engineers stop paying for themselves.

Additionally, features are often much more complex than outsiders realize. If we look at search, how do we make sure that different forms of dates and phone numbers give the same results? How about internationalization? Each language has unique quirks that have to be accounted for. In french, “l’foo” should often match “un foo” and vice versa, but American search engines from the 90s didn’t actually handle that correctly. How about tokenizing Chinese queries, where words don’t have spaces between them, and sentences don’t have unique tokenizations? How about Japanese, where queries can easily contain four different alphabets? How about handling Arabic, which is mostly read right-to-left, except for the bits that are read left-to-right? And that’s not even the most complicated part of handling Arabic! It’s fine to ignore this stuff for a weekend-project MVP, but ignoring it in a real business means ignoring the majority of the market! Some of these are handled ok by open source projects, but many of the problems involve open research problems.

There’s also security! If you don’t “bloat” your company by hiring security people, you’ll end up like hotmail or yahoo, where your product is better known for how often it’s hacked than for any of its other features.

Everything we’ve looked at so far is a technical problem. Compared to organizational problems, technical problems are straightforward. Distributed systems are considered hard because real systems might drop something like 0.1% of messages, corrupt an even smaller percentage of messages, and see latencies in the microsecond to millisecond range. When I talk to higher-ups and compare what they think they’re saying to what my coworkers think they’re saying, I find that the rate of lost messages is well over 50%, every message gets corrupted, and latency can be months or years². When people imagine how long it should take to build something, they’re often imagining a team that works perfectly and spends 100% of its time coding. But that’s impossible to scale up. The question isn’t whether or not there will inefficiencies, but how much inefficiency. A company that could eliminate organizational inefficiency would be a larger innovation than any tech startup, ever. But when doing the math on how many employees a company “should” have, people usually assume that the company is an efficient organization.

This post happens to use search as an example because I ran across some people who claimed that Lucene was going to surpass Google’s capabilities any day now, but there’s nothing about this post that’s unique to search. If you talk to people in almost any field, you’ll hear stories about how people wildly underestimate the complexity of the problems in the field. The point here isn’t that it would be impossible for a small team to build something better than Google search. It’s entirely plausible that someone will have an innovation as great as PageRank, and that a small team could turn that into a viable company. But once that company is past the VC-funded hyper growth phase and wants to maximize its profits, it will end up with a multi-thousand person platforms org, just like Google’s, unless the company wants to leave hundreds of millions or billions of dollars a year on the table due to hardware and software inefficiency. And the company will want to handle languages like Thai, Arabic, Chinese, and Japanese, each of which is non-trivial. And the company will want to have relatively good security. And there are the hundreds of little features that users don’t even realize that are there, each of which provides a noticeable increase in revenue. It’s “obvious” that companies should outsource their billing, except that when you talk to companies that handle their own billing, they can point to individual features that increase conversion by single or double digit percentages that they can’t get from Stripe or Braintree. That fifty person billing team is totally worth it, beyond a certain size. And then there’s sales, which most engineers don’t even think of³, not to mention research (which, almost by definition, involves a lot of bets that don’t pan out).

It’s not that all of those things are necessary to run a service at all; it’s that almost every large service is leaving money on the table if they don’t seriously address those things. This reminds me of a common fallacy we see in unreliable systems, where people build the happy path with the idea that the happy path is the “real” work, and that error handling can be tacked on later. For reliable systems, error handling is more work than the happy path. The same thing is true for large services – all of this stuff that people don’t think of as “real” work is more work than the core service⁴.

I’m experimenting with writing blog posts stream-of-consciousness, without much editing. Both this post and my last post were written that way. Let me know what you think of these posts relative to my “normal” posts!

Thanks to Leah Hanson, Joel Wilder, Kay Rhodes, and Ivar Refsdal for corrections.

In public benchmarks, Lucene appears to get something like 30 QPS - 40 QPS when indexing wikipedia on a single machine. See anandtech, Haque et al., ASPLOS 2015, etc. I’ve seen claims that Lucene can run 10x faster than that on wikipedia but I haven’t seen a reproducible benchmark setup showing that, so let’s say that we can expect to get something like 30 QPS - 300 QPS if we index a wikipedia-sized corpus on one machine.
Those benchmarks appear to be indexing English Wikipedia, articles only. That’s roughly 50 GB and approximately 5m documents. Estimates of the size of the internet vary, but public estimates often fall into the range of 1 trillion (T) to 10T documents. Say we want to index 1T documents, and we can put 5m documents per machine: we need 1T/5m = 200k machines to handle all of the extra documents. None of the off-the-shelf sharding/distribution solutions that are commonly used with Lucene can scale to 200k machines, but let’s posit that we can solve that problem and can operate a search cluster with 200k machines. We’ll also need to have some replication so that queries don’t return bad results if a single machine goes down. If we replicate every machine once, that’s 400k machines. But that’s 400k machines for just one cluster. If we only have one cluster sitting in some location, users in other geographic regions will experience bad latency to our service, so many we want to have ten such clusters. If we have ten such clusters, that’s 4M machines.
In the Anandtech wikipedia benchmark, they get 30 QPS out of a single-socket Broadwell Xeon D with 64 GB of RAM (enough to fit the index in memory). If we don’t want to employ the army of people necessary to build out and run 4M machines worth of datacenters, AFAICT the cheapest VM that’s plausibly at least as “good” as that machine is the GCE n1-highmem-8, which goes for $0.352hr. If we multiply that out by 4M machines, that’s a little over $1.4M an hour, or a little more than $12B a year for a service that can’t even get within an order of magnitude of the query rate or latency necessary to run a service like Google or Bing. And that’s just for the index – even a minimal search engine also requires crawling. BTW, people will often claim that this is easy because they have much larger indices in Lucene, but with a posting-list based algorithm like Lucene, you can very roughly think of query rate as inversely related to the number of postings. When you ask these people with their giant indices what their query rate is, you’ll inevitably find that it’s glacial by internet standards. For reference, the core of twitter was a rails app that could handle something like 200 QPS until 2008. If you look at what most people handle with Lucene, it’s often well under 1 QPS, with documents that are much smaller than the average web document, using configurations that damage search relevance too much to be used in commercial search engines (e.g., using stop words). That’s fine, but that fact that people think that sort of experience is somehow relevant to web search is indicative of the problem this post is discussing.
That also assumes that we won’t hit any other scaling problem if we can make 400k VM clusters. But finding an open source index which will scale not only to the number of documents on the internet, but also the number of terms, is non-trivial. Before you read the next section, try guessing how many unique terms there are online. And then if we shard the internet so that we have 5m documents per machine, try guessing how many unique terms you expect to see per shard.
When I ask this question, I often hear guesses like “ten million” or “ten billion”. But without even looking at the entire internet, just looking at one single document on github, we can find a document with fifty million unique terms:
So there are definitely more than ten million unique terms on the entire internet! In fact, there’s a website out there that has all primes under one trillion. I believe there are something like thirty-seven billion of those. If that website falls into one shard of our index, we’d expect to see more than thirty-seven billion terms in a single shard; that’s more than most people guess we’ll see on the entire internet, and that’s just in one shard that happens to contain one somewhat pathological site. If we try to put the internet into any existing open source index that I know of, not only will it not be able to scale out enough horizontally, many shards will contain data weird enough to make the entire shard fall over if we run a query. That’s nothing against open source software; like any software, it’s designed to satisfy the needs of its users, and none of its users do anything like index the entire internet. As businesses scale up, they run into funny corner cases that people without exposure to the particular domain don’t anticipate.
People often object that you don’t need to index all of this weird stuff. There have been efforts to build web search engines that only index the “important” stuff, but it turns out that if you ask people to evaluate search engines, some people will type in the weirdest queries they can think of and base their evaluation off of that. And others type in what they think of as normal queries for their day-to-day work even if they seem weird to you (e.g., a biologist might query for GTGACCTTGGGCAAGTTACTTAACCTCTCTGTGCCTCAGTTTCCTCATCTGTAAAATGGGGATAATA). If you want to be anything but a tiny niche player, you have to handle not only the weirdest stuff you can think of, but the weirdest stuff that many people can think of.
^[return]
Recently, I was curious why an org that’s notorious for producing unreliable services produces so many unreliable services. When I asked around about why, I found that that upper management were afraid of sending out any sort of positive message about reliability because they were afraid that people would use that as an excuse to slip schedules. Upper management changed their message to include reliability about a year ago, but if you talk to individual contributors, they still believe that the message is that features are the #1 priority and slowing down on features to make things more reliable is bad for your career (and based on who’s getting promoted the individual contributors appear to be right). Maybe in another year, the org will have really gotten the message through to the people who hand out promotions, and in another couple of years, enough software will have been written with reliability in mind that they’ll actually have reliable services. Maybe. That’s just the first-order effect. The second-order effect is that their policies have caused a lot of people who care about reliability to go to companies that care more about reliability and less about demo-ing shiny new features. They might be able to fix that in a decade. Maybe. That’s made harder by the fact that the org is in a company that’s well known for having PMs drive features above all else. If that reputation is possible to change, it will probably take multiple decades. ^[return]
For a lot of products, the sales team is more important than the engineering team. If we build out something rivaling Google search, we’ll probably also end up with the infrastructure required to sell a competitive cloud offering. Google actually tried to do that without having a serious enterprise sales force and the result was that AWS and Azure basically split the enterprise market between them. ^[return]
This isn’t to say that there isn’t waste or that different companies don’t have different levels of waste. I see waste everywhere I look, but it’s usually not what people on the outside think of as waste. Whenever I read outsider’s descriptions of what’s wasteful at the companies I’ve worked at, they’re almost inevitably wrong. Friends of mine who work at other places also describe the same dynamic. ^[return]

↧

Developer hiring and the market for lemons

October 9, 2016, 2:44 am

≫ Next: Should I buy ECC memory?

≪ Previous: Why's that company so big? I could do that in a weekend

Joel Spolsky has a classic blog post on “Finding Great Developers” where he popularized the meme that great developers are impossible to find, a corollary of which is that if you can find someone, they’re not great. Joel writes,

The great software developers, indeed, the best people in every field, are quite simply never on the market.
The average great software developer will apply for, total, maybe, four jobs in their entire career.
…
If you’re lucky, if you’re really lucky, they show up on the open job market once, when, say, their spouse decides to accept a medical internship in Anchorage and they actually send their resume out to what they think are the few places they’d like to work at in Anchorage.
But for the most part, great developers (and this is almost a tautology) are, uh, great, (ok, it is a tautology), and, usually, prospective employers recognize their greatness quickly, which means, basically, they get to work wherever they want, so they honestly don’t send out a lot of resumes or apply for a lot of jobs.
Does this sound like the kind of person you want to hire? It should.The corollary of that rule–the rule that the great people are never on the market–is that the bad people–the seriously unqualified–are on the market quite a lot. They get fired all the time, because they can’t do their job. Their companies fail–sometimes because any company that would hire them would probably also hire a lot of unqualified programmers, so it all adds up to failure–but sometimes because they actually are so unqualified that they ruined the company. Yep, it happens.
These morbidly unqualified people rarely get jobs, thankfully, but they do keep applying, and when they apply, they go to Monster.com and check off 300 or 1000 jobs at once trying to win the lottery.
Astute readers, I expect, will point out that I’m leaving out the largest group yet, the solid, competent people. They’re on the market more than the great people, but less than the incompetent, and all in all they will show up in small numbers in your 1000 resume pile, but for the most part, almost every hiring manager in Palo Alto right now with 1000 resumes on their desk has the same exact set of 970 resumes from the same minority of 970 incompetent people that are applying for every job in Palo Alto, and probably will be for life, and only 30 resumes even worth considering, of which maybe, rarely, one is a great programmer. OK, maybe not even one.

Joel’s claim is basically that “great” developers won’t have that many jobs compared to “bad” developers because companies will try to keep “great” developers. Joel also posits that companies can recognize prospective “great” developers easily. But these two statements are hard to reconcile. If it’s so easy to identify prospective “great” developers, why not try to recruit them? You could just as easily make the case that “great” developers are overrepresented in the market because they have better opportunities and it’s the “bad” developers who will cling to their jobs. This kind of adverse selection is common in companies that are declining; I saw that in my intern cohort at IBM¹, among other places.

Should “good” developers be overrepresented in the market or underrepresented? If we listen to the anecdotal griping about hiring, we might ask if the market for developers is a market for lemons. This idea goes back to Akerlof’s Nobel prize winning 1970 paper, “The Market for ‘Lemons’: Quality Uncertainty and the Market Mechanism”. Akerlof takes used car sales as an example, splitting the market into good used cars and bad used cars (bad cars are called “lemons”). If there’s no way to distinguish between good cars and lemons, good cars and lemons will sell for the same price. Since buyers can’t distinguish between good cars and bad cars, the price they’re willing to pay is based on the quality of the average in the market. Since owners know if their car is a lemon or not, owners of non-lemons won’t sell because the average price is driven down by the existence of lemons. This results in a feedback loop which causes lemons to be the only thing available.

This model is certainly different from Joel’s model. Joel’s model assumes that “great” developers are sticky – that they stay at each job for a long time. This comes from two assumptions; first, that it’s easy for prospective employers to identify who’s “great”, and second, that once someone is identified as “great”, their current employer will do anything to keep them (as in the market for lemons). But the first assumption alone is enough to prevent the developer job market from being a market for lemons. If you can tell that a potential employee is great, you can simply go and offer them twice as much as they’re currently making (something that I’ve seen actually happen). You need an information asymmetry to create a market for lemons, and Joel posits that there’s no information asymmetry.

If we put aside Joel’s argument and look at the job market, there’s incomplete information, but both current and prospective employers have incomplete information, and whose information is better varies widely. It’s actually quite common for prospective employers to have better information than current employers!

Just for example, there’s someone I’ve worked with, let’s call him Bob, who’s saved two different projects by doing the grunt work necessary to keep the project from totally imploding. The projects were both declared successes, promotions went out, they did a big PR blitz which involves seeding articles in all the usual suspects, like Wired, and so on and so forth. That’s worked out great for the people who are good at taking credit for things, but it hasn’t worked out so well for Bob. In fact, someone else I’ve worked with recently mentioned to me that management keeps asking him why Bob takes so long to do simple tasks. The answer is that Bob’s busy making sure the services he works on don’t have global outages when they launch, but that’s not the kind of thing you get credit for in Bob’s org. The result of that is that Bob has a network who knows that he’s great, which makes it easy for him to get a job anywhere else at market rate. But his management chain has no idea, and based on what I’ve seen of offers today, they’re paying him about half what he could make elsewhere. There’s no shortage of cases where information transfer inside a company is so poor that external management has a better view of someone’s productivity than internal management. I have one particular example in mind, but if I just think of the Bob archetype, off the top of my head, I know of four people who are currently in similar situations. It helps that I currently work at a company that’s notorious for being dysfunctional in this exact way, but this happens everywhere. When I worked at a small company, we regularly hired great engineers from big companies that were too clueless to know what kind of talent they had.

Another problem with the idea that “great” developers are sticky is that this assumes that companies are capable of creating groups that developers want to work for on demand. This is usually not the case. Just for example, I once joined a team where the TL was pretty strongly against using version control or having tests. As a result of those (and other) practices, it took five devs one year to produce 10k lines of kinda-sorta working code for a straightforward problem. Additionally, it was a pressure cooker where people were expected to put in 80+ hour weeks, where the PM would shame people into putting in longer hours. Within a year, three of the seven people who were on the team when I joined had left; two of them went to different companies. The company didn’t want to lose those two people, but it wasn’t capable of creating an environment that would keep them.

Around when I joined that team, a friend of mine joined a really great team. They do work that materially impacts the world, they have room for freedom and creativity, a large component of their jobs involves learning new and interesting things, and so on and so forth. Whenever I heard about someone who was looking for work, I’d forward them that team. That team is now full for the foreseeable future because everyone whose network included that team forwarded people into that team. But if you look at the team that lost three out of seven people in a year, that team is hiring. A lot. The result of this dynamic is that, as a dev, if you join a random team, you’re overwhelmingly likely to join a team that has a lot of churn. Additionally, if you know of a good team, it’s likely to be full.

Joel’s model implicitly assumes that, proportionally, there are many more dysfunctional developers than dysfunctional work environments.

At the last conference I attended, I asked most people I met two questions:

Do you know of any companies that aren’t highly dysfunctional?
Do you know of any particular teams that are great and are hiring?

Not one single person told me that their company meets the criteria in (1). A few people suggested that, maybe, Dropbox is ok, or that, maybe, Jane Street is ok, but the answers were of the form “I know a few people there and I haven’t heard any terrible horror stories yet, plus I sometimes hear good stories”, not “that company is great and you should definitely work there”. Most people said that they didn’t know of any companies that weren’t a total mess.

A few people had suggestions for (2), but the most common answer was something like “LOL no, if I knew that I’d go work there”. The second most common answer was of the form “I know some people on the Google Brain team and it sounds great”. There are a few teams that are well known for being great places to work, but the fact that they’re so few and far between that it’s basically impossible to get a job on one of those teams. A few people knew of actual teams that they’d strongly recommend who were hiring, but that was rare. Much rarer than finding a developer who I’d want to work with who would consider moving. If I flipped the question around and asked if they knew of any good developers who were looking for work, the answer was usually “yes”².

Another problem with the idea that “great” developers are impossible to find because they join companies and then stick is that developers (and companies) aren’t immutable. Because I’ve been lucky enough to work in environments that allow people to really flourish, I’ve seen a lot of people go from unremarkable to amazing. Because most companies invest pretty much nothing in helping people, you can do really well here without investing much effort.

On the flip side, I’ve seen entire teams of devs go on the market because their environment changed. Just for example, I used to know a lot of people who worked at company X under Marc Yun. It was the kind of place that has low attrition because people really enjoy working there. And then Marc left. Over the next two years, literally everyone I knew who worked there left. This one change both created a lemon in the searching-for-a-team job market and put a bunch of good developers on the market. This kind of thing happens all the time, even more now than in the past because of today’s acquisition-heavy environment.

Is developer hiring a market for lemons? Well, it depends on what you mean by that. Both developers and hiring managers have incomplete information. It’s not obvious if having a market for lemons in one direction makes the other direction better or worse. The fact that joining a new team is uncertain makes developers less likely to leave existing teams, which makes it harder to hire developers. But the fact that developers often join teams which they dislike makes it easier to hire developers. What’s the net effect of that? I have no idea.

From where I’m standing, it seems really hard to find a good manager/team, and I don’t know of any replicable strategy for doing so; I have a lot of sympathy for people who can’t find a good fit because I get how hard that is. But I have seen replicable strategies for hiring, so I don’t have nearly as much sympathy for hiring managers who complain that hiring “great” developers is impossible.

When a hiring manager complains about hiring, in every single case I’ve seen so far, the hiring manager has one of the following problems:

They pay too little. The last time I went looking for work, I found a 6x difference in compensation between companies who might hire me in the same geographic region. Basically all of the companies thought that they were competitive, even when they were at the bottom end of the range. I don’t know what it is, but companies always seem to think that they pay well, even when they’re not even close to being in the right range. Almost everyone I talk to tells me that they pay as much as any reasonable company. Sure, there are some companies out there that pay a bit more, but they’re overpaying! You can actually see this if you read Joel’s writing – back when he wrote the post I’m quoting above, he talked about how well Fog Creek paid. A couple years later, he complained that Google was overpaying for college kids with no experience, and more recently he’s pretty much said that you don’t want to work at companies that pay well.
They pass on good or even “great” developers³. Earlier, I claimed that I knew lots of good developers who are looking for work. You might ask, if there are so many good developers looking for work, why’s it so hard to find them? Joel claims that out of a 1000 resumes, maybe 30 people will be “solid” and 970 will be “incompetent”. It seems to me it’s more like 200 will be solid and 20 will be really good. It’s just that almost everyone uses the same filters, so everyone ends up fighting over the 30 people who they think are solid.
Matasano famously solved their hiring problem by using a different set of filters and getting a different set of people. Despite the resounding success of their strategy, pretty much everyone insists on sticking with the standard strategy of picking people with brand name pedigrees and running basically the same interview process as everyone else, bidding up the price of folks who are trendy and ignoring everyone else.
If I look at developers I know who are in high-demand today, a large fraction of them went through a multi-year period where they were underemployed and practically begging for interesting work. These people are very easy to hire if you can find them.
They’re trying to hire for some combination of rare skills. Right now, if you’re trying to hire for someone with experience in deep learning and, well, anything else, you’re going to have a bad time.
They’re much more dysfunctional than they realize. I know one hiring manager who complains about how hard it is to hire. What he doesn’t realize is that literally everyone on his team is bitterly unhappy and a significant fraction of his team gives anti-referrals to friends and tells them to stay away.
That’s an extreme case, but it’s quite common to see a VP or founder baffled by why hiring is so hard when employees consider the place to be mediocre or even bad.

Of these problems, (1), low pay, is both the most common and the simplest to fix.

In the past few years, Oracle and Alibaba have spun up new cloud computing groups in Seattle. This is a relatively competitive area, and both companies have reputations that work against them when hiring⁴. If you believe the complaints about how hard it is to hire, you wouldn’t think one company, let alone two, could spin up entire cloud teams in Seattle. Both companies solved the problem by paying substantially more than their competitors were offering for people with similar experience. Alibaba became known for such generous offers that when I was negotiating my offer from Microsoft, MS told me that they’d match an offer from any company except Alibaba. I believe Oracle and Alibaba have hired hundreds of engineers over the past few years.

Most companies don’t need to hire anywhere near a hundreds of people; they can pay competitively without hiring so many developers that the entire market moves upwards, but they still refuse to do so, while complaining about how hard it is to hire.

(2), filtering out good potential employees, seems like the modern version of “no one ever got fired for hiring IBM”. If you hire someone with a trendy background who’s good at traditional coding interviews and they don’t work out, who could blame you? And no one’s going to notice all the people you missed out on. Like (1), this is something that almost everyone thinks they do well and they’ll say things like “we’d have to lower our bar to hire more people, and no one wants that”. But I’ve never worked at a place that doesn’t filter out a lot of people who end up doing great work elsewhere. I’ve tried to get underrated programmers⁵ hired at places I’ve worked, and I’ve literally never succeeded in getting one hired. Once, someone I failed to get hired managed to get a job at Google after something like four years being underemployed (and is a star there). That guy then got me hired at Google. Not hiring that guy didn’t only cost them my brilliant friend, it eventually cost them me!

BTW, this illustrates a problem with Joel’s idea that “great” devs never apply for jobs. There’s often a long time period where a “great” dev has an extremely hard time getting hired, even through their network who knows that they’re great, because they don’t look like what people think “great” developers look like. Additionally, Google, which has heavily studied which hiring channels give good results, has found that referrals and internal recommendations don’t actually generate much signal. While people will refer “great” devs, they’ll also refer terrible ones. The referral bonus scheme that most companies set up skews incentives in a way that makes referrals worse than you might expect. Because of this and other problems, many companies don’t weight referrals particularly heavily, and “great” developers still go through the normal hiring process, just like everyone else.

(3), needing a weird combination of skills, can be solved by hiring people with half or a third of the expertise you need and training people. People don’t seem to need much convincing on this one, and I see this happen all the time.

(4), dysfunction seems hard to fix. If I knew how to do that, I’d be manager.

As a dev, it seems to me that teams I know of that are actually good environments that pay well have no problems hiring, and that teams that have trouble hiring can pretty easily solve that problem. But I’m biased. I’m not a hiring manager. There’s probably some hiring manager out there thinking: “every developer I know who complains that it’s hard to find a good team has one of these four obvious problems; if only my problems were that easy to solve!”

Thanks to Leah Hanson, David Turner, Tim Abbott, Vaibhav Sagar, Victor Felder, Ezekiel Smithburg, Juliano Bortolozzo Solanho, Stephen Tu, Pierre-Yves Baccou, Jorge Montero, Ben Kuhn, and Lindsey Kuper for comments and corrections.

If you liked this post, you’d probably enjoy this other post on the bogosity of claims that there can’t possibly be discrimination in tech hiring.

The folks who stayed describe an environment that’s mostly missing mid-level people they’d want to work with. There are lifers who’ve been there forever and will be there until retirement, and there are new grads who land there at random. But, compared to their competitors, there are relatively few people people with 5-15 years of experience. The person I knew who lasted the longest stayed until the 8 year mark, but he started interviewing with an eye on leaving when he found out the other person on his team who was competent was interviewing; neither one wanted to be the only person on the team doing any work, so they raced to get out the door first. ^[return]
This section kinda makes it sound like I’m looking for work. I’m not looking for work, although I may end up forced into it if my partner takes a job outside of Seattle. ^[return]
Moishe Lettvin has a talk I really like, where he talks about a time when he was on a hiring committee and they rejected every candidate that came up, only to find that the “candidates” were actually anonymized versions of their own interviews!
The bit about when he first started interviewing at Microsoft should sound familiar to MS folks. As is often the case, he got thrown into the interview with no warning and no preparation. He had no idea what to do and, as a result, wrote up interview feedback that wasn’t great. “In classic Microsoft style”, his manager forwarded the interview feedback to the entire team and said “don’t do this”. “In classic Microsoft style” is a quote from Moishe, but I’ve observed the same thing. I’d like to talk about how we have a tendency to do extremely blameful postmortems and how that warps incentives, but that probably deserves its own post.
Well, I’ll tell one story, in remembrance of someone who recently left for Google. Shortly after that guy joined, he was in the office on a weekend (a common occurrence on his team). A manager from another team pinged him on chat and asked him to sign off on some code from the other team. The new guy, wanting to be helpful, signed off on the code. On Monday, the new guy talked to his mentor and his mentor suggested that he not help out other teams like that. Later, there was an outage related to the code. In classic Microsoft style, the manager from the other team successfully pushed the blame for the outage from his team to the new guy.
^[return]
For a while, Oracle claimed that the culture of the Seattle office is totally different from mainline-Oracle culture, but from what I’ve heard, they couldn’t resist Oracle-ifying the Seattle group and that part of the pitch is no longer convincing. ^[return]
This footnote is a response to Ben Kuhn, who asked me, what types of devs are underrated and how would you find them? I think this group is diverse enough that there’s no one easy way to find them. There are people like “Bob”, who do critical work that’s simply not noticed. There are also people who are just terrible at interviewing, like Jeshua Smith. I believe he’s only once gotten a performance review that wasn’t excellent (that semester, his manager said he could only give out one top rating, and it wouldn’t be fair to give it to only one of his two top performers, so he gave them both average ratings). In every place he’s worked, he’s been well known as someone who you can go to with hard problems or questions, and much higher ranking engineers often go to him for help. I tried to get him hired at two different companies I’ve worked at and he failed both interviews. He sucks at interviews. My understanding is that his interview performance almost kept him from getting his current job, but his references were so numerous and strong that his current company decided to take a chance on him anyway. But he only had those references because his old org has been disintegrating. His new company picked up a lot of people from his old company, so there were many people at the new company that knew him. He can’t get the time of day almost anywhere else. Another person I’ve tried and failed to get hired is someone I’ll call Ashley, who got rejected in the recruiter screening phase at Google for not being technical enough, despite my internal recommendation that she was one of the strongest programmers I knew. But she came from a “nontraditional” background that didn’t fit the recruiter’s idea of what a programmer looked like, so that was that. Nontraditional is a funny term because it seems like most programmers have a “nontraditional” background, but you know what I mean.
There’s enough variety here that there isn’t one way to find all of these people. Having a filtering process that’s more like Matasano’s and less like Google, Microsoft, Facebook, almost any YC startup you can name, etc., is probably a good start.
^[return]

↧

Should I buy ECC memory?

November 26, 2015, 4:00 pm

≫ Next: File crash consistency and filesystems are hard

≪ Previous: Developer hiring and the market for lemons

Jeff Atwood, perhaps the most widely read programming blogger, has a post that makes a case against using ECC memory. My read is that his major points are:

Google didn’t use ECC when they built their servers in 1999
Most RAM errors are hard errors and not soft errors
RAM errors are rare because hardware has improved
If ECC were actually important, it would be used everywhere and not just servers. Paying for optional stuff like this is downright enterprisey

Let’s take a look at these arguments one by one:

1. Google didn’t use ECC in 1999

If you do things just because Google once did them, here are some things you might do:

A. Put your servers into shipping containers.

Articles are still written today about what a great idea this is, even though this was an experiment at Google that was deemed unsuccessful. Turns out, even Google’s experiments don’t always succeed. In fact, their propensity for “moonshots” means that they have more failed experiments that most companies. IMO, that’s a substantial competitive advantage for them. You don’t need to make that advantage bigger than it already is by blindly copying their failed experiments.

B. Cause fires in your own datacenters

Part of the post talks about how awesome these servers are:

Some people might look at these early Google servers and see an amateurish fire hazard. Not me. I see a prescient understanding of how inexpensive commodity hardware would shape today’s internet.

The last part of that is true. But the first part has a grain of truth, too. When Google started designing their own boards, one generation had a regrowth¹ issue that caused a non-zero number of fires.

BTW, if you click through to Jeff’s post and look at the photo that the quote refers to, you’ll see that the boards have a lot of flex in them. That caused problems and was fixed in the next generation. You can also observe that the cabling is quite messy, which also caused problems, and was also fixed in the next generation. There were other problems, but I’ll leave those as an exercise for the reader.

C. Make servers that injure your employees

One generation of Google servers had infamously sharp edges, given them the reputation of being made of “razor blades and hate”.

D. Create weather in your datacenters

From talking to folks at a lot of large tech companies, it seems that most of them have had a climate control issue resulting in clouds or fog in their datacenters. You might call this a clever plan by Google to reproduce Seattle weather so they can poach MS employees. Alternately, it might be a plan to create literal cloud computing. Or maybe not.

Note that these are all things Google tried and then changed. Making mistakes and then fixing them is common in every successful engineering organization. If you’re going to cargo cult an engineering practice, you should at least cargo cult current engineering practices, not something that was done in 1999.

When Google used servers without ECC back in 1999, they found a number of symptoms that were ultimately due to memory corruption, including a search index that returned effectively random results to queries. The actual failure mode here is instructive. I often hear that it’s ok to ignore ECC on these machines because it’s ok to have errors in individual results. But even when you can tolerate occasional errors, ignoring errors means that you’re exposing yourself to total corruption, unless you’ve done a very careful analysis to make sure that a single error can only contaminate a single result. In research that’s been done on filesystems, it’s been repeatedly shown that despite making valiant attempts at creating systems that are robust against a single error, it’s extremely hard to do so and basically every heavily tested filesystem can have a massive failure from a single error (see the output of Andrea and Remzi’s research group at Wisconsin if you’re curious about this). I’m not knocking filesystem developers here. They’re better at that kind of analysis than 99.9% of programmers. It’s just that this problem has been repeatedly shown to be hard enough that humans cannot effectively reason about it, and automated tooling for this kind of analysis is still far from a pushbutton process. In their book on warehouse scale computing, Google discusses error correction and detection and ECC is cited as their slam dunk case for when it’s obvious that you should use hardware error correction².

Google has great infrastructure. From what I’ve heard of the infra at other large tech companies, Google’s sounds like the best in the world. But that doesn’t mean that you should copy everything they do. Even if you look at their good ideas, it doesn’t make sense for most companies to copy them. They created a replacement for Linux’s work stealing scheduler that uses both hardware run-time information and static traces to allow them to take advantage of new hardware in Intel’s server processors that lets you dynamically partition caches between cores. If used across their entire fleet, that could easily save Google more money in a week than stackexchange has spent on machines in their entire history. Does that mean you should copy Google? No, not unless you’ve already captured all the lower hanging fruit, which includes things like making sure that your core infrastructure is written in highly optimized C++, not Java or (god forbid) Ruby. And the thing is, for the vast majority of companies, writing in a language that imposes a 20x performance penalty is a totally reasonable decision.

2. Most RAM errors are hard errors

The case against ECC quotes this section of a study on DRAM errors (the bolding is Jeff’s):

Our study has several main findings. First, we find that approximately 70% of DRAM faults are recurring (e.g., permanent) faults, while only 30% are transient faults. Second, we find that large multi-bit faults, such as faults that affects an entire row, column, or bank, constitute over 40% of all DRAM faults. Third, we find that almost 5% of DRAM failures affect board-level circuitry such as data (DQ) or strobe (DQS) wires. Finally, we find that chipkill functionality reduced the system failure rate from DRAM faults by 36x.

This is somewhat ironic, as this quote doesn’t sound like an argument against ECC; it sounds like an argument for chipkill, a particular class of ECC. Putting that aside, Jeff’s post points out that hard errors are twice as common as soft errors, and then mentions that they run memtest on their machines when they get them. First, a 2:1 ratio isn’t so large that you can just ignore soft errors. Second the post implies that Jeff believes that hard errors are basically immutable and can’t surface after some time. That’s incorrect. You can think of electronics as wearing out just the same way mechanical devices wear out. The mechanisms are different, but the effects are similar. In fact, if you compare reliability analysis of chips vs. other kinds of reliability analysis, you’ll find they often use the same families of distributions to model failures. Third, Jeff’s line of reasoning implies that ECC can’t help with detection or correction of hard errors, which is not only incorrect but directly contradicted by the quote.

So, how often are you going to run memtest on your machines to try to catch these hard errors, and how much data corruption are you willing to live with? One of the key uses of ECC is not to correct errors, but to signal errors so that hardware can be replaced before silent corruption occurs. No one’s going to consent to shutting down everything on a machine every day to run memtest (that would be more expensive than just buying ECC memory), and even if you could convince people to do that, it won’t catch as many errors as ECC will.

When I worked at a company that owned about 1000 machines, we noticed that we were getting strange consistency check failures, and after maybe half a year we realized that the failures were more likely to happen on some machines than others. The failures were quite rare, maybe a couple times a week on average, so it took a substantial amount of time to accumulate the data, and more time for someone to realize what was going on. Without knowing the cause, analyzing the logs to figure out that the errors were caused by single bit flips (with high probability) was also non-trivial. We were lucky that, as a side effect of the process we used, the checksums were calculated in a separate process, on a different machine, at a different time, so that an error couldn’t corrupt the result and propogate that corruption into the checksum. If you merely try to protect yourself with in-memory checksums, there’s a good chance you’ll perform a checksum operation on the already corrupted data and compute a valid checksum of bad data unless you’re doing some really fancy stuff with calculations that carry their own checksums (and if you’re that serious about error correction, you’re probably using ECC regardless). Anyway, after completing the analysis, we found that memtest couldn’t detect any problems, but that replacing the RAM on the bad machines caused a one to two order of magnitude reduction in error rate. Most services don’t have this kind of checksumming we had; those services will simply silently write corrupt data to persistent storage and never notice problems until a customer complains.

3. Due to advances in hardware manufacturing, errors are very rare

The data in the post isn’t sufficient to support this assertion. Note that since RAM usage has been increasing and continues to increase at a fast exponential rate, RAM failures would have to decrease at a greater exponential rate to actually reduce the incidence of data corruption. Furthermore, as chips continue shrink, features get smaller, making the kind of wearout issues discussed in “2” more common. For example, at 20nm, a DRAM capacitor might hold something like 50 electrons, and that number will get smaller for next generation DRAM and things continue to shrink.

The 2012 study that Atwood quoted has this graph on corrected errors (a subset of all errors) on ten randomly selected failing nodes (6% of nodes had at least one failure):

We’re talking between 10 and 10k errors for a typical node that has a failure, and that’s a cherry-picked study from a post that’s arguing that you don’t need ECC. Note that the nodes here only have 16GB of RAM, which is an order of magnitude less than modern servers often have, and that this was on an older process node that was less vulnerable to noise than we are now. For anyone who’s used to dealing with reliability issues and just wants to know the FIT rate, the study finds a FIT rate of between 0.057 and 0.071 faults per Mbit (which, contra Atwood’s assertion, is not a shockingly low number). If you take the most optimistic FIT rate, .057, and do the calculation for a server without much RAM (here, I’m using 128GB, since the servers I see nowadays typically have between 128GB and 1.5TB of RAM)., you get an expected value of .057 * 1000 * 1000 * 8760 / 1000000000 = .5 faults per year per server. Note that this is for faults, not errors. From the graph above, we can see that a fault can easily cause hundreds or thousands of errors per month. Another thing to note is that there are multiple nodes that don’t have errors at the start of the study but develop errors later on.

Sun/Oracle famously ran into this a number of decades ago. Transistors and DRAM capacitors were getting smaller, much as they are now, and memory usage and caches were growing, much as they are now. Between having smaller transistors that were less resilient to transient upset as well as more difficult to manufacture, and having more on-chip cache, the vast majority of server vendors decided to add ECC to their caches. Sun decided to save a few dollars and skip the ECC. The direct result was that a number of Sun customers reported sporadic data corruption. It took Sun multiple years to spin a new architecture with ECC cache, and Sun made customers sign an NDA to get replacement chips. Of course there’s no way to cover up this sort of thing forever, and when it came up, Sun’s reputation for producing reliable servers took a permanent hit, much like the time they tried to cover up poor performance results by introducing a clause into their terms of services disallowing benchmarking.

Another thing to note here is that when you’re paying for ECC, you’re not just paying for ECC, you’re paying for parts (CPUs, boards) that have been qual’d more thoroughly. You can easily see this with disk failure rates, and I’ve seen many people observe this in their own private datasets. In terms of public data, I believe Andrea and Remzi’s group had a SIGMETRICS paper a few years back that showed that SATA drives were 4x more likely than SCSI drives to have disk read failures, and 10x more likely to have silent data corruption. This relationship held true even with drives from the same manufacturer. There’s no particular reason to think that the SCSI interface should be more reliable than the SATA interface, but it’s not about the interface. It’s about buying a high-reliability server part vs. a consumer part. Maybe you don’t care about disk reliability in particular because you checksum everything and can easily detect disk corruption, but there are some kinds of corruption that are harder to detect.

4. If ECC were actually important, it would be used everywhere and not just servers.

Rephrased slightly, this argument is “If this feature were actually important for servers, it would be used in non-servers”. You could make this argument about a fair number of server hardware features. This is actually one of the more obnoxious problems facing large cloud vendors.

They have enough negotiating leverage to get most parts at cost, but that only works where there’s more than one viable vendor. Some of the few areas where there aren’t any viable competitors include CPUs and GPUs. Luckily for them, they don’t need that many GPUs, but they need a lot of CPUs and the bit about CPUs has been true for a long time. There have been a number of attempts by CPU vendors to get into the server market, but each attempt so far has been fatally flawed in a way that made it obvious from an early stage that the attempt was doomed (and these are often 5 year projects, so that’s a lot of time to spend on a doomed project). The Qualcomm effort has been getting a lot of hype, but when I talk to folks I know at Qualcomm they all tell me that the current chip is basically for practice, since Qualcomm needed to learn how to build a server chip from all the folks they poached from IBM, and that the next chip is the first chip that has any hope of being competitive. I have high hopes for Qualcomm as well an ARM effort to build good server parts, but those efforts are still a ways away from bearing fruit.

The near total unsuitability of current ARM (and POWER) options (not including hypothetical variants of Apple’s impressive ARM chip) for most server workloads in terms of performance per TCO dollar is a bit of a tangent, so I’ll leave that for another post, but the point is that Intel has the market power to make people pay extra for server features, and they do so. Additionally, some features are genuinely more important for servers than for mobile devices with a few GB of RAM and a power budget of a few watts that are expected to randomly crash and reboot periodically anyway.

Conclusion

Should you buy ECC RAM? That depends. For servers, it’s probably a good bet considering the cost, although it’s hard to really do a cost/benefit analysis because it’s really hard to figure out the cost of silent data corruption, or the cost of having some risk of burning half a year of developer time tracking down intermittent failures only to find that the were caused by using non-ECC memory.

For normal desktop use, I’m pro-ECC, but if you don’t have regular backups set up, doing backups probably has a better ROI than ECC. But if you have backups without ECC, you can easily write corrupt data into your primary store and replicate that corrupt data into backup.

Thanks to Prabhakar Ragde, Tom Murphy, Jay Weisskopf, Leah Hanson, Joe Wilder, and Ralph Corderoy for discussion/comments/corrections. Also, thanks (or maybe anti-thanks) to Leah for convincing me that I should write up this off the cuff verbal comment as a blog post. Apologies for any errors, the lack of references, and the stilted prose; this is basically a transcription of half of a conversation and I haven’t explained terms, provided references, or checked facts in the level of detail that I normally do.

One of the funnier examples I can think of this, at least to me, is the magical self-healing fuse. Although there are many implementaitons, you can think of a fuse on a chip as basically a resistor. If you run some current through it, you should get a connection. If you run a lot of current through it, you’ll heat up the resistor and eventually destroy it. This is commonly used to fuse off features on chips, or to do things like set the clock rate, with the idea being that once a fuse is blown, there’s no way to unblow the fuse.
Once upon a time, there was a semiconductor manufacturer that rushed their manufacturing process a bit and cut the tolerences a bit too fine in one particular process generation. After a few months (or years), the connection between the two ends of the fuse could regrow and cause the fuse to unblow. If you’re lucky, the fuse will be something like the high-order bit of the clock multipler, which will basically brick the chip if changed. If you’re not lucky, it will be something that results in silent data corruption.
I heard about problems in that particular process generation from that manufacturer from multiple people at different companies, so this wasn’t an isolated thing. When I say this is funny, I mean that it’s funny when you hear this story at a bar. It’s maybe less funny when you discover, after a year of testing, that some of your chips are failing because their fuse settings are nonsensical, and you have to respin your chip and delay the release for 3 months. BTW, this fuse regrowth thing is another example of a class of error that can be mitigated with ECC.
This is not the issue that Google had; I only mention this because a lot of people I talk to are surprised by the ways in which hardware can fail.
^[return]
In case you don’t want to dig through the whole book, most of the relevant passage is:
In a system that can tolerate a number of failures at the software level, the minimum requirement made to the hardware layer is that its faults are always detected and reported to software in a timely enough manner as to allow the software infrastructure to contain it and take appropriate recovery actions. It is not necessarily required that hardware transparently corrects all faults. This does not mean that hardware for such systems should be designed without error correction capabilities. Whenever error correction functionality can be offered within a reasonable cost or complexity, it often pays to support it. It means that if hardware error correction would be exceedingly expensive, the system would have the option of using a less expensive version that provided detection capabilities only. Modern DRAM systems are a good example of a case in which powerful error correction can be provided at a very low additional cost. Relaxing the requirement that hardware errors be detected, however, would be much more difficult because it means that every software component would be burdened with the need to check its own correct execution. At one early point in its history, Google had to deal with servers that had DRAM lacking even parity checking. Producing a Web search index consists essentially of a very large shuffle/merge sort operation, using several machines over a long period. In 2000, one of the then monthly updates to Google’s Web index failed prerelease checks when a subset of tested queries was found to return seemingly random documents. After some investigation a pattern was found in the new index files that corresponded to a bit being stuck at zero at a consistent place in the data structures; a bad side effect of streaming a lot of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the likelihood of this problem recurring, and no further problems of this nature were reported. Note, however, that this workaround did not guarantee 100% error detection in the indexing pass because not all memory positions were being checked—instructions, for example, were not. It worked because index data structures were so much larger than all other data involved in the computation, that having those self-checking data structures made it very likely that machines with defective DRAM would be identified and excluded from the cluster. The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM.
^[return]

↧

File crash consistency and filesystems are hard

December 11, 2015, 4:00 pm

≫ Next: Big company vs. startup work and pay

≪ Previous: Should I buy ECC memory?

I haven’t used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox. Pine, eudora, and outlook have all corrupted my inbox, forcing me to restore from backup. How is it that desktop mail clients are less reliable than gmail, even though my gmail account not only handles more email than I ever had on desktop clients, but also allows simultaneous access from multiple locations across the globe? Distributed systems have an unfair advantage, in that they can be robust against total disk failure in a way that desktop clients can’t, but none of the file corruption issues I’ve had have been from total disk failure. Why has my experience with desktop applications been so bad?

Well, what sort of failures can occur? Crash consistency (maintaining consistent state even if there’s a crash) is probably the easiest property to consider, since we can assume that everything, from the filesystem to the disk, works correctly; let’s consider that first.

Crash Consistency

Pillai et al. had a paper and presentation at OSDI ‘14 on exactly how hard it is to save data without corruption or data loss.

Let’s look at a simple example of what it takes to save data in a way that’s robust against a crash. Say we have a file that contains the text a foo and we want to update the file to contain a bar. The pwrite function looks like it’s designed for this exact thing. It takes a file descriptor, what we want to write, a length, and an offset. So we might try

pwrite([file], “bar”, 3, 2)  // write 3 bytes at offset 2

What happens? If nothing goes wrong, the file will contain a bar, but if there’s a crash during the write, we could get a boo, a far, or any other combination. Note that you may want to consider this an example over sectors or blocks and not chars/bytes.

If we want atomicity (so we either end up with a foo or a bar but nothing in between) one standard technique is to make a copy of the data we’re about to change in an undo log file, modify the “real” file, and then delete the log file. If a crash happens, we can recover from the log. We might write something like

creat(/dir/log);
write(/dir/log, “2,3,foo”, 7);
pwrite(/dir/orig, “bar”, 3, 2);
unlink(/dir/log);

This should allow recovery from a crash without data corruption via the undo log, at least if we’re using ext3 and we made sure to mount our drive with data=journal. But we’re out of luck if, like most people, we’re using the default¹– with the default data=ordered, the write and pwrite syscalls can be reordered, causing the write to orig to happen before the write to the log, which defeats the purpose of having a log. We can fix that.

creat(/dir/log);
write(/dir/log, “2, 3, foo”);
fsync(/dir/log);  // don’t allow write to be reordered past pwrite
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That should force things to occur in the correct order, at least if we’re using ext3 with data=journal or data=ordered. If we’re using data=writeback, a crash during the the write or fsync to log can leave log in a state where the filesize has been adjusted for the write of “bar”, but the data hasn’t been written, which means that the log will contain random garbage. This is because with data=writeback, metadata is journaled, but data operations aren’t, which means that data operations (like writing data to a file) aren’t ordered with respect to metadata operations (like adjusting the size of a file for a write).

We can fix that by adding a checksum to the log file when creating it. If the contents of log don’t contain a valid checksum, then we’ll know that we ran into the situation described above.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);  // add checksum to log file
fsync(/dir/log);
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That’s safe, at least on current configurations of ext3. But it’s legal for a filesystem to end up in a state where the log is never created unless we issue an fsync to the parent directory.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);
fsync(/dir/log);
fsync(/dir);  // fsync parent directory of log file
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That should prevent corruption on any Linux filesystem, but if we want to make sure that the file actually contains “bar”, we need another fsync at the end.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);
fsync(/dir/log);
fsync(/dir);
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);
fsync(/dir);

That results in consistent behavior and guarantees that our operation actually modifies the file after it’s completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn’t really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.

Even if we assume fsync issues a flush command to the disk, some disks ignore flush directives for the same reason fsync is gimped on OS X and some versions of ext3 – to look better in benchmarks. Handling that is beyond the scope of this post, but the Rajimwale et al. DSN ‘11 paper and related work cover that issue.

Filesystem semantics

When the authors examined ext2, ext3, ext4, btrfs, and xfs, they found that there are substantial differences in how code has to be written to preserve consistency. They wrote a tool that collects block-level filesystem traces, and used that to determine which properties don’t hold for specific filesystems. The authors are careful to note that they can only determine when properties don’t hold – if they don’t find a violation of a property, that’s not a guarantee that the property holds.

Different filesystems have very different properties

Xs indicate that a property is violated. The atomicity properties are basically what you’d expect, e.g., no X for single sector overwrite means that writing a single sector is atomic. The authors note that the atomicity of single sector overwrite sometimes comes from a property of the disks they’re using, and that running these filesystems on some disks won’t give you single sector atomicity. The ordering properties are also pretty much what you’d expect from their names, e.g., an X in the “Overwrite -> Any op” row means that an overwrite can be reordered with some operation.

After they created a tool to test filesystem properties, they then created a tool to check if any applications rely on any potentially incorrect filesystem properties. Because invariants are application specific, the authors wrote checkers for each application tested.

Everything is broken

The authors find issues with most of the applications tested, including things you’d really hope would work, like LevelDB, HDFS, Zookeeper, and git. In a talk, one of the authors noted that the developers of sqlite have a very deep understanding of these issues, but even that wasn’t enough to prevent all bugs. That speaker also noted that version control systems were particularly bad about this, and that the developers had a pretty lax attitude that made it very easy for the authors to find a lot of issues in their tools. The most common class of error was incorrectly assuming ordering between syscalls. The next most common class of error was assuming that syscalls were atomic². These are fundamentally the same issues people run into when doing multithreaded programming. Correctly reasoning about re-ordering behavior and inserting barriers correctly is hard. But even though shared memory concurrency is considered a hard problem that requires great care, writing to files isn’t treated the same way, even though it’s actually harder in a number of ways.

Something to note here is that while btrfs’s semantics aren’t inherently less reliable than ext3/ext4, many more applications corrupt data on top of btrfs because developers aren’t used to coding against filesystems that allow directory operations to be reordered (ext2 is perhaps the most recent widely used filesystem that allowed that reordering). We’ll probably see a similar level of bug exposure when people start using NVRAM drives that have byte-level atomicity. People almost always just run some tests to see if things work, rather than making sure they’re coding against what’s legal in a POSIX filesystem.

Hardware memory ordering semantics are usually well documented in a way that makes it simple to determine precisely which operations can be reordered with which other operations, and which operations are atomic. By contrast, here’s the ext manpage on its three data modes:

journal: All data is committed into the journal prior to being written into the main filesystem. ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

The manpage literally refers to rumor. This is the level of documentation we have. If we look back at our example where we had to add an fsync between the write(/dir/log, “2, 3, foo”) and pwrite(/dir/orig, 2, “bar”) to prevent reordering, I don’t think the necessity of the fsync is obvious from the description in the manpage. If you look at the hardware memory ordering “manpage” above, it specifically defines the ordering semantics, and it certainly doesn’t rely on rumor.

This isn’t to say that filesystem semantics aren’t documented anywhere. Between lwn and LKML, it’s possible to get a good picture of how things work. But digging through all of that is hard enough that it’s still quite common for there to be long, uncertain discussions on how things work. A lot of the information out there is wrong, and even when information was right at the time it was posted, it often goes out of date.

When digging through archives, I’ve often seen a post from 2005 cited to back up the claim that OS X fsync is the same as Linux fsync, and that OS X fcntl(F_FULLFSYNC) is even safer than anything available on Linux. Even at the time, I don’t think that was true for the 2.4 kernel, although it was true for the 2.6 kernel. But since 2008 or so Linux 2.6 with ext3 will do a full flush to disk for each fsync (if the disk supports it, and the filesystem hasn’t been specially configured with barriers off).

Another issue is that you often also see exchanges like this one:

Dev 1: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.
Dev 2: as the ext3 authors have stated many times over the years, you still need to run fsck periodicly anyway.
Dev 1: Where is that documented?
Dev 2: linux-kernel mailing list archives.
Dev 3: Probably from some 6-8 years ago, in e-mail postings that I made.

Where’s this documented? Oh, in some mailing list post 6-8 years ago (which makes it 12-14 years from today). I don’t mean to pick on filesystem devs. The fs devs whose posts I’ve read are quite polite compared to LKML’s reputation; they generously spend a lot of their time responding to basic questions and I’m impressed by how patient the expert fs devs are with askers, but it’s hard for outsiders to troll through a decade and a half of mailing list postings to figure out which ones are still valid and which ones have been obsoleted!

In their OSDI 2014 talk, the authors of the paper we’re discussing noted that when they reported bugs they’d found, developers would often respond “POSIX doesn’t let filesystems do that”, without being able to point to any specific POSIX documentation to support their statement. If you’ve followed Kyle Kingsbury’s Jepsen work, this may sound familiar, except devs respond with “filesytems don’t do that” instead of “networks don’t do that”. I think this is understandable, given how much misinformation is out there. Not being a filesystem dev myself, I’d be a bit surprised if I don’t have at least one bug in this post.

Filesystem correctness

We’ve already encountered a lot of complexity in saving data correctly, and this only scratches the surface of what’s involved. So far, we’ve assumed that the disk works properly, or at least that the filesystem is able to detect when the disk has an error via SMART or some other kind of monitoring. I’d always figured that was the case until I started looking into it, but that assumption turns out to be completely wrong.

The Prabhakaran et al. SOSP 05 paper examined how filesystems respond to disk errors in some detail. They created a fault injection layer that allowed them to inject disk faults and then ran things like chdir, chroot, stat, open, write, etc. to see what would happen.

Between ext3, reiserfs, and NTFS, reiserfs is the best at handling errors and it seems to be the only filesystem where errors were treated as first class citizens during design. It’s mostly consistent about propagating errors to the user on reads, and calling panic on write failures, which triggers a restart and recovery. This general policy allows the filesystem to gracefully handle read failure and avoid data corruption on write failures. However, the authors found a number of inconsistencies and bugs. For example, reiserfs doesn’t correctly handle read errors on indirect blocks and leaks space, and a specific type of write failure doesn’t prevent reiserfs from updating the journal and committing the transaction, which can result in data corruption.

Reiserfs is the good case. The authors found that ext3 ignored write failures in most cases, and rendered the filesystem read-only in most cases for read failures. This seems like pretty much the opposite of the policy you’d want. Ignoring write failures can easily result in data corruption, and remounting the filesystem as read-only is a drastic overreaction if the read error was a transient error (transient errors are common). Additionally, ext3 did the least consistency checking of the three filesystems and was the most likely to not detect an error. In one presentation, one of the authors remarked that the ext3 code had lots of comments like “I really hope a write error doesn’t happen here” in places where errors weren’t handled.

NTFS is somewhere in between. The authors found that it has many consistency checks built in, and is pretty good about propagating errors to the user. However, like ext3, it ignores write failures.

The paper has much more detail on the exact failure modes, but the details are mostly of historical interest as many of the bugs have been fixed.

It would be really great to see an updated version of the paper, and in one presentation someone in the audience asked if there was more up to date information. The presenter replied that they’d be interested in knowing what things look like now, but that it’s hard to do that kind of work in academia because grad students don’t want to repeat work that’s been done before, which is pretty reasonable given the incentives they face. Doing replications is a lot of work, often nearly as much work as the original paper, and replications usually give little to no academic credit. This is one of the many cases where the incentives align very poorly with producing real world impact.

The Gunawi et al. FAST 08 is another paper it would be great to see replicated today. That paper follows up the paper we just looked at, and examines the error handling code in different file systems, using a simple static analysis tool to find cases where errors are being thrown away. Being thrown away is defined very loosely in the paper — code like the following

if (error) {
    printk(“I have no idea how to handle this error\n”);
}

is considered not throwing away the error. Errors are considered to be ignored if the execution flow of the program doesn’t depend on the error code returned from a function that returns an error code.

With that tool, they find that most filesystems drop a lot of error codes:

	By % Broken		By Viol/Kloc
Rank	FS	Frac.	FS Viol/Kloc
1	IBM JFS	24.4	ext3	7.2
2	ext3	22.1	IBM JFS	5.6
3	JFFS v2	15.7	NFS Client	3.6
4	NFS Client	12.9	VFS	2.9
5	CIFS	12.7	JFFS v2	2.2
6	MemMgmt	11.4	CIFS	2.1
7	ReiserFS	10.5	MemMgmt	2.0
8	VFS	8.4	ReiserFS	1.8
9	NTFS	8.1	XFS	1.4
10	XFS	6.9	NFS Server	1.2

Comments they found next to ignored errors include: “Should we pass any errors back?”, “Error, skip block and hope for the best.”, “There’s no way of reporting error returned from ext3_mark_inode_dirty() to userspace. So ignore it.“, “Note: todo: log error handler.“, “We can’t do anything about an error here.”, “Just ignore errors at this point. There is nothing we can do except to try to keep going.”, “Retval ignored?”, and “Todo: handle failure.”

One thing to note is that in a lot of cases, ignoring an error is more of a symptom of an architectural issue than a bug per se (e.g., ext3 ignored write errors during checkpointing because it didn’t have any kind of recovery mechanism). But even so, the authors of the papers found many real bugs.

Error recovery

Every widely used filesystem has bugs that will cause problems on error conditions, which brings up two questions. Can recovery tools robustly fix errors, and how often do errors occur? How do they handle recovery from those problems? The Gunawi et al. OSDI 08 paper looks at that and finds that fsck, a standard utility for checking and repairing file systems, “checks and repairs certain pointers in an incorrect order … the file system can even be unmountable after”.

At this point, we know that it’s quite hard to write files in a way that ensures their robustness even when the underlying filesystem is correct, the underlying filesystem will have bugs, and that attempting to repair corruption to the filesystem may damage it further or destroy it. How often do errors happen?

Error frequency

The Bairavasundaram et al. SIGMETRICS ‘07 paper found that, depending on the exact model, between 5% and 20% of disks would have at least one error over a two year period. Interestingly, many of these were isolated errors – 38% of disks with errors had only a single error, and 80% had fewer than 50 errors. A follow-up study looked at corruption and found that silent data corruption that was only detected by checksumming happened on .5% of disks per year, with one extremely bad model showing corruption on 4% of disks in a year.

It’s also worth noting that they found very high locality in error rates between disks on some models of disk. For example, there was one model of disk that had a very high error rate in one specific sector, making many forms of RAID nearly useless for redundancy.

That’s another study it would be nice to see replicated. Most studies on disk focus on the failure rate of the entire disk, but if what you’re woried about is data corruption, errors in non-failed disks are more worrying than disk failure, which is easy to detect and mitigate.

Conclusion

Files are hard. Butler Lampson has remarked that when they came up with threads, locks, and condition variables at PARC, they thought that they were creating a programming model that anyone could use, but that there’s now decades of evidence that they were wrong. We’ve accumulated a lot of evidence that humans are very bad at reasoning about these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems. Lampson suggests that the best known general purpose solution is to package up all of your parallelism into as small a box as possible and then have a wizard write the code in the box. Translated to filesystems, that’s equivalent to saying that as an application developer, writing to files safely is hard enough that it should be done via some kind of library and/or database, not by directly making syscalls.

Sqlite is quite good in terms of reliability if you want a good default. However, some people find it to be too heavyweight if all they want is a file-based abstraction. What they really want is a sort of polyfill for the file abstraction that works on top of all filesystems without having to understand the differences between different configurations (and even different versions) of each filesystem. Since that doesn’t exist yet, when no existing library is sufficient, you need to checksum your data since you will get silent errors and corruption. The only questions are whether or not you detect the errors and whether or not your record format only destroys a single record when corruption happens, or if it destroys the entire database. As far as I can tell, most desktop email client developers have chosen to go the route of destroying all of your email if corruption happens.

These studies also hammer home the point that conventional testing isn’t sufficient. There were multiple cases where the authors of a paper wrote a relatively simple tool and found a huge number of bugs. You don’t need any deep computer science magic to write the tools. The error propagation checker from the paper that found a ton of bugs in filesystem error handling was 4k LOC. If you read the paper, you’ll see that the authors observed that the tool had a very large number of shortcomings because of its simplicity, but despite those shortcomings, it was able to find a lot of real bugs. I wrote a vaguely similar tool at my last job to enforce some invariants, and it was literally two pages of code. It didn’t even have a real parser (it just went line-by-line through files and did some regexp matching to detect the simple errors that it’s possible to detect with just a state machine and regexes), but it found enough bugs that it paid for itself in development time the first time I ran it.

Almost every software project I’ve seen has a lot of low hanging testing fruit. Really basic random testing, static analysis, and fault injection can pay for themselves in terms of dev time pretty much the first time you use them.

Appendix

I’ve probably covered less than 20% of the material in the papers I’ve referred to here. Here’s a bit of info about some other neat info you can find in those papers, and others.

Pillai et al., OSDI ‘14: this papers goes into much more detail about what’s required for crash consistency than this post does. It also gives a fair amount of detail about how exactly applications fail, including diagrams of traces that indicate what false assumptions are embedded in each trace.

Chidambara et al., FAST ‘12: the same filesystem primitives are responsible for both consistency and ordering. The authors propose alternative primitives that seperate these concerns, allow better performance while maintaining safety.

Rajimwale et al. DSN ‘01: you probably shouldn’t use disks that ignore flush directives, but in case you do, here’s a protocol that forces those disks to flush using normal filesystem operations. As you might expect, the performance for this is quite bad.

Prabhakaran et al. SOSP ‘05: This has a lot more detail on filesystem responses to error than was covered in this post. The authors also discuss JFS, an IBM filesystem for AIX. Although it was designed for high reliability systems, it isn’t particularly more reliable than the alternatives. Related material is covered further in DSN ‘08, StorageSS ‘06, DSN ‘06, FAST ‘08, and USENIX ‘09, among others.

Gunawi et al. FAST ‘08 : Again, much more detail than is covered in this post on when errors get dropped, and how they wrote their tools. They also have some call graphs that give you one rough measure of the complexity involved in a filesystem. The XFS call graph is particularly messy, and one of the authors noted in a presentation that an XFS developer said that XFS was fun to work on since they took advantage of every possible optimization opportunity regardless of how messy it made things.

Bairavasundaram et al. SIGMETRICS ‘07: There’s a lot of information on disk error locality and disk error probability over time that isn’t covered in this post. A followup paper in FAST08 has more details.

Gunawi et al. OSDI ‘08: This paper has a lot more detail about when fsck doesn’t work. In a presentation, one of the authors mentioned that fsck is the only program that’s ever insulted him. Apparently, if you have a corrupt pointer that points to a superblock, fsck destroys the superblock (possibly rendering the disk unmountable), tells you something like “you dummy, you must have run fsck on a mounted disk”, and then gives up. In the paper, the authors reimplement basically all of fsck using a declarative model, and find that the declarative version is shorter, easier to understand, and much easier to extend, at the cost of being somewhat slower.

Memory errors are beyond the scope of this post, but memory corruption can cause disk corruption. This is especially annoying because memory corruption can cause you to take a checksum of bad data and write a bad checksum. It’s also possible to corrupt in memory pointers, which often results in something very bad happening. See the Zhang et al. FAST ‘10 paper for more on how ZFS is affected by that. There’s a meme going around that ZFS is safe against memory corruption because it checksums, but that paper found that critical things held in memory aren’t checksummed, and that memory errors can cause data corruption in real scenarios.

The sqlite devs are serious about both documentation and testing. If I wanted to write a reliable desktop application, I’d start by reading the sqlite docs and then talking to some of the core devs. If I wanted to write a reliable distributed application I’d start by getting a job at Google and then reading the design docs and postmortems for GFS, Colossus, Spanner, etc. J/k, but not really.

We haven’t looked at formal methods at all, but there have been a variety of attempts to formally verify properties of filesystems, such as SibylFS.

This list isn’t intended to be exhaustive. It’s just a list of things I’ve read that I think are interesting.

Update: many people have read this post and suggested that, in the first file example, you should use the much simpler protocol of copying the file to modified to a temp file, modifying the temp file, and then renaming the temp file to overwrite the original file. In fact, that’s probably the most common comment I’ve gotten on this post. If you think this solves the problem, I’m going to ask you to pause for five seconds and consider the problems this might have. First, you still need to fsync in multiple places. Second, you will get very poor performance with large files. People have also suggested using many small files to work around that problem, but that will also give you very poor performance unless you do something fairly exotic. Third, if there’s a hardlink, you’ve now made the problem of crash consistency much more complicated than in the original example. Fourth, you’ll lose file metadata, sometimes in ways that can’t be fixed up after the fact. That problem can, on some filesystems, be worked around with ioctls, but that only sometimes fixes the issue and now you’ve got fs specific code to preserve correctness even in the non-crash case. And that’s just the beginning. The fact that so many people thought that this was a simple solution to the problem demonstrates that this problem is one that people are prone to underestimating, even they’re explicitly warned that people tend to underestimate this problem!

If you liked this, you’ll probably enjoy this post on cpu bugs.

Thanks to Leah Hanson, Katerina Barone-Adesi, Jamie Brandon, Kamal Marhubi, Joe Wilder, David Turner, Benjamin Gilbert, Tom Murphy, Chris Ball, Joe Doliner, Alexy Romanov, Mindy Preston, Paul McJones, and Evan Jones for comments/discussion.

Turns out some commercially supported distros only support data=ordered. Oh, and when I said data=ordered was the default, that’s only the case if pre-2.6.30. After 2.6.30, there’s a config option, CONFIG_EXT3_DEFAULTS_TO_ORDERED. If that’s not set, the default becomes data=writeback. ^[return]
Cases where overwrite atomicity is required were documented as known issues, and all such cases assumed single-block atomicity and not multi-block atomicity. By contrast, multiple applications (LevelDB, Mercurial, and HSQLDB) had bad data corruption bugs that came from assuming appends are atomic.
That seems to be an indirect result of a commonly used update protocol, where modifications are logged via appends, and then logged data is written via overwrites. Application developers are careful to check for and handle errors in the actual data, but the errors in the log file are often overlooked.
There are a number of other classes of errors discussed, and I recommend reading the paper for the details if you work on an application that writes files.
^[return]

↧

Big company vs. startup work and pay

December 16, 2015, 4:00 pm

≫ Next: Normalization of deviance in software: how broken practices become standard

≪ Previous: File crash consistency and filesystems are hard

There’s a meme that’s been going around for a while now: you should join a startup because the money is better and the work is more technically interesting. Paul Graham says that the best way to make money is to “start or join a startup”, which has been “a reliable way to get rich for hundreds of years”, and that you can “compress a career’s worth of earnings into a few years”. Michael Arrington says that you’ll become a part of history. Joel Spolsky says that by joining a big company, you’ll end up playing foosball and begging people to look at your code. Sam Altman says that if you join Microsoft, you won’t build interesting things and may not work with smart people. They all claim that you’ll learn more and have better options if you go work at a startup. Some of these links are a decade old now, but the same ideas are still circulating and those specific essays are still cited today.

Let’s look at these points one one-by-one.

You’ll earn much more money at a startup
You won’t do interesting work at a big company
You’ll learn more at a startup and have better options afterwards

1. Earnings

The numbers will vary depending on circumstances, but we can do a back of the envelope calculation and adjust for circumstances afterwards. Median income in the U.S. is about $30k/yr. The somewhat bogus zeroth order lifetime earnings approximation I’ll use is $30k * 40 = $1.2M. A new grad at Google/FB/Amazon with a lowball offer will have a total comp (salary + bonus + equity) of $130k/yr. According to glassdoor’s current numbers, someone who makes it to T5/senior at Google should have a total comp of around $250k/yr. These are fairly conservative numbers¹.

Someone who’s not particularly successful, but not particularly unsucessful will probably make senior in five years². For our conservative baseline, let’s assume that we’ll never make it past senior, into the pay grades where compensation really skyrockets. We’d expect earnings (total comp including stock, but not benefits) to looks something like:

Year	Total Comp	Cumulative
0	130k	130k
1	160k	290k
2	190k	480k
3	220k	700k
4	250k	950k
5	250k	1.2M
…	…	…
9	250k	2.2M
…	…	…
39	250k	9.7M

Looks like it takes six years to gross a U.S. career’s worth of income. If you want to adjust for the increased tax burden from earning a lot in a few years, add an extra year. Maybe add one to two more years if you decide to live in the bay or in NYC. If you decide not to retire, lifetime earnings for a 40 year career comes in at almost $10M.

One common, but false, objection to this is that your earnings will get eaten up by the cost of living in the bay area. Not only is this wrong, it’s actually the opposite of correct. You can work at these companies from outside the bay area; most of these companies will pay you maybe 10% less if you work in a location where cost of living is around the U.S. median by working in a satellite office of a trendy company headquartered in SV or Seattle (at least if you work in the US – pay outside of the US is often much lower for reasons that don’t really make sense to me). Market rate at smaller companies in these areas tends to be very low. When I interviewed in places like Portland and Madison, there was a 3x-5x difference between what most small companies were offering and what I could get at a big company in the same city. In places like Austin, where the market is a bit thicker, it was a 2x-3x difference. The difference in pay at 90%-ile companies is greater, not smaller, outside of the SF bay area.

Another objection is that most programmers at most companies don’t make this kind of money. If, three or four years ago, you’d told me that there’s a career track where it’s totally normal to make $250k/yr after a few years, doing work that was fundamentally pretty similar to the work I was doing then, I’m not sure I would have believed it. No one I knew made that kind of money, except maybe the CEO of the company I was working at. Well him, and folks who went into medicine or finance.

The only difference between then and now is that I took a job at a big company. When I took that job, the common story I heard at orientation was basically “I never thought I’d be able to get a job at Google, but a recruiter emailed me and I figured I might as well respond”. For some reason, women were especially likely to have that belief. Anyway, I’ve told that anecdote to multiple people who didn’t think they could get a job at some trendy large company, who then ended up applying and getting in. And what you’ll realize if you end up at a place like Google is that most of them are just normal programmers like you and me. If anything, I’d say that Google is, on average, less selective than the startup I worked at. When you only have to hire 100 people total, and half of them are folks you worked with as a technical fellow at one big company and then as an SVP at another one, you can afford to hire very slowly and being extremely selective. Big companies will hire more than 100 people per week, which means they can only be so selective.

Despite the hype about how hard it is to get a job at Google/FB/wherever, your odds aren’t that bad, and they’re certainly better than your odds striking it rich at a startup, for which Patrick McKenzie has a handy cheatsheet:

Roll d100. (Not the right kind of geek? Sorry. rand(100) then.)
0~70: Your equity grant is worth nothing.
71~94: Your equity grant is worth a lump sum of money which makes you about as much money as you gave up working for the startup, instead of working for a megacorp at a higher salary with better benefits.
95~99: Your equity grant is a life changing amount of money. You won’t feel rich — you’re not the richest person you know, because many of the people you spent the last several years with are now richer than you by definition — but your family will never again give you grief for not having gone into $FAVORED_FIELD like a proper $YOUR_INGROUP.
100: You worked at the next Google, and are rich beyond the dreams of avarice. Congratulations.
Perceptive readers will note that 100 does not actually show up on a d100 or rand(100).

For a more serious take that gives approximately the same results, 80000 hours finds that the average value of a YC founder after 5-9 years is $18M. That sounds great! But there are a few things to keep in mind here. First, YC companies are unusually successful compared to the average startup. Second, in their analysis, 80000 hours notes that 80% of the money belongs to 0.5% of companies. Another 22% are worth enough that founder equity beats working for a big company, but that leaves 77.5% where that’s not true.

If you’re an employee and not a founder, the numbers look a lot worse. If you’re a very early employee you’d be quite lucky to get 1/10th as much equity as a founder. If we guess that 30% of YC startups fail before hiring their first employee, that puts the mean equity offering at $1.8M / .7 = $2.6M. That’s low enough that for 5-9 years of work, you really need to be in the 0.5% for the payoff to be substantially better than working at a big company unless the startup is paying a very generous salary.

There’s a sense in which these numbers are too optimistic. Even if the company is successful and has a solid exit, there are plenty of things that can make your equity grant worthless. It’s hard to get statistics on this, but anecdotally, this seems to be the common case in acquisitions.

Moreover, the pitch that you’ll only need to work for four years is usually untrue. To keep your lottery ticket until it pays out (or fizzles out), you’ll probably have to stay longer. The most common form of equity at early stage startups are ISOs that, by definition, expire 90 at most days after you leave. If you get in early, and leave after four years, you’ll have to exercise your options if you want a chance at the lottery ticket paying off. If the company hasn’t yet landed a large valuation, you might be able to get away with paying O(median US annual income) to exercise your options. If the company looks like a rocketship and VCs are piling in, you’ll have a massive tax bill, too, all for a lottery ticket.

For example, say you joined company X early on and got options for 1% of the company when it was valued at $1M, so the cost exercising all of your options is only $10k. Maybe you got lucky and four years later, the company is valued at $1B and your options have only been diluted to .5%. Great! For only $10k you can exercise your options and then sell the equity you get for $5M. Except that the company hasn’t IPO’d yet, so if you exercise your options, you’re stuck with a tax bill from making $5M, and by the time the company actually has an IPO, your stock could be worthy anywhere from $0 to $LOTS. In some cases, you can sell your non-liquid equity for some fraction of its “value”, but my understanding is that it’s getting more common for companies to add clauses that limit your ability to sell your equity before the company has an IPO. And even when your contract doesn’t have a clause that prohibits you from selling your options on a secondary market, companies sometimes use backchannel communications to keep you from being able to sell your options.

Of course not every company is like this – I hear that Dropbox has generously offered to buy out people’s options at their current valuation for multiple years running and they now hand out RSUs instead of options, and Pinterest now gives people seven years to exercise their options after they leave – but stories like that are uncommon enough that they’re notable. The result is that people are incentivized to stay at most startups, even if they don’t like the work anymore. From chatting with my friends at well regarded highly-valued startups, it sounds like many of them have a substantial fraction of zombie employees who are just mailing it in and waiting for a liquidity event. A common criticism of large companies is that they’ve got a lot of lifers who are mailing it in, but most large companies will let you leave any time after the first year and walk away with a pro-rated fraction of your equity package³. It’s startups where people are incentivized to stick around even if they don’t care about the job.

At a big company, we have a career’s worth of income in six years with high probability once you get your foot in the door. This isn’t quite as good as the claim that you’ll be able to do that in three or four years at a startup, but the risk at a big company is very low once you land the job. In startup land, we have a lottery ticket that appears to have something like a 0.5% chance of paying off for very early employees. Startups might have had a substantially better expected value when Paul wrote about this in 2004, but big company compensation has increased much faster than compensation at the median startup. We’re currently in the best job market the world has ever seen for programmers. That’s likely to change at some point. The relative returns on going the startup route will probably look a lot better once things change, but for now, saving up some cash while big companies hand it out like candy doesn’t seem like a bad idea.

2. Interesting work

We’ve established that big companies will pay you decently. But there’s more to life than making money. After all, you spend 40+ hours a week working. How interesting is the work at big companies? Joel claimed that large companies don’t solve interesting problems and that Google is paying untenable salaries to kids with more ultimate frisbee experience than Python, whose main job will be to play foosball in the googleplex, Sam Altman said something similar (but much more measured) about Microsoft, every third Michael O. Church comment is about how Google tricks a huge number of overqualified programmers into taking jobs that no one wants, and basically every advice thread on HN or reddit aimed at new grads will have multiple people chime in on how the experience you get at startups is better than the experience you’ll get slaving away at a big company.

The claim that big companies have boring work is too broad and absolute to even possibly be true. It depends on what kind of work you want to do. When I look at conferences where I find a high percentage of the papers compelling, the stuff I find to be the most interesting is pretty evenly split between big companies and academia, with the (very) occasional paper by a startup. For example, looking at ISCA this year, there’s a 2:1 ratio of papers from academia to industry (and all of the industry papers are from big companies). But looking at the actual papers, a significant fraction of the academic papers are reproducing unpublished work that was done at big companies but not published, sometimes multiple years ago. If I only look at the new work that I’m personally interested in, it’s about a 1:1 ratio. There are some cases where a startup is working in the same area and not publishing, but that’s quite rare and large companies do much more research that they don’t publish. I’m just using papers as a proxy for having the kind of work I like. There are also plenty of areas where publishing isn’t the norm, but large companies do the bulk of the cutting edge work.

Of course YMMV here depending on what you want to do. I’m not really familiar with the landscape of front-end work, but it seems to me that big companies don’t do the vast majority of the cutting edge non-academic work, the way they do with large scale systems. IIRC, there’s an HN comment where Jonathan Tang describes how he created his own front-end work: he had the idea, told his manager about it, and got approval to make it happen. It’s possible to do that kind of thing at a large company, but people often seem to have an easier time pursuing that kind of idea at a small company. And if your interest is in product, small companies seem like the better bet (though, once again, I’m pretty far removed from that area, so my knowledge is secondhand).

But if you’re interested in large systems, at both of my last two jobs, I’ve seen speculative research projects with 9 figure pilot budgets approved. In a pitch for one of the products, the pitch wasn’t even that the project would make the company money. It was that a specific research area was important to the company, and that this infrastructure project would enable the company to move faster in that research area. Since the company is a $X billion dollar a year company, the project only needed to move the needle by a small percentage to be worth it. And so a research project whose goal was to speed up the progress of another research project was approved. Startups simply don’t have the resources to throw that much money at research problems that aren’t core to their business. And many problems that would be hard problems at startups are curiosities at large companies. Work at Google and have a question that requires running a query that takes 10k machines? No problem! But that’s basically impossible to do at a startup, not even considering the fact that you can run the query across data startups can’t possibly get.

The flip side of this is that there are experiments that startups have a very easy time doing that established companies can’t do. When I was at EC a number of years ago, back when Facebook was still relatively young, the Google ad auction folks remarked to the FB folks that FB was doing the sort of experiments they’d do if they were small enough to do them, but they couldn’t just change the structure of their ad auctions now that there was so much money going through their pipeline. As with everything else we’re discussing, there’s a tradeoff here and the real question is how to weight the various parts of the tradeoff, not which side is better in all ways.

The Michael O. Church claim is somewhat weaker: big companies have cool stuff to work on, but you won’t be allowed to work on them until you’ve paid your dues working on boring problems. A milder phrasing of this is that getting to do interesting work is a matter of getting lucky and landing on an initial project you’re interested in, but the key thing here is that most companies can give you a pretty good estimate about how lucky you’re going to be. Google is notorious for its blind allocation process, and I know multiple people who ended up at MS because they had the choice between a great project at MS and blind allocation at Google, but even Google has changed this to some extent and it’s not uncommon to be given multiple team options with an offer. In that sense, big companies aren’t much different from startups. It’s true that there are some startups that will basically only have jobs that are interesting to you (e.g., FaunaDB if you’re interested in building a distributed database). But at any startup that’s bigger and less specialized, there’s going to be work you’re interested in and work you’re not interested in, and it’s going to be up to you to figure out if your offer lets you work on stuff you’re interested in.

Something to note is that if, per “1”, you have the leverage to negotiate a good compensation package, you also have the leverage to negotiate for work that you want to do. We’re in what is probably the best job market for programmers ever. That might change tomorrow, but until it changes, you have a lot of power to get work that you want.

If this sounds completely foreign to you and you don’t have that kind of leverage, I understand. That was me a few years ago. Taking a job at a trendy big company is one way to get that leverage. Companies really want you to make it to “senior engineer” (where total comp starts at $250k to $350k, depending on the company); hiring is very expensive for them and they’re heavily incentivized to mentor the people they hire until they’re valuable and productive. Some companies are better at this than others, but the average big company that people want to work for has a lot more resources devoted to helping people learn than almost any startup. The goal at most big companies to get everyone to the senior level. Of course they’ll keep hiring which means there will always be non-senior people, but the definition of senior engineer is basically someone who can independently find and solve problems and doesn’t require any handholding, i.e., someone who’s easy to scale horizontally. Google even has a (unevenly enforced) policy that people who don’t “eventually” get to senior should be managed out, and you’ll notice that they’re not known for having a high involuntary attrition rate. They, and most other big companies, take teaching seriously.

3. Learning / Experience

What about the claim that experience at startups is more valuable? We don’t have the data to do a rigorous quantitative comparison, but qualitatively, everything’s on fire at startups, and you get a lot of breadth putting out fires, but you don’t have the time to explore problems as deeply.

I spent the first seven years of my career at a startup and I loved it. It was total chaos, which gave me the ability to work on a wide variety of different things and take on more responsibility than I would have gotten at a bigger company. I did everything from add fault tolerance to an in-house distributed system to owning a quarter of a project that added ARM instructions to an x86 chip, creating both the fastest ARM chip at the time, as well as the only chip capable of switching between ARM and x86 on the fly⁴. That was a great learning experience.

But I’ve had great learning experiences at big companies, too. At Google, my “starter” project was to join a previously one-person project, read the half finished design doc, provide feedback, and then start implementing. The impetus for the project was that people were worried that a certain class of applications would require Google to double the number of machines it owns if a somewhat unlikely but not impossible scenario happened. That wasn’t too much different from my startup experience, except for that bit about actually having a design doc, and that cutting infra costs could save billions a year instead of millions a year.

The next difference was that, at some point, people way above my pay grade made the decision to get serious about the project, and a lot of high-powered people ended up getting brought in to work on the project or at least provide input, folks like Norm Jouppi, Geoff Hinton, and Jeff Dean.

Was that project a better or worse learning experience than the equivalent project at a startup? At a startup, the project probably would have continued to be a two-person show, and I would have learned all the things you learn when you bang out a project with not enough time and resources and do half the thing yourself. Instead, I ended up owning a fraction of the project and merely provided feedback on the rest, and it was merely a matter of luck (timing) that I had significant say on fleshing out the architecture. I definitely didn’t get the same level of understanding I would have if I implemented half of it myself. On the other hand, the larger team meant that we actually had time to do things like design reviews and code reviews, and I got feedback from people who have way more experience and knowledge than me. My experience at MS is similar – I only own maybe a quarter of the project I’m working on, and there’s an architect above me who’s extremely well regarded and probably has veto power on architectural decisions. But when I had a question the other day, I emailed a Turing award winner and got a response back within an hour. It’s almost impossible to have access to the same breadth and depth of expertise at a startup. As a result, there are things I’ve learned in an hour long design review that it would have taken me months or years to learn if I was implementing things myself.

If you care about impact, it’s also easier to have a large absolute impact at a large company, due to the scale that big companies operate at. If I implemented what I’m doing now for a companies the size of the startup I used to work for, it would have had an impact of maybe $10k/month. That’s nothing to sneeze at, but it wouldn’t have covered my salary. But the same thing at a big company is worth well over 1000x that. There are simply more opportunities to have high impact at large companies because they operate at a larger scale. The corollary to this is that startups are small enough that it’s easier to have an impact on the company itself, even when the impact on the world is smaller in absolute terms. Nothing I do is make or break for a large company, but when I worked at a startup, it felt like what we did could change the odds of the company surviving.

As far as having better options after having worked for a big company or having worked for a startup, if you want to work at startups, you’ll probably have better options with experience at startups. If you want to work on the sorts of problems that are dominated by large companies, you’re better off with more experience in those areas, at large companies. There’s no right answer here.

Conclusion

The compensation tradeoff has changed a lot over time. When Paul Graham was writing in 2004, he used $80k/yr as a reasonable baseline for what “a good hacker” might make. Adjusting for inflation, that’s about $100k/yr now. But the total comp for “a good hacker” is $250k+/yr, not even counting perks like free food and having really solid insurance. The tradeoff has heavily tilted in favor of large companies.

The interesting work tradeoff has also changed a lot over time, but the change has been… bimodal. The existence of AWS and Azure means that ideas that would have taken millions of dollars in servers and operational expertise can be done with almost no fixed cost and low marginal costs. The scope of things you can do at an early-stage startup that were previously the domain of well funded companies is large and still growing. But at the same time, if you look at the work Google and MS are publishing at top systems conferences, startups are farther from being able to reproduce the scale-dependent work than ever before (and a lot of the most interesting work doesn’t get published). Depending on what sort of work you’re interested in, things might look relatively better or relatively worse at big companies.

In any case, the reality is that the difference between types of companies is smaller than the differences between companies of the same type. That’s true whether we’re talking about startups vs. big companies or mobile gaming vs. biotech. This is recursive. The differences between different managers and teams at a company can easily be larger than the differences between companies. If someone tells you that you should work for a certain type of company, that advice is guaranteed to be wrong much of the time, whether that’s a VC advocating that you should work for a startup or a Turing award winner telling you that you should work in a research lab.

As for me, well, I don’t know you and it doesn’t matter to me whether you end up at a big company, a startup, or something in between. Whatever you decide, I hope you get to know your manager well enough to know that they have your back, your team well enough to know that you like working with them, and your project well enough to know that you find it interesting. Personally, I’m a bit tired of the sort of nonsense you see at big companies after two stints at big companies⁵, and I might want to trade that for the sort of nonsense you see at startups next time I look for work, but that’s just me. You should figure out what the relevant tradeoffs are for you.

Jocelyn Goldfein on big companies vs. small companies.

Patrick McKenzie on providing business value vs. technical value, with a response from Yossi Kreinin.

Yossi Kreinin on passion vs. money, and with a rebuttal to this post on regret minimization.

Update: The responses on this post have been quite divided. Folks at big companies usually agree, except that the numbers seem low to them, especially for new grads. This is true even for people who living in places like Madison and Austin, which have a cost of living similar to U.S. median. On the other hand, a lot of people vehemently maintain that the numbers in this post are basically impossible. A lot of people are really invested in the idea that they’re making about as much as possible. If you’ve decided that making less money is the right tradeoff for you, that’s fine and I don’t have any problem with that. But if you really think that you can’t make that much money and you don’t believe me, I recommend talking to one of the hundreds of thousands of engineers at one of the many large companies that pays well.

Thanks to Kelly Eskridge, Leah Hanson, Julia Evans, Alex Clemmer, Ben Kuhn, Malcolm Matalka, Nick Bergson-Shilcock, Joe Wilder, Nat Welch, Darius Bacon, Lindsey Kuper, Prabhakar Ragde, Pierre-Yves Baccou, David Turner, Oskar Thoren, Katerina Barone-Adesi, Scott Feeney, Ralph Corderoy, Ezekiel Benjamin Smithburg, and Kyle Littler for comments/corrections/discussion.

In particular, the glassdoor numbers seem low for an average. I suspect that’s because their average is weighed down by older numbers, while compensation has skyrocketed the past seven years. The average numbers on glassdoor don’t even match the average numbers I heard from other people in my midwestern satellite office in a large town two years ago, and the market has gone up sharply since then. More recently, on the upper end, I know someone fresh out of school who has a total comp of almost $250k/yr ($350k equity over four years, a $50k signing bonus, plus a generous salary). As is normal, they got a number of offers with varying compensation levels, and then Facebook came in and bid him up. The companies that are serious about competing for people matched the offers, and that was that. This included bids in Seattle and Austin that matched the bids in SV. If you’re negotiating an offer, the thing that’s critical isn’t to be some kind of super genius. It’s enough to be pretty good, know what the market is paying, and have multiple offers. This person was worth every penny, which is why he got his offers, but I know several people who are just as good who make half as much just because they only got a single offer and had no leverage.
Anyway, the point of this footnote is just that the total comp for experienced engineers can go way above the numbers mentioned in the post. In the analysis that follows, keep in mind that I’m using conservative numbers and that an aggressive estimate for experienced engineers would be much higher. Just for example, at Google, senior is level 5 out of 11 on a scale that effectively starts at 3. At Microsoft, it’s 63 out of a weirdo scale that starts at 59 and goes to 70-something and then jumps up to 80 (or something like that, I always forget the details because the scale is so silly). Senior isn’t a particularly high band, and people at senior often have total comp substantially greater than $250k/yr. Note that these numbers also don’t include the above market rate of stock growth at trendy large companies in the past few years. If you’ve actually taken this deal, your RSUs have likely appreciated substantially.
^[return]
This depends on the company. It’s true at places like Facebook and Google, which make a serious effort to retain people. It’s nearly completely untrue at places like IBM, National Instruments (NI), and Epic Systems, which don’t even try. And it’s mostly untrue at places like Microsft, which tries, but in the most backwards way possible.
Microsoft (and other mid-tier companies) will give you an ok offer and match good offers from other companies. That by itself is already problematic since it incentivizes people who are inteviewing at Microsoft to also interview elsewhere. But the worse issue is that they do the same when retaining employees. If you stay at Microsoft for a long time and aren’t one of the few people on the fast track to “partner”, your pay is going to end up severely below market, sometime by as much as a factor of two. When you realize that, and you interview elsewhere, Microsoft will match external offers, but after getting underpaid for years, by hundreds of thousands or millions of dollars (depending on how long you’ve been there), the promise of making market rate for a single year and then being underpaid for the forseeable future doesn’t seem very compelling. The incentive structure appears as if it were designed to cause people who are between average and outstanding to leave. I’ve seen this happen with multiple people and I know multiple others who are planning to leave for this exact reason. Their managers are always surprised when this happens, but they shouldn’t be; it’s eminently predictable.
The IBM strategy actually makes a lot more sense to me than the Microsoft strategy. You can save a lot of money by paying people poorly. That makes sense. But why bother paying a lot to get people in the door and then incentivizing them to leave? While it’s true that the very top people I work with are well compensated and seem happy about it, there aren’t enough of those people that you can rely on them for everything.
^[return]
Some are better about this than others. Older companies, like MS, sometimes have yearly vesting, but a lot of younger companies, like Google, have much smoother vesting schedules once you get past the first year. And then there’s Amazon, which backloads its offers, knowing that they have a high attrition rate and won’t have to pay out much. ^[return]
Sadly, we ended up not releasing this for business reasons that came up later. ^[return]
My very first interaction with an employee at big company X orientation was having that employee tell me that I couldn’t get into orientation because I wasn’t on the list. I had to ask how I could get on the list, and I was told that I’d need an email from my manager to get on the list. This was at around 7:30am because orientation starts at 7:30 and then runs for half a day for reasons no one seems to know (I’ve asked a lot of people, all the way up to VPs in HR). When I asked if I could just come back later in the day, I was told that if I couldn’t get in within an hour I’d have to come back next week. I also asked if the fact that I was listed in some system as having a specific manager was evidence that I was supposed to be at orientation and was told that I had to be on the list. So I emailed my manager, but of course he didn’t respond because who checks their email at 7:30am? Luckily, my manager had previously given me his number and told me to call if I ever needed anything, and being able to get into orientation and not have to show up at 7:30am again next week seemed like anything, so I gave him a call. Naturally, he asked to talk to the orientation gatekeeper; when I relayed that the orientation guy, he told me that he couldn’t talk on the phone – you see, he can only accept emails and can’t talk on the phone, not even just to clarify something. Five minutes into orientation, I was already flabbergasted. But, really, I should have considered myself lucky – the other person who “wasn’t on the list” didn’t have his manager’s phone number, and as far as I know, he had to come back the next week at 7:30am to get into orientation. I asked the orientation person how often this happens, and he told me “very rarely, only once or twice per week”.
That experience was repeated approximately every half hour for the duration of orientation. I didn’t get dropped from any other orientation stations, but when I asked, I found that every station had errors that dropped people regularly. My favorite was the station where someone was standing at input queue, handing out a piece of paper. The piece of paper informed you that the machine at the station was going to give you an error with some instructions about what to do. Instead of following those instructions, you had to follow the instructions on the piece of paper when the error occurred.
These kinds of experiences occupied basically my entire first week. Now that I’m past onboarding and onto the regular day-to-day, I have a surreal Kafaka-esque experience a few times a week. And I’ve mostly figured out how to navigate the system (usually, knowing the right person and asking them to intervene solves the problem). What gets me about this isn’t the actual experience, but that most people I talk to who’ve been here a while think that it literally cannot be any other way and that things could not possibly be improved; new hires from younger companies almost always agree that the company is bizarrely screwed up in ways that are incomprehensible. Curiously, people who have been here as long who are very senior tend to agree that the company is quite messed up. I wish I had enough data on that to tell which way the causation runs. Something that’s even curiouser is that the company invests a fair amount of effort to give people the impression that things are as good as they could possibly be. At orientation, we got a very strange version of history that made it sound as if the company had pioneered everything from the GUI to the web, with multiple claims that we have the best X in the world, even when X is not best in class but in fact worst in class, so bad that X is a running joke internally. It’s not clear to me what the company gets out of making sure that most employees don’t understand what the downsides are in our own products and processes.
Whatever the reason, the attitude that things couldn’t possibly be improved isn’t just limited to administrative issues. A friend of mine needed to find a function to do something that’s a trivial one liner on Linux, but that’s considerably more involved on our OS. His first attempt was to use boost, but it turns out that the documentation for doing this on our OS is complicated enough that boost got this wrong and has had a bug in it for years. A couple days, and 72 lines of code later, he managed to figure out how to create a function to do this trivial-on-Linux thing. Since he wasn’t sure if he was missing something, he forwarded the code review to two very senior engineers (one level below Distinguished Engineer). They weren’t sure and forwarded it on to the CTO, who said that he didn’t see a simpler way to accomplish the same thing in our OS with the APIs as they currently are.
Later, my friend had a heated discussion with someone on the OS team, who maintained that the documentation on how to do this was very clear, and that it couldn’t be clearer, nor could the API be any easier. This is despite this being so hard to do that boost has been wrong for seven years, and that two very senior engineers didn’t feel confident enough to review the code and passed it up to a CTO.
Another curious thing is how easy it is to see that things don’t have to be this way from the outside. A while back, I did a round of interviews at other local companies, and they all explicitly disavowed absorbing corporate culture from the company I’m describing, not like company X across the street which is all screwed up by having hired too many employees from this company.
I’m going to stop here. I’ve been writing down big company stories and saving them, but a mere half a year of big company stories is longer than my blog. Not just longer than this post or any individual post, but longer than everything else on my blog combined, which is a bit over 100k words. Typical estimates for words per page vary between 250 and 1000, putting my rate of surreal experiences at somewhere between 100 and 400 pages every six months. I’m not sure this rate is inherently different from the rate you’d get at startups, but there’s a different flavor to the stories and you should have an idea of the flavor by this point.
^[return]

↧

Normalization of deviance in software: how broken practices become standard

December 28, 2015, 4:00 pm

≫ Next: We saw some really bad Intel CPU bugs in 2015, and we should expect to see more in the future

≪ Previous: Big company vs. startup work and pay

Have you ever mentioned something that seems totally normal to you only to be greeted by surprise? Happens to me all the time, when I describe something everyone at work thinks is normal. For some reason, my conversation partner’s face morphs from pleasant smile to rictus of horror. Here are a few representative examples.

There’s the company that is perhaps the nicest place I’ve ever worked, combining the best parts of Valve and Netflix. The people are amazing and you’re given near total freedom to do whatever you want. But as a side effect of the culture, they lose perhaps half of new hires in the first year, some voluntarily and some involuntarily. Totally normal, right?

There’s the company that’s incredibly secretive about infrastructure. For example, there’s the team that was afraid that, if they reported bugs to their hardware vendor, the bugs would get fixed and their competitors would be able to use the fixes. Solution: request the firmware and fix bugs themselves! More recently, I know a group of folks outside the company who tried to reproduce the algorithm in the paper the company published earlier this year. The group found that they couldn’t reproduce the result, and that the algorithm in the paper resulted in an unusual level of instability; when asked about this, one of the authors responded “well, we have some tweaks that didn’t make it into the paper” and declined to share the tweaks, i.e., the company purposely published an unreproducible result to avoid giving away the details, as is normal. This company enforces secrecy by having a strict policy of firing leakers. This is introduced at orientation with examples of people who got fired for leaking (e.g., the guy who leaked that a concert was going to happen inside a particular office), and by announcing firings for leaks at the company all hands. The result of those policies is that I know multiple people who are afraid to forward emails about things like insurance updates for fear of forwarding the wrong email and getting fired; instead, they use another computer to retype the email and pass it along, or take photos of the email on their phone. Normal.

There’s the office where I asked one day about the fact that I almost never saw two particular people in the same room together. I was told that they had a feud going back a decade, and that things had actually improved – for years, they literally couldn’t be in the same room because one of the two would get too angry and do something regrettable, but things had now cooled to the point where the two could, occasionally, be found in the same wing of the office or even the same room. These weren’t just random people, either. They were the two managers of the only two teams in the office. Normal!

There’s the company whose culture is so odd that, when I sat down to write a post about it, I found that I’d not only written more than for any other single post, but more than all other posts combined (which is well over 100k words now, the length of a moderate book). This is the same company where someone recently explained to me how great it is that, instead of using data to make decisions, we use political connections, and that the idea of making decisions based on data is a myth anyway; no one does that. This is also the company where all four of the things they told me to get me to join were false, and the job ended up being the one thing I specifically said I didn’t want to do. When I joined this company, my team didn’t use version control for months and it was a real fight to get everyone to use version control. Although I won that fight, I haven’t won the fight to get people to run a build, let alone run tests, before checking in, so the build is broken multiple times per day. When I mentioned that I thought this was a problem for our productivity, I was told that it’s fine because it affects everyone equally because that kind of breakage is totally normal.

There’s the company that created multiple massive initiatives to recruit more women into engineering roles, where women still get rejected in recruiter screens for not being technical enough after being asked questions like “was your experience with algorithms or just coding?”, as is normal. I thought that my referral with a very strong recommendation would have prevented that, but I forgot how normal the company was.

There’s the company where I worked on a four person effort with a multi-hundred million dollar budget and a billion dollar a year impact, where requests for things that cost hundreds of dollars routinely took months or were denied.

You might wonder if I’ve just worked at places that are unusually screwed up. Sure, the companies are generally considered to be ok places to work, and two of them are considered to be among the best places to work, but maybe I’ve just ended up at places that are overrated. But I have the same experience when I hear stories about how other companies work, even places with stellar engineering reputations, except that it’s me that’s shocked and my conversation partner who thinks their story is normal.

There’s the company that adopted “move fast and break nothing as its motto”, and continues to regularly break everything while writing blog posts about how careful they are about breaking things. I said “the company”, but if you tweak the exact wording of the motto this actually applies to many normal bay area startups.

There’s the companies that use @flaky, which includes the vast majority of Python-using SF Bay area unicorns. If you don’t know what this is, this is a library that lets you add a Python annotation to those annoying flaky tests that sometimes pass and sometimes fail. When I asked multiple co-workers and former co-workers from three different companies what they thought this did, they all guessed that it re-runs the test multiple times and reports a failure if any of the runs fail. Close, but not quite. It’s technically possible to use @flaky for that, but in practice it’s used to re-run the test multiple times and reports a pass if any of the runs pass. The company that created @flaky is effectively a storage infrastructure company, and the library is widely used at its major competitor. Marking tests that expose potential bugs as passing is totally normal; after all, that’s what ext2/ext3/ext4 do with write errors.

There’s the company with a reputation for having great engineering practices that had 2 9s of reliability last time I checked, for reasons that are entirely predictable from their engineering practices. This is the second thing in a row that can’t be deanonymized because multiple companies find it to be normal. Here, I’m not talking about companies trying to be the next reddit or twitter where it’s, apparently, totally fine to have 1 9. I’m talking about companies that sell platforms that other companies rely on, where an outage will cause dependent companies to pause operations for the duration of the outage. Multiple companies that build infrastructure find practices that lead to 2 9s of reliability to be completely and totally normal.

As far as I can tell, what happens at these companies is that they started by concentrating almost totally on product growth. That’s completely and totally reasonable, because companies are worth approximately zero when they’re founded; they don’t bother with things that protect them from losses, like good ops practices or actually having security, because there’s nothing to lose (well, except for user data when the inevetible security breach happens, and if you talk to security folks at unicorns you’ll know that these happen).

The result is a culture where people are hyper-focused on growth and ignore risk. That culture tends to stick even after company has grown to be worth well over a billion dollars, and the companies have something to lose. Anyone who comes into one of these companies from Google, Amazon, or another place with solid ops practices is shocked. Often, they try to fix things, and then leave when they can’t make a dent.

Google probably has the best ops and security practices of any tech company today. It’s easy to say that you should take these things as seriously as Google does, but it’s instructive to see how they got there. If you look at the codebase, you’ll see that various services have names ending in z, as do a curiously large number of variables. I’m told that’s because, once upon a time, someone wanted to add monitoring. It wouldn’t really be secure to have google.com/somename expose monitoring data, so they added a z. google.com/somenamez. For security. At the company that is now the best in the world at security.

Google didn’t go from adding z to the end of names to having the world’s best security because someone gave a rousing speech or wrote a convincing essay. They did it after getting embarrassed a few times, which gave people who wanted to do things “right” the leverage to fix fundamental process issues. It’s the same story at almost every company I know of that has good practices. Microsoft was a joke in the security world for years, until multiple disastrously bad exploits forced them to get serious about security. Which makes it sound simple: but if you talk to people who were there at the time, the change was brutal. Despite a mandate from the top, there was vicious political pushback from people whose position was that the company got to where it was in 2003 without wasting time on practices like security. Why change what’s worked?

You can see this kind of thing in every industry. A classic example that tech folks often bring up is hand-washing by doctors and nurses. It’s well known that germs exist, and that washing hands properly very strongly reduces the odds of transmitting germs and thereby significantly reduces hospital mortality rates. Despite that, trained doctors and nurses still often don’t do it. Interventions are required. Signs reminding people to wash their hands save lives. But when people stand at hand-washing stations to require others walking by to wash their hands, even more lives are saved. People can ignore signs, but they can’t ignore being forced to wash their hands.

This mirrors a number of attempts at tech companies to introduce better practices. If you tell people they should do it, that helps a bit. If you enforce better practices via code review, that helps a lot.

The data are clear that humans are really bad at taking the time to do things that are well understood to incontrovertibly reduce the risk of rare but catastrophic events. We will rationalize that taking shortcuts is the right, reasonable thing to do. There’s a term for this: the normalization of deviance. It’s well studied in a number of other contexts including healthcare, aviation, mechanical engineering, aerospace engineering, and civil engineering, but we don’t see it discussed in the context of software. In fact, I’ve never seen the term used in the context of software.

Is it possible to learn from other’s mistakes instead of making every mistake ourselves? The state of the industry make this sound unlikely, but let’s give it a shot. John Banja has a nice summary paper on the normalization of deviance in healthcare, with lessons we can attempt to apply to software development. One thing to note is that, because Banja is concerned with patient outcomes, there’s a close analogy to devops failure modes, but normalization of deviance also occurs in cultural contexts that are less directly analogous.

The first section of the paper details a number of disasters, both in healthcare and elsewhere. Here’s one typical example:

A catastrophic negligence case that the author participated in as an expert witness involved an anesthesiologist’s turning off a ventilator at the request of a surgeon who wanted to take an x-ray of the patient’s abdomen (Banja, 2005, pp. 87-101). The ventilator was to be off for only a few seconds, but the anesthesiologist forgot to turn it back on, or thought he turned it back on but had not. The patient was without oxygen for a long enough time to cause her to experience global anoxia, which plunged her into a vegetative state. She never recovered, was disconnected from artificial ventilation 9 days later, and then died 2 days after that. It was later discovered that the anesthesia alarms and monitoring equipment in the operating room had been deliberately programmed to a “suspend indefinite” mode such that the anesthesiologist was not alerted to the ventilator problem. Tragically, the very instrumentality that was in place to prevent such a horror was disabled, possibly because the operating room staff found the constant beeping irritating and annoying.

Turning off or ignoring notifications because there are too many of them and they’re too annoying? An erroneous manual operation? This could be straight out of the post-mortem of more than a few companies I can think of, except that the result was a tragic death instead of the loss of millions of dollars. If you read a lot of tech post-mortems, every example in Banja’s paper will feel familiar even though the details are different.

The section concludes,

What these disasters typically reveal is that the factors accounting for them usually had “long incubation periods, typified by rule violations, discrepant events that accumulated unnoticed, and cultural beliefs about hazards that together prevented interventions that might have staved off harmful outcomes”. Furthermore, it is especially striking how multiple rule violations and lapses can coalesce so as to enable a disaster’s occurrence.

Once again, this could be from an article about technical failures. That makes the next section, on why these failures happen, seem worth checking out. The reasons given are:

The rules are stupid and inefficient

The example in the paper is about delivering medication to newborns. To prevent “drug diversion,” nurses were required to enter their password onto the computer to access the medication drawer, get the medication, and administer the correct amount. In order to ensure that the first nurse wasn’t stealing drugs, if any drug remained, another nurse was supposed to observe the process, and then enter their password onto the computer to indicate they witnessed the drug being properly disposed of.

That sounds familiar. How many technical postmortems start off with “someone skipped some steps because they’re inefficient”, e.g., “the programmer force pushed a bad config or bad code because they were sure nothing could go wrong and skipped staging/testing”? The infamous November 2014 Azure outage happened for just that reason. At around the same time, a dev at one of Azure’s competitors overrode the rule that you shouldn’t push a config that fails tests because they knew that the config couldn’t possibly be bad. When that caused the canary deploy to start failing, they overrode the rule that you can’t deploy from canary into staging with a failure because they knew their config couldn’t possibly be bad and so the failure must be from something else. That postmortem revealed that the config was technically correct, but exposed a bug in the underlying software; it was pure luck that the latent bug the config revealed wasn’t as severe as the Azure bug.

Humans are bad at reasoning about how failures cascade, so we implement bright line rules about when it’s safe to deploy. But the same thing that makes it hard for us to reason about when it’s safe to deploy makes the rules seem stupid and inefficient!

Knowledge is imperfect and uneven

People don’t automatically know what should be normal, and when new people are onboarded, they can just as easily learn deviant processes that have become normalized as reasonable processes.

Julia Evans described to me how this happens:

new person joins
new person: WTF WTF WTF WTF WTF
old hands: yeah we know we’re concerned about it
new person: WTF WTF wTF wtf wtf w…
new person gets used to it
new person #2 joins
new person #2: WTF WTF WTF WTF
new person: yeah we know. we’re concerned about it.

The thing that’s really insidious here is that people will really buy into the WTF idea, and they can spread it elsewhere for the duration of their career. Once, after doing some work on an open source project that’s regularly broken and being told that it’s normal to have a broken build, and that they were doing better than average, I ran the numbers, found that project was basically worst in class, and wrote someting about the idea that it’s possible to have a build that nearly always passes with pretty much zero effort. The most common comment I got in response was, “Wow that guy must work with superstar programmers. But let’s get real. We all break the build at least a few times a week”, as if running tests (or for that matter, even attempting to compile) before checking code in requires superhuman abilities. But once people get convinced that some deviation is normal, they often get really invested in the idea.

I’m breaking the rule for the good of my patient

The example in the paper is of someone who breaks the rule that you should wear gloves when finding a vein. Their reasoning is that wearing gloves makes it harder to find a vein, which may result in their having to stick a baby with a needle multiple times. It’s hard to argue against that. No one wants to cause a baby extra pain!

The second worst outage I can think of occurred when someone noticed that a database service was experiencing slowness. They pushed a fix to the service, and in order to prevent the service degradation from spreading, they ignored the rule that you should do a proper, slow, staged deploy. Instead, they pushed the fix to all machines. It’s hard to argue against that. No one wants their customers to have degraded service! Unfortunately, the fix exposed a bug that caused a global outage.

The rules don’t apply to me/You can trust me

most human beings perceive themselves as good and decent people, such that they can understand many of their rule violations as entirely rational and ethically acceptable responses to problematic situations. They understand themselves to be doing nothing wrong, and will be outraged and often fiercely defend themselves when confronted with evidence to the contrary.

As companies grow up, they eventually have to impose security that prevents every employee from being able to access basically everything. And at most companies, when that happens, some people get really upset. “Don’t you trust me? If you trust me, how come you’re revoking my access to X, Y, and Z?”

Facebook famously let all employees access everyone’s profile for a long time, and you can even find HN comments indicating that some recruiters would explicitly mention that as a perk of working for Facebook. And I can think of more than one well-regarded unicorn where everyone still has access to basically everything, even after their first or second bad security breach. It’s hard to get the political capital to restrict people’s access to what they believe they need, or are entitled, to know. A lot of trendy startups have core values like “trust” and “transparency” which make it difficult to argue against universal access.

Workers are afraid to speak up

There are people I simply don’t give feedback to because I can’t tell if they’d take it well or not, and once you say something, it’s impossible to un-say it. In the paper, the author gives an example of a doctor with poor handwriting who gets mean when people ask him to clarify what he’s written. As a result, people guess instead of asking.

In most company cultures, people feel weird about giving feedback. Everyone has stories about a project that lingered on for months after it should have been terminated because no one was willing to offer explicit feedback. This is a problem even when cultures discourage meanness and encourage feedback: cultures of niceness seem to have as many issues around speaking up as cultures of meanness, if not more. In some places, people are afraid to speak up because they’ll get attacked by someone mean. In others, they’re afraid because they’ll be branded as mean. It’s a hard problem.

Leadership withholding or diluting findings on problems

In the paper, this is characterized by flaws and weaknesses being diluted as information flows up the chain of command. One example is how a supervisor might take sub-optimal actions to avoid looking bad to superiors.

I was shocked the first time I saw this happen. I must have been half a year or a year out of school. I saw that we were doing something obviously non-optimal, and brought it up with the senior person in the group. He told me that he didn’t disagree, but that if we did it my way and there was a failure, it would be really embarrassing. He acknowledged that my way reduced the chance of failure without making the technical consequences of failure worse, but it was more important that we not be embarrassed. Now that I’ve been working for a decade, I have a better understanding of how and why people play this game, but I still find it absurd.

Solutions

Let’s say you notice that your company has a problem that I’ve heard people at most companies complain about: people get promoted for heroism and putting out fires, not for preventing fires; and people get promoted for shipping features, not for doing critical maintenance work and bug fixing. How do you change that?

The simplest option is to just do the right thing yourself and ignore what’s going on around you. That has some positive impact, but the scope of your impact is necessarily limited. Next, you can convince your team to do the right thing: I’ve done that a few times for practices I feel are really important and are sticky, so that I won’t have to continue to expend effort on convincing people once things get moving.

But if the incentives are aligned against you, it will require an ongoing and probably unsustainable effort to keep people doing the right thing. In that case, the problem becomes convincing someone to change the incentives, and then making sure the change works as designed. How to convince people is worth discussing, but long and messy enough that it’s beyond the scope of this post. As for making the change work, I’ve seen many “obvious” mistakes repeated, both in places I’ve worked and those whose internal politics I know a lot about.

Small companies have it easy. When I worked at a 100 person company, the hierarchy was individual contributor (IC) -> team lead (TL) -> CEO. That was it. The CEO had a very light touch, but if he wanted something to happen, it happened. Critically, he had a good idea of what everyone was up to and could basically adjust rewards in real-time. If you did something great for the company, there’s a good chance you’d get a raise. Not in nine months when the next performance review cycle came up, but basically immediately. Not all small companies do that effectively, but with the right leadership, they can. That’s impossible for large companies.

At large company A (LCA), they had the problem we’re discussing and a mandate came down to reward people better for doing critical but low-visibility grunt work. There were too many employees for the mandator to directly make all decisions about compensation and promotion, but the mandator could review survey data, spot check decisions, and provide feedback until things were normalized. My subjective perception is that the company never managed to achieve parity between boring maintenance work and shiny new projects, but got close enough that people who wanted to make sure things worked correctly didn’t have to significantly damage their careers to do it.

At large company B (LCB), ICs agreed that it’s problematic to reward creating new features more richly than doing critical grunt work. When I talked to managers, they often agreed, too. But nevertheless, the people who get promoted are disproportionately those who ship shiny new things. I saw mangement attempt a number of cultural and process changes at LCB. Mostly, those took the form of pronouncements from people with fancy titles. For really important things, they might produce a video, and enforce compliance by making people take a multiple choice quiz after watching the video. The net effect I observed among other ICs was that people talked about how disconnected management was from the day-to-day life of ICs. But, for the same reasons that normalization of deviance occurs, that information seems to have no way to reach upper management.

It’s sort of funny that this ends up being a problem about incentives. As an industry, we spend a lot of time thinking about how to incentivize consumers into doing what we want. But then we set up incentive systems that are generally agreed upon as incentivizing us to do the wrong things, and we do so via a combination of a game of telephone and cargo cult diffusion. Back when Microsoft was ascendant, we copied their interview process and asked brain-teaser interview questions. Now that Google is ascendant, we copy their interview process and ask algorithms questions. If you look around at trendy companies that are younger than Google, most of them basically copy their ranking/leveling system, with some minor tweaks. The good news is that, unlike many companies people previously copied, Google has put a lot of thought into most of their processes and made data driven decisions. The bad news is that Google is unique in a number of ways, which means that their reasoning often doesn’t generalize, and that people often cargo cult practices long after they’ve become deprecated at Google.

This kind of diffusion happens for technical decisions, too. Stripe built a reliable message queue on top of Mongo, so we build reliable message queues on top of Mongo¹. It’s cargo cults all the way down².

The paper has specific sub-sections on how to prevent normalization of deviance, which I recommend reading in full.

Pay attention to weak signals
Resist the urge to be unreasonably optimistic
Teach employees how to conduct emotionally uncomfortable conversations
System operators need to feel safe in speaking up
Realize that oversight and monitoring are never-ending

Let’s look at how the first one of these, “pay attention to weak signals”, interacts with a single example, the “WTF WTF WTF” a new person gives off when the join the company.

If a VP decides something is screwed up, people usually listen. It’s a strong signal. And when people don’t listen, the VP knows what levers to pull to make things happen. But when someone new comes in, they don’t know what levers they can pull to make things happen or who they should talk to almost by definition. They give out weak signals that are easily ignored. By the time they learn enough about the system to give out strong signals, they’ve acclimated.

“Pay attention to weak signals” sure sounds like good advice, but how do we do it? Strong signals are few and far between, making them easy to pay attention to. Weak signals are abundant. How do we filter out the ones that aren’t important? And how do we get an entire team or org to actually do it? These kinds of questions can’t be answered in a generic way; this takes real thought. We mostly put this thought elsewhere. Startups spend a lot of time thinking about growth, and while they’ll all tell you that they care a lot about engineering culture, revealed preference shows that they don’t. With a few exceptions, big companies aren’t much different. At LCB, I looked through the competitive analysis slide decks and they’re amazing. They look at every last detail on hundreds of products to make sure that everything is as nice for users as possible, from onboarding to interop with competing products. If there’s any single screen where things are more complex or confusing than any competitor’s, people get upset and try to fix it. It’s quite impressive. And then when LCB onboards employees in my org, a third of them are missing at least one of, an alias/account, an office, or a computer, a condition which can persist for weeks or months. The competitive analysis slide decks talk about how important onboarding is because you only get one chance to make a first impression, and then employees are onboarded with the impression that the company couldn’t care less about them and that it’s normal for quotidian processes to be pervasively broken. LCB can’t even to get the basics of employee onboarding right, let alone really complex things like acculturation. This is understandable – external metrics like user growth or attrition are measurable, and targets like how to tell if you’re acculturating people so that they don’t ignore weak signals are softer and harder to determine, but that doesn’t mean they’re any less important. People write a lot about how things like using fancier languages or techniques like TDD or agile will make your teams more productive, but having a strong engineering culture is much larger force multiplier.

Thanks to Ezekiel Benjamin Smithburg and Marc Brooker for introducing me to the term Normalization of Deviance, and Kelly Eskridge, Leah Hanson, Sophie Rapoport, Ezekiel Benjamin Smithburg, Julia Evans, Dmitri Kalintsev, Ralph Corderoy, Jamie Brandon, Egor Neliuba, and Victor Felder for comments/corrections/discussion.

People seem to think I’m joking here. I can understand why, but try Googling mongodb message queue. You’ll find statements like “replica sets in MongoDB work extremely well to allow automatic failover and redundancy”. Basically every company I know of that’s done this and has anything resembling scale finds this to be non-optimal, to say the least, but you can’t actually find blog posts or talks that discuss that. All you see are the posts and talks from when they first tried it and are in the honeymoon period. This is common with many technologies. You’ll mostly find glowing recommendations in public even when, in private, people will tell you about all the problems. Today, if you do the search mentioned above, you’ll get a ton of posts talking about how amazing it is to build a message queue on top of Mongo, this footnote, and a maybe couple of blog posts by Kyle Kingsbury depending on your exact search terms.
If there were an acute failure, you might see a postmortem, but while we’ll do postmortems for “the site was down for 30 seconds”, we rarely do postmortems for “this takes 10x as much ops effort as the alternative and it’s a death by a thousand papercuts”, “we architected this thing poorly and now it’s very difficult to make changes that ought to be trivial”, or “a competitor of ours was able to accomplish the same thing with an order of magnitude less effort”. I’ll sometimes do informal postmortems by asking everyone involved oblique questions about what happened, but more for my own benefit than anything else, because I’m not sure people really want to hear the whole truth. This is especially sensitive if the effort has generated a round of promotions, which seems to be more common the more screwed up the project. The larger the project, the more visiblity and promotions, even if the project could have been done with much less effort.
^[return]
I’ve spent a lot of time asking about why things are the way they are, both in areas where things are working well, and in areas where things are going badly. Where things are going badly, everyone has ideas. But where things are going well, as in the small company with the light-touch CEO mentioned above, almost no one has any idea why things work. It’s magic. If you ask, people will literally tell you that it seems really similar to some other place they’ve worked, except that things are magically good instead of being terrible for reasons they don’t understand. But it’s not magic. It’s hard work that very few people understand. Something I’ve seen multiple times is that, when a VP leaves, a company will become a substantially worse place to work, and it will slowly dawn on people that the VP was doing an amazing job at supporting not only their direct reports, but making sure that everyone under them was having a good time. It’s hard to see until it changes, but if you don’t see anything obviously wrong, either you’re not paying attention or someone or many someones have put a lot of work into making sure things run smoothly. ^[return]

↧

We saw some really bad Intel CPU bugs in 2015, and we should expect to see more in the future

January 9, 2016, 4:00 pm

≫ Next: The Nyquist theorem and limitations of sampling profilers today, with glimpses of tracing tools from the future

≪ Previous: Normalization of deviance in software: how broken practices become standard

2015 was a pretty good year for Intel. Their quarterly earnings reports exceeded expectations every quarter. They continue to be the only game in town for the serious server market, which continues to grow exponentially; from the earnings reports of the two largest cloud vendors, we can see that AWS and Azure grew by 80% and 100%, respectively. That growth has effectively offset the damage Intel has seen from the continued decline of the desktop market. For a while, it looked like cloud vendors might be able to avoid the Intel tax by moving their computation onto FPGAs, but Intel bought one of the two serious FPGA vendors and, combined with their fab advantage, they look well positioned to dominate the high-end FPGA market the same way they’ve been dominating the high-end server CPU market. Also, their fine for anti-competitive practices turned out to be $1.45B, much less than the benefit they gained from their anti-competitive practices¹.

Things haven’t looked so great on the engineering/bugs side of things, though. I don’t keep track of Intel bugs unless they’re so serious that people I know are scrambling to get a patch in because of the potential impact, and I still heard about two severe bugs this year in the last quarter of the year alone. First, there was the bug found by Ben Serebrin and Jan Beulic, which allowed a guest VM to fault in a way that would cause the CPU to hang in a microcode infinite loop, allowing any VM to DoS its host.

Major cloud vendors were quite lucky that this bug was found by a Google engineer, and that Google decided to share its knowledge of the bug with its competitors before publicly disclosing. Black hats spend a lot of time trying to take down major services. I’m actually really impressed by both the persistence and the cleverness of the people who spend their time attacking the companies I work for. If, buried deep in our infrastructure, we have a bit of code running at DPC that’s vulnerable to slowdown because of some kind of hash collision, someone will find and exploit that, even if it takes a long and obscure sequence of events to make it happen. And they’ll often wait until an inconvenient time to start the attack, such as Christmas, or one of the big online shopping days. If this CPU microcode hang had been found by one of these black hats, there would have been major carnage for most cloud hosted services at the most inconvenient possible time².

Shortly after the Serebrin/Beulic bug was found, a group of people found that running prime95, a commonly used tool for benchmarking and burn-in, causes their entire system to lock up. Intel’s response to this was:

Intel has identified an issue that potentially affects the 6th Gen Intel® Core™ family of products. This issue only occurs under certain complex workload conditions, like those that may be encountered when running applications like Prime95. In those cases, the processor may hang or cause unpredictable system behavior.

which reveals almost nothing about what’s actually going on. If you look at their errata list, you’ll find that this is typical, except that they normally won’t even name the application that was used to trigger the bug. For example, one of the current errata lists has entries like

Certain Combinations of AVX Instructions May Cause Unpredictable System Behavior
AVX Gather Instruction That Should Result in #DF May Cause Unexpected System Behavior
Processor May Experience a Spurious LLC-Related Machine Check During Periods of High Activity
Page Fault May Report Incorrect Fault Information

As we’ve seen, “unexpected system behavior” can mean that we’re completely screwed. Machine checks aren’t great either – they cause Windows to blue screen and Linux to kernel panic. An incorrect address on a page fault is potentially even worse than a mere crash, and if you dig through the list you can find a lot of other scary sounding bugs.

And keep in mind that the Intel errata list has the following disclaimer:

Errata remain in the specification update throughout the product’s lifecycle, or until a particular stepping is no longer commercially available. Under these circumstances, errata removed from the specification update are archived and available upon request.

Once they stop manufacturing a stepping (the hardware equivalent of a point release), they reserve the right to remove the errata and you won’t be able to find out what errata your older stepping has unless you’re important enough to Intel.

Anyway, back to 2015. We’ve seen at least two serious bugs in Intel CPUs in the last quarter³, and it’s almost certain there are more bugs lurking. Back when I worked at a company that produced Intel compatible CPUs, we did a fair amount of testing and characterization of Intel CPUs; as someone fresh out of school who’d previously assumed that CPUs basically worked, I was surprised by how many bugs we were able to find. Even though I never worked on the characterization and competitive analysis side of things, I still personally found multiple Intel CPU bugs just in the normal course of doing my job, poking around to verify things that seemed non-obvious to me. Turns out things that seem non-obvious to me are sometimes also non-obvious to Intel engineers. As more services move to the cloud and the impact of system hang and reset vulnerabilities increases, we’ll see more black hats investing time in finding CPU bugs. We should expect to see a lot more of these when people realize that it’s much easier than it seems to find these bugs. There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we’ve moved past that. In part, that’s because “unpredictable system behavior” have moved from being an annoying class of bugs that forces you to restart your computation to an attack vector that lets anyone with an AWS account attack random cloud-hosted services, but it’s mostly because CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validaton effort. Ironically, we have hardware virtualization that’s supposed to help us with security, but the virtualization is so complicated⁴ that the hardware virtualization implementation is likely to expose “unpredictable system behavior” bugs that wouldn’t otherwise have existed. This isn’t to say it’s hopeless – it’s possible, in principle, to design CPUs such that a hang bug on one core doesn’t crash the entire system. It’s just that it’s a fair amount of work to do that at every level (cache directories, the uncore, etc., would have to be modified to operate when a core is hung, as well as OS schedulers). No one’s done the work because it hasn’t previously seemed important.

Update

After writing this, an ex-Intel employee said “even with your privileged access, you have no idea” and a pseudo-anonymous commenter on reddit made this shocking comment:

As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.
Why?
Let me set the scene: It’s late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: “We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are” - I’m paraphrasing. Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn’t explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signalled a sea change at Intel to many of us who were there. And it didn’t seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

I haven’t been able to confirm this story from another source I personally know, although another anonymous commenter said “I left INTC in mid 2013. From validation. This … is accurate compared with my experience.” Another anonymous person didn’t hear that speech, but found that at around that time, “velocity” became a buzzword and management spent a lot of time talking about how Intel needs more “velocity” to compete with ARM, which appears to confirm the sentiment, if not the actual speech.

I’ve also heard from formal methods people that, around that time, there was an exodus of formal verification folks. One story I’ve heard is that people left because they were worried about being made redundant. I’m told that, at the time, early retirement packages were being floated around and people strongly suspected layoffs. Another story I’ve heard is that things got really strange due to Intel’s focus on the mobile battle with ARM, and people wanted to leave before things got even worse. But it’s hard to say of this means anything, since Intel has been losing a lot of people to Apple because Apple offers better compensation packages and the promise of being less dysfunctional.

I also got anonymous stories about bugs. One person who works in HPC told me that when they were shopping for Haswell parts, a little bird told them that they’d see drastically reduced performance on variants with greater than 12 cores. When they tried building out both 12-core and 16-core systems, they found that they got noticeably better performance on their 12-core systems across a wide variety of workloads. That’s not better per-core performance – that’s better absolute performance. Adding 4 more cores reduced the performance on parallel workloads! That was true both in single-socket and two-socket benchmarks.

There’s also a mysterious hang during idle/low-activity bug that Intel doesn’t seem to have figured out yet.

And then there’s this Broadwell bug that hangs Linux if you don’t disable low-power states.

And of course Intel isn’t the only company with bugs – this AMD bug found by Robert Swiecki not only allows a VM to crash its host, it also allows a VM to take over the host.

I doubt I’ve even heard of all the recent bugs and stories about verification/validation. Feel free to send other reports my way.

Thanks to Leah Hanson, Jeff Ligouri, Derek Slager, Ralph Corderoy, Joe Wilder, Nate Martin, Hari Angepat, and a number of anonymous tipsters for comments/corrections/discussion.

As with the Apple, Google, Adobe, etc., wage-fixing agreement, legal systems are sending the clear messsage that businesses should engage in illegal and unethical behavior since they’ll end up getting fined a small fraction of what they gain. This is the opposite of the Becker-ian policy that’s applied to individuals, where sentences have gotten jacked up on the theory that, since many criminals aren’t caught, the criminals that are caught should have severe punishments applied as a deterrence mechanism. The theory is that the criminals will rationally calculate the expected sentence from a crime, and weigh that against the expected value of a crime. If, for example, the odds of being caught are 1% and we increase the expected sentence from 6 months to 50 years, criminals will cacluate that the expected sentence has changed from 2 days to 6 months, thereby reducing the effective value of the crime and causing a reduction in crime. We now have decades of evidence that the theory that long sentences will deter crime is either empirically false or that the effect is very small; turns out that people who impulse commit crimes don’t deeply study sentencing guidelines before commit crimes. Ironically, for white-collar corporate crimes where Becker’s theory might more plausibly hold, Becker’s theory isn’t applied. ^[return]
Something I find curious is how non-linear the level of effort of the attacks is. Google, Microsoft, and Amazon face regular, persistent, attacks, and if they couldn’t trivially mitigate the kind of unsophisticated attack that’s been severely affecting linode availability for weeks, they wouldn’t be able to stay in business. If you talk to people at various bay area unicorns, you’ll find that a lot of them have accidentally DoS’d themselves when they hit an external API too hard during testing. In the time that it takes a sophisticated attacker to find a hole in Azure that will cause an hour of disruption across 1% of VMs, that same attacker could probably completely take down ten unicorns for a much longer period of time. And yet, these attackers are hyper focused on the most hardened targets. Why is that? ^[return]
The fault into microcode infinite loop also affects AMD processors, but basically no one runs a cloud on AMD chips. I’m pointing out Intel examples because Intel bugs have higher impact, not because Intel is buggier. Intel has a much better track record on bugs than AMD. IBM is the only major microprocessor company I know of that’s been more serious about hardware verification than Intel, but if you have an IBM system running AIX, I could tell you some stories that will make your hair stand on end. Moreover, it’s not clear how effective their verification groups can be since they’ve been losing experienced folks without being able to replace them for over a decade, but that’s a topic for another post. ^[return]
See this code for a simple example of how to use Intel’s API for this. The example is simplified, so much so that it’s not really useful except as a learning aid, and it still turns out to be around 1000 lines of low-level code. ^[return]

↧

The Nyquist theorem and limitations of sampling profilers today, with glimpses of tracing tools from the future

January 23, 2016, 4:00 pm

≫ Next: We only hire the best means we only hire the trendiest

≪ Previous: We saw some really bad Intel CPU bugs in 2015, and we should expect to see more in the future

Perf is probably the most widely used general purpose performance debugging tool on Linux. There are multiple contenders for the #2 spot, and, like perf, they’re sampling profilers. Sampling profilers are great. They tend to be easy-to-use and low-overhead compared to most alternatives. However, there are large classes of performance problems sampling profilers can’t debug effectively, and those problems are becoming more important.

For example, consider a Google search query. Below, we have a diagram of how a query is carried out. Each of the black boxes is a rack of machines and each line shows a remote procedure call (RPC) from one machine to another.

The diagram shows a single search query coming in, which issues RPCs to over a hundred machines (shown in green), each of which delivers another set of requests to the next, lower level (shown in blue). Each request at that lower level also issues a set of RPCs, which aren’t shown because there’s too much going on to effectively visualize. At that last leaf level, the machines do 1ms-2ms of work, and respond with the result, which gets propagated and merged on the way back, until the search result is assembled. While that’s happening, on any leaf machine, 20-100 other search queries will touch the same machine. A single query might touch a couple thousand machines to get its results. If we look at the latency distribution for RPCs, we’d expect that with that many RPCs, any particular query will see a 99%-ile worst case (tail) latency; and much worse than mere 99%-ile, actually.

That latency translates directly into money. It’s now well established that adding user latency reduces ad clicks, reduces the odds that a user will complete a transaction and buy something, reduces the odds that a user will come back later and become a repeat customer, etc. Over the past ten to fifteen years, the understanding that tail latency is an important factor in determining user latency, and that user latency translates directly to money, has trickled out from large companies like Google into the general consciousness. But debugging tools haven’t kept up.

Sampling profilers, the most common performance debugging tool, are notoriously bad at debugging problems caused by tail latency because they aggregate events into averages. But tail latency is, by definition, not average.

For more on this, let’s look at this wide ranging Dick Sites talk¹ which covers, among other things, the performace tracing framework that Dick and others have created at Google. By capturing “every” event that happens, it lets us easily debug performance oddities that would otherwise be difficult to track down. We’ll take a look at three different bugs to get an idea about the kinds of problems Google’s tracing framework is useful for.

First, we can look at another view of the search query we just saw above: given a top-level query that issues some number of RPCs, how long does it take to get responses?

Time goes from left to right. Each row is one rpc, with the blue bar showing when the RPC was issued and when it finished. We can see that the first RPC is issued and returns before 93 other RPCs go out. When the last of those 93 RPCs is done, the search result is returned. We can see that two of the RPCs take substantially longer than the rest; the slowest RPC gates the result of the search query.

To debug this problem, we want a couple things. Because the vast majority of RPCs in a slow query are normal, and only a couple are slow, we need something that does more than just show aggregates, like a sampling profiler would. We need something that will show us specifically what’s going on in the slow RPCs. Furthermore, because weird performance events may be hard to reproduce, we want something that’s cheap enough that we can run it all the time, allowing us to look at any particular case of bad performance in retrospect. In the talk, Dick Sites mentions having a budget of about 1% of CPU for the tracing framework they have.

In addition, we want a tool that has time-granularity that’s much shorter than the granularity of the thing we’re debugging. Sampling profilers typically run at something like 1 kHz (1 ms between samples), which gives little insight into what happens in a one-time event, like an slow RPC that still executes in under 1ms. There are tools that will display what looks like a trace from the output of a sampling profiler, but the resolution is so poor that these tools provide no insight into most performance problems. While it’s possible to crank up the sampling rate on something like perf, you can’t get as much resolution as we need for the problems we’re going to look at.

Getting back to the framework, to debug something like this, we might want to look at a much more zoomed in view. Here’s an example with not much going on (just tcpdump and some packet processing with recvmsg), just to illustrate what we can see when we zoom in.

The horizontal axis is time, and each row shows what a CPU is executing. The different colors indicate that different things are running. The really tall slices are kernel mode execution, the thin black line is the idle process, and the medium height slices are user mode execution. We can see that CPU0 is mostly handling incoming network traffic in a user mode process, with 18 switches into kernel mode. CPU1 is maybe half idle, with a lot of jumps into kernel mode, doing interrupt processing for tcpdump. CPU2 is almost totally idle, except for a brief chunk when a timer interrupt fires.

What’s happening is that every time a packet comes in, an interrupt is triggered to notify tcpdump about the packet. The packet is then delivered the packet to the process that called recvmsg on CPU0. Note that running tcpdump isn’t cheap, and it actually consumes 7% of a server if you turn it on when the server is running at full load. This only dumps network traffic, and it’s already at 7x the budget we have for tracing everything! If we were to look at this in detail, we’d see that Linux’s TCP/IP stack has a large instruction footprint, and workloads like tcpdump will consistently come in and wipe that out of the l1i and l2 caches.

Anyway, now that we’ve seen a simple example of what it looks like when we zoom in on a trace, let’s look at how we can debug the slow RPC we were looking at before.

We have two views of a trace of one machine here. At the top, there’s one row per CPU, and at the bottom there’s one row per RPC. Looking at the top set, we can see that there are some bits where individual CPUs are idle, but that the CPUs are mostly quite busy. Looking at the bottom set, we can see parts of 40 different searches, most of which take around 50us, with the exception of a few that take much longer, like the one pinned between the red arrows.

We can also look at a trace of the same timeframe by which locks are behind held and which threads are executing. The arcs between the threads and the locks show when a particular thread is blocked, waiting on a particular lock. If we look at this, we can see that the time spent waiting for locks is sometimes much longer than the time spent actually executing anything. The thread pinned between the arrows is the same thread that’s executing that slow RPC. It’s a little hard to see what’s going on here, so let’s focus on that single slow RPC.

We can see that this RPC spends very little time executing and a lot of time waiting. We can also see that we’d have a pretty hard time trying to find the cause of the waiting with traditional performance measurement tools. According to stackoverflow, you should use a sampling profiler! But tools like oprofile are useless since they’ll only tell us what’s going on when our RPC is actively executing. What we really care about is what our thread is blocked on and why.

Instead of following the advice from stackoverflow, let’s look at the second view of this again.

We can see that, not only is this RPC spending most of its time waiting for locks, it’s actually spending most of its time waiting for the same lock, with only a short chunk of execution time between the waiting. With this, we can look at the cause of the long wait for a lock. Additionally, if we zoom in on the period between waiting for the two locks, we can see something curious.

It takes 50us for the thread to start executing after it gets scheduled. Note that the wait time is substantially longer than the execution time. The waiting is because an affinity policy was set which will cause the scheduler to try to schedule the thread back to the same core so that any data that’s in the core’s cache will still be there, giving you the best possible cache locality, which means that the thread will have to wait until the previously scheduled thread finishes. That makes intutive sense, but if consider, for example, a 2.2Ghz Skylake, the cache latency is 6.4ns, and 21.2ns to l2, and l3 cache, respectively. Is it worth changing the affinity policy to speed this kind of thing up? You can’t tell from this single trace, but with the tracing framework used to generate this data, you could do the math to figure out if you should change the policy.

In the talk, Dick notes that, given the actual working set size, it would be worth waiting up to 10us to schedule on another CPU sharing the same l2 cache, and 100us to schedule on another CPU sharing the same l3 cache².

Something else you can observe from this trace is that, if you care about a workload that resembles Google search, basically every standard benchmark out there is bad, and the standard technique of running N copies of spec is terrible. That’s not a straw man. People still do that in academic papers today, and some chip companies use SPEC to benchmark their mobile devices!

Anyway, that was one performance issue where we were able to see what was going on because of the ability to see a number of different things at the same time (CPU scheduling, thread scheduling, and locks). Let’s look at a simpler single-threaded example on a single machine where a tracing framework is still beneficial:

This is a trace from gmail, circa 2004. Each row shows the processing that it takes to handle one email. Well, except for the last 5 rows; the last email shown takes so long to process that displaying all of the processing takes 5 rows of space. If we look at each of the normal emails, they all look approximately the same in terms of what colors (i.e., what functions) are called and how much time they take. The last one is different. It starts the same as all the others, but then all this other junk appears that only happens in the slow email.

The email itself isn’t the problem – all of that extra junk is the processing that’s done to reindex the words from the emails that had just come in, which was batched up across multiple emails. This picture caused the Gmail devs to move that batch work to another thread, reducing tail latency from 1800ms to 100ms. This is another performance bug that it would be very difficult to track down with standard profiling tools. I’ve often wondered why email almost always appears quickly when I send to gmail from gmail, and it sometimes takes minutes when I send work email from outlook to outlook. My guess is that a major cause is that it’s much harder for the outlook devs to track down tail latency bugs like this than it is for the gmail devs to do the same thing.

Let’s look at one last performance bug before moving on to discussing what kind of visibility we need to track these down. This is a bit of a spoiler, but with this bug, it’s going to be critical to see what the entire machine is doing at any given time.

This is a histogram of disk latencies on storage machines for a 64kB read, in ms. There are two sets of peaks in this graph. The ones that make sense, on the left in blue, and the ones that don’t, on the right in red.

Going from left to right on the peaks that make sense, first there’s the peak at 0ms for things that are cached in RAM. Next, there’s a peak at 3ms. That’s way too fast for the 7200rpm disks we have to transfer 64kB; the time to get a random point under the head is already (1/(7200/60)) / 2 s = 4ms. That must be the time it takes to transfer something from the disk’s cache over PCIe. The next peak, at near 25ms, is the time it takes to seek to a point and then read 64kB off the disk.

Those numbers don’t look so bad, but the 99%-ile latency is a whopping 696ms, and there are peaks at 250ms, 500ms, 750ms, 1000ms, etc. And these are all unreproducible – if you go back and read a slow block again, or even replay the same sequence of reads, the slow reads are (usually) fast. That’s weird! What could possibly cause delays that long? In the talk, Dick Sites says “each of you think of a guess, and you’ll find you’re all wrong”.

That’s a trace of thirteen disks in a machine. The blue blocks are reads, and the red blocks are writes. The black lines show the time from the initiation of a transaction by the CPU until the transaction is completed. There are some black lines without blocks because some of the transactions hit in a cache and don’t require actual disk activity. If we wait for a period where we can see tail latency and zoom in a bit, we’ll see this:

We can see that there’s a period where things are normal, and then some kind of phase transition into a period where there are 250ms gaps (4) between periods of disk activity (5) on the machine for all disks. This goes on for nine minutes. And then there’s a phase transition and disk latencies go back to normal. That it’s machine wide and not disk specific is a huge clue.

Using that information, Dick pinged various folks about what could possibly cause periodic delays that are a multiple of 250ms on an entire machine, and found out that the cause was kernel throttling of the CPU for processes that went beyond their usage quota. To enforce the quota, the kernel puts all of the relevant threads to sleep until the next multiple of a quarter second. When the quarter-second hand of the clock rolls around, it wakes up all the threads, and if those threads are still using too much CPU, the threads get put back to sleep for another quarter second. The phase change out of this mode happens when, by happenstance, there aren’t too many requests in a quarter second interval and the kernel stops throttling the threads.

After finding the cause, an engineer found that this was happening on 25% of disk servers at Google, for an average of half an hour a day, with periods of high latency as long as 23 hours. This had been happening for three years³. Dick Sites says that fixing this bug paid for his salary for a decade. This is another bug where traditional sampling profilers would have had a hard time. The key insight was that the slowdowns were correlated and machine wide, which isn’t something you can see in a profile.

One question you might have is, is this because of some flaw in existing profilers, or can profilers provide enough information that you don’t need to use tracing tools to track down rare, long-tail, performance bugs? I’ve been talking to Xi Yang about this, who had an ISCA 2015 paper and talk describing some of his work. He and his collaborators have done a lot more since publishing the paper, but the paper still contains great information on how far a profiling tool can be pushed. As Xi explains in his talk, one of the fundamental limits of a sampling profiler is how often you can sample.

This is a graph of the number of the number of executed instructions per clock (IPC) over time in Lucene, which is the core of Elasticsearch.

At 1kHz, which is the default sampling interval for perf, you basically can’t see that anything changes over time at all. At 100kHz, which is as fast as perf runs, you can tell something is going on, but not what. The 10MHz graph is labeled SHIM because that’s the name of the tool presented in the paper. At 10MHz, you get a much better picture of what’s going on (although it’s worth noting that 10MHz is substantially lower resolution than you can get out of some tracing frameworks).

If we look at the IPC in different methods, we can see that we’re losing a lot of information at the slower sampling rates:

This is the top 10 hottest methods Lucene ranked by execution time; these 10 methods account for 74% of the total execution time. With perf, it’s hard to tell which methods have low IPC, i.e., which methods are spending time stalled. But with SHIM, we can clearly see that there’s one method that spends a lot of time waiting, #4.

In retrospect, there’s nothing surprising about these graphs. We know from the Nyquist theorem that, to observe a signal with some frequency, X, we have to sample with a rate at least 2X. There are a lot of factors of performance that have a frequency higher than 1kHz (e.g., CPU p-state changes), so we should expect that we’re unable to directly observe a lot of things that affect performance with perf or other traditional sampling profilers. If we care about microbenchmarks, we can get around this by repeatedly sampling the same thing over and over again, but for rare or one-off events, it may be hard or impossible to do that.

This raises a few questions:

Why does perf sample so infrequently?
How does SHIM get around the limitations of perf?
Why are sampling profilers dominant?

1. Why does perf sample so infrequently?

This comment from events/core.c in the linux kernel explains the limit:

perf samples are done in some very critical code paths (NMIs). If they get too much CPU time, the system can lock up and not get any real work done.

As we saw from the tcpdump trace in the Dick Sites talk, interrupts take a significant amount of time to get processed, which limits the rate at which you can sample with an interrupt based sampling mechanism.

2. How does SHIM get around the limitations of perf?

Instead of having an interrupt come in periodically, like perf, SHIM instruments the runtime so that it periodically runs a code snippet that can squirrel away relevant information. In particular, the authors instrumented the Jikes RVM, which injects yield points into every method prologue, method epilogue, and loop backedge. At a high level, injecting a code snippet into every function prologue and epilogue sounds similar to what Dick Sites describes in his talk.

The details are different, and I recommend both watching the Dick Sites talk and reading the Yang et al. paper if you’re interested in performance measurement, but the fundamental similarity is that both of them decided that it’s too expensive to having another thread break in and sample periodically, so they both ended up injecting some kind of tracing code into the normal execution stream.

It’s worth noting that sampling, at any frequency, is going to miss waiting on (for example) software locks. Dick Sites’s recommendation for this is to timestamp based on wall clock (not CPU clock), and then try to find the underlying causes of unusually long waits.

3. Why are sampling profilers dominant?

We’ve seen that Google’s tracing framework allows us to debug performance problems that we’d never be able to catch with traditional sampling profilers, while also collecting the data that sampling profilers collect. From the outside, SHIM looks like a high-frequency sampling profiler, but it does so by acting like a tracing tool. Even perf is getting support for low-overhead tracing. Intel added hardware support for certain types for certain types of tracing in Broadwell and Skylake, along with kernel support in 4.1 (with user mode support for perf coming in 4.3). If you’re wondering how much overhead these tools have, Andi Kleen claims that the Intel tracing support in Linux has about a 5% overhead, and Dick Sites mentions in the talk that they have a budget of about 1% overhead.

It’s clear that state-of-the-art profilers are going to look a lot like tracing tools in the future, but if we look at the state of things today, the easiest options are all classical profilers. You can fire up a profiler like perf and it will tell you approximately how much time various methods are taking. With other basic tooling, you can tell what’s consuming memory. Between those two numbers, you can solve the majority of performance issues. Building out something like Google’s performance tracing framework is non-trivial, and cobbling together existing publicly available tools to trace performance problems is a rough experience. You can see one example of this when Marek Majkowski debugged a tail latency issue using System Tap.

In Brendan Gregg’s page on Linux tracers, he says “[perf_events] can do many things, but if I had to recommend you learn just one [tool], it would be CPU profiling”. Tracing tools are cumbersome enough that his top recommendation on his page about tracing tools is to learn a profiling tool!

Now what?

If you want to use an tracing tool like the one we looked at today your options are:

Get a job at Google
Build it yourself
Cobble together what you need out of existing tools

1. Get a job at Google

I hear Steve Yegge has good advice on how to do this. If you go this route, try to attend orientation in Mountain View. They have the best orientation.

2. Build it yourself

If you look at the SHIM paper, there’s a lot of cleverness built-in to get really fine-grained information while minimizing overhead. I think their approach is really neat, but considering the current state of things, you can get a pretty substantial improvement without much cleverness. Fundamentally, all you really need is some way to inject your tracing code at the appropriate points, some number of bits for a timestamp, plus a handful of bits to store the event.

Say you want trace transitions between user mode and kernel mode. The transitions between waiting and running will tell you what the thread was waiting on (e.g., disk, timer, IPI, etc.). There are maybe 200k transitions per second per core on a busy node. 200k events with a 1% overhead is 50ns per event per core. A cache miss is well over 100 cycles, so our budget is less than one cache miss per event, meaning that each record must fit within a fraction of a cache line. If we have 20 bits of timestamp (RDTSC >> 8 bits, giving ~100ns resolution and 100ms range) and 12 bits of event, that’s 4 bytes, or 16 events per cache line. Each core has to have its own buffer to avoid cache contention. To map RDTSC times back to wall clock times, calling gettimeofday along with RDTSC at least every 100ms is sufficient.

Now, say the machine is serving 2000 QPS. That’s 20 99%-ile tail events per second and 2 99.9% tail events per second. Since those events are, by definition, unusually long, Dick Sites recommends a window of 30s to 120s to catch those events. If we have 4 bytes per event * 200k events per second * 40 cores, that’s about 32MB/s of data. Writing to disk while we’re logging is hopeless, so you’ll want to store the entire log while tracing, which will be in the range of 1GB to 4GB. That’s probably fine for a typical machine in a datacenter, which will have between 128GB and 256GB of RAM.

My not-so-secret secret hope for this post is that someone will take this idea and implement it. That’s already happened with at least one blog post idea I’ve thrown out there, and this seems at least as valuable.

3. Cobble together what you need out of existing tools

If you don’t have a magical framework that solves all your problems, the tool you want is going to depend on the problem you’re trying to solve.

For figuring out why things are waiting, Brendan Gregg’s writeup on off-CPU flamegraphs is a pretty good start if you don’t have access to internal Google tools. For that matter, his entire site is great if you’re doing any kind of Linux performance analysis. There’s info on Dtrace, ftrace, SystemTap, etc. Most tools you might use are covered, although PMCTrack is missing.

The problem with all of these is that they’re all much higher overhead than the things we’ve looked at today, so they can’t be run in the background to catch and effectively replay any bug that comes along if you operate at scale. Yes, that includes dtrace, which I’m calling out in particular because any time you have one of these discussions, a dtrace troll will come along to say that dtrace has supported that for years. It’s like the common lisp of trace tools, in terms of community trolling.

Anyway, if you’re on Windows, Bruce Dawson’s site seems to be the closest analogue to Bredan Gregg’s site. If that doesn’t have enough detail, there’s always the Windows Internals books.

This is a bit far afield, but for problems where you want an easy way to get CPU performance counters, likwid is nice. It has a much nicer interface than perf stat, lets you easily only get stats for selected functions, etc.

Thanks to Nathan Kurz, Xi Yang, Leah Hanson, John Gossman, Dick Sites, and Hari Angepat for comments/corrections/discussion.

P.S. Xi Yang, one of the authors of SHIM is finishing up his PhD soon and is going to be looking for work. If you want to hire a performance wizard, he has a CV and resume here.

The talk is amazing and I recommend watching the talk instead of reading this post. I’m writing this up because I know if someone told me I should watch a talk instead of reading the summary, I wouldn’t do it. Ok, fine. If you’re like me, maybe you’d consider reading a couple of his papers instead of reading this post. I once heard someone say that it’s impossible to disagree with Dick’s reasoning. You can disagree with his premises, but if you accept his premises and follow his argument, you have to agree with his conclusions. His presentation is impeccable and his logic is implacable. ^[return]
This oversimplifies things a bit since, if some level of cache is bandwidth limited, spending bandwidth to move data between cores could slow down other operations more than this operation is sped up by not having to wait. But even that’s oversimplified since it doesn’t take into account the extra power it takes to move data from a higher level cache as opposed to accessing the local cache. But that’s also oversimplified, as is everything in this post. Reality is really complicated, and the more detail we want the less effective sampling profilers are. ^[return]
This sounds like a long time, but if you ask around you’ll hear other versions of this story at every company that creates systems complex beyond human understanding. I know of one chip project at Sun that was delayed for multiple years because they couldn’t track down some persistent bugs. At Microsoft, they famously spent two years tracking down a scrolling smoothness bug on Vista. The bug was hard enough to reproduce that they set up screens in the hallways so that they could casually see when the bug struck their test boxes. One clue was that the bug only struck high-end boxes with video cards, not low-end boxes with integrated graphics, but that clue wasn’t sufficient to find the bug.
After quite a while, they called the Xbox team in to use their profiling expertise to set up a system that could capture the bug, and once they had the profiler set up it immediately became apparent what the cause was. This was back in the AGP days, where upstream bandwidth was something like 1/10th downstream bandwidth. When memory would fill up, textures would get ejected, and while doing so, the driver would lock the bus and prevent any other traffic from going through. That took long enough that the video card became unresponsive, resulting in janky scrolling.
It’s really common to hear stories of bugs that can take an unbounded amount of time to debug if the proper tools aren’t available.
^[return]

↧

We only hire the best means we only hire the trendiest

March 21, 2016, 12:23 am

≫ Next: Notes on Google's Site Reliability Engineering book

≪ Previous: The Nyquist theorem and limitations of sampling profilers today, with glimpses of tracing tools from the future

An acquaintance of mine, let’s call him Mike, is looking for work after getting laid off from a contract role at Microsoft, which has happened to a lot of people I know. Like me, Mike has 11 years in industry. Unlike me, he doesn’t know a lot of folks at trendy companies, so I passed his resume around to some engineers I know at companies that are desperately hiring. My engineering friends thought Mike’s resume was fine, but most recruiters rejected him in the resume screening phase.

When I asked why he was getting rejected, the typical response I got was:

Tech experience is in irrelevant tech
“Experience is too random, with payments, mobile, data analytics, and UX.”
Contractors are generally not the strongest technically

This response is something from a recruiter that was relayed to me through an engineer; the engineer was incredulous at the response from the recruiter. Just so we have a name, let’s call this company TrendCo. It’s one of the thousands of companies that claims to have world class engineers, hire only the best, etc. This is one company in particular, but it’s representative of a large class of companies and the responses Mike has gotten.

Anyway, (1) is code for “Mike’s a .NET dev, and we don’t like people with Windows experience”.

I’m familiar with TrendCo’s tech stack, which multiple employees have told me is “a tire fire”. Their core systems top out under 1k QPS, which has caused them to go down under load. Mike has worked on systems that can handle multiple orders of magnitude more load, but his experience is, apparently, irrelevant.

(2) is hard to make sense of. I’ve interviewed at TrendCo and one of the selling points is that it’s a startup where you get to do a lot of different things. TrendCo almost exclusively hires generalists but Mike is, apparently, too general for them.

(3), combined with (1), gets at what TrendCo’s real complaint with Mike is. He’s not their type. TrendCo’s median employee is a recent graduate from one of maybe ten “top” schools with 0-2 years of experience. They have a few experienced hires, but not many, and most of their experienced hires have something trendy on their resume, not a boring old company like Microsoft.

Whether or not you think there’s anything wrong with having a type and rejecting people who aren’t your type, as Thomas Ptacek has observed, if your type is the same type everyone else is competing for, “you are competing for talent with the wealthiest (or most overfunded) tech companies in the market”.

If you look at new grad hiring data, it looks like FB is offering people with zero experience > $100k/ salary, $100k signing bonus, and $150k in RSUs, for an amortized total comp > $160k/yr, including $240k in the first year. Google’s package has > $100k salary, a variable signing bonus in the $10k range, and $187k in RSUs. That comes in a bit lower than FB, but it’s much higher than most companies that claim to only hire the best are willing to pay for a new grad. Keep in mind that compensation can go much higher for contested candidates, and that compensation for experienced candidates is probably higher than you expect if you’re not a hiring manager who’s seen what competitive offers look like today.

By going after people with the most sought after qualifications, TrendCo has narrowed their options down to either paying out the nose for employees, or offering non-competitive compensation packages. TrendCo has chosen the latter option, which partially explains why they have, proportionally, so few senior devs – the compensation delta increases as you get more senior, and you have to make a really compelling pitch to someone to get them to choose TrendCo when you’re offering $150k/yr less than the competition. And as people get more experience, they’re less likely to believe the part of the pitch that explains how much the stock options are worth.

Just to be clear, I don’t have anything against people with trendy backgrounds. I know a lot of these people who have impeccable interviewing skills and got 5-10 strong offers last time they looked for work. I’ve worked with someone like that: he was just out of school, his total comp package was north of $200k/yr, and he was worth every penny. But think about that for a minute. He had strong offers from six different companies, of which he was going to accept at most one. Including lunch and phone screens, the companies put in an average of eight hours apiece interviewing him. And because they wanted to hire him so much, the companies that were really serious spent an average of another five hours apiece of engineer time trying to convince him to take their offer. Because these companies had, on average, a ⅙ chance of hiring this person, they have to spend at least an expected (8+5) * 6 = 78 hours of engineer time¹. People with great backgrounds are, on average, pretty great, but they’re really hard to hire. It’s much easier to hire people who are underrated, especially if you’re not paying market rates.

I’ve seen this hyperfocus on hiring people with trendy backgrounds from both sides of the table, and it’s ridiculous from both sides.

On the referring side of hiring, I tried to get a startup I was at to hire the most interesting and creative programmer I’ve ever met, who was tragically underemployed for years because of his low GPA in college. We declined to hire him and I was told that his low GPA meant that he couldn’t be very smart. Years later, Google took a chance on him and he’s been killing it since then. He actually convinced me to join Google, and at Google, I tried to hire one of the most productive programmers I know, who was promptly rejected by a recruiter for not being technical enough.

On the candidate side of hiring, I’ve experienced both being in demand and being almost unhireable. Because I did my undergrad at Wisconsin, which is one of the 25 schools that claims to be a top 10 cs/engineering school, I had recruiters beating down my door when I graduated. But that’s silly – that I attended Wisconsin wasn’t anything about me; I just happened to grow up in the state of Wisconsin. If I grew up in Utah, I probably would have ended up going to school at Utah. When I’ve compared notes with folks who attended schools like Utah and Boise State, their education is basically the same as mine. Wisconsin’s rank as an engineering school comes from having professors who do great research which is, at best, weakly correlated to effectiveness at actually teaching undergrads. Despite getting the same engineering education you could get at hundreds of other schools, I had a very easy time getting interviews and finding a great job.

I spent 7.5 years in that great job, at Centaur. Centaur has a pretty strong reputation among hardware companies in Austin who’ve been around for a while, and I had an easy time shopping for local jobs at hardware companies. But I don’t know of any software folks who’ve heard of Centaur, and as a result I couldn’t get an interview at most software companies. There were even a couple of cases where I had really strong internal referrals and the recruiters still didn’t want to talk to me, which I found funny and my friends found frustrating.

When I could get interviews, they often went poorly. A typical rejection reason was something like “we process millions of transactions per day here and we really need someone with more relevant experience who can handle these things without ramping up”. And then Google took a chance on me and I was the second person on a project to get serious about deep learning performance, which was a 20%-time project until just before I joined. We built the fastest deep learning system in the world. From what I hear, they’re now on the Nth generation of that project, but even the first generation thing we built has better per-node performance and performance per dollar than any other production system I know of today, years later (excluding follow-ons to that project, of course).

While I was at Google I had recruiters pinging me about job opportunities all the time. And now that I’m at boring old Microsoft, I don’t get nearly as many recruiters reaching out to me. I’ve been considering looking for work² and I wonder how trendy I’ll be if I do. Experience in irrelevant tech? Check! Random experience? Check! Contractor? Well, no. But two out of three ain’t bad.

My point here isn’t anything about me. It’s that here’s this person³ who has wildly different levels of attractiveness to employers at various times, mostly due to superficial factors that don’t have much to do with actual productivity. This is a really common story among people who end up at Google. If you hired them before they worked at Google, you might have gotten a great deal! But no one (except Google) was willing to take that chance. There’s something to be said for paying more to get a known quantity, but a company like TrendCo that isn’t willing to do that cripples its hiring pipeline by only going after people with trendy resumes.

I don’t mean to pick on startups like TrendCo in particular. Boring old companies have their version of what a trendy background is, too. A friend of mine who’s desperate to hire can’t do anything with some of the resumes I pass his way because his group isn’t allowed to hire anyone without a degree. Another person I know is in a similar situation because his group won’t talk to people who aren’t already employed.

Not only are these decisions non-optimal for companies, they create a path dependence in employment outcomes that causes individual good (or bad) events to follow people around for decades. You can see similar effects in the literature on career earnings in a variety of fields⁴.

Thomas Ptacek has this great line about how“we interview people whose only prior work experience is “Line of Business .NET Developer”, and they end up showing us how to write exploits for elliptic curve partial nonce bias attacks that involve Fourier transforms and BKZ lattice reduction steps that take 6 hours to run.” If you work at a company that doesn’t reject people out of hand for not being trendy, you’ll hear lots of stories like this. Some of the best people I’ve worked with went to schools you’ve never heard of and worked at companies you’ve never heard of until they ended up at Google. Some are still at companies you’ve never heard of.

If you read Zach Holman, you may recall that when he said that he was fired, someone responded with “If an employer has decided to fire you, then you’ve not only failed at your job, you’ve failed as a human being.” A lot of people treat employment status and credentials as measures of the inherent worth of individuals. But a large component of these markers of success, not to mention success itself, is luck.

Solutions?

I can understand why this happens. At an individual level, we’re prone to the fundamental attribution error. At an orgazational level, fast growing organizations burn a large fraction of their time on interviews, and the obvious way to cut down on time spent interviewing is to only interview people with “good” qualifications. Unfortunately, that’s counterproductive when you’re chasing after the same tiny pool of people as everyone else.

Here are the beginnings of some ideas. I’m open to better suggestions!

Moneyball

Billy Beane and Paul Depodesta took the Oakland A’s, a baseball franchise with nowhere near the budget of top teams, and created what was arguably the best team in baseball by finding and “hiring” players who were statistically underrated for their price. The thing I find really amazing about this is that they publically talked about doing this, and then Michael Lewis wrote a book, titled Moneyball, about them doing this. Despite the publicity, it took years for enough competitors to catch on enough that the A’s strategy stopped giving them a very large edge.

You can see the exact same thing in software hiring. Thomas Ptacek has been talking about how they hired unusually effective people at Matasano for at least half a decade, maybe more. Google bigwigs regularly talk about the hiring data they have and what hasn’t worked. I believe they talked about how focusing on top schools wasn’t effective and didn’t turn up employees that have better performance years ago, but that doesn’t stop TrendCo from focusing hiring efforts on top schools.

Training / mentorship

You see a lot of talk about moneyball, but for some reason people are less excited about… trainingball? Practiceball? Whatever you want to call taking people who aren’t “the best” and teaching them how to be “the best”.

This is another one where it’s easy to see the impact through the lens of sports, because there is so much good performance data. Since it’s basketball season, if we look at college basketball, for example, we can identify a handful of programs that regularly take unremarkable inputs and produce good outputs. And that’s against a field of competitors where every team is expected to coach and train their players.

When it comes to tech companies, most of the competition isn’t even trying. At the median large company, you get a couple days of “orientation”, which is mostly legal mumbo jumbo and paperwork, and the occasional “training”, which is usually a set of videos and a set of multiple-choice questions that are offered up for compliance reasons, not to teach anyone anything. And you’ll be assigned a mentor who, more likely than not, won’t provide any actual mentorship. Startups tend to be even worse! It’s not hard to do better than that.

Considering how much money companies spend on hiring and retaining“the best”, you’d expect them to spend at least a (non-zero) fraction on training. It’s also quite strange that companies don’t focus more or training and mentorship when trying to recruit. Specific things I’ve learned in specific roles have been tremendously valuable to me, but it’s almost always either been a happy accident, or something I went out of my way to do. Most companies don’t focus on this stuff. Sure, recruiters will tell you that “you’ll learn so much more here than at Google, which will make you more valuable”, implying that it’s worth the $150k/yr pay cut, but if you ask them what, specfically, they do to make a better learning environment than Google, they never have a good answer.

Process / tools / culture

I’ve worked at two companies that both have effectively infinite resources to spend on tooling. One of them, let’s call them ToolCo, is really serious about tooling and invests heavily in tools. People describe tooling there with phrases like “magical”, “the best I’ve ever seen”, and “I can’t believe this is even possible”. And I can see why. For example, if you want to build a project that’s millions of lines of code, their build system will make that take somewhere between 5s and 20s (assuming you don’t enable LTO or anything else that can’t be parallelized)⁵. In the course of a regular day at work you’ll use multiple tools that seem magical because they’re so far ahead of what’s available in the outside world.

The other company, let’s call them ProdCo pays lip service to tooling, but doesn’t really value it. People describing ProdCo tools use phrases like “world class bad software” and “I am 2x less productive than I’ve ever been anywhere else”, and “I can’t believe this is even possible”. ProdCo has a paper on a new build system; their claimed numbers for speedup from parallelization/caching, onboarding time, and reliability, are at least two orders of magnitude worse than the equivalent at ToolCo. And, in my experience, the actual numbers are worse than the claims in the paper. In the course of a day of work at ProdCo, you’ll use multiple tools that are multiple orders of magnitude worse than the equivalent at ToolCo in multiple dimensions. These kinds of things add up and can easily make a larger difference than “hiring only the best”.

Processes and culture also matter. I once worked on a team that didn’t use version control or have a bug tracker. For every no-brainer item on the Joel test, there are teams out there that make the wrong choice.

Although I’ve only worked on one team that completely failed the Joel test, every team I’ve worked on has had glaring deficiencies that are technically trivial (but sometimes culturally difficult) to fix. When I was at Google, we had really bad communication problems between the two halves of our team that were in different locations. My fix was brain-dead simple: I started typing up meeting notes for all of our local meetings and discussions and taking questions from the remote team about things that surprised them in our notes. That’s something anyone could have done, and it was a huge productivity improvement for the entire team. I’ve literally never found an environment where you can’t massively improve productivity with something that trivial. Sometimes people don’t agree (e.g., it took months to get the non-version-control-using-team to use version control), but that’s a topic for another post.

Programmers are woefully underutilized at most companies. What’s the point of hiring “the best” and then crippling them? You can get better results by hiring undistinguished folks and setting them up for success, and it’s a lot cheaper.

Conclusion

When I started programming, I heard a lot about how programmers are down to earth, not like those elitist folks who have uniforms involving suits and ties. You can even wear t-shirts to work! But if you think programmers aren’t elitist, try wearing a suit and tie to an interview sometime. You’ll have to go above and beyond to prove that you’re not a bad cultural fit. We like to think that we’re different from all those industries that judge people based on appearance, but we do the same thing, only instead of saying that people are a bad fit because they don’t wear ties, we say they’re a bad fit because they do, and instead of saying people aren’t smart enough because they don’t have the right pedigree… wait, that’s exactly the same.

Thanks to Kelley Eskridge, Laura Lindzey, John Hergenroeder, Kamal Marhubi, Julia Evans, Steven McCarthy, Lindsey Kuper, Leah Hanson, Darius Bacon, Pierre-Yves Baccou, Kyle Littler, Jorge Montero, and Mark Dominus for discussion/comments/corrections.

This estimate is conservative. The math only works out to 78 hours if you assume that you never incorrectly reject a trendy candidate and that you don’t have to interview candidates that you “correctly” fail to find good candidates. If you add in the extra time for those, the number becomes a lot larger. And if you’re TrendCo, and you won’t give senior ICs $200k/yr, let alone new grads, you probably need to multiply that number by at least a factor of 10 to account for the reduced probability that someone who’s in high demand is going to take a huge paycut to work for you.
By the way, if you do some similar math you can see that the “no false positives” thing people talk about is bogus. The only way to reduce the risk of a false positive to zero is to not hire anyone. If you hire anyone, you’re trading off the cost of firing a bad hire vs. the cost of spending engineering hours interviewing.
^[return]
I consider this to generally be a good practice, at least for folks like me who are relatively early in their careers. It’s good to know what your options are, even if you don’t exercise them. When I was at Centaur, I did a round of interviews about once a year and those interviews made it very clear that I was lucky to be at Centaur. I got a lot more responsibility and a wider variety of work than I could have gotten elsewhere, I didn’t have to deal with as much nonsense, and I was pretty well paid. I still did the occasional interview, though, and you should too! If you’re worried about wasting the time of the hiring company, when I was interviewing speculatively, I always made it very clear that I was happy in my job and unlikely to change jobs, and most companies are fine with that and still wanted to go through with interviewing. ^[return]
It’s really not about me in particular. At the same time I couldn’t get any company to talk to me, a friend of mine who’s a much better programmer than me spent six months looking for work full time. He eventually got a job at Cloudflare, was half of the team that wrote their DNS, and is now one of the world’s experts on DDoS mitigation for companies that don’t have infinite resources. That guy wasn’t even a networking person before he joined Cloudflare. He’s a brilliant generalist who’s created everything from a widely used javascript library to one of the coolest toy systems projects I’ve ever seen. He probably could have picked up whatever problem domain you’re struggling with and knocked it out of the park. Oh, and between the blog posts he write and the talks he gives, he’s one of Cloudflare’s most effective recruiters. ^[return]
I’m not going to do a literature review because there are just so many studies that link career earnings to external shocks, but I’ll cite a result that I found to be interesting, Lisa Kahn’s 2010 Labour Economics paper
There have been a lot of studies that show, for some particular negative shock (like a recession), graduating into the negative shock reduces lifetime earnings. But most of those studies show that, over time, the effect gets smaller. When Kahn looked at national unemployment as a proxy for the state of the economy, she found the same thing. But when Kahn looked at state level unempoyment, she found that the effect actually compounded over time.
The overall evidence on what happens in the long run is equivocal. If you dig around, you’ll find studies where earnings normalizes after “only” 15 years, causing a large but effectively one-off loss in earnings, and studies where the effect gets worse over time. The results are mostly technically not contradictory because they look at different causes of economic distress when people get their first job, and it’s possible that the differences in results are because the different circumstances don’t generalize. But the “good” result is that it takes 15 years for earnings to normalize after a single bad setback. Even a very optimistic reading of the literature reveals that external events can and do have very large effects on people’s careers. And if you want an estimate of the bound on the “bad” case, check out, for example, the Guiso, Sapienza, and Zingales paper that claims to link the productivity of a city today to whether or not that city had a bishop in the year 1000.
^[return]
During orientation, the back end of the build system was down so I tried building one of the starter tutorials on my local machine. I gave up after an hour when the build was 2% complete. I know someone who tried to build a real, large scale, production codebase on their local machine over a long weekend, and it was nowhere near done when they got back. ^[return]

↧

Notes on Google's Site Reliability Engineering book

April 11, 2016, 1:00 am

≫ Next: Modest list of programming blogs

≪ Previous: We only hire the best means we only hire the trendiest

The book starts with a story about a time [Margaret Hamilton](https://en.wikipedia.org/wiki/Margaret_Hamilton_(scientist)) brought her young daughter with her to NASA, back in the days of the Apollo program. During a simulation mission, her daughter caused the mission to crash by pressing some keys that caused a prelaunch program to run during the simulated mission. Hamilton submitted a change request to add error checking code to prevent the error from happening again, but the request was rejected because the error case should never happen.

On the next mission, Apollo 8, that exact error condition occurred and a potentially fatal problem that could have been prevented with a trivial check took NASA’s engineers 9 hours to resolve.

This sounds familiar – I’ve lost track of the number of dev post-mortems that have the same basic structure.

This is an experiment in note-taking for me in two ways. First, I normally take pen and paper notes and then scan them in for posterity. Second, I normally don’t post my notes online, but I’ve been inspired to try this by Jamie Brandon’s notes on books he’s read. My handwritten notes are a series of bullet points, which may not translate well into markdown. One issue is that my markdown renderer doesn’t handle more than one level of nesting, so things will get artificially flattened. There are probably more issues. Let’s find out what they are! In case it’s not obvious, asides from me are in italics.

Chapter 1: Introduction

Everything in this chapter is covered in much more detail later.

Two approaches to hiring people to manage system stability:

Traditional approach: sysadmins

Assemble existing components and deploy to produce a service
Respond to events and updates as they occur
Grow team to absorb increased work as service grows
Pros
- Easy to implement because it’s standard
- Large talent pool to hire from
- Lots of available software
Cons
- Manual intervention for change management and event handling causes size of team to scale with load on system
- Ops is fundamentally at odds with dev, which can cause pathological resistance to changes, which causes similarly pathological response from devs, which reclassify “launches” as “incremental updates”, “flag flips”, etc.

Google’s approach: SREs

Have software engineers do operations
Candidates should be able to pass or nearly pass normal dev hiring bar, and may have some additional skills that are rare among devs (e.g., L1 - L3 networking or UNIX system internals).
Career progress comparable to dev career track
Results
- SREs would be bored by doing tasks by hand
- Have the skillset necessary to automate tasks
- Do the same work as an operations team, but with automation instead of manual labor
To avoid manual labor trap that causes team size to scale with service load, Google places a 50% cap on the amount of “ops” work for SREs
- Upper bound. Actual amount of ops work is expected to be much lower
Pros
- Cheaper to scale
- Circumvents devs/ops split
Cons
- Hard to hire for
- May be unorthodox in ways that require management support (e.g., product team may push back against decision to stop releases for the quarter because the error budget is depleted)

I don’t really understand how this is an example of circumventing the dev/ops split. I can see how it’s true in one sense, but the example of stopping all releases because an error budget got hit doesn’t seem fundamentally different from the “sysadmin” example where teams push back against launches. It seems that SREs have more political capital to spend and that, in the specific examples given, the SREs might be more reasonable, but there’s no reason to think that sysadmins can’t be reasonable.

Tenets of SRE

SRE team responsible for latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning

Ensuring a durable focus on engineering

50% ops cap means that extra ops work is redirected to product teams on overflow
Provides feedback mechanism to product teams as well as keeps load down
Target max 2 events per 8-12 hour on-call shift
Postmortems for all serious incidents, even if they didn’t trigger a page
Blameless postmortems

2 events per shift is the max, but what’s the average? How many on-call events are expected to get sent from the SRE team to the dev team per week?

How do you get from a blameful postmortem culture to a blameless postmortem culture? Now that everyone knows that you should have blameless postmortems, everyone will claim to do them. Sort of like having good testing and deployment practices. I’ve been lucky to be on an on call rotation that’s never gotten paged, but when I talk to folks who joined recently and are on call, they have not so great stories of finger pointing, trash talk, and blame shifting. The fact that everyone knows you’re supposed to be blameless seems to make it harder to call out blamefulness, not easier.

Move fast without breaking SLO

Error budget. 100% is the wrong reliability target for basically everything
Going from 5 9s to 100% reliability isn’t noticeable to most users and requires tremendous effort
Set a goal that acknowledges the trade-off and leaves an error budget
Error budget can be spent on anything: launching features, etc.
Error budget allows for discussion about how phased rollouts and 1% experiments can maintain tolerable levels of errors
Goal of SRE team isn’t “zero outages” – SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity

It’s not explicitly stated, but for teams that need to “move fast”, consistently coming in way under the error budget could be taken as a sign that the team is spending too much effort on reliability.

I like this idea a lot, but when I discussed this with Jessica Kerr, she pushed back on this idea because maybe you’re just under your error budget because you got lucky and a single really bad event can wipe out your error budget for the next decade. Followup question: how can you be confident enough in your risk model that you can purposefully consume error budget to move faster without worrying that a downstream (in time) bad event will put you overbudget? Nat Welch (a former Google SRE) responded to this by saying that you can build confidence through simulated disasters and other testing.

Monitoring

Monitoring should never require a human to interpret any part of the alerting domain
Three valid kinds of monitoring output
- Alerts: human needs to take action immediately
- Tickets: human needs to take action eventually
- Logging: no action needed
- Note that, for example, graphs are a type of log

Emergency Response

Reliability is a function of MTTF (mean-time-to-failure) and MTTR (mean-time-to-recovery)
For evaluating responses, we care about MTTR
Humans add latency
Systems that don’t require humans to respond will have higher availability due to lower MTTR
Having a “playbook” produces 3x lower MTTR
- Having hero generalists who can respond to everything works, but having playbooks works better

I personally agree, but boy do we like our on call heros. I wonder how we can foster a culture of documentation.

Change management

70% of outages due to changes in a live system. Mitigation:
- Implement progressive rollouts
- Monitoring
- Rollback
Remove humans from the loop, avoid standard human problems on repetitive tasks

Demand forecasting and capacity planning

Straightforward, but a surprising number of teams/services don’t do it

Provisioning

Adding capacity riskier than load shifting, since it often involves spinning up new instances/locations, making significant changes to existing systems (config files, load balancers, etc.)
Expensive enough that it should be done only when necessary; must be done quickly
- If you don’t know what you actually need and overprovision that costs money

Efficiency and performance

Load slows down systems
SREs provision to meet capacity target with a specific response time goal
Efficiency == money

Chapter 2: The production environment at Google, from the viewpoint of an SRE

No notes on this chapter because I’m already pretty familiar with it. TODO: maybe go back and read this chapter in more detail.

Chapter 3: Embracing risk

Ex: if a user is on a smartphone with 99% reliability, they can’t tell the difference between 99.99% and 99.999% reliability

Managing risk

Reliability isn’t linear in cost. It can easily cost 100x more to get one additional increment of reliability
- Cost associated with redundant equipment
- Cost of building out features for reliability as opposed to “normal” features
- Goal: make systems reliable enough, but not too reliable!

Measuring service risk

Standard practice: identify metric to represent property of system to optimize
Possible metric = uptime / (uptime + downtime)
- Problematic for a globally distributed service. What does uptime really mean?
Aggregate availability = successful requests / total requests
- Obv, not all requests are equal, but aggregate availability is an ok first order approximation
Usually set quarterly targets

Risk tolerance of services

Usually not objectively obvious
SREs work with product owners to translate business objectives into explicit objectives

Identifying risk tolerance of consumer services

TODO: maybe read this in detail on second pass

Identifying risk tolerance of infrastructure services

Target availability

Running ex: Bigtable
- Some consumer services serve data directly from Bigtable – need low latency and high reliability
- Some teams use bigtable as a backing store for offline analysis – care more about throughput than reliability
Too expensive to meet all needs generically
- Ex: Bigtable instance
- Low-latency Bigtable user wants low queue depth
- Throughput oriented Bigtable user wants moderate to high queue depth
- Success and failure are diametrically opposed in these two cases!

Cost

Partition infra and offer different levels of service
In addition to obv. benefits, allows service to externalize the cost of providing different levels of service (e.g., expect latency oriented service to be more expensive than throughput oriented service)

Motivation for error budgets

No notes on this because I already believe all of this. Maybe go back and re-read this if involved in debate about this.

Chapter 4: Service level objectives

Note: skipping notes on terminology section.

Ex: Chubby planned outages
- Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google
- Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby
- Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down

What do you and your users care about?

Too many indicators: hard to pay attention
Too few indicators: might ignore important behavior
Different classes of services should have different indicators
- User-facing: availability, latency, throughput
- Storage: latency, availability, durability
- Big data: throughput, end-to-end latency
All systems care about correctness

Collecting indicators

Can often do naturally from server, but client-side metrics sometimes needed.

Aggregation

Use distributions and not averages
User studies show that people usually prefer slower average with better tail latency
Standardize on common defs, e.g., average over 1 minute, average over tasks in cluster, etc.
- Can have exceptions, but having reasonable defaults makes things easier

Choosing targets

Don’t pick target based on current performance
- Current performance may require heroic effort
Keep it simple
Avoid absolutes
- Unreasonable to talk about “infinite” scale or “always” available
Minimize number of SLOs
Perfection can wait
- Can always redefine SLOs over time
SLOs set expectations
- Keep a safety margin (internal SLOs can be defined more loosely than external SLOs)
Don’t overachieve
- See Chubby example, above
- Another example is making sure that the system isn’t too fast under light loads

Chapter 5: Eliminating toil

Carla Geisser: “If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.”

Def: Toil
- Not just “work I don’t want to do”
- Manual
- Repetitive
- Automatable
- Tactical
- No enduring value
- O(n) with service growth
In surveys, find 33% toil on average
- Numbers can be as low as 0% and as high as 80%
- Toil > 50% is a sign that the manager should spread toil load more evenly
Is toil always bad?
- Predictable and repetitive tasks can be calming
- Can produce a sense of accomplishment, can be low-risk / low-stress activities

Section on why toil is bad. Skipping notetaking for that section.

Chapter 6: Monitoring distributed systems

Why monitor?
- Analyze long-term trends
- Compare over time or do experiments
- Alerting
- Building dashboards
- Debugging

As Alex Clemmer is wont to say, our problem isn’t that we move too slowly, it’s that we build the wrong thing. I wonder how we could get from where we are today to having enough instrumentation to be able to make informed decisions when building new systems.

Setting reasonable expectations

Monitoring is non-trivial
10-12 person SRE team typically has 1-2 people building and maintaining monitoring
Number has decreased over time due to improvements in tooling/libs/centralized monitoring infra
General trend towards simpler/faster monitoring systems, with better tools for post hoc analysis
Avoid “magic” systems
Limited success with complex dependency hierarchies (e.g., “if DB slow, alert for DB, otherwise alert for website”).
- Used mostly (only?) for very stable parts of system
Rules that generate alerts for humans should be simple to understand and represent a clear failure

Avoiding magic includes avoiding ML?

Lots of white-box monitoring
Some black-box monitoring for critical stuff
Four golden signals
- Latency
- Traffic
- Errors
- Saturation

Interesting examples from Bigtable and Gmail from chapter not transcribed. A lot of information on the importance of keeping alerts simple also not transcribed.

The long run

There’s often a tension between long-run and short-run availability
Can sometimes fix unreliable systems through heroic effort, but that’s a burnout risk and also a failure risk
Taking a controlled hit in short-term reliability is usually the better trade

Chapter 7: Evolution of automation at Google

“Automation is a force multiplier, not a panacea”
Value of automation
- Consistency
- Extensibility
- MTTR
- Faster non-repair actions
- Time savings

Multiple interesting case studies and explanations skipped in notes.

Chapter 8: Release engineering

This is a specific job function at Google

Release engineer role

Release engineers work with SWEs and SREs to define how software is released
- Allows dev teams to focus on dev work
Define best practices
- Compiler flags, formats for build ID tags, etc.
Releases automated
Models vary between teams
- Could be “push on green” and deploy every build
- Could be hourly builds and deploys
- etc.
Hermetic builds
- Building same rev number should always give identical results
- Self-contained – this includes versioning everything down the compiler used
- Can cherry-pick fixes against an old rev to fix production software
Virtually all changes require code review
Branching
- All code in main branch
- Releases are branched off
- Fixes can go from master to branch
- Branches never merged back
Testing
- CI
- Release process creates an audit trail that runs tests and shows that tests passed
Config management
- Deceptively simple, can cause instability
Many possible schemes (all involve storing config in source control and having strict config review)
Use mainline for config – config maintained at head and applied immediately
- Originally used for Borg (and pre-Borg systems)
- Binary releases and config changes decoupled!
Include config files and binaries in same package
- Simple
- Tightly couples binary and config – ok for projects with few config files or where few configs change
Package config into “configuration packages”
- Same hermetic principle as for code
Release engineering shouldn’t be an afterthought!
- Budget resources at beginning of dev cycle

Chapter 9: Simplicity

Stability vs. agility
- Can make things stable by freezing – need to balance the two
- Reliable systems can increase agility
- Reliable rollouts make it easier to link changes to bugs
Virtue of boring!
Essential vs. accidental complexity
- SREs should push back when accidental complexity is introduced
Code is a liability
- Remove dead code or other bloat
Minimal APIs
- Smaller APIs easier to test, more reliable
Modularity
- API versioning
- Same as code, where you’d avoid misc/util classes
Releases
- Small releases easier to measure
- Can’t tell what happened if we released 100 changes together

Chapter 10: Altering from time-series data

Borgmon

Similar-ish to Prometheus
Common data format for logging
Data used for both dashboards and alerts
Formalized a legacy data format, “varz”, which allowed metrics to be viewed via HTTP
- To view metrics manually, go to http://foo:80/varz
Adding a metric only requires a single declaration in code
- low user-cost to add new metric
Borgmon fetches /varz from each target periodically
- Also includes synthetic data like health check, if name was resolved, etc.,
Time series arena
- Data stored in-memory, with checkpointing to disk
- Fixed sized allocation
- GC expires oldest entries when full
- conceptually a 2-d array with time on one axis and items on the other axis
- 24 bytes for a data point -> 1M unique time series for 12 hours at 1-minute intervals = 17 GB
Borgmon rules
- Algebraic expressions
- Compute time-series from other time-series
- Rules evaluated in parallel on a threadpool
Counters vs. gauges
- Def: counters are non-decreasing
- Def: can take any value
- Counters preferred to gauges because gauges can lose information depending on sampling interval
Altering
- Borgmon rules can trigger alerts
- Have minimum duration to prevent “flapping”
- Usually set to two duration cycles so that missed collections don’t trigger an alert
Scaling
- Borgmon can take time-series data from other Borgmon (uses binary streaming protocol instead of the text-based varz protocol)
- Can have multiple tiers of filters
Prober
- Black-box monitoring that monitors what the user sees
- Can be queried with varz or directly send alerts to Altertmanager
Configuration
- Separation between definition of rules and targets being monitored

Chapter 11: Being on-call

Typical response time
- 5 min for user-facing or other time-critical tasks
- 30 min for less time-sensitive stuff
Response times linked to SLOs
- Ex: 99.99% for a quarter is 13 minutes of downtime; clearly can’t have response time above 13 minutes
- Services with looser SLOs can have response times in the 10s of minutes (or more?)
Primary vs secondary on-call
- Work distribution varies by team
- In some, secondary can be backup for primary
- In others, secondary handles non-urgent / non-paging events, primary handles pages
Balanced on-call
- Def: quantity: percent of time on-call
- Def: quality: number of incidents that occur while on call

This is great. We should do this. People sometimes get really rough on-call rotations a few times in a row and considering the infrequency of on-call rotations there’s no reason to expect that this should randomly balance out over the course of a year or two.

Balance in quantity
- >= 50% of SRE time goes into engineering
- Of remainder, no more than 25% spent on-call
Prefer multi-site teams
- Night shifts are bad for health, multi-site teams allow elimination of night shifts
Balance in quality
- On average, dealing with an incident (incl root-cause analysis, remediation, writing postmortem, fixing bug, etc.) takes 6 hours.
- => shouldn’t have more than 2 incidents in a 12-hour on-call shift
- To stay within upper bound, want very flat distribution of pages, with median value of 0
Compensation – extra pay for being on-call (time-off or cash)

Chapter 12: Effective troubleshooting

No notes for this chapter.

Chapter 13: Emergency response

Test-induced emergency
- SREs break systems to see what happens
Ex: want to flush out hidden dependencies on a distributed MySQL database
- Plan: block access to ¹⁄₁₀₀ of DBs
- Response: dependent services report that they’re unable to access key systems
- SRE response: SRE aborts exercise, tries to roll back permissions change
- Rollback attempt fails
- Attempt to restore access to replicas works
- Normal operation restored in 1 hour
- What went well: dependent teams escalated issues immediately, were able to restore access
- What we learned: had an insufficient understanding of the system and its interaction with other systems, failed to follow incident response that would have informed customers of outage, hadn’t tested rollback procedures in test env
Change-induced emergency
- Changes can cause failures!
Ex: config change to abuse prevention infra pushed on Friday triggered crash-loop bug
- Almost all externally facing systems depend on this, become unavailable
- Many internal systems also have dependency and become unavailable
- Alerts start firing with seconds
- Within 5 minutes of config push, engineer who pushed change rolled back change and services started recovering
- What went well: monitoring fired immediately, incident management worked well, out-of-band communications systems kept people up to date even though many systems were down, luck (engineer who pushed change was following real-time comms channels, which isn’t part of the release procedure)
- What we learned: push to canary didn’t trigger same issue because it didn’t hit a specific config keyword combination; push was considered low-risk and went through less stringent canary process, alerting was too noisy during outage
Process-induced emergency

No notes on process-induced example.

Chapter 14: Managing incidents

This is an area where we seem to actually be pretty good. No notes on this chapter.

Chapter 15: Postmortem culture: learning from failure

I’m in strong agreement with most of this chapter. No notes.

Chapter 16: Tracking outages

Escalator: centralized system that tracks ACKs to alerts, notifies other people if necessary, etc.
Outalator: gives time-interleaved view of notifications for multiple queues
- Also saves related email and allows marking some messages as “important”, can collapse non-important messages, etc.

Our version of Escalator seems fine. We could really use something like Outalator, though.

Chapter 17: Testing for reliability

Preaching to the choir. No notes on this section. We could really do a lot better here, though.

Chapter 18: Software engineering in SRE

Ex: Auxon, capacity planning automation tool
Background: traditional capacity planning cycle
- 1) collect demand forecasts (quarters to years in advance)
- 2) Plan allocations
- 3) Review plan
- 4) Deploy and config resources
Traditional approach cons
- Many things can affect plan: increase in efficiency, increase in adoption rate, cluster delivery date slips, etc.
- Even small changes require rechecking allocation plan
- Large changes may require total rewrite of plan
- Labor intensive and error prone
Google solution: intent-based capacity planning
- Specify requirements, not implementation
- Encode requirements and autogenerate a capacity plan
- In addition to saving labor, solvers can do better than human generated solutions => cost savings
Ladder of examples of increasingly intent based planning
- 1) Want 50 cores in clusters X, Y, and Z – why those resources in those clusters?
- 2) Want 50-core footprint in any 3 clusters in region – why that many resources and why 3?
- 3) Want to meet demand with N+2 redundancy – why N+2?
- 4) Want 5 9s of reliability. Could find, for example, that N+2 isn’t sufficient
Found that greatest gains are from going to (3)
- Some sophisticated services may go for (4)
Putting constraints into tools allows tradeoffs to be consistent across fleet
- As opposed to making individual ad hoc decisions
Auxon inputs
- Requirements (e.g., “service must be N+2 per continent”, “frontend servers no more than 50ms away from backend servers”
- Dependencies
- Budget priorities
- Performance data (how a service scales)
- Demand forecast data (note that services like Colossus have derived forecasts from dependent services)
- Resource supply & pricing
Inputs go into solver (mixed-integer or linear programming solver)

No notes on why SRE software, how to spin up a group, etc. TODO: re-read back half of this chapter and take notes if it’s ever directly relevant for me.

Chapter 19: Load balancing at the frontend

No notes on this section. Seems pretty similar to what we have in terms of high-level goals, and the chapter doesn’t go into low-level details. It’s notable that they do [redacted] differently from us, though. For more info on lower-level details, there’s the Maglev paper.

Chapter 20: Load balancing in the datacenter

Flow control
Need to avoid unhealthy tasks
Naive flow control for unhealthy tasks
- Track number of requests to a backend
- Treat backend as unhealthy when threshold is reached
- Cons: generally terrible
Health-based flow control
- Backend task can be in one of three states: {healthy, refusing connections, lame duck}
- Lame duck state can still take connections, but sends backpressure request to all clients
- Lame duck state simplifies clean shutdown
Def: subsetting: limiting pool of backend tasks that a client task can interact with
- Clients in RPC system maintain pool of connections to backends
- Using pool reduces latency compared to doing setup/teardown when needed
- Inactive connections are relatively cheap, but not free, even in “inactive” mode (reduced health checks, UDP instead of TCP, etc.)
Choosing the correct subset
- Typ: 20-100, choose base on workload
Subset selection: random
- Bad utilization
Subset selection: round robin
- Order is permuted; each round has its own permutation
Load balancing
- Subset selection is for connection balancing, but we still need to balance load
Load balancing: round robin
- In practice, observe 2x difference between most loaded and least load
- In practice, most expensive request can be 1000x more expensive than cheapest request
- In addition, there’s random unpredictable variation in requests
Load balancing: least-loaded round robin
- Exactly what it sounds like: round-robin among least loaded backends
- Load appears to be measured in terms of connection count; may not always be the best metric
- This is per client, not globally, so it’s possible to send requests to a backend with many requests from other clients
- In practice, for larg services, find that most-loaded task uses twice as much CPU as least-loaded; similar to normal round robin
Load balancing: weighted round robin
- Same as above, but weight with other factors
- In practice, much better load distribution than least-loaded round robin

I wonder what Heroku meant when they responded to Rap Genius by saying “after extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections”.

Chapter 21: Handling overload

Even with “good” load balancing, systems will become overloaded
Typical strategy is to serve degraded responses, but under very high load that may not be possible
Modeling capacity as QPS or as a function of requests (e.g., how many keys the requests read) is failure prone
- These generally change slowly, but can change rapidly (e.g., because of a single checkin)
Better solution: measure directly available resources
CPU utilization is usually a good signal for provisioning
- With GC, memory pressure turns into CPU utilization
- With other systems, can provision other resources such that CPU is likely to be limiting factor
- In cases where over-provisioning CPU is too expensive, take other resources into account

How much does it cost to generally over-provision CPU like that?

Client-side throttling
- Backends start rejecting requests when customer hits quota
- Requests still use resources, even when rejected – without throttling, backends can spend most of their resources on rejecting requests
Criticality
- Seems to be priority but with a different name?
- First-class notion in RPC system
- Client-side throttling keeps separate stats for each level of criticality
- By default, criticality is propagated through subsequent RPCs
Handling overloaded errors
- Shed load to other DCs if DC is overloaded
- Shed load to other backends if DC is ok but some backends are overloaded
Clients retry when they get an overloaded response
- Per-request retry budget (3)
- Per-client retry budget (10%)
- Failed retries from client cause “overloaded; don’t retry” response to be returned upstream

Having a “don’t retry” response is “obvious”, but relatively rare in practice. A lot of real systems have a problem with failed retries causing more retries up the stack. This is especially true when crossing a hardware/software boundary (e.g., filesystem read causes many retries on DVD/SSD/spinning disk, fails, and then gets retried at the filesystem level), but seems to be generally true in pure software too.

Chapter 22: Addressing cascading failures

Typical failure scenarios?
Server overload
Ex: have two servers
- One gets overloaded, failing
- Other one now gets all traffic and also fails
Resource exhaustion
- CPU/memory/threads/file descriptors/etc.
Ex: dependencies among resources
- 1) Java frontend has poorly tuned GC params
- 2) Frontend runs out of CPU due to GC
- 3) CPU exhaustion slows down requests
- 4) Increased queue depth uses more RAM
- 5) Fixed memory allocation for entire frontend means that less memory is available for caching
- 6) Lower hit rate
- 7) More requests into backend
- 8) Backend runs out of CPU or threads
- 9) Health checks fail, starting cascading failure
- Difficult to determine cause during outage
Note: policies that avoid servers that serve errors can make things worse
- fewer backends available, which get too many requests, which then become unavailable
Preventing server overload
- Load test! Must have realistic environment
- Serve degraded results
- Fail cheaply and early when overloaded
- Have higher-level systems reject requests (at reverse proxy, load balancer, and on task level)
- Perform capacity planning
Queue management
- Queues do nothing in steady state
- Queued reqs consume memory and increase latency
- If traffic is steady-ish, better to keep small queue size (say, 50% or less of thread pool size)
- Ex: Gmail uses queueless servers with failover when threads are full
- For bursty workloads, queue size should be function of #threads, time per req, size/freq of bursts
- See also, adaptive LIFO and CoDel
Graceful degradation
- Note that it’s important to test graceful degradation path, maybe by running a small set of servers near overload regularly, since this path is rarely exercised under normal circumstances
- Best to keep simple and easy to understand
Retries
- Always use randomized exponential backoff
- See previous chapter on only retrying at a single level
- Consider having a server-wide retry budget
Deadlines
- Don’t do work where deadline has been missed (common theme for cascading failure)
- At each stage, check that deadline hasn’t been hit
- Deadlines should be propagated (e.g., even through RPCs)
Bimodal latency
- Ex: problem with long deadline
- Say frontend has 10 servers, 100 threads each (1k threads of total cap)
- Normal operation: 1k QPS, reqs take 100ms => 100 worker threads occupied (1k QPS * .1s)
- Say 5% of operations don’t complete and there’s a 100s deadline
- That consumes 5k threads (50 QPS * 100s)
- Frontend oversubscribed by 5x. Success rate = 1k / (5k + 95) = 19.6% => 80.4% error rate

Using deadlines instead of timeouts is great. We should really be more systematic about this.

Not allowing systems to fill up with pointless zombie requests by setting reasonable deadlines is “obvious”, but a lot of real systems seem to have arbitrary timeouts at nice round human numbers (30s, 60s, 100s, etc.) instead of deadlines that are assigned with load/cascading failures in mind.

Try to avoid intra-layer communication
- Simpler, avoids possible cascading failure paths
Testing for cascading failures
- Load test components!
- Load testing both reveals breaking and point ferrets out components that will totally fall over under load
- Make sure to test each component separately
- Test non-critical backends (e.g., make sure that spelling suggestions for search don’t impede the critical path)
Immediate steps to address cascading failures
- Increase resources
- Temporarily stop health check failures/deaths
- Restart servers (only if that would help – e.g., in GC death spiral or deadlock)
- Drop traffic – drastic, last resort
- Enter degraded mode – requires having built this into service previously
- Eliminate batch load
- Eliminate bad traffic

Chapter 23: Distributed consensus for reliability

How do we agree on questions like…
- Which process is the leader of a group of processes?
- What is the set of processes in a group?
- Has a message been successfully committed to a distributed queue?
- Does a process hold a particular lease?
- What’s the value in a datastore for a particular key?
Ex1: split-brain
- Service has replicated file servers in different racks
- Must avoid writing simultaneously to both file servers in a set to avoid data corruption
- Each pair of file servers has one leader & one follower
- Servers monitor each other via heartbeats
- If one server can’t contact the other, it sends a STONITH (shoot the other node in the head)
- But what happens if the network is slow or packets get dropped?
- What happens if both servers issue STONITH?

This reminds me of one of my favorite distributed database postmortems. The database is configured as a ring, where each node talks to and replicates data into a “neighborhood” of 5 servers. If some machines in the neighborhood go down, other servers join the neighborhood and data gets replicated appropriately.

Sounds good, but in the case where a server goes bad and decides that no data exists and all of its neighbors are bad, it can return results faster than any of its neighbors, as well as tell its neighbors that they’re all bad. Because the bad server has no data it’s very fast and can report that its neighbors are bad faster than its neighbors can report that it’s bad. Whoops!

Ex2: failover requires human intervention
- A highly sharded DB has a primary for each shard, which replicates to a secondary in another DC
- External health checks decide if the primary should failover to its secondary
- If the primary can’t see the secondary, it makes itself unavailable to avoid the problems from “Ex1”
- This increases operational load
- Problems are correlated and this is relatively likely to run into problems when people are busy with other issues
- If there’s a network issues, there’s no reason to think that a human will have a better view into the state of the world than machines in the system
Ex3: faulty group-membership algorithms
- What it sounds like. No notes on this part
Impossibility results
- CAP: P is impossible in real networks, so choose C or A
- FLP: async distributed consensus can’t gaurantee progress with unreliable network

Paxos

Sequence of proposals, which may or may not be accepted by the majority of processes
- Not accepted => fails
- Sequence number per proposal, must be unique across system
Proposal
- Proposer sends seq number to acceptors
- Acceptor agrees if it hasn’t seen a higher seq number
- Proposers can try again with higher seq number
- If proposer recvs agreement from majority, it commits by sending commit message with value
- Acceptors must journal to persistent storage when they accept

Patterns

Distributed consensus algorithms are a low-level primitive
Reliable replicated state machines
- Fundamental building block for data config/storage, locking, leader election, etc.
- See these papers: Schnieder, Aguilera, Amir & Kirsch
Reliable repliacted data and config stores
- Non distributed-consensus-based systems often use timestamps: problematic because clock synchrony can’t be gauranteed
- See Spanner paper for an example of using distributed consensus
Leader election
- Equivalent to distributed consensus
- Where work of the leader can performed performed by one process or sharded, leader election pattern allows writing distributed system as if it were a simple program
- Used by, for example, GFS and Colussus
Distributed coordination and locking services
- Barrier used, for example, in MapReduce to make sure that Map is finished before Reduce proceeds
Distributed queues and messaging
- Queues: can tolerate failures from worker nodes, but system needs to ensure that claimed tasks are processed
- Can use leases instead of removal from queue
- Using RSM means that system can continue processing even when queue goes down
Performance
- Conventional wisdom that consensus algorithms can’t be used for high-throughput low-latency systems is false
- Distributed consensus at the core of many Google systems
- Scale makes this worse for Google than most other companies, but it still works
Multi-Paxos
- Strong leader process: unless a leader has not yet been elected or a failure occurs, only one round trip required to reach consensus
- Note that another process in the group can propose at any time
- Can ping pong back and forth and pseudo-livelock
- Not unqique to multi-paxos,
- Standard solutions are to elect a proposer process or use rotating proposer
Scaling read-heavy workloads
- Ex: Photon allows reads from any replica
- Read from stale replica requres extra work, but doesn’t produce bad incorrect results
- To gaurantee reads are up to date, do one of the following:
- 1) Perform a read-only consensus operation
- 2) Read data from replica that’s guaranteed to be most-up-to-date (stable leader can provide this guarantee)
- 3) Use quorum leases
Quorum leases
- Replicas can be granted lease over some (or all) data in the system
Fast Paxos
- Designed to be faster over WAN
- Each client can send Propose to each member of a group of acceptors directly, instead of through a leader
- Not necessarily faster than classic Paxos– if RTT to acceptors is long, we’ve traded one message across slow link plus N in parallel across fast link for N across slow link
Stable leaders
- “Almost all distributed consensus systems that have been designed with performance in mind use either the single stable leader pattern or a system of rotating leadership”

TODO: finish this chapter?

Chapter 24: Distributed cron

TODO: go back and read in more detail, take notes.

Chapter 25: Data processing pipelines

Examples of this are MapReduce or Flume
Convenient and easy to reason about the happy case, but fragile
- Initial install is usually ok because worker sizing, chunking, parameters are carefully tuned
- Over time, load changes, causes problems

Chapter 26: Data integrity

Definition not necessarily obvious
- If an interface bug causes Gmail to fail to display messages, that’s the same as the data being gone from the user’s standpoint
- 99.99% uptime means 1 hour of downtime per year. Probably ok for most apps
- 99.99% good bytes in a 2GB file means 200K corrupt. Probably not ok for most apps
Backup is non-trivial
- May have mixture of transactional and non-transactional backup and restore
- Different versions of business logic might be live at once
- If services are independently versioned, maybe have many combinations of versions
- Replicas aren’t sufficient – replicas may sync corruption
Study of 19 data recovery efforts at Google
- Most common user-visible data loss caused by deletion or loss of referential integrity due to software bugs
- Hardest cases were low-grade corruption discovered weeks to months later

Defense in depth

First layer: soft deletion
- Users should be able to delete their data
- But that means that users will be able to accidentally delete their data
- Also, account hijacking, etc.
- Accidentally deletion can also happen due to bugs
- Soft deletion delays actual deletion for some period of time
Second layer: backups
- Need to figure out how much data it’s ok to lose during recovery, how long recovery can take, and how far back backups need to go
- Want backups to go back forever, since corruption can go unnoticed for months (or longer)
- But changes to code and schema can make recovery of older backups expensive
- Google usually has 30 to 90 day window, depending on the service
Third layer: early detection
- Out-of-band integrity checks
- Hard to do this right!
- Correct changes can cause checkers to fail
- But loosening checks can cause failures to get missed

No notes on the two interesting case studies covered.

Chapter 27: Reliable product launches at scale

No notes on this chapter in particular. A lot of this material is covered by or at least implied by material in other chapters. Probably worth at least looking at example checklist items and action items before thinking about launch strategy, though. Also see appendix E, launch coordination checklist.

Chapters 28-32: Various chapters on management

No notes on these.

Notes on the notes

I like this book a lot. If you care about building reliable systems, reading through this book and seeing what the teams around you don’t do seems like a good exercise. That being said, the book isn’t perfect. The two big downsides for me stem from the same issue: this is one of those books that’s a collection of chapters by different people. The two major problems for me are that some of the editors are better than others, meaning that some of the chapters are clearer than others and that because the chapters seem designed to be readable as standalone chapters, there’s a fair amount of redundancy in the book if you just read it straight through. Depending on how you plan to use the book, that can be a positive, but it’s a negative to me. But even including he downsides, I’d say that this is the most valuable technical book I’ve read in the past year and I’ve covered probably 20% of the content in this set of notes. If you really like these notes, you’ll probably want to read the full book.

If you’re on the fence about the book, you can preview the first three chapters, plus half of the fourth (along with parts of other chapters) in Google books. If you want to buy a copy, you can get one on Amazon, of course, but the ebook is a lot cheaper through Google books than through Amazon (or at least that was true when I bought it last week).

If you found this set of notes way too dry, maybe try this much more entertaining set of notes on a totally different book. If you found this to only be slightly too dry, maybe try this set of notes on classes of errors commonly seen in postmortems. In any case, I’d appreciate feedback on these notes. Writing up notes is an experiment for me. If people find these useful, I’ll try to write up notes on books I read more often. If not, I might try a different approach to writing up notes or some other kind of post entirely.

↧

Modest list of programming blogs

April 18, 2016, 12:06 am

≫ Next: Notes on concurrency bugs

≪ Previous: Notes on Google's Site Reliability Engineering book

This is one of those “N technical things every programmer must read” lists, except that “programmer” is way too broad a term and the styles of writing people find helpful for them are too different for any such list to contain a non-zero number of items (if you want the entire list to be helpful to everyone). So here’s a list of some things you might want to read, and why you might (or might not) want to read them.

Alex Clemmer

This post on why making a competitor to Google search is a post in classic Alex Clemmer style. The post looks at a position that’s commonly believed (web search isn’t all that hard and someone should come up with a better Google) and explains why that’s not an obviously correct position. That’s also a common theme of his comments elsewhere, such as these comments on, stack ranking at MS, implementing POSIX on Windows, the size of the Windows codebase, and Bing.

If you follow his online commenting, it’s mostly Microsoft-related rants; much more current than Mini-MSFT.

Allison Kaptur

Explorations of various areas, often Python related, such as this this series on the Python interpreter and this series on the CPython peephole optimizer. Also, thoughts on broader topics like debugging and learning.

Often detailed, with inline code that’s meant to be read and understood (with the help of exposition that’s generally quite clear).

Chris Fenton

Computer related projects, by which I mean things like reconstructing the Cray-1A and building mechanical computers. Rarely updated, presumably due to the amount of work that goes into the creations, but almost always interesting.

The blog posts tend to be high-level, more like pitch decks than design docs, but there’s often source code available if you want more detail.

Code Words

This is a quarterly publication from RC. Posts vary from floating point implementations in various languages to how git works to image processing.

I wonder why web publications like this don’t get more press. There’s been a bit of a revival lately, and we’ve seen plenty of high quality publications, from high-profile efforts like The Macro to unpublisized gems like Snowsuit, but you don’t really see people talking about these much. Or I don’t, anyway.

Dan McKinley

A lot of great material on how engineering companies should be run. He has a lot of ideas that sound like common sense, e.g., choose boring technology, until you realize that it’s actually uncommon to find opinions that are so sensible.

Mostly distilled wisdom (as opposed to, say, detailed explanations of code).

David Dalrymple

A mix of things from writing a 64-bit kernel from scratch shortly after learning assembly to a high-level overview of computer systems. Rarely updated, with few posts, but each post has a lot to think about.

Eli Bendersky

I think of this as “the C++ blog”, but it’s much wider ranging that that. It’s too wide ranging for me to sum up, but if I had to commit to a description I might say that it’s a collection of deep dives into various topics, often (but not always) relatively low-level, along with short blurbs about books, often (but not always) technical.

The book reviews tend to be easy reading, but the programming blog posts are often a mix of code and exposition that really demands your attention; usually not a light read.

Evan Jones

A wide variety of bite-sized technical tidbits, from how integer division behavior varies by language to data corruption that isn’t corrected by Ethernet or TCP checksums. Usually bite-sized and easily read.

EPITA Systems Lab

Low-level. A good example of a relatively high-level post from this blog is this post on the low fragmentation heap in Windows. Posts like how to hack a pinball machine and how to design a 386 compatible dev board are typical.

Posts are often quite detailed, with schematic/circuit diagrams. This is relatively heavy reading and I try to have pen and paper handy when I’m reading this blog.

Fabrice Bellard

Not exactly a blog, but every time a new project appears on the front page, it’s amazing. Some examples are QEMU, FFMPEG, a 4G LTE base station that runs on a PC, a javascript PC emulator that can boot Linux, etc.

Gary Bernhardt

Another “not exactly a blog”, but it’s more informative than most blogs, not to mention more entertaining. This is the best “blog” on the pervasive brokenness of modern software that I know of.

Greg Wilson

Writeups of papers that (should) have an impact on how people write software, like this paper on what causes failures in distributed systems or this paper on what makes people feel productive. Not updated much, but Greg still blogs on his personal site.

The posts tend to be extended abstracts that tease you into reading the paper, rather than detailed explanations of the methodology and results.

Gustavo Duarte

Explanations of how Linux works, as well as other low-level topics. This particular blog seems to be on hiatus, but “0xAX” seems to have picked up the slack with the linux-insides project.

If you’ve read Love’s book on Linux, Duarte’s explanations are similar, but tend to be more about the idea and less about the implementation. They’re also heavier on providing diagrams and context. “0xAX” is a lot more focused on walking through the code than either Love or Duarte.

Jessica Kerr

Jessica is probably better known for her talks than her blog? Her talks are great! My favorite is probably this talk with explains different concurrency models in an easy to understand way, but the blog also has a lot of material I like.

As is the case with her talks, the diagrams often take a concept and clarify it, making something that wasn’t obvious seem very obvious in retrospect.

John Regehr

I think of this as the “C is harder than you think, even if you think C is really hard” blog, although the blog actually covers a lot more than that. Some commonly covered topics are fuzzing, compiler optimization, and testing in general.

Posts tend to be conceptual. There’s often code as examples, but the code tends to be light and easy to read, making Regehr’s blog a relatively smooth and easy read even though it covers a lot of important ideas.

Juho Snellman

A lot of posts about networking, generally written so that they make sense even with minimal networking background. I wish more people with this kind of knowledge (in depth knowledge of systems, not just networking knowledge in particular) would write up explanations for a general audience. Also has interesting non-networking content, like this post on Finnish elections.

Julia Evans

AFAICT, the theme is “things Julia has learned recently”, which can be anything from Huffman coding to how to be happy when working in a remote job. When the posts are on a topic I don’t already know, I learn something new. When they’re on a topic I know, they remind me that the topic is exciting and contains a lot of wonder and mystery.

Many posts have more questions than answers, and are more of a live-blogged exploration of a topic than an explanation of the topic.

Kamal Marhubi

Technical explorations of various topics, with a systems-y bent. Kubernetes. Git push. Syscalls in Rust. Also, some musings on programming in general.

The technical explorations often get into enough nitty gritty detail that this is something you probably want to sit down to read, as opposed to skim on your phone.

Kyle Kingsbury

90% of Kyle’s posts are explanations of distributed systems testing, which expose bugs in real systems that most of us rely on. The other 10% are musings on programming that are as rigorous as Kyle’s posts on distributed systems. Possibly the most educational programming blog of all time.

For those of us without a distributed systems background, understanding posts often requires a bit of Googling, despite the exensive explanations in the posts.

Marc Brooker

A mix of theory and wisdom from a distributed systems engineer on EBS at Amazon. The theory posts tend to be relatively short and easy to swallow; not at all intimidating, as theory sometimes is.

Marek Majkowski

This used to be a blog about random experiments Marek was doing, like this post on bitsliced SipHash. Since Marek joined Cloudflare, this has turned into a list of things Marek has learned while working in Cloudflare’s networking stack, like this story about debugging slow downloads.

Posts tend to be relatively short, but with enough technical specifics that they’re not light reads.

Mary Rose Cook

Lengthy and very-detailed explanations of technical topics, mixed in with a wide variety of other posts.

The selection of topics is eclectic, and explained at a level of detail such that you’ll come away with a solid understanding of the topic. The explanations are usually fine grained enough that it’s hard to miss what’s going on, even if you’re a beginner programmer.

Nitsan Wakart

More than you ever wanted to know about writing fast code for the JVM, from GV affects data structures to the subtleties of volatile reads.

Posts tend to involve lots of Java code, but the takeaways are often language agnostic.

Oona Raisanen

Adventures in signal processing. Everything from deblurring barcodes to figuring out what those signals from helicopters mean. If I’d known that signals and systems could be this interesting, I would have paid more attention in class.

Paul Khuong

Some content on Lisp, and some on low-level optimizations, with a trend towards low-level optimizations.

Posts are usualy relatively long and self-contained explanations of technical ideas with very little fluff.

Rachel Kroll

Years of debugging stories from a long-time SRE, along with stories about big company nonsense. The stories often have details that make them sound like they come from Google, but anyone who’s worked at Microsoft, IBM, Oracle, or another large company will find them familiar.

This reminds me a bit of Google’s SRE book, except that the content is ordered chronologically instead of by topic, and it’s conveyed through personal stories rather than impersonal case studies.

Russell Smith

Homemade electronics projects from vim on a mechanical typewriter to building an electrobalance to proof spirits.

Posts tend to have a fair bit of detail, down to diagrams explaining parts of circuits, but the posts aren’t as detailed as specs. But there are usually links to resources that will teach you enough to reproduce the project, if you want.

Steve Yegge

This is one of the few programming blogs where I regularly go back and re-read posts from the archive. I learn something new every time. Posts span the entire stack, from how individual programmers can improve at programming to how orgs can improve at recruiting. I re-read that last post before posting the link here and this bit jumped out:

Well, in case you hadn’t noticed, they’re kicking our butts at recruiting. Even in our own backyard. Professor Ed Lazowska at the University of Washington told us last year that Google’s getting about 3 times as many UW hires as we are. A candidate at last week’s recruiting trip told me that of the nine or ten students he considered to be the best programmers at the UW, about half of them went to Google; only two went to Amazon, and the rest went to “no-name” places.
Actually, his story had one more interesting tidbit: he said that although Microsoft is considered one of the top three places to work by the UW CS students (along with Google and Amazon), he claims that Microsoft is hiring lots of mediocre programmers. He said they gave offers to a whole bunch of programmers who he knows aren’t any good — and this guy was my strongest interviewee of the trip, so I was inclined to trust his judgement. He said that in his eyes, this disqualified Microsoft as a potential employer.
That’s not to say we don’t lose candidates to Microsoft. We do! Microsoft has determined that Amazon is very good at talent assessment, but crappy at selling the candidates and clinching the deal. So when Microsoft hears from a candidate that they’ve got a full-time offer from us, Microsoft doesn’t even interview the person. They take the candidate for a ride in the company hummer, have execs wine and dine them, let them spend the day with the team they’re going to join, show them the private office with a door they’ll get so they can concentrate on innovation… it’s a straight sell job after we’ve made an offer.

This is exactly what happened to me with Microsoft – I had a number of offers from other companies (though not Amazon), and someone from Microsoft called me up and sold me on Microsoft. I technically had an interview, but the interview was basically a sales job. Basically every time I re-read a Steve Yegge post, I notice that the post reflects some recent experience of mine.

Ted Unangst

A mix of technical posts about security and BSD and commentary on how broken software is. Some examples of the latter are this post on automation failure and this post on how Netflix handles CDs.

Even when there’s code, the posts tend to be about a high-level idea that just happens to be illustrated by the code, which makes this a lighter read than you’d expect from sheer amount of code.

Rebecca Frankel

As far as I know, Rebecca doesn’t have a programming blog, but if you look at her apparently off-the-cuff comments on other people’s posts as a blog, it’s one of the best written programming blogs out there. She used to be prolific on Piaw’s Buzz (and probably elsewhere, although I don’t know where), and you occasionally see comments elsewhere, like on this Steve Yegge blog post about brilliant engineers¹. I wish I could write like that.

RWT

This isn’t updated anymore, but I find the archives to be fun reading for insight into what people were thinking about microprocessors and computer architecture over the past two decades. It can be a bit depressing to see that the same benchmarking contreversies we had 15 years ago are being repeated today, sometimes with the same players. If anything, I’d say that the average benchmark you see passed around today is worse than what you would have seen 15 years ago, even though the industry as a whole has learned a lot about benchmarking since then.

Vyacheslav Egorov

In-depth explanations on how V8 works and how various constructs get optimized by a compiler dev on the V8 team. If I knew compilers were this interesting, I would have taken a compilers class back when I was in college.

Often takes topics that are considered hard and explains them in a way that makes them seem easy. Lots of diagrams, where appropriate, and detailed exposition on all the tricky bits.

Yossi Kreinin

Mostly dormant since the author started doing art, but the archives have a lot of great content about hardware, low-level software, and general programming-related topics that aren’t strictly programming.

90% of the time, when I get the desire to write a post about a common misconception software folks have about hardware, Yossi has already written the post and taken a lot of flak for it so I don’t have to :-).

I also really like Yossi’s career advice, like this response to Patrick McKenzie and this post on how managers get what they want and not what they ask for.

This blog?

Common themes include:

I still sort of can’t believe that anyone reads my writing on purpose. If I had to think of one flattering thing to about my blog, it would be that even though my blog posts are often substantially longer than Steve Yegge’s, I have literally not seen a single person complain about the length in internet comments. I expect that’s a self-defeating prophecy, though :-).

The end

Note that this list is relatively tilted towards blogs I find to be underrated. So it doesn’t include, for example, the high scalability blog, mechanical sympathy, or Patrick McKenzie even though I think they’re great. In that case, you might say that it’s strange that I have folks like Steve Yegge and Kyle Kingsbury listed. What I can say? I still consider them underrated. This list also doesn’t include blogs that mostly aren’t about programming, so it doesn’t include, for example, Ben Kuhn’s excellent blog.

Anyway, that’s all for now, but this list is pretty much off the top of my head, so I’ll add more as more blogs come to mind. I’ll also keep this list updated with what I’m reading as I find new blogs. Please please please suggest other blogs I might like, and don’t assume that I already know about a blog because it’s popular. Just for example, I had no idea who either Jeff Atwood or Zed Shaw were until a few years ago, and were and still are probably two of the most well known programming bloggers in existence. Even with centralized link aggregators like HN and reddit, blog discovery has become haphazard and random with the decline of blogrolls and blogging as a dialogue rather than a monologue. Also, please don’t assume that I don’t want to read something just because it’s different from the kind of blog I normally read. I’d love to read more from UX or front-end folks; I just don’t know where to find that kind of thing!

This post was inspired by the two posts Julia Evans has on blogs she reads and by the Chicago undergraduate mathematics bilbiography, which I’ve found to be the most useful set of book reviews I’ve ever encoutered.

Thanks to Bartłomiej Filipek and Sean Barrett for suggestions on what to add to the list. I haven’t had time to write them up, but I’ll probably add https://fgiesen.wordpress.com/, http://fabiensanglard.net/, http://preshing.com/, http://huonw.github.io/, and https://randomascii.wordpress.com/, among others. Also, thanks to Lindsey Kuper for discussion/corrections.

Quote follows below, since I can see from my analytics data that relatively few people click any individual link, and people seem especially unlikely to click a link to read a comment on a blog, even if the comment is great:
The key here is “principally,” and that I am describing motivation, not self-evaluation. The question is, what’s driving you? What gets you working? If its just trying to show that you’re good, then you won’t be. It has to be something else too, or it won’t get you through the concentrated decade of training it takes to get to that level.
Look at the history of the person we’re all presuming Steve Yegge is talking about. He graduated (with honors) in 1990 and started at Google in 1999. So he worked a long time before he got to the level of Google’s star. When I was at Google I hung out on Sunday afternoons with a similar superstar. Nobody else was reliably there on Sunday; but he always was, so I could count on having someone to talk to. On some Sundays he came to work even when he had unquestionably legitimate reasons for not feeling well, but he stillcame to work. Why didn’t he go home like any normal person would? It wasn’t that he was trying to prove himself; he’d done that long ago. What was driving him?
The only way I can describe it is one word: fury. What was he doing every Sunday? He was reviewing various APIs that were being proposed as standards by more junior programmers, and he was always finding things wrong with them. What he would talk about, or rather, rage about, on these Sunday afternoons was always about some idiocy or another that someone was trying make standard, and what was wrong with it, how it had to be fixed up, etc, etc. He was always in a high dudgeon over it all.
What made him come to work when he was feeling sick and dizzy and nobody, not even Larry and Sergey with their legendary impatience, not even them, I mean nobody would have thought less of him if he had just gone home & gone to sleep? He seemed to be driven, not by ambition, but by fear that if he stopped paying attention, something idiotically wrong (in his eyes) might get past him, and become the standard, and that was just unbearable, the thought made him so incoherently angry at the sheer wrongness of it, that he had to stay awake and prevent it from happening no matter how legitimately bad he was feeling at the time.
It made me think of Paul Graham’s comment: “What do I mean by good people? One of the best tricks I learned during our startup was a rule for deciding who to hire. Could you describe the person as an animal?… I mean someone who takes their work a little too seriously; someone who does what they do so well that they pass right through professional and cross over into obsessive.
What it means specifically depends on the job: a salesperson who just won’t take no for an answer; a hacker who will stay up till 4:00 AM rather than go to bed leaving code with a bug in it; a PR person who will cold-call New York Times reporters on their cell phones; a graphic designer who feels physical pain when something is two millimeters out of place.”
I think a corollary of this characterization is that if you really want to be “an animal,” what you have cultivate in yourself is partly ambition, but it is partly also self-knowledge. As Paul Graham says, there are different kinds of animals. The obsessive graphic designer might be unconcerned about an API that is less than it could be, while the programming superstar might pass by, or create, a terrible graphic design without the slightest twinge of misgiving.
Therefore, key question is: are you working on the thing you care about most? If its wrong, is it unbearable to you? Nothing but deep seated fury will propel you to the level of a superstar. Getting there hurts too much; mere desire to be good is not enough. If its not in you, its not in you. You have to be propelled by elemental wrath. Nothing less will do.
Or it might be in you, but just not in this domain. You have to find what you care about, and not just what you care about, but what you care about violently: you can’t fake it.
(Also, if you do have it in you, you still have to choose your boss carefully. No matter how good you are, it may not be trivial to find someone you can work for. There’s more to say here; but I’ll have to leave it for another comment.)
Another clarification of my assertion “if you’re wondering if you’re good, then you’re not” should perhaps be said “if you need reassurance from someone else that you’re good, then you’re not.” One characteristic of these “animals” is that they are such obsessive perfectionists that their own internal standards so far outstrip anything that anyone else could hold them to, that no ordinary person (i.e. ordinary boss) can evaluate them. As Steve Yegge said, they don’t go for interviews. They do evaluate each other – at Google the superstars all reviewed each other’s code, reportedly brutally – but I don’t think they cared about the judgments of anyone who wasn’t in their circle or at their level.
I agree with Steve Yegge’s assertion that there are an enormously important (small) group of people who are just on another level, and ordinary smart hardworking people just aren’t the same. Here’s another way to explain why there should be a quantum jump – perhaps I’ve been using this discussion to build up this idea: its the difference between people who are still trying to do well on a test administered by someone else, and the people who have found in themselves the ability to grade their own test, more carefully, with more obsessive perfectionism, than anyone else could possibly impose on them.
School, for all it teaches, may have one bad lasting effect on people: it gives them the idea that good people get A’s on tests, and better ones get A+’s on tests, and the very best get A++’s. Then you get the idea that you go out into the real world, and your boss is kind of super-professor, who takes over the grading of the test. Joel Spolsky is accepting that role, being boss as super-professor, grading his employees tests for them, telling them whether they are good.
But the problem is that in the real world, the very most valuable, most effective people aren’t the ones who are trying to get A+++’s on the test you give them. The very best people are the ones who can make up their own test with harder problems on it than you could ever think of, and you’d have to have studied for the same ten years they have to be able even to know how to grade their answers.
That’s a problem, incidentally, with the idea of a meritocracy. School gives you an idea of a ladder of merit that reaches to the top. But it can’t reach all the way to the top, because someone has to measure the rungs. At the top you’re not just being judged on how high you are on the ladder. You’re also being judged on your ability to “grade your own test”; that is to say, your trustworthiness. People start asking whether you will enforce your own standards even if no one is imposing them on you. They have to! because at the top people get given jobs with the kind of responsibility where no one can possibly correct you if you screw up. I’m giving you an image of someone who is working himself sick, literally, trying grade everyone else’s work. In the end there is only so much he can do, and he does want to go home and go to bed sometimes. That means he wants people under him who are not merely good, but can be trusted not to need to be graded. Somebody has to watch the watchers, and in the end, the watchers have to watch themselves.
^[return]

↧

Notes on concurrency bugs

August 4, 2016, 8:32 pm

≫ Next: How I learned to program

≪ Previous: Modest list of programming blogs

Do concurrency bugs matter? From the literature, we know that most reported bugs in distributed systems have really simple causes and can be caught by trivial tests, even when we only look at bugs that cause really bad failures, like loss of a cluster or data corruption. The filesystem literature echos this result – a simple checker that looks for totally unimplemented error handling can find hundreds of serious data corruption bugs. Most bugs are simple, at least if you measure by bug count. But if you measure by debugging time, the story is a bit different.

Just from personal experience, I’ve spent more time debugging complex non-deterministic failures than all other types of bugs combined. In fact, I’ve spent more time debugging some individual non-deterministic bugs (weeks or months) than on all other bug types combined. Non-deterministic bugs are rare, but they can be extremely hard to debug and they’re a productivity killer. Bad non-deterministic bugs take so long to debug that relatively large investments in tools and prevention can be worth it¹.

Let’s see what the academic literature has to say on non-deterministic bugs. There’s a lot of literature out there, so let’s narrow things down by looking at one relatively well studied area: concurrency bugs. We’ll start with the literature on single-machine concurrency bugs and then look at distributed concurrency bugs.

Fonseca et al. DSN ‘10

They studied MySQL concurrency bugs from 2003 to 2009 and found the following:

More non-deadlock bugs (63%) than deadlock bugs (40%)

Note that these numbers sum to more than 100% because some bugs are tagged with multiple causes. This is roughly in line with the Lu et al. ASPLOS ‘08 paper (which we’ll look at later), which found that 30% of the bugs they examined were deadlock bugs.

15% of examined failures were semantic

The paper defines a semantic failure as one “where the application provides the user with a result that violates the intended semantics of the application”. The authors also find that “the vast majority of semantic bugs (92%) generated subtle violations of application semantics”. By their nature, these failures are likely to be undercounted – it’s pretty hard to miss a deadlock, but it’s easy to miss subtle data corruption.

15% of examined failures were latent

The paper defines latent as bugs that “do not become immediately visible to users.”. Unsurprisingly, the paper finds that latent failures are closely related to semantic failures; 92% of latent failures are semantic and vice versa. The 92% number makes this finding sound more precise than it really is – it’s just that 11 out of the 12 semantic failures are latent and vice versa. That could have easily been 11 out of 11 (100%) or 10 out of 12 (83%).

That’s interesting, but it’s hard to tell from that if the results generalize to projects that aren’t databases, or even projects that aren’t MySQL.

Lu et al. ASPLOS ‘08

They looked at concurrency bugs in MySQL, Firefox, OpenOffice, and Apache. Some of their findings are:

97% of examined non-deadlock bugs were atomicity-violation or order-violation bugs

Of the 74 non-deadlock bugs studied, 51 were atomicity bugs, 24 were ordering bugs, and 2 were categorized as “other”.

An example of an atomicity violation is this bug from MySQL:

Thread 1:

if (thd->proc_info)
  fputs(thd->proc_info, ...)

Thread 2:

thd->proc_info = NULL;

For anyone who isn’t used to C or C++, thd is a pointer, and -> is the operator to access a field through a pointer. The first line in thread 1 checks if the field is null. The second line calls fputs, which writes the field. The intent is to only call fputs if and only if proc_info isn’t NULL, but there’s nothing preventing another thread from setting proc_info to NULL“between” the first and second lines of thread 1.

Like most bugs, this bug is obvious in retrospect, but if we look at the original bug report, we can see that it wasn’t obvious at the time:

Description: I’ve just noticed with the latest bk tree than MySQL regularly crashes in InnoDB code … How to repeat: I’ve still no clues on why this crash occurs.

As is common with large codebases, fixing the bug once it was diagnosed was more complicated than it first seemed. This bug was partially fixed in 2004, resurfaced again and was fixed in 2008. A fix for another bug caused a regression in 2009, which was also fixed in 2009. That fix introduced a deadlock that was found in 2011.

An example ordering bug is the following bug from Firefox:

Thread 1:

mThread=PR_CreateThread(mMain, ...);

Thread 2:

void mMain(...) {
  mState = mThread->State;
  }

Thread 1 launches Thread 2 with PR_CreateThread. Thread 2 assumes that, because the line that launched it assigned to mThread, mThread is valid. But Thread 2 can start executing before Thread 1 has assigned to mThread! The authors note that they call this an ordering bug and not an atomicity bug even though the bug could have been prevented if the line in thread 1 were atomic because their “bug pattern categorization is based on root cause, regardless of possible fix strategies”.

An example of an “other” bug, one of only two studied, is this bug in MySQL:

Threads 1…n:

rw_lock(&lock);

Watchdog thread:

if (lock_wait_time[i] > fatal_timeout)
  assert(0);

This can cause a spurious crash when there’s more than the expected amount of work. Note that the study doesn’t look at performance bugs, so a bug where lock contention causes things to slow to a crawl but a watchdog doesn’t kill the program wouldn’t be considered.

An aside that’s probably a topic for another post is that hardware often has deadlock or livelock detection built in, and that when a lock condition is detected, hardware will often try to push things into a state where normal execution can continue. After detecting and breaking deadlock/livelock, an error will typically be logged in a way that it will be noticed if it’s caught in lab, but that external customers won’t see. For some reason, that strategy seems rare in the software world, although it seems like it should be easier in software than in hardware.

Deadlock occurs if and only if the following four conditions are true:

Mutual exclusion: at least one resource must be held in a non-shareable mode. Only one process can use the resource at any given instant of time.
Hold and wait or resource holding: a process is currently holding at least one resource and requesting additional resources which are being held by other processes.
No preemption: a resource can be released only voluntarily by the process holding it.
Circular wait: a process must be waiting for a resource which is being held by another process, which in turn is waiting for the first process to release the resource.

There’s nothing about these conditions that are unique to either hardware or software, and it’s easier to build mechanisms that can back off and replay to relax (2) in software than in hardware. Anyway, back to the study findings.

96% of examined concurrency bugs could be reproduced by fixing the relative order of 2 specific threads

This sounds like great news for testing. Testing only orderings between thread pairs is much more tractable than testing all orderings between all threads. Similarly, 92% of examined bugs could be reproduced by fixing the order of four (or fewer) memory accesses. However, there’s a kind of sampling bias here – only bugs that could be reproduced could be analyzed for a root cause, and bugs that only require ordering between two threads or only a few memory accesses are easier to reproduce.

97% of examined deadlock bugs were caused by two threads waiting for at most two resources

Moreover, 22% of examined deadlock bugs were caused by a thread acquiring a resource held by the thread itself. The authors state that pairwise testing of acquisition and release sequences should be able to catch most deadlock bugs, and that pairwise testing of thread orderings should be able to catch most non-deadlock bugs. The claim seems plausibly true when read as written; the implication seems to be that virtually all bugs can be caught through some kind of pairwise testing, but I’m a bit skeptical of that due to the sample bias of the bugs studied.

I’ve seen bugs with many moving parts take months to track down. The worst bug I’ve seen consumed nearly a person-year’s worth of time. Bugs like that mostly don’t make it into studies like this because it’s rare that a job allows someone the time to chase bugs that elusive. How many bugs like that are out there is still an open question.

Caveats

Note that all of the programs studied were written in C or C++, and that this study predates C++11. Moving to C++11 and using atomics and scoped locks would probably change the numbers substantially, not to mention moving to an entirely different concurrency model. There’s some academic work on how different concurrency models affect bug rates, but it’s not really clear how that work generalizes to codebases as large and mature as the ones studied, and by their nature, large and mature codebases are hard to do randomized trials on when the trial involves changing the fundamental primitives used. The authors note that 39% of examined bugs could have been prevented by using transactional memory, but it’s not clear how many other bugs might have been introduced if transactional memory were used.

Tools

There are other papers on characterizing single-machine concurrency bugs, but in the interest of space, I’m going to skip those. There are also papers on distributed concurrency bugs, but before we get to that, let’s look at some of the tooling for finding single-machine concurrency bugs that’s in the literature. I find the papers to be pretty interesting, especially the model checking work, but realistically, I’m probably not going to build a tool from scratch if something is available, so let’s look at what’s out there.

HapSet

Uses run-time coverage to generate interleavings that haven’t been covered yet. This is out of NEC labs; googling NEC labs HapSet returns the paper, some patent listings, but no obvious download for the tool.

CHESS

Generates unique interleavings of threads for each run. They claim that, by not tracking state, the checker is much simpler than it would otherwise be, and that they’re able to avoid many of the disadvantages of tracking state via a detail that can’t properly be described in this tiny little paragraph; read the paper if you’re interested! Supports C# and C++. The page claims that it requires Visual Studio 2010 and that it’s only been tested with 32-bit code. I haven’t tried to run this on a modern *nix compiler, but IME requiring Visual Studio 2010 means that it would be a moderate effort to get it running on a modern version of Visual Studio, and a substantial effort to get it running on a modern version of gcc or clang. A quick Google search indicates that this might be patent encumbered².

Maple

Uses coverage to generate interleavings that haven’t been covered yet. Instruments pthreads. The source is up on github. It’s possible this tool is still usable, and I’ll probably give it a shot at some point, but it depends on at least one old, apparently unmaintained tool (PIN, a binary instrumentation tool from Intel). Googling (Binging?) for either Maple or PIN gives a number of results where people can’t even get the tool to compile, let alone use the tool.

PACER

Samples using the FastTrack algorithm in order to keep overhead low enough “to consider in production software”. Ironically, this was implemented on top of the Jikes RVM, which is unlikely to be used in actual production software. The only reference I could find for an actually downloadable tool is a completely different pacer.

ConLock / MagicLock / MagicFuzzer

There’s a series of tools that are from one group which claims to get good results using various techniques, but AFAICT the source isn’t available for any of the tools. There’s a page that claims there’s a version of MagicFuzzer available, but it’s a link to a binary that doesn’t specify what platform the binary is for and the link 404s.

OMEN / WOLF

I couldn’t find a page for these tools (other than their papers), let alone a download link.

SherLock / AtomChase / Racageddon

Another series of tools that aren’t obviously available.

Tools you can actually easily use

Valgrind / DRD / Helgrind

Instruments pthreads and easy to use – just run valgrind with the appropriate options (-drd or -helgrind) on the binary. May require a couple tweaks if using C++11 threading.

clang thread sanitzer (TSan)

Can find data races. Flags when happens-before is violated. Works with pthreads and C++11 threads. Easy to use (just pass a -fsanitize=thread to clang).

A side effect of being so easy to use and actually available is that tsan has had a very large impact in the real world:

One interesting incident occurred in the open source Chrome browser. Up to 15% of known crashes were attributed to just one bug [5], which proved difficult to understand - the Chrome engineers spent over 6 months tracking this bug without success. On the other hand, the TSAN V1 team found the reason for this bug in a 30 minute run, without even knowing about these crashes. The crashes were caused by data races on a couple of reference counters. Once this reason was found, a relatively trivial fix was quickly made and patched in, and subsequently the bug was closed.

clang `-Wthread-safety`

Static analysis that uses annotations on shared state to determine if state wasn’t correctly guarded.

FindBugs

General static analysis for Java with many features. Has @GuardedBy annotations, similar to -Wthread-safety.

CheckerFramework

Java framework for writing checkers. Has many different checkers. For concurrency in particular, uses @GuardedBy, like FindBugs.

rr

Deterministic replay for debugging. Easy to get and use, and appears to be actively maintained. Adds support for time-travel debugging in gdb.

DrDebug/PinPlay

General toolkit that can give you deterministic replay for debugging. Also gives you “dynamic slicing”, which is watchpoint-like: it can tell you what statements affected a variable, as well as what statements are affected by a variable. Currently Linux only; claims Windows and Android support coming soon.

Other tools

This isn’t an exhaustive list – there’s a ton of literature on this, and this is an area where, frankly, I’m pretty unlikely to have the time to implement a tool myself, so there’s not much value for me in reading more papers to find out about techniques that I’d have to implement myself³. However, I’d be interested in hearing about other tools that are usable.

One thing I find interesting about this is that almost all of the papers for the academic tools claim to do something novel that lets them find bugs not found by other tools. They then run their tool on some codebase and show that the tool is capable of finding new bugs. But since almost no one goes and runs the older tools on any codebase, you’d never know if one of the newer tools only found a subset of the bugs that one of the older tools could catch.

Furthermore, you see cycles (livelock?) in how papers claim to be novel. Paper I will claim that it does X. Paper II will claim that it’s novel because it doesn’t need to do X, unlike Paper I. Then Paper III will claim that it’s novel because, unlike Paper II, it does X.

Distributed systems

Now that we’ve looked at some of the literature on single-machine concurrency bugs, what about distributed concurrency bugs?

Leesatapornwongsa et al. ASPLOS 2016

They looked at 104 bugs in Cassandra, MapReduce, HBase, and Zookeeper. Let’s look at some example bugs, which will clarify the terminology used in the study and make it easier to understanding the main findings.

Message-message race

This diagram is just for reference, so that we have a high-level idea of how different parts fit together in MapReduce:

Block diagram of MapReduce

In MapReduce bug #3274, a resource manager sends a task-init message to a node manager. Shortly afterwards, an application master sends a task-kill preemption to the same node manager. The intent is for the task-kill message to kill the task that was started with the task-init message, but the task-kill can win the race and arrive before the task-init. This example happens to be a case where two messages from different nodes are racing to get to a single node.

For example, in MapReduce bug #5358, an application master sends a kill message to node manager running a speculative task because another copy of the task finished. However, before the message is received by the node manager, the node manager’s task completes, causing a complete message to be sent to the application master, causing an exception because a complete message was received after the task had completed.

Message-compute race

One example is MapReduce bug# 4157, where the application master unregisters with the resource manager. The application master then cleans up, but that clean-up races against the resource manager sending kill messages to the application’s containers via node managers, causing the application master to get killed. Note that this is classified as a race and not an atomicity bug, which we’ll get to shortly.

Compute-compute races can happen, but they’re outside the scope of this study since this study only looks at distributed concurrency bugs.

Atomicity violation

For the purposes of this study, atomicity bugs are defined as “whenever a message comes in the middle of a set of events, which is a local computation or global communication, but not when the message comes either before or after the events”. According to this definition, the message-compute race we looked at above isn’t a atomicity bug because it would still be a bug if the message came in before the “computation” started. This definition also means that hardware failures that occur inside a block that must be atomic are not considered atomicity bugs.

I can see why you’d want to define those bugs as separate types of bugs, but I find this to be a bit counterintuitive, since I consider all of these to be different kinds of atomicity bugs because they’re different bugs that are caused by breaking up something that needs to be atomic.

In any case, by the definition of this study, MapReduce bug #5009 is an atomicty bug. A node manager is in the process of committing data to HDFS. The resource manager kills the task, which doesn’t cause the commit state to change. Any time the node tries to rerun the commit task, the task is killed by the application manager because a commit is believed to already be in process.

Fault timing

A fault is defined to be a “component failure”, such as a crash, timeout, or unexpected latency. At one point, the paper refers to “hardware faults such as machine crashes”, which seems to indicate that some faults that could be considered software faults are defined as hardware faults for the purposes of this study.

Anyway, for the purposes of this study, an example of a fault-timing issue is MapReduce bug #3858. A node manager crashes while committing results. When the task is re-run, later attempts to commit all fail.

Reboot timing

In this study, reboots are classified separately from other faults. MapReduce bug #3186 illustrates a reboot bug.

A resource manager sends a job to an application master. If the resource manager is rebooted before the application master sends a commit message back to the resource manager, the resource manager loses its state and throws an exception because it’s getting an unexpected complete message.

Some of their main findings are:

47% of examined bugs led to latent failures

That’s a pretty large difference when compared to the DSN’ 10 paper that found that 15% of examined multithreading bugs were latent failures. It’s plausible that this is a real difference and not just something due to a confounding variable, but it’s hard to tell from the data.

This is a large difference from what studies on “local” concurrency bugs found. I wonder how much of that is just because people mostly don’t even bother filing and fixing bugs on hardware faults in non-distributed software.

64% of examined bugs were triggered by a single message’s timing

44% were ordering violations, and 20% were atomicity violations. Furthermore, > 90% of bugs involved three messages (or fewer).

32% of examined bugs were due to fault or reboot timing. Note that, for the purposes of the study, a hardware fault or a reboot that breaks up a block that needed to be atomic isn’t considered an atomicity bug – here, atomicity bugs are bugs where a message arrives in the middle of a computation that needs to be atomic.

70% of bugs had simple fixes

30% were fixed by ignoring the badly timed message and 40% were fixed by delaying or ignoring the message.

Bug causes?

After reviewing the bugs, the authors propose common fallacies that lead to bugs:

One hop is faster than two hops
Zero hops are faster than one hop
Atomic blocks can’t be broken

On (3), the authors note that it’s not just hardware faults or reboots that break up atomic blocks – systems can send kill or pre-emption messages that break up an atomic block. A fallacy which I’ve commonly seen in post-mortems that’s not listed here, goes something like “bad nodes are obviously bad”. A classic example of this is when a system starts “handling” queries by dropping them quickly, causing a load balancer to shift traffic the bad node because it’s handling traffic so quickly.

One of my favorite bugs in this class from an actual system was in a ring-based storage system where nodes could do health checks on their neighbors and declare that their neighbors should be dropped if the health check fails. One node went bad, dropped all of its storage, and started reporting its neighbors as bad nodes. Its neighbors noticed that the bad node was bad, but because the bad node had dropped all of its storage, it was super fast and was able to report its good neighbors before the good neighbors could report the bad node. After ejecting its immediate neighbors, the bad node got new neighbors and raced the new neighbors, winning again for the same reason. This was repeated until the entire cluster died.

Tools

Mace

A set of language extensions (on C++) that helps you build distributed systems. Mace has a model checker that can check all possible event orderings of messages, interleaved with crashes, reboots, and timeouts. The Mace model checker is actually available, but AFAICT it requires using the Mace framework, and most distributed systems aren’t written in Mace.

Modist

Another model checker that checks different orderings. Runs only one interleaving of independent actions (partial order reduction) to avoid checking redundant states. Also interleaves timeouts. Unlike Mace, doesn’t inject reboots. Doesn’t appear to be available.

Demeter

Like Modist, in that it’s a model checker that injects the same types of faults. Uses a different technique to reduce the state space, which I don’t know how to summarize succinctly. See paper for details. Doesn’t appear to be available. Googling for Demeter returns some software used to model X-ray absorption?

SAMC

Another model checker. Can inject multiple crashes and reboots. Uses some understanding of the system to avoid redundant re-orderings (e.g., if a series of messages is invariant to when a reboot is injected, the system tries to avoid injecting the reboot between each message). Doesn’t appear to be available.

Jepsen

As was the case for non-distributed concurrency bugs, there’s a vast literature on academic tools, most of which appear to be grad-student code that hasn’t been made available.

And of course there’s Jepsen, which doesn’t have any attached academic papers, but has probably had more real-world impact than any of the other tools because it’s actually available and maintained. There’s also chaos monkey, but if I’m understanding it correctly, unlike the other tools listed, it doesn’t attempt to create reproducible failures.

Conclusion

Is this where you’re supposed to have a conclusion? I don’t have a conclusion. We’ve looked at some literature and found out some information about bugs that’s interesting, but not necessarily actionable. We’ve read about tools that are interesting, but not actually available. And then there are some tools based on old techniques that are available and useful.

For example, the idea inside clang’s TSan, using “happens-before” to find data races, goes back ages. There’s a 2003 paper that discusses “combining two previously known race detection techniques – lockset-based detection and happens-before-based detection – to obtain fewer false positives than lockset-based detection alone”. That’s actually what TSan v1 did, but with TSan v2 they realized the tool would be more impactful if they only used happens-before because that avoids false positives, which means that people will actually use the tool. That’s not something that’s likely to turn into a paper that gets cited zillions of times, though. For anyone who’s looked at how afl works, this story should sound familiar. AFL is emintently practical and has had a very large impact in the real world, mostly by eschewing fancy techniques from the recent literature.

If you must have a conclusion, maybe the conclusion is that individuals like Kyle Kingsbury or Michal Zalewski have had an outsized impact on industry, and that you too can probably pick an underserved area in testing and have a curiously large impact on an entire industry.

Unrelated miscellania

Rose Ames asked me to tell more “big company” stories, so here’s a set of stories that explains why I haven’t put a blog post up for a while. The proximal cause is that my VP has been getting negative comments about my writing. But the reasons for that are a bit of a long story. Part of it is the usual thing, where the comments I receive personally skew very heavily positive, but the comments my manager gets run the other way because it’s weird to email someone’s manager because you like their writing, but you might send an email if their writing really strikes a nerve.

That explains why someone in my management chain was getting emailed about my writing, but it doesn’t explain why the emails went to my VP. That’s because I switched teams a few months ago, and the org that I was going to switch into overhired and didn’t have any headcount. I’ve heard conflicting numbers about how much they overhired, from 10 or 20 people to 10% or 20% (the org is quite large, and 10% would be much more than 20), as well as conflicting stories about why it happened (honest mistake vs. some group realizing that there was a hiring crunch coming and hiring as much as possible to take all of the reqs from the rest of the org). Anyway, for some reason, the org I would have worked in hired more than it was allowed to by at least one person and instituted a hiring freeze. Since my new manager couldn’t hire me into that org, he transferred into an org that had spare headcount and hired me into the new org. The new org happens to be a sales org, which means that I technically work in sales now; this has some impact on my day-to-day life since there are some resources and tech talks that are only accessible by people in product groups, but that’s another story. Anyway, for reasons that I don’t fully understand, I got hired into the org before my new manager, and during the months it took for the org chart to get updated I was shown as being parked under my VP, which meant that anyone who wanted to fire off an email to my manager would look me up in the directory and accidentally email my VP instead.

It didn’t seem like any individual email was a big deal, but since I don’t have much interaction with my VP and I don’t want to only be known as that guy who writes stuff which generates pushback from inside the company, I paused blogging for a while. I don’t exactly want to be known that way to my manager either, but I interact with my manager frequently enough that at least I won’t only be known for that.

I also wonder if these emails to my manager/VP are more likely at my current employer than at previous employers. I’ve never had this happen (that I know of) at another employer, but the total number of times it’s happened here is low enough that it might just be coincidence.

Then again, I was just reading the archives of a really insightful internal blog and ran across a note that mentioned that the series of blog posts was being published internally because the author got static from Sinofsky about publishing posts that contradicted the party line, which eventually resulted in the author agreeing to email Sinofsky comments related to anything under Sinofsky’s purview instead of publishing the comments publicly. But now that Sinofsky has moved on, the author wanted to share emails that would have otherwise been posts internally.

That kind of thing doesn’t seem to be a freak occurance around here. At the same time I saw that thing about Sinofsky, I ran across a discussion on whether or not a PM was within their rights to tell someone to take down a negative review from the app store. Apparently, a PM found out that someone had written a negative rating on the PM’s product in some app store and emailed the rater, telling them that they had to take the review down. It’s not clear how the PM found out that the rater worked for us (do they search the internal directory for every negative rating they find?), but they somehow found out and then issued their demand. Most people thought that the PM was out of line, but there were a non-zero number of people (in addition to the PM) who thought that employees should not say anything that could be construed as negative about the company in public.

I feel like I see more of this kind of thing now than I have at other companies, but the company’s really too big to tell if anyone’s personal experience generalizes. Anyway, I’ll probably start blogging again now that the org chart shows that I report to my actual manager, and maybe my manager will get some emails about that. Or maybe not.

Thanks to Leah Hanson, David Turner, Justin Mason, Joe Wilder, Matt Dziubinski, Alex Blewitt, Bruno Kim Medeiros Cesar, Luke Gilliam, Ben Karas, Julia Evans, Michael Ernst, and Stephen Tu for comments/corrections.