Quantcast
Channel:
Viewing all 308 articles
Browse latest View live

Is developer compensation becoming bimodal?

$
0
0

Developer compensation has skyrocketed since the demise of the Google et al. wage-suppressing no-hire agreement, to the point where compensation rivals and maybe even exceeds compensation in traditionally remunerative fields like law, consulting, etc.

Those fields have sharply bimodal income distributions. Are programmers in for the same fate? Let’s see what data we can find. First, let’s look at data from the National Association for Law Placement, which shows when legal salaries become bimodal.

Lawyers in 1991

First-year lawyer salaries in 1991. $40k median, trailing off with the upper end just under $90k

Median salary is $40k, with the numbers slowly trickling off until about $90k. According to the BLS $90k in 1991 is worth $160k in 2016 dollars. That’s a pretty generous starting salary.

Lawyers in 2000

First-year lawyer salaries in 2000. $50k median; bimodal with peaks at $40k and $125k

By 2000, the distribution had become bimodal. The lower peak is about the same in nominal (non-inflation-adjusted) terms, putting it substantially lower in real (inflation-adjusted) terms, and there’s an upper peak at around $125k, with almost everyone coming in under $130k. $130k in 2000 is $180k in 2016 dollars. The peak on the left has moved from roughly $30k in 1991 dollars to roughly $40k in 2000 dollars; both of those translate to roughly $55k in 2016 dollars. People in the right mode are doing better, while people in the left mode are doing about the same.

I won’t belabor the point with more graphs, but if you look at more recent data, the middle area between the two modes has hollowed out, increasing the level of inequality within the field. As a profession, lawyers have gotten hit hard by automation, and in real terms, 95%-ile offers today aren’t really better than they were in 2000. But 50%-ile and even 75%-ile offers are worse off due to the bimodal distribution.

Programmers in 2015

Enough about lawyers! What about programmers? Unfortunately, it’s hard to get good data on this. Anecdotally, it sure seems to me like we’re going down the same road. Unfortunately, almost all of the public data sources that are available, like H1B data, have salary numbers and not total compensation numbers. Since compensation at the the upper end is disproportionately bonus and stock, most data sets I can find don’t capture what’s going on.

One notable exception is the new grad compensation data recorded by Dan Zhang and Jesse Collins:

First-year programmer compensation in 2016. Compensation ranges from $50k to $250k

There’s certainly a wide range here, and while it’s technically bimodal, there isn’t a huge gulf in the middle like you see in law and business. Note that this data is mostly bachelors grads with a few master’s grads. PhD numbers, which sometimes go much higher, aren’t included.

Do you know of a better (larger) source of data? This is from about 100 data points, members of the “Hackathon Hackers” Facebook group, in 2015. Dan and Jesse also have data from 2014, but it would be nice to get data over a wider timeframe and just plain more data. Also, this data is pretty clearly biased towards the high end – if you look at national averages for programmers at all levels of experience, the average comes in much lower than the average for new grads in this data set. The data here match the numbers I hear when we compete for people, but the population of “people negotiating offers at Microsoft” also isn’t representative.

If we had more representative data it’s possible that we’d see a lot more data points in the $40k to $60k range along with the data we have here, which would make the data look bimodal. It’s also possible that we’d see a lot more points in the $40k to $60k range, many more in the $70k to $80k range, some more in the $90k+ range, etc., and we’d see a smooth drop-off instead of two distinct modes.

Stepping back from the meager data we have and looking at the circumstances, “should” programmer compensation be bimodal? Most other fields that have bimodal compensation have a very different compensation structure than we see in programming. For example, top law and consulting firms have an up-or-out structure, which is effectively a tournament, which distorts compensation and certainly makes it seem more likely that compensation is likely to end up being bimodal. Additionally, competitive firms pay the same rate to all 1st year employees, which they determine by matching whoever appears to be paying the most. For example, this year, Cravath announced that it would pay first-year associates $180k, and many other firms followed suit. Like most high-end firms, Cravath has a salary schedule that’s entirely based on experience:

  • 0 years: $180k
  • 1 year: $190k
  • 2 years: $210k
  • 3 years: $235k
  • 4 years: $260k
  • 5 years: $280k
  • 6 years: $300k
  • 7 years: $315k

In software, compensation tends to be on a case-by-case basis, which makes it much less likely that we’ll see a sharp peak the way we do in law. If I had to guess, I’d say that while the dispersion in programmer compensation is increasing, it’s not bimodal, but I don’t really have the right data set to conclusively say anything. Please point me to any data you have that’s better.

Appendix A: please don’t send me these

  • H-1B: mostly salary only.
  • Stack Overflow survey: salary only. Also, data is skewed by the heavy web focus of the survey – I stopped doing the survey when none of their job descriptions matched anyone in my entire building, and I know other people who stopped for the same reason.
  • Glassdoor: weirdly inconsistent about whether or not it includes stock compensation. Numbers for some companies seem to, but numbers for other companies don’t.
  • O’Reilly survey: salary focused.
  • BLS: doesn’t make fine-grained distribution available.
  • IRS: they must have the data, but they’re not sharing.
  • IDG: only has averages.
  • internal company data: too narrow
  • compensation survey companies like PayScale: when I’ve talked to people from these companies, they acknowledge that they have very poor visibility into large company compensation, but that’s what drives the upper end of the market (outside of finance).
  • #talkpay on twitter: numbers skew low1.

Appendix B: wtf?

Since we have both programmer and lawyer compensation handy, let’s examine that. Programming pays so well that it seems a bit absurd. If you look at other careers with similar compensation, there are multiple factors that act as barriers or disincentives to entry.

If you look at law, you have to win the prestige lottery and get into a top school, which will cost hundreds of thousands of dollars. Then you have to win the grades lottery and get good enough grades to get into a top firm. And the you have to continue winning tournaments to avoid getting kicked out, which requires sacrificing any semblance of a personal life. Consulting, investment banking, etc., are similar. Compensation appears to be proportional to the level of sacrifice (e.g., investment bankers are paid better, but work even longer hours than lawyers).

Medicine seems to be a bit better from the sacrifice standpoint because there’s a cartel which limits entry into the field, but the combination of medical school and residency is still incredibly brutal compared to most jobs at places like Facebook and Google.

Programming also doesn’t have a licensing body limiting the number of programers, nor is there the same prestige filter where you have to go to a top school to get a well paying job. Sure, there are a lot of startups who basically only hire from MIT, Stanford, CMU, and a few other prestigious schools, and I see job ads like the following whenever I look at startups:

Our team of 14 includes 6 MIT alumni, 3 ex-Googlers, 1 Wharton MBA, 1 MIT Master in CS, 1 CMU CS alum, and 1 “20 under 20” Thiel fellow. Candidates often remark we’re the strongest team they’ve ever seen.

We’re not for everyone. We’re an enterprise SaaS company your mom will probably never hear of. We work really hard 6 days a week because we believe in the future of mobile and we want to win.

That happens. But, in programming, measuring people by markers of prestige seems to be a Silicon Valley startup thing and not a top-paying companies thing. Big companies, which pay a lot better than startups, don’t filter people out by prestige nearly as often. Not only do you not need the right degree from the right school, you also don’t need to have the right kind of degree, or any degree at all. Although it’s getting rarer to not have a degree, I still meet new hires with no experience and either no degree or a degree in an unrelated field (like sociology or philosophy).

How is it possible that programmers are paid so well without these other barriers to entry that similarly remunerative fields have? One possibility is that we have a shortage of programmers. If that’s the case, you’d expect more programmers to enter the field, bringing down compensation. CS enrollments have been at record levels recently, so this may already be happening. Another possibility is that programming is uniquely hard in some way, but that seems implausible to me. Programming doesn’t seem inherently harder than electrical engineering or chemical engineering and it certainly hasn’t gotten much harder over the past decade, but during that timeframe, programming has gone from having similar compensation to most engineering fields to paying much better. The last time I was negotiating with a EE company about offers, they remarked to me that their VPs don’t make as much as I do, and I work at a software company that pays relatively poorly compared to its peers. There’s no reason to be believe that we won’t see a flow of people from engineering fields into programming until compensation is balanced.

Another possibility is that U.S. immigration laws act as a protectionistic barrier to prop up programmer compensation. It seems impossible for this to last (why shouldn’t there by really valuable non-U.S. companies), but it does appear to be somewhat true for now. When I was at Google, one thing that was remarkable to me was that they’d pay you approximately the same thing in a small midwestern town as in Silicon Valley, but they’d pay you much less in London. Whenever one of these discussions comes up, people always bring up the “fact” that SV salaries aren’t really as good as they sound because the cost of living is so high, but companies will not only match SV offers in Seattle, they’ll match them in places like Madison, Wisconsin. My best guess for why this happens is that someone in the midwest can credibly threaten to move to SV and take a job at any company there, whereas someone in London can’t2. While we seem unlikely to loosen current immigration restrictions, our immigration restrictions have caused and continue to cause people who would otherwise have founded companies in the U.S. to found companies elsewhere. Given that the U.S. doesn’t have a monopoly on people who found startups and that we do our best to keep people who want to found startups here out, it seems inevitable that there will eventually be Facebooks and Googles founded outside of the U.S. who compete for programmers the same way companies compete inside the U.S.

Another theory that I’ve heard a lot lately is that programmers at large companies get paid a lot because of the phenomenon described in Kremer’s O-ring model. This model assumes that productivity is multiplicative. If your co-workers are better, you’re more productive and produce more value. If that’s the case, you expect a kind of assortive matching where you end up with high-skill firms that pay better, and low-skill firms that pay worse. This model has a kind of intuitive appeal to it, but it can’t explain why programming compensation has higher dispersion than (for example) electrical engineering compensation. With the prevalence of open source, it’s much easier to utilize the work of productive people outside your firm than it most fields. This model should be less true of programming than in most engineering fields, but the dispersion in compensation is higher.

I don’t understand this at all and would love to hear a compelling theory for why programming “should” pay more than other similar fields, or why it should pay as much as fields that have much higher barriers to entry.


  1. People often worry that comp surveys will skew high because people want to brag, but the reality seems to be that numbers skew low because people feel embarrased about sounding like they’re bragging. I have a theory that you can see this reflected in the prices of other goods. For example, if you look at house prices, they’re generally predicatable based on location, square footage, amenities, and so on. But there’s a significant penalty for having the largest house on the block, for what (I suspect) is the same reason people with the highest compensation disproportionately don’t participate in #talkpay: people don’t want to admit that they have the highest pay, have the biggest house, or drive the fanciest car. Well, some people do, but on average, bragging about that stuff is seen as quite gauche. [return]
  2. There’s a funny move some companies will do where they station the new employee in Canada for a year before importing them into the U.S., which gets them into a visa process that’s less competitive. But this is enough of a hassle that most employees balk at the idea. [return]

Why's that company so big? I could do that in a weekend

$
0
0

I can’t think of a single large software company that doesn’t regularly draw internet comments of the form “What do all the employees do? I could build their product myself.” Benjamin Pollack and Jeff Atwood called out people who do that with Stack Overflow. But Stack Overflow is relatively obviously lean, so the general response is something like “oh, sure maybe Stack Overflow is lean, but FooCorp must really be bloated”. And since most people have relatively little visibility into FooCorp, for any given value of FooCorp, that sounds like a plausible statement. After all, what product could possible require hundreds, or even thousands of engineers?

A few years ago, in the wake of the rapgenius SEO controversy, a number of folks called for someone to write a better Google. Alex Clemmer responded that maybe building a better Google is a non-trivial problem. Considering how much of Google’s $500B market cap comes from search, and how much money has been spent by tens (hundreds?) of competitors in an attempt to capture some of that value, it seems plausible to me that search isn’t a trivial problem. But in the comments on Alex’s posts, multiple people respond and say that Lucene basically does the same thing Google does and that Lucene is poised to surpass Google’s capabilities in the next few years.

What would Lucene at Google’s size look like? If we do a naive back of the envelope calculation on what it would take to index a significant fraction of the internet (often estimated to be 1 trillion (T) or 10T documents), we might expect a 1T document index to cost something like $10B1. That’s not a feasible startup, so let’s say that instead of trying to index 1T documents, we want to maintain an artisanal search index of 1B documents. Then our cost comes down to $12M/yr. That’s not so bad – plenty of startups burn through more than that every year. While we’re in the VC-funded hypergrowth mode, that’s fine, but once we have a real business, we’ll want to consider trying to save money. At $12M/yr for the index, a 3% performance improvement that lets us trim our costs by 2% is worth $360k/yr. With those kinds of costs, it’s surely worth it to have at least one engineer working full-time on optimization, if not more.

Businesses that actually care about turning a profit will spend a lot of time (hence, a lot of engineers) working on optimizing systems, even if an MVP for the system could have been built in a weekend. There’s also a wide body of research that’s found that decreasing latency has a roughly linear effect on revenue over a pretty wide range of latencies and businesses. Businesses should keep adding engineers to work on optimization until the cost of adding an engineer equals the revenue gain plus the cost savings at the margin. This is often many more engineers than people realize.

And that’s just performance. Features also matter: when I talk to engineers working on basically any product at any company, they’ll often find that there are seemingly trivial individual features that can add integer percentage points to revenue. Just as with performance, people underestimate how many engineers you can add to a product before engineers stop paying for themselves.

Additionally, features are often much more complex than outsiders realize. If we look at search, how do we make sure that different forms of dates and phone numbers give the same results? How about internationalization? Each language has unique quirks that have to be accounted for. In french, “l’foo” should often match “un foo” and vice versa, but American search engines from the 90s didn’t actually handle that correctly. How about tokenizing Chinese queries, where words don’t have spaces between them, and sentences don’t have unique tokenizations? How about Japanese, where queries can easily contain four different alphabets? How about handling Arabic, which is mostly read right-to-left, except for the bits that are read left-to-right? And that’s not even the most complicated part of handling Arabic! It’s fine to ignore this stuff for a weekend-project MVP, but ignoring it in a real business means ignoring the majority of the market! Some of these are handled ok by open source projects, but many of the problems involve open research problems.

There’s also security! If you don’t “bloat” your company by hiring security people, you’ll end up like hotmail or yahoo, where your product is better known for how often it’s hacked than for any of its other features.

Everything we’ve looked at so far is a technical problem. Compared to organizational problems, technical problems are straightforward. Distributed systems are considered hard because real systems might drop something like 0.1% of messages, corrupt an even smaller percentage of messages, and see latencies in the microsecond to millisecond range. When I talk to higher-ups and compare what they think they’re saying to what my coworkers think they’re saying, I find that the rate of lost messages is well over 50%, every message gets corrupted, and latency can be months or years2. When people imagine how long it should take to build something, they’re often imagining a team that works perfectly and spends 100% of its time coding. But that’s impossible to scale up. The question isn’t whether or not there will inefficiencies, but how much inefficiency. A company that could eliminate organizational inefficiency would be a larger innovation than any tech startup, ever. But when doing the math on how many employees a company “should” have, people usually assume that the company is an efficient organization.

This post happens to use search as an example because I ran across some people who claimed that Lucene was going to surpass Google’s capabilities any day now, but there’s nothing about this post that’s unique to search. If you talk to people in almost any field, you’ll hear stories about how people wildly underestimate the complexity of the problems in the field. The point here isn’t that it would be impossible for a small team to build something better than Google search. It’s entirely plausible that someone will have an innovation as great as PageRank, and that a small team could turn that into a viable company. But once that company is past the VC-funded hyper growth phase and wants to maximize its profits, it will end up with a multi-thousand person platforms org, just like Google’s, unless the company wants to leave hundreds of millions or billions of dollars a year on the table due to hardware and software inefficiency. And the company will want to handle languages like Thai, Arabic, Chinese, and Japanese, each of which is non-trivial. And the company will want to have relatively good security. And there are the hundreds of little features that users don’t even realize that are there, each of which provides a noticeable increase in revenue. It’s “obvious” that companies should outsource their billing, except that when you talk to companies that handle their own billing, they can point to individual features that increase conversion by single or double digit percentages that they can’t get from Stripe or Braintree. That fifty person billing team is totally worth it, beyond a certain size. And then there’s sales, which most engineers don’t even think of3, not to mention research (which, almost by definition, involves a lot of bets that don’t pan out).

It’s not that all of those things are necessary to run a service at all; it’s that almost every large service is leaving money on the table if they don’t seriously address those things. This reminds me of a common fallacy we see in unreliable systems, where people build the happy path with the idea that the happy path is the “real” work, and that error handling can be tacked on later. For reliable systems, error handling is more work than the happy path. The same thing is true for large services – all of this stuff that people don’t think of as “real” work is more work than the core service4.

I’m experimenting with writing blog posts stream-of-consciousness, without much editing. Both this post and my last post were written that way. Let me know what you think of these posts relative to my “normal” posts!

Thanks to Leah Hanson, Joel Wilder, Kay Rhodes, and Ivar Refsdal for corrections.


  1. In public benchmarks, Lucene appears to get something like 30 QPS - 40 QPS when indexing wikipedia on a single machine. See anandtech, Haque et al., ASPLOS 2015, etc. I’ve seen claims that Lucene can run 10x faster than that on wikipedia but I haven’t seen a reproducible benchmark setup showing that, so let’s say that we can expect to get something like 30 QPS - 300 QPS if we index a wikipedia-sized corpus on one machine.

    Those benchmarks appear to be indexing English Wikipedia, articles only. That’s roughly 50 GB and approximately 5m documents. Estimates of the size of the internet vary, but public estimates often fall into the range of 1 trillion (T) to 10T documents. Say we want to index 1T documents, and we can put 5m documents per machine: we need 1T/5m = 200k machines to handle all of the extra documents. None of the off-the-shelf sharding/distribution solutions that are commonly used with Lucene can scale to 200k machines, but let’s posit that we can solve that problem and can operate a search cluster with 200k machines. We’ll also need to have some replication so that queries don’t return bad results if a single machine goes down. If we replicate every machine once, that’s 400k machines. But that’s 400k machines for just one cluster. If we only have one cluster sitting in some location, users in other geographic regions will experience bad latency to our service, so many we want to have ten such clusters. If we have ten such clusters, that’s 4M machines.

    In the Anandtech wikipedia benchmark, they get 30 QPS out of a single-socket Broadwell Xeon D with 64 GB of RAM (enough to fit the index in memory). If we don’t want to employ the army of people necessary to build out and run 4M machines worth of datacenters, AFAICT the cheapest VM that’s plausibly at least as “good” as that machine is the GCE n1-highmem-8, which goes for $0.352hr. If we multiply that out by 4M machines, that’s a little over $1.4M an hour, or a little more than $12B a year for a service that can’t even get within an order of magnitude of the query rate or latency necessary to run a service like Google or Bing. And that’s just for the index – even a minimal search engine also requires crawling. BTW, people will often claim that this is easy because they have much larger indices in Lucene, but with a posting-list based algorithm like Lucene, you can very roughly think of query rate as inversely related to the number of postings. When you ask these people with their giant indices what their query rate is, you’ll inevitably find that it’s glacial by internet standards. For reference, the core of twitter was a rails app that could handle something like 200 QPS until 2008. If you look at what most people handle with Lucene, it’s often well under 1 QPS, with documents that are much smaller than the average web document, using configurations that damage search relevance too much to be used in commercial search engines (e.g., using stop words). That’s fine, but that fact that people think that sort of experience is somehow relevant to web search is indicative of the problem this post is discussing.

    That also assumes that we won’t hit any other scaling problem if we can make 400k VM clusters. But finding an open source index which will scale not only to the number of documents on the internet, but also the number of terms, is non-trivial. Before you read the next section, try guessing how many unique terms there are online. And then if we shard the internet so that we have 5m documents per machine, try guessing how many unique terms you expect to see per shard.

    When I ask this question, I often hear guesses like “ten million” or “ten billion”. But without even looking at the entire internet, just looking at one single document on github, we can find a document with fifty million unique terms:

    Crista Lopes: The largest C++ file we found in GitHub has 528MB, 57 lines of code. Contains the first 50,847,534 primes, all hard coded into an array.

    So there are definitely more than ten million unique terms on the entire internet! In fact, there’s a website out there that has all primes under one trillion. I believe there are something like thirty-seven billion of those. If that website falls into one shard of our index, we’d expect to see more than thirty-seven billion terms in a single shard; that’s more than most people guess we’ll see on the entire internet, and that’s just in one shard that happens to contain one somewhat pathological site. If we try to put the internet into any existing open source index that I know of, not only will it not be able to scale out enough horizontally, many shards will contain data weird enough to make the entire shard fall over if we run a query. That’s nothing against open source software; like any software, it’s designed to satisfy the needs of its users, and none of its users do anything like index the entire internet. As businesses scale up, they run into funny corner cases that people without exposure to the particular domain don’t anticipate.

    People often object that you don’t need to index all of this weird stuff. There have been efforts to build web search engines that only index the “important” stuff, but it turns out that if you ask people to evaluate search engines, some people will type in the weirdest queries they can think of and base their evaluation off of that. And others type in what they think of as normal queries for their day-to-day work even if they seem weird to you (e.g., a biologist might query for GTGACCTTGGGCAAGTTACTTAACCTCTCTGTGCCTCAGTTTCCTCATCTGTAAAATGGGGATAATA). If you want to be anything but a tiny niche player, you have to handle not only the weirdest stuff you can think of, but the weirdest stuff that many people can think of.

    [return]
  2. Recently, I was curious why an org that’s notorious for producing unreliable services produces so many unreliable services. When I asked around about why, I found that that upper management were afraid of sending out any sort of positive message about reliability because they were afraid that people would use that as an excuse to slip schedules. Upper management changed their message to include reliability about a year ago, but if you talk to individual contributors, they still believe that the message is that features are the #1 priority and slowing down on features to make things more reliable is bad for your career (and based on who’s getting promoted the individual contributors appear to be right). Maybe in another year, the org will have really gotten the message through to the people who hand out promotions, and in another couple of years, enough software will have been written with reliability in mind that they’ll actually have reliable services. Maybe. That’s just the first-order effect. The second-order effect is that their policies have caused a lot of people who care about reliability to go to companies that care more about reliability and less about demo-ing shiny new features. They might be able to fix that in a decade. Maybe. That’s made harder by the fact that the org is in a company that’s well known for having PMs drive features above all else. If that reputation is possible to change, it will probably take multiple decades. [return]
  3. For a lot of products, the sales team is more important than the engineering team. If we build out something rivaling Google search, we’ll probably also end up with the infrastructure required to sell a competitive cloud offering. Google actually tried to do that without having a serious enterprise sales force and the result was that AWS and Azure basically split the enterprise market between them. [return]
  4. This isn’t to say that there isn’t waste or that different companies don’t have different levels of waste. I see waste everywhere I look, but it’s usually not what people on the outside think of as waste. Whenever I read outsider’s descriptions of what’s wasteful at the companies I’ve worked at, they’re almost inevitably wrong. Friends of mine who work at other places also describe the same dynamic. [return]

Developer hiring and the market for lemons

$
0
0

Joel Spolsky has a classic blog post on “Finding Great Developers” where he popularized the meme that great developers are impossible to find, a corollary of which is that if you can find someone, they’re not great. Joel writes,

The great software developers, indeed, the best people in every field, are quite simply never on the market.

The average great software developer will apply for, total, maybe, four jobs in their entire career.

If you’re lucky, if you’re really lucky, they show up on the open job market once, when, say, their spouse decides to accept a medical internship in Anchorage and they actually send their resume out to what they think are the few places they’d like to work at in Anchorage.

But for the most part, great developers (and this is almost a tautology) are, uh, great, (ok, it is a tautology), and, usually, prospective employers recognize their greatness quickly, which means, basically, they get to work wherever they want, so they honestly don’t send out a lot of resumes or apply for a lot of jobs.

Does this sound like the kind of person you want to hire? It should.The corollary of that rule–the rule that the great people are never on the market–is that the bad people–the seriously unqualified–are on the market quite a lot. They get fired all the time, because they can’t do their job. Their companies fail–sometimes because any company that would hire them would probably also hire a lot of unqualified programmers, so it all adds up to failure–but sometimes because they actually are so unqualified that they ruined the company. Yep, it happens.

These morbidly unqualified people rarely get jobs, thankfully, but they do keep applying, and when they apply, they go to Monster.com and check off 300 or 1000 jobs at once trying to win the lottery.

Astute readers, I expect, will point out that I’m leaving out the largest group yet, the solid, competent people. They’re on the market more than the great people, but less than the incompetent, and all in all they will show up in small numbers in your 1000 resume pile, but for the most part, almost every hiring manager in Palo Alto right now with 1000 resumes on their desk has the same exact set of 970 resumes from the same minority of 970 incompetent people that are applying for every job in Palo Alto, and probably will be for life, and only 30 resumes even worth considering, of which maybe, rarely, one is a great programmer. OK, maybe not even one.

Joel’s claim is basically that “great” developers won’t have that many jobs compared to “bad” developers because companies will try to keep “great” developers. Joel also posits that companies can recognize prospective “great” developers easily. But these two statements are hard to reconcile. If it’s so easy to identify prospective “great” developers, why not try to recruit them? You could just as easily make the case that “great” developers are overrepresented in the market because they have better opportunities and it’s the “bad” developers who will cling to their jobs. This kind of adverse selection is common in companies that are declining; I saw that in my intern cohort at IBM1, among other places.

Should “good” developers be overrepresented in the market or underrepresented? If we listen to the anecdotal griping about hiring, we might ask if the market for developers is a market for lemons. This idea goes back to Akerlof’s Nobel prize winning 1970 paper, “The Market for ‘Lemons’: Quality Uncertainty and the Market Mechanism”. Akerlof takes used car sales as an example, splitting the market into good used cars and bad used cars (bad cars are called “lemons”). If there’s no way to distinguish between good cars and lemons, good cars and lemons will sell for the same price. Since buyers can’t distinguish between good cars and bad cars, the price they’re willing to pay is based on the quality of the average in the market. Since owners know if their car is a lemon or not, owners of non-lemons won’t sell because the average price is driven down by the existence of lemons. This results in a feedback loop which causes lemons to be the only thing available.

This model is certainly different from Joel’s model. Joel’s model assumes that “great” developers are sticky – that they stay at each job for a long time. This comes from two assumptions; first, that it’s easy for prospective employers to identify who’s “great”, and second, that once someone is identified as “great”, their current employer will do anything to keep them (as in the market for lemons). But the first assumption alone is enough to prevent the developer job market from being a market for lemons. If you can tell that a potential employee is great, you can simply go and offer them twice as much as they’re currently making (something that I’ve seen actually happen). You need an information asymmetry to create a market for lemons, and Joel posits that there’s no information asymmetry.

If we put aside Joel’s argument and look at the job market, there’s incomplete information, but both current and prospective employers have incomplete information, and whose information is better varies widely. It’s actually quite common for prospective employers to have better information than current employers!

Just for example, there’s someone I’ve worked with, let’s call him Bob, who’s saved two different projects by doing the grunt work necessary to keep the project from totally imploding. The projects were both declared successes, promotions went out, they did a big PR blitz which involves seeding articles in all the usual suspects, like Wired, and so on and so forth. That’s worked out great for the people who are good at taking credit for things, but it hasn’t worked out so well for Bob. In fact, someone else I’ve worked with recently mentioned to me that management keeps asking him why Bob takes so long to do simple tasks. The answer is that Bob’s busy making sure the services he works on don’t have global outages when they launch, but that’s not the kind of thing you get credit for in Bob’s org. The result of that is that Bob has a network who knows that he’s great, which makes it easy for him to get a job anywhere else at market rate. But his management chain has no idea, and based on what I’ve seen of offers today, they’re paying him about half what he could make elsewhere. There’s no shortage of cases where information transfer inside a company is so poor that external management has a better view of someone’s productivity than internal management. I have one particular example in mind, but if I just think of the Bob archetype, off the top of my head, I know of four people who are currently in similar situations. It helps that I currently work at a company that’s notorious for being dysfunctional in this exact way, but this happens everywhere. When I worked at a small company, we regularly hired great engineers from big companies that were too clueless to know what kind of talent they had.

Another problem with the idea that “great” developers are sticky is that this assumes that companies are capable of creating groups that developers want to work for on demand. This is usually not the case. Just for example, I once joined a team where the TL was pretty strongly against using version control or having tests. As a result of those (and other) practices, it took five devs one year to produce 10k lines of kinda-sorta working code for a straightforward problem. Additionally, it was a pressure cooker where people were expected to put in 80+ hour weeks, where the PM would shame people into putting in longer hours. Within a year, three of the seven people who were on the team when I joined had left; two of them went to different companies. The company didn’t want to lose those two people, but it wasn’t capable of creating an environment that would keep them.

Around when I joined that team, a friend of mine joined a really great team. They do work that materially impacts the world, they have room for freedom and creativity, a large component of their jobs involves learning new and interesting things, and so on and so forth. Whenever I heard about someone who was looking for work, I’d forward them that team. That team is now full for the foreseeable future because everyone whose network included that team forwarded people into that team. But if you look at the team that lost three out of seven people in a year, that team is hiring. A lot. The result of this dynamic is that, as a dev, if you join a random team, you’re overwhelmingly likely to join a team that has a lot of churn. Additionally, if you know of a good team, it’s likely to be full.

Joel’s model implicitly assumes that, proportionally, there are many more dysfunctional developers than dysfunctional work environments.

At the last conference I attended, I asked most people I met two questions:

  1. Do you know of any companies that aren’t highly dysfunctional?
  2. Do you know of any particular teams that are great and are hiring?

Not one single person told me that their company meets the criteria in (1). A few people suggested that, maybe, Dropbox is ok, or that, maybe, Jane Street is ok, but the answers were of the form “I know a few people there and I haven’t heard any terrible horror stories yet, plus I sometimes hear good stories”, not “that company is great and you should definitely work there”. Most people said that they didn’t know of any companies that weren’t a total mess.

A few people had suggestions for (2), but the most common answer was something like “LOL no, if I knew that I’d go work there”. The second most common answer was of the form “I know some people on the Google Brain team and it sounds great”. There are a few teams that are well known for being great places to work, but the fact that they’re so few and far between that it’s basically impossible to get a job on one of those teams. A few people knew of actual teams that they’d strongly recommend who were hiring, but that was rare. Much rarer than finding a developer who I’d want to work with who would consider moving. If I flipped the question around and asked if they knew of any good developers who were looking for work, the answer was usually “yes”2.

Another problem with the idea that “great” developers are impossible to find because they join companies and then stick is that developers (and companies) aren’t immutable. Because I’ve been lucky enough to work in environments that allow people to really flourish, I’ve seen a lot of people go from unremarkable to amazing. Because most companies invest pretty much nothing in helping people, you can do really well here without investing much effort.

On the flip side, I’ve seen entire teams of devs go on the market because their environment changed. Just for example, I used to know a lot of people who worked at company X under Marc Yun. It was the kind of place that has low attrition because people really enjoy working there. And then Marc left. Over the next two years, literally everyone I knew who worked there left. This one change both created a lemon in the searching-for-a-team job market and put a bunch of good developers on the market. This kind of thing happens all the time, even more now than in the past because of today’s acquisition-heavy environment.

Is developer hiring a market for lemons? Well, it depends on what you mean by that. Both developers and hiring managers have incomplete information. It’s not obvious if having a market for lemons in one direction makes the other direction better or worse. The fact that joining a new team is uncertain makes developers less likely to leave existing teams, which makes it harder to hire developers. But the fact that developers often join teams which they dislike makes it easier to hire developers. What’s the net effect of that? I have no idea.

From where I’m standing, it seems really hard to find a good manager/team, and I don’t know of any replicable strategy for doing so; I have a lot of sympathy for people who can’t find a good fit because I get how hard that is. But I have seen replicable strategies for hiring, so I don’t have nearly as much sympathy for hiring managers who complain that hiring “great” developers is impossible.

When a hiring manager complains about hiring, in every single case I’ve seen so far, the hiring manager has one of the following problems:

  1. They pay too little. The last time I went looking for work, I found a 6x difference in compensation between companies who might hire me in the same geographic region. Basically all of the companies thought that they were competitive, even when they were at the bottom end of the range. I don’t know what it is, but companies always seem to think that they pay well, even when they’re not even close to being in the right range. Almost everyone I talk to tells me that they pay as much as any reasonable company. Sure, there are some companies out there that pay a bit more, but they’re overpaying! You can actually see this if you read Joel’s writing – back when he wrote the post I’m quoting above, he talked about how well Fog Creek paid. A couple years later, he complained that Google was overpaying for college kids with no experience, and more recently he’s pretty much said that you don’t want to work at companies that pay well.

  2. They pass on good or even “great” developers3. Earlier, I claimed that I knew lots of good developers who are looking for work. You might ask, if there are so many good developers looking for work, why’s it so hard to find them? Joel claims that out of a 1000 resumes, maybe 30 people will be “solid” and 970 will be “incompetent”. It seems to me it’s more like 200 will be solid and 20 will be really good. It’s just that almost everyone uses the same filters, so everyone ends up fighting over the 30 people who they think are solid.

    Matasano famously solved their hiring problem by using a different set of filters and getting a different set of people. Despite the resounding success of their strategy, pretty much everyone insists on sticking with the standard strategy of picking people with brand name pedigrees and running basically the same interview process as everyone else, bidding up the price of folks who are trendy and ignoring everyone else.

    If I look at developers I know who are in high-demand today, a large fraction of them went through a multi-year period where they were underemployed and practically begging for interesting work. These people are very easy to hire if you can find them.

  3. They’re trying to hire for some combination of rare skills. Right now, if you’re trying to hire for someone with experience in deep learning and, well, anything else, you’re going to have a bad time.

  4. They’re much more dysfunctional than they realize. I know one hiring manager who complains about how hard it is to hire. What he doesn’t realize is that literally everyone on his team is bitterly unhappy and a significant fraction of his team gives anti-referrals to friends and tells them to stay away.

    That’s an extreme case, but it’s quite common to see a VP or founder baffled by why hiring is so hard when employees consider the place to be mediocre or even bad.

Of these problems, (1), low pay, is both the most common and the simplest to fix.

In the past few years, Oracle and Alibaba have spun up new cloud computing groups in Seattle. This is a relatively competitive area, and both companies have reputations that work against them when hiring4. If you believe the complaints about how hard it is to hire, you wouldn’t think one company, let alone two, could spin up entire cloud teams in Seattle. Both companies solved the problem by paying substantially more than their competitors were offering for people with similar experience. Alibaba became known for such generous offers that when I was negotiating my offer from Microsoft, MS told me that they’d match an offer from any company except Alibaba. I believe Oracle and Alibaba have hired hundreds of engineers over the past few years.

Most companies don’t need to hire anywhere near a hundreds of people; they can pay competitively without hiring so many developers that the entire market moves upwards, but they still refuse to do so, while complaining about how hard it is to hire.

(2), filtering out good potential employees, seems like the modern version of “no one ever got fired for hiring IBM”. If you hire someone with a trendy background who’s good at traditional coding interviews and they don’t work out, who could blame you? And no one’s going to notice all the people you missed out on. Like (1), this is something that almost everyone thinks they do well and they’ll say things like “we’d have to lower our bar to hire more people, and no one wants that”. But I’ve never worked at a place that doesn’t filter out a lot of people who end up doing great work elsewhere. I’ve tried to get underrated programmers5 hired at places I’ve worked, and I’ve literally never succeeded in getting one hired. Once, someone I failed to get hired managed to get a job at Google after something like four years being underemployed (and is a star there). That guy then got me hired at Google. Not hiring that guy didn’t only cost them my brilliant friend, it eventually cost them me!

BTW, this illustrates a problem with Joel’s idea that “great” devs never apply for jobs. There’s often a long time period where a “great” dev has an extremely hard time getting hired, even through their network who knows that they’re great, because they don’t look like what people think “great” developers look like. Additionally, Google, which has heavily studied which hiring channels give good results, has found that referrals and internal recommendations don’t actually generate much signal. While people will refer “great” devs, they’ll also refer terrible ones. The referral bonus scheme that most companies set up skews incentives in a way that makes referrals worse than you might expect. Because of this and other problems, many companies don’t weight referrals particularly heavily, and “great” developers still go through the normal hiring process, just like everyone else.

(3), needing a weird combination of skills, can be solved by hiring people with half or a third of the expertise you need and training people. People don’t seem to need much convincing on this one, and I see this happen all the time.

(4), dysfunction seems hard to fix. If I knew how to do that, I’d be manager.

As a dev, it seems to me that teams I know of that are actually good environments that pay well have no problems hiring, and that teams that have trouble hiring can pretty easily solve that problem. But I’m biased. I’m not a hiring manager. There’s probably some hiring manager out there thinking: “every developer I know who complains that it’s hard to find a good team has one of these four obvious problems; if only my problems were that easy to solve!”

Thanks to Leah Hanson, David Turner, Tim Abbott, Vaibhav Sagar, Victor Felder, Ezekiel Smithburg, Juliano Bortolozzo Solanho, Stephen Tu, Pierre-Yves Baccou, Jorge Montero, Ben Kuhn, and Lindsey Kuper for comments and corrections.

If you liked this post, you’d probably enjoy this other post on the bogosity of claims that there can’t possibly be discrimination in tech hiring.


  1. The folks who stayed describe an environment that’s mostly missing mid-level people they’d want to work with. There are lifers who’ve been there forever and will be there until retirement, and there are new grads who land there at random. But, compared to their competitors, there are relatively few people people with 5-15 years of experience. The person I knew who lasted the longest stayed until the 8 year mark, but he started interviewing with an eye on leaving when he found out the other person on his team who was competent was interviewing; neither one wanted to be the only person on the team doing any work, so they raced to get out the door first. [return]
  2. This section kinda makes it sound like I’m looking for work. I’m not looking for work, although I may end up forced into it if my partner takes a job outside of Seattle. [return]
  3. Moishe Lettvin has a talk I really like, where he talks about a time when he was on a hiring committee and they rejected every candidate that came up, only to find that the “candidates” were actually anonymized versions of their own interviews!

    The bit about when he first started interviewing at Microsoft should sound familiar to MS folks. As is often the case, he got thrown into the interview with no warning and no preparation. He had no idea what to do and, as a result, wrote up interview feedback that wasn’t great. “In classic Microsoft style”, his manager forwarded the interview feedback to the entire team and said “don’t do this”. “In classic Microsoft style” is a quote from Moishe, but I’ve observed the same thing. I’d like to talk about how we have a tendency to do extremely blameful postmortems and how that warps incentives, but that probably deserves its own post.

    Well, I’ll tell one story, in remembrance of someone who recently left for Google. Shortly after that guy joined, he was in the office on a weekend (a common occurrence on his team). A manager from another team pinged him on chat and asked him to sign off on some code from the other team. The new guy, wanting to be helpful, signed off on the code. On Monday, the new guy talked to his mentor and his mentor suggested that he not help out other teams like that. Later, there was an outage related to the code. In classic Microsoft style, the manager from the other team successfully pushed the blame for the outage from his team to the new guy.

    [return]
  4. For a while, Oracle claimed that the culture of the Seattle office is totally different from mainline-Oracle culture, but from what I’ve heard, they couldn’t resist Oracle-ifying the Seattle group and that part of the pitch is no longer convincing. [return]
  5. This footnote is a response to Ben Kuhn, who asked me, what types of devs are underrated and how would you find them? I think this group is diverse enough that there’s no one easy way to find them. There are people like “Bob”, who do critical work that’s simply not noticed. There are also people who are just terrible at interviewing, like Jeshua Smith. I believe he’s only once gotten a performance review that wasn’t excellent (that semester, his manager said he could only give out one top rating, and it wouldn’t be fair to give it to only one of his two top performers, so he gave them both average ratings). In every place he’s worked, he’s been well known as someone who you can go to with hard problems or questions, and much higher ranking engineers often go to him for help. I tried to get him hired at two different companies I’ve worked at and he failed both interviews. He sucks at interviews. My understanding is that his interview performance almost kept him from getting his current job, but his references were so numerous and strong that his current company decided to take a chance on him anyway. But he only had those references because his old org has been disintegrating. His new company picked up a lot of people from his old company, so there were many people at the new company that knew him. He can’t get the time of day almost anywhere else. Another person I’ve tried and failed to get hired is someone I’ll call Ashley, who got rejected in the recruiter screening phase at Google for not being technical enough, despite my internal recommendation that she was one of the strongest programmers I knew. But she came from a “nontraditional” background that didn’t fit the recruiter’s idea of what a programmer looked like, so that was that. Nontraditional is a funny term because it seems like most programmers have a “nontraditional” background, but you know what I mean.

    There’s enough variety here that there isn’t one way to find all of these people. Having a filtering process that’s more like Matasano’s and less like Google, Microsoft, Facebook, almost any YC startup you can name, etc., is probably a good start.

    [return]

Should I buy ECC memory?

$
0
0

Jeff Atwood, perhaps the most widely read programming blogger, has a post that makes a case against using ECC memory. My read is that his major points are:

  1. Google didn’t use ECC when they built their servers in 1999
  2. Most RAM errors are hard errors and not soft errors
  3. RAM errors are rare because hardware has improved
  4. If ECC were actually important, it would be used everywhere and not just servers. Paying for optional stuff like this is downright enterprisey

Let’s take a look at these arguments one by one:

1. Google didn’t use ECC in 1999

If you do things just because Google once did them, here are some things you might do:

A. Put your servers into shipping containers.

Articles are still written today about what a great idea this is, even though this was an experiment at Google that was deemed unsuccessful. Turns out, even Google’s experiments don’t always succeed. In fact, their propensity for “moonshots” means that they have more failed experiments that most companies. IMO, that’s a substantial competitive advantage for them. You don’t need to make that advantage bigger than it already is by blindly copying their failed experiments.

B. Cause fires in your own datacenters

Part of the post talks about how awesome these servers are:

Some people might look at these early Google servers and see an amateurish fire hazard. Not me. I see a prescient understanding of how inexpensive commodity hardware would shape today’s internet.

The last part of that is true. But the first part has a grain of truth, too. When Google started designing their own boards, one generation had a regrowth1 issue that caused a non-zero number of fires.

BTW, if you click through to Jeff’s post and look at the photo that the quote refers to, you’ll see that the boards have a lot of flex in them. That caused problems and was fixed in the next generation. You can also observe that the cabling is quite messy, which also caused problems, and was also fixed in the next generation. There were other problems, but I’ll leave those as an exercise for the reader.

C. Make servers that injure your employees

One generation of Google servers had infamously sharp edges, given them the reputation of being made of “razor blades and hate”.

D. Create weather in your datacenters

From talking to folks at a lot of large tech companies, it seems that most of them have had a climate control issue resulting in clouds or fog in their datacenters. You might call this a clever plan by Google to reproduce Seattle weather so they can poach MS employees. Alternately, it might be a plan to create literal cloud computing. Or maybe not.

Note that these are all things Google tried and then changed. Making mistakes and then fixing them is common in every successful engineering organization. If you’re going to cargo cult an engineering practice, you should at least cargo cult current engineering practices, not something that was done in 1999.

When Google used servers without ECC back in 1999, they found a number of symptoms that were ultimately due to memory corruption, including a search index that returned effectively random results to queries. The actual failure mode here is instructive. I often hear that it’s ok to ignore ECC on these machines because it’s ok to have errors in individual results. But even when you can tolerate occasional errors, ignoring errors means that you’re exposing yourself to total corruption, unless you’ve done a very careful analysis to make sure that a single error can only contaminate a single result. In research that’s been done on filesystems, it’s been repeatedly shown that despite making valiant attempts at creating systems that are robust against a single error, it’s extremely hard to do so and basically every heavily tested filesystem can have a massive failure from a single error (see the output of Andrea and Remzi’s research group at Wisconsin if you’re curious about this). I’m not knocking filesystem developers here. They’re better at that kind of analysis than 99.9% of programmers. It’s just that this problem has been repeatedly shown to be hard enough that humans cannot effectively reason about it, and automated tooling for this kind of analysis is still far from a pushbutton process. In their book on warehouse scale computing, Google discusses error correction and detection and ECC is cited as their slam dunk case for when it’s obvious that you should use hardware error correction2.

Google has great infrastructure. From what I’ve heard of the infra at other large tech companies, Google’s sounds like the best in the world. But that doesn’t mean that you should copy everything they do. Even if you look at their good ideas, it doesn’t make sense for most companies to copy them. They created a replacement for Linux’s work stealing scheduler that uses both hardware run-time information and static traces to allow them to take advantage of new hardware in Intel’s server processors that lets you dynamically partition caches between cores. If used across their entire fleet, that could easily save Google more money in a week than stackexchange has spent on machines in their entire history. Does that mean you should copy Google? No, not unless you’ve already captured all the lower hanging fruit, which includes things like making sure that your core infrastructure is written in highly optimized C++, not Java or (god forbid) Ruby. And the thing is, for the vast majority of companies, writing in a language that imposes a 20x performance penalty is a totally reasonable decision.

2. Most RAM errors are hard errors

The case against ECC quotes this section of a study on DRAM errors (the bolding is Jeff’s):

Our study has several main findings. First, we find that approximately 70% of DRAM faults are recurring (e.g., permanent) faults, while only 30% are transient faults. Second, we find that large multi-bit faults, such as faults that affects an entire row, column, or bank, constitute over 40% of all DRAM faults. Third, we find that almost 5% of DRAM failures affect board-level circuitry such as data (DQ) or strobe (DQS) wires. Finally, we find that chipkill functionality reduced the system failure rate from DRAM faults by 36x.

This is somewhat ironic, as this quote doesn’t sound like an argument against ECC; it sounds like an argument for chipkill, a particular class of ECC. Putting that aside, Jeff’s post points out that hard errors are twice as common as soft errors, and then mentions that they run memtest on their machines when they get them. First, a 2:1 ratio isn’t so large that you can just ignore soft errors. Second the post implies that Jeff believes that hard errors are basically immutable and can’t surface after some time. That’s incorrect. You can think of electronics as wearing out just the same way mechanical devices wear out. The mechanisms are different, but the effects are similar. In fact, if you compare reliability analysis of chips vs. other kinds of reliability analysis, you’ll find they often use the same families of distributions to model failures. Third, Jeff’s line of reasoning implies that ECC can’t help with detection or correction of hard errors, which is not only incorrect but directly contradicted by the quote.

So, how often are you going to run memtest on your machines to try to catch these hard errors, and how much data corruption are you willing to live with? One of the key uses of ECC is not to correct errors, but to signal errors so that hardware can be replaced before silent corruption occurs. No one’s going to consent to shutting down everything on a machine every day to run memtest (that would be more expensive than just buying ECC memory), and even if you could convince people to do that, it won’t catch as many errors as ECC will.

When I worked at a company that owned about 1000 machines, we noticed that we were getting strange consistency check failures, and after maybe half a year we realized that the failures were more likely to happen on some machines than others. The failures were quite rare, maybe a couple times a week on average, so it took a substantial amount of time to accumulate the data, and more time for someone to realize what was going on. Without knowing the cause, analyzing the logs to figure out that the errors were caused by single bit flips (with high probability) was also non-trivial. We were lucky that, as a side effect of the process we used, the checksums were calculated in a separate process, on a different machine, at a different time, so that an error couldn’t corrupt the result and propogate that corruption into the checksum. If you merely try to protect yourself with in-memory checksums, there’s a good chance you’ll perform a checksum operation on the already corrupted data and compute a valid checksum of bad data unless you’re doing some really fancy stuff with calculations that carry their own checksums (and if you’re that serious about error correction, you’re probably using ECC regardless). Anyway, after completing the analysis, we found that memtest couldn’t detect any problems, but that replacing the RAM on the bad machines caused a one to two order of magnitude reduction in error rate. Most services don’t have this kind of checksumming we had; those services will simply silently write corrupt data to persistent storage and never notice problems until a customer complains.

3. Due to advances in hardware manufacturing, errors are very rare

The data in the post isn’t sufficient to support this assertion. Note that since RAM usage has been increasing and continues to increase at a fast exponential rate, RAM failures would have to decrease at a greater exponential rate to actually reduce the incidence of data corruption. Furthermore, as chips continue shrink, features get smaller, making the kind of wearout issues discussed in “2” more common. For example, at 20nm, a DRAM capacitor might hold something like 50 electrons, and that number will get smaller for next generation DRAM and things continue to shrink.

The 2012 study that Atwood quoted has this graph on corrected errors (a subset of all errors) on ten randomly selected failing nodes (6% of nodes had at least one failure):

We’re talking between 10 and 10k errors for a typical node that has a failure, and that’s a cherry-picked study from a post that’s arguing that you don’t need ECC. Note that the nodes here only have 16GB of RAM, which is an order of magnitude less than modern servers often have, and that this was on an older process node that was less vulnerable to noise than we are now. For anyone who’s used to dealing with reliability issues and just wants to know the FIT rate, the study finds a FIT rate of between 0.057 and 0.071 faults per Mbit (which, contra Atwood’s assertion, is not a shockingly low number). If you take the most optimistic FIT rate, .057, and do the calculation for a server without much RAM (here, I’m using 128GB, since the servers I see nowadays typically have between 128GB and 1.5TB of RAM)., you get an expected value of .057 * 1000 * 1000 * 8760 / 1000000000 = .5 faults per year per server. Note that this is for faults, not errors. From the graph above, we can see that a fault can easily cause hundreds or thousands of errors per month. Another thing to note is that there are multiple nodes that don’t have errors at the start of the study but develop errors later on.

Sun/Oracle famously ran into this a number of decades ago. Transistors and DRAM capacitors were getting smaller, much as they are now, and memory usage and caches were growing, much as they are now. Between having smaller transistors that were less resilient to transient upset as well as more difficult to manufacture, and having more on-chip cache, the vast majority of server vendors decided to add ECC to their caches. Sun decided to save a few dollars and skip the ECC. The direct result was that a number of Sun customers reported sporadic data corruption. It took Sun multiple years to spin a new architecture with ECC cache, and Sun made customers sign an NDA to get replacement chips. Of course there’s no way to cover up this sort of thing forever, and when it came up, Sun’s reputation for producing reliable servers took a permanent hit, much like the time they tried to cover up poor performance results by introducing a clause into their terms of services disallowing benchmarking.

Another thing to note here is that when you’re paying for ECC, you’re not just paying for ECC, you’re paying for parts (CPUs, boards) that have been qual’d more thoroughly. You can easily see this with disk failure rates, and I’ve seen many people observe this in their own private datasets. In terms of public data, I believe Andrea and Remzi’s group had a SIGMETRICS paper a few years back that showed that SATA drives were 4x more likely than SCSI drives to have disk read failures, and 10x more likely to have silent data corruption. This relationship held true even with drives from the same manufacturer. There’s no particular reason to think that the SCSI interface should be more reliable than the SATA interface, but it’s not about the interface. It’s about buying a high-reliability server part vs. a consumer part. Maybe you don’t care about disk reliability in particular because you checksum everything and can easily detect disk corruption, but there are some kinds of corruption that are harder to detect.

4. If ECC were actually important, it would be used everywhere and not just servers.

Rephrased slightly, this argument is “If this feature were actually important for servers, it would be used in non-servers”. You could make this argument about a fair number of server hardware features. This is actually one of the more obnoxious problems facing large cloud vendors.

They have enough negotiating leverage to get most parts at cost, but that only works where there’s more than one viable vendor. Some of the few areas where there aren’t any viable competitors include CPUs and GPUs. Luckily for them, they don’t need that many GPUs, but they need a lot of CPUs and the bit about CPUs has been true for a long time. There have been a number of attempts by CPU vendors to get into the server market, but each attempt so far has been fatally flawed in a way that made it obvious from an early stage that the attempt was doomed (and these are often 5 year projects, so that’s a lot of time to spend on a doomed project). The Qualcomm effort has been getting a lot of hype, but when I talk to folks I know at Qualcomm they all tell me that the current chip is basically for practice, since Qualcomm needed to learn how to build a server chip from all the folks they poached from IBM, and that the next chip is the first chip that has any hope of being competitive. I have high hopes for Qualcomm as well an ARM effort to build good server parts, but those efforts are still a ways away from bearing fruit.

The near total unsuitability of current ARM (and POWER) options (not including hypothetical variants of Apple’s impressive ARM chip) for most server workloads in terms of performance per TCO dollar is a bit of a tangent, so I’ll leave that for another post, but the point is that Intel has the market power to make people pay extra for server features, and they do so. Additionally, some features are genuinely more important for servers than for mobile devices with a few GB of RAM and a power budget of a few watts that are expected to randomly crash and reboot periodically anyway.

Conclusion

Should you buy ECC RAM? That depends. For servers, it’s probably a good bet considering the cost, although it’s hard to really do a cost/benefit analysis because it’s really hard to figure out the cost of silent data corruption, or the cost of having some risk of burning half a year of developer time tracking down intermittent failures only to find that the were caused by using non-ECC memory.

For normal desktop use, I’m pro-ECC, but if you don’t have regular backups set up, doing backups probably has a better ROI than ECC. But if you have backups without ECC, you can easily write corrupt data into your primary store and replicate that corrupt data into backup.

Thanks to Prabhakar Ragde, Tom Murphy, Jay Weisskopf, Leah Hanson, Joe Wilder, and Ralph Corderoy for discussion/comments/corrections. Also, thanks (or maybe anti-thanks) to Leah for convincing me that I should write up this off the cuff verbal comment as a blog post. Apologies for any errors, the lack of references, and the stilted prose; this is basically a transcription of half of a conversation and I haven’t explained terms, provided references, or checked facts in the level of detail that I normally do.


  1. One of the funnier examples I can think of this, at least to me, is the magical self-healing fuse. Although there are many implementaitons, you can think of a fuse on a chip as basically a resistor. If you run some current through it, you should get a connection. If you run a lot of current through it, you’ll heat up the resistor and eventually destroy it. This is commonly used to fuse off features on chips, or to do things like set the clock rate, with the idea being that once a fuse is blown, there’s no way to unblow the fuse.

    Once upon a time, there was a semiconductor manufacturer that rushed their manufacturing process a bit and cut the tolerences a bit too fine in one particular process generation. After a few months (or years), the connection between the two ends of the fuse could regrow and cause the fuse to unblow. If you’re lucky, the fuse will be something like the high-order bit of the clock multipler, which will basically brick the chip if changed. If you’re not lucky, it will be something that results in silent data corruption.

    I heard about problems in that particular process generation from that manufacturer from multiple people at different companies, so this wasn’t an isolated thing. When I say this is funny, I mean that it’s funny when you hear this story at a bar. It’s maybe less funny when you discover, after a year of testing, that some of your chips are failing because their fuse settings are nonsensical, and you have to respin your chip and delay the release for 3 months. BTW, this fuse regrowth thing is another example of a class of error that can be mitigated with ECC.

    This is not the issue that Google had; I only mention this because a lot of people I talk to are surprised by the ways in which hardware can fail.

    [return]
  2. In case you don’t want to dig through the whole book, most of the relevant passage is:

    In a system that can tolerate a number of failures at the software level, the minimum requirement made to the hardware layer is that its faults are always detected and reported to software in a timely enough manner as to allow the software infrastructure to contain it and take appropriate recovery actions. It is not necessarily required that hardware transparently corrects all faults. This does not mean that hardware for such systems should be designed without error correction capabilities. Whenever error correction functionality can be offered within a reasonable cost or complexity, it often pays to support it. It means that if hardware error correction would be exceedingly expensive, the system would have the option of using a less expensive version that provided detection capabilities only. Modern DRAM systems are a good example of a case in which powerful error correction can be provided at a very low additional cost. Relaxing the requirement that hardware errors be detected, however, would be much more difficult because it means that every software component would be burdened with the need to check its own correct execution. At one early point in its history, Google had to deal with servers that had DRAM lacking even parity checking. Producing a Web search index consists essentially of a very large shuffle/merge sort operation, using several machines over a long period. In 2000, one of the then monthly updates to Google’s Web index failed prerelease checks when a subset of tested queries was found to return seemingly random documents. After some investigation a pattern was found in the new index files that corresponded to a bit being stuck at zero at a consistent place in the data structures; a bad side effect of streaming a lot of data through a faulty DRAM chip. Consistency checks were added to the index data structures to minimize the likelihood of this problem recurring, and no further problems of this nature were reported. Note, however, that this workaround did not guarantee 100% error detection in the indexing pass because not all memory positions were being checked—instructions, for example, were not. It worked because index data structures were so much larger than all other data involved in the computation, that having those self-checking data structures made it very likely that machines with defective DRAM would be identified and excluded from the cluster. The following machine generation at Google did include memory parity detection, and once the price of memory with ECC dropped to competitive levels, all subsequent generations have used ECC DRAM.

    [return]

File crash consistency and filesystems are hard

$
0
0

I haven’t used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox. Pine, eudora, and outlook have all corrupted my inbox, forcing me to restore from backup. How is it that desktop mail clients are less reliable than gmail, even though my gmail account not only handles more email than I ever had on desktop clients, but also allows simultaneous access from multiple locations across the globe? Distributed systems have an unfair advantage, in that they can be robust against total disk failure in a way that desktop clients can’t, but none of the file corruption issues I’ve had have been from total disk failure. Why has my experience with desktop applications been so bad?

Well, what sort of failures can occur? Crash consistency (maintaining consistent state even if there’s a crash) is probably the easiest property to consider, since we can assume that everything, from the filesystem to the disk, works correctly; let’s consider that first.

Crash Consistency

Pillai et al. had a paper and presentation at OSDI ‘14 on exactly how hard it is to save data without corruption or data loss.

Let’s look at a simple example of what it takes to save data in a way that’s robust against a crash. Say we have a file that contains the text a foo and we want to update the file to contain a bar. The pwrite function looks like it’s designed for this exact thing. It takes a file descriptor, what we want to write, a length, and an offset. So we might try

pwrite([file], “bar”, 3, 2)  // write 3 bytes at offset 2

What happens? If nothing goes wrong, the file will contain a bar, but if there’s a crash during the write, we could get a boo, a far, or any other combination. Note that you may want to consider this an example over sectors or blocks and not chars/bytes.

If we want atomicity (so we either end up with a foo or a bar but nothing in between) one standard technique is to make a copy of the data we’re about to change in an undo log file, modify the “real” file, and then delete the log file. If a crash happens, we can recover from the log. We might write something like

creat(/dir/log);
write(/dir/log, “2,3,foo”, 7);
pwrite(/dir/orig, “bar”, 3, 2);
unlink(/dir/log);

This should allow recovery from a crash without data corruption via the undo log, at least if we’re using ext3 and we made sure to mount our drive with data=journal. But we’re out of luck if, like most people, we’re using the default1– with the default data=ordered, the write and pwrite syscalls can be reordered, causing the write to orig to happen before the write to the log, which defeats the purpose of having a log. We can fix that.

creat(/dir/log);
write(/dir/log, “2, 3, foo”);
fsync(/dir/log);  // don’t allow write to be reordered past pwrite
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That should force things to occur in the correct order, at least if we’re using ext3 with data=journal or data=ordered. If we’re using data=writeback, a crash during the the write or fsync to log can leave log in a state where the filesize has been adjusted for the write of “bar”, but the data hasn’t been written, which means that the log will contain random garbage. This is because with data=writeback, metadata is journaled, but data operations aren’t, which means that data operations (like writing data to a file) aren’t ordered with respect to metadata operations (like adjusting the size of a file for a write).

We can fix that by adding a checksum to the log file when creating it. If the contents of log don’t contain a valid checksum, then we’ll know that we ran into the situation described above.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);  // add checksum to log file
fsync(/dir/log);
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That’s safe, at least on current configurations of ext3. But it’s legal for a filesystem to end up in a state where the log is never created unless we issue an fsync to the parent directory.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);
fsync(/dir/log);
fsync(/dir);  // fsync parent directory of log file
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That should prevent corruption on any Linux filesystem, but if we want to make sure that the file actually contains “bar”, we need another fsync at the end.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);
fsync(/dir/log);
fsync(/dir);
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);
fsync(/dir);

That results in consistent behavior and guarantees that our operation actually modifies the file after it’s completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn’t really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.

Even if we assume fsync issues a flush command to the disk, some disks ignore flush directives for the same reason fsync is gimped on OS X and some versions of ext3 – to look better in benchmarks. Handling that is beyond the scope of this post, but the Rajimwale et al. DSN ‘11 paper and related work cover that issue.

Filesystem semantics

When the authors examined ext2, ext3, ext4, btrfs, and xfs, they found that there are substantial differences in how code has to be written to preserve consistency. They wrote a tool that collects block-level filesystem traces, and used that to determine which properties don’t hold for specific filesystems. The authors are careful to note that they can only determine when properties don’t hold – if they don’t find a violation of a property, that’s not a guarantee that the property holds.

Different filesystems have very different properties

Xs indicate that a property is violated. The atomicity properties are basically what you’d expect, e.g., no X for single sector overwrite means that writing a single sector is atomic. The authors note that the atomicity of single sector overwrite sometimes comes from a property of the disks they’re using, and that running these filesystems on some disks won’t give you single sector atomicity. The ordering properties are also pretty much what you’d expect from their names, e.g., an X in the “Overwrite -> Any op” row means that an overwrite can be reordered with some operation.

After they created a tool to test filesystem properties, they then created a tool to check if any applications rely on any potentially incorrect filesystem properties. Because invariants are application specific, the authors wrote checkers for each application tested.

Everything is broken

The authors find issues with most of the applications tested, including things you’d really hope would work, like LevelDB, HDFS, Zookeeper, and git. In a talk, one of the authors noted that the developers of sqlite have a very deep understanding of these issues, but even that wasn’t enough to prevent all bugs. That speaker also noted that version control systems were particularly bad about this, and that the developers had a pretty lax attitude that made it very easy for the authors to find a lot of issues in their tools. The most common class of error was incorrectly assuming ordering between syscalls. The next most common class of error was assuming that syscalls were atomic2. These are fundamentally the same issues people run into when doing multithreaded programming. Correctly reasoning about re-ordering behavior and inserting barriers correctly is hard. But even though shared memory concurrency is considered a hard problem that requires great care, writing to files isn’t treated the same way, even though it’s actually harder in a number of ways.

Something to note here is that while btrfs’s semantics aren’t inherently less reliable than ext3/ext4, many more applications corrupt data on top of btrfs because developers aren’t used to coding against filesystems that allow directory operations to be reordered (ext2 is perhaps the most recent widely used filesystem that allowed that reordering). We’ll probably see a similar level of bug exposure when people start using NVRAM drives that have byte-level atomicity. People almost always just run some tests to see if things work, rather than making sure they’re coding against what’s legal in a POSIX filesystem.

Hardware memory ordering semantics are usually well documented in a way that makes it simple to determine precisely which operations can be reordered with which other operations, and which operations are atomic. By contrast, here’s the ext manpage on its three data modes:

journal: All data is committed into the journal prior to being written into the main filesystem. ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

The manpage literally refers to rumor. This is the level of documentation we have. If we look back at our example where we had to add an fsync between the write(/dir/log, “2, 3, foo”) and pwrite(/dir/orig, 2, “bar”) to prevent reordering, I don’t think the necessity of the fsync is obvious from the description in the manpage. If you look at the hardware memory ordering “manpage” above, it specifically defines the ordering semantics, and it certainly doesn’t rely on rumor.

This isn’t to say that filesystem semantics aren’t documented anywhere. Between lwn and LKML, it’s possible to get a good picture of how things work. But digging through all of that is hard enough that it’s still quite common for there to be long, uncertain discussions on how things work. A lot of the information out there is wrong, and even when information was right at the time it was posted, it often goes out of date.

When digging through archives, I’ve often seen a post from 2005 cited to back up the claim that OS X fsync is the same as Linux fsync, and that OS X fcntl(F_FULLFSYNC) is even safer than anything available on Linux. Even at the time, I don’t think that was true for the 2.4 kernel, although it was true for the 2.6 kernel. But since 2008 or so Linux 2.6 with ext3 will do a full flush to disk for each fsync (if the disk supports it, and the filesystem hasn’t been specially configured with barriers off).

Another issue is that you often also see exchanges like this one:

Dev 1: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.
Dev 2: as the ext3 authors have stated many times over the years, you still need to run fsck periodicly anyway.
Dev 1: Where is that documented?
Dev 2: linux-kernel mailing list archives.
Dev 3: Probably from some 6-8 years ago, in e-mail postings that I made.

Where’s this documented? Oh, in some mailing list post 6-8 years ago (which makes it 12-14 years from today). I don’t mean to pick on filesystem devs. The fs devs whose posts I’ve read are quite polite compared to LKML’s reputation; they generously spend a lot of their time responding to basic questions and I’m impressed by how patient the expert fs devs are with askers, but it’s hard for outsiders to troll through a decade and a half of mailing list postings to figure out which ones are still valid and which ones have been obsoleted!

In their OSDI 2014 talk, the authors of the paper we’re discussing noted that when they reported bugs they’d found, developers would often respond “POSIX doesn’t let filesystems do that”, without being able to point to any specific POSIX documentation to support their statement. If you’ve followed Kyle Kingsbury’s Jepsen work, this may sound familiar, except devs respond with “filesytems don’t do that” instead of “networks don’t do that”. I think this is understandable, given how much misinformation is out there. Not being a filesystem dev myself, I’d be a bit surprised if I don’t have at least one bug in this post.

Filesystem correctness

We’ve already encountered a lot of complexity in saving data correctly, and this only scratches the surface of what’s involved. So far, we’ve assumed that the disk works properly, or at least that the filesystem is able to detect when the disk has an error via SMART or some other kind of monitoring. I’d always figured that was the case until I started looking into it, but that assumption turns out to be completely wrong.

The Prabhakaran et al. SOSP 05 paper examined how filesystems respond to disk errors in some detail. They created a fault injection layer that allowed them to inject disk faults and then ran things like chdir, chroot, stat, open, write, etc. to see what would happen.

Between ext3, reiserfs, and NTFS, reiserfs is the best at handling errors and it seems to be the only filesystem where errors were treated as first class citizens during design. It’s mostly consistent about propagating errors to the user on reads, and calling panic on write failures, which triggers a restart and recovery. This general policy allows the filesystem to gracefully handle read failure and avoid data corruption on write failures. However, the authors found a number of inconsistencies and bugs. For example, reiserfs doesn’t correctly handle read errors on indirect blocks and leaks space, and a specific type of write failure doesn’t prevent reiserfs from updating the journal and committing the transaction, which can result in data corruption.

Reiserfs is the good case. The authors found that ext3 ignored write failures in most cases, and rendered the filesystem read-only in most cases for read failures. This seems like pretty much the opposite of the policy you’d want. Ignoring write failures can easily result in data corruption, and remounting the filesystem as read-only is a drastic overreaction if the read error was a transient error (transient errors are common). Additionally, ext3 did the least consistency checking of the three filesystems and was the most likely to not detect an error. In one presentation, one of the authors remarked that the ext3 code had lots of comments like “I really hope a write error doesn’t happen here” in places where errors weren’t handled.

NTFS is somewhere in between. The authors found that it has many consistency checks built in, and is pretty good about propagating errors to the user. However, like ext3, it ignores write failures.

The paper has much more detail on the exact failure modes, but the details are mostly of historical interest as many of the bugs have been fixed.

It would be really great to see an updated version of the paper, and in one presentation someone in the audience asked if there was more up to date information. The presenter replied that they’d be interested in knowing what things look like now, but that it’s hard to do that kind of work in academia because grad students don’t want to repeat work that’s been done before, which is pretty reasonable given the incentives they face. Doing replications is a lot of work, often nearly as much work as the original paper, and replications usually give little to no academic credit. This is one of the many cases where the incentives align very poorly with producing real world impact.

The Gunawi et al. FAST 08 is another paper it would be great to see replicated today. That paper follows up the paper we just looked at, and examines the error handling code in different file systems, using a simple static analysis tool to find cases where errors are being thrown away. Being thrown away is defined very loosely in the paper — code like the following

if (error) {
    printk(“I have no idea how to handle this error\n”);
}

is considered not throwing away the error. Errors are considered to be ignored if the execution flow of the program doesn’t depend on the error code returned from a function that returns an error code.

With that tool, they find that most filesystems drop a lot of error codes:


By % BrokenBy Viol/Kloc

Rank

FS Frac. FS                Viol/Kloc

1

IBM JFS 24.4 ext3 7.2

2

ext3 22.1 IBM JFS 5.6

3

JFFS v2 15.7 NFS Client 3.6

4

NFS Client 12.9 VFS 2.9

5

CIFS 12.7 JFFS v2 2.2

6

MemMgmt 11.4 CIFS 2.1

7

ReiserFS 10.5 MemMgmt 2.0

8

VFS 8.4 ReiserFS 1.8

9

NTFS 8.1 XFS 1.4

10

XFS 6.9 NFS Server 1.2


Comments they found next to ignored errors include: “Should we pass any errors back?”, “Error, skip block and hope for the best.”, “There’s no way of reporting error returned from ext3_mark_inode_dirty() to userspace. So ignore it.“, “Note: todo: log error handler.“, “We can’t do anything about an error here.”, “Just ignore errors at this point. There is nothing we can do except to try to keep going.”, “Retval ignored?”, and “Todo: handle failure.”

One thing to note is that in a lot of cases, ignoring an error is more of a symptom of an architectural issue than a bug per se (e.g., ext3 ignored write errors during checkpointing because it didn’t have any kind of recovery mechanism). But even so, the authors of the papers found many real bugs.

Error recovery

Every widely used filesystem has bugs that will cause problems on error conditions, which brings up two questions. Can recovery tools robustly fix errors, and how often do errors occur? How do they handle recovery from those problems? The Gunawi et al. OSDI 08 paper looks at that and finds that fsck, a standard utility for checking and repairing file systems, “checks and repairs certain pointers in an incorrect order … the file system can even be unmountable after”.

At this point, we know that it’s quite hard to write files in a way that ensures their robustness even when the underlying filesystem is correct, the underlying filesystem will have bugs, and that attempting to repair corruption to the filesystem may damage it further or destroy it. How often do errors happen?

Error frequency

The Bairavasundaram et al. SIGMETRICS ‘07 paper found that, depending on the exact model, between 5% and 20% of disks would have at least one error over a two year period. Interestingly, many of these were isolated errors – 38% of disks with errors had only a single error, and 80% had fewer than 50 errors. A follow-up study looked at corruption and found that silent data corruption that was only detected by checksumming happened on .5% of disks per year, with one extremely bad model showing corruption on 4% of disks in a year.

It’s also worth noting that they found very high locality in error rates between disks on some models of disk. For example, there was one model of disk that had a very high error rate in one specific sector, making many forms of RAID nearly useless for redundancy.

That’s another study it would be nice to see replicated. Most studies on disk focus on the failure rate of the entire disk, but if what you’re woried about is data corruption, errors in non-failed disks are more worrying than disk failure, which is easy to detect and mitigate.

Conclusion

Files are hard. Butler Lampson has remarked that when they came up with threads, locks, and condition variables at PARC, they thought that they were creating a programming model that anyone could use, but that there’s now decades of evidence that they were wrong. We’ve accumulated a lot of evidence that humans are very bad at reasoning about these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems. Lampson suggests that the best known general purpose solution is to package up all of your parallelism into as small a box as possible and then have a wizard write the code in the box. Translated to filesystems, that’s equivalent to saying that as an application developer, writing to files safely is hard enough that it should be done via some kind of library and/or database, not by directly making syscalls.

Sqlite is quite good in terms of reliability if you want a good default. However, some people find it to be too heavyweight if all they want is a file-based abstraction. What they really want is a sort of polyfill for the file abstraction that works on top of all filesystems without having to understand the differences between different configurations (and even different versions) of each filesystem. Since that doesn’t exist yet, when no existing library is sufficient, you need to checksum your data since you will get silent errors and corruption. The only questions are whether or not you detect the errors and whether or not your record format only destroys a single record when corruption happens, or if it destroys the entire database. As far as I can tell, most desktop email client developers have chosen to go the route of destroying all of your email if corruption happens.

These studies also hammer home the point that conventional testing isn’t sufficient. There were multiple cases where the authors of a paper wrote a relatively simple tool and found a huge number of bugs. You don’t need any deep computer science magic to write the tools. The error propagation checker from the paper that found a ton of bugs in filesystem error handling was 4k LOC. If you read the paper, you’ll see that the authors observed that the tool had a very large number of shortcomings because of its simplicity, but despite those shortcomings, it was able to find a lot of real bugs. I wrote a vaguely similar tool at my last job to enforce some invariants, and it was literally two pages of code. It didn’t even have a real parser (it just went line-by-line through files and did some regexp matching to detect the simple errors that it’s possible to detect with just a state machine and regexes), but it found enough bugs that it paid for itself in development time the first time I ran it.

Almost every software project I’ve seen has a lot of low hanging testing fruit. Really basic random testing, static analysis, and fault injection can pay for themselves in terms of dev time pretty much the first time you use them.

Appendix

I’ve probably covered less than 20% of the material in the papers I’ve referred to here. Here’s a bit of info about some other neat info you can find in those papers, and others.

Pillai et al., OSDI ‘14: this papers goes into much more detail about what’s required for crash consistency than this post does. It also gives a fair amount of detail about how exactly applications fail, including diagrams of traces that indicate what false assumptions are embedded in each trace.

Chidambara et al., FAST ‘12: the same filesystem primitives are responsible for both consistency and ordering. The authors propose alternative primitives that seperate these concerns, allow better performance while maintaining safety.

Rajimwale et al. DSN ‘01: you probably shouldn’t use disks that ignore flush directives, but in case you do, here’s a protocol that forces those disks to flush using normal filesystem operations. As you might expect, the performance for this is quite bad.

Prabhakaran et al. SOSP ‘05: This has a lot more detail on filesystem responses to error than was covered in this post. The authors also discuss JFS, an IBM filesystem for AIX. Although it was designed for high reliability systems, it isn’t particularly more reliable than the alternatives. Related material is covered further in DSN ‘08, StorageSS ‘06, DSN ‘06, FAST ‘08, and USENIX ‘09, among others.

Gunawi et al. FAST ‘08 : Again, much more detail than is covered in this post on when errors get dropped, and how they wrote their tools. They also have some call graphs that give you one rough measure of the complexity involved in a filesystem. The XFS call graph is particularly messy, and one of the authors noted in a presentation that an XFS developer said that XFS was fun to work on since they took advantage of every possible optimization opportunity regardless of how messy it made things.

Bairavasundaram et al. SIGMETRICS ‘07: There’s a lot of information on disk error locality and disk error probability over time that isn’t covered in this post. A followup paper in FAST08 has more details.

Gunawi et al. OSDI ‘08: This paper has a lot more detail about when fsck doesn’t work. In a presentation, one of the authors mentioned that fsck is the only program that’s ever insulted him. Apparently, if you have a corrupt pointer that points to a superblock, fsck destroys the superblock (possibly rendering the disk unmountable), tells you something like “you dummy, you must have run fsck on a mounted disk”, and then gives up. In the paper, the authors reimplement basically all of fsck using a declarative model, and find that the declarative version is shorter, easier to understand, and much easier to extend, at the cost of being somewhat slower.

Memory errors are beyond the scope of this post, but memory corruption can cause disk corruption. This is especially annoying because memory corruption can cause you to take a checksum of bad data and write a bad checksum. It’s also possible to corrupt in memory pointers, which often results in something very bad happening. See the Zhang et al. FAST ‘10 paper for more on how ZFS is affected by that. There’s a meme going around that ZFS is safe against memory corruption because it checksums, but that paper found that critical things held in memory aren’t checksummed, and that memory errors can cause data corruption in real scenarios.

The sqlite devs are serious about both documentation and testing. If I wanted to write a reliable desktop application, I’d start by reading the sqlite docs and then talking to some of the core devs. If I wanted to write a reliable distributed application I’d start by getting a job at Google and then reading the design docs and postmortems for GFS, Colossus, Spanner, etc. J/k, but not really.

We haven’t looked at formal methods at all, but there have been a variety of attempts to formally verify properties of filesystems, such as SibylFS.

This list isn’t intended to be exhaustive. It’s just a list of things I’ve read that I think are interesting.

Update: many people have read this post and suggested that, in the first file example, you should use the much simpler protocol of copying the file to modified to a temp file, modifying the temp file, and then renaming the temp file to overwrite the original file. In fact, that’s probably the most common comment I’ve gotten on this post. If you think this solves the problem, I’m going to ask you to pause for five seconds and consider the problems this might have. First, you still need to fsync in multiple places. Second, you will get very poor performance with large files. People have also suggested using many small files to work around that problem, but that will also give you very poor performance unless you do something fairly exotic. Third, if there’s a hardlink, you’ve now made the problem of crash consistency much more complicated than in the original example. Fourth, you’ll lose file metadata, sometimes in ways that can’t be fixed up after the fact. That problem can, on some filesystems, be worked around with ioctls, but that only sometimes fixes the issue and now you’ve got fs specific code to preserve correctness even in the non-crash case. And that’s just the beginning. The fact that so many people thought that this was a simple solution to the problem demonstrates that this problem is one that people are prone to underestimating, even they’re explicitly warned that people tend to underestimate this problem!

If you liked this, you’ll probably enjoy this post on cpu bugs.

Thanks to Leah Hanson, Katerina Barone-Adesi, Jamie Brandon, Kamal Marhubi, Joe Wilder, David Turner, Benjamin Gilbert, Tom Murphy, Chris Ball, Joe Doliner, Alexy Romanov, Mindy Preston, Paul McJones, and Evan Jones for comments/discussion.


  1. Turns out some commercially supported distros only support data=ordered. Oh, and when I said data=ordered was the default, that’s only the case if pre-2.6.30. After 2.6.30, there’s a config option, CONFIG_EXT3_DEFAULTS_TO_ORDERED. If that’s not set, the default becomes data=writeback. [return]
  2. Cases where overwrite atomicity is required were documented as known issues, and all such cases assumed single-block atomicity and not multi-block atomicity. By contrast, multiple applications (LevelDB, Mercurial, and HSQLDB) had bad data corruption bugs that came from assuming appends are atomic.

    That seems to be an indirect result of a commonly used update protocol, where modifications are logged via appends, and then logged data is written via overwrites. Application developers are careful to check for and handle errors in the actual data, but the errors in the log file are often overlooked.

    There are a number of other classes of errors discussed, and I recommend reading the paper for the details if you work on an application that writes files.

    [return]

Big company vs. startup work and pay

$
0
0

There’s a meme that’s been going around for a while now: you should join a startup because the money is better and the work is more technically interesting. Paul Graham says that the best way to make money is to “start or join a startup”, which has been “a reliable way to get rich for hundreds of years”, and that you can “compress a career’s worth of earnings into a few years”. Michael Arrington says that you’ll become a part of history. Joel Spolsky says that by joining a big company, you’ll end up playing foosball and begging people to look at your code. Sam Altman says that if you join Microsoft, you won’t build interesting things and may not work with smart people. They all claim that you’ll learn more and have better options if you go work at a startup. Some of these links are a decade old now, but the same ideas are still circulating and those specific essays are still cited today.

Let’s look at these points one one-by-one.

  1. You’ll earn much more money at a startup
  2. You won’t do interesting work at a big company
  3. You’ll learn more at a startup and have better options afterwards

1. Earnings

The numbers will vary depending on circumstances, but we can do a back of the envelope calculation and adjust for circumstances afterwards. Median income in the U.S. is about $30k/yr. The somewhat bogus zeroth order lifetime earnings approximation I’ll use is $30k * 40 = $1.2M. A new grad at Google/FB/Amazon with a lowball offer will have a total comp (salary + bonus + equity) of $130k/yr. According to glassdoor’s current numbers, someone who makes it to T5/senior at Google should have a total comp of around $250k/yr. These are fairly conservative numbers1.

Someone who’s not particularly successful, but not particularly unsucessful will probably make senior in five years2. For our conservative baseline, let’s assume that we’ll never make it past senior, into the pay grades where compensation really skyrockets. We’d expect earnings (total comp including stock, but not benefits) to looks something like:

YearTotal CompCumulative
0130k130k
1160k290k
2190k480k
3220k700k
4250k950k
5250k1.2M
9250k2.2M
39250k9.7M

Looks like it takes six years to gross a U.S. career’s worth of income. If you want to adjust for the increased tax burden from earning a lot in a few years, add an extra year. Maybe add one to two more years if you decide to live in the bay or in NYC. If you decide not to retire, lifetime earnings for a 40 year career comes in at almost $10M.

One common, but false, objection to this is that your earnings will get eaten up by the cost of living in the bay area. Not only is this wrong, it’s actually the opposite of correct. You can work at these companies from outside the bay area; most of these companies will pay you maybe 10% less if you work in a location where cost of living is around the U.S. median by working in a satellite office of a trendy company headquartered in SV or Seattle (at least if you work in the US – pay outside of the US is often much lower for reasons that don’t really make sense to me). Market rate at smaller companies in these areas tends to be very low. When I interviewed in places like Portland and Madison, there was a 3x-5x difference between what most small companies were offering and what I could get at a big company in the same city. In places like Austin, where the market is a bit thicker, it was a 2x-3x difference. The difference in pay at 90%-ile companies is greater, not smaller, outside of the SF bay area.

Another objection is that most programmers at most companies don’t make this kind of money. If, three or four years ago, you’d told me that there’s a career track where it’s totally normal to make $250k/yr after a few years, doing work that was fundamentally pretty similar to the work I was doing then, I’m not sure I would have believed it. No one I knew made that kind of money, except maybe the CEO of the company I was working at. Well him, and folks who went into medicine or finance.

The only difference between then and now is that I took a job at a big company. When I took that job, the common story I heard at orientation was basically “I never thought I’d be able to get a job at Google, but a recruiter emailed me and I figured I might as well respond”. For some reason, women were especially likely to have that belief. Anyway, I’ve told that anecdote to multiple people who didn’t think they could get a job at some trendy large company, who then ended up applying and getting in. And what you’ll realize if you end up at a place like Google is that most of them are just normal programmers like you and me. If anything, I’d say that Google is, on average, less selective than the startup I worked at. When you only have to hire 100 people total, and half of them are folks you worked with as a technical fellow at one big company and then as an SVP at another one, you can afford to hire very slowly and being extremely selective. Big companies will hire more than 100 people per week, which means they can only be so selective.

Despite the hype about how hard it is to get a job at Google/FB/wherever, your odds aren’t that bad, and they’re certainly better than your odds striking it rich at a startup, for which Patrick McKenzie has a handy cheatsheet:

Roll d100. (Not the right kind of geek? Sorry. rand(100) then.)
0~70: Your equity grant is worth nothing.
71~94: Your equity grant is worth a lump sum of money which makes you about as much money as you gave up working for the startup, instead of working for a megacorp at a higher salary with better benefits.
95~99: Your equity grant is a life changing amount of money. You won’t feel rich — you’re not the richest person you know, because many of the people you spent the last several years with are now richer than you by definition — but your family will never again give you grief for not having gone into $FAVORED_FIELD like a proper $YOUR_INGROUP.
100: You worked at the next Google, and are rich beyond the dreams of avarice. Congratulations.
Perceptive readers will note that 100 does not actually show up on a d100 or rand(100).

For a more serious take that gives approximately the same results, 80000 hours finds that the average value of a YC founder after 5-9 years is $18M. That sounds great! But there are a few things to keep in mind here. First, YC companies are unusually successful compared to the average startup. Second, in their analysis, 80000 hours notes that 80% of the money belongs to 0.5% of companies. Another 22% are worth enough that founder equity beats working for a big company, but that leaves 77.5% where that’s not true.

If you’re an employee and not a founder, the numbers look a lot worse. If you’re a very early employee you’d be quite lucky to get 1/10th as much equity as a founder. If we guess that 30% of YC startups fail before hiring their first employee, that puts the mean equity offering at $1.8M / .7 = $2.6M. That’s low enough that for 5-9 years of work, you really need to be in the 0.5% for the payoff to be substantially better than working at a big company unless the startup is paying a very generous salary.

There’s a sense in which these numbers are too optimistic. Even if the company is successful and has a solid exit, there are plenty of things that can make your equity grant worthless. It’s hard to get statistics on this, but anecdotally, this seems to be the common case in acquisitions.

Moreover, the pitch that you’ll only need to work for four years is usually untrue. To keep your lottery ticket until it pays out (or fizzles out), you’ll probably have to stay longer. The most common form of equity at early stage startups are ISOs that, by definition, expire 90 at most days after you leave. If you get in early, and leave after four years, you’ll have to exercise your options if you want a chance at the lottery ticket paying off. If the company hasn’t yet landed a large valuation, you might be able to get away with paying O(median US annual income) to exercise your options. If the company looks like a rocketship and VCs are piling in, you’ll have a massive tax bill, too, all for a lottery ticket.

For example, say you joined company X early on and got options for 1% of the company when it was valued at $1M, so the cost exercising all of your options is only $10k. Maybe you got lucky and four years later, the company is valued at $1B and your options have only been diluted to .5%. Great! For only $10k you can exercise your options and then sell the equity you get for $5M. Except that the company hasn’t IPO’d yet, so if you exercise your options, you’re stuck with a tax bill from making $5M, and by the time the company actually has an IPO, your stock could be worthy anywhere from $0 to $LOTS. In some cases, you can sell your non-liquid equity for some fraction of its “value”, but my understanding is that it’s getting more common for companies to add clauses that limit your ability to sell your equity before the company has an IPO. And even when your contract doesn’t have a clause that prohibits you from selling your options on a secondary market, companies sometimes use backchannel communications to keep you from being able to sell your options.

Of course not every company is like this – I hear that Dropbox has generously offered to buy out people’s options at their current valuation for multiple years running and they now hand out RSUs instead of options, and Pinterest now gives people seven years to exercise their options after they leave – but stories like that are uncommon enough that they’re notable. The result is that people are incentivized to stay at most startups, even if they don’t like the work anymore. From chatting with my friends at well regarded highly-valued startups, it sounds like many of them have a substantial fraction of zombie employees who are just mailing it in and waiting for a liquidity event. A common criticism of large companies is that they’ve got a lot of lifers who are mailing it in, but most large companies will let you leave any time after the first year and walk away with a pro-rated fraction of your equity package3. It’s startups where people are incentivized to stick around even if they don’t care about the job.

At a big company, we have a career’s worth of income in six years with high probability once you get your foot in the door. This isn’t quite as good as the claim that you’ll be able to do that in three or four years at a startup, but the risk at a big company is very low once you land the job. In startup land, we have a lottery ticket that appears to have something like a 0.5% chance of paying off for very early employees. Startups might have had a substantially better expected value when Paul wrote about this in 2004, but big company compensation has increased much faster than compensation at the median startup. We’re currently in the best job market the world has ever seen for programmers. That’s likely to change at some point. The relative returns on going the startup route will probably look a lot better once things change, but for now, saving up some cash while big companies hand it out like candy doesn’t seem like a bad idea.

2. Interesting work

We’ve established that big companies will pay you decently. But there’s more to life than making money. After all, you spend 40+ hours a week working. How interesting is the work at big companies? Joel claimed that large companies don’t solve interesting problems and that Google is paying untenable salaries to kids with more ultimate frisbee experience than Python, whose main job will be to play foosball in the googleplex, Sam Altman said something similar (but much more measured) about Microsoft, every third Michael O. Church comment is about how Google tricks a huge number of overqualified programmers into taking jobs that no one wants, and basically every advice thread on HN or reddit aimed at new grads will have multiple people chime in on how the experience you get at startups is better than the experience you’ll get slaving away at a big company.

The claim that big companies have boring work is too broad and absolute to even possibly be true. It depends on what kind of work you want to do. When I look at conferences where I find a high percentage of the papers compelling, the stuff I find to be the most interesting is pretty evenly split between big companies and academia, with the (very) occasional paper by a startup. For example, looking at ISCA this year, there’s a 2:1 ratio of papers from academia to industry (and all of the industry papers are from big companies). But looking at the actual papers, a significant fraction of the academic papers are reproducing unpublished work that was done at big companies but not published, sometimes multiple years ago. If I only look at the new work that I’m personally interested in, it’s about a 1:1 ratio. There are some cases where a startup is working in the same area and not publishing, but that’s quite rare and large companies do much more research that they don’t publish. I’m just using papers as a proxy for having the kind of work I like. There are also plenty of areas where publishing isn’t the norm, but large companies do the bulk of the cutting edge work.

Of course YMMV here depending on what you want to do. I’m not really familiar with the landscape of front-end work, but it seems to me that big companies don’t do the vast majority of the cutting edge non-academic work, the way they do with large scale systems. IIRC, there’s an HN comment where Jonathan Tang describes how he created his own front-end work: he had the idea, told his manager about it, and got approval to make it happen. It’s possible to do that kind of thing at a large company, but people often seem to have an easier time pursuing that kind of idea at a small company. And if your interest is in product, small companies seem like the better bet (though, once again, I’m pretty far removed from that area, so my knowledge is secondhand).

But if you’re interested in large systems, at both of my last two jobs, I’ve seen speculative research projects with 9 figure pilot budgets approved. In a pitch for one of the products, the pitch wasn’t even that the project would make the company money. It was that a specific research area was important to the company, and that this infrastructure project would enable the company to move faster in that research area. Since the company is a $X billion dollar a year company, the project only needed to move the needle by a small percentage to be worth it. And so a research project whose goal was to speed up the progress of another research project was approved. Startups simply don’t have the resources to throw that much money at research problems that aren’t core to their business. And many problems that would be hard problems at startups are curiosities at large companies. Work at Google and have a question that requires running a query that takes 10k machines? No problem! But that’s basically impossible to do at a startup, not even considering the fact that you can run the query across data startups can’t possibly get.

The flip side of this is that there are experiments that startups have a very easy time doing that established companies can’t do. When I was at EC a number of years ago, back when Facebook was still relatively young, the Google ad auction folks remarked to the FB folks that FB was doing the sort of experiments they’d do if they were small enough to do them, but they couldn’t just change the structure of their ad auctions now that there was so much money going through their pipeline. As with everything else we’re discussing, there’s a tradeoff here and the real question is how to weight the various parts of the tradeoff, not which side is better in all ways.

The Michael O. Church claim is somewhat weaker: big companies have cool stuff to work on, but you won’t be allowed to work on them until you’ve paid your dues working on boring problems. A milder phrasing of this is that getting to do interesting work is a matter of getting lucky and landing on an initial project you’re interested in, but the key thing here is that most companies can give you a pretty good estimate about how lucky you’re going to be. Google is notorious for its blind allocation process, and I know multiple people who ended up at MS because they had the choice between a great project at MS and blind allocation at Google, but even Google has changed this to some extent and it’s not uncommon to be given multiple team options with an offer. In that sense, big companies aren’t much different from startups. It’s true that there are some startups that will basically only have jobs that are interesting to you (e.g., FaunaDB if you’re interested in building a distributed database). But at any startup that’s bigger and less specialized, there’s going to be work you’re interested in and work you’re not interested in, and it’s going to be up to you to figure out if your offer lets you work on stuff you’re interested in.

Something to note is that if, per “1”, you have the leverage to negotiate a good compensation package, you also have the leverage to negotiate for work that you want to do. We’re in what is probably the best job market for programmers ever. That might change tomorrow, but until it changes, you have a lot of power to get work that you want.

If this sounds completely foreign to you and you don’t have that kind of leverage, I understand. That was me a few years ago. Taking a job at a trendy big company is one way to get that leverage. Companies really want you to make it to “senior engineer” (where total comp starts at $250k to $350k, depending on the company); hiring is very expensive for them and they’re heavily incentivized to mentor the people they hire until they’re valuable and productive. Some companies are better at this than others, but the average big company that people want to work for has a lot more resources devoted to helping people learn than almost any startup. The goal at most big companies to get everyone to the senior level. Of course they’ll keep hiring which means there will always be non-senior people, but the definition of senior engineer is basically someone who can independently find and solve problems and doesn’t require any handholding, i.e., someone who’s easy to scale horizontally. Google even has a (unevenly enforced) policy that people who don’t “eventually” get to senior should be managed out, and you’ll notice that they’re not known for having a high involuntary attrition rate. They, and most other big companies, take teaching seriously.

3. Learning / Experience

What about the claim that experience at startups is more valuable? We don’t have the data to do a rigorous quantitative comparison, but qualitatively, everything’s on fire at startups, and you get a lot of breadth putting out fires, but you don’t have the time to explore problems as deeply.

I spent the first seven years of my career at a startup and I loved it. It was total chaos, which gave me the ability to work on a wide variety of different things and take on more responsibility than I would have gotten at a bigger company. I did everything from add fault tolerance to an in-house distributed system to owning a quarter of a project that added ARM instructions to an x86 chip, creating both the fastest ARM chip at the time, as well as the only chip capable of switching between ARM and x86 on the fly4. That was a great learning experience.

But I’ve had great learning experiences at big companies, too. At Google, my “starter” project was to join a previously one-person project, read the half finished design doc, provide feedback, and then start implementing. The impetus for the project was that people were worried that a certain class of applications would require Google to double the number of machines it owns if a somewhat unlikely but not impossible scenario happened. That wasn’t too much different from my startup experience, except for that bit about actually having a design doc, and that cutting infra costs could save billions a year instead of millions a year.

The next difference was that, at some point, people way above my pay grade made the decision to get serious about the project, and a lot of high-powered people ended up getting brought in to work on the project or at least provide input, folks like Norm Jouppi, Geoff Hinton, and Jeff Dean.

Was that project a better or worse learning experience than the equivalent project at a startup? At a startup, the project probably would have continued to be a two-person show, and I would have learned all the things you learn when you bang out a project with not enough time and resources and do half the thing yourself. Instead, I ended up owning a fraction of the project and merely provided feedback on the rest, and it was merely a matter of luck (timing) that I had significant say on fleshing out the architecture. I definitely didn’t get the same level of understanding I would have if I implemented half of it myself. On the other hand, the larger team meant that we actually had time to do things like design reviews and code reviews, and I got feedback from people who have way more experience and knowledge than me. My experience at MS is similar – I only own maybe a quarter of the project I’m working on, and there’s an architect above me who’s extremely well regarded and probably has veto power on architectural decisions. But when I had a question the other day, I emailed a Turing award winner and got a response back within an hour. It’s almost impossible to have access to the same breadth and depth of expertise at a startup. As a result, there are things I’ve learned in an hour long design review that it would have taken me months or years to learn if I was implementing things myself.

If you care about impact, it’s also easier to have a large absolute impact at a large company, due to the scale that big companies operate at. If I implemented what I’m doing now for a companies the size of the startup I used to work for, it would have had an impact of maybe $10k/month. That’s nothing to sneeze at, but it wouldn’t have covered my salary. But the same thing at a big company is worth well over 1000x that. There are simply more opportunities to have high impact at large companies because they operate at a larger scale. The corollary to this is that startups are small enough that it’s easier to have an impact on the company itself, even when the impact on the world is smaller in absolute terms. Nothing I do is make or break for a large company, but when I worked at a startup, it felt like what we did could change the odds of the company surviving.

As far as having better options after having worked for a big company or having worked for a startup, if you want to work at startups, you’ll probably have better options with experience at startups. If you want to work on the sorts of problems that are dominated by large companies, you’re better off with more experience in those areas, at large companies. There’s no right answer here.

Conclusion

The compensation tradeoff has changed a lot over time. When Paul Graham was writing in 2004, he used $80k/yr as a reasonable baseline for what “a good hacker” might make. Adjusting for inflation, that’s about $100k/yr now. But the total comp for “a good hacker” is $250k+/yr, not even counting perks like free food and having really solid insurance. The tradeoff has heavily tilted in favor of large companies.

The interesting work tradeoff has also changed a lot over time, but the change has been… bimodal. The existence of AWS and Azure means that ideas that would have taken millions of dollars in servers and operational expertise can be done with almost no fixed cost and low marginal costs. The scope of things you can do at an early-stage startup that were previously the domain of well funded companies is large and still growing. But at the same time, if you look at the work Google and MS are publishing at top systems conferences, startups are farther from being able to reproduce the scale-dependent work than ever before (and a lot of the most interesting work doesn’t get published). Depending on what sort of work you’re interested in, things might look relatively better or relatively worse at big companies.

In any case, the reality is that the difference between types of companies is smaller than the differences between companies of the same type. That’s true whether we’re talking about startups vs. big companies or mobile gaming vs. biotech. This is recursive. The differences between different managers and teams at a company can easily be larger than the differences between companies. If someone tells you that you should work for a certain type of company, that advice is guaranteed to be wrong much of the time, whether that’s a VC advocating that you should work for a startup or a Turing award winner telling you that you should work in a research lab.

As for me, well, I don’t know you and it doesn’t matter to me whether you end up at a big company, a startup, or something in between. Whatever you decide, I hope you get to know your manager well enough to know that they have your back, your team well enough to know that you like working with them, and your project well enough to know that you find it interesting. Personally, I’m a bit tired of the sort of nonsense you see at big companies after two stints at big companies5, and I might want to trade that for the sort of nonsense you see at startups next time I look for work, but that’s just me. You should figure out what the relevant tradeoffs are for you.

Jocelyn Goldfein on big companies vs. small companies.

Patrick McKenzie on providing business value vs. technical value, with a response from Yossi Kreinin.

Yossi Kreinin on passion vs. money, and with a rebuttal to this post on regret minimization.

Update: The responses on this post have been quite divided. Folks at big companies usually agree, except that the numbers seem low to them, especially for new grads. This is true even for people who living in places like Madison and Austin, which have a cost of living similar to U.S. median. On the other hand, a lot of people vehemently maintain that the numbers in this post are basically impossible. A lot of people are really invested in the idea that they’re making about as much as possible. If you’ve decided that making less money is the right tradeoff for you, that’s fine and I don’t have any problem with that. But if you really think that you can’t make that much money and you don’t believe me, I recommend talking to one of the hundreds of thousands of engineers at one of the many large companies that pays well.

Thanks to Kelly Eskridge, Leah Hanson, Julia Evans, Alex Clemmer, Ben Kuhn, Malcolm Matalka, Nick Bergson-Shilcock, Joe Wilder, Nat Welch, Darius Bacon, Lindsey Kuper, Prabhakar Ragde, Pierre-Yves Baccou, David Turner, Oskar Thoren, Katerina Barone-Adesi, Scott Feeney, Ralph Corderoy, Ezekiel Benjamin Smithburg, and Kyle Littler for comments/corrections/discussion.


  1. In particular, the glassdoor numbers seem low for an average. I suspect that’s because their average is weighed down by older numbers, while compensation has skyrocketed the past seven years. The average numbers on glassdoor don’t even match the average numbers I heard from other people in my midwestern satellite office in a large town two years ago, and the market has gone up sharply since then. More recently, on the upper end, I know someone fresh out of school who has a total comp of almost $250k/yr ($350k equity over four years, a $50k signing bonus, plus a generous salary). As is normal, they got a number of offers with varying compensation levels, and then Facebook came in and bid him up. The companies that are serious about competing for people matched the offers, and that was that. This included bids in Seattle and Austin that matched the bids in SV. If you’re negotiating an offer, the thing that’s critical isn’t to be some kind of super genius. It’s enough to be pretty good, know what the market is paying, and have multiple offers. This person was worth every penny, which is why he got his offers, but I know several people who are just as good who make half as much just because they only got a single offer and had no leverage.

    Anyway, the point of this footnote is just that the total comp for experienced engineers can go way above the numbers mentioned in the post. In the analysis that follows, keep in mind that I’m using conservative numbers and that an aggressive estimate for experienced engineers would be much higher. Just for example, at Google, senior is level 5 out of 11 on a scale that effectively starts at 3. At Microsoft, it’s 63 out of a weirdo scale that starts at 59 and goes to 70-something and then jumps up to 80 (or something like that, I always forget the details because the scale is so silly). Senior isn’t a particularly high band, and people at senior often have total comp substantially greater than $250k/yr. Note that these numbers also don’t include the above market rate of stock growth at trendy large companies in the past few years. If you’ve actually taken this deal, your RSUs have likely appreciated substantially.

    [return]
  2. This depends on the company. It’s true at places like Facebook and Google, which make a serious effort to retain people. It’s nearly completely untrue at places like IBM, National Instruments (NI), and Epic Systems, which don’t even try. And it’s mostly untrue at places like Microsft, which tries, but in the most backwards way possible.

    Microsoft (and other mid-tier companies) will give you an ok offer and match good offers from other companies. That by itself is already problematic since it incentivizes people who are inteviewing at Microsoft to also interview elsewhere. But the worse issue is that they do the same when retaining employees. If you stay at Microsoft for a long time and aren’t one of the few people on the fast track to “partner”, your pay is going to end up severely below market, sometime by as much as a factor of two. When you realize that, and you interview elsewhere, Microsoft will match external offers, but after getting underpaid for years, by hundreds of thousands or millions of dollars (depending on how long you’ve been there), the promise of making market rate for a single year and then being underpaid for the forseeable future doesn’t seem very compelling. The incentive structure appears as if it were designed to cause people who are between average and outstanding to leave. I’ve seen this happen with multiple people and I know multiple others who are planning to leave for this exact reason. Their managers are always surprised when this happens, but they shouldn’t be; it’s eminently predictable.

    The IBM strategy actually makes a lot more sense to me than the Microsoft strategy. You can save a lot of money by paying people poorly. That makes sense. But why bother paying a lot to get people in the door and then incentivizing them to leave? While it’s true that the very top people I work with are well compensated and seem happy about it, there aren’t enough of those people that you can rely on them for everything.

    [return]
  3. Some are better about this than others. Older companies, like MS, sometimes have yearly vesting, but a lot of younger companies, like Google, have much smoother vesting schedules once you get past the first year. And then there’s Amazon, which backloads its offers, knowing that they have a high attrition rate and won’t have to pay out much. [return]
  4. Sadly, we ended up not releasing this for business reasons that came up later. [return]
  5. My very first interaction with an employee at big company X orientation was having that employee tell me that I couldn’t get into orientation because I wasn’t on the list. I had to ask how I could get on the list, and I was told that I’d need an email from my manager to get on the list. This was at around 7:30am because orientation starts at 7:30 and then runs for half a day for reasons no one seems to know (I’ve asked a lot of people, all the way up to VPs in HR). When I asked if I could just come back later in the day, I was told that if I couldn’t get in within an hour I’d have to come back next week. I also asked if the fact that I was listed in some system as having a specific manager was evidence that I was supposed to be at orientation and was told that I had to be on the list. So I emailed my manager, but of course he didn’t respond because who checks their email at 7:30am? Luckily, my manager had previously given me his number and told me to call if I ever needed anything, and being able to get into orientation and not have to show up at 7:30am again next week seemed like anything, so I gave him a call. Naturally, he asked to talk to the orientation gatekeeper; when I relayed that the orientation guy, he told me that he couldn’t talk on the phone – you see, he can only accept emails and can’t talk on the phone, not even just to clarify something. Five minutes into orientation, I was already flabbergasted. But, really, I should have considered myself lucky – the other person who “wasn’t on the list” didn’t have his manager’s phone number, and as far as I know, he had to come back the next week at 7:30am to get into orientation. I asked the orientation person how often this happens, and he told me “very rarely, only once or twice per week”.

    That experience was repeated approximately every half hour for the duration of orientation. I didn’t get dropped from any other orientation stations, but when I asked, I found that every station had errors that dropped people regularly. My favorite was the station where someone was standing at input queue, handing out a piece of paper. The piece of paper informed you that the machine at the station was going to give you an error with some instructions about what to do. Instead of following those instructions, you had to follow the instructions on the piece of paper when the error occurred.

    These kinds of experiences occupied basically my entire first week. Now that I’m past onboarding and onto the regular day-to-day, I have a surreal Kafaka-esque experience a few times a week. And I’ve mostly figured out how to navigate the system (usually, knowing the right person and asking them to intervene solves the problem). What gets me about this isn’t the actual experience, but that most people I talk to who’ve been here a while think that it literally cannot be any other way and that things could not possibly be improved; new hires from younger companies almost always agree that the company is bizarrely screwed up in ways that are incomprehensible. Curiously, people who have been here as long who are very senior tend to agree that the company is quite messed up. I wish I had enough data on that to tell which way the causation runs. Something that’s even curiouser is that the company invests a fair amount of effort to give people the impression that things are as good as they could possibly be. At orientation, we got a very strange version of history that made it sound as if the company had pioneered everything from the GUI to the web, with multiple claims that we have the best X in the world, even when X is not best in class but in fact worst in class, so bad that X is a running joke internally. It’s not clear to me what the company gets out of making sure that most employees don’t understand what the downsides are in our own products and processes.

    Whatever the reason, the attitude that things couldn’t possibly be improved isn’t just limited to administrative issues. A friend of mine needed to find a function to do something that’s a trivial one liner on Linux, but that’s considerably more involved on our OS. His first attempt was to use boost, but it turns out that the documentation for doing this on our OS is complicated enough that boost got this wrong and has had a bug in it for years. A couple days, and 72 lines of code later, he managed to figure out how to create a function to do this trivial-on-Linux thing. Since he wasn’t sure if he was missing something, he forwarded the code review to two very senior engineers (one level below Distinguished Engineer). They weren’t sure and forwarded it on to the CTO, who said that he didn’t see a simpler way to accomplish the same thing in our OS with the APIs as they currently are.

    Later, my friend had a heated discussion with someone on the OS team, who maintained that the documentation on how to do this was very clear, and that it couldn’t be clearer, nor could the API be any easier. This is despite this being so hard to do that boost has been wrong for seven years, and that two very senior engineers didn’t feel confident enough to review the code and passed it up to a CTO.

    Another curious thing is how easy it is to see that things don’t have to be this way from the outside. A while back, I did a round of interviews at other local companies, and they all explicitly disavowed absorbing corporate culture from the company I’m describing, not like company X across the street which is all screwed up by having hired too many employees from this company.

    I’m going to stop here. I’ve been writing down big company stories and saving them, but a mere half a year of big company stories is longer than my blog. Not just longer than this post or any individual post, but longer than everything else on my blog combined, which is a bit over 100k words. Typical estimates for words per page vary between 250 and 1000, putting my rate of surreal experiences at somewhere between 100 and 400 pages every six months. I’m not sure this rate is inherently different from the rate you’d get at startups, but there’s a different flavor to the stories and you should have an idea of the flavor by this point.

    [return]

Normalization of deviance in software: how broken practices become standard

$
0
0

Have you ever mentioned something that seems totally normal to you only to be greeted by surprise? Happens to me all the time, when I describe something everyone at work thinks is normal. For some reason, my conversation partner’s face morphs from pleasant smile to rictus of horror. Here are a few representative examples.

There’s the company that is perhaps the nicest place I’ve ever worked, combining the best parts of Valve and Netflix. The people are amazing and you’re given near total freedom to do whatever you want. But as a side effect of the culture, they lose perhaps half of new hires in the first year, some voluntarily and some involuntarily. Totally normal, right?

There’s the company that’s incredibly secretive about infrastructure. For example, there’s the team that was afraid that, if they reported bugs to their hardware vendor, the bugs would get fixed and their competitors would be able to use the fixes. Solution: request the firmware and fix bugs themselves! More recently, I know a group of folks outside the company who tried to reproduce the algorithm in the paper the company published earlier this year. The group found that they couldn’t reproduce the result, and that the algorithm in the paper resulted in an unusual level of instability; when asked about this, one of the authors responded “well, we have some tweaks that didn’t make it into the paper” and declined to share the tweaks, i.e., the company purposely published an unreproducible result to avoid giving away the details, as is normal. This company enforces secrecy by having a strict policy of firing leakers. This is introduced at orientation with examples of people who got fired for leaking (e.g., the guy who leaked that a concert was going to happen inside a particular office), and by announcing firings for leaks at the company all hands. The result of those policies is that I know multiple people who are afraid to forward emails about things like insurance updates for fear of forwarding the wrong email and getting fired; instead, they use another computer to retype the email and pass it along, or take photos of the email on their phone. Normal.

There’s the office where I asked one day about the fact that I almost never saw two particular people in the same room together. I was told that they had a feud going back a decade, and that things had actually improved – for years, they literally couldn’t be in the same room because one of the two would get too angry and do something regrettable, but things had now cooled to the point where the two could, occasionally, be found in the same wing of the office or even the same room. These weren’t just random people, either. They were the two managers of the only two teams in the office. Normal!

There’s the company whose culture is so odd that, when I sat down to write a post about it, I found that I’d not only written more than for any other single post, but more than all other posts combined (which is well over 100k words now, the length of a moderate book). This is the same company where someone recently explained to me how great it is that, instead of using data to make decisions, we use political connections, and that the idea of making decisions based on data is a myth anyway; no one does that. This is also the company where all four of the things they told me to get me to join were false, and the job ended up being the one thing I specifically said I didn’t want to do. When I joined this company, my team didn’t use version control for months and it was a real fight to get everyone to use version control. Although I won that fight, I haven’t won the fight to get people to run a build, let alone run tests, before checking in, so the build is broken multiple times per day. When I mentioned that I thought this was a problem for our productivity, I was told that it’s fine because it affects everyone equally because that kind of breakage is totally normal.

There’s the company that created multiple massive initiatives to recruit more women into engineering roles, where women still get rejected in recruiter screens for not being technical enough after being asked questions like “was your experience with algorithms or just coding?”, as is normal. I thought that my referral with a very strong recommendation would have prevented that, but I forgot how normal the company was.

There’s the company where I worked on a four person effort with a multi-hundred million dollar budget and a billion dollar a year impact, where requests for things that cost hundreds of dollars routinely took months or were denied.

You might wonder if I’ve just worked at places that are unusually screwed up. Sure, the companies are generally considered to be ok places to work, and two of them are considered to be among the best places to work, but maybe I’ve just ended up at places that are overrated. But I have the same experience when I hear stories about how other companies work, even places with stellar engineering reputations, except that it’s me that’s shocked and my conversation partner who thinks their story is normal.

There’s the company that adopted “move fast and break nothing as its motto”, and continues to regularly break everything while writing blog posts about how careful they are about breaking things. I said “the company”, but if you tweak the exact wording of the motto this actually applies to many normal bay area startups.

There’s the companies that use @flaky, which includes the vast majority of Python-using SF Bay area unicorns. If you don’t know what this is, this is a library that lets you add a Python annotation to those annoying flaky tests that sometimes pass and sometimes fail. When I asked multiple co-workers and former co-workers from three different companies what they thought this did, they all guessed that it re-runs the test multiple times and reports a failure if any of the runs fail. Close, but not quite. It’s technically possible to use @flaky for that, but in practice it’s used to re-run the test multiple times and reports a pass if any of the runs pass. The company that created @flaky is effectively a storage infrastructure company, and the library is widely used at its major competitor. Marking tests that expose potential bugs as passing is totally normal; after all, that’s what ext2/ext3/ext4 do with write errors.

There’s the company with a reputation for having great engineering practices that had 2 9s of reliability last time I checked, for reasons that are entirely predictable from their engineering practices. This is the second thing in a row that can’t be deanonymized because multiple companies find it to be normal. Here, I’m not talking about companies trying to be the next reddit or twitter where it’s, apparently, totally fine to have 1 9. I’m talking about companies that sell platforms that other companies rely on, where an outage will cause dependent companies to pause operations for the duration of the outage. Multiple companies that build infrastructure find practices that lead to 2 9s of reliability to be completely and totally normal.

As far as I can tell, what happens at these companies is that they started by concentrating almost totally on product growth. That’s completely and totally reasonable, because companies are worth approximately zero when they’re founded; they don’t bother with things that protect them from losses, like good ops practices or actually having security, because there’s nothing to lose (well, except for user data when the inevetible security breach happens, and if you talk to security folks at unicorns you’ll know that these happen).

The result is a culture where people are hyper-focused on growth and ignore risk. That culture tends to stick even after company has grown to be worth well over a billion dollars, and the companies have something to lose. Anyone who comes into one of these companies from Google, Amazon, or another place with solid ops practices is shocked. Often, they try to fix things, and then leave when they can’t make a dent.

Google probably has the best ops and security practices of any tech company today. It’s easy to say that you should take these things as seriously as Google does, but it’s instructive to see how they got there. If you look at the codebase, you’ll see that various services have names ending in z, as do a curiously large number of variables. I’m told that’s because, once upon a time, someone wanted to add monitoring. It wouldn’t really be secure to have google.com/somename expose monitoring data, so they added a z. google.com/somenamez. For security. At the company that is now the best in the world at security.

Google didn’t go from adding z to the end of names to having the world’s best security because someone gave a rousing speech or wrote a convincing essay. They did it after getting embarrassed a few times, which gave people who wanted to do things “right” the leverage to fix fundamental process issues. It’s the same story at almost every company I know of that has good practices. Microsoft was a joke in the security world for years, until multiple disastrously bad exploits forced them to get serious about security. Which makes it sound simple: but if you talk to people who were there at the time, the change was brutal. Despite a mandate from the top, there was vicious political pushback from people whose position was that the company got to where it was in 2003 without wasting time on practices like security. Why change what’s worked?

You can see this kind of thing in every industry. A classic example that tech folks often bring up is hand-washing by doctors and nurses. It’s well known that germs exist, and that washing hands properly very strongly reduces the odds of transmitting germs and thereby significantly reduces hospital mortality rates. Despite that, trained doctors and nurses still often don’t do it. Interventions are required. Signs reminding people to wash their hands save lives. But when people stand at hand-washing stations to require others walking by to wash their hands, even more lives are saved. People can ignore signs, but they can’t ignore being forced to wash their hands.

This mirrors a number of attempts at tech companies to introduce better practices. If you tell people they should do it, that helps a bit. If you enforce better practices via code review, that helps a lot.

The data are clear that humans are really bad at taking the time to do things that are well understood to incontrovertibly reduce the risk of rare but catastrophic events. We will rationalize that taking shortcuts is the right, reasonable thing to do. There’s a term for this: the normalization of deviance. It’s well studied in a number of other contexts including healthcare, aviation, mechanical engineering, aerospace engineering, and civil engineering, but we don’t see it discussed in the context of software. In fact, I’ve never seen the term used in the context of software.

Is it possible to learn from other’s mistakes instead of making every mistake ourselves? The state of the industry make this sound unlikely, but let’s give it a shot. John Banja has a nice summary paper on the normalization of deviance in healthcare, with lessons we can attempt to apply to software development. One thing to note is that, because Banja is concerned with patient outcomes, there’s a close analogy to devops failure modes, but normalization of deviance also occurs in cultural contexts that are less directly analogous.

The first section of the paper details a number of disasters, both in healthcare and elsewhere. Here’s one typical example:

A catastrophic negligence case that the author participated in as an expert witness involved an anesthesiologist’s turning off a ventilator at the request of a surgeon who wanted to take an x-ray of the patient’s abdomen (Banja, 2005, pp. 87-101). The ventilator was to be off for only a few seconds, but the anesthesiologist forgot to turn it back on, or thought he turned it back on but had not. The patient was without oxygen for a long enough time to cause her to experience global anoxia, which plunged her into a vegetative state. She never recovered, was disconnected from artificial ventilation 9 days later, and then died 2 days after that. It was later discovered that the anesthesia alarms and monitoring equipment in the operating room had been deliberately programmed to a “suspend indefinite” mode such that the anesthesiologist was not alerted to the ventilator problem. Tragically, the very instrumentality that was in place to prevent such a horror was disabled, possibly because the operating room staff found the constant beeping irritating and annoying.

Turning off or ignoring notifications because there are too many of them and they’re too annoying? An erroneous manual operation? This could be straight out of the post-mortem of more than a few companies I can think of, except that the result was a tragic death instead of the loss of millions of dollars. If you read a lot of tech post-mortems, every example in Banja’s paper will feel familiar even though the details are different.

The section concludes,

What these disasters typically reveal is that the factors accounting for them usually had “long incubation periods, typified by rule violations, discrepant events that accumulated unnoticed, and cultural beliefs about hazards that together prevented interventions that might have staved off harmful outcomes”. Furthermore, it is especially striking how multiple rule violations and lapses can coalesce so as to enable a disaster’s occurrence.

Once again, this could be from an article about technical failures. That makes the next section, on why these failures happen, seem worth checking out. The reasons given are:

The rules are stupid and inefficient

The example in the paper is about delivering medication to newborns. To prevent “drug diversion,” nurses were required to enter their password onto the computer to access the medication drawer, get the medication, and administer the correct amount. In order to ensure that the first nurse wasn’t stealing drugs, if any drug remained, another nurse was supposed to observe the process, and then enter their password onto the computer to indicate they witnessed the drug being properly disposed of.

That sounds familiar. How many technical postmortems start off with “someone skipped some steps because they’re inefficient”, e.g., “the programmer force pushed a bad config or bad code because they were sure nothing could go wrong and skipped staging/testing”? The infamous November 2014 Azure outage happened for just that reason. At around the same time, a dev at one of Azure’s competitors overrode the rule that you shouldn’t push a config that fails tests because they knew that the config couldn’t possibly be bad. When that caused the canary deploy to start failing, they overrode the rule that you can’t deploy from canary into staging with a failure because they knew their config couldn’t possibly be bad and so the failure must be from something else. That postmortem revealed that the config was technically correct, but exposed a bug in the underlying software; it was pure luck that the latent bug the config revealed wasn’t as severe as the Azure bug.

Humans are bad at reasoning about how failures cascade, so we implement bright line rules about when it’s safe to deploy. But the same thing that makes it hard for us to reason about when it’s safe to deploy makes the rules seem stupid and inefficient!

Knowledge is imperfect and uneven

People don’t automatically know what should be normal, and when new people are onboarded, they can just as easily learn deviant processes that have become normalized as reasonable processes.

Julia Evans described to me how this happens:

new person joins
new person: WTF WTF WTF WTF WTF
old hands: yeah we know we’re concerned about it
new person: WTF WTF wTF wtf wtf w…
new person gets used to it
new person #2 joins
new person #2: WTF WTF WTF WTF
new person: yeah we know. we’re concerned about it.

The thing that’s really insidious here is that people will really buy into the WTF idea, and they can spread it elsewhere for the duration of their career. Once, after doing some work on an open source project that’s regularly broken and being told that it’s normal to have a broken build, and that they were doing better than average, I ran the numbers, found that project was basically worst in class, and wrote someting about the idea that it’s possible to have a build that nearly always passes with pretty much zero effort. The most common comment I got in response was, “Wow that guy must work with superstar programmers. But let’s get real. We all break the build at least a few times a week”, as if running tests (or for that matter, even attempting to compile) before checking code in requires superhuman abilities. But once people get convinced that some deviation is normal, they often get really invested in the idea.

I’m breaking the rule for the good of my patient

The example in the paper is of someone who breaks the rule that you should wear gloves when finding a vein. Their reasoning is that wearing gloves makes it harder to find a vein, which may result in their having to stick a baby with a needle multiple times. It’s hard to argue against that. No one wants to cause a baby extra pain!

The second worst outage I can think of occurred when someone noticed that a database service was experiencing slowness. They pushed a fix to the service, and in order to prevent the service degradation from spreading, they ignored the rule that you should do a proper, slow, staged deploy. Instead, they pushed the fix to all machines. It’s hard to argue against that. No one wants their customers to have degraded service! Unfortunately, the fix exposed a bug that caused a global outage.

The rules don’t apply to me/You can trust me

most human beings perceive themselves as good and decent people, such that they can understand many of their rule violations as entirely rational and ethically acceptable responses to problematic situations. They understand themselves to be doing nothing wrong, and will be outraged and often fiercely defend themselves when confronted with evidence to the contrary.

As companies grow up, they eventually have to impose security that prevents every employee from being able to access basically everything. And at most companies, when that happens, some people get really upset. “Don’t you trust me? If you trust me, how come you’re revoking my access to X, Y, and Z?”

Facebook famously let all employees access everyone’s profile for a long time, and you can even find HN comments indicating that some recruiters would explicitly mention that as a perk of working for Facebook. And I can think of more than one well-regarded unicorn where everyone still has access to basically everything, even after their first or second bad security breach. It’s hard to get the political capital to restrict people’s access to what they believe they need, or are entitled, to know. A lot of trendy startups have core values like “trust” and “transparency” which make it difficult to argue against universal access.

Workers are afraid to speak up

There are people I simply don’t give feedback to because I can’t tell if they’d take it well or not, and once you say something, it’s impossible to un-say it. In the paper, the author gives an example of a doctor with poor handwriting who gets mean when people ask him to clarify what he’s written. As a result, people guess instead of asking.

In most company cultures, people feel weird about giving feedback. Everyone has stories about a project that lingered on for months after it should have been terminated because no one was willing to offer explicit feedback. This is a problem even when cultures discourage meanness and encourage feedback: cultures of niceness seem to have as many issues around speaking up as cultures of meanness, if not more. In some places, people are afraid to speak up because they’ll get attacked by someone mean. In others, they’re afraid because they’ll be branded as mean. It’s a hard problem.

Leadership withholding or diluting findings on problems

In the paper, this is characterized by flaws and weaknesses being diluted as information flows up the chain of command. One example is how a supervisor might take sub-optimal actions to avoid looking bad to superiors.

I was shocked the first time I saw this happen. I must have been half a year or a year out of school. I saw that we were doing something obviously non-optimal, and brought it up with the senior person in the group. He told me that he didn’t disagree, but that if we did it my way and there was a failure, it would be really embarrassing. He acknowledged that my way reduced the chance of failure without making the technical consequences of failure worse, but it was more important that we not be embarrassed. Now that I’ve been working for a decade, I have a better understanding of how and why people play this game, but I still find it absurd.

Solutions

Let’s say you notice that your company has a problem that I’ve heard people at most companies complain about: people get promoted for heroism and putting out fires, not for preventing fires; and people get promoted for shipping features, not for doing critical maintenance work and bug fixing. How do you change that?

The simplest option is to just do the right thing yourself and ignore what’s going on around you. That has some positive impact, but the scope of your impact is necessarily limited. Next, you can convince your team to do the right thing: I’ve done that a few times for practices I feel are really important and are sticky, so that I won’t have to continue to expend effort on convincing people once things get moving.

But if the incentives are aligned against you, it will require an ongoing and probably unsustainable effort to keep people doing the right thing. In that case, the problem becomes convincing someone to change the incentives, and then making sure the change works as designed. How to convince people is worth discussing, but long and messy enough that it’s beyond the scope of this post. As for making the change work, I’ve seen many “obvious” mistakes repeated, both in places I’ve worked and those whose internal politics I know a lot about.

Small companies have it easy. When I worked at a 100 person company, the hierarchy was individual contributor (IC) -> team lead (TL) -> CEO. That was it. The CEO had a very light touch, but if he wanted something to happen, it happened. Critically, he had a good idea of what everyone was up to and could basically adjust rewards in real-time. If you did something great for the company, there’s a good chance you’d get a raise. Not in nine months when the next performance review cycle came up, but basically immediately. Not all small companies do that effectively, but with the right leadership, they can. That’s impossible for large companies.

At large company A (LCA), they had the problem we’re discussing and a mandate came down to reward people better for doing critical but low-visibility grunt work. There were too many employees for the mandator to directly make all decisions about compensation and promotion, but the mandator could review survey data, spot check decisions, and provide feedback until things were normalized. My subjective perception is that the company never managed to achieve parity between boring maintenance work and shiny new projects, but got close enough that people who wanted to make sure things worked correctly didn’t have to significantly damage their careers to do it.

At large company B (LCB), ICs agreed that it’s problematic to reward creating new features more richly than doing critical grunt work. When I talked to managers, they often agreed, too. But nevertheless, the people who get promoted are disproportionately those who ship shiny new things. I saw mangement attempt a number of cultural and process changes at LCB. Mostly, those took the form of pronouncements from people with fancy titles. For really important things, they might produce a video, and enforce compliance by making people take a multiple choice quiz after watching the video. The net effect I observed among other ICs was that people talked about how disconnected management was from the day-to-day life of ICs. But, for the same reasons that normalization of deviance occurs, that information seems to have no way to reach upper management.

It’s sort of funny that this ends up being a problem about incentives. As an industry, we spend a lot of time thinking about how to incentivize consumers into doing what we want. But then we set up incentive systems that are generally agreed upon as incentivizing us to do the wrong things, and we do so via a combination of a game of telephone and cargo cult diffusion. Back when Microsoft was ascendant, we copied their interview process and asked brain-teaser interview questions. Now that Google is ascendant, we copy their interview process and ask algorithms questions. If you look around at trendy companies that are younger than Google, most of them basically copy their ranking/leveling system, with some minor tweaks. The good news is that, unlike many companies people previously copied, Google has put a lot of thought into most of their processes and made data driven decisions. The bad news is that Google is unique in a number of ways, which means that their reasoning often doesn’t generalize, and that people often cargo cult practices long after they’ve become deprecated at Google.

This kind of diffusion happens for technical decisions, too. Stripe built a reliable message queue on top of Mongo, so we build reliable message queues on top of Mongo1. It’s cargo cults all the way down2.

The paper has specific sub-sections on how to prevent normalization of deviance, which I recommend reading in full.

  • Pay attention to weak signals
  • Resist the urge to be unreasonably optimistic
  • Teach employees how to conduct emotionally uncomfortable conversations
  • System operators need to feel safe in speaking up
  • Realize that oversight and monitoring are never-ending

Let’s look at how the first one of these, “pay attention to weak signals”, interacts with a single example, the “WTF WTF WTF” a new person gives off when the join the company.

If a VP decides something is screwed up, people usually listen. It’s a strong signal. And when people don’t listen, the VP knows what levers to pull to make things happen. But when someone new comes in, they don’t know what levers they can pull to make things happen or who they should talk to almost by definition. They give out weak signals that are easily ignored. By the time they learn enough about the system to give out strong signals, they’ve acclimated.

“Pay attention to weak signals” sure sounds like good advice, but how do we do it? Strong signals are few and far between, making them easy to pay attention to. Weak signals are abundant. How do we filter out the ones that aren’t important? And how do we get an entire team or org to actually do it? These kinds of questions can’t be answered in a generic way; this takes real thought. We mostly put this thought elsewhere. Startups spend a lot of time thinking about growth, and while they’ll all tell you that they care a lot about engineering culture, revealed preference shows that they don’t. With a few exceptions, big companies aren’t much different. At LCB, I looked through the competitive analysis slide decks and they’re amazing. They look at every last detail on hundreds of products to make sure that everything is as nice for users as possible, from onboarding to interop with competing products. If there’s any single screen where things are more complex or confusing than any competitor’s, people get upset and try to fix it. It’s quite impressive. And then when LCB onboards employees in my org, a third of them are missing at least one of, an alias/account, an office, or a computer, a condition which can persist for weeks or months. The competitive analysis slide decks talk about how important onboarding is because you only get one chance to make a first impression, and then employees are onboarded with the impression that the company couldn’t care less about them and that it’s normal for quotidian processes to be pervasively broken. LCB can’t even to get the basics of employee onboarding right, let alone really complex things like acculturation. This is understandable – external metrics like user growth or attrition are measurable, and targets like how to tell if you’re acculturating people so that they don’t ignore weak signals are softer and harder to determine, but that doesn’t mean they’re any less important. People write a lot about how things like using fancier languages or techniques like TDD or agile will make your teams more productive, but having a strong engineering culture is much larger force multiplier.

Thanks to Ezekiel Benjamin Smithburg and Marc Brooker for introducing me to the term Normalization of Deviance, and Kelly Eskridge, Leah Hanson, Sophie Rapoport, Ezekiel Benjamin Smithburg, Julia Evans, Dmitri Kalintsev, Ralph Corderoy, Jamie Brandon, Egor Neliuba, and Victor Felder for comments/corrections/discussion.


  1. People seem to think I’m joking here. I can understand why, but try Googling mongodb message queue. You’ll find statements like “replica sets in MongoDB work extremely well to allow automatic failover and redundancy”. Basically every company I know of that’s done this and has anything resembling scale finds this to be non-optimal, to say the least, but you can’t actually find blog posts or talks that discuss that. All you see are the posts and talks from when they first tried it and are in the honeymoon period. This is common with many technologies. You’ll mostly find glowing recommendations in public even when, in private, people will tell you about all the problems. Today, if you do the search mentioned above, you’ll get a ton of posts talking about how amazing it is to build a message queue on top of Mongo, this footnote, and a maybe couple of blog posts by Kyle Kingsbury depending on your exact search terms.

    If there were an acute failure, you might see a postmortem, but while we’ll do postmortems for “the site was down for 30 seconds”, we rarely do postmortems for “this takes 10x as much ops effort as the alternative and it’s a death by a thousand papercuts”, “we architected this thing poorly and now it’s very difficult to make changes that ought to be trivial”, or “a competitor of ours was able to accomplish the same thing with an order of magnitude less effort”. I’ll sometimes do informal postmortems by asking everyone involved oblique questions about what happened, but more for my own benefit than anything else, because I’m not sure people really want to hear the whole truth. This is especially sensitive if the effort has generated a round of promotions, which seems to be more common the more screwed up the project. The larger the project, the more visiblity and promotions, even if the project could have been done with much less effort.

    [return]
  2. I’ve spent a lot of time asking about why things are the way they are, both in areas where things are working well, and in areas where things are going badly. Where things are going badly, everyone has ideas. But where things are going well, as in the small company with the light-touch CEO mentioned above, almost no one has any idea why things work. It’s magic. If you ask, people will literally tell you that it seems really similar to some other place they’ve worked, except that things are magically good instead of being terrible for reasons they don’t understand. But it’s not magic. It’s hard work that very few people understand. Something I’ve seen multiple times is that, when a VP leaves, a company will become a substantially worse place to work, and it will slowly dawn on people that the VP was doing an amazing job at supporting not only their direct reports, but making sure that everyone under them was having a good time. It’s hard to see until it changes, but if you don’t see anything obviously wrong, either you’re not paying attention or someone or many someones have put a lot of work into making sure things run smoothly. [return]

We saw some really bad Intel CPU bugs in 2015, and we should expect to see more in the future

$
0
0

2015 was a pretty good year for Intel. Their quarterly earnings reports exceeded expectations every quarter. They continue to be the only game in town for the serious server market, which continues to grow exponentially; from the earnings reports of the two largest cloud vendors, we can see that AWS and Azure grew by 80% and 100%, respectively. That growth has effectively offset the damage Intel has seen from the continued decline of the desktop market. For a while, it looked like cloud vendors might be able to avoid the Intel tax by moving their computation onto FPGAs, but Intel bought one of the two serious FPGA vendors and, combined with their fab advantage, they look well positioned to dominate the high-end FPGA market the same way they’ve been dominating the high-end server CPU market. Also, their fine for anti-competitive practices turned out to be $1.45B, much less than the benefit they gained from their anti-competitive practices1.

Things haven’t looked so great on the engineering/bugs side of things, though. I don’t keep track of Intel bugs unless they’re so serious that people I know are scrambling to get a patch in because of the potential impact, and I still heard about two severe bugs this year in the last quarter of the year alone. First, there was the bug found by Ben Serebrin and Jan Beulic, which allowed a guest VM to fault in a way that would cause the CPU to hang in a microcode infinite loop, allowing any VM to DoS its host.

Major cloud vendors were quite lucky that this bug was found by a Google engineer, and that Google decided to share its knowledge of the bug with its competitors before publicly disclosing. Black hats spend a lot of time trying to take down major services. I’m actually really impressed by both the persistence and the cleverness of the people who spend their time attacking the companies I work for. If, buried deep in our infrastructure, we have a bit of code running at DPC that’s vulnerable to slowdown because of some kind of hash collision, someone will find and exploit that, even if it takes a long and obscure sequence of events to make it happen. And they’ll often wait until an inconvenient time to start the attack, such as Christmas, or one of the big online shopping days. If this CPU microcode hang had been found by one of these black hats, there would have been major carnage for most cloud hosted services at the most inconvenient possible time2.

Shortly after the Serebrin/Beulic bug was found, a group of people found that running prime95, a commonly used tool for benchmarking and burn-in, causes their entire system to lock up. Intel’s response to this was:

Intel has identified an issue that potentially affects the 6th Gen Intel® Core™ family of products. This issue only occurs under certain complex workload conditions, like those that may be encountered when running applications like Prime95. In those cases, the processor may hang or cause unpredictable system behavior.

which reveals almost nothing about what’s actually going on. If you look at their errata list, you’ll find that this is typical, except that they normally won’t even name the application that was used to trigger the bug. For example, one of the current errata lists has entries like

  • Certain Combinations of AVX Instructions May Cause Unpredictable System Behavior
  • AVX Gather Instruction That Should Result in #DF May Cause Unexpected System Behavior
  • Processor May Experience a Spurious LLC-Related Machine Check During Periods of High Activity
  • Page Fault May Report Incorrect Fault Information

As we’ve seen, “unexpected system behavior” can mean that we’re completely screwed. Machine checks aren’t great either – they cause Windows to blue screen and Linux to kernel panic. An incorrect address on a page fault is potentially even worse than a mere crash, and if you dig through the list you can find a lot of other scary sounding bugs.

And keep in mind that the Intel errata list has the following disclaimer:

Errata remain in the specification update throughout the product’s lifecycle, or until a particular stepping is no longer commercially available. Under these circumstances, errata removed from the specification update are archived and available upon request.

Once they stop manufacturing a stepping (the hardware equivalent of a point release), they reserve the right to remove the errata and you won’t be able to find out what errata your older stepping has unless you’re important enough to Intel.

Anyway, back to 2015. We’ve seen at least two serious bugs in Intel CPUs in the last quarter3, and it’s almost certain there are more bugs lurking. Back when I worked at a company that produced Intel compatible CPUs, we did a fair amount of testing and characterization of Intel CPUs; as someone fresh out of school who’d previously assumed that CPUs basically worked, I was surprised by how many bugs we were able to find. Even though I never worked on the characterization and competitive analysis side of things, I still personally found multiple Intel CPU bugs just in the normal course of doing my job, poking around to verify things that seemed non-obvious to me. Turns out things that seem non-obvious to me are sometimes also non-obvious to Intel engineers. As more services move to the cloud and the impact of system hang and reset vulnerabilities increases, we’ll see more black hats investing time in finding CPU bugs. We should expect to see a lot more of these when people realize that it’s much easier than it seems to find these bugs. There was a time when a CPU family might only have one bug per year, with serious bugs happening once every few years, or even once a decade, but we’ve moved past that. In part, that’s because “unpredictable system behavior” have moved from being an annoying class of bugs that forces you to restart your computation to an attack vector that lets anyone with an AWS account attack random cloud-hosted services, but it’s mostly because CPUs have gotten more complex, making them more difficult to test and audit effectively, while Intel appears to be cutting back on validaton effort. Ironically, we have hardware virtualization that’s supposed to help us with security, but the virtualization is so complicated4 that the hardware virtualization implementation is likely to expose “unpredictable system behavior” bugs that wouldn’t otherwise have existed. This isn’t to say it’s hopeless – it’s possible, in principle, to design CPUs such that a hang bug on one core doesn’t crash the entire system. It’s just that it’s a fair amount of work to do that at every level (cache directories, the uncore, etc., would have to be modified to operate when a core is hung, as well as OS schedulers). No one’s done the work because it hasn’t previously seemed important.

Update

After writing this, an ex-Intel employee said “even with your privileged access, you have no idea” and a pseudo-anonymous commenter on reddit made this shocking comment:

As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

Why?

Let me set the scene: It’s late in 2013. Intel is frantic about losing the mobile CPU wars to ARM. Meetings with all the validation groups. Head honcho in charge of Validation says something to the effect of: “We need to move faster. Validation at Intel is taking much longer than it does for our competition. We need to do whatever we can to reduce those times… we can’t live forever in the shadow of the early 90’s FDIV bug, we need to move on. Our competition is moving much faster than we are” - I’m paraphrasing. Many of the engineers in the room could remember the FDIV bug and the ensuing problems caused for Intel 20 years prior. Many of us were aghast that someone highly placed would suggest we needed to cut corners in validation - that wasn’t explicitly said, of course, but that was the implicit message. That meeting there in late 2013 signalled a sea change at Intel to many of us who were there. And it didn’t seem like it was going to be a good kind of sea change. Some of us chose to get out while the getting was good. As someone who worked in an Intel Validation group for SOCs until mid-2014 or so I can tell you, yes, you will see more CPU bugs from Intel than you have in the past from the post-FDIV-bug era until recently.

I haven’t been able to confirm this story from another source I personally know, although another anonymous commenter said “I left INTC in mid 2013. From validation. This … is accurate compared with my experience.” Another anonymous person didn’t hear that speech, but found that at around that time, “velocity” became a buzzword and management spent a lot of time talking about how Intel needs more “velocity” to compete with ARM, which appears to confirm the sentiment, if not the actual speech.

I’ve also heard from formal methods people that, around that time, there was an exodus of formal verification folks. One story I’ve heard is that people left because they were worried about being made redundant. I’m told that, at the time, early retirement packages were being floated around and people strongly suspected layoffs. Another story I’ve heard is that things got really strange due to Intel’s focus on the mobile battle with ARM, and people wanted to leave before things got even worse. But it’s hard to say of this means anything, since Intel has been losing a lot of people to Apple because Apple offers better compensation packages and the promise of being less dysfunctional.

I also got anonymous stories about bugs. One person who works in HPC told me that when they were shopping for Haswell parts, a little bird told them that they’d see drastically reduced performance on variants with greater than 12 cores. When they tried building out both 12-core and 16-core systems, they found that they got noticeably better performance on their 12-core systems across a wide variety of workloads. That’s not better per-core performance – that’s better absolute performance. Adding 4 more cores reduced the performance on parallel workloads! That was true both in single-socket and two-socket benchmarks.

There’s also a mysterious hang during idle/low-activity bug that Intel doesn’t seem to have figured out yet.

And then there’s this Broadwell bug that hangs Linux if you don’t disable low-power states.

And of course Intel isn’t the only company with bugs – this AMD bug found by Robert Swiecki not only allows a VM to crash its host, it also allows a VM to take over the host.

I doubt I’ve even heard of all the recent bugs and stories about verification/validation. Feel free to send other reports my way.

Thanks to Leah Hanson, Jeff Ligouri, Derek Slager, Ralph Corderoy, Joe Wilder, Nate Martin, Hari Angepat, and a number of anonymous tipsters for comments/corrections/discussion.


  1. As with the Apple, Google, Adobe, etc., wage-fixing agreement, legal systems are sending the clear messsage that businesses should engage in illegal and unethical behavior since they’ll end up getting fined a small fraction of what they gain. This is the opposite of the Becker-ian policy that’s applied to individuals, where sentences have gotten jacked up on the theory that, since many criminals aren’t caught, the criminals that are caught should have severe punishments applied as a deterrence mechanism. The theory is that the criminals will rationally calculate the expected sentence from a crime, and weigh that against the expected value of a crime. If, for example, the odds of being caught are 1% and we increase the expected sentence from 6 months to 50 years, criminals will cacluate that the expected sentence has changed from 2 days to 6 months, thereby reducing the effective value of the crime and causing a reduction in crime. We now have decades of evidence that the theory that long sentences will deter crime is either empirically false or that the effect is very small; turns out that people who impulse commit crimes don’t deeply study sentencing guidelines before commit crimes. Ironically, for white-collar corporate crimes where Becker’s theory might more plausibly hold, Becker’s theory isn’t applied. [return]
  2. Something I find curious is how non-linear the level of effort of the attacks is. Google, Microsoft, and Amazon face regular, persistent, attacks, and if they couldn’t trivially mitigate the kind of unsophisticated attack that’s been severely affecting linode availability for weeks, they wouldn’t be able to stay in business. If you talk to people at various bay area unicorns, you’ll find that a lot of them have accidentally DoS’d themselves when they hit an external API too hard during testing. In the time that it takes a sophisticated attacker to find a hole in Azure that will cause an hour of disruption across 1% of VMs, that same attacker could probably completely take down ten unicorns for a much longer period of time. And yet, these attackers are hyper focused on the most hardened targets. Why is that? [return]
  3. The fault into microcode infinite loop also affects AMD processors, but basically no one runs a cloud on AMD chips. I’m pointing out Intel examples because Intel bugs have higher impact, not because Intel is buggier. Intel has a much better track record on bugs than AMD. IBM is the only major microprocessor company I know of that’s been more serious about hardware verification than Intel, but if you have an IBM system running AIX, I could tell you some stories that will make your hair stand on end. Moreover, it’s not clear how effective their verification groups can be since they’ve been losing experienced folks without being able to replace them for over a decade, but that’s a topic for another post. [return]
  4. See this code for a simple example of how to use Intel’s API for this. The example is simplified, so much so that it’s not really useful except as a learning aid, and it still turns out to be around 1000 lines of low-level code. [return]

The Nyquist theorem and limitations of sampling profilers today, with glimpses of tracing tools from the future

$
0
0

Perf is probably the most widely used general purpose performance debugging tool on Linux. There are multiple contenders for the #2 spot, and, like perf, they’re sampling profilers. Sampling profilers are great. They tend to be easy-to-use and low-overhead compared to most alternatives. However, there are large classes of performance problems sampling profilers can’t debug effectively, and those problems are becoming more important.

For example, consider a Google search query. Below, we have a diagram of how a query is carried out. Each of the black boxes is a rack of machines and each line shows a remote procedure call (RPC) from one machine to another.

The diagram shows a single search query coming in, which issues RPCs to over a hundred machines (shown in green), each of which delivers another set of requests to the next, lower level (shown in blue). Each request at that lower level also issues a set of RPCs, which aren’t shown because there’s too much going on to effectively visualize. At that last leaf level, the machines do 1ms-2ms of work, and respond with the result, which gets propagated and merged on the way back, until the search result is assembled. While that’s happening, on any leaf machine, 20-100 other search queries will touch the same machine. A single query might touch a couple thousand machines to get its results. If we look at the latency distribution for RPCs, we’d expect that with that many RPCs, any particular query will see a 99%-ile worst case (tail) latency; and much worse than mere 99%-ile, actually.

That latency translates directly into money. It’s now well established that adding user latency reduces ad clicks, reduces the odds that a user will complete a transaction and buy something, reduces the odds that a user will come back later and become a repeat customer, etc. Over the past ten to fifteen years, the understanding that tail latency is an important factor in determining user latency, and that user latency translates directly to money, has trickled out from large companies like Google into the general consciousness. But debugging tools haven’t kept up.

Sampling profilers, the most common performance debugging tool, are notoriously bad at debugging problems caused by tail latency because they aggregate events into averages. But tail latency is, by definition, not average.

For more on this, let’s look at this wide ranging Dick Sites talk1 which covers, among other things, the performace tracing framework that Dick and others have created at Google. By capturing “every” event that happens, it lets us easily debug performance oddities that would otherwise be difficult to track down. We’ll take a look at three different bugs to get an idea about the kinds of problems Google’s tracing framework is useful for.

First, we can look at another view of the search query we just saw above: given a top-level query that issues some number of RPCs, how long does it take to get responses?

Time goes from left to right. Each row is one rpc, with the blue bar showing when the RPC was issued and when it finished. We can see that the first RPC is issued and returns before 93 other RPCs go out. When the last of those 93 RPCs is done, the search result is returned. We can see that two of the RPCs take substantially longer than the rest; the slowest RPC gates the result of the search query.

To debug this problem, we want a couple things. Because the vast majority of RPCs in a slow query are normal, and only a couple are slow, we need something that does more than just show aggregates, like a sampling profiler would. We need something that will show us specifically what’s going on in the slow RPCs. Furthermore, because weird performance events may be hard to reproduce, we want something that’s cheap enough that we can run it all the time, allowing us to look at any particular case of bad performance in retrospect. In the talk, Dick Sites mentions having a budget of about 1% of CPU for the tracing framework they have.

In addition, we want a tool that has time-granularity that’s much shorter than the granularity of the thing we’re debugging. Sampling profilers typically run at something like 1 kHz (1 ms between samples), which gives little insight into what happens in a one-time event, like an slow RPC that still executes in under 1ms. There are tools that will display what looks like a trace from the output of a sampling profiler, but the resolution is so poor that these tools provide no insight into most performance problems. While it’s possible to crank up the sampling rate on something like perf, you can’t get as much resolution as we need for the problems we’re going to look at.

Getting back to the framework, to debug something like this, we might want to look at a much more zoomed in view. Here’s an example with not much going on (just tcpdump and some packet processing with recvmsg), just to illustrate what we can see when we zoom in.

The horizontal axis is time, and each row shows what a CPU is executing. The different colors indicate that different things are running. The really tall slices are kernel mode execution, the thin black line is the idle process, and the medium height slices are user mode execution. We can see that CPU0 is mostly handling incoming network traffic in a user mode process, with 18 switches into kernel mode. CPU1 is maybe half idle, with a lot of jumps into kernel mode, doing interrupt processing for tcpdump. CPU2 is almost totally idle, except for a brief chunk when a timer interrupt fires.

What’s happening is that every time a packet comes in, an interrupt is triggered to notify tcpdump about the packet. The packet is then delivered the packet to the process that called recvmsg on CPU0. Note that running tcpdump isn’t cheap, and it actually consumes 7% of a server if you turn it on when the server is running at full load. This only dumps network traffic, and it’s already at 7x the budget we have for tracing everything! If we were to look at this in detail, we’d see that Linux’s TCP/IP stack has a large instruction footprint, and workloads like tcpdump will consistently come in and wipe that out of the l1i and l2 caches.

Anyway, now that we’ve seen a simple example of what it looks like when we zoom in on a trace, let’s look at how we can debug the slow RPC we were looking at before.

We have two views of a trace of one machine here. At the top, there’s one row per CPU, and at the bottom there’s one row per RPC. Looking at the top set, we can see that there are some bits where individual CPUs are idle, but that the CPUs are mostly quite busy. Looking at the bottom set, we can see parts of 40 different searches, most of which take around 50us, with the exception of a few that take much longer, like the one pinned between the red arrows.

We can also look at a trace of the same timeframe by which locks are behind held and which threads are executing. The arcs between the threads and the locks show when a particular thread is blocked, waiting on a particular lock. If we look at this, we can see that the time spent waiting for locks is sometimes much longer than the time spent actually executing anything. The thread pinned between the arrows is the same thread that’s executing that slow RPC. It’s a little hard to see what’s going on here, so let’s focus on that single slow RPC.

We can see that this RPC spends very little time executing and a lot of time waiting. We can also see that we’d have a pretty hard time trying to find the cause of the waiting with traditional performance measurement tools. According to stackoverflow, you should use a sampling profiler! But tools like oprofile are useless since they’ll only tell us what’s going on when our RPC is actively executing. What we really care about is what our thread is blocked on and why.

Instead of following the advice from stackoverflow, let’s look at the second view of this again.

We can see that, not only is this RPC spending most of its time waiting for locks, it’s actually spending most of its time waiting for the same lock, with only a short chunk of execution time between the waiting. With this, we can look at the cause of the long wait for a lock. Additionally, if we zoom in on the period between waiting for the two locks, we can see something curious.

It takes 50us for the thread to start executing after it gets scheduled. Note that the wait time is substantially longer than the execution time. The waiting is because an affinity policy was set which will cause the scheduler to try to schedule the thread back to the same core so that any data that’s in the core’s cache will still be there, giving you the best possible cache locality, which means that the thread will have to wait until the previously scheduled thread finishes. That makes intutive sense, but if consider, for example, a 2.2Ghz Skylake, the cache latency is 6.4ns, and 21.2ns to l2, and l3 cache, respectively. Is it worth changing the affinity policy to speed this kind of thing up? You can’t tell from this single trace, but with the tracing framework used to generate this data, you could do the math to figure out if you should change the policy.

In the talk, Dick notes that, given the actual working set size, it would be worth waiting up to 10us to schedule on another CPU sharing the same l2 cache, and 100us to schedule on another CPU sharing the same l3 cache2.

Something else you can observe from this trace is that, if you care about a workload that resembles Google search, basically every standard benchmark out there is bad, and the standard technique of running N copies of spec is terrible. That’s not a straw man. People still do that in academic papers today, and some chip companies use SPEC to benchmark their mobile devices!

Anyway, that was one performance issue where we were able to see what was going on because of the ability to see a number of different things at the same time (CPU scheduling, thread scheduling, and locks). Let’s look at a simpler single-threaded example on a single machine where a tracing framework is still beneficial:

This is a trace from gmail, circa 2004. Each row shows the processing that it takes to handle one email. Well, except for the last 5 rows; the last email shown takes so long to process that displaying all of the processing takes 5 rows of space. If we look at each of the normal emails, they all look approximately the same in terms of what colors (i.e., what functions) are called and how much time they take. The last one is different. It starts the same as all the others, but then all this other junk appears that only happens in the slow email.

The email itself isn’t the problem – all of that extra junk is the processing that’s done to reindex the words from the emails that had just come in, which was batched up across multiple emails. This picture caused the Gmail devs to move that batch work to another thread, reducing tail latency from 1800ms to 100ms. This is another performance bug that it would be very difficult to track down with standard profiling tools. I’ve often wondered why email almost always appears quickly when I send to gmail from gmail, and it sometimes takes minutes when I send work email from outlook to outlook. My guess is that a major cause is that it’s much harder for the outlook devs to track down tail latency bugs like this than it is for the gmail devs to do the same thing.

Let’s look at one last performance bug before moving on to discussing what kind of visibility we need to track these down. This is a bit of a spoiler, but with this bug, it’s going to be critical to see what the entire machine is doing at any given time.

This is a histogram of disk latencies on storage machines for a 64kB read, in ms. There are two sets of peaks in this graph. The ones that make sense, on the left in blue, and the ones that don’t, on the right in red.

Going from left to right on the peaks that make sense, first there’s the peak at 0ms for things that are cached in RAM. Next, there’s a peak at 3ms. That’s way too fast for the 7200rpm disks we have to transfer 64kB; the time to get a random point under the head is already (1/(7200/60)) / 2 s = 4ms. That must be the time it takes to transfer something from the disk’s cache over PCIe. The next peak, at near 25ms, is the time it takes to seek to a point and then read 64kB off the disk.

Those numbers don’t look so bad, but the 99%-ile latency is a whopping 696ms, and there are peaks at 250ms, 500ms, 750ms, 1000ms, etc. And these are all unreproducible – if you go back and read a slow block again, or even replay the same sequence of reads, the slow reads are (usually) fast. That’s weird! What could possibly cause delays that long? In the talk, Dick Sites says “each of you think of a guess, and you’ll find you’re all wrong”.

That’s a trace of thirteen disks in a machine. The blue blocks are reads, and the red blocks are writes. The black lines show the time from the initiation of a transaction by the CPU until the transaction is completed. There are some black lines without blocks because some of the transactions hit in a cache and don’t require actual disk activity. If we wait for a period where we can see tail latency and zoom in a bit, we’ll see this:

We can see that there’s a period where things are normal, and then some kind of phase transition into a period where there are 250ms gaps (4) between periods of disk activity (5) on the machine for all disks. This goes on for nine minutes. And then there’s a phase transition and disk latencies go back to normal. That it’s machine wide and not disk specific is a huge clue.

Using that information, Dick pinged various folks about what could possibly cause periodic delays that are a multiple of 250ms on an entire machine, and found out that the cause was kernel throttling of the CPU for processes that went beyond their usage quota. To enforce the quota, the kernel puts all of the relevant threads to sleep until the next multiple of a quarter second. When the quarter-second hand of the clock rolls around, it wakes up all the threads, and if those threads are still using too much CPU, the threads get put back to sleep for another quarter second. The phase change out of this mode happens when, by happenstance, there aren’t too many requests in a quarter second interval and the kernel stops throttling the threads.

After finding the cause, an engineer found that this was happening on 25% of disk servers at Google, for an average of half an hour a day, with periods of high latency as long as 23 hours. This had been happening for three years3. Dick Sites says that fixing this bug paid for his salary for a decade. This is another bug where traditional sampling profilers would have had a hard time. The key insight was that the slowdowns were correlated and machine wide, which isn’t something you can see in a profile.

One question you might have is, is this because of some flaw in existing profilers, or can profilers provide enough information that you don’t need to use tracing tools to track down rare, long-tail, performance bugs? I’ve been talking to Xi Yang about this, who had an ISCA 2015 paper and talk describing some of his work. He and his collaborators have done a lot more since publishing the paper, but the paper still contains great information on how far a profiling tool can be pushed. As Xi explains in his talk, one of the fundamental limits of a sampling profiler is how often you can sample.

This is a graph of the number of the number of executed instructions per clock (IPC) over time in Lucene, which is the core of Elasticsearch.

At 1kHz, which is the default sampling interval for perf, you basically can’t see that anything changes over time at all. At 100kHz, which is as fast as perf runs, you can tell something is going on, but not what. The 10MHz graph is labeled SHIM because that’s the name of the tool presented in the paper. At 10MHz, you get a much better picture of what’s going on (although it’s worth noting that 10MHz is substantially lower resolution than you can get out of some tracing frameworks).

If we look at the IPC in different methods, we can see that we’re losing a lot of information at the slower sampling rates:

This is the top 10 hottest methods Lucene ranked by execution time; these 10 methods account for 74% of the total execution time. With perf, it’s hard to tell which methods have low IPC, i.e., which methods are spending time stalled. But with SHIM, we can clearly see that there’s one method that spends a lot of time waiting, #4.

In retrospect, there’s nothing surprising about these graphs. We know from the Nyquist theorem that, to observe a signal with some frequency, X, we have to sample with a rate at least 2X. There are a lot of factors of performance that have a frequency higher than 1kHz (e.g., CPU p-state changes), so we should expect that we’re unable to directly observe a lot of things that affect performance with perf or other traditional sampling profilers. If we care about microbenchmarks, we can get around this by repeatedly sampling the same thing over and over again, but for rare or one-off events, it may be hard or impossible to do that.

This raises a few questions:

  1. Why does perf sample so infrequently?
  2. How does SHIM get around the limitations of perf?
  3. Why are sampling profilers dominant?

1. Why does perf sample so infrequently?

This comment from events/core.c in the linux kernel explains the limit:

perf samples are done in some very critical code paths (NMIs). If they get too much CPU time, the system can lock up and not get any real work done.

As we saw from the tcpdump trace in the Dick Sites talk, interrupts take a significant amount of time to get processed, which limits the rate at which you can sample with an interrupt based sampling mechanism.

2. How does SHIM get around the limitations of perf?

Instead of having an interrupt come in periodically, like perf, SHIM instruments the runtime so that it periodically runs a code snippet that can squirrel away relevant information. In particular, the authors instrumented the Jikes RVM, which injects yield points into every method prologue, method epilogue, and loop backedge. At a high level, injecting a code snippet into every function prologue and epilogue sounds similar to what Dick Sites describes in his talk.

The details are different, and I recommend both watching the Dick Sites talk and reading the Yang et al. paper if you’re interested in performance measurement, but the fundamental similarity is that both of them decided that it’s too expensive to having another thread break in and sample periodically, so they both ended up injecting some kind of tracing code into the normal execution stream.

It’s worth noting that sampling, at any frequency, is going to miss waiting on (for example) software locks. Dick Sites’s recommendation for this is to timestamp based on wall clock (not CPU clock), and then try to find the underlying causes of unusually long waits.

3. Why are sampling profilers dominant?

We’ve seen that Google’s tracing framework allows us to debug performance problems that we’d never be able to catch with traditional sampling profilers, while also collecting the data that sampling profilers collect. From the outside, SHIM looks like a high-frequency sampling profiler, but it does so by acting like a tracing tool. Even perf is getting support for low-overhead tracing. Intel added hardware support for certain types for certain types of tracing in Broadwell and Skylake, along with kernel support in 4.1 (with user mode support for perf coming in 4.3). If you’re wondering how much overhead these tools have, Andi Kleen claims that the Intel tracing support in Linux has about a 5% overhead, and Dick Sites mentions in the talk that they have a budget of about 1% overhead.

It’s clear that state-of-the-art profilers are going to look a lot like tracing tools in the future, but if we look at the state of things today, the easiest options are all classical profilers. You can fire up a profiler like perf and it will tell you approximately how much time various methods are taking. With other basic tooling, you can tell what’s consuming memory. Between those two numbers, you can solve the majority of performance issues. Building out something like Google’s performance tracing framework is non-trivial, and cobbling together existing publicly available tools to trace performance problems is a rough experience. You can see one example of this when Marek Majkowski debugged a tail latency issue using System Tap.

In Brendan Gregg’s page on Linux tracers, he says “[perf_events] can do many things, but if I had to recommend you learn just one [tool], it would be CPU profiling”. Tracing tools are cumbersome enough that his top recommendation on his page about tracing tools is to learn a profiling tool!

Now what?

If you want to use an tracing tool like the one we looked at today your options are:

  1. Get a job at Google
  2. Build it yourself
  3. Cobble together what you need out of existing tools

1. Get a job at Google

I hear Steve Yegge has good advice on how to do this. If you go this route, try to attend orientation in Mountain View. They have the best orientation.

2. Build it yourself

If you look at the SHIM paper, there’s a lot of cleverness built-in to get really fine-grained information while minimizing overhead. I think their approach is really neat, but considering the current state of things, you can get a pretty substantial improvement without much cleverness. Fundamentally, all you really need is some way to inject your tracing code at the appropriate points, some number of bits for a timestamp, plus a handful of bits to store the event.

Say you want trace transitions between user mode and kernel mode. The transitions between waiting and running will tell you what the thread was waiting on (e.g., disk, timer, IPI, etc.). There are maybe 200k transitions per second per core on a busy node. 200k events with a 1% overhead is 50ns per event per core. A cache miss is well over 100 cycles, so our budget is less than one cache miss per event, meaning that each record must fit within a fraction of a cache line. If we have 20 bits of timestamp (RDTSC >> 8 bits, giving ~100ns resolution and 100ms range) and 12 bits of event, that’s 4 bytes, or 16 events per cache line. Each core has to have its own buffer to avoid cache contention. To map RDTSC times back to wall clock times, calling gettimeofday along with RDTSC at least every 100ms is sufficient.

Now, say the machine is serving 2000 QPS. That’s 20 99%-ile tail events per second and 2 99.9% tail events per second. Since those events are, by definition, unusually long, Dick Sites recommends a window of 30s to 120s to catch those events. If we have 4 bytes per event * 200k events per second * 40 cores, that’s about 32MB/s of data. Writing to disk while we’re logging is hopeless, so you’ll want to store the entire log while tracing, which will be in the range of 1GB to 4GB. That’s probably fine for a typical machine in a datacenter, which will have between 128GB and 256GB of RAM.

My not-so-secret secret hope for this post is that someone will take this idea and implement it. That’s already happened with at least one blog post idea I’ve thrown out there, and this seems at least as valuable.

3. Cobble together what you need out of existing tools

If you don’t have a magical framework that solves all your problems, the tool you want is going to depend on the problem you’re trying to solve.

For figuring out why things are waiting, Brendan Gregg’s writeup on off-CPU flamegraphs is a pretty good start if you don’t have access to internal Google tools. For that matter, his entire site is great if you’re doing any kind of Linux performance analysis. There’s info on Dtrace, ftrace, SystemTap, etc. Most tools you might use are covered, although PMCTrack is missing.

The problem with all of these is that they’re all much higher overhead than the things we’ve looked at today, so they can’t be run in the background to catch and effectively replay any bug that comes along if you operate at scale. Yes, that includes dtrace, which I’m calling out in particular because any time you have one of these discussions, a dtrace troll will come along to say that dtrace has supported that for years. It’s like the common lisp of trace tools, in terms of community trolling.

Anyway, if you’re on Windows, Bruce Dawson’s site seems to be the closest analogue to Bredan Gregg’s site. If that doesn’t have enough detail, there’s always the Windows Internals books.

This is a bit far afield, but for problems where you want an easy way to get CPU performance counters, likwid is nice. It has a much nicer interface than perf stat, lets you easily only get stats for selected functions, etc.

Thanks to Nathan Kurz, Xi Yang, Leah Hanson, John Gossman, Dick Sites, and Hari Angepat for comments/corrections/discussion.

P.S. Xi Yang, one of the authors of SHIM is finishing up his PhD soon and is going to be looking for work. If you want to hire a performance wizard, he has a CV and resume here.


  1. The talk is amazing and I recommend watching the talk instead of reading this post. I’m writing this up because I know if someone told me I should watch a talk instead of reading the summary, I wouldn’t do it. Ok, fine. If you’re like me, maybe you’d consider reading a couple of his papers instead of reading this post. I once heard someone say that it’s impossible to disagree with Dick’s reasoning. You can disagree with his premises, but if you accept his premises and follow his argument, you have to agree with his conclusions. His presentation is impeccable and his logic is implacable. [return]
  2. This oversimplifies things a bit since, if some level of cache is bandwidth limited, spending bandwidth to move data between cores could slow down other operations more than this operation is sped up by not having to wait. But even that’s oversimplified since it doesn’t take into account the extra power it takes to move data from a higher level cache as opposed to accessing the local cache. But that’s also oversimplified, as is everything in this post. Reality is really complicated, and the more detail we want the less effective sampling profilers are. [return]
  3. This sounds like a long time, but if you ask around you’ll hear other versions of this story at every company that creates systems complex beyond human understanding. I know of one chip project at Sun that was delayed for multiple years because they couldn’t track down some persistent bugs. At Microsoft, they famously spent two years tracking down a scrolling smoothness bug on Vista. The bug was hard enough to reproduce that they set up screens in the hallways so that they could casually see when the bug struck their test boxes. One clue was that the bug only struck high-end boxes with video cards, not low-end boxes with integrated graphics, but that clue wasn’t sufficient to find the bug.

    After quite a while, they called the Xbox team in to use their profiling expertise to set up a system that could capture the bug, and once they had the profiler set up it immediately became apparent what the cause was. This was back in the AGP days, where upstream bandwidth was something like 1/10th downstream bandwidth. When memory would fill up, textures would get ejected, and while doing so, the driver would lock the bus and prevent any other traffic from going through. That took long enough that the video card became unresponsive, resulting in janky scrolling.

    It’s really common to hear stories of bugs that can take an unbounded amount of time to debug if the proper tools aren’t available.

    [return]

We only hire the best means we only hire the trendiest

$
0
0

An acquaintance of mine, let’s call him Mike, is looking for work after getting laid off from a contract role at Microsoft, which has happened to a lot of people I know. Like me, Mike has 11 years in industry. Unlike me, he doesn’t know a lot of folks at trendy companies, so I passed his resume around to some engineers I know at companies that are desperately hiring. My engineering friends thought Mike’s resume was fine, but most recruiters rejected him in the resume screening phase.

When I asked why he was getting rejected, the typical response I got was:

  1. Tech experience is in irrelevant tech
  2. “Experience is too random, with payments, mobile, data analytics, and UX.”
  3. Contractors are generally not the strongest technically

This response is something from a recruiter that was relayed to me through an engineer; the engineer was incredulous at the response from the recruiter. Just so we have a name, let’s call this company TrendCo. It’s one of the thousands of companies that claims to have world class engineers, hire only the best, etc. This is one company in particular, but it’s representative of a large class of companies and the responses Mike has gotten.

Anyway, (1) is code for “Mike’s a .NET dev, and we don’t like people with Windows experience”.

I’m familiar with TrendCo’s tech stack, which multiple employees have told me is “a tire fire”. Their core systems top out under 1k QPS, which has caused them to go down under load. Mike has worked on systems that can handle multiple orders of magnitude more load, but his experience is, apparently, irrelevant.

(2) is hard to make sense of. I’ve interviewed at TrendCo and one of the selling points is that it’s a startup where you get to do a lot of different things. TrendCo almost exclusively hires generalists but Mike is, apparently, too general for them.

(3), combined with (1), gets at what TrendCo’s real complaint with Mike is. He’s not their type. TrendCo’s median employee is a recent graduate from one of maybe ten “top” schools with 0-2 years of experience. They have a few experienced hires, but not many, and most of their experienced hires have something trendy on their resume, not a boring old company like Microsoft.

Whether or not you think there’s anything wrong with having a type and rejecting people who aren’t your type, as Thomas Ptacek has observed, if your type is the same type everyone else is competing for, “you are competing for talent with the wealthiest (or most overfunded) tech companies in the market”.

If you look at new grad hiring data, it looks like FB is offering people with zero experience > $100k/ salary, $100k signing bonus, and $150k in RSUs, for an amortized total comp > $160k/yr, including $240k in the first year. Google’s package has > $100k salary, a variable signing bonus in the $10k range, and $187k in RSUs. That comes in a bit lower than FB, but it’s much higher than most companies that claim to only hire the best are willing to pay for a new grad. Keep in mind that compensation can go much higher for contested candidates, and that compensation for experienced candidates is probably higher than you expect if you’re not a hiring manager who’s seen what competitive offers look like today.

By going after people with the most sought after qualifications, TrendCo has narrowed their options down to either paying out the nose for employees, or offering non-competitive compensation packages. TrendCo has chosen the latter option, which partially explains why they have, proportionally, so few senior devs – the compensation delta increases as you get more senior, and you have to make a really compelling pitch to someone to get them to choose TrendCo when you’re offering $150k/yr less than the competition. And as people get more experience, they’re less likely to believe the part of the pitch that explains how much the stock options are worth.

Just to be clear, I don’t have anything against people with trendy backgrounds. I know a lot of these people who have impeccable interviewing skills and got 5-10 strong offers last time they looked for work. I’ve worked with someone like that: he was just out of school, his total comp package was north of $200k/yr, and he was worth every penny. But think about that for a minute. He had strong offers from six different companies, of which he was going to accept at most one. Including lunch and phone screens, the companies put in an average of eight hours apiece interviewing him. And because they wanted to hire him so much, the companies that were really serious spent an average of another five hours apiece of engineer time trying to convince him to take their offer. Because these companies had, on average, a ⅙ chance of hiring this person, they have to spend at least an expected (8+5) * 6 = 78 hours of engineer time1. People with great backgrounds are, on average, pretty great, but they’re really hard to hire. It’s much easier to hire people who are underrated, especially if you’re not paying market rates.

I’ve seen this hyperfocus on hiring people with trendy backgrounds from both sides of the table, and it’s ridiculous from both sides.

On the referring side of hiring, I tried to get a startup I was at to hire the most interesting and creative programmer I’ve ever met, who was tragically underemployed for years because of his low GPA in college. We declined to hire him and I was told that his low GPA meant that he couldn’t be very smart. Years later, Google took a chance on him and he’s been killing it since then. He actually convinced me to join Google, and at Google, I tried to hire one of the most productive programmers I know, who was promptly rejected by a recruiter for not being technical enough.

On the candidate side of hiring, I’ve experienced both being in demand and being almost unhireable. Because I did my undergrad at Wisconsin, which is one of the 25 schools that claims to be a top 10 cs/engineering school, I had recruiters beating down my door when I graduated. But that’s silly – that I attended Wisconsin wasn’t anything about me; I just happened to grow up in the state of Wisconsin. If I grew up in Utah, I probably would have ended up going to school at Utah. When I’ve compared notes with folks who attended schools like Utah and Boise State, their education is basically the same as mine. Wisconsin’s rank as an engineering school comes from having professors who do great research which is, at best, weakly correlated to effectiveness at actually teaching undergrads. Despite getting the same engineering education you could get at hundreds of other schools, I had a very easy time getting interviews and finding a great job.

I spent 7.5 years in that great job, at Centaur. Centaur has a pretty strong reputation among hardware companies in Austin who’ve been around for a while, and I had an easy time shopping for local jobs at hardware companies. But I don’t know of any software folks who’ve heard of Centaur, and as a result I couldn’t get an interview at most software companies. There were even a couple of cases where I had really strong internal referrals and the recruiters still didn’t want to talk to me, which I found funny and my friends found frustrating.

When I could get interviews, they often went poorly. A typical rejection reason was something like “we process millions of transactions per day here and we really need someone with more relevant experience who can handle these things without ramping up”. And then Google took a chance on me and I was the second person on a project to get serious about deep learning performance, which was a 20%-time project until just before I joined. We built the fastest deep learning system in the world. From what I hear, they’re now on the Nth generation of that project, but even the first generation thing we built has better per-node performance and performance per dollar than any other production system I know of today, years later (excluding follow-ons to that project, of course).

While I was at Google I had recruiters pinging me about job opportunities all the time. And now that I’m at boring old Microsoft, I don’t get nearly as many recruiters reaching out to me. I’ve been considering looking for work2 and I wonder how trendy I’ll be if I do. Experience in irrelevant tech? Check! Random experience? Check! Contractor? Well, no. But two out of three ain’t bad.

My point here isn’t anything about me. It’s that here’s this person3 who has wildly different levels of attractiveness to employers at various times, mostly due to superficial factors that don’t have much to do with actual productivity. This is a really common story among people who end up at Google. If you hired them before they worked at Google, you might have gotten a great deal! But no one (except Google) was willing to take that chance. There’s something to be said for paying more to get a known quantity, but a company like TrendCo that isn’t willing to do that cripples its hiring pipeline by only going after people with trendy resumes.

I don’t mean to pick on startups like TrendCo in particular. Boring old companies have their version of what a trendy background is, too. A friend of mine who’s desperate to hire can’t do anything with some of the resumes I pass his way because his group isn’t allowed to hire anyone without a degree. Another person I know is in a similar situation because his group won’t talk to people who aren’t already employed.

Not only are these decisions non-optimal for companies, they create a path dependence in employment outcomes that causes individual good (or bad) events to follow people around for decades. You can see similar effects in the literature on career earnings in a variety of fields4.

Thomas Ptacek has this great line about how“we interview people whose only prior work experience is “Line of Business .NET Developer”, and they end up showing us how to write exploits for elliptic curve partial nonce bias attacks that involve Fourier transforms and BKZ lattice reduction steps that take 6 hours to run.” If you work at a company that doesn’t reject people out of hand for not being trendy, you’ll hear lots of stories like this. Some of the best people I’ve worked with went to schools you’ve never heard of and worked at companies you’ve never heard of until they ended up at Google. Some are still at companies you’ve never heard of.

If you read Zach Holman, you may recall that when he said that he was fired, someone responded with “If an employer has decided to fire you, then you’ve not only failed at your job, you’ve failed as a human being.” A lot of people treat employment status and credentials as measures of the inherent worth of individuals. But a large component of these markers of success, not to mention success itself, is luck.

Solutions?

I can understand why this happens. At an individual level, we’re prone to the fundamental attribution error. At an orgazational level, fast growing organizations burn a large fraction of their time on interviews, and the obvious way to cut down on time spent interviewing is to only interview people with “good” qualifications. Unfortunately, that’s counterproductive when you’re chasing after the same tiny pool of people as everyone else.

Here are the beginnings of some ideas. I’m open to better suggestions!

Moneyball

Billy Beane and Paul Depodesta took the Oakland A’s, a baseball franchise with nowhere near the budget of top teams, and created what was arguably the best team in baseball by finding and “hiring” players who were statistically underrated for their price. The thing I find really amazing about this is that they publically talked about doing this, and then Michael Lewis wrote a book, titled Moneyball, about them doing this. Despite the publicity, it took years for enough competitors to catch on enough that the A’s strategy stopped giving them a very large edge.

You can see the exact same thing in software hiring. Thomas Ptacek has been talking about how they hired unusually effective people at Matasano for at least half a decade, maybe more. Google bigwigs regularly talk about the hiring data they have and what hasn’t worked. I believe they talked about how focusing on top schools wasn’t effective and didn’t turn up employees that have better performance years ago, but that doesn’t stop TrendCo from focusing hiring efforts on top schools.

Training / mentorship

You see a lot of talk about moneyball, but for some reason people are less excited about… trainingball? Practiceball? Whatever you want to call taking people who aren’t “the best” and teaching them how to be “the best”.

This is another one where it’s easy to see the impact through the lens of sports, because there is so much good performance data. Since it’s basketball season, if we look at college basketball, for example, we can identify a handful of programs that regularly take unremarkable inputs and produce good outputs. And that’s against a field of competitors where every team is expected to coach and train their players.

When it comes to tech companies, most of the competition isn’t even trying. At the median large company, you get a couple days of “orientation”, which is mostly legal mumbo jumbo and paperwork, and the occasional “training”, which is usually a set of videos and a set of multiple-choice questions that are offered up for compliance reasons, not to teach anyone anything. And you’ll be assigned a mentor who, more likely than not, won’t provide any actual mentorship. Startups tend to be even worse! It’s not hard to do better than that.

Considering how much money companies spend on hiring and retaining“the best”, you’d expect them to spend at least a (non-zero) fraction on training. It’s also quite strange that companies don’t focus more or training and mentorship when trying to recruit. Specific things I’ve learned in specific roles have been tremendously valuable to me, but it’s almost always either been a happy accident, or something I went out of my way to do. Most companies don’t focus on this stuff. Sure, recruiters will tell you that “you’ll learn so much more here than at Google, which will make you more valuable”, implying that it’s worth the $150k/yr pay cut, but if you ask them what, specfically, they do to make a better learning environment than Google, they never have a good answer.

Process / tools / culture

I’ve worked at two companies that both have effectively infinite resources to spend on tooling. One of them, let’s call them ToolCo, is really serious about tooling and invests heavily in tools. People describe tooling there with phrases like “magical”, “the best I’ve ever seen”, and “I can’t believe this is even possible”. And I can see why. For example, if you want to build a project that’s millions of lines of code, their build system will make that take somewhere between 5s and 20s (assuming you don’t enable LTO or anything else that can’t be parallelized)5. In the course of a regular day at work you’ll use multiple tools that seem magical because they’re so far ahead of what’s available in the outside world.

The other company, let’s call them ProdCo pays lip service to tooling, but doesn’t really value it. People describing ProdCo tools use phrases like “world class bad software” and “I am 2x less productive than I’ve ever been anywhere else”, and “I can’t believe this is even possible”. ProdCo has a paper on a new build system; their claimed numbers for speedup from parallelization/caching, onboarding time, and reliability, are at least two orders of magnitude worse than the equivalent at ToolCo. And, in my experience, the actual numbers are worse than the claims in the paper. In the course of a day of work at ProdCo, you’ll use multiple tools that are multiple orders of magnitude worse than the equivalent at ToolCo in multiple dimensions. These kinds of things add up and can easily make a larger difference than “hiring only the best”.

Processes and culture also matter. I once worked on a team that didn’t use version control or have a bug tracker. For every no-brainer item on the Joel test, there are teams out there that make the wrong choice.

Although I’ve only worked on one team that completely failed the Joel test, every team I’ve worked on has had glaring deficiencies that are technically trivial (but sometimes culturally difficult) to fix. When I was at Google, we had really bad communication problems between the two halves of our team that were in different locations. My fix was brain-dead simple: I started typing up meeting notes for all of our local meetings and discussions and taking questions from the remote team about things that surprised them in our notes. That’s something anyone could have done, and it was a huge productivity improvement for the entire team. I’ve literally never found an environment where you can’t massively improve productivity with something that trivial. Sometimes people don’t agree (e.g., it took months to get the non-version-control-using-team to use version control), but that’s a topic for another post.

Programmers are woefully underutilized at most companies. What’s the point of hiring “the best” and then crippling them? You can get better results by hiring undistinguished folks and setting them up for success, and it’s a lot cheaper.

Conclusion

When I started programming, I heard a lot about how programmers are down to earth, not like those elitist folks who have uniforms involving suits and ties. You can even wear t-shirts to work! But if you think programmers aren’t elitist, try wearing a suit and tie to an interview sometime. You’ll have to go above and beyond to prove that you’re not a bad cultural fit. We like to think that we’re different from all those industries that judge people based on appearance, but we do the same thing, only instead of saying that people are a bad fit because they don’t wear ties, we say they’re a bad fit because they do, and instead of saying people aren’t smart enough because they don’t have the right pedigree… wait, that’s exactly the same.

Thanks to Kelley Eskridge, Laura Lindzey, John Hergenroeder, Kamal Marhubi, Julia Evans, Steven McCarthy, Lindsey Kuper, Leah Hanson, Darius Bacon, Pierre-Yves Baccou, Kyle Littler, Jorge Montero, and Mark Dominus for discussion/comments/corrections.


  1. This estimate is conservative. The math only works out to 78 hours if you assume that you never incorrectly reject a trendy candidate and that you don’t have to interview candidates that you “correctly” fail to find good candidates. If you add in the extra time for those, the number becomes a lot larger. And if you’re TrendCo, and you won’t give senior ICs $200k/yr, let alone new grads, you probably need to multiply that number by at least a factor of 10 to account for the reduced probability that someone who’s in high demand is going to take a huge paycut to work for you.

    By the way, if you do some similar math you can see that the “no false positives” thing people talk about is bogus. The only way to reduce the risk of a false positive to zero is to not hire anyone. If you hire anyone, you’re trading off the cost of firing a bad hire vs. the cost of spending engineering hours interviewing.

    [return]
  2. I consider this to generally be a good practice, at least for folks like me who are relatively early in their careers. It’s good to know what your options are, even if you don’t exercise them. When I was at Centaur, I did a round of interviews about once a year and those interviews made it very clear that I was lucky to be at Centaur. I got a lot more responsibility and a wider variety of work than I could have gotten elsewhere, I didn’t have to deal with as much nonsense, and I was pretty well paid. I still did the occasional interview, though, and you should too! If you’re worried about wasting the time of the hiring company, when I was interviewing speculatively, I always made it very clear that I was happy in my job and unlikely to change jobs, and most companies are fine with that and still wanted to go through with interviewing. [return]
  3. It’s really not about me in particular. At the same time I couldn’t get any company to talk to me, a friend of mine who’s a much better programmer than me spent six months looking for work full time. He eventually got a job at Cloudflare, was half of the team that wrote their DNS, and is now one of the world’s experts on DDoS mitigation for companies that don’t have infinite resources. That guy wasn’t even a networking person before he joined Cloudflare. He’s a brilliant generalist who’s created everything from a widely used javascript library to one of the coolest toy systems projects I’ve ever seen. He probably could have picked up whatever problem domain you’re struggling with and knocked it out of the park. Oh, and between the blog posts he write and the talks he gives, he’s one of Cloudflare’s most effective recruiters. [return]
  4. I’m not going to do a literature review because there are just so many studies that link career earnings to external shocks, but I’ll cite a result that I found to be interesting, Lisa Kahn’s 2010 Labour Economics paper

    There have been a lot of studies that show, for some particular negative shock (like a recession), graduating into the negative shock reduces lifetime earnings. But most of those studies show that, over time, the effect gets smaller. When Kahn looked at national unemployment as a proxy for the state of the economy, she found the same thing. But when Kahn looked at state level unempoyment, she found that the effect actually compounded over time.

    The overall evidence on what happens in the long run is equivocal. If you dig around, you’ll find studies where earnings normalizes after “only” 15 years, causing a large but effectively one-off loss in earnings, and studies where the effect gets worse over time. The results are mostly technically not contradictory because they look at different causes of economic distress when people get their first job, and it’s possible that the differences in results are because the different circumstances don’t generalize. But the “good” result is that it takes 15 years for earnings to normalize after a single bad setback. Even a very optimistic reading of the literature reveals that external events can and do have very large effects on people’s careers. And if you want an estimate of the bound on the “bad” case, check out, for example, the Guiso, Sapienza, and Zingales paper that claims to link the productivity of a city today to whether or not that city had a bishop in the year 1000.

    [return]
  5. During orientation, the back end of the build system was down so I tried building one of the starter tutorials on my local machine. I gave up after an hour when the build was 2% complete. I know someone who tried to build a real, large scale, production codebase on their local machine over a long weekend, and it was nowhere near done when they got back. [return]

Notes on Google's Site Reliability Engineering book

$
0
0

The book starts with a story about a time [Margaret Hamilton](https://en.wikipedia.org/wiki/Margaret_Hamilton_(scientist)) brought her young daughter with her to NASA, back in the days of the Apollo program. During a simulation mission, her daughter caused the mission to crash by pressing some keys that caused a prelaunch program to run during the simulated mission. Hamilton submitted a change request to add error checking code to prevent the error from happening again, but the request was rejected because the error case should never happen.

On the next mission, Apollo 8, that exact error condition occurred and a potentially fatal problem that could have been prevented with a trivial check took NASA’s engineers 9 hours to resolve.

This sounds familiar – I’ve lost track of the number of dev post-mortems that have the same basic structure.

This is an experiment in note-taking for me in two ways. First, I normally take pen and paper notes and then scan them in for posterity. Second, I normally don’t post my notes online, but I’ve been inspired to try this by Jamie Brandon’s notes on books he’s read. My handwritten notes are a series of bullet points, which may not translate well into markdown. One issue is that my markdown renderer doesn’t handle more than one level of nesting, so things will get artificially flattened. There are probably more issues. Let’s find out what they are! In case it’s not obvious, asides from me are in italics.

Chapter 1: Introduction

Everything in this chapter is covered in much more detail later.

Two approaches to hiring people to manage system stability:

Traditional approach: sysadmins

  • Assemble existing components and deploy to produce a service
  • Respond to events and updates as they occur
  • Grow team to absorb increased work as service grows
  • Pros
    • Easy to implement because it’s standard
    • Large talent pool to hire from
    • Lots of available software
  • Cons
    • Manual intervention for change management and event handling causes size of team to scale with load on system
    • Ops is fundamentally at odds with dev, which can cause pathological resistance to changes, which causes similarly pathological response from devs, which reclassify “launches” as “incremental updates”, “flag flips”, etc.

Google’s approach: SREs

  • Have software engineers do operations
  • Candidates should be able to pass or nearly pass normal dev hiring bar, and may have some additional skills that are rare among devs (e.g., L1 - L3 networking or UNIX system internals).
  • Career progress comparable to dev career track
  • Results
    • SREs would be bored by doing tasks by hand
    • Have the skillset necessary to automate tasks
    • Do the same work as an operations team, but with automation instead of manual labor
  • To avoid manual labor trap that causes team size to scale with service load, Google places a 50% cap on the amount of “ops” work for SREs
    • Upper bound. Actual amount of ops work is expected to be much lower
  • Pros
    • Cheaper to scale
    • Circumvents devs/ops split
  • Cons
    • Hard to hire for
    • May be unorthodox in ways that require management support (e.g., product team may push back against decision to stop releases for the quarter because the error budget is depleted)

I don’t really understand how this is an example of circumventing the dev/ops split. I can see how it’s true in one sense, but the example of stopping all releases because an error budget got hit doesn’t seem fundamentally different from the “sysadmin” example where teams push back against launches. It seems that SREs have more political capital to spend and that, in the specific examples given, the SREs might be more reasonable, but there’s no reason to think that sysadmins can’t be reasonable.

Tenets of SRE

  • SRE team responsible for latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning

Ensuring a durable focus on engineering

  • 50% ops cap means that extra ops work is redirected to product teams on overflow
  • Provides feedback mechanism to product teams as well as keeps load down
  • Target max 2 events per 8-12 hour on-call shift
  • Postmortems for all serious incidents, even if they didn’t trigger a page
  • Blameless postmortems

2 events per shift is the max, but what’s the average? How many on-call events are expected to get sent from the SRE team to the dev team per week?

How do you get from a blameful postmortem culture to a blameless postmortem culture? Now that everyone knows that you should have blameless postmortems, everyone will claim to do them. Sort of like having good testing and deployment practices. I’ve been lucky to be on an on call rotation that’s never gotten paged, but when I talk to folks who joined recently and are on call, they have not so great stories of finger pointing, trash talk, and blame shifting. The fact that everyone knows you’re supposed to be blameless seems to make it harder to call out blamefulness, not easier.

Move fast without breaking SLO

  • Error budget. 100% is the wrong reliability target for basically everything
  • Going from 5 9s to 100% reliability isn’t noticeable to most users and requires tremendous effort
  • Set a goal that acknowledges the trade-off and leaves an error budget
  • Error budget can be spent on anything: launching features, etc.
  • Error budget allows for discussion about how phased rollouts and 1% experiments can maintain tolerable levels of errors
  • Goal of SRE team isn’t “zero outages” – SRE and product devs are incentive aligned to spend the error budget to get maximum feature velocity

It’s not explicitly stated, but for teams that need to “move fast”, consistently coming in way under the error budget could be taken as a sign that the team is spending too much effort on reliability.

I like this idea a lot, but when I discussed this with Jessica Kerr, she pushed back on this idea because maybe you’re just under your error budget because you got lucky and a single really bad event can wipe out your error budget for the next decade. Followup question: how can you be confident enough in your risk model that you can purposefully consume error budget to move faster without worrying that a downstream (in time) bad event will put you overbudget? Nat Welch (a former Google SRE) responded to this by saying that you can build confidence through simulated disasters and other testing.

Monitoring

  • Monitoring should never require a human to interpret any part of the alerting domain
  • Three valid kinds of monitoring output
    • Alerts: human needs to take action immediately
    • Tickets: human needs to take action eventually
    • Logging: no action needed
    • Note that, for example, graphs are a type of log

Emergency Response

  • Reliability is a function of MTTF (mean-time-to-failure) and MTTR (mean-time-to-recovery)
  • For evaluating responses, we care about MTTR
  • Humans add latency
  • Systems that don’t require humans to respond will have higher availability due to lower MTTR
  • Having a “playbook” produces 3x lower MTTR
    • Having hero generalists who can respond to everything works, but having playbooks works better

I personally agree, but boy do we like our on call heros. I wonder how we can foster a culture of documentation.

Change management

  • 70% of outages due to changes in a live system. Mitigation:
    • Implement progressive rollouts
    • Monitoring
    • Rollback
  • Remove humans from the loop, avoid standard human problems on repetitive tasks

Demand forecasting and capacity planning

  • Straightforward, but a surprising number of teams/services don’t do it

Provisioning

  • Adding capacity riskier than load shifting, since it often involves spinning up new instances/locations, making significant changes to existing systems (config files, load balancers, etc.)
  • Expensive enough that it should be done only when necessary; must be done quickly
    • If you don’t know what you actually need and overprovision that costs money

Efficiency and performance

  • Load slows down systems
  • SREs provision to meet capacity target with a specific response time goal
  • Efficiency == money

Chapter 2: The production environment at Google, from the viewpoint of an SRE

No notes on this chapter because I’m already pretty familiar with it. TODO: maybe go back and read this chapter in more detail.

Chapter 3: Embracing risk

  • Ex: if a user is on a smartphone with 99% reliability, they can’t tell the difference between 99.99% and 99.999% reliability

Managing risk

  • Reliability isn’t linear in cost. It can easily cost 100x more to get one additional increment of reliability
    • Cost associated with redundant equipment
    • Cost of building out features for reliability as opposed to “normal” features
    • Goal: make systems reliable enough, but not too reliable!

Measuring service risk

  • Standard practice: identify metric to represent property of system to optimize
  • Possible metric = uptime / (uptime + downtime)
    • Problematic for a globally distributed service. What does uptime really mean?
  • Aggregate availability = successful requests / total requests
    • Obv, not all requests are equal, but aggregate availability is an ok first order approximation
  • Usually set quarterly targets

Risk tolerance of services

  • Usually not objectively obvious
  • SREs work with product owners to translate business objectives into explicit objectives

Identifying risk tolerance of consumer services

TODO: maybe read this in detail on second pass

Identifying risk tolerance of infrastructure services

Target availability
  • Running ex: Bigtable
    • Some consumer services serve data directly from Bigtable – need low latency and high reliability
    • Some teams use bigtable as a backing store for offline analysis – care more about throughput than reliability
  • Too expensive to meet all needs generically
    • Ex: Bigtable instance
    • Low-latency Bigtable user wants low queue depth
    • Throughput oriented Bigtable user wants moderate to high queue depth
    • Success and failure are diametrically opposed in these two cases!
Cost
  • Partition infra and offer different levels of service
  • In addition to obv. benefits, allows service to externalize the cost of providing different levels of service (e.g., expect latency oriented service to be more expensive than throughput oriented service)

Motivation for error budgets

No notes on this because I already believe all of this. Maybe go back and re-read this if involved in debate about this.

Chapter 4: Service level objectives

Note: skipping notes on terminology section.

  • Ex: Chubby planned outages
    • Google found that Chubby was consistently over its SLO, and that global Chubby outages would cause unusually bad outages at Google
    • Chubby was so reliable that teams were incorrectly assuming that it would never be down and failing to design systems that account for failures in Chubby
    • Solution: take Chubby down globally when it’s too far above its SLO for a quarter to “show” teams that Chubby can go down

What do you and your users care about?

  • Too many indicators: hard to pay attention
  • Too few indicators: might ignore important behavior
  • Different classes of services should have different indicators
    • User-facing: availability, latency, throughput
    • Storage: latency, availability, durability
    • Big data: throughput, end-to-end latency
  • All systems care about correctness

Collecting indicators

  • Can often do naturally from server, but client-side metrics sometimes needed.

Aggregation

  • Use distributions and not averages
  • User studies show that people usually prefer slower average with better tail latency
  • Standardize on common defs, e.g., average over 1 minute, average over tasks in cluster, etc.
    • Can have exceptions, but having reasonable defaults makes things easier

Choosing targets

  • Don’t pick target based on current performance
    • Current performance may require heroic effort
  • Keep it simple
  • Avoid absolutes
    • Unreasonable to talk about “infinite” scale or “always” available
  • Minimize number of SLOs
  • Perfection can wait
    • Can always redefine SLOs over time
  • SLOs set expectations
    • Keep a safety margin (internal SLOs can be defined more loosely than external SLOs)
  • Don’t overachieve
    • See Chubby example, above
    • Another example is making sure that the system isn’t too fast under light loads

Chapter 5: Eliminating toil

Carla Geisser: “If a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow.”

  • Def: Toil
    • Not just “work I don’t want to do”
    • Manual
    • Repetitive
    • Automatable
    • Tactical
    • No enduring value
    • O(n) with service growth
  • In surveys, find 33% toil on average
    • Numbers can be as low as 0% and as high as 80%
    • Toil > 50% is a sign that the manager should spread toil load more evenly
  • Is toil always bad?
    • Predictable and repetitive tasks can be calming
    • Can produce a sense of accomplishment, can be low-risk / low-stress activities

Section on why toil is bad. Skipping notetaking for that section.

Chapter 6: Monitoring distributed systems

  • Why monitor?
    • Analyze long-term trends
    • Compare over time or do experiments
    • Alerting
    • Building dashboards
    • Debugging

As Alex Clemmer is wont to say, our problem isn’t that we move too slowly, it’s that we build the wrong thing. I wonder how we could get from where we are today to having enough instrumentation to be able to make informed decisions when building new systems.

Setting reasonable expectations

  • Monitoring is non-trivial
  • 10-12 person SRE team typically has 1-2 people building and maintaining monitoring
  • Number has decreased over time due to improvements in tooling/libs/centralized monitoring infra
  • General trend towards simpler/faster monitoring systems, with better tools for post hoc analysis
  • Avoid “magic” systems
  • Limited success with complex dependency hierarchies (e.g., “if DB slow, alert for DB, otherwise alert for website”).
    • Used mostly (only?) for very stable parts of system
  • Rules that generate alerts for humans should be simple to understand and represent a clear failure

Avoiding magic includes avoiding ML?

  • Lots of white-box monitoring
  • Some black-box monitoring for critical stuff
  • Four golden signals
    • Latency
    • Traffic
    • Errors
    • Saturation

Interesting examples from Bigtable and Gmail from chapter not transcribed. A lot of information on the importance of keeping alerts simple also not transcribed.

The long run

  • There’s often a tension between long-run and short-run availability
  • Can sometimes fix unreliable systems through heroic effort, but that’s a burnout risk and also a failure risk
  • Taking a controlled hit in short-term reliability is usually the better trade

Chapter 7: Evolution of automation at Google

  • “Automation is a force multiplier, not a panacea”
  • Value of automation
    • Consistency
    • Extensibility
    • MTTR
    • Faster non-repair actions
    • Time savings

Multiple interesting case studies and explanations skipped in notes.

Chapter 8: Release engineering

  • This is a specific job function at Google

Release engineer role

  • Release engineers work with SWEs and SREs to define how software is released
    • Allows dev teams to focus on dev work
  • Define best practices
    • Compiler flags, formats for build ID tags, etc.
  • Releases automated
  • Models vary between teams
    • Could be “push on green” and deploy every build
    • Could be hourly builds and deploys
    • etc.
  • Hermetic builds
    • Building same rev number should always give identical results
    • Self-contained – this includes versioning everything down the compiler used
    • Can cherry-pick fixes against an old rev to fix production software
  • Virtually all changes require code review
  • Branching
    • All code in main branch
    • Releases are branched off
    • Fixes can go from master to branch
    • Branches never merged back
  • Testing
    • CI
    • Release process creates an audit trail that runs tests and shows that tests passed
  • Config management
  • Many possible schemes (all involve storing config in source control and having strict config review)
  • Use mainline for config – config maintained at head and applied immediately
    • Originally used for Borg (and pre-Borg systems)
    • Binary releases and config changes decoupled!
  • Include config files and binaries in same package
    • Simple
    • Tightly couples binary and config – ok for projects with few config files or where few configs change
  • Package config into “configuration packages”
    • Same hermetic principle as for code
  • Release engineering shouldn’t be an afterthought!
    • Budget resources at beginning of dev cycle

Chapter 9: Simplicity

  • Stability vs. agility
    • Can make things stable by freezing – need to balance the two
    • Reliable systems can increase agility
    • Reliable rollouts make it easier to link changes to bugs
  • Virtue of boring!
  • Essential vs. accidental complexity
    • SREs should push back when accidental complexity is introduced
  • Code is a liability
    • Remove dead code or other bloat
  • Minimal APIs
    • Smaller APIs easier to test, more reliable
  • Modularity
    • API versioning
    • Same as code, where you’d avoid misc/util classes
  • Releases
    • Small releases easier to measure
    • Can’t tell what happened if we released 100 changes together

Chapter 10: Altering from time-series data

Borgmon

  • Similar-ish to Prometheus
  • Common data format for logging
  • Data used for both dashboards and alerts
  • Formalized a legacy data format, “varz”, which allowed metrics to be viewed via HTTP
  • Adding a metric only requires a single declaration in code
    • low user-cost to add new metric
  • Borgmon fetches /varz from each target periodically
    • Also includes synthetic data like health check, if name was resolved, etc.,
  • Time series arena
    • Data stored in-memory, with checkpointing to disk
    • Fixed sized allocation
    • GC expires oldest entries when full
    • conceptually a 2-d array with time on one axis and items on the other axis
    • 24 bytes for a data point -> 1M unique time series for 12 hours at 1-minute intervals = 17 GB
  • Borgmon rules
    • Algebraic expressions
    • Compute time-series from other time-series
    • Rules evaluated in parallel on a threadpool
  • Counters vs. gauges
    • Def: counters are non-decreasing
    • Def: can take any value
    • Counters preferred to gauges because gauges can lose information depending on sampling interval
  • Altering
    • Borgmon rules can trigger alerts
    • Have minimum duration to prevent “flapping”
    • Usually set to two duration cycles so that missed collections don’t trigger an alert
  • Scaling
    • Borgmon can take time-series data from other Borgmon (uses binary streaming protocol instead of the text-based varz protocol)
    • Can have multiple tiers of filters
  • Prober
    • Black-box monitoring that monitors what the user sees
    • Can be queried with varz or directly send alerts to Altertmanager
  • Configuration
    • Separation between definition of rules and targets being monitored

Chapter 11: Being on-call

  • Typical response time
    • 5 min for user-facing or other time-critical tasks
    • 30 min for less time-sensitive stuff
  • Response times linked to SLOs
    • Ex: 99.99% for a quarter is 13 minutes of downtime; clearly can’t have response time above 13 minutes
    • Services with looser SLOs can have response times in the 10s of minutes (or more?)
  • Primary vs secondary on-call
    • Work distribution varies by team
    • In some, secondary can be backup for primary
    • In others, secondary handles non-urgent / non-paging events, primary handles pages
  • Balanced on-call
    • Def: quantity: percent of time on-call
    • Def: quality: number of incidents that occur while on call

This is great. We should do this. People sometimes get really rough on-call rotations a few times in a row and considering the infrequency of on-call rotations there’s no reason to expect that this should randomly balance out over the course of a year or two.

  • Balance in quantity
    • >= 50% of SRE time goes into engineering
    • Of remainder, no more than 25% spent on-call
  • Prefer multi-site teams
    • Night shifts are bad for health, multi-site teams allow elimination of night shifts
  • Balance in quality
    • On average, dealing with an incident (incl root-cause analysis, remediation, writing postmortem, fixing bug, etc.) takes 6 hours.
    • => shouldn’t have more than 2 incidents in a 12-hour on-call shift
    • To stay within upper bound, want very flat distribution of pages, with median value of 0
  • Compensation – extra pay for being on-call (time-off or cash)

Chapter 12: Effective troubleshooting

No notes for this chapter.

Chapter 13: Emergency response

  • Test-induced emergency
  • Ex: want to flush out hidden dependencies on a distributed MySQL database
    • Plan: block access to 1100 of DBs
    • Response: dependent services report that they’re unable to access key systems
    • SRE response: SRE aborts exercise, tries to roll back permissions change
    • Rollback attempt fails
    • Attempt to restore access to replicas works
    • Normal operation restored in 1 hour
    • What went well: dependent teams escalated issues immediately, were able to restore access
    • What we learned: had an insufficient understanding of the system and its interaction with other systems, failed to follow incident response that would have informed customers of outage, hadn’t tested rollback procedures in test env
  • Change-induced emergency
    • Changes can cause failures!
  • Ex: config change to abuse prevention infra pushed on Friday triggered crash-loop bug
    • Almost all externally facing systems depend on this, become unavailable
    • Many internal systems also have dependency and become unavailable
    • Alerts start firing with seconds
    • Within 5 minutes of config push, engineer who pushed change rolled back change and services started recovering
    • What went well: monitoring fired immediately, incident management worked well, out-of-band communications systems kept people up to date even though many systems were down, luck (engineer who pushed change was following real-time comms channels, which isn’t part of the release procedure)
    • What we learned: push to canary didn’t trigger same issue because it didn’t hit a specific config keyword combination; push was considered low-risk and went through less stringent canary process, alerting was too noisy during outage
  • Process-induced emergency

No notes on process-induced example.

Chapter 14: Managing incidents

This is an area where we seem to actually be pretty good. No notes on this chapter.

Chapter 15: Postmortem culture: learning from failure

I’m in strong agreement with most of this chapter. No notes.

Chapter 16: Tracking outages

  • Escalator: centralized system that tracks ACKs to alerts, notifies other people if necessary, etc.
  • Outalator: gives time-interleaved view of notifications for multiple queues
    • Also saves related email and allows marking some messages as “important”, can collapse non-important messages, etc.

Our version of Escalator seems fine. We could really use something like Outalator, though.

Chapter 17: Testing for reliability

Preaching to the choir. No notes on this section. We could really do a lot better here, though.

Chapter 18: Software engineering in SRE

  • Ex: Auxon, capacity planning automation tool
  • Background: traditional capacity planning cycle
    • 1) collect demand forecasts (quarters to years in advance)
    • 2) Plan allocations
    • 3) Review plan
    • 4) Deploy and config resources
  • Traditional approach cons
    • Many things can affect plan: increase in efficiency, increase in adoption rate, cluster delivery date slips, etc.
    • Even small changes require rechecking allocation plan
    • Large changes may require total rewrite of plan
    • Labor intensive and error prone
  • Google solution: intent-based capacity planning
    • Specify requirements, not implementation
    • Encode requirements and autogenerate a capacity plan
    • In addition to saving labor, solvers can do better than human generated solutions => cost savings
  • Ladder of examples of increasingly intent based planning
    • 1) Want 50 cores in clusters X, Y, and Z – why those resources in those clusters?
    • 2) Want 50-core footprint in any 3 clusters in region – why that many resources and why 3?
    • 3) Want to meet demand with N+2 redundancy – why N+2?
    • 4) Want 5 9s of reliability. Could find, for example, that N+2 isn’t sufficient
  • Found that greatest gains are from going to (3)
    • Some sophisticated services may go for (4)
  • Putting constraints into tools allows tradeoffs to be consistent across fleet
    • As opposed to making individual ad hoc decisions
  • Auxon inputs
    • Requirements (e.g., “service must be N+2 per continent”, “frontend servers no more than 50ms away from backend servers”
    • Dependencies
    • Budget priorities
    • Performance data (how a service scales)
    • Demand forecast data (note that services like Colossus have derived forecasts from dependent services)
    • Resource supply & pricing
  • Inputs go into solver (mixed-integer or linear programming solver)

No notes on why SRE software, how to spin up a group, etc. TODO: re-read back half of this chapter and take notes if it’s ever directly relevant for me.

Chapter 19: Load balancing at the frontend

No notes on this section. Seems pretty similar to what we have in terms of high-level goals, and the chapter doesn’t go into low-level details. It’s notable that they do [redacted] differently from us, though. For more info on lower-level details, there’s the Maglev paper.

Chapter 20: Load balancing in the datacenter

  • Flow control
  • Need to avoid unhealthy tasks
  • Naive flow control for unhealthy tasks
    • Track number of requests to a backend
    • Treat backend as unhealthy when threshold is reached
    • Cons: generally terrible
  • Health-based flow control
    • Backend task can be in one of three states: {healthy, refusing connections, lame duck}
    • Lame duck state can still take connections, but sends backpressure request to all clients
    • Lame duck state simplifies clean shutdown
  • Def: subsetting: limiting pool of backend tasks that a client task can interact with
    • Clients in RPC system maintain pool of connections to backends
    • Using pool reduces latency compared to doing setup/teardown when needed
    • Inactive connections are relatively cheap, but not free, even in “inactive” mode (reduced health checks, UDP instead of TCP, etc.)
  • Choosing the correct subset
    • Typ: 20-100, choose base on workload
  • Subset selection: random
    • Bad utilization
  • Subset selection: round robin
    • Order is permuted; each round has its own permutation
  • Load balancing
    • Subset selection is for connection balancing, but we still need to balance load
  • Load balancing: round robin
    • In practice, observe 2x difference between most loaded and least load
    • In practice, most expensive request can be 1000x more expensive than cheapest request
    • In addition, there’s random unpredictable variation in requests
  • Load balancing: least-loaded round robin
    • Exactly what it sounds like: round-robin among least loaded backends
    • Load appears to be measured in terms of connection count; may not always be the best metric
    • This is per client, not globally, so it’s possible to send requests to a backend with many requests from other clients
    • In practice, for larg services, find that most-loaded task uses twice as much CPU as least-loaded; similar to normal round robin
  • Load balancing: weighted round robin
    • Same as above, but weight with other factors
    • In practice, much better load distribution than least-loaded round robin

I wonder what Heroku meant when they responded to Rap Genius by saying “after extensive research and experimentation, we have yet to find either a theoretical model or a practical implementation that beats the simplicity and robustness of random routing to web backends that can support multiple concurrent connections”.

Chapter 21: Handling overload

  • Even with “good” load balancing, systems will become overloaded
  • Typical strategy is to serve degraded responses, but under very high load that may not be possible
  • Modeling capacity as QPS or as a function of requests (e.g., how many keys the requests read) is failure prone
    • These generally change slowly, but can change rapidly (e.g., because of a single checkin)
  • Better solution: measure directly available resources
  • CPU utilization is usually a good signal for provisioning
    • With GC, memory pressure turns into CPU utilization
    • With other systems, can provision other resources such that CPU is likely to be limiting factor
    • In cases where over-provisioning CPU is too expensive, take other resources into account

How much does it cost to generally over-provision CPU like that?

  • Client-side throttling
    • Backends start rejecting requests when customer hits quota
    • Requests still use resources, even when rejected – without throttling, backends can spend most of their resources on rejecting requests
  • Criticality
    • Seems to be priority but with a different name?
    • First-class notion in RPC system
    • Client-side throttling keeps separate stats for each level of criticality
    • By default, criticality is propagated through subsequent RPCs
  • Handling overloaded errors
    • Shed load to other DCs if DC is overloaded
    • Shed load to other backends if DC is ok but some backends are overloaded
  • Clients retry when they get an overloaded response
    • Per-request retry budget (3)
    • Per-client retry budget (10%)
    • Failed retries from client cause “overloaded; don’t retry” response to be returned upstream

Having a “don’t retry” response is “obvious”, but relatively rare in practice. A lot of real systems have a problem with failed retries causing more retries up the stack. This is especially true when crossing a hardware/software boundary (e.g., filesystem read causes many retries on DVD/SSD/spinning disk, fails, and then gets retried at the filesystem level), but seems to be generally true in pure software too.

Chapter 22: Addressing cascading failures

  • Typical failure scenarios?
  • Server overload
  • Ex: have two servers
    • One gets overloaded, failing
    • Other one now gets all traffic and also fails
  • Resource exhaustion
    • CPU/memory/threads/file descriptors/etc.
  • Ex: dependencies among resources
    • 1) Java frontend has poorly tuned GC params
    • 2) Frontend runs out of CPU due to GC
    • 3) CPU exhaustion slows down requests
    • 4) Increased queue depth uses more RAM
    • 5) Fixed memory allocation for entire frontend means that less memory is available for caching
    • 6) Lower hit rate
    • 7) More requests into backend
    • 8) Backend runs out of CPU or threads
    • 9) Health checks fail, starting cascading failure
    • Difficult to determine cause during outage
  • Note: policies that avoid servers that serve errors can make things worse
    • fewer backends available, which get too many requests, which then become unavailable
  • Preventing server overload
    • Load test! Must have realistic environment
    • Serve degraded results
    • Fail cheaply and early when overloaded
    • Have higher-level systems reject requests (at reverse proxy, load balancer, and on task level)
    • Perform capacity planning
  • Queue management
    • Queues do nothing in steady state
    • Queued reqs consume memory and increase latency
    • If traffic is steady-ish, better to keep small queue size (say, 50% or less of thread pool size)
    • Ex: Gmail uses queueless servers with failover when threads are full
    • For bursty workloads, queue size should be function of #threads, time per req, size/freq of bursts
    • See also, adaptive LIFO and CoDel
  • Graceful degradation
    • Note that it’s important to test graceful degradation path, maybe by running a small set of servers near overload regularly, since this path is rarely exercised under normal circumstances
    • Best to keep simple and easy to understand
  • Retries
    • Always use randomized exponential backoff
    • See previous chapter on only retrying at a single level
    • Consider having a server-wide retry budget
  • Deadlines
    • Don’t do work where deadline has been missed (common theme for cascading failure)
    • At each stage, check that deadline hasn’t been hit
    • Deadlines should be propagated (e.g., even through RPCs)
  • Bimodal latency
    • Ex: problem with long deadline
    • Say frontend has 10 servers, 100 threads each (1k threads of total cap)
    • Normal operation: 1k QPS, reqs take 100ms => 100 worker threads occupied (1k QPS * .1s)
    • Say 5% of operations don’t complete and there’s a 100s deadline
    • That consumes 5k threads (50 QPS * 100s)
    • Frontend oversubscribed by 5x. Success rate = 1k / (5k + 95) = 19.6% => 80.4% error rate

Using deadlines instead of timeouts is great. We should really be more systematic about this.

Not allowing systems to fill up with pointless zombie requests by setting reasonable deadlines is “obvious”, but a lot of real systems seem to have arbitrary timeouts at nice round human numbers (30s, 60s, 100s, etc.) instead of deadlines that are assigned with load/cascading failures in mind.

  • Try to avoid intra-layer communication
    • Simpler, avoids possible cascading failure paths
  • Testing for cascading failures
    • Load test components!
    • Load testing both reveals breaking and point ferrets out components that will totally fall over under load
    • Make sure to test each component separately
    • Test non-critical backends (e.g., make sure that spelling suggestions for search don’t impede the critical path)
  • Immediate steps to address cascading failures
    • Increase resources
    • Temporarily stop health check failures/deaths
    • Restart servers (only if that would help – e.g., in GC death spiral or deadlock)
    • Drop traffic – drastic, last resort
    • Enter degraded mode – requires having built this into service previously
    • Eliminate batch load
    • Eliminate bad traffic

Chapter 23: Distributed consensus for reliability

  • How do we agree on questions like…
    • Which process is the leader of a group of processes?
    • What is the set of processes in a group?
    • Has a message been successfully committed to a distributed queue?
    • Does a process hold a particular lease?
    • What’s the value in a datastore for a particular key?
  • Ex1: split-brain
    • Service has replicated file servers in different racks
    • Must avoid writing simultaneously to both file servers in a set to avoid data corruption
    • Each pair of file servers has one leader & one follower
    • Servers monitor each other via heartbeats
    • If one server can’t contact the other, it sends a STONITH (shoot the other node in the head)
    • But what happens if the network is slow or packets get dropped?
    • What happens if both servers issue STONITH?

This reminds me of one of my favorite distributed database postmortems. The database is configured as a ring, where each node talks to and replicates data into a “neighborhood” of 5 servers. If some machines in the neighborhood go down, other servers join the neighborhood and data gets replicated appropriately.

Sounds good, but in the case where a server goes bad and decides that no data exists and all of its neighbors are bad, it can return results faster than any of its neighbors, as well as tell its neighbors that they’re all bad. Because the bad server has no data it’s very fast and can report that its neighbors are bad faster than its neighbors can report that it’s bad. Whoops!

  • Ex2: failover requires human intervention
    • A highly sharded DB has a primary for each shard, which replicates to a secondary in another DC
    • External health checks decide if the primary should failover to its secondary
    • If the primary can’t see the secondary, it makes itself unavailable to avoid the problems from “Ex1”
    • This increases operational load
    • Problems are correlated and this is relatively likely to run into problems when people are busy with other issues
    • If there’s a network issues, there’s no reason to think that a human will have a better view into the state of the world than machines in the system
  • Ex3: faulty group-membership algorithms
    • What it sounds like. No notes on this part
  • Impossibility results
    • CAP: P is impossible in real networks, so choose C or A
    • FLP: async distributed consensus can’t gaurantee progress with unreliable network

Paxos

  • Sequence of proposals, which may or may not be accepted by the majority of processes
    • Not accepted => fails
    • Sequence number per proposal, must be unique across system
  • Proposal
    • Proposer sends seq number to acceptors
    • Acceptor agrees if it hasn’t seen a higher seq number
    • Proposers can try again with higher seq number
    • If proposer recvs agreement from majority, it commits by sending commit message with value
    • Acceptors must journal to persistent storage when they accept

Patterns

  • Distributed consensus algorithms are a low-level primitive
  • Reliable replicated state machines
  • Reliable repliacted data and config stores
    • Non distributed-consensus-based systems often use timestamps: problematic because clock synchrony can’t be gauranteed
    • See Spanner paper for an example of using distributed consensus
  • Leader election
    • Equivalent to distributed consensus
    • Where work of the leader can performed performed by one process or sharded, leader election pattern allows writing distributed system as if it were a simple program
    • Used by, for example, GFS and Colussus
  • Distributed coordination and locking services
    • Barrier used, for example, in MapReduce to make sure that Map is finished before Reduce proceeds
  • Distributed queues and messaging
    • Queues: can tolerate failures from worker nodes, but system needs to ensure that claimed tasks are processed
    • Can use leases instead of removal from queue
    • Using RSM means that system can continue processing even when queue goes down
  • Performance
    • Conventional wisdom that consensus algorithms can’t be used for high-throughput low-latency systems is false
    • Distributed consensus at the core of many Google systems
    • Scale makes this worse for Google than most other companies, but it still works
  • Multi-Paxos
    • Strong leader process: unless a leader has not yet been elected or a failure occurs, only one round trip required to reach consensus
    • Note that another process in the group can propose at any time
    • Can ping pong back and forth and pseudo-livelock
    • Not unqique to multi-paxos,
    • Standard solutions are to elect a proposer process or use rotating proposer
  • Scaling read-heavy workloads
    • Ex: Photon allows reads from any replica
    • Read from stale replica requres extra work, but doesn’t produce bad incorrect results
    • To gaurantee reads are up to date, do one of the following:
    • 1) Perform a read-only consensus operation
    • 2) Read data from replica that’s guaranteed to be most-up-to-date (stable leader can provide this guarantee)
    • 3) Use quorum leases
  • Quorum leases
    • Replicas can be granted lease over some (or all) data in the system
  • Fast Paxos
    • Designed to be faster over WAN
    • Each client can send Propose to each member of a group of acceptors directly, instead of through a leader
    • Not necessarily faster than classic Paxos– if RTT to acceptors is long, we’ve traded one message across slow link plus N in parallel across fast link for N across slow link
  • Stable leaders
    • “Almost all distributed consensus systems that have been designed with performance in mind use either the single stable leader pattern or a system of rotating leadership”

TODO: finish this chapter?

Chapter 24: Distributed cron

TODO: go back and read in more detail, take notes.

Chapter 25: Data processing pipelines

  • Examples of this are MapReduce or Flume
  • Convenient and easy to reason about the happy case, but fragile
    • Initial install is usually ok because worker sizing, chunking, parameters are carefully tuned
    • Over time, load changes, causes problems

Chapter 26: Data integrity

  • Definition not necessarily obvious
    • If an interface bug causes Gmail to fail to display messages, that’s the same as the data being gone from the user’s standpoint
    • 99.99% uptime means 1 hour of downtime per year. Probably ok for most apps
    • 99.99% good bytes in a 2GB file means 200K corrupt. Probably not ok for most apps
  • Backup is non-trivial
    • May have mixture of transactional and non-transactional backup and restore
    • Different versions of business logic might be live at once
    • If services are independently versioned, maybe have many combinations of versions
    • Replicas aren’t sufficient – replicas may sync corruption
  • Study of 19 data recovery efforts at Google
    • Most common user-visible data loss caused by deletion or loss of referential integrity due to software bugs
    • Hardest cases were low-grade corruption discovered weeks to months later

Defense in depth

  • First layer: soft deletion
    • Users should be able to delete their data
    • But that means that users will be able to accidentally delete their data
    • Also, account hijacking, etc.
    • Accidentally deletion can also happen due to bugs
    • Soft deletion delays actual deletion for some period of time
  • Second layer: backups
    • Need to figure out how much data it’s ok to lose during recovery, how long recovery can take, and how far back backups need to go
    • Want backups to go back forever, since corruption can go unnoticed for months (or longer)
    • But changes to code and schema can make recovery of older backups expensive
    • Google usually has 30 to 90 day window, depending on the service
  • Third layer: early detection
    • Out-of-band integrity checks
    • Hard to do this right!
    • Correct changes can cause checkers to fail
    • But loosening checks can cause failures to get missed

No notes on the two interesting case studies covered.

Chapter 27: Reliable product launches at scale

No notes on this chapter in particular. A lot of this material is covered by or at least implied by material in other chapters. Probably worth at least looking at example checklist items and action items before thinking about launch strategy, though. Also see appendix E, launch coordination checklist.

Chapters 28-32: Various chapters on management

No notes on these.

Notes on the notes

I like this book a lot. If you care about building reliable systems, reading through this book and seeing what the teams around you don’t do seems like a good exercise. That being said, the book isn’t perfect. The two big downsides for me stem from the same issue: this is one of those books that’s a collection of chapters by different people. The two major problems for me are that some of the editors are better than others, meaning that some of the chapters are clearer than others and that because the chapters seem designed to be readable as standalone chapters, there’s a fair amount of redundancy in the book if you just read it straight through. Depending on how you plan to use the book, that can be a positive, but it’s a negative to me. But even including he downsides, I’d say that this is the most valuable technical book I’ve read in the past year and I’ve covered probably 20% of the content in this set of notes. If you really like these notes, you’ll probably want to read the full book.

If you’re on the fence about the book, you can preview the first three chapters, plus half of the fourth (along with parts of other chapters) in Google books. If you want to buy a copy, you can get one on Amazon, of course, but the ebook is a lot cheaper through Google books than through Amazon (or at least that was true when I bought it last week).

If you found this set of notes way too dry, maybe try this much more entertaining set of notes on a totally different book. If you found this to only be slightly too dry, maybe try this set of notes on classes of errors commonly seen in postmortems. In any case, I’d appreciate feedback on these notes. Writing up notes is an experiment for me. If people find these useful, I’ll try to write up notes on books I read more often. If not, I might try a different approach to writing up notes or some other kind of post entirely.

Modest list of programming blogs

$
0
0

This is one of those “N technical things every programmer must read” lists, except that “programmer” is way too broad a term and the styles of writing people find helpful for them are too different for any such list to contain a non-zero number of items (if you want the entire list to be helpful to everyone). So here’s a list of some things you might want to read, and why you might (or might not) want to read them.

Alex Clemmer

This post on why making a competitor to Google search is a post in classic Alex Clemmer style. The post looks at a position that’s commonly believed (web search isn’t all that hard and someone should come up with a better Google) and explains why that’s not an obviously correct position. That’s also a common theme of his comments elsewhere, such as these comments on, stack ranking at MS, implementing POSIX on Windows, the size of the Windows codebase, and Bing.

If you follow his online commenting, it’s mostly Microsoft-related rants; much more current than Mini-MSFT.

Allison Kaptur

Explorations of various areas, often Python related, such as this this series on the Python interpreter and this series on the CPython peephole optimizer. Also, thoughts on broader topics like debugging and learning.

Often detailed, with inline code that’s meant to be read and understood (with the help of exposition that’s generally quite clear).

Chris Fenton

Computer related projects, by which I mean things like reconstructing the Cray-1A and building mechanical computers. Rarely updated, presumably due to the amount of work that goes into the creations, but almost always interesting.

The blog posts tend to be high-level, more like pitch decks than design docs, but there’s often source code available if you want more detail.

Code Words

This is a quarterly publication from RC. Posts vary from floating point implementations in various languages to how git works to image processing.

I wonder why web publications like this don’t get more press. There’s been a bit of a revival lately, and we’ve seen plenty of high quality publications, from high-profile efforts like The Macro to unpublisized gems like Snowsuit, but you don’t really see people talking about these much. Or I don’t, anyway.

Dan McKinley

A lot of great material on how engineering companies should be run. He has a lot of ideas that sound like common sense, e.g., choose boring technology, until you realize that it’s actually uncommon to find opinions that are so sensible.

Mostly distilled wisdom (as opposed to, say, detailed explanations of code).

David Dalrymple

A mix of things from writing a 64-bit kernel from scratch shortly after learning assembly to a high-level overview of computer systems. Rarely updated, with few posts, but each post has a lot to think about.

Eli Bendersky

I think of this as “the C++ blog”, but it’s much wider ranging that that. It’s too wide ranging for me to sum up, but if I had to commit to a description I might say that it’s a collection of deep dives into various topics, often (but not always) relatively low-level, along with short blurbs about books, often (but not always) technical.

The book reviews tend to be easy reading, but the programming blog posts are often a mix of code and exposition that really demands your attention; usually not a light read.

Evan Jones

A wide variety of bite-sized technical tidbits, from how integer division behavior varies by language to data corruption that isn’t corrected by Ethernet or TCP checksums. Usually bite-sized and easily read.

EPITA Systems Lab

Low-level. A good example of a relatively high-level post from this blog is this post on the low fragmentation heap in Windows. Posts like how to hack a pinball machine and how to design a 386 compatible dev board are typical.

Posts are often quite detailed, with schematic/circuit diagrams. This is relatively heavy reading and I try to have pen and paper handy when I’m reading this blog.

Fabrice Bellard

Not exactly a blog, but every time a new project appears on the front page, it’s amazing. Some examples are QEMU, FFMPEG, a 4G LTE base station that runs on a PC, a javascript PC emulator that can boot Linux, etc.

Gary Bernhardt

Another “not exactly a blog”, but it’s more informative than most blogs, not to mention more entertaining. This is the best “blog” on the pervasive brokenness of modern software that I know of.

Greg Wilson

Writeups of papers that (should) have an impact on how people write software, like this paper on what causes failures in distributed systems or this paper on what makes people feel productive. Not updated much, but Greg still blogs on his personal site.

The posts tend to be extended abstracts that tease you into reading the paper, rather than detailed explanations of the methodology and results.

Gustavo Duarte

Explanations of how Linux works, as well as other low-level topics. This particular blog seems to be on hiatus, but “0xAX” seems to have picked up the slack with the linux-insides project.

If you’ve read Love’s book on Linux, Duarte’s explanations are similar, but tend to be more about the idea and less about the implementation. They’re also heavier on providing diagrams and context. “0xAX” is a lot more focused on walking through the code than either Love or Duarte.

Jessica Kerr

Jessica is probably better known for her talks than her blog? Her talks are great! My favorite is probably this talk with explains different concurrency models in an easy to understand way, but the blog also has a lot of material I like.

As is the case with her talks, the diagrams often take a concept and clarify it, making something that wasn’t obvious seem very obvious in retrospect.

John Regehr

I think of this as the “C is harder than you think, even if you think C is really hard” blog, although the blog actually covers a lot more than that. Some commonly covered topics are fuzzing, compiler optimization, and testing in general.

Posts tend to be conceptual. There’s often code as examples, but the code tends to be light and easy to read, making Regehr’s blog a relatively smooth and easy read even though it covers a lot of important ideas.

Juho Snellman

A lot of posts about networking, generally written so that they make sense even with minimal networking background. I wish more people with this kind of knowledge (in depth knowledge of systems, not just networking knowledge in particular) would write up explanations for a general audience. Also has interesting non-networking content, like this post on Finnish elections.

Julia Evans

AFAICT, the theme is “things Julia has learned recently”, which can be anything from Huffman coding to how to be happy when working in a remote job. When the posts are on a topic I don’t already know, I learn something new. When they’re on a topic I know, they remind me that the topic is exciting and contains a lot of wonder and mystery.

Many posts have more questions than answers, and are more of a live-blogged exploration of a topic than an explanation of the topic.

Kamal Marhubi

Technical explorations of various topics, with a systems-y bent. Kubernetes. Git push. Syscalls in Rust. Also, some musings on programming in general.

The technical explorations often get into enough nitty gritty detail that this is something you probably want to sit down to read, as opposed to skim on your phone.

Kyle Kingsbury

90% of Kyle’s posts are explanations of distributed systems testing, which expose bugs in real systems that most of us rely on. The other 10% are musings on programming that are as rigorous as Kyle’s posts on distributed systems. Possibly the most educational programming blog of all time.

For those of us without a distributed systems background, understanding posts often requires a bit of Googling, despite the exensive explanations in the posts.

Marc Brooker

A mix of theory and wisdom from a distributed systems engineer on EBS at Amazon. The theory posts tend to be relatively short and easy to swallow; not at all intimidating, as theory sometimes is.

Marek Majkowski

This used to be a blog about random experiments Marek was doing, like this post on bitsliced SipHash. Since Marek joined Cloudflare, this has turned into a list of things Marek has learned while working in Cloudflare’s networking stack, like this story about debugging slow downloads.

Posts tend to be relatively short, but with enough technical specifics that they’re not light reads.

Mary Rose Cook

Lengthy and very-detailed explanations of technical topics, mixed in with a wide variety of other posts.

The selection of topics is eclectic, and explained at a level of detail such that you’ll come away with a solid understanding of the topic. The explanations are usually fine grained enough that it’s hard to miss what’s going on, even if you’re a beginner programmer.

Nitsan Wakart

More than you ever wanted to know about writing fast code for the JVM, from GV affects data structures to the subtleties of volatile reads.

Posts tend to involve lots of Java code, but the takeaways are often language agnostic.

Oona Raisanen

Adventures in signal processing. Everything from deblurring barcodes to figuring out what those signals from helicopters mean. If I’d known that signals and systems could be this interesting, I would have paid more attention in class.

Paul Khuong

Some content on Lisp, and some on low-level optimizations, with a trend towards low-level optimizations.

Posts are usualy relatively long and self-contained explanations of technical ideas with very little fluff.

Rachel Kroll

Years of debugging stories from a long-time SRE, along with stories about big company nonsense. The stories often have details that make them sound like they come from Google, but anyone who’s worked at Microsoft, IBM, Oracle, or another large company will find them familiar.

This reminds me a bit of Google’s SRE book, except that the content is ordered chronologically instead of by topic, and it’s conveyed through personal stories rather than impersonal case studies.

Russell Smith

Homemade electronics projects from vim on a mechanical typewriter to building an electrobalance to proof spirits.

Posts tend to have a fair bit of detail, down to diagrams explaining parts of circuits, but the posts aren’t as detailed as specs. But there are usually links to resources that will teach you enough to reproduce the project, if you want.

SteveYegge

This is one of the few programming blogs where I regularly go back and re-read posts from the archive. I learn something new every time. Posts span the entire stack, from how individual programmers can improve at programming to how orgs can improve at recruiting. I re-read that last post before posting the link here and this bit jumped out:

Well, in case you hadn’t noticed, they’re kicking our butts at recruiting. Even in our own backyard. Professor Ed Lazowska at the University of Washington told us last year that Google’s getting about 3 times as many UW hires as we are. A candidate at last week’s recruiting trip told me that of the nine or ten students he considered to be the best programmers at the UW, about half of them went to Google; only two went to Amazon, and the rest went to “no-name” places.

Actually, his story had one more interesting tidbit: he said that although Microsoft is considered one of the top three places to work by the UW CS students (along with Google and Amazon), he claims that Microsoft is hiring lots of mediocre programmers. He said they gave offers to a whole bunch of programmers who he knows aren’t any good — and this guy was my strongest interviewee of the trip, so I was inclined to trust his judgement. He said that in his eyes, this disqualified Microsoft as a potential employer.

That’s not to say we don’t lose candidates to Microsoft. We do! Microsoft has determined that Amazon is very good at talent assessment, but crappy at selling the candidates and clinching the deal. So when Microsoft hears from a candidate that they’ve got a full-time offer from us, Microsoft doesn’t even interview the person. They take the candidate for a ride in the company hummer, have execs wine and dine them, let them spend the day with the team they’re going to join, show them the private office with a door they’ll get so they can concentrate on innovation… it’s a straight sell job after we’ve made an offer.

This is exactly what happened to me with Microsoft – I had a number of offers from other companies (though not Amazon), and someone from Microsoft called me up and sold me on Microsoft. I technically had an interview, but the interview was basically a sales job. Basically every time I re-read a Steve Yegge post, I notice that the post reflects some recent experience of mine.

Ted Unangst

A mix of technical posts about security and BSD and commentary on how broken software is. Some examples of the latter are this post on automation failure and this post on how Netflix handles CDs.

Even when there’s code, the posts tend to be about a high-level idea that just happens to be illustrated by the code, which makes this a lighter read than you’d expect from sheer amount of code.

Rebecca Frankel

As far as I know, Rebecca doesn’t have a programming blog, but if you look at her apparently off-the-cuff comments on other people’s posts as a blog, it’s one of the best written programming blogs out there. She used to be prolific on Piaw’sBuzz (and probably elsewhere, although I don’t know where), and you occasionally see comments elsewhere, like on this Steve Yegge blog post about brilliant engineers1. I wish I could write like that.

RWT

This isn’t updated anymore, but I find the archives to be fun reading for insight into what people were thinking about microprocessors and computer architecture over the past two decades. It can be a bit depressing to see that the same benchmarking contreversies we had 15 years ago are being repeated today, sometimes with the same players. If anything, I’d say that the average benchmark you see passed around today is worse than what you would have seen 15 years ago, even though the industry as a whole has learned a lot about benchmarking since then.

Vyacheslav Egorov

In-depth explanations on how V8 works and how various constructs get optimized by a compiler dev on the V8 team. If I knew compilers were this interesting, I would have taken a compilers class back when I was in college.

Often takes topics that are considered hard and explains them in a way that makes them seem easy. Lots of diagrams, where appropriate, and detailed exposition on all the tricky bits.

Yossi Kreinin

Mostly dormant since the author started doing art, but the archives have a lot of great content about hardware, low-level software, and general programming-related topics that aren’t strictly programming.

90% of the time, when I get the desire to write a post about a common misconception software folks have about hardware, Yossi has already written the post and taken a lot of flak for it so I don’t have to :-).

I also really like Yossi’s career advice, like this response to Patrick McKenzie and this post on how managers get what they want and not what they ask for.

This blog?

Common themes include:

I still sort of can’t believe that anyone reads my writing on purpose. If I had to think of one flattering thing to about my blog, it would be that even though my blog posts are often substantially longer than Steve Yegge’s, I have literally not seen a single person complain about the length in internet comments. I expect that’s a self-defeating prophecy, though :-).

The end

Note that this list is relatively tilted towards blogs I find to be underrated. So it doesn’t include, for example, the high scalability blog, mechanical sympathy, or Patrick McKenzie even though I think they’re great. In that case, you might say that it’s strange that I have folks like Steve Yegge and Kyle Kingsbury listed. What I can say? I still consider them underrated. This list also doesn’t include blogs that mostly aren’t about programming, so it doesn’t include, for example, Ben Kuhn’s excellent blog.

Anyway, that’s all for now, but this list is pretty much off the top of my head, so I’ll add more as more blogs come to mind. I’ll also keep this list updated with what I’m reading as I find new blogs. Please please please suggest other blogs I might like, and don’t assume that I already know about a blog because it’s popular. Just for example, I had no idea who either Jeff Atwood or Zed Shaw were until a few years ago, and were and still are probably two of the most well known programming bloggers in existence. Even with centralized link aggregators like HN and reddit, blog discovery has become haphazard and random with the decline of blogrolls and blogging as a dialogue rather than a monologue. Also, please don’t assume that I don’t want to read something just because it’s different from the kind of blog I normally read. I’d love to read more from UX or front-end folks; I just don’t know where to find that kind of thing!

This post was inspired by the two posts Julia Evans has on blogs she reads and by the Chicago undergraduate mathematics bilbiography, which I’ve found to be the most useful set of book reviews I’ve ever encoutered.

Thanks to Bartłomiej Filipek and Sean Barrett for suggestions on what to add to the list. I haven’t had time to write them up, but I’ll probably add https://fgiesen.wordpress.com/, http://fabiensanglard.net/, http://preshing.com/, http://huonw.github.io/, and https://randomascii.wordpress.com/, among others. Also, thanks to Lindsey Kuper for discussion/corrections.


  1. Quote follows below, since I can see from my analytics data that relatively few people click any individual link, and people seem especially unlikely to click a link to read a comment on a blog, even if the comment is great:

    The key here is “principally,” and that I am describing motivation, not self-evaluation. The question is, what’s driving you? What gets you working? If its just trying to show that you’re good, then you won’t be. It has to be something else too, or it won’t get you through the concentrated decade of training it takes to get to that level.

    Look at the history of the person we’re all presuming Steve Yegge is talking about. He graduated (with honors) in 1990 and started at Google in 1999. So he worked a long time before he got to the level of Google’s star. When I was at Google I hung out on Sunday afternoons with a similar superstar. Nobody else was reliably there on Sunday; but he always was, so I could count on having someone to talk to. On some Sundays he came to work even when he had unquestionably legitimate reasons for not feeling well, but he stillcame to work. Why didn’t he go home like any normal person would? It wasn’t that he was trying to prove himself; he’d done that long ago. What was driving him?

    The only way I can describe it is one word: fury. What was he doing every Sunday? He was reviewing various APIs that were being proposed as standards by more junior programmers, and he was always finding things wrong with them. What he would talk about, or rather, rage about, on these Sunday afternoons was always about some idiocy or another that someone was trying make standard, and what was wrong with it, how it had to be fixed up, etc, etc. He was always in a high dudgeon over it all.

    What made him come to work when he was feeling sick and dizzy and nobody, not even Larry and Sergey with their legendary impatience, not even them, I mean nobody would have thought less of him if he had just gone home & gone to sleep? He seemed to be driven, not by ambition, but by fear that if he stopped paying attention, something idiotically wrong (in his eyes) might get past him, and become the standard, and that was just unbearable, the thought made him so incoherently angry at the sheer wrongness of it, that he had to stay awake and prevent it from happening no matter how legitimately bad he was feeling at the time.

    It made me think of Paul Graham’s comment: “What do I mean by good people? One of the best tricks I learned during our startup was a rule for deciding who to hire. Could you describe the person as an animal?… I mean someone who takes their work a little too seriously; someone who does what they do so well that they pass right through professional and cross over into obsessive.

    What it means specifically depends on the job: a salesperson who just won’t take no for an answer; a hacker who will stay up till 4:00 AM rather than go to bed leaving code with a bug in it; a PR person who will cold-call New York Times reporters on their cell phones; a graphic designer who feels physical pain when something is two millimeters out of place.”

    I think a corollary of this characterization is that if you really want to be “an animal,” what you have cultivate in yourself is partly ambition, but it is partly also self-knowledge. As Paul Graham says, there are different kinds of animals. The obsessive graphic designer might be unconcerned about an API that is less than it could be, while the programming superstar might pass by, or create, a terrible graphic design without the slightest twinge of misgiving.

    Therefore, key question is: are you working on the thing you care about most? If its wrong, is it unbearable to you? Nothing but deep seated fury will propel you to the level of a superstar. Getting there hurts too much; mere desire to be good is not enough. If its not in you, its not in you. You have to be propelled by elemental wrath. Nothing less will do.

    Or it might be in you, but just not in this domain. You have to find what you care about, and not just what you care about, but what you care about violently: you can’t fake it.

    (Also, if you do have it in you, you still have to choose your boss carefully. No matter how good you are, it may not be trivial to find someone you can work for. There’s more to say here; but I’ll have to leave it for another comment.)

    Another clarification of my assertion “if you’re wondering if you’re good, then you’re not” should perhaps be said “if you need reassurance from someone else that you’re good, then you’re not.” One characteristic of these “animals” is that they are such obsessive perfectionists that their own internal standards so far outstrip anything that anyone else could hold them to, that no ordinary person (i.e. ordinary boss) can evaluate them. As Steve Yegge said, they don’t go for interviews. They do evaluate each other – at Google the superstars all reviewed each other’s code, reportedly brutally – but I don’t think they cared about the judgments of anyone who wasn’t in their circle or at their level.

    I agree with Steve Yegge’s assertion that there are an enormously important (small) group of people who are just on another level, and ordinary smart hardworking people just aren’t the same. Here’s another way to explain why there should be a quantum jump – perhaps I’ve been using this discussion to build up this idea: its the difference between people who are still trying to do well on a test administered by someone else, and the people who have found in themselves the ability to grade their own test, more carefully, with more obsessive perfectionism, than anyone else could possibly impose on them.

    School, for all it teaches, may have one bad lasting effect on people: it gives them the idea that good people get A’s on tests, and better ones get A+’s on tests, and the very best get A++’s. Then you get the idea that you go out into the real world, and your boss is kind of super-professor, who takes over the grading of the test. Joel Spolsky is accepting that role, being boss as super-professor, grading his employees tests for them, telling them whether they are good.

    But the problem is that in the real world, the very most valuable, most effective people aren’t the ones who are trying to get A+++’s on the test you give them. The very best people are the ones who can make up their own test with harder problems on it than you could ever think of, and you’d have to have studied for the same ten years they have to be able even to know how to grade their answers.

    That’s a problem, incidentally, with the idea of a meritocracy. School gives you an idea of a ladder of merit that reaches to the top. But it can’t reach all the way to the top, because someone has to measure the rungs. At the top you’re not just being judged on how high you are on the ladder. You’re also being judged on your ability to “grade your own test”; that is to say, your trustworthiness. People start asking whether you will enforce your own standards even if no one is imposing them on you. They have to! because at the top people get given jobs with the kind of responsibility where no one can possibly correct you if you screw up. I’m giving you an image of someone who is working himself sick, literally, trying grade everyone else’s work. In the end there is only so much he can do, and he does want to go home and go to bed sometimes. That means he wants people under him who are not merely good, but can be trusted not to need to be graded. Somebody has to watch the watchers, and in the end, the watchers have to watch themselves.

    [return]

Notes on concurrency bugs

$
0
0

Do concurrency bugs matter? From the literature, we know that most reported bugs in distributed systems have really simple causes and can be caught by trivial tests, even when we only look at bugs that cause really bad failures, like loss of a cluster or data corruption. The filesystem literature echos this result – a simple checker that looks for totally unimplemented error handling can find hundreds of serious data corruption bugs. Most bugs are simple, at least if you measure by bug count. But if you measure by debugging time, the story is a bit different.

Just from personal experience, I’ve spent more time debugging complex non-deterministic failures than all other types of bugs combined. In fact, I’ve spent more time debugging some individual non-deterministic bugs (weeks or months) than on all other bug types combined. Non-deterministic bugs are rare, but they can be extremely hard to debug and they’re a productivity killer. Bad non-deterministic bugs take so long to debug that relatively large investments in tools and prevention can be worth it1.

Let’s see what the academic literature has to say on non-deterministic bugs. There’s a lot of literature out there, so let’s narrow things down by looking at one relatively well studied area: concurrency bugs. We’ll start with the literature on single-machine concurrency bugs and then look at distributed concurrency bugs.

Fonseca et al. DSN ‘10

They studied MySQL concurrency bugs from 2003 to 2009 and found the following:

More non-deadlock bugs (63%) than deadlock bugs (40%)

Note that these numbers sum to more than 100% because some bugs are tagged with multiple causes. This is roughly in line with the Lu et al. ASPLOS ‘08 paper (which we’ll look at later), which found that 30% of the bugs they examined were deadlock bugs.

15% of examined failures were semantic

The paper defines a semantic failure as one “where the application provides the user with a result that violates the intended semantics of the application”. The authors also find that “the vast majority of semantic bugs (92%) generated subtle violations of application semantics”. By their nature, these failures are likely to be undercounted – it’s pretty hard to miss a deadlock, but it’s easy to miss subtle data corruption.

15% of examined failures were latent

The paper defines latent as bugs that “do not become immediately visible to users.”. Unsurprisingly, the paper finds that latent failures are closely related to semantic failures; 92% of latent failures are semantic and vice versa. The 92% number makes this finding sound more precise than it really is – it’s just that 11 out of the 12 semantic failures are latent and vice versa. That could have easily been 11 out of 11 (100%) or 10 out of 12 (83%).

That’s interesting, but it’s hard to tell from that if the results generalize to projects that aren’t databases, or even projects that aren’t MySQL.

Lu et al. ASPLOS ‘08

They looked at concurrency bugs in MySQL, Firefox, OpenOffice, and Apache. Some of their findings are:

97% of examined non-deadlock bugs were atomicity-violation or order-violation bugs

Of the 74 non-deadlock bugs studied, 51 were atomicity bugs, 24 were ordering bugs, and 2 were categorized as “other”.

An example of an atomicity violation is this bug from MySQL:

Thread 1:

if (thd->proc_info)
  fputs(thd->proc_info, ...)

Thread 2:

thd->proc_info = NULL;

For anyone who isn’t used to C or C++, thd is a pointer, and -> is the operator to access a field through a pointer. The first line in thread 1 checks if the field is null. The second line calls fputs, which writes the field. The intent is to only call fputs if and only if proc_info isn’t NULL, but there’s nothing preventing another thread from setting proc_info to NULL“between” the first and second lines of thread 1.

Like most bugs, this bug is obvious in retrospect, but if we look at the original bug report, we can see that it wasn’t obvious at the time:

Description: I’ve just noticed with the latest bk tree than MySQL regularly crashes in InnoDB code … How to repeat: I’ve still no clues on why this crash occurs.

As is common with large codebases, fixing the bug once it was diagnosed was more complicated than it first seemed. This bug was partially fixed in 2004, resurfaced again and was fixed in 2008. A fix for another bug caused a regression in 2009, which was also fixed in 2009. That fix introduced a deadlock that was found in 2011.

An example ordering bug is the following bug from Firefox:

Thread 1:

mThread=PR_CreateThread(mMain, ...);

Thread 2:

void mMain(...) {
  mState = mThread->State;
  }

Thread 1 launches Thread 2 with PR_CreateThread. Thread 2 assumes that, because the line that launched it assigned to mThread, mThread is valid. But Thread 2 can start executing before Thread 1 has assigned to mThread! The authors note that they call this an ordering bug and not an atomicity bug even though the bug could have been prevented if the line in thread 1 were atomic because their “bug pattern categorization is based on root cause, regardless of possible fix strategies”.

An example of an “other” bug, one of only two studied, is this bug in MySQL:

Threads 1…n:

rw_lock(&lock);

Watchdog thread:

if (lock_wait_time[i] > fatal_timeout)
  assert(0);

This can cause a spurious crash when there’s more than the expected amount of work. Note that the study doesn’t look at performance bugs, so a bug where lock contention causes things to slow to a crawl but a watchdog doesn’t kill the program wouldn’t be considered.

An aside that’s probably a topic for another post is that hardware often has deadlock or livelock detection built in, and that when a lock condition is detected, hardware will often try to push things into a state where normal execution can continue. After detecting and breaking deadlock/livelock, an error will typically be logged in a way that it will be noticed if it’s caught in lab, but that external customers won’t see. For some reason, that strategy seems rare in the software world, although it seems like it should be easier in software than in hardware.

Deadlock occurs if and only if the following four conditions are true:

  1. Mutual exclusion: at least one resource must be held in a non-shareable mode. Only one process can use the resource at any given instant of time.
  2. Hold and wait or resource holding: a process is currently holding at least one resource and requesting additional resources which are being held by other processes.
  3. No preemption: a resource can be released only voluntarily by the process holding it.
  4. Circular wait: a process must be waiting for a resource which is being held by another process, which in turn is waiting for the first process to release the resource.

There’s nothing about these conditions that are unique to either hardware or software, and it’s easier to build mechanisms that can back off and replay to relax (2) in software than in hardware. Anyway, back to the study findings.

96% of examined concurrency bugs could be reproduced by fixing the relative order of 2 specific threads

This sounds like great news for testing. Testing only orderings between thread pairs is much more tractable than testing all orderings between all threads. Similarly, 92% of examined bugs could be reproduced by fixing the order of four (or fewer) memory accesses. However, there’s a kind of sampling bias here – only bugs that could be reproduced could be analyzed for a root cause, and bugs that only require ordering between two threads or only a few memory accesses are easier to reproduce.

97% of examined deadlock bugs were caused by two threads waiting for at most two resources

Moreover, 22% of examined deadlock bugs were caused by a thread acquiring a resource held by the thread itself. The authors state that pairwise testing of acquisition and release sequences should be able to catch most deadlock bugs, and that pairwise testing of thread orderings should be able to catch most non-deadlock bugs. The claim seems plausibly true when read as written; the implication seems to be that virtually all bugs can be caught through some kind of pairwise testing, but I’m a bit skeptical of that due to the sample bias of the bugs studied.

I’ve seen bugs with many moving parts take months to track down. The worst bug I’ve seen consumed nearly a person-year’s worth of time. Bugs like that mostly don’t make it into studies like this because it’s rare that a job allows someone the time to chase bugs that elusive. How many bugs like that are out there is still an open question.

Caveats

Note that all of the programs studied were written in C or C++, and that this study predates C++11. Moving to C++11 and using atomics and scoped locks would probably change the numbers substantially, not to mention moving to an entirely different concurrency model. There’s some academic work on how different concurrency models affect bug rates, but it’s not really clear how that work generalizes to codebases as large and mature as the ones studied, and by their nature, large and mature codebases are hard to do randomized trials on when the trial involves changing the fundamental primitives used. The authors note that 39% of examined bugs could have been prevented by using transactional memory, but it’s not clear how many other bugs might have been introduced if transactional memory were used.

Tools

There are other papers on characterizing single-machine concurrency bugs, but in the interest of space, I’m going to skip those. There are also papers on distributed concurrency bugs, but before we get to that, let’s look at some of the tooling for finding single-machine concurrency bugs that’s in the literature. I find the papers to be pretty interesting, especially the model checking work, but realistically, I’m probably not going to build a tool from scratch if something is available, so let’s look at what’s out there.

HapSet

Uses run-time coverage to generate interleavings that haven’t been covered yet. This is out of NEC labs; googling NEC labs HapSet returns the paper, some patent listings, but no obvious download for the tool.

CHESS

Generates unique interleavings of threads for each run. They claim that, by not tracking state, the checker is much simpler than it would otherwise be, and that they’re able to avoid many of the disadvantages of tracking state via a detail that can’t properly be described in this tiny little paragraph; read the paper if you’re interested! Supports C# and C++. The page claims that it requires Visual Studio 2010 and that it’s only been tested with 32-bit code. I haven’t tried to run this on a modern *nix compiler, but IME requiring Visual Studio 2010 means that it would be a moderate effort to get it running on a modern version of Visual Studio, and a substantial effort to get it running on a modern version of gcc or clang. A quick Google search indicates that this might be patent encumbered2.

Maple

Uses coverage to generate interleavings that haven’t been covered yet. Instruments pthreads. The source is up on github. It’s possible this tool is still usable, and I’ll probably give it a shot at some point, but it depends on at least one old, apparently unmaintained tool (PIN, a binary instrumentation tool from Intel). Googling (Binging?) for either Maple or PIN gives a number of results where people can’t even get the tool to compile, let alone use the tool.

PACER

Samples using the FastTrack algorithm in order to keep overhead low enough “to consider in production software”. Ironically, this was implemented on top of the Jikes RVM, which is unlikely to be used in actual production software. The only reference I could find for an actually downloadable tool is a completely different pacer.

ConLock / MagicLock / MagicFuzzer

There’s a series of tools that are from one group which claims to get good results using various techniques, but AFAICT the source isn’t available for any of the tools. There’s a page that claims there’s a version of MagicFuzzer available, but it’s a link to a binary that doesn’t specify what platform the binary is for and the link 404s.

OMEN / WOLF

I couldn’t find a page for these tools (other than their papers), let alone a download link.

SherLock / AtomChase / Racageddon

Another series of tools that aren’t obviously available.

Tools you can actually easily use

Valgrind / DRD / Helgrind

Instruments pthreads and easy to use – just run valgrind with the appropriate options (-drd or -helgrind) on the binary. May require a couple tweaks if using C++11 threading.

clang thread sanitzer (TSan)

Can find data races. Flags when happens-before is violated. Works with pthreads and C++11 threads. Easy to use (just pass a -fsanitize=thread to clang).

A side effect of being so easy to use and actually available is that tsan has had a very large impact in the real world:

One interesting incident occurred in the open source Chrome browser. Up to 15% of known crashes were attributed to just one bug [5], which proved difficult to understand - the Chrome engineers spent over 6 months tracking this bug without success. On the other hand, the TSAN V1 team found the reason for this bug in a 30 minute run, without even knowing about these crashes. The crashes were caused by data races on a couple of reference counters. Once this reason was found, a relatively trivial fix was quickly made and patched in, and subsequently the bug was closed.

clang -Wthread-safety

Static analysis that uses annotations on shared state to determine if state wasn’t correctly guarded.

FindBugs

General static analysis for Java with many features. Has @GuardedBy annotations, similar to -Wthread-safety.

CheckerFramework

Java framework for writing checkers. Has many different checkers. For concurrency in particular, uses @GuardedBy, like FindBugs.

rr

Deterministic replay for debugging. Easy to get and use, and appears to be actively maintained. Adds support for time-travel debugging in gdb.

DrDebug/PinPlay

General toolkit that can give you deterministic replay for debugging. Also gives you “dynamic slicing”, which is watchpoint-like: it can tell you what statements affected a variable, as well as what statements are affected by a variable. Currently Linux only; claims Windows and Android support coming soon.

Other tools

This isn’t an exhaustive list – there’s a ton of literature on this, and this is an area where, frankly, I’m pretty unlikely to have the time to implement a tool myself, so there’s not much value for me in reading more papers to find out about techniques that I’d have to implement myself3. However, I’d be interested in hearing about other tools that are usable.

One thing I find interesting about this is that almost all of the papers for the academic tools claim to do something novel that lets them find bugs not found by other tools. They then run their tool on some codebase and show that the tool is capable of finding new bugs. But since almost no one goes and runs the older tools on any codebase, you’d never know if one of the newer tools only found a subset of the bugs that one of the older tools could catch.

Furthermore, you see cycles (livelock?) in how papers claim to be novel. Paper I will claim that it does X. Paper II will claim that it’s novel because it doesn’t need to do X, unlike Paper I. Then Paper III will claim that it’s novel because, unlike Paper II, it does X.

Distributed systems

Now that we’ve looked at some of the literature on single-machine concurrency bugs, what about distributed concurrency bugs?

Leesatapornwongsa et al. ASPLOS 2016

They looked at 104 bugs in Cassandra, MapReduce, HBase, and Zookeeper. Let’s look at some example bugs, which will clarify the terminology used in the study and make it easier to understanding the main findings.

Message-message race

This diagram is just for reference, so that we have a high-level idea of how different parts fit together in MapReduce:

Block diagram of MapReduce

In MapReduce bug #3274, a resource manager sends a task-init message to a node manager. Shortly afterwards, an application master sends a task-kill preemption to the same node manager. The intent is for the task-kill message to kill the task that was started with the task-init message, but the task-kill can win the race and arrive before the task-init. This example happens to be a case where two messages from different nodes are racing to get to a single node.

For example, in MapReduce bug #5358, an application master sends a kill message to node manager running a speculative task because another copy of the task finished. However, before the message is received by the node manager, the node manager’s task completes, causing a complete message to be sent to the application master, causing an exception because a complete message was received after the task had completed.

Message-compute race

One example is MapReduce bug# 4157, where the application master unregisters with the resource manager. The application master then cleans up, but that clean-up races against the resource manager sending kill messages to the application’s containers via node managers, causing the application master to get killed. Note that this is classified as a race and not an atomicity bug, which we’ll get to shortly.

Compute-compute races can happen, but they’re outside the scope of this study since this study only looks at distributed concurrency bugs.

Atomicity violation

For the purposes of this study, atomicity bugs are defined as “whenever a message comes in the middle of a set of events, which is a local computation or global communication, but not when the message comes either before or after the events”. According to this definition, the message-compute race we looked at above isn’t a atomicity bug because it would still be a bug if the message came in before the “computation” started. This definition also means that hardware failures that occur inside a block that must be atomic are not considered atomicity bugs.

I can see why you’d want to define those bugs as separate types of bugs, but I find this to be a bit counterintuitive, since I consider all of these to be different kinds of atomicity bugs because they’re different bugs that are caused by breaking up something that needs to be atomic.

In any case, by the definition of this study, MapReduce bug #5009 is an atomicty bug. A node manager is in the process of committing data to HDFS. The resource manager kills the task, which doesn’t cause the commit state to change. Any time the node tries to rerun the commit task, the task is killed by the application manager because a commit is believed to already be in process.

Fault timing

A fault is defined to be a “component failure”, such as a crash, timeout, or unexpected latency. At one point, the paper refers to “hardware faults such as machine crashes”, which seems to indicate that some faults that could be considered software faults are defined as hardware faults for the purposes of this study.

Anyway, for the purposes of this study, an example of a fault-timing issue is MapReduce bug #3858. A node manager crashes while committing results. When the task is re-run, later attempts to commit all fail.

Reboot timing

In this study, reboots are classified separately from other faults. MapReduce bug #3186 illustrates a reboot bug.

A resource manager sends a job to an application master. If the resource manager is rebooted before the application master sends a commit message back to the resource manager, the resource manager loses its state and throws an exception because it’s getting an unexpected complete message.

Some of their main findings are:

47% of examined bugs led to latent failures

That’s a pretty large difference when compared to the DSN’ 10 paper that found that 15% of examined multithreading bugs were latent failures. It’s plausible that this is a real difference and not just something due to a confounding variable, but it’s hard to tell from the data.

This is a large difference from what studies on “local” concurrency bugs found. I wonder how much of that is just because people mostly don’t even bother filing and fixing bugs on hardware faults in non-distributed software.

64% of examined bugs were triggered by a single message’s timing

44% were ordering violations, and 20% were atomicity violations. Furthermore, > 90% of bugs involved three messages (or fewer).

32% of examined bugs were due to fault or reboot timing. Note that, for the purposes of the study, a hardware fault or a reboot that breaks up a block that needed to be atomic isn’t considered an atomicity bug – here, atomicity bugs are bugs where a message arrives in the middle of a computation that needs to be atomic.

70% of bugs had simple fixes

30% were fixed by ignoring the badly timed message and 40% were fixed by delaying or ignoring the message.

Bug causes?

After reviewing the bugs, the authors propose common fallacies that lead to bugs:

  1. One hop is faster than two hops
  2. Zero hops are faster than one hop
  3. Atomic blocks can’t be broken

On (3), the authors note that it’s not just hardware faults or reboots that break up atomic blocks – systems can send kill or pre-emption messages that break up an atomic block. A fallacy which I’ve commonly seen in post-mortems that’s not listed here, goes something like “bad nodes are obviously bad”. A classic example of this is when a system starts “handling” queries by dropping them quickly, causing a load balancer to shift traffic the bad node because it’s handling traffic so quickly.

One of my favorite bugs in this class from an actual system was in a ring-based storage system where nodes could do health checks on their neighbors and declare that their neighbors should be dropped if the health check fails. One node went bad, dropped all of its storage, and started reporting its neighbors as bad nodes. Its neighbors noticed that the bad node was bad, but because the bad node had dropped all of its storage, it was super fast and was able to report its good neighbors before the good neighbors could report the bad node. After ejecting its immediate neighbors, the bad node got new neighbors and raced the new neighbors, winning again for the same reason. This was repeated until the entire cluster died.

Tools

Mace

A set of language extensions (on C++) that helps you build distributed systems. Mace has a model checker that can check all possible event orderings of messages, interleaved with crashes, reboots, and timeouts. The Mace model checker is actually available, but AFAICT it requires using the Mace framework, and most distributed systems aren’t written in Mace.

Modist

Another model checker that checks different orderings. Runs only one interleaving of independent actions (partial order reduction) to avoid checking redundant states. Also interleaves timeouts. Unlike Mace, doesn’t inject reboots. Doesn’t appear to be available.

Demeter

Like Modist, in that it’s a model checker that injects the same types of faults. Uses a different technique to reduce the state space, which I don’t know how to summarize succinctly. See paper for details. Doesn’t appear to be available. Googling for Demeter returns some software used to model X-ray absorption?

SAMC

Another model checker. Can inject multiple crashes and reboots. Uses some understanding of the system to avoid redundant re-orderings (e.g., if a series of messages is invariant to when a reboot is injected, the system tries to avoid injecting the reboot between each message). Doesn’t appear to be available.

Jepsen

As was the case for non-distributed concurrency bugs, there’s a vast literature on academic tools, most of which appear to be grad-student code that hasn’t been made available.

And of course there’s Jepsen, which doesn’t have any attached academic papers, but has probably had more real-world impact than any of the other tools because it’s actually available and maintained. There’s also chaos monkey, but if I’m understanding it correctly, unlike the other tools listed, it doesn’t attempt to create reproducible failures.

Conclusion

Is this where you’re supposed to have a conclusion? I don’t have a conclusion. We’ve looked at some literature and found out some information about bugs that’s interesting, but not necessarily actionable. We’ve read about tools that are interesting, but not actually available. And then there are some tools based on old techniques that are available and useful.

For example, the idea inside clang’s TSan, using “happens-before” to find data races, goes back ages. There’s a 2003 paper that discusses “combining two previously known race detection techniques – lockset-based detection and happens-before-based detection – to obtain fewer false positives than lockset-based detection alone”. That’s actually what TSan v1 did, but with TSan v2 they realized the tool would be more impactful if they only used happens-before because that avoids false positives, which means that people will actually use the tool. That’s not something that’s likely to turn into a paper that gets cited zillions of times, though. For anyone who’s looked at how afl works, this story should sound familiar. AFL is emintently practical and has had a very large impact in the real world, mostly by eschewing fancy techniques from the recent literature.

If you must have a conclusion, maybe the conclusion is that individuals like Kyle Kingsbury or Michal Zalewski have had an outsized impact on industry, and that you too can probably pick an underserved area in testing and have a curiously large impact on an entire industry.

Unrelated miscellania

Rose Ames asked me to tell more “big company” stories, so here’s a set of stories that explains why I haven’t put a blog post up for a while. The proximal cause is that my VP has been getting negative comments about my writing. But the reasons for that are a bit of a long story. Part of it is the usual thing, where the comments I receive personally skew very heavily positive, but the comments my manager gets run the other way because it’s weird to email someone’s manager because you like their writing, but you might send an email if their writing really strikes a nerve.

That explains why someone in my management chain was getting emailed about my writing, but it doesn’t explain why the emails went to my VP. That’s because I switched teams a few months ago, and the org that I was going to switch into overhired and didn’t have any headcount. I’ve heard conflicting numbers about how much they overhired, from 10 or 20 people to 10% or 20% (the org is quite large, and 10% would be much more than 20), as well as conflicting stories about why it happened (honest mistake vs. some group realizing that there was a hiring crunch coming and hiring as much as possible to take all of the reqs from the rest of the org). Anyway, for some reason, the org I would have worked in hired more than it was allowed to by at least one person and instituted a hiring freeze. Since my new manager couldn’t hire me into that org, he transferred into an org that had spare headcount and hired me into the new org. The new org happens to be a sales org, which means that I technically work in sales now; this has some impact on my day-to-day life since there are some resources and tech talks that are only accessible by people in product groups, but that’s another story. Anyway, for reasons that I don’t fully understand, I got hired into the org before my new manager, and during the months it took for the org chart to get updated I was shown as being parked under my VP, which meant that anyone who wanted to fire off an email to my manager would look me up in the directory and accidentally email my VP instead.

It didn’t seem like any individual email was a big deal, but since I don’t have much interaction with my VP and I don’t want to only be known as that guy who writes stuff which generates pushback from inside the company, I paused blogging for a while. I don’t exactly want to be known that way to my manager either, but I interact with my manager frequently enough that at least I won’t only be known for that.

I also wonder if these emails to my manager/VP are more likely at my current employer than at previous employers. I’ve never had this happen (that I know of) at another employer, but the total number of times it’s happened here is low enough that it might just be coincidence.

Then again, I was just reading the archives of a really insightful internal blog and ran across a note that mentioned that the series of blog posts was being published internally because the author got static from Sinofsky about publishing posts that contradicted the party line, which eventually resulted in the author agreeing to email Sinofsky comments related to anything under Sinofsky’s purview instead of publishing the comments publicly. But now that Sinofsky has moved on, the author wanted to share emails that would have otherwise been posts internally.

That kind of thing doesn’t seem to be a freak occurance around here. At the same time I saw that thing about Sinofsky, I ran across a discussion on whether or not a PM was within their rights to tell someone to take down a negative review from the app store. Apparently, a PM found out that someone had written a negative rating on the PM’s product in some app store and emailed the rater, telling them that they had to take the review down. It’s not clear how the PM found out that the rater worked for us (do they search the internal directory for every negative rating they find?), but they somehow found out and then issued their demand. Most people thought that the PM was out of line, but there were a non-zero number of people (in addition to the PM) who thought that employees should not say anything that could be construed as negative about the company in public.

I feel like I see more of this kind of thing now than I have at other companies, but the company’s really too big to tell if anyone’s personal experience generalizes. Anyway, I’ll probably start blogging again now that the org chart shows that I report to my actual manager, and maybe my manager will get some emails about that. Or maybe not.

Thanks to Leah Hanson, David Turner, Justin Mason, Joe Wilder, Matt Dziubinski, Alex Blewitt, Bruno Kim Medeiros Cesar, Luke Gilliam, Ben Karas, Julia Evans, Michael Ernst, and Stephen Tu for comments/corrections.


  1. If you’re going to debug bugs. I know some folks at startups who give up on bugs that look like they’ll take more than a few hours to debug because their todo list is long enough that they can’t afford the time. That might be the right decision given the tradeoffs they have, but it’s not the right decision for everyone. [return]
  2. Funny thing about US patent law: you owe treble damages for willfully infringing on a patent. A direct effect of this is that two out of three of my full-time employers have very strongly recommended that I don’t read patents, so I avoid reading patents that aren’t obviously frivolous. And by frivolous, I don’t mean patents for obvious things that any programmer might independently discover, because patents like that are often upheld as valid. I mean patents for things like how to swing on a swing. [return]
  3. I get the incentives that lead to this, and I don’t begrudge researchers for pursuing career success by responding to those incentives, but as a lowly practitioner, it sure would be nice if the incentives were different. [return]

How I learned to program

$
0
0

Tavish Armstrong has a great document where he describes how and when he learned the programming skills he has. I like this idea because I’ve found that the paths that people take to get into programming are much more varied than stereotypes give credit for, and I think it’s useful to see that there are many possible paths into programming.

Personally, I spent a decade working as an electrical engineer before taking a programming job. When I talk to people about this, they often want to take away a smooth narrative of my history. Maybe it’s that my math background gives me tools I can apply to a lot of problems, maybe it’s that my hardware background gives me a good understanding of performance and testing, or maybe it’s that the combination makes me a great fit for hardware/software co-design problems. People like a good narrative. One narrative people seem to like is that I’m a good problem solver, and that problem solving ability is generalizable. But reality is messy. Electrical engineering seemed like the most natural thing in the world, and I picked it up without trying very hard. Programming was unnatural for me, and didn’t make any sense at all for years. If you believe in the common “you either have it or you don’t” narrative about programmers, I definitely don’t have it. And yet, I now make a living programming, and people seem to be pretty happy with the work I do.

How’d that happen? Well, if we go back to the beginning, before becoming a hardware engineer, I spent a fair amount of time doing failed kid-projects (e.g., writing a tic-tac-toe game and AI) and not really “getting” programming. I do sometimes get a lot of value out of my math or hardware skills, but I suspect I could teach someone the actually applicable math and hardware skills I have in less than a year. Spending five years in a school and a decade in industry to pick up those skills was a circuitous route to getting where I am. Amazingly, I’ve found that my path has been more direct than that of most of my co-workers, giving the lie to the narrative that most programmers are talented whiz kids who took to programming early.

And while I only use a small fraction of the technical skills I’ve learned on any given day, I find that I have a meta-skill set that I use all the time. There’s nothing profound about the meta-skill set, but because I often work in new (to me) problem domains, I find my meta skillset to be more valuable than my actual skills. I don’t think that you can communicate the importance of meta-skills (like communication) by writing a blog post any more than you can explain what a monad is by saying that it’s like a burrito. That being said, I’m going to tell this story anyway.

Ineffective fumbling (1980s - 1996)

Many of my friends and I tried and failed multiple times to learn how to program. We tried BASIC, and could write some simple loops, use conditionals, and print to the screen, but never figured out how to do anything fun or useful.

We were exposed to some kind of lego-related programming, uhhh, thing in school, but none of us had any idea how to do anything beyond what was in the instructions. While it was fun, it was no more educational than a video game and had a similar impact.

One of us got a game programming book. We read it, tried to do a few things, and made no progress.

High school (1996 - 2000)

Our ineffective fumbling continued through high school. Due to an interest in gaming, I got interested in benchmarking, which eventually led to learning about CPUs and CPU microarchitecture. This was in the early days of Google, before Google Scholar, and before most CS/EE papers could be found online for free, so this was mostly material from enthusiast sites. Luckily, the internet was relatively young, as were the users on the sites I frequented. Much of the material on hardware was targeted at (and even written by) people like me, which made it accessible. Unfortunately, a lot of the material on programming was written by and targeted at professional programmers, things like Paul Hsieh’s optimization guide. There were some beginner-friendly guides to programming out there, but my friends and I didn’t stumble across them.

We had programming classes in high school: an introductory class that covered Visual Basic and an AP class that taught C++. Both classes were taught by someone who didn’t really know how to program or how to teach programming. My class had a couple of kids who already knew how to program and were making good money doing programming competitions on topcoder, but they failed to test out of the intro class because that test included things like a screenshot of the VB6 IDE, where you got a point for correctly identifying what each button did. The class taught about as much as you’d expect from a class where the pre-test involved identifying UI elements from an IDE.

The AP class the year after was similarly effective. About halfway through the class, a couple of students organized an independent study group which worked through an alternate textbook because the class was clearly not preparing us for the AP exam. I passed the AP exam because it was one of those multiple choice tests that’s possible to pass without knowing the material.

Although I didn’t learn much, I wouldn’t have graduated high school if not for AP classes. I failed enough individual classes that I almost didn’t have enough credits to graduate. I got those necessary credits for two reasons: first, a lot of the teachers had a deal where, if you scored well on the AP exam, they would give you a passing grade in the class (usually an A, but sometimes a B). Even that wouldn’t have been enough if my chemistry teacher hadn’t also changed my grade to a passing grade when he found out I did well on the AP chemistry test1.

Other than not failing out of high school, I’m not sure I got much out of my AP classes. My AP CS class actually had a net negative effect on my learning to program because the AP test let me opt out of the first two intro CS classes in college (an introduction to programming and a data structures course). In retrospect, I should have taken the intro classes, but I didn’t, which left me with huge holes in my knowledge that I didn’t really fill in for nearly a decade.

College (2000 - 2003)

Because I’d nearly failed out of high school, there was no reasonable way I could have gotten into a “good” college. Luckily, I grew up in Wisconsin, a state with a “good” school that used a formula to determine who would automatically get admitted: the GPA cutoff depended on standardized test scores, and anyone with standardized test scores above a certain mark was admitted regardless of GPA. During orientation, I talked to someone who did admissions and found out that my year was the last year they used the formula.

I majored in computer engineering and math for reasons that seem quite bad in retrospect. I had no idea what I really wanted to study. I settled on either computer engineering or engineering mechanics because both of those sounded “hard”.

I made a number of attempts to come up with better criteria for choosing a major. The most serious was when I spent a week talking to professors in an attempt to find out what day-to-day life in different fields was like. That approach had two key flaws. First, most professors don’t know what it’s like to work in industry; now that I work in industry and talk to folks in academia, I see that most academics who haven’t done stints in industry have a lot of misconceptions about what it’s like. Second, even if I managed to get accurate descriptions of different fields, it turns out that there’s a wide body of research that indicates that humans are basically hopeless at predicting which activities they’ll enjoy. Ultimately, I decided by coin flip.

Math

I wasn’t planning on majoring in math, but my freshman intro calculus course was so much fun that I ended up adding a math major. That only happened because a high-school friend of mine passed me the application form for the honors calculus sequence because he thought I might be interested in it (he’d already taken the entire calculus sequence as well as linear algebra). The professor for the class covered the material at an unusually fast pace: he finished what was supposed to be a year-long calculus textbook in part-way through the semester and then lectured on his research for the rest of the semester. The class was theorem-proof oriented and didn’t involve any of that yucky memorization that I’d previously associated with math. That was the first time I’d found school engaging in my entire life and it made me really look forward to going to math classes. I later found out that non-honors calculus involved a lot of memorization when the engineering school required me to go back and take calculus II, which I’d skipped because I’d already covered the material in the intro calculus course.

If I hadn’t had a friend drop the application for honors calculus in my lap, I probably wouldn’t have majored in math and it’s possible I never would have found any classes that seemed worth attending. Even as it was, all of the most engaging undergrad professors I had were math professors2 and I mostly skipped my other classes. I don’t know how much of that was because my math classes were much smaller, and therefore much more customized to the people in the class (computer engineering was very trendy at the time, and classes were overflowing), and how much was because these professors were really great teachers.

Although I occasionally get some use out of the math that I learned, most of the value was in becoming confident that I can learn and work through the math I need to solve any particular problem.

Engineering

In my engineering classes, I learned how to debug and how computers work down to the transistor level. I spent a fair amount of time skipping classes and reading about topics of interest in the library, which included things like computer arithmetic and circuit design. I still have fond memories of Koren’s Computer Arithmetic Algorithms, Chandrakasan et al.’s Design of High-Performance Microprocessor Circuits. I also started reading papers; I spent a lot of time in libraries reading physics and engineering papers that mostly didn’t make sense to me. The notable exception were systems papers, which I found to be easy reading. I distinctly remember reading the Dynamo paper (this was HP’s paper on JITs, not the more recent Amazon work of the same name), but I can’t recall any other papers I read back then.

Internships

I had two internships, one at Micron where I “worked on” flash memory, and another at IBM where I worked on the POWER6. The Micron internship was a textbook example of a bad internship. When I showed up, my manager was surprised that he was getting an intern and had nothing for me to do. After a while (perhaps a day), he found an assignment for me: press buttons on a phone. He’d managed to find a phone that used Micron flash chips; he handed it to me, told me to test it, and walked off.

After poking at the phone for an hour or two and not being able to find any obvious bugs, I walked around and found people who had tasks I could do. Most of them were only slightly less manual than “testing” a phone by mashing buttons, but I did one not-totally-uninteresting task, which was to verify that a flash chip’s controller behaved correctly. Unlike my other tasks, this was amenable to automation and I was able to write a perl script to do the testing for me.

I chose perl because someone had a perl book on their desk that I could borrow, which seemed like as good a reason as any at the time. I called up a friend of mine to tell him about this great “new” language and we implemented age of renaissance, a boardgame we’d played in high school. We didn’t finish, but perl was easy enough to use that we felt like we could write a program that actually did something interesting.

Besides learning perl, I learned that I could ask people for books and read them, and I spent most of the rest of my internship half keeping an eye on a manual task while reading the books people had lying around. Most of the books had to do with either analog circuit design or flash memory, so that’s what I learned. None of the specifics have really been useful to me in my career, but I learned two meta-items that were useful.

First, no one’s going to stop you from spending time reading at work or spending time learning (on most teams). Micron did its best to keep interns from learning by having a default policy of blocking interns from having internet access (managers could override the policy, but mine didn’t), but no one will go out of their way to prevent an intern from reading books when their other task is to randomly push buttons on a phone.

Second, I learned that there are a lot of engineering problems we can solve without anyone knowing why. One of the books I read was a survey of then-current research on flash memory. At the time, flash memory relied on some behaviors that were well characterized but not really understood. There were theories about how the underlying physical mechanisms might work, but determining which theory was correct was still an open question.

The next year, I had a much more educational internship at IBM. I was attached to a logic design team on the POWER6, and since they didn’t really know what to do with me, they had me do verification on the logic they were writing. They had a relatively new tool called SixthSense, which you can think of as a souped-up quickcheck. The obvious skill I learned was how to write tests using a fancy testing framework, but the meta-thing I learned which has been even more useful is the fact that writing a test-case generator and a checker is often much more productive than the manual test-case writing that passes for automated testing in most places.

The other thing I encountered for the first time at IBM was version control (CVS, unfortunately). Looking back, I find it a bit surprising that not only did I never use version control in any of my classes, but I’d never met any other students who were using version control. My IBM internship was between undergrad and grad school, so I managed to get a B.S. degree without ever using or seeing anyone use version control.

Computer Science

I took a couple of CS classes. The first was algorithms, which was poorly taught and so heavily curved as a result that I got an A despite not learning anything at all. The course involved no programming and while I could have done some implementation in my free time, I was much more interested in engineering and didn’t try to apply any of the material.

The second course was databases. There were a couple of programming projects, but they were all projects where you got some scaffolding and only had to implement a few key methods to make things work, so it was possible to do ok without having any idea how to program. I got involved in a competition to see who could attend fewest possible classes, didn’t learn anything, and scraped by with a B.

Grad school (2003 - 2005)

After undergrad, I decided to go to grad school for a couple of silly reasons. One was a combination of “why not?” and the argument that most of professors gave, which was that you’ll never go if you don’t go immediately after undergrad because it’s really hard to go back to school later. But the reason that people don’t go back later is because they have more information (they know what both school and work are like), and they almost always choose work! The other major reason was that I thought I’d get a more interesting job with a master’s degree. That’s not obviously wrong, but it appears to be untrue in general for people going into electrical engineering and programming.

I don’t know that I learned anything that I use today, either in the direct sense or in a meta sense. I had some great professors3 and I made some good friends, but I think that this wasn’t a good use of time because of two bad decisions I made at the age of 19 or 20. Rather than attended a school that had a lot of people working in an area I was interested in, I went with a school that gave me a fellowship that only had one person working in an area I was really interested. That person left just before I started.

I ended up studying optics, and while learning a new field was a lot of fun, the experience was of no particular value to me, and I could have had fun studying something I had more of an interest in.

While I was officially studying optics, I still spent a lot of time learning unrelated things. At one point, I decided I should learn Lisp or Haskell, probably because of something Paul Graham wrote. I couldn’t find a Lisp textbook in the library, but I found a Haskell textbook. After I worked through the exercises, I had no idea how to accomplish anything practical. But I did learn about list comprehensions and got in the habit of using higher-order functions.

Based on internet comments and advice, I had the idea learning more languages would teach me how to be a good programmer so I worked through introductory books on Python and Ruby. As far as I can tell, this taught me basically nothing useful and I would have been much better off learning about a specific area (like algorithms or networking) than learning lots of languages.

First real job (2005 - 2013)

Towards the end of grad school, I mostly looked for, and found, electrical/computer engineering jobs. The one notable exception was Google, which called me up in order to fly me out to Mountain View for an interview. I told them that they probably had the wrong person because they hadn’t even done a phone screen, so they offered to do a phone interview instead. I took the phone interview expecting to fail because I didn’t have any CS background, and I failed as expected. In retrospect, I should have asked to interview for a hardware position, but at the time I didn’t know they had hardware positions, even though they’d been putting together their own servers and designing some of their own hardware for years.

Anyway, I ended up at a little chip company called Centaur. I was hesitant about taking the job because the interview was the easiest interview I had at any company4, which made me wonder if they had a low hiring bar, and therefore relatively weak engineers. It turns out that, on average, that’s the best group of people I’ve ever worked with. I didn’t realize it at the time, but this would later teach me that companies that claim to have brilliant engineers because they have super hard interviews are full of it, and that the interview difficulty one-upmanship a lot of companies promote is more of a prestige play than anything else.

But I’m getting ahead of myself – my first role was something they call “regression debug”, which included debugging test failures for both newly generated tests as well as regression tests. The main goal of this job was to teach new employees the ins-and-outs of the x86 architecture. At the time, Centaur’s testing was very heavily based on chip-level testing done by injecting real instructions, interrupts, etc., onto the bus, so debugging test failures taught new employees everything there is to know about x86.

The Intel x86 manual is thousands of pages long and it isn’t sufficient to implement a compatible x86 chip. When Centaur made its first x86 chip, they followed the Intel manual in perfect detail, and left all instances of undefined behavior up to individual implementors. When they got their first chip back and tried it, they found that some compilers produced code that relied on the behavior that’s technically undefined on x86, but happened to always be the same on Intel chips. While that’s technically a compiler bug, you can’t ship a chip that isn’t compatible with actually existing software, and ever since then, Centaur has implemented x86 chips by making sure that the chips match the exact behavior of Intel chips, down to matching officially undefined behavior5.

For years afterwards, I had encyclopedic knowledge of x86 and could set bits in control registers and MSRs from memory. I didn’t have a use for any of that knowledge at any future job, but the meta-skill of not being afraid of low-level hardware comes in handy pretty often, especially when I run into compiler or chip bugs. People look at you like you’re a crackpot if you say you’ve found a hardware bug, but because we were so careful about characterizing the exact behavior of Intel chips, we would regularly find bugs and then have discussions about whether we should match the bug or match the spec (the Intel manual).

The other thing I took away from the regression debug experience was a lifelong love of automation. Debugging often involves a large number of mechanical steps. After I learned enough about x86 that debugging became boring, I started automating debugging. At that point, I knew how to write simple scripts but didn’t really know how to program, so I wasn’t able to totally automate the process. However, I was able to automate enough that, for 99% of failures, I just had to glance at a quick summary to figure out what the bug was, rather than spend what might be hours debugging. That turned what was previously a full-time job into something that took maybe 30-60 minutes a day (excluding days when I’d hit a bug that involved some obscure corner of x86 I wasn’t already familiar with, or some bug that my script couldn’t give a useful summary of).

At that point, I did two things that I’d previously learned in internships. First, I started reading at work. I began with online commentary about programming, but there wasn’t much of that, so I asked if I could expense books and read them at work. This seemed perfectly normal because a lot of other people did the same thing, and there were at least two people who averaged more than one technical book per week, including one person who averaged a technical book every 2 or 3 days.

I settled in at a pace of somewhere between a book a week and a book a month. I read a lot of engineering books that imparted some knowledge that I no longer use, now that I spend most of my time writing software; some “big idea” software engineering books like Design Patterns and Refactoring, which I didn’t really appreciate because I was just writing scripts; and a ton of books on different programming languages, which doesn’t seem to have had any impact on me.

The only book I read back then that changed how I write software in a way that’s obvious to me was The Design of Everyday Things. The core idea of the book is that while people beat themselves up for failing to use hard-to-understand interfaces, we should blame designers for designing poor interfaces, not users for failing to use them.

If you ever run into a door that you incorrectly try to pull instead of push (or vice versa) and have some spare time, try watching how other people use the door. Whenever I do this, I’ll see something like half the people who try the door use it incorrectly. That’s a design flaw!

The Design of Everyday Things has made me a lot more receptive to API and UX feedback, and a lot less tolerant of programmers who say things like “it’s fine – everyone knows that the arguments to foo and bar just have to be given in the opposite order” or “Duh! Everyone knows that you just need to click on the menu X, select Y, navigate to tab Z, open AA, go to tab AB, and then slide the setting to AC.”

I don’t think all of that reading was a waste of time, exactly, but I would have been better off picking a few sub-fields in CS or EE and learning about them, rather than reading the sorts of books O’Reilly and Manning produce.

It’s not that these books aren’t useful, it’s that almost all of them are written to make sense without any particular background beyond what any random programmer might have, and you can only get so much out of reading your 50th book targeted at random programmers. IMO, most non-academic conferences have the same problem. As a speaker, you want to give a talk that works for everyone in the audience, but a side effect of that is that many talks have relatively little educational value to experienced programmers who have been to a few conferences.

I think I got positive things out of all that reading as well, but I don’t know yet how to figure out what those things are.

As a result of my reading, I also did two things that were, in retrospect, quite harmful.

One was that I really got into functional programming and used a functional style everywhere I could. Immutability, higher-order X for any possible value of X, etc. The result was code that I could write and modify quickly that was incomprehensible to anyone but a couple of coworkers who were also into functional programming.

The second big negative was that I became convinced that perl was causing us a lot of problems. We had perl scripts that were hard to understand and modify. They’d often be thousands of lines of code with only one or two functions and no tests which used every obscure perl feature you could think of. Static! Magic sigils! Implicit everything! You name it, we used it. For me, the last straw was when I inserted a new function between two functions which didn’t explicitly pass any arguments and return values – and broke the script because one of the functions was returning a value into an implicit variable which was getting read by the next function. By putting another function in between the two closely coupled functions, I broke the script.

After that, I convinced a bunch of people to use Ruby and started using it myself. The problem was that I only managed to convince half of my team to do this The other half kept using Perl, which resulted in language fragmentation. Worse yet, in another group, they also got fed up with Perl, but started using Python, resulting in the company having code in Perl, Python, and Ruby.

Centaur has an explicit policy of not telling people how to do anything, which precludes having team-wide or company-wide standards. Given the environment, using a “better” language seemed like a natural thing to do, but I didn’t recognize the cost of fragmentation until, later in my career, I saw a company that uses standardization to good effect.

Anyway, while I was causing horrific fragmentation, I also automated away most of my regression debug job. I got bored of spending 80% of my time at work reading and I started poking around for other things to do, which is something I continued for my entire time at Centaur. I like learning new things, so I did almost everything you can do related to chip design. The only things I didn’t do were circuit design (the TL of circuit design didn’t want a non-specialist interfering in his area) and a few roles where I was told “Dan, you can do that if you really want to, but we pay you too much to have you do it full-time.”

If I hadn’t interviewed regularly (about once a year, even though I was happy with my job), I probably would’ve wondered if I was stunting my career by doing so many different things, because the big chip companies produce specialists pretty much exclusively. But in interviews I found that my experience was valued because it was something they couldn’t get in-house. The irony is that every single role I was offered would have turned me into a specialist. Big chip companies talk about wanting their employees to move around and try different things, but when you dig into what that means, it’s that they like to have people work one very narrow role for two or three years before moving on to their next very narrow role.

For a while, I wondered if I was doomed to either eventually move to a big company and pick up a hyper-specialized role, or stay at Centaur for my entire career (not a bad fate – Centaur has, by far, the lowest attrition rate of any place I’ve worked because people like it so much). But I later found that software companies building hardware accelerators actually have generalist roles for hardware engineers, and that software companies have generalist roles for programmers, although that might be a moot point since most software folks would probably consider me an extremely niche specialist.

Regardless of whether spending a lot of time in different hardware-related roles makes you think of me as a generalist or a specialist, I picked up a lot of skills which came in handy when I worked on hardware accelerators, but that don’t really generalize to the pure software project I’m working on today. A lot of the meta-skills I learned transfer over pretty well, though.

If I had to pick the three most useful meta-skills I learned back then, I’d say they were debugging, bug tracking, and figuring out how to approach hard problems.

Debugging is a funny skill to claim to have because everyone thinks they know how to debug. For me, I wouldn’t even say that I learned how to debug at Centaur, but that I learned how to be persistent. Non-deterministic hardware bugs are so much worse than non-deterministic software bugs that I always believe I can track down software bugs. In the absolute worst case, when there’s a bug that isn’t caught in logs and can’t be caught in a debugger, I can always add tracing information until the bug becomes obvious. The same thing’s true in hardware, but “recompiling” to add tracing information takes 3 months per “recompile”; compared to that experience, tracking down a software bug that takes three months to figure out feels downright pleasant.

Bug tracking is another meta-skill that everyone thinks they have, but when when I look at most projects I find that they literally don’t know what bugs they have and they lose bugs all the time due to a failure to triage bugs effectively. I didn’t even know that I’d developed this skill until after I left Centaur and saw teams that don’t know how to track bugs. At Centaur, depending on the phase of the project, we’d have between zero and a thousand open bugs. The people I worked with most closely kept a mental model of what bugs were open; this seemed totally normal at the time, and the fact that a bunch of people did this made it easy for people to be on the same page about the state of the project and which areas were ahead of schedule and which were behind.

Outside of Centaur, I find that I’m lucky to even find one person who’s tracking what the major outstanding bugs are. Until I’ve been on the team for a while, people are often uncomfortable with the idea of taking a major problem and putting it into a bug instead of fixing it immediately because they’re so used to bugs getting forgotten that they don’t trust bugs. But that’s what bug tracking is for! I view this as analogous to teams whose test coverage is so low and staging system is so flaky that they don’t trust themselves to make changes because they don’t have confidence that issues will be caught before hitting production. It’s a huge drag on productivity, but people don’t really see it until they’ve seen the alternative.

Perhaps the most important meta-skill I picked up was learning how to solve large problems. When I joined Centaur, I saw people solving problems I didn’t even know how to approach. There were folks like Glenn Henry, a fellow from IBM back when IBM was at the forefront of computing, and Terry Parks, who Glenn called the best engineer he knew at IBM. It wasn’t that they were 10x engineers; they didn’t just work faster. In fact, I can probably type 10x as quickly as Glenn (a hunt and peck typist) and could solve trivial problems that are limited by typing speed more quickly than him. But Glenn, Terry, and some of the other wizards knew how to approach problems that I couldn’t even get started on.

I can’t cite any particular a-ha moment. It was just eight years of work. When I went looking for problems to solve, Glenn would often hand me a problem that was slightly harder than I thought possible for me. I’d tell him that I didn’t think I could solve the problem, he’d tell me to try anyway, and maybe 80% of the time I’d solve the problem. We repeated that for maybe five or six years before I stopped telling Glenn that I didn’t think I could solve the problem. Even though I don’t know when it happened, I know that I eventually started thinking of myself as someone who could solve any open problem that we had.

Grad school, again (2008 - 2010)

At some point during my tenure at Centaur, I switched to being part-time and did a stint taking classes and doing a bit of research at the local university. For reasons which I can’t recall, I split my time between software engineering and CS theory.

I read a lot of software engineering papers and came to the conclusion that we know very little about what makes teams (or even individuals) productive, and that the field is unlikely to have actionable answers in the near future. I also got my name on a couple of papers that I don’t think made meaningful contributions to the state of human knowledge.

On the CS theory side of things, I took some graduate level theory classes. That was genuinely educational and I really “got” algorithms for the first time in my life, as well as complexity theory, etc. I could have gotten my name on a paper that I didn’t think made a meaningful contribution to the state of human knowledge, but my would-be co-author felt the same way and we didn’t write it up.

I originally tried grad school again because I was considering getting a PhD, but I didn’t find the work I was doing to be any more “interesting” than the work I had at Centaur, and after seeing the job outcomes of people in the program, I decided there was less than 1% chance that a PhD would provide any real value to me and went back to Centaur full time.

RC (Spring 2013)

After eight years at Centaur, I wanted to do something besides microprocessors. I had enough friends at other hardware companies to know that I’d be downgrading in basically every dimension except name recognition if I switched to another hardware company, so I started applying to software jobs.

While I was applying to jobs, I heard about RC. It sounded great, maybe even too great: when I showed my friends what people were saying about it, they thought the comments were fake. It was a great experience, and I can see why so many people raved about it, to the point where real comments sound impossibly positive. It was transformative for a lot of people; I heard a lot of exclamations like “I learned more here in 3 months here than in N years of school” or “I was totally burnt out and this was the first time I’ve been productive in a year”. It wasn’t transformative for me, but it was as fun a 3 month period as I’ve ever had, and I even learned a thing or two.

From a learning standpoint, the one major thing I got out of RC was feedback from Marek, whom I worked with for about two months. While the freedom and lack of oversight at Centaur was great for letting me develop my ability to work independently, I basically didn’t get any feedback on my work6 since they didn’t do code review while I was there, and I never really got any actionable feedback in performance reviews.

Marek is really great at giving feedback while pair programming, and working with him broke me of a number of bad habits as well as teaching me some new approaches for solving problems. At a meta level, RC is relatively more focused on pair programming than most places and it got me to pair program for the first time. I hadn’t realized how effective pair programming with someone is in terms of learning how they operate and what makes them effective. Since then, I’ve asked a number of super productive programmers to pair program and I’ve gotten something out of it every time.

Second real job (2013 - 2014)

I was in the right place at the right time to land on a project that was just transitioning from Andy Phelps’ pet 20% time project into what would later be called the Google TPU.

As far as I can tell, it was pure luck that I was the second engineer on the project as opposed to the fifth or the tenth. I got to see what it looks like to take a project from its conception and turn it into something real. There was a sense in which I got that at Centaur, but every project I worked on was either part of a CPU, or a tool whose goal was to make CPU development better. This was the first time I worked on a non-trivial project from its inception, where I wasn’t just working on part of the project but the whole thing.

That would have been educational regardless of the methodology used, but it was a particularly great learning experience because of how the design was done. We started with a lengthy discussion on what core algorithm we were going to use. After we figured out an algorithm that would give us acceptable performance, we coded up design docs for every major module before getting serious about implementation.

Many people consider writing design docs to be a waste of time nowadays, but going through this process, which took months, had a couple big advantages. The first is that working through a design collaboratively teaches everyone on the team everyone else’s tricks. It’s a lot like the kind of skill transfer you get with pair programming, but applied to design. This was great for me, because as someone with only a decade of experience, I was one of the least experienced people in the room.

The second is that the iteration speed is much faster in the design phase, where throwing away a design just means erasing a whiteboard. Once you start coding, iterating on the design can mean throwing away code; for infrastructure projects, that can easily be person-years or even tens of persons-years of work. Since working on the TPU project, I’ve seen a couple of teams on projects of similar scope insist on getting “working” code as soon as possible. In every single case, that resulted in massive delays as huge chunks of code had to be re-written, and in a few cases the project was fundamentally flawed in a way that required the team had to start over from scratch.

I get that on product-y projects, where you can’t tell how much traction you’re going to get from something, you might want to get an MVP out the door and iterate, but for pure infrastructure, it’s often possible to predict how useful something will be in the design phase.

The other big thing I got out of the job was a better understanding of what’s possible when a company makes a real effort to make engineers productive. Something I’d seen repeatedly at Centaur was that someone would come in, take a look around, find the tooling to be a huge productivity sink, and then make a bunch of improvements. They’d then feel satisfied that they’d improved things a lot and then move on to other problems. Then the next new hire would come in, have the same reaction, and do the same thing. The result was tools that improved a lot while I was there, but not to the point where someone coming in would be satisfied with them. Google was the only place I’d worked where a lot of the tools seem like magic compared to what exists in the outside world7. Sure, people complain that a lot of the tooling is falling over, that there isn’t enough documentation, and that a lot of it is out of date. All true. But the situation is much better than it’s been at any other company I’ve worked at. That doesn’t seem to actually be a competitive advantage for Google’s business, but it makes the development experience really pleasant.

Third real job (2015 - Present)

It’s hard for me to tell what I’ve learned until I’ve had a chance to apply it elsewhere, so this section is a TODO until I move onto another role. I feel like I’m learning a lot right now, but I’ve noticed that feeling like I’m learning a lot at the time is weakly correlated to whether or not I learn skills that are useful in the long run. Unless I get re-org’d or someone makes me an offer I can’t refuse, it seems unlikely that I’d move on until my current project is finished, which seems likely to be at least another 6-12 months.

What about the bad stuff?

When I think about my career, it seems to me that it’s been one lucky event after the next. I’ve been unlucky a few times, but I don’t really know what to take away from the times I’ve been unlucky.

For example, I’d consider my upbringing to be mildly abusive. I remember having nights where I couldn’t sleep because I’d have nightmares about my father every time I fell asleep. Being awake during the day wasn’t a great experience, either. That’s obviously not good, and in retrospect it seems pretty directly related to the academic problems I had until I moved out, but I don’t know that I could give useful advice to a younger version of myself. Don’t be born into an abusive family? That’s something people would already do if they had any control over the matter.

Or to pick a more recent example, I once joined a team that scored a 1 on the Joel Test. The Joel Test is now considered to be obsolete because it awards points for things like “Do you have testers?” and “Do you fix bugs before writing new code?”, which aren’t considered best practices by most devs today. Of the items that aren’t controversial, many seem so obvious that they’re not worth asking about, things like:

  • Can you make a build in one step?
  • Do you make daily builds?
  • Do you have a bug database?
  • Do new candidates write code during their interview?

For anyone who cares about this kind of thing, it’s clearly not a great idea to join a team that does, at most, 1 item off of Joel’s checklist. Getting first-hand experience on a team that scored a 1 didn’t give me any new information that would make me reconsider my opinion.

You might say that I should have asked about those things. It’s true! I should have, and I probably will in the future. However, when I was hired, the TL who was against version control and other forms of automation hadn’t been hired yet, so I wouldn’t have found out about this if I’d asked. Furthermore, even if he’d already been hired, I’m still not sure I would have found out about it – this is the only time I’ve joined a team and then found that most of the factual statements made during the recruiting process were untrue. When I was on that team, every day featured a running joke between team members about how false the recruiting pitch was.

I could try to prevent similar problems in the future by asking for concrete evidence of factual claims (e.g., if someone claims the attrition rate is X, I could ask for access to the HR database to verify), but considering that I have a finite amount of time and the relatively low probability of being told outright falsehoods, I think I’m going to continue to prioritize finding out other information when I’m considering a job and just accept that there’s a tiny probability I’ll end up in a similar situation in the future.

When I look at the bad career-related stuff I’ve experienced, almost all of it falls into one of two categories: something obviously bad that was basically unavoidable, or something obviously bad that I don’t know how to reasonably avoid, given limited resources. I don’t see much to learn from that. That’s not to say that I haven’t made and learned from mistakes. I’ve made a lot of mistakes and do a lot of things differently as a result of mistakes! But my worst experiences have come out of things that I don’t know how to prevent in any reasonable way.

This also seems to be true for most people I know. For example, something I’ve seen a lot is that a friend of mine will end up with a manager whose view is that managers are people who dole out rewards and punishments (as opposed to someone who believes that managers should make the team as effective as possible, or someone who believes that managers should help people grow). When you have a manager like that, a common failure mode is that you’re given work that’s a bad fit, and then maybe you don’t do a great job because the work is a bad fit. If you ask for something that’s a better fit, that’s refused (why should you be rewarded with doing something you want when you’re not doing good work, instead you should be punished by having to do more of this thing you don’t like), which causes a spiral that ends in the person leaving or getting fired. In the most recent case I saw, the firing was a surprise to both the person getting fired and their closest co-workers: my friend had managed to find a role that was a good fit despite the best efforts of management; when management decided to fire my friend, they didn’t bother to consult the co-workers on the new project, who thought that my friend was doing great and had been doing great for months!

I hear a lot of stories like that, and I’m happy to listen because I like stories, but I don’t know that there’s anything actionable here. Avoid managers who prefer doling out punishments to helping their employees? Obvious but not actionable.

Conclusion

The most common sort of career advice I see is “you should do what I did because I’m successful”. It’s usually phrased differently, but that’s the gist of it. That basically never works. When I compare notes with friends and acquaintances, it’s pretty clear that my career has been unusual in a number of ways, but it’s not really clear why.

Just for example, I’ve almost always had a supportive manager who’s willing to not only let me learn whatever I want on my own, but who’s willing to expend substantial time and effort to help me improve as an engineer. Most folks I’ve talked to have never had that. Why the difference? I have no idea.

One story might be: the two times I had unsupportive managers, I quickly found other positions, whereas a lot of friends of mine will stay in roles that are a bad fit for years. Maybe I could spin it to make it sound like the moral of the story is that you should leave roles sooner than you think, but both of the bad situations I ended up in, I only ended up in because I left a role sooner than I should have, so the advice can’t be “prefer to leave roles sooner than you think”. Maybe the moral of the story should be “leave bad roles more quickly and stay in good roles longer”, but that’s so obvious that it’s not even worth stating. Every strategy that I can think of is either incorrect in the general case, or so obvious there’s no reason to talk about it.

Another story might be: I’ve learned a lot of meta-skills that are valuable, so you should learn these skills. But you probably shouldn’t. The particular set of meta-skills I’ve picked have been great for me because they’re skills I could easily pick up in places I worked (often because I had a great mentor) and because they’re things I really strongly believe in doing. Your circumstances and core beliefs are probably different from mine and you have to figure out for yourself what it makes sense to learn.

Yet another story might be: while a lot of opportunities come from serendipity, I’ve had a lot of opportunities because I spend a lot of time generating possible opportunities. When I passed around the draft of this post to some friends, basically everyone told me that I emphasized luck too much in my narrative and that all of my lucky breaks came from a combination of hard work and trying to create opportunities. While there’s a sense in which that’s true, many of my opportunities also came out of making outright bad decisions.

For example, I ended up at Centaur because I turned down the chance to work at IBM for a terrible reason! At the end of my internship, my manager made an attempt to convince me to stay on as a full-time employee, but I declined because I was going to grad school. But I was only going to grad school because I wanted to get a microprocessor logic design position, something I thought I couldn’t get with just a bachelor’s degree. But I could have gotten that position if I hadn’t turned my manager down! I’d just forgotten the reason that I’d decided to go to grad school and incorrectly used the cached decision as a reason to turn down the job. By sheer luck, that happened to work out well and I got better opportunities than anyone I know from my intern cohort who decided to take a job at IBM. Have I “mostly” been lucky or prepared? Hard to say; maybe even impossible.

Careers don’t have the logging infrastructure you’d need to determine the impact of individual decisions. Careers in programming, anyway. Many sports now track play-by-play data in a way that makes it possible to try to determine how much of success in any particular game or any particular season was luck and how much was skill.

Take baseball, which is one of the better understood sports. If we look at the statistical understanding we have of performance today, it’s clear that almost no one had a good idea about what factors made players successful 20 years ago. One thing I find particularly interesting is that we now have much better understanding of which factors are fundamental and which factors come down to luck, and it’s not at all what almost anyone would have thought 20 years ago. We can now look at a pitcher and say something like “they’ve gotten unlucky this season, but their foo, bar, and baz rates are all great so it appears to be bad luck on balls in play as opposed any sort of decline in skill”, and we can also make statements like “they’ve done well this season but their fundamental stats haven’t moved so it’s likely that their future performance will be no better than their past performance before this season”. We couldn’t have made a statement like that 20 years ago. And this is a sport that’s had play-by-play video available going back what seems like forever, where play-by-play stats have been kept for a century, etc.

In this sport where everything is measured, it wasn’t until relatively recently that we could disambiguate between fluctuations in performance due to luck and fluctuations due to changes in skill. And then there’s programming, where it’s generally believed to be impossible to measure people’s performance and the state of the art in grading people’s performance is that you ask five people for their comments on someone and then aggregate the comments. If we’re only just now able to make comments on what’s attributable to luck and what’s attributable to skill in a sport where every last detail of someone’s work is available, how could we possibly be anywhere close to making claims about what comes down to luck vs. other factors in something as nebulous as a programming career?

In conclusion, life is messy and I don’t have any advice.

Appendix A: meta-skills I’d like to learn

Documentation

I once worked with Jared Davis, a documentation wizard whose documentaiton was so good that I’d go to him to understand how a module worked before I talked to the owner the module. As far as I could tell, he wrote documentation on things he was trying to understand to make life easier for himself, but his documentation was so good that was it was a force multiplier for the entire company.

Later, at Google, I noticed a curiously strong correlation between the quality of initial design docs and the success of projects. Since then, I’ve tried to write solid design docs and documentation for my projects, but I still have a ways to go.

Fixing totally broken situations

So far, I’ve only landed on teams where things are much better than average and on teams where things are much worse than average. You might think that, because there’s so much low hanging fruit on teams that are much worse than average, it should be easier to improve things on teams that are terrible, but it’s just the opposite. The places that have a lot of problems have problems because something makes it hard to fix the problems.

When I joined the team that scored a 1 on the Joel Test, it took months of campaigning just to get everyone to use version control.

I’ve never seen an enviroment go from “bad” to “good” and I’d be curious to know what that looks like and how it happens. Yossi Kreinin’s thesis is that only management can fix broken situations. That might be true, but I’m not quite ready to believe it just yet, even though I don’t have any evidence to the contrary.

Appendix B: other “how I became a programmer” stories

Kragen. Describes 27 years of learning to program. Heavy emphasis on conceptual phases of development (e.g., understanding how to use provided functions vs. understanding that you can write arbitrary functions)

Julia Evans. Started programming on a TI-83 in 2004. Dabbled in programing until college (2006-2011) and has been working as a professional programmer ever since. Some emphasis on the “journey” and how long it takes to improve.

Tavish Armstrong. 4th grade through college. Emphasis on particular technologies (e.g., LaTeX or Python).

Caitie McCaffrey . Started programming in AP computer science. Emphasis on how interests led to a career in programming.

Matt DeBoard. Spent 12 weeks learning Django with the help of a mentor. Emphasis on the fact that it’s possible to become a programmer without programming background.

Kristina Chodorow. Started in college. Emphasis on alternatives (math, grad school).

Michael Bernstein. Story of learning Haskell over the course of years. Emphasis on how long it took to become even minimally proficient.

Thanks to Leah Hanson, Lindsey Kuper, Kelley Eskridge, Jeshua Smith, Tejas Sapre, Joe Wilder, Adrien Lamarque, Maggie Zhou, Lisa Neigut, Steve McCarthy, Darius Bacon, Kaylyn Gibilterra, and Sarah Ransohoff for comments/criticism/discussion.


  1. If you happen to have contact information for Mr. Swanson, I’d love to be able to send a note saying thanks. [return]
  2. Wayne Dickey, Richard Brualdi, Andreas Seeger, and a visiting professor whose name escapes me. [return]
  3. I strongly recommend Andy Weiner for any class, as well as the guy who taught mathematical physics when I sat in on it, but I don’t remember who that was or if that’s even the exact name of the class. [return]
  4. with the exception of one government lab, which gave me an offer on the strength of a non-technical on-campus interview. I believe that was literally the first interview I did when I was looking for work, but they didn’t get back to me until well after interview season was over and I’d already accepted an offer. I wonder if that’s because they went down the list of candidates in some order and only got to me after N people turned them down or if they just had a six month latency on offers. [return]
  5. Because Intel sees no reason to keep its competitors informed about what it’s doing, this results in a substantial latency when matching new features. They usually announce enough information that you can implement the basic functionality, but behavior on edge cases may vary. We once had a bug (noticed and fixed well before we shipped, but still problematic) where we bought an engineering sample off of ebay and implemented some new features based on the engineering sample. This resulted in an MWAIT bug that caused Windows to hang; Intel had changed the behavior of MWAIT between shipping the engineering sample and shipping the final version.

    I recently saw a post that claims that you can get great performance per dollar by buying some engineering samples off of ebay. Don’t do this. Engineering samples regularly have bugs. Sometimes those bugs are actual bugs, and sometimes it’s just that Intel changed their minds. Either way, you really don’t want to run production systems off of engineering samples.

    [return]
  6. I occasionally got feedback by taking a problem I’d solved to someone and asking them if they had any better ideas, but that’s much less in depth than the kind of feedback I’m talking about here. [return]
  7. To pick one arbitrary concrete example, look at version control at Microsoft from someone who worked on Windows Vista:

    In small programming projects, there’s a central repository of code. Builds are produced, generally daily, from this central repository. Programmers add their changes to this central repository as they go, so the daily build is a pretty good snapshot of the current state of the product.

    In Windows, this model breaks down simply because there are far too many developers to access one central repository. So Windows has a tree of repositories: developers check in to the nodes, and periodically the changes in the nodes are integrated up one level in the hierarchy. At a different periodicity, changes are integrated down the tree from the root to the nodes. In Windows, the node I was working on was 4 levels removed from the root. The periodicity of integration decayed exponentially and unpredictably as you approached the root so it ended up that it took between 1 and 3 months for my code to get to the root node, and some multiple of that for it to reach the other nodes. It should be noted too that the only common ancestor that my team, the shell team, and the kernel team shared was the root.

    Google and Microsoft both maintained their own forks of perforce because that was the most scalable source control system available at the time. Google would go on to build piper, a distributed version control system (in the distributed systems sense, not in the git sense) that solved the scaling problem, despite having a dev experience that wasn’t nearly as painful. But that option wasn’t really on the table at Microsoft. In the comments to the post quoted above, a then-manager at Microsoft commented that the possible options were:

    1. federate out the source tree, and pay the forward and reverse integration taxes (primarily delay in finding build breaks), or…
    2. remove a large number of the unnecessary dependencies between the various parts of Windows, especially the circular dependencies.
    3. Both 1&2 #1 was the winning solution in large part because it could be executed by a small team over a defined period of time. #2 would have required herding all the Windows developers (and PMs, managers, UI designers…), and is potentially an unbounded problem.

    Someone else commented, to me, that they were on an offshoot team that got the one-way latency down from months to weeks. That’s certainly an improvement, but why didn’t anyone build a system like piper? I asked that question of people who were at Microsoft at the time, and I got answers like “when we started using perforce, it was so much faster than what we’d previously had that it didn’t occur to people that we could do much better” and “perforce was so much faster than xcopy that it seemed like magic”.

    This general phenomenon, where people don’t attempt to make a major improvement because the current system is already such a huge improvement over the previous system, is something I’d seen before and even something I’d done before. This example happens to use Microsoft and Google, but please don’t read too much into that. There are systems where things are flipped around and the system at Google is curiously unwieldy compared to the same system at Microsoft.

    [return]

Is developer compensation becoming bimodal?

$
0
0

Developer compensation has skyrocketed since the demise of the Google et al. wage-suppressing no-hire agreement, to the point where compensation rivals and maybe even exceeds compensation in traditionally remunerative fields like law, consulting, etc.

Those fields have sharply bimodal income distributions. Are programmers in for the same fate? Let’s see what data we can find. First, let’s look at data from the National Association for Law Placement, which shows when legal salaries become bimodal.

Lawyers in 1991

First-year lawyer salaries in 1991. $40k median, trailing off with the upper end just under $90k

Median salary is $40k, with the numbers slowly trickling off until about $90k. According to the BLS $90k in 1991 is worth $160k in 2016 dollars. That’s a pretty generous starting salary.

Lawyers in 2000

First-year lawyer salaries in 2000. $50k median; bimodal with peaks at $40k and $125k

By 2000, the distribution had become bimodal. The lower peak is about the same in nominal (non-inflation-adjusted) terms, putting it substantially lower in real (inflation-adjusted) terms, and there’s an upper peak at around $125k, with almost everyone coming in under $130k. $130k in 2000 is $180k in 2016 dollars. The peak on the left has moved from roughly $30k in 1991 dollars to roughly $40k in 2000 dollars; both of those translate to roughly $55k in 2016 dollars. People in the right mode are doing better, while people in the left mode are doing about the same.

I won’t belabor the point with more graphs, but if you look at more recent data, the middle area between the two modes has hollowed out, increasing the level of inequality within the field. As a profession, lawyers have gotten hit hard by automation, and in real terms, 95%-ile offers today aren’t really better than they were in 2000. But 50%-ile and even 75%-ile offers are worse off due to the bimodal distribution.

Programmers in 2015

Enough about lawyers! What about programmers? Unfortunately, it’s hard to get good data on this. Anecdotally, it sure seems to me like we’re going down the same road. Unfortunately, almost all of the public data sources that are available, like H1B data, have salary numbers and not total compensation numbers. Since compensation at the the upper end is disproportionately bonus and stock, most data sets I can find don’t capture what’s going on.

One notable exception is the new grad compensation data recorded by Dan Zhang and Jesse Collins:

First-year programmer compensation in 2016. Compensation ranges from $50k to $250k

There’s certainly a wide range here, and while it’s technically bimodal, there isn’t a huge gulf in the middle like you see in law and business. Note that this data is mostly bachelors grads with a few master’s grads. PhD numbers, which sometimes go much higher, aren’t included.

Do you know of a better (larger) source of data? This is from about 100 data points, members of the “Hackathon Hackers” Facebook group, in 2015. Dan and Jesse also have data from 2014, but it would be nice to get data over a wider timeframe and just plain more data. Also, this data is pretty clearly biased towards the high end – if you look at national averages for programmers at all levels of experience, the average comes in much lower than the average for new grads in this data set. The data here match the numbers I hear when we compete for people, but the population of “people negotiating offers at Microsoft” also isn’t representative.

If we had more representative data it’s possible that we’d see a lot more data points in the $40k to $60k range along with the data we have here, which would make the data look bimodal. It’s also possible that we’d see a lot more points in the $40k to $60k range, many more in the $70k to $80k range, some more in the $90k+ range, etc., and we’d see a smooth drop-off instead of two distinct modes.

Stepping back from the meager data we have and looking at the circumstances, “should” programmer compensation be bimodal? Most other fields that have bimodal compensation have a very different compensation structure than we see in programming. For example, top law and consulting firms have an up-or-out structure, which is effectively a tournament, which distorts compensation and certainly makes it seem more likely that compensation is likely to end up being bimodal. Additionally, competitive firms pay the same rate to all 1st year employees, which they determine by matching whoever appears to be paying the most. For example, this year, Cravath announced that it would pay first-year associates $180k, and many other firms followed suit. Like most high-end firms, Cravath has a salary schedule that’s entirely based on experience:

  • 0 years: $180k
  • 1 year: $190k
  • 2 years: $210k
  • 3 years: $235k
  • 4 years: $260k
  • 5 years: $280k
  • 6 years: $300k
  • 7 years: $315k

In software, compensation tends to be on a case-by-case basis, which makes it much less likely that we’ll see a sharp peak the way we do in law. If I had to guess, I’d say that while the dispersion in programmer compensation is increasing, it’s not bimodal, but I don’t really have the right data set to conclusively say anything. Please point me to any data you have that’s better.

Appendix A: please don’t send me these

  • H-1B: mostly salary only.
  • Stack Overflow survey: salary only. Also, data is skewed by the heavy web focus of the survey – I stopped doing the survey when none of their job descriptions matched anyone in my entire building, and I know other people who stopped for the same reason.
  • Glassdoor: weirdly inconsistent about whether or not it includes stock compensation. Numbers for some companies seem to, but numbers for other companies don’t.
  • O’Reilly survey: salary focused.
  • BLS: doesn’t make fine-grained distribution available.
  • IRS: they must have the data, but they’re not sharing.
  • IDG: only has averages.
  • internal company data: too narrow
  • compensation survey companies like PayScale: when I’ve talked to people from these companies, they acknowledge that they have very poor visibility into large company compensation, but that’s what drives the upper end of the market (outside of finance).
  • #talkpay on twitter: numbers skew low1.

Appendix B: wtf?

Since we have both programmer and lawyer compensation handy, let’s examine that. Programming pays so well that it seems a bit absurd. If you look at other careers with similar compensation, there are multiple factors that act as barriers or disincentives to entry.

If you look at law, you have to win the prestige lottery and get into a top school, which will cost hundreds of thousands of dollars. Then you have to win the grades lottery and get good enough grades to get into a top firm. And then you have to continue winning tournaments to avoid getting kicked out, which requires sacrificing any semblance of a personal life. Consulting, investment banking, etc., are similar. Compensation appears to be proportional to the level of sacrifice (e.g., investment bankers are paid better, but work even longer hours than lawyers).

Medicine seems to be a bit better from the sacrifice standpoint because there’s a cartel which limits entry into the field, but the combination of medical school and residency is still incredibly brutal compared to most jobs at places like Facebook and Google.

Programming also doesn’t have a licensing body limiting the number of programers, nor is there the same prestige filter where you have to go to a top school to get a well paying job. Sure, there are a lot of startups who basically only hire from MIT, Stanford, CMU, and a few other prestigious schools, and I see job ads like the following whenever I look at startups:

Our team of 14 includes 6 MIT alumni, 3 ex-Googlers, 1 Wharton MBA, 1 MIT Master in CS, 1 CMU CS alum, and 1 “20 under 20” Thiel fellow. Candidates often remark we’re the strongest team they’ve ever seen.

We’re not for everyone. We’re an enterprise SaaS company your mom will probably never hear of. We work really hard 6 days a week because we believe in the future of mobile and we want to win.

That happens. But, in programming, measuring people by markers of prestige seems to be a Silicon Valley startup thing and not a top-paying companies thing. Big companies, which pay a lot better than startups, don’t filter people out by prestige nearly as often. Not only do you not need the right degree from the right school, you also don’t need to have the right kind of degree, or any degree at all. Although it’s getting rarer to not have a degree, I still meet new hires with no experience and either no degree or a degree in an unrelated field (like sociology or philosophy).

How is it possible that programmers are paid so well without these other barriers to entry that similarly remunerative fields have? One possibility is that we have a shortage of programmers. If that’s the case, you’d expect more programmers to enter the field, bringing down compensation. CS enrollments have been at record levels recently, so this may already be happening. Another possibility is that programming is uniquely hard in some way, but that seems implausible to me. Programming doesn’t seem inherently harder than electrical engineering or chemical engineering and it certainly hasn’t gotten much harder over the past decade, but during that timeframe, programming has gone from having similar compensation to most engineering fields to paying much better. The last time I was negotiating with a EE company about offers, they remarked to me that their VPs don’t make as much as I do, and I work at a software company that pays relatively poorly compared to its peers. There’s no reason to believe that we won’t see a flow of people from engineering fields into programming until compensation is balanced.

Another possibility is that U.S. immigration laws act as a protectionistic barrier to prop up programmer compensation. It seems impossible for this to last (why shouldn’t there by really valuable non-U.S. companies), but it does appear to be somewhat true for now. When I was at Google, one thing that was remarkable to me was that they’d pay you approximately the same thing in a small midwestern town as in Silicon Valley, but they’d pay you much less in London. Whenever one of these discussions comes up, people always bring up the “fact” that SV salaries aren’t really as good as they sound because the cost of living is so high, but companies will not only match SV offers in Seattle, they’ll match them in places like Madison, Wisconsin. My best guess for why this happens is that someone in the midwest can credibly threaten to move to SV and take a job at any company there, whereas someone in London can’t2. While we seem unlikely to loosen current immigration restrictions, our immigration restrictions have caused and continue to cause people who would otherwise have founded companies in the U.S. to found companies elsewhere. Given that the U.S. doesn’t have a monopoly on people who found startups and that we do our best to keep people who want to found startups here out, it seems inevitable that there will eventually be Facebooks and Googles founded outside of the U.S. who compete for programmers the same way companies compete inside the U.S.

Another theory that I’ve heard a lot lately is that programmers at large companies get paid a lot because of the phenomenon described in Kremer’s O-ring model. This model assumes that productivity is multiplicative. If your co-workers are better, you’re more productive and produce more value. If that’s the case, you expect a kind of assortive matching where you end up with high-skill firms that pay better, and low-skill firms that pay worse. This model has a kind of intuitive appeal to it, but it can’t explain why programming compensation has higher dispersion than (for example) electrical engineering compensation. With the prevalence of open source, it’s much easier to utilize the work of productive people outside your firm than in most fields. This model should be less true of programming than in most engineering fields, but the dispersion in compensation is higher.

I don’t understand this at all and would love to hear a compelling theory for why programming “should” pay more than other similar fields, or why it should pay as much as fields that have much higher barriers to entry.


  1. People often worry that comp surveys will skew high because people want to brag, but the reality seems to be that numbers skew low because people feel embarrased about sounding like they’re bragging. I have a theory that you can see this reflected in the prices of other goods. For example, if you look at house prices, they’re generally predicatable based on location, square footage, amenities, and so on. But there’s a significant penalty for having the largest house on the block, for what (I suspect) is the same reason people with the highest compensation disproportionately don’t participate in #talkpay: people don’t want to admit that they have the highest pay, have the biggest house, or drive the fanciest car. Well, some people do, but on average, bragging about that stuff is seen as quite gauche. [return]
  2. There’s a funny move some companies will do where they station the new employee in Canada for a year before importing them into the U.S., which gets them into a visa process that’s less competitive. But this is enough of a hassle that most employees balk at the idea. [return]

Why's that company so big? I could do that in a weekend

$
0
0

I can’t think of a single large software company that doesn’t regularly draw internet comments of the form “What do all the employees do? I could build their product myself.” Benjamin Pollack and Jeff Atwood called out people who do that with Stack Overflow. But Stack Overflow is relatively obviously lean, so the general response is something like “oh, sure maybe Stack Overflow is lean, but FooCorp must really be bloated”. And since most people have relatively little visibility into FooCorp, for any given value of FooCorp, that sounds like a plausible statement. After all, what product could possible require hundreds, or even thousands of engineers?

A few years ago, in the wake of the rapgenius SEO controversy, a number of folks called for someone to write a better Google. Alex Clemmer responded that maybe building a better Google is a non-trivial problem. Considering how much of Google’s $500B market cap comes from search, and how much money has been spent by tens (hundreds?) of competitors in an attempt to capture some of that value, it seems plausible to me that search isn’t a trivial problem. But in the comments on Alex’s posts, multiple people respond and say that Lucene basically does the same thing Google does and that Lucene is poised to surpass Google’s capabilities in the next few years.

What would Lucene at Google’s size look like? If we do a naive back of the envelope calculation on what it would take to index a significant fraction of the internet (often estimated to be 1 trillion (T) or 10T documents), we might expect a 1T document index to cost something like $10B1. That’s not a feasible startup, so let’s say that instead of trying to index 1T documents, we want to maintain an artisanal search index of 1B documents. Then our cost comes down to $12M/yr. That’s not so bad – plenty of startups burn through more than that every year. While we’re in the VC-funded hypergrowth mode, that’s fine, but once we have a real business, we’ll want to consider trying to save money. At $12M/yr for the index, a 3% performance improvement that lets us trim our costs by 2% is worth $360k/yr. With those kinds of costs, it’s surely worth it to have at least one engineer working full-time on optimization, if not more.

Businesses that actually care about turning a profit will spend a lot of time (hence, a lot of engineers) working on optimizing systems, even if an MVP for the system could have been built in a weekend. There’s also a wide body of research that’s found that decreasing latency has a roughly linear effect on revenue over a pretty wide range of latencies and businesses. Businesses should keep adding engineers to work on optimization until the cost of adding an engineer equals the revenue gain plus the cost savings at the margin. This is often many more engineers than people realize.

And that’s just performance. Features also matter: when I talk to engineers working on basically any product at any company, they’ll often find that there are seemingly trivial individual features that can add integer percentage points to revenue. Just as with performance, people underestimate how many engineers you can add to a product before engineers stop paying for themselves.

Additionally, features are often much more complex than outsiders realize. If we look at search, how do we make sure that different forms of dates and phone numbers give the same results? How about internationalization? Each language has unique quirks that have to be accounted for. In french, “l’foo” should often match “un foo” and vice versa, but American search engines from the 90s didn’t actually handle that correctly. How about tokenizing Chinese queries, where words don’t have spaces between them, and sentences don’t have unique tokenizations? How about Japanese, where queries can easily contain four different alphabets? How about handling Arabic, which is mostly read right-to-left, except for the bits that are read left-to-right? And that’s not even the most complicated part of handling Arabic! It’s fine to ignore this stuff for a weekend-project MVP, but ignoring it in a real business means ignoring the majority of the market! Some of these are handled ok by open source projects, but many of the problems involve open research problems.

There’s also security! If you don’t “bloat” your company by hiring security people, you’ll end up like hotmail or yahoo, where your product is better known for how often it’s hacked than for any of its other features.

Everything we’ve looked at so far is a technical problem. Compared to organizational problems, technical problems are straightforward. Distributed systems are considered hard because real systems might drop something like 0.1% of messages, corrupt an even smaller percentage of messages, and see latencies in the microsecond to millisecond range. When I talk to higher-ups and compare what they think they’re saying to what my coworkers think they’re saying, I find that the rate of lost messages is well over 50%, every message gets corrupted, and latency can be months or years2. When people imagine how long it should take to build something, they’re often imagining a team that works perfectly and spends 100% of its time coding. But that’s impossible to scale up. The question isn’t whether or not there will inefficiencies, but how much inefficiency. A company that could eliminate organizational inefficiency would be a larger innovation than any tech startup, ever. But when doing the math on how many employees a company “should” have, people usually assume that the company is an efficient organization.

This post happens to use search as an example because I ran across some people who claimed that Lucene was going to surpass Google’s capabilities any day now, but there’s nothing about this post that’s unique to search. If you talk to people in almost any field, you’ll hear stories about how people wildly underestimate the complexity of the problems in the field. The point here isn’t that it would be impossible for a small team to build something better than Google search. It’s entirely plausible that someone will have an innovation as great as PageRank, and that a small team could turn that into a viable company. But once that company is past the VC-funded hyper growth phase and wants to maximize its profits, it will end up with a multi-thousand person platforms org, just like Google’s, unless the company wants to leave hundreds of millions or billions of dollars a year on the table due to hardware and software inefficiency. And the company will want to handle languages like Thai, Arabic, Chinese, and Japanese, each of which is non-trivial. And the company will want to have relatively good security. And there are the hundreds of little features that users don’t even realize that are there, each of which provides a noticeable increase in revenue. It’s “obvious” that companies should outsource their billing, except that when you talk to companies that handle their own billing, they can point to individual features that increase conversion by single or double digit percentages that they can’t get from Stripe or Braintree. That fifty person billing team is totally worth it, beyond a certain size. And then there’s sales, which most engineers don’t even think of3, not to mention research (which, almost by definition, involves a lot of bets that don’t pan out).

It’s not that all of those things are necessary to run a service at all; it’s that almost every large service is leaving money on the table if they don’t seriously address those things. This reminds me of a common fallacy we see in unreliable systems, where people build the happy path with the idea that the happy path is the “real” work, and that error handling can be tacked on later. For reliable systems, error handling is more work than the happy path. The same thing is true for large services – all of this stuff that people don’t think of as “real” work is more work than the core service4.

I’m experimenting with writing blog posts stream-of-consciousness, without much editing. Both this post and my last post were written that way. Let me know what you think of these posts relative to my “normal” posts!

Thanks to Leah Hanson, Joel Wilder, Kay Rhodes, and Ivar Refsdal for corrections.


  1. In public benchmarks, Lucene appears to get something like 30 QPS - 40 QPS when indexing wikipedia on a single machine. See anandtech, Haque et al., ASPLOS 2015, etc. I’ve seen claims that Lucene can run 10x faster than that on wikipedia but I haven’t seen a reproducible benchmark setup showing that, so let’s say that we can expect to get something like 30 QPS - 300 QPS if we index a wikipedia-sized corpus on one machine.

    Those benchmarks appear to be indexing English Wikipedia, articles only. That’s roughly 50 GB and approximately 5m documents. Estimates of the size of the internet vary, but public estimates often fall into the range of 1 trillion (T) to 10T documents. Say we want to index 1T documents, and we can put 5m documents per machine: we need 1T/5m = 200k machines to handle all of the extra documents. None of the off-the-shelf sharding/distribution solutions that are commonly used with Lucene can scale to 200k machines, but let’s posit that we can solve that problem and can operate a search cluster with 200k machines. We’ll also need to have some replication so that queries don’t return bad results if a single machine goes down. If we replicate every machine once, that’s 400k machines. But that’s 400k machines for just one cluster. If we only have one cluster sitting in some location, users in other geographic regions will experience bad latency to our service, so many we want to have ten such clusters. If we have ten such clusters, that’s 4M machines.

    In the Anandtech wikipedia benchmark, they get 30 QPS out of a single-socket Broadwell Xeon D with 64 GB of RAM (enough to fit the index in memory). If we don’t want to employ the army of people necessary to build out and run 4M machines worth of datacenters, AFAICT the cheapest VM that’s plausibly at least as “good” as that machine is the GCE n1-highmem-8, which goes for $0.352hr. If we multiply that out by 4M machines, that’s a little over $1.4M an hour, or a little more than $12B a year for a service that can’t even get within an order of magnitude of the query rate or latency necessary to run a service like Google or Bing. And that’s just for the index – even a minimal search engine also requires crawling. BTW, people will often claim that this is easy because they have much larger indices in Lucene, but with a posting-list based algorithm like Lucene, you can very roughly think of query rate as inversely related to the number of postings. When you ask these people with their giant indices what their query rate is, you’ll inevitably find that it’s glacial by internet standards. For reference, the core of twitter was a rails app that could handle something like 200 QPS until 2008. If you look at what most people handle with Lucene, it’s often well under 1 QPS, with documents that are much smaller than the average web document, using configurations that damage search relevance too much to be used in commercial search engines (e.g., using stop words). That’s fine, but that fact that people think that sort of experience is somehow relevant to web search is indicative of the problem this post is discussing.

    That also assumes that we won’t hit any other scaling problem if we can make 400k VM clusters. But finding an open source index which will scale not only to the number of documents on the internet, but also the number of terms, is non-trivial. Before you read the next section, try guessing how many unique terms there are online. And then if we shard the internet so that we have 5m documents per machine, try guessing how many unique terms you expect to see per shard.

    When I ask this question, I often hear guesses like “ten million” or “ten billion”. But without even looking at the entire internet, just looking at one single document on github, we can find a document with fifty million unique terms:

    Crista Lopes: The largest C++ file we found in GitHub has 528MB, 57 lines of code. Contains the first 50,847,534 primes, all hard coded into an array.

    So there are definitely more than ten million unique terms on the entire internet! In fact, there’s a website out there that has all primes under one trillion. I believe there are something like thirty-seven billion of those. If that website falls into one shard of our index, we’d expect to see more than thirty-seven billion terms in a single shard; that’s more than most people guess we’ll see on the entire internet, and that’s just in one shard that happens to contain one somewhat pathological site. If we try to put the internet into any existing open source index that I know of, not only will it not be able to scale out enough horizontally, many shards will contain data weird enough to make the entire shard fall over if we run a query. That’s nothing against open source software; like any software, it’s designed to satisfy the needs of its users, and none of its users do anything like index the entire internet. As businesses scale up, they run into funny corner cases that people without exposure to the particular domain don’t anticipate.

    People often object that you don’t need to index all of this weird stuff. There have been efforts to build web search engines that only index the “important” stuff, but it turns out that if you ask people to evaluate search engines, some people will type in the weirdest queries they can think of and base their evaluation off of that. And others type in what they think of as normal queries for their day-to-day work even if they seem weird to you (e.g., a biologist might query for GTGACCTTGGGCAAGTTACTTAACCTCTCTGTGCCTCAGTTTCCTCATCTGTAAAATGGGGATAATA). If you want to be anything but a tiny niche player, you have to handle not only the weirdest stuff you can think of, but the weirdest stuff that many people can think of.

    [return]
  2. Recently, I was curious why an org that’s notorious for producing unreliable services produces so many unreliable services. When I asked around about why, I found that that upper management were afraid of sending out any sort of positive message about reliability because they were afraid that people would use that as an excuse to slip schedules. Upper management changed their message to include reliability about a year ago, but if you talk to individual contributors, they still believe that the message is that features are the #1 priority and slowing down on features to make things more reliable is bad for your career (and based on who’s getting promoted the individual contributors appear to be right). Maybe in another year, the org will have really gotten the message through to the people who hand out promotions, and in another couple of years, enough software will have been written with reliability in mind that they’ll actually have reliable services. Maybe. That’s just the first-order effect. The second-order effect is that their policies have caused a lot of people who care about reliability to go to companies that care more about reliability and less about demo-ing shiny new features. They might be able to fix that in a decade. Maybe. That’s made harder by the fact that the org is in a company that’s well known for having PMs drive features above all else. If that reputation is possible to change, it will probably take multiple decades. [return]
  3. For a lot of products, the sales team is more important than the engineering team. If we build out something rivaling Google search, we’ll probably also end up with the infrastructure required to sell a competitive cloud offering. Google actually tried to do that without having a serious enterprise sales force and the result was that AWS and Azure basically split the enterprise market between them. [return]
  4. This isn’t to say that there isn’t waste or that different companies don’t have different levels of waste. I see waste everywhere I look, but it’s usually not what people on the outside think of as waste. Whenever I read outsider’s descriptions of what’s wasteful at the companies I’ve worked at, they’re almost inevitably wrong. Friends of mine who work at other places also describe the same dynamic. [return]

Developer hiring and the market for lemons

$
0
0

Joel Spolsky has a classic blog post on “Finding Great Developers” where he popularized the meme that great developers are impossible to find, a corollary of which is that if you can find someone, they’re not great. Joel writes,

The great software developers, indeed, the best people in every field, are quite simply never on the market.

The average great software developer will apply for, total, maybe, four jobs in their entire career.

If you’re lucky, if you’re really lucky, they show up on the open job market once, when, say, their spouse decides to accept a medical internship in Anchorage and they actually send their resume out to what they think are the few places they’d like to work at in Anchorage.

But for the most part, great developers (and this is almost a tautology) are, uh, great, (ok, it is a tautology), and, usually, prospective employers recognize their greatness quickly, which means, basically, they get to work wherever they want, so they honestly don’t send out a lot of resumes or apply for a lot of jobs.

Does this sound like the kind of person you want to hire? It should.The corollary of that rule–the rule that the great people are never on the market–is that the bad people–the seriously unqualified–are on the market quite a lot. They get fired all the time, because they can’t do their job. Their companies fail–sometimes because any company that would hire them would probably also hire a lot of unqualified programmers, so it all adds up to failure–but sometimes because they actually are so unqualified that they ruined the company. Yep, it happens.

These morbidly unqualified people rarely get jobs, thankfully, but they do keep applying, and when they apply, they go to Monster.com and check off 300 or 1000 jobs at once trying to win the lottery.

Astute readers, I expect, will point out that I’m leaving out the largest group yet, the solid, competent people. They’re on the market more than the great people, but less than the incompetent, and all in all they will show up in small numbers in your 1000 resume pile, but for the most part, almost every hiring manager in Palo Alto right now with 1000 resumes on their desk has the same exact set of 970 resumes from the same minority of 970 incompetent people that are applying for every job in Palo Alto, and probably will be for life, and only 30 resumes even worth considering, of which maybe, rarely, one is a great programmer. OK, maybe not even one.

Joel’s claim is basically that “great” developers won’t have that many jobs compared to “bad” developers because companies will try to keep “great” developers. Joel also posits that companies can recognize prospective “great” developers easily. But these two statements are hard to reconcile. If it’s so easy to identify prospective “great” developers, why not try to recruit them? You could just as easily make the case that “great” developers are overrepresented in the market because they have better opportunities and it’s the “bad” developers who will cling to their jobs. This kind of adverse selection is common in companies that are declining; I saw that in my intern cohort at IBM1, among other places.

Should “good” developers be overrepresented in the market or underrepresented? If we listen to the anecdotal griping about hiring, we might ask if the market for developers is a market for lemons. This idea goes back to Akerlof’s Nobel prize winning 1970 paper, “The Market for ‘Lemons’: Quality Uncertainty and the Market Mechanism”. Akerlof takes used car sales as an example, splitting the market into good used cars and bad used cars (bad cars are called “lemons”). If there’s no way to distinguish between good cars and lemons, good cars and lemons will sell for the same price. Since buyers can’t distinguish between good cars and bad cars, the price they’re willing to pay is based on the quality of the average in the market. Since owners know if their car is a lemon or not, owners of non-lemons won’t sell because the average price is driven down by the existence of lemons. This results in a feedback loop which causes lemons to be the only thing available.

This model is certainly different from Joel’s model. Joel’s model assumes that “great” developers are sticky – that they stay at each job for a long time. This comes from two assumptions; first, that it’s easy for prospective employers to identify who’s “great”, and second, that once someone is identified as “great”, their current employer will do anything to keep them (as in the market for lemons). But the first assumption alone is enough to prevent the developer job market from being a market for lemons. If you can tell that a potential employee is great, you can simply go and offer them twice as much as they’re currently making (something that I’ve seen actually happen). You need an information asymmetry to create a market for lemons, and Joel posits that there’s no information asymmetry.

If we put aside Joel’s argument and look at the job market, there’s incomplete information, but both current and prospective employers have incomplete information, and whose information is better varies widely. It’s actually quite common for prospective employers to have better information than current employers!

Just for example, there’s someone I’ve worked with, let’s call him Bob, who’s saved two different projects by doing the grunt work necessary to keep the project from totally imploding. The projects were both declared successes, promotions went out, they did a big PR blitz which involves seeding articles in all the usual suspects, like Wired, and so on and so forth. That’s worked out great for the people who are good at taking credit for things, but it hasn’t worked out so well for Bob. In fact, someone else I’ve worked with recently mentioned to me that management keeps asking him why Bob takes so long to do simple tasks. The answer is that Bob’s busy making sure the services he works on don’t have global outages when they launch, but that’s not the kind of thing you get credit for in Bob’s org. The result of that is that Bob has a network who knows that he’s great, which makes it easy for him to get a job anywhere else at market rate. But his management chain has no idea, and based on what I’ve seen of offers today, they’re paying him about half what he could make elsewhere. There’s no shortage of cases where information transfer inside a company is so poor that external management has a better view of someone’s productivity than internal management. I have one particular example in mind, but if I just think of the Bob archetype, off the top of my head, I know of four people who are currently in similar situations. It helps that I currently work at a company that’s notorious for being dysfunctional in this exact way, but this happens everywhere. When I worked at a small company, we regularly hired great engineers from big companies that were too clueless to know what kind of talent they had.

Another problem with the idea that “great” developers are sticky is that this assumes that companies are capable of creating groups that developers want to work for on demand. This is usually not the case. Just for example, I once joined a team where the TL was pretty strongly against using version control or having tests. As a result of those (and other) practices, it took five devs one year to produce 10k lines of kinda-sorta working code for a straightforward problem. Additionally, it was a pressure cooker where people were expected to put in 80+ hour weeks, where the PM would shame people into putting in longer hours. Within a year, three of the seven people who were on the team when I joined had left; two of them went to different companies. The company didn’t want to lose those two people, but it wasn’t capable of creating an environment that would keep them.

Around when I joined that team, a friend of mine joined a really great team. They do work that materially impacts the world, they have room for freedom and creativity, a large component of their jobs involves learning new and interesting things, and so on and so forth. Whenever I heard about someone who was looking for work, I’d forward them that team. That team is now full for the foreseeable future because everyone whose network included that team forwarded people into that team. But if you look at the team that lost three out of seven people in a year, that team is hiring. A lot. The result of this dynamic is that, as a dev, if you join a random team, you’re overwhelmingly likely to join a team that has a lot of churn. Additionally, if you know of a good team, it’s likely to be full.

Joel’s model implicitly assumes that, proportionally, there are many more dysfunctional developers than dysfunctional work environments.

At the last conference I attended, I asked most people I met two questions:

  1. Do you know of any companies that aren’t highly dysfunctional?
  2. Do you know of any particular teams that are great and are hiring?

Not one single person told me that their company meets the criteria in (1). A few people suggested that, maybe, Dropbox is ok, or that, maybe, Jane Street is ok, but the answers were of the form “I know a few people there and I haven’t heard any terrible horror stories yet, plus I sometimes hear good stories”, not “that company is great and you should definitely work there”. Most people said that they didn’t know of any companies that weren’t a total mess.

A few people had suggestions for (2), but the most common answer was something like “LOL no, if I knew that I’d go work there”. The second most common answer was of the form “I know some people on the Google Brain team and it sounds great”. There are a few teams that are well known for being great places to work, but the fact that they’re so few and far between that it’s basically impossible to get a job on one of those teams. A few people knew of actual teams that they’d strongly recommend who were hiring, but that was rare. Much rarer than finding a developer who I’d want to work with who would consider moving. If I flipped the question around and asked if they knew of any good developers who were looking for work, the answer was usually “yes”2.

Another problem with the idea that “great” developers are impossible to find because they join companies and then stick is that developers (and companies) aren’t immutable. Because I’ve been lucky enough to work in environments that allow people to really flourish, I’ve seen a lot of people go from unremarkable to amazing. Because most companies invest pretty much nothing in helping people, you can do really well here without investing much effort.

On the flip side, I’ve seen entire teams of devs go on the market because their environment changed. Just for example, I used to know a lot of people who worked at company X under Marc Yun. It was the kind of place that has low attrition because people really enjoy working there. And then Marc left. Over the next two years, literally everyone I knew who worked there left. This one change both created a lemon in the searching-for-a-team job market and put a bunch of good developers on the market. This kind of thing happens all the time, even more now than in the past because of today’s acquisition-heavy environment.

Is developer hiring a market for lemons? Well, it depends on what you mean by that. Both developers and hiring managers have incomplete information. It’s not obvious if having a market for lemons in one direction makes the other direction better or worse. The fact that joining a new team is uncertain makes developers less likely to leave existing teams, which makes it harder to hire developers. But the fact that developers often join teams which they dislike makes it easier to hire developers. What’s the net effect of that? I have no idea.

From where I’m standing, it seems really hard to find a good manager/team, and I don’t know of any replicable strategy for doing so; I have a lot of sympathy for people who can’t find a good fit because I get how hard that is. But I have seen replicable strategies for hiring, so I don’t have nearly as much sympathy for hiring managers who complain that hiring “great” developers is impossible.

When a hiring manager complains about hiring, in every single case I’ve seen so far, the hiring manager has one of the following problems:

  1. They pay too little. The last time I went looking for work, I found a 6x difference in compensation between companies who might hire me in the same geographic region. Basically all of the companies thought that they were competitive, even when they were at the bottom end of the range. I don’t know what it is, but companies always seem to think that they pay well, even when they’re not even close to being in the right range. Almost everyone I talk to tells me that they pay as much as any reasonable company. Sure, there are some companies out there that pay a bit more, but they’re overpaying! You can actually see this if you read Joel’s writing – back when he wrote the post I’m quoting above, he talked about how well Fog Creek paid. A couple years later, he complained that Google was overpaying for college kids with no experience, and more recently he’s pretty much said that you don’t want to work at companies that pay well.

  2. They pass on good or even “great” developers3. Earlier, I claimed that I knew lots of good developers who are looking for work. You might ask, if there are so many good developers looking for work, why’s it so hard to find them? Joel claims that out of a 1000 resumes, maybe 30 people will be “solid” and 970 will be “incompetent”. It seems to me it’s more like 200 will be solid and 20 will be really good. It’s just that almost everyone uses the same filters, so everyone ends up fighting over the 30 people who they think are solid.

    Matasano famously solved their hiring problem by using a different set of filters and getting a different set of people. Despite the resounding success of their strategy, pretty much everyone insists on sticking with the standard strategy of picking people with brand name pedigrees and running basically the same interview process as everyone else, bidding up the price of folks who are trendy and ignoring everyone else.

    If I look at developers I know who are in high-demand today, a large fraction of them went through a multi-year period where they were underemployed and practically begging for interesting work. These people are very easy to hire if you can find them.

  3. They’re trying to hire for some combination of rare skills. Right now, if you’re trying to hire for someone with experience in deep learning and, well, anything else, you’re going to have a bad time.

  4. They’re much more dysfunctional than they realize. I know one hiring manager who complains about how hard it is to hire. What he doesn’t realize is that literally everyone on his team is bitterly unhappy and a significant fraction of his team gives anti-referrals to friends and tells them to stay away.

    That’s an extreme case, but it’s quite common to see a VP or founder baffled by why hiring is so hard when employees consider the place to be mediocre or even bad.

Of these problems, (1), low pay, is both the most common and the simplest to fix.

In the past few years, Oracle and Alibaba have spun up new cloud computing groups in Seattle. This is a relatively competitive area, and both companies have reputations that work against them when hiring4. If you believe the complaints about how hard it is to hire, you wouldn’t think one company, let alone two, could spin up entire cloud teams in Seattle. Both companies solved the problem by paying substantially more than their competitors were offering for people with similar experience. Alibaba became known for such generous offers that when I was negotiating my offer from Microsoft, MS told me that they’d match an offer from any company except Alibaba. I believe Oracle and Alibaba have hired hundreds of engineers over the past few years.

Most companies don’t need to hire anywhere near a hundreds of people; they can pay competitively without hiring so many developers that the entire market moves upwards, but they still refuse to do so, while complaining about how hard it is to hire.

(2), filtering out good potential employees, seems like the modern version of “no one ever got fired for hiring IBM”. If you hire someone with a trendy background who’s good at traditional coding interviews and they don’t work out, who could blame you? And no one’s going to notice all the people you missed out on. Like (1), this is something that almost everyone thinks they do well and they’ll say things like “we’d have to lower our bar to hire more people, and no one wants that”. But I’ve never worked at a place that doesn’t filter out a lot of people who end up doing great work elsewhere. I’ve tried to get underrated programmers5 hired at places I’ve worked, and I’ve literally never succeeded in getting one hired. Once, someone I failed to get hired managed to get a job at Google after something like four years being underemployed (and is a star there). That guy then got me hired at Google. Not hiring that guy didn’t only cost them my brilliant friend, it eventually cost them me!

BTW, this illustrates a problem with Joel’s idea that “great” devs never apply for jobs. There’s often a long time period where a “great” dev has an extremely hard time getting hired, even through their network who knows that they’re great, because they don’t look like what people think “great” developers look like. Additionally, Google, which has heavily studied which hiring channels give good results, has found that referrals and internal recommendations don’t actually generate much signal. While people will refer “great” devs, they’ll also refer terrible ones. The referral bonus scheme that most companies set up skews incentives in a way that makes referrals worse than you might expect. Because of this and other problems, many companies don’t weight referrals particularly heavily, and “great” developers still go through the normal hiring process, just like everyone else.

(3), needing a weird combination of skills, can be solved by hiring people with half or a third of the expertise you need and training people. People don’t seem to need much convincing on this one, and I see this happen all the time.

(4), dysfunction seems hard to fix. If I knew how to do that, I’d be manager.

As a dev, it seems to me that teams I know of that are actually good environments that pay well have no problems hiring, and that teams that have trouble hiring can pretty easily solve that problem. But I’m biased. I’m not a hiring manager. There’s probably some hiring manager out there thinking: “every developer I know who complains that it’s hard to find a good team has one of these four obvious problems; if only my problems were that easy to solve!”

Thanks to Leah Hanson, David Turner, Tim Abbott, Vaibhav Sagar, Victor Felder, Ezekiel Smithburg, Juliano Bortolozzo Solanho, Stephen Tu, Pierre-Yves Baccou, Jorge Montero, Ben Kuhn, and Lindsey Kuper for comments and corrections.

If you liked this post, you’d probably enjoy this other post on the bogosity of claims that there can’t possibly be discrimination in tech hiring.


  1. The folks who stayed describe an environment that’s mostly missing mid-level people they’d want to work with. There are lifers who’ve been there forever and will be there until retirement, and there are new grads who land there at random. But, compared to their competitors, there are relatively few people people with 5-15 years of experience. The person I knew who lasted the longest stayed until the 8 year mark, but he started interviewing with an eye on leaving when he found out the other person on his team who was competent was interviewing; neither one wanted to be the only person on the team doing any work, so they raced to get out the door first. [return]
  2. This section kinda makes it sound like I’m looking for work. I’m not looking for work, although I may end up forced into it if my partner takes a job outside of Seattle. [return]
  3. Moishe Lettvin has a talk I really like, where he talks about a time when he was on a hiring committee and they rejected every candidate that came up, only to find that the “candidates” were actually anonymized versions of their own interviews!

    The bit about when he first started interviewing at Microsoft should sound familiar to MS folks. As is often the case, he got thrown into the interview with no warning and no preparation. He had no idea what to do and, as a result, wrote up interview feedback that wasn’t great. “In classic Microsoft style”, his manager forwarded the interview feedback to the entire team and said “don’t do this”. “In classic Microsoft style” is a quote from Moishe, but I’ve observed the same thing. I’d like to talk about how we have a tendency to do extremely blameful postmortems and how that warps incentives, but that probably deserves its own post.

    Well, I’ll tell one story, in remembrance of someone who recently left my former team for Google. Shortly after that guy joined, he was in the office on a weekend (a common occurrence on his team). A manager from another team pinged him on chat and asked him to sign off on some code from the other team. The new guy, wanting to be helpful, signed off on the code. On Monday, the new guy talked to his mentor and his mentor suggested that he not help out other teams like that. Later, there was an outage related to the code. In classic Microsoft style, the manager from the other team successfully pushed the blame for the outage from his team to the new guy.

    Note that this guy isn’t included in my 37 stat because he joined shortly after I did, and I’m not trying to cherry pick a window with the highest possible attrition.

    [return]
  4. For a while, Oracle claimed that the culture of the Seattle office is totally different from mainline-Oracle culture, but from what I’ve heard, they couldn’t resist Oracle-ifying the Seattle group and that part of the pitch is no longer convincing. [return]
  5. This footnote is a response to Ben Kuhn, who asked me, what types of devs are underrated and how would you find them? I think this group is diverse enough that there’s no one easy way to find them. There are people like “Bob”, who do critical work that’s simply not noticed. There are also people who are just terrible at interviewing, like Jeshua Smith. I believe he’s only once gotten a performance review that wasn’t excellent (that semester, his manager said he could only give out one top rating, and it wouldn’t be fair to give it to only one of his two top performers, so he gave them both average ratings). In every place he’s worked, he’s been well known as someone who you can go to with hard problems or questions, and much higher ranking engineers often go to him for help. I tried to get him hired at two different companies I’ve worked at and he failed both interviews. He sucks at interviews. My understanding is that his interview performance almost kept him from getting his current job, but his references were so numerous and strong that his current company decided to take a chance on him anyway. But he only had those references because his old org has been disintegrating. His new company picked up a lot of people from his old company, so there were many people at the new company that knew him. He can’t get the time of day almost anywhere else. Another person I’ve tried and failed to get hired is someone I’ll call Ashley, who got rejected in the recruiter screening phase at Google for not being technical enough, despite my internal recommendation that she was one of the strongest programmers I knew. But she came from a “nontraditional” background that didn’t fit the recruiter’s idea of what a programmer looked like, so that was that. Nontraditional is a funny term because it seems like most programmers have a “nontraditional” background, but you know what I mean.

    There’s enough variety here that there isn’t one way to find all of these people. Having a filtering process that’s more like Matasano’s and less like Google, Microsoft, Facebook, almost any YC startup you can name, etc., is probably a good start.

    [return]

Programming books you might want to consider reading

$
0
0

There are a lot of “12 CS books every programmer must read” lists floating around out there. That’s nonsense. The field is too broad for almost any topic to be required reading for all programmers, and even if a topic is that important, people’s learning preferences differ too much for any book on that topic to be the best book on the topic for all people.

This is a list of topics and books where I’ve read the book, am familiar enough with the topic to say what you might get out of learning more about the topic, and have read other books and can say why you’d want to read one book over another.

Algorithmic game theory / auction theory / mechanism design

Why should you care? Some of the world’s biggest tech companies run on ad revenue, and those ads are sold through auctions. This field explains how and why they work. Additionally, this material is useful any time you’re trying to figure out how to design systems that allocate resources effectively.1

In particular, incetive compatiable mechanism deisgn (rouhgly, how to create systems that provide globally optimal outcomes when people behave in their own selfish best interest) should be required reading for anyone who designs internal incentive systems at companies. If you’ve ever worked at a large company that “gets” this and one that doesn’t, you’ll see that the copmany that doesn’t get it has giant piles of money that are basically being lit on fire because the people who set up incentives created systems that are hugely wasteful. This field gives you the background to understand what sorts of mechanisms give you what sorts of outcomes; reading case studies gives you a very long (and entertaining) list of mistakes that can cost millions or even billions of dollars.

Krishna; Auction Theory

The last time I looked, this was the only game in town for a comprehensive, modern, introduction to auction theory. Covers the classic second price auction result in the first chapter, and then moves on to cover risk aversion, bidding rings, interdependant values, multiple auctions, assymetrical information, and other real-world issues.

Relatively dry. Unlikely to be motivating unless you’re already interested in the topic. Requires an understanding of basic probability and calculus.

Steighlitz; Snipers, Shills, and Sharks: eBay and Human Behavior

Seems designed as an entertaining introduction to auction theory for the layperson. Requires no mathematical background and relegates math to the small print. Covers maybe, 1/10th of the material of Krishna, if that. Fun read.

Crampton, Shoham, and Steinberg; Combinatorial Auctions

Discusses things like how FCC spectrum auctions got to be the way they are and how “bugs” in mechanism design can leave hundreds of millions or billions of dollars on the table. This is one of those books where each chapter is by a different author. Despite that, it still manages to be coherent and I didn’t mind reading it straight through. It’s self-contained enough that you could probably read this without reading Krishna first, but I wouldn’t recommend it.

Shoham and Leyton-Brown; Multiagent Systems: Algortihmic, Game-Theoretic, and Logical Foundations

The title is the worst thing about this book. Otherwise, it’s a nice introduction to algorithmic game theory. The book covers basic game theory, auction theory, and other classic topics that CS folks might not already know, and then covers the intersection of CS with these topics. Assumes no particular background in the topic.

Nisan, Roughgarden, Tardos, and Vazirani; Algorithmic Game Theory

A survey of various results in algorithmic game theory. Requires a fair amount of background (consider reading Shoham and Leyton-Brown first). For example, chapter five is basically Devanur, Papadimitriou, Saberi, and Vazirani’s JACM paper, Market Equilibrium via a Primal-Dual Algorithm for a Convex Program, with a bit more motivation and some related problems thrown in. The exposition is good and the result is interesting (if you’re into that kind of thing), but it’s not necessarily what you want if you want to read a book straight through and get an introduction to the field.

Algorithms / Data Structures / Complexity

Why should you care? Well, there’s the pragmatic argument: even if you never use this stuff in your job, most of the best paying companies will quiz you on this stuff in interviews. On the non-bullshit side of things, I find algorithms to be useful in the same way I find math to be useful. The probability of any particular algorithm being useful for any particular problem is low, but having a general picture of what kinds of problems are solved problems, what kinds of problems are intractable, and when approximations will be effective, is often useful.

McDowell; Cracking the Coding Interview

Some problems and solutions, with explanations, matching the level of questions you see in entry-level interviews at Google, Facebook, Microsoft, etc. I usually recommend this book to people who want to pass interviews but not really learn about algorithms. It has just enough to get by, but doesn’t really teach you the why behind anything. If you want to actually learn about algorithms and data structures, see below.

Dasgupta, Papadimitriou, and Vazirani; Algorithms

Everything about this book seems perfect to me. It breaks up algorithms into classes (e.g., divide and conquer or greedy), and teaches you how to recognize what kind of algorithm should be used to solve a particular problem. It has a good selection of topics for an intro book, it’s the right length to read over a few weekends, and it has exercises that are appropriate for an intro book. Additionally, it has sub-questions in the middle of chapters to make you reflect on non-obvious ideas to make sure you don’t miss anything.

I know some folks don’t like it because it’s relatively math-y/proof focused. If that’s you, you’ll probably prefer Skiena.

Skiena; The Algorithm Design Manual

The longer, more comprehensive, more practical, less math-y version of Dasgupta. It’s similar in that it attempts to teach you how to identify problems, use the correct algorithm, and give a clear explanation of the algorithm. Book is well motivated with “war stories” that show the impact of algorithms in real world programming.

CLRS; Introduction to Algorithms

This book somehow manages to make it into half of these “N books all programmers must read” lists despite being so comprehensive and rigorous that almost no practitioners actually read the entire thing. It’s great as a textbook for an algorithms class, where you get a selection of topics. As a class textbook, it’s nice bonus that it has exercises that are hard enough that they can be used for graduate level classes (about half the exercises from my grad level algorithms class were pulled from CLRS, and the other half were from Kleinberg & Tardos), but this is wildly impractical as a standalone introduction for most people.

Just for example, there’s an entire chapter on Van Emde Boas trees. They’re really neat – it’s a little surprising that a balanced-tree-like structure with O(lg lg n) insert, delete, as well as find, successor, and predecessor is possible, but a first introduction to algorithms shouldn’t include Van Emde Boas trees.

Kleinberg & Tardos; Algorithm Design

Same comments as for CLRS – it’s widely recommended as an introductory book even though it doesn’t make sense as an introductory book. Personally, I found the exposition in Kleinberg to be much easier to follow than in CLRS, but plenty of people find the opposite.

Demaine; Advanced Data Structures

This is a set of lectures and notes and not a book, but if you want a coherent (but not intractably comprehensive) set of material on data structures that you’re unlikely to see in most undergraduate courses, this is great. The notes aren’t designed to be standalone, so you’ll want to watch the videos if you haven’t already seen this material.

Okasaki; Purely Functional Data Structures

Fun to work through, but, unlike the other algorithms and data structures books, I’ve yet to be able to apply anything from this book to a problem domain where performance really matters.

For a couple years after I read this, when someone would tell me that it’s not that hard to reason about the performance of purely functional lazy data structures, I’d ask them about part of a proof that stumped me in this book. I’m not talking about some obscure super hard exercise, either. I’m talking about something that’s in the main body of the text that was considered too obvious to the author to explain. No one could explain it. Reasoning about this kind of thing is harder than people often claim.

Dominus; Higher Order Perl

A gentle introduction to functional programming that happens to use Perl. You could probably work through this book just as easily in Python or Ruby.

If you keep up with what’s trendy, this book might seem a bit dated today, but only because so many of the ideas have become mainstream. If you’re wondering why you should care about this “functional programming” thing people keep talking about, and some of the slogans you hear don’t speak to you or are even off-putting (types are propositions, it’s great because it’s math, etc.), give this book a chance.

Levitin; Algorithms

I ordered this off amazon after seeing these two blurbs: “Other learning-enhancement features include chapter summaries, hints to the exercises, and a detailed solution manual.” and “Student learning is further supported by exercise hints and chapter summaries.” One of these blurbs is even printed on the book itself, but after getting the book, the only self-study resources I could find were some yahoo answers posts asking where you could find hints or solutions.

I ended up picking up Dasgupta instead, which was available off an author’s website for free.

Mitzenmacher & Upfal; Probability and Computing: Randomized Algorithms and Probabilistic Analysis

I’ve probably gotten more mileage out of this than out of any other algorithms book. A lot of randomized algorithms are trivial to port to other applications and can simpliify things a lot.

The text has enough of an intro to probability that you don’t need to have any probability background. Also, the material on tails bounds (e.g., Chernoff bounds) is useful for a lot of CS theory proofs and isn’t covered in the intro probability texts I’ve seen.

Sipser; Introduction to the Theory of Computation

Classic intro to theory of computation. Turing machines, etc. Proofs are often given at an intuitive, “proof sketch”, level of detail. A lot of important results (e.g, Rice’s Theorem) are pushed into the exercises, so you really have to do the key exercises. Unfortunately, most of the key exercises don’t have solutions, so you can’t check your work.

For something with a more modern topic selection, maybe see Aurora & Barak.

Bernhardt; Computation

Covers a few theory of computation highlights. The explanations are delightful and I’ve watched some of the videos more than once just to watch Bernhardt explain things. Targeted at a general programmer audience with no background in CS.

Kearns & Vazirani; An Introduction to Computational Learning Theory

Classic, but dated and riddled with errors, with no errata available. When I wanted to learn this material, I ended up cobbling together notes from a couple of courses, one by Klivans and one by Blum.

Operating Systems

Why should you care? Having a bit of knowledge about operating systems can save days or week of debugging time. This is a regular theme on Julia Evans’s blog, and I’ve found the same thing to be true of my experience. I’m hard pressed to think of anyone who builds practical systems and knows a bit about operating systems who hasn’t found their operating systems knowledge to be a time saver. However, there’s a bias in who reads operatings systems books – it tends to be people who do related work! It’s possible you won’t get the same thing out of reading these if you do really high-level stuff.

Silberchatz, Galvin, and Gagne; Operating System Concepts

This was what we used at Wisconsin before the comet book became standard. I guess it’s ok. It covers concepts at a high level and hits the major points, but it’s lacking in technical depth, details on how things work, advanced topics, and clear exposition.

BTW, I’ve heard very good things about the comet book. I just can’t say much about it since I haven’t read it.

Cox, Kasshoek, and Morris; xv6

This book is great! It explains how you can actually implement things in a real system, and it comes with its own implementation of an OS that you can play with. By design, the authors favor simple implementations over optimized ones, so the algorithms and data structures used are often quite different than what you see in production systems.

This book goes well when paired with a book that talks about how more modern operating systems work, like Love’s Linux Kernel Development or Russinovich’s Windows Internals.

Love; Linux Kernel Development

The title can be a bit misleading – this is basically a book about how the Linux kernel works: how things fit together, what algorithms and data structures are used, etc. I read the 2nd edition, which is now quite dated. The 3rd edition has some updates, but introduced some errors and inconsistencies, and is still dated (it was published in 2010, and covers 2.6.34). Even so, it’s a nice introduction into how a relatively modern operating system works.

The other downside of this book is that the author loses all objectivity any time Linux and Windows are compared. Basically every time they’re compared, the author says that Linux has clearly and incontrovertibly made the right choice and that Windows is doing something stupid. On balance, I prefer Linux to Windows, but there are a number of areas where Windows is superior, as well as areas where there’s parity but Windows was ahead for years. You’ll never find out what they are from this book, though.

Russinovich, Solomon, and Ionescu; Windows Internals

The most comprehensive book about how a modern operating system works. It just happens to be about Windows. Coming from a *nix background, I found this interesting to read just to see the differences.

This is definitely not an intro book, and you should have some knowledge of operating systems before reading this. If you’re going to buy a physical copy of this book, you might want to wait until the 7th edition is released (early in 2017).

Downey; The Little Book of Semaphores

Takes a topic that’s normally one or two sections in an operating systems textbook and turns it into its own 300 page book. The book is a series of exercises, a bit like The Little Schemer, but with more exposition. It starts by explaining what semaphore is, and then has a series of exercises that builds up higher level concurrency primitives.

This book was very helpful when I first started to write threading/concurrency code. I subscribe to the Butler Lampson school of concurrency, which is to say that I prefer to have all the concurrency-related code stuffed into a black box that someone else writes. But sometimes you’re stuck writing the black box, and if so, this book has a nice introduction to the style of thinking required to write maybe possibly not totally wrong concurrent code.

I wish someone would write a book in this style, but both lower level and higher level. I’d love to see exercises like this, but starting with instruction-level primitives for a couple different architectures with different memory models (say, x86 and Alpha) instead of semaphores. If I’m writing grungy low-level threading code today, I’m overwhelmingly like to be using c++11 threading primitives, so I’d like something that uses those instead of semaphores, which I might have used if I was writing threading code against the Win32 API. But since that book doesn’t exist, this seems like the next best thing.

I’ve heard that Doug Lea’s Concurrent Programming in Java is also quite good, but I’ve only taken a quick look at it.

Computer architecture

Why should you care? The specific facts and trivia you’ll learn will be useful when you’re doing low-level performance optimizations, but the real value is learning how to reason about tradeoffs between performance and other factors, whether that’s power, cost, size, weight, or something else.

In theory, that kind of reasoning should be taught regardless of specialization, but my experience is that comp arch folks are much more likely to “get” that kind of reasoning and do back of the envelope calculations that will save them from throwing away a 2x or 10x (or 100x) factor in performance for no reason. This sounds obvious, but I can think of multiple production systems at large companies that are giving up 10x to 100x in performance which are operating at a scale where even a 2x difference in performance could pay a VP’s salary – all because people didn’t think through the performance implications of their design.

Hennessy & Patterson; Computer Architecture: A Quantitative Approach

This book teaches you how to do systems design with multiple constraints (e.g., performance, TCO, and power) and how to reason about tradeoffs. It happens to mostly do so using microprocessors and supercomputers as examples.

New editions of this book have substantive additions and you really want the latest version. For example, the latest version added, among other things, a chapter on data center design, and it answers questions like, how much opex/capex is spent on power, power distribution, and cooling, and how much is spent on support staff and machines, what’s the effect of using lower power machines on tail latency and result quality (bing search results are used as an example), and what other factors should you consider when designing a data center.

Assumes some background, but that background is presented in the appendices (which are available online for free).

Shen & Lipasti: Modern Processor Design

Presents most of what you need to know to architect a high performance Pentium Pro (1995) era microprocessor. That’s no mean feat, considering the complexity involved in such a processor. Additionally, presents some more advanced ideas and bounds on how much parallelism can be extracted from various workloads (and how you might go about doing such a calculation). Has an unusually large section on value prediction, because the authors invented the concept and it was still hot when the first edition was published.

For pure CPU architecture, this is probably the best book available.

Hill, Jouppi, and Sohi, Readings in Computer Architecture

Read for historical reasons and to see how much better we’ve gotten at explaining things. For example, compare Amdahl’s paper on Amdahl’s law (two pages, with a single non-obvious graph presented, and no formulas), vs. the presentation in a modern textbook (one paragraph, one formula, and maybe one graph to clarify, although it’s usually clear enough that no extra graph is needed).

This seems to be worse the further back you go; since comp arch is a relatively young field, nothing here is really hard to understand. If you want to see a dramatic exmaple of how we’ve gotten better at explaining things, compare Maxwell’s original paper on Maxwell’s equations to a modern treatment of the same material. Fun if you like history, but a bit of slog if you’re just trying to learn something.

Misc

Beyer, Jones, Petoff, and Murphy; Site Reliability Engineering

A description of how Google handles operations. Has the typical Google tone, which is off-putting to a lot of folks with a “traditional” ops background, and assumes that many things can only be done with the SRE model when they can, in fact, be done without going full SRE.

For a much longer description, see this 22 page set of notes on Google’s SRE book.

Fowler, Beck, Brant, Opdyke, and Roberts; Refactoring

At the time I read it, it was worth the price of admission for the section on code smells alone. But this book has been so successful that the ideas of refactoring and code smells have become mainstream.

Steve Yegge has a great pitch for this book:

When I read this book for the first time, in October 2003, I felt this horrid cold feeling, the way you might feel if you just realized you’ve been coming to work for 5 years with your pants down around your ankles. I asked around casually the next day: “Yeah, uh, you’ve read that, um, Refactoring book, of course, right? Ha, ha, I only ask because I read it a very long time ago, not just now, of course.” Only 1 person of 20 I surveyed had read it. Thank goodness all of us had our pants down, not just me.

If you’re a relatively experienced engineer, you’ll recognize 80% or more of the techniques in the book as things you’ve already figured out and started doing out of habit. But it gives them all names and discusses their pros and cons objectively, which I found very useful. And it debunked two or three practices that I had cherished since my earliest days as a programmer. Don’t comment your code? Local variables are the root of all evil? Is this guy a madman? Read it and decide for yourself!

Demarco & Lister, Peopleware

This book seemed convincing when I read it in college. It even had all sorts of studies backing up what they said. No deadlines is better than having deadlines. Offices are better than cubicles. Basically all devs I talk to agree with this stuff.

But virtually every successful company is run the opposite way. Even Microsoft is remodeling buildings from individual offices to open plan layouts. Could it be that all of this stuff just doesn’t matter that much? If it really is that important, how come companies that are true believers, like Fog Creek, aren’t running roughshod over their competitors?

This book agrees with my biases and I’d love for this book to be right, but the meta evidence makes me want to re-read this with a critical eye and look up primary sources.

Drummond; Renegades of the Empire

This is the story of the development of DirectX. It also explains how Microsoft’s aggresive culture got to be the way it is today. The intro reads:

Microsoft didn’t necessarily hire clones of Gates (although there were plenty on the corporate campus) so much as recruiter those who shared some of Gates’s more notable traits – arrogance, aggressiveness, and high intelligence.

Gates is infamous for ridiculing someone’s idea as “stupid”, or worse, “random”, just to see how he or she defends a position. This hostile managerial technique invariably spread through the chain of command and created a culture of conflict.

Microsoft nurtures a Darwinian order where resources are often plundered and hoarded for power, wealth, and prestige. A manager who leaves on vacation might return to find his turf raided by a rival and his project put under a different command or canceled altogether

On interviewing at Microsoft:

“What do you like about Microsoft?” “Bill kicks ass”, St. John said. “I like kicking ass. I enjoy the feeling of killing competitors and dominating markets”.

St. John was hired, and he could do no wrong for years. This book tells the story of him and a few others like him. Read this book if you’re considering a job at Microsoft. I wish I’d read this before joining and not after!

Math

Why should you care? From a pure ROI perspective, I doubt learning math is “worth it” for 99% of jobs out there. AFAICT, I use math more often than most programmers, and I don’t use it all that often. But having the right math background sometimes comes in handy and I really enjoy learning math. YMMV.

Bertsekas; Introduction to Probability

Introductory undergrad text that tends towards intuitive explanations over epsilon-delta rigor. For anyone who cares to do more rigorous derivations, there are some exercises at the back of the book that go into more detail.

Has many exercises with available solutions, making this a good text for self-study.

Ross; A First Course in Probability

This is one of those books where they regularly crank out new editions to make students pay for new copies of the book (this is presently priced at a whopping $174 on Amazon)2. This was the standard text when I took probability at Wisconsin, and I literally cannot think of a single person who found it helpful. Avoid.

Brualdi; Introductory Combinatorics

Brualdi is a great lecturer, one of the best I had in undergrad, but this book was full of errors and not particularly clear. There have been two new editions since I used this book, but according to the Amazon reviews the book still has a lot of errors.

For an alternate introductory text, I’ve heard good things about Camina & Lewis’s book, but I haven’t read it myself. Also, Lovasz is a great book on combinatorics, but it’s not exactly introductory.

Apostol; Calculus

Volume 1 covers what you’d expect in a calculus I + calculus II book. Volume 2 covers linear algebra and multivariable calculus. It covers linear algebra before multivariable calculus, which makes multi-variable calculus a lot easier to understand.

It also makes a lot of sense from a programming standpoint, since a lot of the value I get out of calculus is its applications to approximations, etc., and that’s a lot clearer when taught in this sequence.

This book is probably a rough intro if you don’t have a professor or TA to help you along. The Spring SUMS series tends to be pretty good for self-study introductions to various areas, but I haven’t actually read their intro calculus book so I can’t actually recommend it.

Stewart; Calculus

Another one of those books where they crank out new editions with trivial changes to make money. This was the standard text for non-honors calculus at Wisconsin, and the result of that was I taught a lot of people to do complex integrals with the methods covered in Apostol, which are much more intuitive to many folks.

This book takes the approach that, for a type of problem, you should pattern match to one of many possible formulas and then apply the formula. Apostol is more about teaching you a few tricks and some intuition that you can apply to a wide variety of problems. I’m not sure why you’d buy this unless you were required to for some class.

Hardware basics

Why should you care? People often claim that, to be a good programmer, you have to understand every abstraction you use. That’s nonsense. Modern computing is too complicated for any human to have a real full-stack understanding of what’s going on. In fact, one reason modern computing can accomplish what it does is that it’s possible to be productive without having a deep understanding of much of the stack that sits below the level you’re operating at.

That being said, if you’re curious about what sits below software, here are a few books that will get you started.

Nisan & Shocken; nand2tetris

If you only want to read one single thing, this should probably be it. It’s a “101” level intro that goes down to gates and boolean logic. As implied by the name, it takes you from NAND gates to a working tetris program.

Roth; Fundamentals of Logic Design

Much more detail on gates and logic design than you’ll see in nand2tetris. The book is full of exercises and appears to be designed to work for self-study. Note that the link above is to the 5th edition. There are newer, more expensive, editions, but they don’t seem to be much improve, have a lot of errors in the new material, and are much more expensive.

Weste; Harris, and Bannerjee; CMOS VLSI Design

One level below boolean gates, you get to VLSI, a historical acronym (very large scale integration) that doesn’t really have any meaning today.

Broader and deeper than the alternatives, with clear exposition. Explores the design space (e.g., the section on adders doesn’t just mention a few different types in an ad hoc way, it explores all the tradeoffs you can make. Also, has both problems and solutions, which makes it great for self study.

Kang & Leblebici; CMOS Digital Integrated Circuits

This was the standard text at Wisconsin way back in the day. It was hard enough to follow that the TA basically re-explained pretty much everything necessary for the projects and the exams. I find that it’s ok as a reference, but it wasn’t a great book to learn from.

Compared to West et al., Weste spends a lot more effort talking about tradeoffs in design (e.g., when creating a parallel prefix tree adder, what does it really mean to be at some particular point in the design space?).

Pierret; Semiconductor Device Fundamentals

One level below VLSI, you have how transistors actually work.

Really beautiful explanation of solid state devices. The text nails the fundamentals of what you need to know to really understand this stuff (e.g., band diagrams), and then uses those fundamentals along with clear explanations to give you a good mental model of how different types of junctions and devices work.

Streetman & Bannerjee; Solid State Electronic Devices

Covers the same material as Pierret, but seems to substitute mathematical formulas for the intuitive understanding that Pierret goes for.

Ida; Engineering Electromagnetics

One level below transistors, you have electromagnetics.

Two to three times thicker than other intro texts because it has more worked examples and diagrams. Breaks things down into types of problems and subproblems, making things easy to follow. For self-study, A much gentler introduction than Griffiths or Purcell.

Shanley; Pentium Pro and Pentium II System Architecture

Unlike the other books in this section, this book is about practice instead of theory. It’s a bit like Windows Internals, in that it goes into the details of a real, working, system. Topics include hardware bus protocols, how I/O actually works (e.g., APIC), etc.

The problem with a practical introduction is that there’s been an exponential increase in complexity ever since the 8080. The further back you go, the easier it is to understand the most important moving parts in the system, and the more irrelevant the knowledge. This book seems like an ok compromise in that the bus and I/O protocols had to handle multiprocessors, and many of the elements that are in modern systems were in these systems, just in a simpler form.

Not covered

Of the books that I’ve liked, I’d say this captures at most 25% of the software books and 5% of the hardware books. On average, the books that have been left off the list are more specialized. This list is also missing many entire topic areas, like PL, practical books on how to learn languages, networking, etc.

The reasons for leaving off topic areas vary; I don’t have any PL books listed because I don’t read PL books. I don’t have any networking books because, although I’ve read a couple, I don’t know enough about the area to really say how useful the books are. The vast majority of hardware books aren’t included because they cover material that you wouldn’t care about unless you were a specialist (e.g., Skew-Tolerant Circuit Design or Ultrafast Optics). The same goes for areas like math and CS theory, where I left off a number of books that I think are great but have basically zero probability of being useful in my day-to-day programming life, e.g., Extremal Combinatorics. I also didn’t include books I didn’t read all or most of, unless I stopped because the book was atrocious. This means that I don’t list classics I haven’t finished like SICP and The Little Schemer, since those book seem fine and I just didn’t finish them for one reason or another.

This list also doesn’t include books on history and culture, like Inside Intel or Masters of Doom. I’ll probably add a category for those at some point, but I’ve been trying an experiment where I try to write more like Julia Evans (stream of consciousness, fewer or no drafts). I’d have to go back and re-read the books I read 10+ years ago to write meaningful comments, which doesn’t exactly fit with the experiment. On that note, since this list is from memory and I got rid of almost all of my books a couple years ago, I’m probably forgetting a lot of books that I meant to add.

If you liked this, you might also like this list of programming blogs, which is written in a similar style.


  1. Also, if you play boardgames, auction theory explains why fixing game imbalance via an auction mechanism is non-trivial and often makes the game worse. [return]
  2. I talked to the author of one of these books. He griped that the used book market destroys revenue from textbooks after a couple years, and that authors don’t get much in royalties, so you have to charge a lot of money and keep producing new editions every couple of years to make money. That griping goes double in cases where a new author picks up a book classic book that someone else originally wrote, since the original author often has a much larger share of the royalties than the new author, despite doing no work no the later editions. [return]

HN comments are underrated

$
0
0

HN comments are terrible. On any topic I’m informed about, the vast majority of comments are pretty clearly wrong. Most of the time, there are zero comments from people who know anything about the topic and the top comment is reasonable sounding but totally incorrect. Additionally, many comments are gratuitously mean. You’ll often hear mean comments backed up with something like “this is better than the other possibility, where everyone just pats each other on the back with comments like ‘this is great’”, as if being an asshole is some sort of talisman against empty platitudes. I’ve seen people push back against that; when pressed, people often say that it’s either impossible or inefficient to teach someone without being mean, as if telling someone that they’re stupid somehow helps them learn. It’s as if people learned how to explain things by watching Simon Cowell and can’t comprehend the concept of an explanation that isn’t littered with personal insults. Paul Graham has said, “Oh, you should never read Hacker News comments about anything you write”. Most of the negative things you hear about HN comments are true.

And yet, I haven’t found a public internet forum with better technical commentary. On topics I’m familiar with, while it’s rare that a thread will have even a single comment that’s well-informed, when those comments appear, they usually float to the top. On other forums, well-informed comments are either non-existent or get buried by reasonable sounding but totally wrong comments when they appear, and they appear even more rarely than on HN.

By volume, there are probably more interesting technical “posts” in comments than in links. Well, that depends on what you find interesting, but that’s true for my interests. If I see a low-level optimization comment from nkurz, a comment on business from patio11, a comment on how companies operate by nostrademons, I almost certainly know that I’m going to read an interesting comment. There are maybe 20 to 30 people I can think of who don’t blog much, but write great comments on HN and I doubt I even know of half the people who are writing great comments on HN1.

I compiled a very abbreviated list of comments I like because comments seem to get lost. If you write a blog post, people will refer it years later, but comments mostly disappear. I think that’s sad – there’s a lot of great material on HN (and yes, even more not-so-great material).

What’s the deal with MS Word’s file format?

Basically, the Word file format is a binary dump of memory. I kid you not. They just took whatever was in memory and wrote it out to disk. We can try to reason why (maybe it was faster, maybe it made the code smaller), but I think the overriding reason is that the original developers didn’t know any better.

Later as they tried to add features they had to try to make it backward compatible. This is where a lot of the complexity lies. There are lots of crazy workarounds for things that would be simple if you allowed yourself to redesign the file format. It’s pretty clear that this was mandated by management, because no software developer would put themselves through that hell for no reason.

Later they added a fast-save feature (I forget what it is actually called). This appends changes to the file without changing the original file. The way they implemented this was really ingenious, but complicates the file structure a lot.

One thing I feel I must point out (I remember posting a huge thing on slashdot when this article was originally posted) is that 2 way file conversion is next to impossible for word processors. That’s because the file formats do not contain enough information to format the document. The most obvious place to see this is pagination. The file format does not say where to paginate a text flow (unless it is explicitly entered by the user). It relies of the formatter to do it. Each word processor formats text completely differently. Word, for example famously paginates footnotes incorrectly. They can’t change it, though, because it will break backwards compatibility. This is one of the only reasons that Word Perfect survives today – it is the only word processor that paginates legal documents the way the US Department of Justice requires.

Just considering the pagination issue, you can see what the problem is. When reading a Word document, you have to paginate it like Word – only the file format doesn’t tell you what that is. Then if someone modifies the document and you need to resave it, you need to somehow mark that it should be paginated like Word (even though it might now have features that are not in Word). If it was only pagination, you might be able to do it, but practically everything is like that.

I recommend reading (a bit of) the XML Word file format for those who are interested. You will see large numbers of flags for things like “Format like Word 95”. The format doesn’t say what that is – because it’s pretty obvious that the authors of the file format don’t know. It’s lost in a hopeless mess of legacy code and nobody can figure out what it does now.

Fun with NULL

Here’s another example of this fine feature:

  #include <stdio.h>
  #include <string.h>
  #include <stdlib.h>
  #define LENGTH 128

  int main(int argc, char **argv) {
      char *string = NULL;
      int length = 0;
      if (argc > 1) {
          string = argv[1];
          length = strlen(string);
          if (length >= LENGTH) exit(1);
      }

      char buffer[LENGTH];
      memcpy(buffer, string, length);
      buffer[length] = 0;

      if (string == NULL) {
          printf("String is null, so cancel the launch.\n");
      } else {
          printf("String is not null, so launch the missiles!\n");
      }

      printf("string: %s\n", string);  // undefined for null but works in practice

      #if SEGFAULT_ON_NULL
      printf("%s\n", string);          // segfaults on null when bare "%s\n"
      #endif

      return 0;
  }

  nate@skylake:~/src$ clang-3.8 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is null, so cancel the launch.
  string: (null)

  nate@skylake:~/src$ icc-17 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is null, so cancel the launch.
  string: (null)

  nate@skylake:~/src$ gcc-5 -Wall -O3 null_check.c -o null_check
  nate@skylake:~/src$ null_check
  String is not null, so launch the missiles!
  string: (null)

It appear that Intel’s ICC and Clang still haven’t caught up with GCC’s optimizations. Ouch if you were depending on that optimization to get the performance you need! But before picking on GCC too much, consider that all three of those compilers segfault on printf(“string: “); printf(”%s\n”, string) when string is NULL, despite having no problem with printf(“string: %s\n”, string) as a single statement. Can you see why using two separate statements would cause a segfault? If not, see here for a hint: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=25609

How do you make sure the autopilot backup is paying attention?

Good engineering eliminates users being able to do the wrong thing as much as possible… . You don’t design a feature that invites misuse and then use instructions to try to prevent that misuse.

There was a derailment in Australia called the Waterfall derailment [1]. It occurred because the driver had a heart attack and was responsible for 7 deaths (a miracle it was so low, honestly). The root cause was the failure of the dead-man’s switch.

In the case of Waterfall, the driver had 2 dead-man switches he could use - 1) the throttle handle had to be held against a spring at a small rotation, or 2) a bar on the floor could be depressed. You had to do 1 of these things, the idea being that you prevent wrist or foot cramping by allowing the driver to alternate between the two. Failure to do either triggers an emergency brake.

It turns out that this driver was fat enough that when he had a heart attack, his leg was able to depress the pedal enough to hold the emergency system off. Thus, the dead-man’s system never triggered with a whole lot of dead man in the driver’s seat.

I can’t quite remember the specifics of the system at Waterfall, but one method to combat this is to require the pedal to be held halfway between released and fully depressed. The idea being that a dead leg would fully depress the pedal so that would trigger a brake, and a fully released pedal would also trigger a brake. I don’t know if they had that system but certainly that’s one approach used in rail.

Either way, the problem is equally possible in cars. If you lose consciousness and your foot goes limp, a heavy enough leg will be able to hold the pedal down a bit depending on where it’s positioned relative to the pedal and the leverage it has on the floor.

The other major system I’m familiar with for ensuring drivers are alive at the helm is called ‘vigilance’. The way it works is that periodically, a light starts flashing on the dash and the driver has to acknowledge that. If they do not, a buzzer alarm starts sounding. If they still don’t acknowledge it, the train brakes apply and the driver is assumed incapacitated. Let me tell you some stories of my involvement in it.

When we first started, we had a simple vigi system. Every 30 seconds or so (for example), the driver would press a button. Ok cool. Except that then drivers became so hard-wired to pressing the button every 30 seconds that we were having instances of drivers falling asleep/dozing off and still pressing the button right on every 30 seconds because it was so ingrained into them that it was literally a subconscious action.

So we introduced random-timing vigilance, where the time varies 30-60 seconds (for example) and you could only acknowledge it within a small period of time once the light started flashing. Again, drivers started falling asleep/semi asleep and would hit it as soon as the alarm buzzed, each and every time.

So we introduced random-timing, task-linked vigilance and that finally broke the back of the problem. Now, the driver has to press a button, or turn a knob, or do a number of different activities and they must do that randomly-chosen activity, at a randomly-chosen time, for them to acknowledge their consciousness. It was only at that point that we finally nailed out driver alertness.

See also.

Prestige

Curious why he would need to move to a more prestigious position? Most people realize by their 30s that prestige is a sucker’s game; it’s a way of inducing people to do things that aren’t much fun and they wouldn’t really want to do on their own, by lauding them with accolades from people they don’t really care about.

Why is FedEx based in Mephis?

… we noticed that we also needed:
(1) A suitable, existing airport at the hub location.
(2) Good weather at the hub location, e.g., relatively little snow, fog, or rain.
(3) Access to good ramp space, that is, where to park and service the airplanes and sort the packages.
(4) Good labor supply, e.g., for the sort center.
(5) Relatively low cost of living to keep down prices.
(6) Friendly regulatory environment.
(7) Candidate airport not too busy, e.g., don’t want arriving planes to have to circle a long time before being able to land.
(8) Airport with relatively little in cross winds and with more than one runway to pick from in case of winds.
(9) Runway altitude not too high, e.g., not high enough to restrict maximum total gross take off weight, e.g., rule out Denver.
(10) No tall obstacles, e.g., mountains, near the ends of the runways.
(11) Good supplies of jet fuel.
(12) Good access to roads for 18 wheel trucks for exchange of packages between trucks and planes, e.g., so that some parts could be trucked to the hub and stored there and shipped directly via the planes to customers that place orders, say, as late as 11 PM for delivery before 10 AM.
So, there were about three candidate locations, Memphis and, as I recall, Cincinnati and Kansas City.
The Memphis airport had some old WWII hangers next to the runway that FedEx could use for the sort center, aircraft maintenance, and HQ office space. Deal done – it was Memphis.

Why etherpad joined Wave, and why it didn’t work out as expected

The decision to sell to Google was one of the toughest decisions I and my cofounders ever had to wrestle with in our lives. We were excited by the Wave vision though we saw the flaws in the product. The Wave team told us about how they wanted our help making wave simpler and more like etherpad, and we thought we could help with that, though in the end we were unsuccessful at making wave simpler. We were scared of Google as a competitor: they had more engineers and more money behind this project, yet they were running it much more like an independent startup than a normal big-company department. The Wave office was in Australia and had almost total autonomy. And finally, after 1.5 years of being on the brink of failure with AppJet, it was tempting to be able to declare our endeavor a success and provide a decent return to all our investors who had risked their money on us.

In the end, our decision to join Wave did not work out as we had hoped. The biggest lessons learned were that having more engineers and money behind a project can actually be more harmful than helpful, so we were wrong to be scared of Wave as a competitor for this reason. It seems obvious in hindsight, but at the time it wasn’t. Second, I totally underestimated how hard it would be to iterate on the Wave codebase. I was used to rewriting major portions of software in a single all-nighter. Because of the software development process Wave was using, it was practically impossible to iterate on the product. I should have done more diligence on their specific software engineering processes, but instead I assumed because they seemed to be operating like a startup, that they would be able to iterate like a startup. A lot of the product problems were known to the whole Wave team, but we were crippled by a large complex codebase built on poor technical choices and a cumbersome engineering process that prevented fast iteration.

The accuracy of tech news

When I’ve had inside information about a story that later breaks in the tech press, I’m always shocked at how differently it’s perceived by readers of the article vs. how I experienced it. Among startups & major feature launches I’ve been party to, I’ve seen: executives that flat-out say that they’re not working on a product category when there’s been a whole department devoted to it for a year; startups that were founded 1.5 years before the dates listed in Crunchbase/Wikipedia; reporters that count the number of people they meet in a visit and report that as a the “team size”, because the company refuses to release that info; funding rounds that never make it to the press; acquisitions that are reported as “for an undisclosed sum” but actually are less than the founders would’ve made if they’d taken a salaried job at the company; project start dates that are actually when the project was staffed up to its current size and ignore the year or so that a small team spent working on the problem (or the 3-4 years that other small teams spent working on the problem); and algorithms or other technologies that are widely reported as being the core of the company’s success, but actually aren’t even used by the company.

Self-destructing speakers from Dell

As the main developer of VLC, we know about this story since a long time, and this is just Dell putting crap components on their machine and blaming others. Any discussion was impossible with them. So let me explain a bit…

In this case, VLC just uses the Windows APIs (DirectSound), and sends signed integers of 16bits (s16) to the Windows Kernel.

VLC allows amplification of the INPUT above the sound that was decoded. This is just like replay gain, broken codecs, badly recorded files or post-amplification and can lead to saturation.

But this is exactly the same if you put your mp3 file through Audacity and increase it and play with WMP, or if you put a DirectShow filter that amplifies the volume after your codec output. For example, for a long time, VLC ac3 and mp3 codecs were too low (-6dB) compared to the reference output.

At worse, this will reduce the dynamics and saturate a lot, but this is not going to break your hardware.

VLC does not (and cannot) modify the OUTPUT volume to destroy the speakers. VLC is a Software using the OFFICIAL platforms APIs.

The issue here is that Dell sound cards output power (that can be approached by a factor of the quadratic of the amplitude) that Dell speakers cannot handle. Simply said, the sound card outputs at max 10W, and the speakers only can take 6W in, and neither their BIOS or drivers block this.

And as VLC is present on a lot of machines, it’s simple to blame VLC. “Correlation does not mean causation” is something that seems too complex for cheap Dell support…

Learning on the job, startups vs. big companies

Working for someone else’s startup, I learned how to quickly cobble solutions together. I learned about uncertainty and picking a direction regardless of whether you’re sure it’ll work. I learned that most startups fail, and that when they fail, the people who end up doing well are the ones who were looking out for their own interests all along. I learned a lot of basic technical skills, how to write code quickly and learn new APIs quickly and deploy software to multiple machines. I learned how quickly problems of scaling a development team crop up, and how early you should start investing in automation.

Working for Google, I learned how to fix problems once and for all and build that culture into the organization. I learned that even in successful companies, everything is temporary, and that great products are usually built through a lot of hard work by many people rather than great ah-ha insights. I learned how to architect systems for scale, and a lot of practices used for robust, high-availability, frequently-deployed systems. I learned the value of research and of spending a lot of time on a single important problem: many startups take a scattershot approach, trying one weekend hackathon after another and finding nobody wants any of them, while oftentimes there are opportunities that nobody has solved because nobody wants to put in the work. I learned how to work in teams and try to understand what other people want. I learned what problems are really painful for big organizations. I learned how to rigorously research the market and use data to make product decisions, rather than making decisions based on what seems best to one person.

We failed this person, what are we going to do differently?

Having been in on the company’s leadership meetings where departures were noted with a simple ‘regret yes/no’ flag it was my experience that no single departure had any effect. Mass departures did, trends did, but one person never did, even when that person was a founder.

The rationalizations always put the issue back on the departing employee, “They were burned out”, “They had lost their ability to be effective”, “They have moved on”, “They just haven’t grown with the company” never was it “We failed this person, what are we going to do differently?”

AWS’s origin story

Anyway, the SOA effort was in full swing when I was there. It was a pain, and it was a mess because every team did things differently and every API was different and based on different assumptions and written in a different language.

But I want to correct the misperception that this lead to AWS. It didn’t. S3 was written by its own team, from scratch. At the time I was at Amazon, working on the retail site, none of Amazon.com was running on AWS. I know, when AWS was announced, with great fanfare, they said “the services that power Amazon.com can now power your business!” or words to that effect. This was a flat out lie. The only thing they shared was data centers and a standard hardware configuration. Even by the time I left, when AWS was running full steam ahead (and probably running Reddit already), none of Amazon.com was running on AWS, except for a few, small, experimental and relatively new projects. I’m sure more of it has been adopted now, but AWS was always a separate team (and a better managed one, from what I could see.)

Why is Windows so slow?

I (and others) have put a lot of effort into making the Linux Chrome build fast. Some examples are multiple new implementations of the build system (http://neugierig.org/software/chromium/notes/2011/02/ninja.h... ), experimentation with the gold linker (e.g. measuring and adjusting the still off-by-default thread flags https://groups.google.com/a/chromium.org/group/chromium-dev/... ) as well as digging into bugs in it, and other underdocumented things like ‘thin’ ar archives.

But it’s also true that people who are more of Windows wizards than I am a Linux apprentice have worked on Chrome’s Windows build. If you asked me the original question, I’d say the underlying problem is that on Windows all you have is what Microsoft gives you and you can’t typically do better than that. For example, migrating the Chrome build off of Visual Studio would be a large undertaking, large enough that it’s rarely considered. (Another way of phrasing this is it’s the IDE problem: you get all of the IDE or you get nothing.)

When addressing the poor Windows performance people first bought SSDs, something that never even occurred to me (“your system has enough RAM that the kernel cache of the file system should be in memory anyway!”). But for whatever reason on the Linux side some Googlers saw it fit to rewrite the Linux linker to make it twice as fast (this effort predated Chrome), and all Linux developers now get to benefit from that. Perhaps the difference is that when people write awesome tools for Windows or Mac they try to sell them rather than give them away.

Why is Windows so slow, an insider view

I’m a developer in Windows and contribute to the NT kernel. (Proof: the SHA1 hash of revision #102 of [Edit: filename redacted] is [Edit: hash redacted].) I’m posting through Tor for obvious reasons.

Windows is indeed slower than other operating systems in many scenarios, and the gap is worsening. The cause of the problem is social. There’s almost none of the improvement for its own sake, for the sake of glory, that you see in the Linux world.

Granted, occasionally one sees naive people try to make things better. These people almost always fail. We can and do improve performance for specific scenarios that people with the ability to allocate resources believe impact business goals, but this work is Sisyphean. There’s no formal or informal program of systemic performance improvement. We started caring about security because pre-SP3 Windows XP was an existential threat to the business. Our low performance is not an existential threat to the business.

See, component owners are generally openly hostile to outside patches: if you’re a dev, accepting an outside patch makes your lead angry (due to the need to maintain this patch and to justify in in shiproom the unplanned design change), makes test angry (because test is on the hook for making sure the change doesn’t break anything, and you just made work for them), and PM is angry (due to the schedule implications of code churn). There’s just no incentive to accept changes from outside your own team. You can always find a reason to say “no”, and you have very little incentive to say “yes”.

What’s the probability of a successful exit by city?

See link for giant table :-).

The hiring crunch

Broken record: startups are also probably rejecting a lot of engineering candidates that would perform as well or better than anyone on their existing team, because tech industry hiring processes are folkloric and irrational.

Too long to excerpt. See the link!

Should you leave a bad job?

I am 42-year-old very successful programmer who has been through a lot of situations in my career so far, many of them highly demotivating. And the best advice I have for you is to get out of what you are doing. Really. Even though you state that you are not in a position to do that, you really are. It is okay. You are free. Okay, you are helping your boyfriend’s startup but what is the appropriate cost for this? Would he have you do it if he knew it was crushing your soul?

I don’t use the phrase “crushing your soul” lightly. When it happens slowly, as it does in these cases, it is hard to see the scale of what is happening. But this is a very serious situation and if left unchecked it may damage the potential for you to do good work for the rest of your life.

The commenters who are warning about burnout are right. Burnout is a very serious situation. If you burn yourself out hard, it will be difficult to be effective at any future job you go to, even if it is ostensibly a wonderful job. Treat burnout like a physical injury. I burned myself out once and it took at least 12 years to regain full productivity. Don’t do it.

  • More broadly, the best and most creative work comes from a root of joy and excitement. If you lose your ability to feel joy and excitement about programming-related things, you’ll be unable to do the best work. That this issue is separate from and parallel to burnout! If you are burned out, you might still be able to feel the joy and excitement briefly at the start of a project/idea, but they will fade quickly as the reality of day-to-day work sets in. Alternatively, if you are not burned out but also do not have a sense of wonder, it is likely you will never get yourself started on the good work.

  • The earlier in your career it is now, the more important this time is for your development. Programmers learn by doing. If you put yourself into an environment where you are constantly challenged and are working at the top threshold of your ability, then after a few years have gone by, your skills will have increased tremendously. It is like going to intensively learn kung fu for a few years, or going into Navy SEAL training or something. But this isn’t just a one-time constant increase. The faster you get things done, and the more thorough and error-free they are, the more ideas you can execute on, which means you will learn faster in the future too. Over the long term, programming skill is like compound interest. More now means a LOT more later. Less now means a LOT less later.

So if you are putting yourself into a position that is not really challenging, that is a bummer day in and day out, and you get things done slowly, you aren’t just having a slow time now. You are bringing down that compound interest curve for the rest of your career. It is a serious problem. If I could go back to my early career I would mercilessly cut out all the shitty jobs I did (and there were many of them).

Creating change when politically unpopular

A small anecdote. An acquaintance related a story of fixing the ‘drainage’ in their back yard. They were trying to grow some plants that were sensitive to excessive moisture, and the plants were dying. Not watering them, watering them a little, didn’t seem to change. They died. A professional gardner suggested that their problem was drainage. So they dug down about 3’ (where the soil was very very wet) and tried to build in better drainage. As they were on the side of a hill, water table issues were not considered. It turned out their “problem” was that the water main that fed their house and the houses up the hill, was so pressurized at their property (because it had maintain pressure at the top of the hill too) that the pipe seams were leaking and it was pumping gallons of water into the ground underneath their property. The problem wasn’t their garden, the problem was that the city water supply was poorly designed.

While I have never been asked if I was an engineer on the phone, I have experienced similar things to Rachel in meetings and with regard to suggestions. Co-workers will create an internal assessment of your value and then respond based on that assessment. If they have written you off they will ignore you, if you prove their assessment wrong in a public forum they will attack you. These are management issues, and something which was sorely lacking in the stories.

If you are the “owner” of a meeting, and someone is trying to be heard and isn’t. It is incumbent on you to let them be heard. By your position power as “the boss” you can naturally interrupt a discussion to collect more data from other members. Its also important to ask questions like “does anyone have any concerns?” to draw out people who have valid input but are too timid to share it.

In a highly political environment there are two ways to create change, one is through overt manipulation, which is to collect political power to yourself and then exert it to enact change, and the other is covert manipulation, which is to enact change subtly enough that the political organism doesn’t react. (sometimes called “triggering the antibodies”).

The problem with the latter is that if you help make positive change while keeping everyone not pissed off, no one attributes it to you (which is good for the change agent because if they knew the anti-bodies would react, but bad if your manager doesn’t recognize it). I asked my manager what change he wanted to be ‘true’ yet he (or others) had been unsuccessful making true, he gave me one, and 18 months later that change was in place. He didn’t believe that I was the one who had made the change. I suggested he pick a change he wanted to happen and not tell me, then in 18 months we could see if that one happened :-). But he also didn’t understand enough about organizational dynamics to know that making change without having the source of that change point back at you was even possible.

How to get tech support from Google

Heavily relying on Google product? ✓
Hitting a dead-end with Google’s customer service? ✓
Have an existing audience you can leverage to get some random Google employee’s attention? ✓
Reach front page of Hacker News? ✓
Good news! You should have your problem fixed in 2-5 business days. The rest of us suckers relying on google services get to stare at our inboxes helplessly, waiting for a response to our support ticket (which will never come). I feel like it’s almost a [rite] of passage these days to rely heavily on a Google service, only to have something go wrong and be left out in the cold.

Taking funding

IIRC PayPal was very similar - it was sold for $1.5B, but Max Levchin’s share was only about $30M, and Elon Musk’s was only about $100M. By comparison, many early Web 2.0 darlings (Del.icio.us, Blogger, Flickr) sold for only $20-40M, but their founders had only taken small seed rounds, and so the vast majority of the purchase price went to the founders. 75% of a $40M acquisition = 3% of a $1B acquisition.

Something for founders to think about when they’re taking funding. If you look at the gigantic tech fortunes - Gates, Page/Brin, Omidyar, Bezos, Zuckerburg, Hewlett/Packard - they usually came from having a company that was already profitable or was already well down the hockey-stick user growth curve and had a clear path to monetization by the time they sought investment. Companies that fight tooth & nail for customers and need lots of outside capital to do it usually have much worse financial outcomes.

StackOverflow vs. Experts-Exchange

A lot of the people who were involved in some way in Experts-Exchange don’t understand Stack Overflow.

The basic value flow of EE is that “experts” provide valuable “answers” for novices with questions. In that equation there’s one person asking a question and one person writing an answer.

Stack Overflow recognizes that for every person who asks a question, 100 - 10,000 people will type that same question into Google and find an answer that has already been written. In our equation, we are a community of people writing answers that will be read by hundreds or thousands of people. Ours is a project more like wikipedia – collaboratively creating a resource for the Internet at large.

Because that resource is provided by the community, it belongs to the community. That’s why our data is freely available and licensed under creative commons. We did this specifically because of the negative experience we had with EE taking a community-generated resource and deciding to slap a paywall around it.

The attitude of many EE contributors, like Greg Young who calculates that he “worked” for half a year for free, is not shared by the 60,000 people who write answers on SO every month. When you talk to them you realize that on Stack Overflow, answering questions is about learning. It’s about creating a permanent artifact to make the Internet better. It’s about helping someone solve a problem in five minutes that would have taken them hours to solve on their own. It’s not about working for free.

As soon as EE introduced the concept of money they forced everybody to think of their work on EE as just that – work.

Making money from amazon bots

I saw that one of my old textbooks was selling for a nice price, so I listed it along with two other used copies. I priced it $1 cheaper than the lowest price offered, but within an hour both sellers had changed their prices to $.01 and $.02 cheaper than mine. I reduced it two times more by $1, and each time they beat my price by a cent or two. So what I did was reduce my price by a few dollars every hour for one day until everybody was priced under $5. Then I bought their books and changed my price back.

What running a business is like

While I like the sentiment here, I think the danger is that engineers might come to the mistaken conclusion that making pizzas is the primary limiting reagent to running a successful pizzeria. Running a successful pizzeria is more about schlepping to local hotels and leaving them 50 copies of your menu to put at the front desk, hiring drivers who will both deliver pizzas in a timely fashion and not embezzle your (razor-thin) profits while also costing next-to-nothing to employ, maintaining a kitchen in sufficient order to pass your local health inspector’s annual visit (and dealing with 47 different pieces of paper related to that), being able to juggle priorities like “Do I take out a bank loan to build a new brick-oven, which will make the pizza taste better, in the knowledge that this will commit $3,000 of my cash flow every month for the next 3 years, or do I hire an extra cook?”, sourcing ingredients such that they’re available in quantity and quality every day for a fairly consistent price, setting prices such that they’re locally competitive for your chosen clientele but generate a healthy gross margin for the business, understanding why a healthy gross margin really doesn’t imply a healthy net margin and that the rent still needs to get paid, keeping good-enough records such that you know whether your business is dying before you can’t make payroll and such that you can provide a reasonably accurate picture of accounts for the taxation authorities every year, balancing 50% off medium pizza promotions with the desire to not cannibalize the business of your regulars, etc etc, and by the way tomato sauce should be tangy but not sour and cheese should melt with just the faintest whisp of a crust on it.

Do you want to write software for a living? Google is hiring. Do you want to run a software business? Godspeed. Software is now 10% of your working life.

How to handle mismanagement?

The way I prefer to think of it is: it is not your job to protect people (particularly senior management) from the consequences of their decisions. Make your decisions in your own best interest; it is up to the organization to make sure that your interest aligns with theirs.

Google used to have a severe problem where code refactoring & maintenance was not rewarded in performance reviews while launches were highly regarded, which led to the effect of everybody trying to launch things as fast as possible and nobody cleaning up the messes left behind. Eventually launches started getting slowed down, Larry started asking “Why can’t we have nice things?”, and everybody responded “Because you’ve been paying us to rack up technical debt.” As a result, teams were formed with the express purpose of code health & maintenance, those teams that were already working on those goals got more visibility, and refactoring contributions started counting for something in perf. Moreover, many ex-Googlers who were fed up with the situation went to Facebook and, I’ve heard, instituted a culture there where grungy engineering maintenance is valued by your peers.

None of this would’ve happened if people had just heroically fallen on their own sword and burnt out doing work nobody cared about. Sometimes it takes highly visible consequences before people with decision-making power realize there’s a problem and start correcting it. If those consequences never happen, they’ll keep believing it’s not a problem and won’t pay much attention to it.

Some downsides of immutability

People who aren’t exactly lying

It took me too long to figure this out. There are some people to truly, and passionately, believe something they say to you, and realistically they personally can’t make it happen so you can’t really bank on that ‘promise.’

I used to think those people were lying to take advantage, but as I’ve gotten older I have come to recognize that these ‘yes’ people get promoted a lot. And for some of them, they really do believe what they are saying.

As an engineer I’ve found that once I can ‘calibrate’ someone’s ‘yes-ness’ I can then work with them, understanding that they only make ‘wishful’ commitments rather than ‘reasoned’ commitments.

So when someone, like Steve Jobs, says “we’re going to make it an open standard!”, my first question then is “Great, I’ve got your support in making this an open standard so I can count on you to wield your position influence to aid me when folks line up against that effort, right?” If the answer that that question is no, then they were lying.

The difference is subtle of course but important. Steve clearly doesn’t go to standards meetings and vote etc, but if Manager Bob gets push back from accounting that he’s going to exceed his travel budget by sending 5 guys to the Open Video Chat Working Group which is championing the Facetime protocol as an open standard, then Manager Bob goes to Steve and says “I need your help here, these 5 guys are needed to argue this standard and keep it from being turned into a turd by the 5 guys from Google who are going to attend.” and then Steve whips off a one liner to accounting that says “Get off this guy’s back we need this.” Then its all good. If on the other hand he says “We gotta save money, send one guy.” well in that case I’m more sympathetic to the accusation of prevarication.

What makes engineers productive?

For those who work inside Google, it’s well worth it to look at Jeff & Sanjay’s commit history and code review dashboard. They aren’t actually all that much more productive in terms of code written than a decent SWE3 who knows his codebase.

The reason they have a reputation as rockstars is that they can apply this productivity to things that really matter; they’re able to pick out the really important parts of the problem and then focus their efforts there, so that the end result ends up being much more impactful than what the SWE3 wrote. The SWE3 may spend his time writing a bunch of unit tests that catch bugs that wouldn’t really have happened anyway, or migrating from one system to another that isn’t really a large improvement, or going down an architectural dead end that’ll just have to be rewritten later. Jeff or Sanjay (or any of the other folks operating at that level) will spend their time running a proposed API by clients to ensure it meets their needs, or measuring the performance of subsystems so they fully understand their building blocks, or mentally simulating the operation of the system before building it so they rapidly test out alternatives. They don’t actually write more code than a junior developer (oftentimes, they write less), but the code they do write gives them more information, which makes them ensure that they write the rightcode.

I feel like this point needs to be stressed a whole lot more than it is, as there’s a whole mythology that’s grown up around 10x developers that’s not all that helpful. In particular, people need to realize that these developers rapidly become 1x developers (or worse) if you don’t let them make their own architectural choices - the reason they’re excellent in the first place is because they know how to determine if certain work is going to be useless and avoid doing it in the first place. If you dictate that they do it anyway, they’re going to be just as slow as any other developer

Do the work, be a hero

I got the hero speech too, once. If anyone ever mentions the word “heroic” again and there isn’t a burning building involved, I will start looking for new employment immediately. It seems that in our industry it is universally a code word for “We’re about to exploit you because the project is understaffed and under budgeted for time and that is exactly as we planned it so you’d better cowboy up.”

Maybe it is different if you’re writing Quake, but I guarantee you the 43rd best selling game that year also had programmers “encouraged onwards” by tales of the glory that awaited after the death march.

Learning English from watching movies

I was once speaking to a good friend of mine here, in English.
“Do you want to go out for yakitori?”
“Go fuck yourself!”
“… switches to Japanese Have I recently done anything very major to offend you?”
“No, of course not.”
“Oh, OK, I was worried. So that phrase, that’s something you would only say under extreme distress when you had maximal desire to offend me, or I suppose you could use it jokingly between friends, but neither you nor I generally talk that way.”
“I learned it from a movie. I thought it meant ‘No.’”

Being smart and getting things done

True story: I went to a talk given by one of the ‘engineering elders’ (these were low Emp# engineers who were considered quite successful and were to be emulated by the workers :-) This person stated when they came to work at Google they were given the XYZ system to work on (sadly I’m prevented from disclosing the actual system). They remarked how they spent a couple of days looking over the system which was complicated and creaky, they couldn’t figure it out so they wrote a new system. Yup, and they committed that. This person is a coding God are they not? (sarcasm) I asked what happened to the old system (I knew but was interested on their perspective) and they said it was still around because a few things still used it, but (quite proudly) nearly everything else had moved to their new system.

So if you were reading carefully, this person created a new system to ‘replace’ an existing system which they didn’t understand and got nearly everyone to move to the new system. That made them uber because they got something big to put on their internal resume, and a whole crapload of folks had to write new code to adapt from the old system to this new system, which imperfectly recreated the old system (remember they didn’t understand the original), such that those parts of the system that relied on the more obscure bits had yet to be converted (because nobody undersood either the dependent code or the old system apparently).

Was this person smart? Blindingly brilliant according to some of their peers. Did they get things done? Hell yes, they wrote the replacement for the XYZ system from scratch! One person? Can you imagine? Would I hire them? Not unless they were the last qualified person in my pool and I was out of time.

That anecdote encapsulates the dangerous side of smart people who get things done.

Public speaking tips

Some kids grow up on football. I grew up on public speaking (as behavioral therapy for a speech impediment, actually). If you want to get radically better in a hurry:

Too long to excerpt. See the link.

A reason a company can be a bad fit

I can relate to this, but I can also relate to the other side of the question. Sometimes it isn’t me, its you. Take someone who gets things done and suddenly in your organization they aren’t delivering. Could be them, but it could also be you.

I had this experience working at Google. I had a horrible time getting anything done there. Now I spent a bit of time evaluating that since it had never been the case in my career, up to that point, where I was unable to move the ball forward and I really wanted to understand that. The short answer was that Google had developed a number of people who spent much, if not all, of their time preventing change. It took me a while to figure out what motivated someone to be anti-change.

The fear was risk and safety. Folks moved around a lot and so you had people in charge of systems they didn’t build, didn’t understand all the moving parts of, and were apt to get a poor rating if they broke. When dealing with people in that situation one could either educate them and bring them along, or steam roll over them. Education takes time, and during that time the ‘teacher’ doesn’t get anything done. This favors steamrolling evolutionarily :-)

So you can hire someone who gets stuff done, but if getting stuff done in your organization requires them to be an asshole, and they aren’t up for that, well they aren’t going to be nearly as successful as you would like them to be.

What working at Google is like

I can tell that this was written by an outsider, because it focuses on the perks and rehashes several cliches that have made their way into the popular media but aren’t all that accurate.

Most Googlers will tell you that the best thing about working there is having the ability to work on really hard problems, with really smart coworkers, and lots of resources at your disposal. I remember asking my interviewer whether I could use things like Google’s index if I had a cool 20% idea, and he was like “Sure. That’s encouraged. Oftentimes I’ll just grab 4000 or so machines and run a MapReduce to test out some hypothesis.” My phone screener, when I asked him what it was like to work there, said “It’s a place where really smart people go to be average,” which has turned out to be both true and honestly one of the best things that I’ve gained from working there.

NSA vs. Black Hat

This entire event was a staged press op. Keith Alexander is a ~30 year veteran of SIGINT, electronic warfare, and intelligence, and a Four-Star US Army General — which is a bigger deal than you probably think it is. He’s a spy chief in the truest sense and a master politician. Anyone who thinks he walked into that conference hall in Caesars without a near perfect forecast of the outcome of the speech is kidding themselves.

Heckling Alexander played right into the strategy. It gave him an opportunity to look reasonable compared to his detractors, and, more generally (and alarmingly), to have the NSA look more reasonable compared to opponents of NSA surveillance. It allowed him to “split the vote” with audience reactions, getting people who probably have serious misgivings about NSA programs to applaud his calm and graceful handling of shouted insults; many of those people probably applauded simply to protest the hecklers, who after all were making it harder for them to follow what Alexander was trying to say.

There was no serious Q&A on offer at the keynote. The questions were pre-screened; all attendees could do was vote on them. There was no possibility that anything would come of this speech other than an effectively unchallenged full-throated defense of the NSA’s programs.

Are deadlines necessary?

Interestingly one of the things that I found most amazing when I was working for Google was a nearly total inability to grasp the concept of ‘deadline.’ For so many years the company just shipped it by committing it to the release branch and having the code deploy over the course of a small number of weeks to the ‘fleet’.

Sure there were ‘processes’, like “Canary it in some cluster and watch the results for a few weeks before turning it loose on the world.” but being completely vertically integrated is a unique sort of situation.

Debugging on Windows vs. Linux

Being a very experienced game developer who tried to switch to Linux, I have posted about this before (and gotten flamed heavily by reactionary Linux people).

The main reason is that debugging is terrible on Linux. gdb is just bad to use, and all these IDEs that try to interface with gdb to “improve” it do it badly (mainly because gdb itself is not good at being interfaced with). Someone needs to nuke this site from orbit and build a new debugger from scratch, and provide a library-style API that IDEs can use to inspect executables in rich and subtle ways.

Productivity is crucial. If the lack of a reasonable debugging environment costs me even 5% of my productivity, that is too much, because games take so much work to make. At the end of a project, I just don’t have 5% effort left any more. It requires everything. (But the current Linux situation is way more than a 5% productivity drain. I don’t know exactly what it is, but if I were to guess, I would say it is something like 20%.)

What happens when you become rich?

What is interesting is that people don’t even know they have a complex about money until they get “rich.” I’ve watched many people, perhaps a hundred, go from “working to pay the bills” to “holy crap I can pay all my current and possibly my future bills with the money I now have.” That doesn’t include the guy who lived in our neighborhood and won the CA lottery one year.

It affects people in ways they don’t expect. If its sudden (like lottery winning or sudden IPO surge) it can be difficult to process. But it is an important thing to realize that one is processing an exceptional event. Like having a loved one die or a spouse suddenly divorcing you.

Not everyone feels “guilty”, not everyone feels “smug.” A lot of millionaires and billionaires in the Bay Area are outwardly unchanged. But the bottom line is that the emotion comes from the cognitive dissonance between values and reality. What do you value? What is reality?

One woman I knew at Google was massively conflicted when she started work at Google. She always felt that she would help the homeless folks she saw, if she had more money than she needed. Upon becoming rich (on Google stock value), now she found that she wanted to save the money she had for her future kids education and needs. Was she a bad person? Before? After? Do your kids hate you if you give away their college education to the local foodbank? Do your peers hate you because you could close the current food gap at the foodbank and you don’t?

Microsoft’s Skype acquisition

This is Microsoft’s ICQ moment. Overpaying for a company at the moment when its core competency is becoming a commodity. Does anyone have the slightest bit of loyalty to Skype? Of course not. They’re going to use whichever video chat comes built into their SmartPhone, tablet, computer, etc. They’re going to use FaceBook’s eventual video chat service or something Google offers. No one is going to actively seek out Skype when so many alternatives exist and are deeply integrated into the products/services they already use. Certainly no one is going to buy a Microsoft product simply because it has Skype integration. Who cares if it’s FaceTime, FaceBook Video Chat, Google Video Chat? It’s all the same to the user.

With $7B they should have just given away about 15 million Windows Mobile phones in the form of an epic PR stunt. It’s not a bad product – they just need to make people realize it exists. If they want to flush money down the toilet they might as well engage users in the process right?

What happened to Google Fiber?

I worked briefly on the Fiber team when it was very young (basically from 2 weeks before to 2 weeks after launch - I was on loan from Search specifically so that they could hit their launch goals). The bottleneck when I was there were local government regulations, and in fact Kansas City was chosen because it had a unified city/county/utility regulatory authority that was very favorable to Google. To lay fiber to the home, you either need right-of-ways on the utility poles (which are owned by Google’s competitors) or you need permission to dig up streets (which requires a mess of permitting from the city government). In either case, the cable & phone companies were in very tight with local regulators, and so you had hostile gatekeepers whose approval you absolutely needed.

The technology was awesome (1G Internet and HDTV!), the software all worked great, and the economics of hiring contractors to lay the fiber itself actually worked out. The big problem was regulatory capture.

With Uber & AirBnB’s success in hindsight, I’d say that the way to crack the ISP business is to provide your customers with the tools to break the law en masse. For example, you could imagine an ISP startup that basically says “Here’s a box, a wire, and a map of other customers’ locations. Plug into their jack, and if you can convince others to plug into yours, we’ll give you a discount on your monthly bill based on how many you sign up.” But Google in general is not willing to break laws - they’ll go right up to the boundary of what the law allows, but if a regulatory agency says “No, you can’t do that”, they won’t do it rather than fight the agency.

Indeed, Fiber is being phased out in favor of Google’s acquisition of WebPass, which does basically exactly that but with wireless instead of fiber. WebPass only requires the building owner’s consent, and leaves the city out of it.

???

How did HN get get the commenter base that it has? If you read HN, on any given week, there are at least as many good, substantial, comments as there are posts. This is different from every other modern public news aggregator I can find out there, and I don’t really know what the ingredients are that make HN successful.

For the last couple years (ish?), the moderation regime has been really active in trying to get a good mix of stories on the front page and in tamping down on gratuitously mean comments. But there was a period of years where the moderation could be described as sparse, arbitrary, and capricious, and while there are fewer “bad” comments now, it doesn’t seem like good moderation actually generates more “good” comments.

The ranking scheme seems to penalize posts that have a lot of comments on the theory that flamebait topics will draw a lot of comments. That sometimes prematurely buries stories with good discussion, but much more often, it buries stories that draw pointless flamewars. If you just read HN, it’s hard to see the effect, but if you look at forums that use comments as a positive factor in ranking, the difference is dramatic – those other forums that boost topics with many comments (presumably on theory that vigorous discussion should be highlighted) often have content-free flame wars pinned at the top for long periods of time.

Something else that HN does that’s different from most forums is that user flags are weighted very heavily. On reddit, a downvote only cancels out an upvote, which means that flamebait topics that draw a lot of upvotes like “platform X is cancer” “Y is doing some horrible thing” often get pinned to the top of r/programming for a an entire day, since the number of people who don’t want to see that is drowned out by the number of people who upvote outrageous stories. If you read the comments for one of the “X is cancer” posts on r/programming, the top comment will almost inevitably that the post has no content, that the author of the post is a troll who never posts anything with content, and that we’d be better off with less flamebait by the author at the top of r/programming. But the people who will upvote outrage porn outnumber the people who will downvote it, so that kind of stuff dominates aggregators that use raw votes for ranking. Having flamebait drop off the front page quickly is significant, but it doesn’t seem sufficient to explain why there are so many more well-informed comments on HN than on other forums with roughly similar traffic.

Maybe the answer is that people come to HN for the same reason people come to Silicon Valley – despite all the downsides, there’s a relatively large concentration of experts there across a wide variety of CS-related disciplines. If that’s true, and it’s a combination of path dependence on network effects, that’s pretty depressing since that’s not replicable.

If you liked this curated list of comments, you’ll probably also like this list of books and this list of blogs.

This is part of an experiment where I write up thoughts quickly, without proofing or editing. Apologies if this is less clear than a normal post. This is probably going to be the last post like this, for now, since, by quickly writing up a post whenever I have something that can be written up quickly, I’m building up a backlog of post ideas that require re-reading the literature in an area or running experiments.

P.S. Please suggest other good comments! By their nature, HN comments are much less discoverable than stories, so there are a lot of great coments that I haven’t seen.


  1. if you’re one of those people, you’ve probably already thought of this, but maybe consider, at the margin, blogging more and commenting on HN less? As a result of writing this post, I looked through my old HN comments and noticed that I wrote this comment three years ago, which is another way of stating the second half of this post I wrote recently. Comparing the two, I think the HN comment is substantially better written. But, like most HN comments, it got some traffic while the story was still current and is now buried, and AFAICT, nothing really happened as a result of the comment. The blog post, despite being “worse”, has gotten some people to contact me personally, and I’ve had some good discussions about that and other topics as a result. Additionally, people occasionally contact me about older posts I’ve written; I continue to get interesting stuff in my inbox as a result of having written posts years ago. Writing your comment up as a blog post will almost certainly provide more value to you, and if it gets posted to HN, it will probably provide no less value to HN.

    Steve Yegge has a pretty list of reasons why you should blog that I won’t recapitulate here. And if you’re writing substantial comments on HN, you’re already doing basically everything you’d need to do to write a blog except that you’re putting the text into a little box on HN instead of into a static site generator or some hosted blogging service. BTW, I’m not just saying this for your benefit: my selfish reason for writing this appeal is that I really want to read the Nathan Kurz blog on low-level optimizations, the Jonathan Tang blog on what it’s like to work at startups vs. big companies, etc.

    [return]

File crash consistency and filesystems are hard

$
0
0

I haven’t used a desktop email client in years. None of them could handle the volume of email I get without at least occasionally corrupting my mailbox. Pine, Eudora, and outlook have all corrupted my inbox, forcing me to restore from backup. How is it that desktop mail clients are less reliable than gmail, even though my gmail account not only handles more email than I ever had on desktop clients, but also allows simultaneous access from multiple locations across the globe? Distributed systems have an unfair advantage, in that they can be robust against total disk failure in a way that desktop clients can’t, but none of the file corruption issues I’ve had have been from total disk failure. Why has my experience with desktop applications been so bad?

Well, what sort of failures can occur? Crash consistency (maintaining consistent state even if there’s a crash) is probably the easiest property to consider, since we can assume that everything, from the filesystem to the disk, works correctly; let’s consider that first.

Crash Consistency

Pillai et al. had a paper and presentation at OSDI ‘14 on exactly how hard it is to save data without corruption or data loss.

Let’s look at a simple example of what it takes to save data in a way that’s robust against a crash. Say we have a file that contains the text a foo and we want to update the file to contain a bar. The pwrite function looks like it’s designed for this exact thing. It takes a file descriptor, what we want to write, a length, and an offset. So we might try

pwrite([file], “bar”, 3, 2)  // write 3 bytes at offset 2

What happens? If nothing goes wrong, the file will contain a bar, but if there’s a crash during the write, we could get a boo, a far, or any other combination. Note that you may want to consider this an example over sectors or blocks and not chars/bytes.

If we want atomicity (so we either end up with a foo or a bar but nothing in between) one standard technique is to make a copy of the data we’re about to change in an undo log file, modify the “real” file, and then delete the log file. If a crash happens, we can recover from the log. We might write something like

creat(/dir/log);
write(/dir/log, “2,3,foo”, 7);
pwrite(/dir/orig, “bar”, 3, 2);
unlink(/dir/log);

This should allow recovery from a crash without data corruption via the undo log, at least if we’re using ext3 and we made sure to mount our drive with data=journal. But we’re out of luck if, like most people, we’re using the default1– with the default data=ordered, the write and pwrite syscalls can be reordered, causing the write to orig to happen before the write to the log, which defeats the purpose of having a log. We can fix that.

creat(/dir/log);
write(/dir/log, “2, 3, foo”);
fsync(/dir/log);  // don't allow write to be reordered past pwrite
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That should force things to occur in the correct order, at least if we’re using ext3 with data=journal or data=ordered. If we’re using data=writeback, a crash during the the write or fsync to log can leave log in a state where the filesize has been adjusted for the write of “bar”, but the data hasn’t been written, which means that the log will contain random garbage. This is because with data=writeback, metadata is journaled, but data operations aren’t, which means that data operations (like writing data to a file) aren’t ordered with respect to metadata operations (like adjusting the size of a file for a write).

We can fix that by adding a checksum to the log file when creating it. If the contents of log don’t contain a valid checksum, then we’ll know that we ran into the situation described above.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);  // add checksum to log file
fsync(/dir/log);
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That’s safe, at least on current configurations of ext3. But it’s legal for a filesystem to end up in a state where the log is never created unless we issue an fsync to the parent directory.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);
fsync(/dir/log);
fsync(/dir);  // fsync parent directory of log file
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);

That should prevent corruption on any Linux filesystem, but if we want to make sure that the file actually contains “bar”, we need another fsync at the end.

creat(/dir/log);
write(/dir/log, “2, 3, [checksum], foo”);
fsync(/dir/log);
fsync(/dir);
pwrite(/dir/orig, 2, “bar”);
fsync(/dir/orig);
unlink(/dir/log);
fsync(/dir);

That results in consistent behavior and guarantees that our operation actually modifies the file after it’s completed, as long as we assume that fsync actually flushes to disk. OS X and some versions of ext3 have an fsync that doesn’t really flush to disk. OS X requires fcntl(F_FULLFSYNC) to flush to disk, and some versions of ext3 only flush to disk if the the inode changed (which would only happen at most once a second on writes to the same file, since the inode mtime has one second granularity), as an optimization.

Even if we assume fsync issues a flush command to the disk, some disks ignore flush directives for the same reason fsync is gimped on OS X and some versions of ext3 – to look better in benchmarks. Handling that is beyond the scope of this post, but the Rajimwale et al. DSN ‘11 paper and related work cover that issue.

Filesystem semantics

When the authors examined ext2, ext3, ext4, btrfs, and xfs, they found that there are substantial differences in how code has to be written to preserve consistency. They wrote a tool that collects block-level filesystem traces, and used that to determine which properties don’t hold for specific filesystems. The authors are careful to note that they can only determine when properties don’t hold – if they don’t find a violation of a property, that’s not a guarantee that the property holds.

Different filesystems have very different properties

Xs indicate that a property is violated. The atomicity properties are basically what you’d expect, e.g., no X for single sector overwrite means that writing a single sector is atomic. The authors note that the atomicity of single sector overwrite sometimes comes from a property of the disks they’re using, and that running these filesystems on some disks won’t give you single sector atomicity. The ordering properties are also pretty much what you’d expect from their names, e.g., an X in the “Overwrite -> Any op” row means that an overwrite can be reordered with some operation.

After they created a tool to test filesystem properties, they then created a tool to check if any applications rely on any potentially incorrect filesystem properties. Because invariants are application specific, the authors wrote checkers for each application tested.

Everything is broken

The authors find issues with most of the applications tested, including things you’d really hope would work, like LevelDB, HDFS, Zookeeper, and git. In a talk, one of the authors noted that the developers of sqlite have a very deep understanding of these issues, but even that wasn’t enough to prevent all bugs. That speaker also noted that version control systems were particularly bad about this, and that the developers had a pretty lax attitude that made it very easy for the authors to find a lot of issues in their tools. The most common class of error was incorrectly assuming ordering between syscalls. The next most common class of error was assuming that syscalls were atomic2. These are fundamentally the same issues people run into when doing multithreaded programming. Correctly reasoning about re-ordering behavior and inserting barriers correctly is hard. But even though shared memory concurrency is considered a hard problem that requires great care, writing to files isn’t treated the same way, even though it’s actually harder in a number of ways.

Something to note here is that while btrfs’s semantics aren’t inherently less reliable than ext3/ext4, many more applications corrupt data on top of btrfs because developers aren’t used to coding against filesystems that allow directory operations to be reordered (ext2 is perhaps the most recent widely used filesystem that allowed that reordering). We’ll probably see a similar level of bug exposure when people start using NVRAM drives that have byte-level atomicity. People almost always just run some tests to see if things work, rather than making sure they’re coding against what’s legal in a POSIX filesystem.

Hardware memory ordering semantics are usually well documented in a way that makes it simple to determine precisely which operations can be reordered with which other operations, and which operations are atomic. By contrast, here’s the ext manpage on its three data modes:

journal: All data is committed into the journal prior to being written into the main filesystem. ordered: This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. writeback: Data ordering is not preserved – data may be written into the main filesystem after its metadata has been committed to the journal. This is rumoured to be the highest-throughput option. It guarantees internal filesystem integrity, however it can allow old data to appear in files after a crash and journal recovery.

The manpage literally refers to rumor. This is the level of documentation we have. If we look back at our example where we had to add an fsync between the write(/dir/log, “2, 3, foo”) and pwrite(/dir/orig, 2, “bar”) to prevent reordering, I don’t think the necessity of the fsync is obvious from the description in the manpage. If you look at the hardware memory ordering “manpage” above, it specifically defines the ordering semantics, and it certainly doesn’t rely on rumor.

This isn’t to say that filesystem semantics aren’t documented anywhere. Between lwn and LKML, it’s possible to get a good picture of how things work. But digging through all of that is hard enough that it’s still quite common for there to be long, uncertain discussions on how things work. A lot of the information out there is wrong, and even when information was right at the time it was posted, it often goes out of date.

When digging through archives, I’ve often seen a post from 2005 cited to back up the claim that OS X fsync is the same as Linux fsync, and that OS X fcntl(F_FULLFSYNC) is even safer than anything available on Linux. Even at the time, I don’t think that was true for the 2.4 kernel, although it was true for the 2.6 kernel. But since 2008 or so Linux 2.6 with ext3 will do a full flush to disk for each fsync (if the disk supports it, and the filesystem hasn’t been specially configured with barriers off).

Another issue is that you often also see exchanges like this one:

Dev 1: Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there.
Dev 2: as the ext3 authors have stated many times over the years, you still need to run fsck periodically anyway.
Dev 1: Where is that documented?
Dev 2: linux-kernel mailing list archives.
Dev 3: Probably from some 6-8 years ago, in e-mail postings that I made.

Where’s this documented? Oh, in some mailing list post 6-8 years ago (which makes it 12-14 years from today). I don’t mean to pick on filesystem devs. The fs devs whose posts I’ve read are quite polite compared to LKML’s reputation; they generously spend a lot of their time responding to basic questions and I’m impressed by how patient the expert fs devs are with askers, but it’s hard for outsiders to troll through a decade and a half of mailing list postings to figure out which ones are still valid and which ones have been obsoleted!

In their OSDI 2014 talk, the authors of the paper we’re discussing noted that when they reported bugs they’d found, developers would often respond “POSIX doesn’t let filesystems do that”, without being able to point to any specific POSIX documentation to support their statement. If you’ve followed Kyle Kingsbury’s Jepsen work, this may sound familiar, except devs respond with “filesystems don’t do that” instead of “networks don’t do that”. I think this is understandable, given how much misinformation is out there. Not being a filesystem dev myself, I’d be a bit surprised if I don’t have at least one bug in this post.

Filesystem correctness

We’ve already encountered a lot of complexity in saving data correctly, and this only scratches the surface of what’s involved. So far, we’ve assumed that the disk works properly, or at least that the filesystem is able to detect when the disk has an error via SMART or some other kind of monitoring. I’d always figured that was the case until I started looking into it, but that assumption turns out to be completely wrong.

The Prabhakaran et al. SOSP 05 paper examined how filesystems respond to disk errors in some detail. They created a fault injection layer that allowed them to inject disk faults and then ran things like chdir, chroot, stat, open, write, etc. to see what would happen.

Between ext3, reiserfs, and NTFS, reiserfs is the best at handling errors and it seems to be the only filesystem where errors were treated as first class citizens during design. It’s mostly consistent about propagating errors to the user on reads, and calling panic on write failures, which triggers a restart and recovery. This general policy allows the filesystem to gracefully handle read failure and avoid data corruption on write failures. However, the authors found a number of inconsistencies and bugs. For example, reiserfs doesn’t correctly handle read errors on indirect blocks and leaks space, and a specific type of write failure doesn’t prevent reiserfs from updating the journal and committing the transaction, which can result in data corruption.

Reiserfs is the good case. The authors found that ext3 ignored write failures in most cases, and rendered the filesystem read-only in most cases for read failures. This seems like pretty much the opposite of the policy you’d want. Ignoring write failures can easily result in data corruption, and remounting the filesystem as read-only is a drastic overreaction if the read error was a transient error (transient errors are common). Additionally, ext3 did the least consistency checking of the three filesystems and was the most likely to not detect an error. In one presentation, one of the authors remarked that the ext3 code had lots of comments like “I really hope a write error doesn’t happen here” in places where errors weren’t handled.

NTFS is somewhere in between. The authors found that it has many consistency checks built in, and is pretty good about propagating errors to the user. However, like ext3, it ignores write failures.

The paper has much more detail on the exact failure modes, but the details are mostly of historical interest as many of the bugs have been fixed.

It would be really great to see an updated version of the paper, and in one presentation someone in the audience asked if there was more up to date information. The presenter replied that they’d be interested in knowing what things look like now, but that it’s hard to do that kind of work in academia because grad students don’t want to repeat work that’s been done before, which is pretty reasonable given the incentives they face. Doing replications is a lot of work, often nearly as much work as the original paper, and replications usually give little to no academic credit. This is one of the many cases where the incentives align very poorly with producing real world impact.

The Gunawi et al. FAST 08 is another paper it would be great to see replicated today. That paper follows up the paper we just looked at, and examines the error handling code in different file systems, using a simple static analysis tool to find cases where errors are being thrown away. Being thrown away is defined very loosely in the paper — code like the following

if (error) {
    printk(“I have no idea how to handle this error\n”);
}

is considered not throwing away the error. Errors are considered to be ignored if the execution flow of the program doesn’t depend on the error code returned from a function that returns an error code.

With that tool, they find that most filesystems drop a lot of error codes:


By % BrokenBy Viol/Kloc

Rank

FS Frac. FS                Viol/Kloc

1

IBM JFS 24.4 ext3 7.2

2

ext3 22.1 IBM JFS 5.6

3

JFFS v2 15.7 NFS Client 3.6

4

NFS Client 12.9 VFS 2.9

5

CIFS 12.7 JFFS v2 2.2

6

MemMgmt 11.4 CIFS 2.1

7

ReiserFS 10.5 MemMgmt 2.0

8

VFS 8.4 ReiserFS 1.8

9

NTFS 8.1 XFS 1.4

10

XFS 6.9 NFS Server 1.2


Comments they found next to ignored errors include: “Should we pass any errors back?”, “Error, skip block and hope for the best.”, “There’s no way of reporting error returned from ext3_mark_inode_dirty() to user space. So ignore it.“, “Note: todo: log error handler.“, “We can’t do anything about an error here.”, “Just ignore errors at this point. There is nothing we can do except to try to keep going.”, “Retval ignored?”, and “Todo: handle failure.”

One thing to note is that in a lot of cases, ignoring an error is more of a symptom of an architectural issue than a bug per se (e.g., ext3 ignored write errors during checkpointing because it didn’t have any kind of recovery mechanism). But even so, the authors of the papers found many real bugs.

Error recovery

Every widely used filesystem has bugs that will cause problems on error conditions, which brings up two questions. Can recovery tools robustly fix errors, and how often do errors occur? How do they handle recovery from those problems? The Gunawi et al. OSDI 08 paper looks at that and finds that fsck, a standard utility for checking and repairing file systems, “checks and repairs certain pointers in an incorrect order … the file system can even be unmountable after”.

At this point, we know that it’s quite hard to write files in a way that ensures their robustness even when the underlying filesystem is correct, the underlying filesystem will have bugs, and that attempting to repair corruption to the filesystem may damage it further or destroy it. How often do errors happen?

Error frequency

The Bairavasundaram et al. SIGMETRICS ‘07 paper found that, depending on the exact model, between 5% and 20% of disks would have at least one error over a two year period. Interestingly, many of these were isolated errors – 38% of disks with errors had only a single error, and 80% had fewer than 50 errors. A follow-up study looked at corruption and found that silent data corruption that was only detected by checksumming happened on .5% of disks per year, with one extremely bad model showing corruption on 4% of disks in a year.

It’s also worth noting that they found very high locality in error rates between disks on some models of disk. For example, there was one model of disk that had a very high error rate in one specific sector, making many forms of RAID nearly useless for redundancy.

That’s another study it would be nice to see replicated. Most studies on disk focus on the failure rate of the entire disk, but if what you’re worried about is data corruption, errors in non-failed disks are more worrying than disk failure, which is easy to detect and mitigate.

Conclusion

Files are hard. Butler Lampson has remarked that when they came up with threads, locks, and condition variables at PARC, they thought that they were creating a programming model that anyone could use, but that there’s now decades of evidence that they were wrong. We’ve accumulated a lot of evidence that humans are very bad at reasoning about these kinds of problems, which are very similar to the problems you have when writing correct code to interact with current filesystems. Lampson suggests that the best known general purpose solution is to package up all of your parallelism into as small a box as possible and then have a wizard write the code in the box. Translated to filesystems, that’s equivalent to saying that as an application developer, writing to files safely is hard enough that it should be done via some kind of library and/or database, not by directly making syscalls.

Sqlite is quite good in terms of reliability if you want a good default. However, some people find it to be too heavyweight if all they want is a file-based abstraction. What they really want is a sort of polyfill for the file abstraction that works on top of all filesystems without having to understand the differences between different configurations (and even different versions) of each filesystem. Since that doesn’t exist yet, when no existing library is sufficient, you need to checksum your data since you will get silent errors and corruption. The only questions are whether or not you detect the errors and whether or not your record format only destroys a single record when corruption happens, or if it destroys the entire database. As far as I can tell, most desktop email client developers have chosen to go the route of destroying all of your email if corruption happens.

These studies also hammer home the point that conventional testing isn’t sufficient. There were multiple cases where the authors of a paper wrote a relatively simple tool and found a huge number of bugs. You don’t need any deep computer science magic to write the tools. The error propagation checker from the paper that found a ton of bugs in filesystem error handling was 4k LOC. If you read the paper, you’ll see that the authors observed that the tool had a very large number of shortcomings because of its simplicity, but despite those shortcomings, it was able to find a lot of real bugs. I wrote a vaguely similar tool at my last job to enforce some invariants, and it was literally two pages of code. It didn’t even have a real parser (it just went line-by-line through files and did some regexp matching to detect the simple errors that it’s possible to detect with just a state machine and regexes), but it found enough bugs that it paid for itself in development time the first time I ran it.

Almost every software project I’ve seen has a lot of low hanging testing fruit. Really basic random testing, static analysis, and fault injection can pay for themselves in terms of dev time pretty much the first time you use them.

Appendix

I’ve probably covered less than 20% of the material in the papers I’ve referred to here. Here’s a bit of info about some other neat info you can find in those papers, and others.

Pillai et al., OSDI ‘14: this papers goes into much more detail about what’s required for crash consistency than this post does. It also gives a fair amount of detail about how exactly applications fail, including diagrams of traces that indicate what false assumptions are embedded in each trace.

Chidambara et al., FAST ‘12: the same filesystem primitives are responsible for both consistency and ordering. The authors propose alternative primitives that separate these concerns, allow better performance while maintaining safety.

Rajimwale et al. DSN ‘01: you probably shouldn’t use disks that ignore flush directives, but in case you do, here’s a protocol that forces those disks to flush using normal filesystem operations. As you might expect, the performance for this is quite bad.

Prabhakaran et al. SOSP ‘05: This has a lot more detail on filesystem responses to error than was covered in this post. The authors also discuss JFS, an IBM filesystem for AIX. Although it was designed for high reliability systems, it isn’t particularly more reliable than the alternatives. Related material is covered further in DSN ‘08, StorageSS ‘06, DSN ‘06, FAST ‘08, and USENIX ‘09, among others.

Gunawi et al. FAST ‘08 : Again, much more detail than is covered in this post on when errors get dropped, and how they wrote their tools. They also have some call graphs that give you one rough measure of the complexity involved in a filesystem. The XFS call graph is particularly messy, and one of the authors noted in a presentation that an XFS developer said that XFS was fun to work on since they took advantage of every possible optimization opportunity regardless of how messy it made things.

Bairavasundaram et al. SIGMETRICS ‘07: There’s a lot of information on disk error locality and disk error probability over time that isn’t covered in this post. A followup paper in FAST08 has more details.

Gunawi et al. OSDI ‘08: This paper has a lot more detail about when fsck doesn’t work. In a presentation, one of the authors mentioned that fsck is the only program that’s ever insulted him. Apparently, if you have a corrupt pointer that points to a superblock, fsck destroys the superblock (possibly rendering the disk unmountable), tells you something like “you dummy, you must have run fsck on a mounted disk”, and then gives up. In the paper, the authors reimplement basically all of fsck using a declarative model, and find that the declarative version is shorter, easier to understand, and much easier to extend, at the cost of being somewhat slower.

Memory errors are beyond the scope of this post, but memory corruption can cause disk corruption. This is especially annoying because memory corruption can cause you to take a checksum of bad data and write a bad checksum. It’s also possible to corrupt in memory pointers, which often results in something very bad happening. See the Zhang et al. FAST ‘10 paper for more on how ZFS is affected by that. There’s a meme going around that ZFS is safe against memory corruption because it checksums, but that paper found that critical things held in memory aren’t checksummed, and that memory errors can cause data corruption in real scenarios.

The sqlite devs are serious about both documentation and testing. If I wanted to write a reliable desktop application, I’d start by reading the sqlite docs and then talking to some of the core devs. If I wanted to write a reliable distributed application I’d start by getting a job at Google and then reading the design docs and postmortems for GFS, Colossus, Spanner, etc. J/k, but not really.

We haven’t looked at formal methods at all, but there have been a variety of attempts to formally verify properties of filesystems, such as SibylFS.

This list isn’t intended to be exhaustive. It’s just a list of things I’ve read that I think are interesting.

Update: many people have read this post and suggested that, in the first file example, you should use the much simpler protocol of copying the file to modified to a temp file, modifying the temp file, and then renaming the temp file to overwrite the original file. In fact, that’s probably the most common comment I’ve gotten on this post. If you think this solves the problem, I’m going to ask you to pause for five seconds and consider the problems this might have. First, you still need to fsync in multiple places. Second, you will get very poor performance with large files. People have also suggested using many small files to work around that problem, but that will also give you very poor performance unless you do something fairly exotic. Third, if there’s a hardlink, you’ve now made the problem of crash consistency much more complicated than in the original example. Fourth, you’ll lose file metadata, sometimes in ways that can’t be fixed up after the fact. That problem can, on some filesystems, be worked around with ioctls, but that only sometimes fixes the issue and now you’ve got fs specific code to preserve correctness even in the non-crash case. And that’s just the beginning. The fact that so many people thought that this was a simple solution to the problem demonstrates that this problem is one that people are prone to underestimating, even they’re explicitly warned that people tend to underestimate this problem!

If you liked this, you’ll probably enjoy this post on CPU bugs.

Thanks to Leah Hanson, Katerina Barone-Adesi, Jamie Brandon, Kamal Marhubi, Joe Wilder, David Turner, Benjamin Gilbert, Tom Murphy, Chris Ball, Joe Doliner, Alexy Romanov, Mindy Preston, Paul McJones, and Evan Jones for comments/discussion.


  1. Turns out some commercially supported distros only support data=ordered. Oh, and when I said data=ordered was the default, that’s only the case if pre-2.6.30. After 2.6.30, there’s a config option, CONFIG_EXT3_DEFAULTS_TO_ORDERED. If that’s not set, the default becomes data=writeback. [return]
  2. Cases where overwrite atomicity is required were documented as known issues, and all such cases assumed single-block atomicity and not multi-block atomicity. By contrast, multiple applications (LevelDB, Mercurial, and HSQLDB) had bad data corruption bugs that came from assuming appends are atomic.

    That seems to be an indirect result of a commonly used update protocol, where modifications are logged via appends, and then logged data is written via overwrites. Application developers are careful to check for and handle errors in the actual data, but the errors in the log file are often overlooked.

    There are a number of other classes of errors discussed, and I recommend reading the paper for the details if you work on an application that writes files.

    [return]
Viewing all 308 articles
Browse latest View live