It’s been forever since I updated the blog at all, and I do it with a linkdump. :) But it’s also a good chance to roll out a new theme, as I was getting sick of the dark one.
Here are some of the interesting things I’ve been reading lately.
I found this article on failures in distributed systems at Facebook really interesting. It includes information on common observed failure modes, advice for preventing failures from cascading across a large system, improvements in monitoring dashboards, and a description of Facebook’s incident review methodology. There are a lot of really good practical lessons here, and I highly recommend reading this article.
One small section that stood out to me called out the fact that human-initiated changes are a major source of failure, an insight I’ve seen in a number of places lately:
These two data points seem to suggest that when Facebook employees are not actively making changes to infrastructure because they are busy with other things (weekends, holidays, or even performance reviews), the site experiences higher levels of reliability. We believe this is not a result of carelessness on the part of people making changes but rather evidence that our infrastructure is largely self-healing in the face of non-human causes of errors such as machine failure.
I’m actually sharing this link for two reasons: because it’s thought-provoking, and because I disagree with it so much! Cheney makes some interesting arguments for the idea that there are only two types of logs: debug logs that programmers care about, and info logs that users care about. Other log levels are unneeded.
My disagreement could probably be expanded into a whole blog post. But suffice it to say that I think having levels of logging is extremely useful from an operational perspective and makes monitoring and filtering those logs a lot easier. In my opinion, collapsing these into “debug” and “info” would only result in a lot of custom tags in the text of the logs, and a lot more work for regex parsers in the monitoring system. :)
A Case Study - Scaling Legacy Code on Next Generation Platforms, by William Roshan Quadros at SNL for IMR24
This paper is an interesting little case study about how to scale HPC codes to platforms with new processor technologies and a higher level of parallelism – i.e., the new Trinity system at LANL. I expect to have reason to re-read this a few more times.
This is a good read about how the difference in latency between different types of data transfers (disk, network, RDMA, etc) help to determine what kind of system design will perform best. For example, storage can often be located on a different machine rather than locally because network latency is often much less than the latency of a disk seek. Very much worth the read.
A good short post on designing systems to reduce complexity and maximize the ability of a team to understand the system, not just optimizing each individual part of the system in isolation.
An interesting and disturbing little near-future science fiction story, on Vice’s Motherboard site. (Which site has been impressing me a lot lately.)