After I started at Twitter, I spent sooner or later constructing a draw that right now found a mid 7 figure optimization (which ended up transport). Within the first year, we shipped mid 8 figures per year charge of label financial savings which capacity that. The primary characteristic this draw introduces is the capacity to query metrics recordsdata all the diagram in which by diagram of all hosts and all providers and products and over any time period (since inception), so now we uncover known as it LongTermMetrics (LTM) internally since I esteem tiring, descriptive, names.
This got started once I develop into searching to search out a starter finishing up that might maybe well perhaps perhaps each and each abet me realize the Twitter infra stack and even uncover some easily quantifiable label. Andy Wilcox steered taking a watch at JVM survivor plan utilization for some tremendous providers and products. If you’re no longer accustomed to what survivor plan is, you might maybe maybe perhaps maybe bring to mind it as a configurable, mounted-size buffer, in the JVM (at least in the occasion you exercise the GC algorithm that’s default at Twitter). At the time, in the occasion you checked out a random tremendous providers and products, you’d usually receive that both:
- The buffer develop into too tiny, ensuing in miserable performance, every so incessantly catastrophically miserable when below high load.
- The buffer develop into too tremendous, ensuing in wasted reminiscence, i.e., wasted money.
But in plan of taking a watch at random providers and products, there’s no predominant motive that we need to now not be in a self-discipline to query all providers and products and earn a checklist of which providers and products uncover room for enchancment in their configuration, sorted by performance degradation or label financial savings. And if we write that question for JVM survivor plan, this also goes for various configuration parameters (e.g., various JVM parameters, CPU quota, reminiscence quota, and so forth.). Writing a query that worked on your entire providers and products turned out to be a little bit of extra advanced than I develop into hoping due to the a mixture of recordsdata consistency and performance points. Info consistency points included issues esteem:
- Any given metric can uncover ~100 names, e.g., I discovered 94 various names for JVM survivor plan
- I think there are extra, these had been factual the ones I might maybe well perhaps perhaps receive by diagram of a straightforward search
- The same metric establish might maybe well perhaps perhaps want a decided which methodology for various providers and products
- Will be a counter or a gauge
- Can even uncover various gadgets, e.g., bytes vs. MB or microseconds vs. milliseconds
- Metrics are every so incessantly tagged with an fallacious carrier establish
- Zombie shards can proceed to neutral and document metrics even supposing the cluster supervisor has started up a novel occasion of the shard, ensuing in duplicate and inconsistent metrics for a explicit shard establish
Our metrics database, MetricsDB, develop into specialized to tackle monitoring and dashboards and didn’t enhance frequent queries. That is completely practical, since monitoring and dashboards are decrease on Maslow’s hierarchy of observability needs than frequent metrics analytics, however it surely intended that we might maybe well perhaps perhaps now not proceed arbitrary SQL queries in opposition to metrics in MetricsDB.
One other methodology to query the information is to make exercise of the reproduction that will get written to HDFS in Parquet structure, which enables other folks to maneuver arbitrary SQL queries (besides as write Scalding (MapReduce) jobs that relish the information).
Unfortunately, due to the the number of metric names, the information on HDFS cannot be saved in a columnar structure with one column per establish — Presto will get unlucky in the occasion you feed it too many columns and now we uncover enough various metrics that we’re well previous that restrict. If you kind no longer exercise a columnar structure (and kind no longer note any various tricks), you cease up learning the total recordsdata for any non-trivial query. The result develop into that you just might maybe maybe perhaps now not proceed any non-trivial query (or even many trivial queries) all the diagram in which by diagram of all providers and products or all hosts without having it time out. We kind no longer uncover an analogous timeouts for Scalding, however Scalding performance is much worse and a straightforward Scalding query in opposition to a day’s charge of metrics will usually have interaction between three and twenty hours, searching on cluster load, making it unreasonable to make exercise of Scalding for any type of exploratory recordsdata diagnosis.
Given the information infrastructure that already existed, an easy methodology to solve each and each of those issues develop into to write a Scalding job to retailer the 0.1% to 0.01% of metrics recordsdata that we care about for performance or capacity linked queries and re-write it into a columnar structure. A chuffed side blueprint of that is that since this kind of tiny fraction of the information is relevant, or no longer it’s low-label to retailer it indefinitely. The conventional metrics recordsdata dump is deleted after a pair of weeks due to the or no longer it’s tremendous enough that it will doubtless be prohibitively pricey to retailer it indefinitely; a longer metrics reminiscence will doubtless be precious for capacity planning or various analyses that opt to uncover historical recordsdata.
The 0.01% to 0.1% of recordsdata we’re saving includes (however is rarely any longer restricted to) the following issues for every and each occasion of each and each carrier:
- utilizations and sizes of various buffers
- CPU, reminiscence, and various utilization
- number of threads, context switches, core migrations
- various queue depths and network stats
- JVM version, characteristic flags, and so forth.
- GC stats
- Finagle metrics
And for every and each host:
- various issues from procfs, esteem
indolent, and so forth.
- what cluster the machine is a little bit of
- host-level recordsdata esteem NIC tempo, number of cores on the host, reminiscence,
- host-level stats for “health” points esteem thermal throttling, machine assessments, and so forth.
- OS version, host-level draw versions, host-level characteristic flags, and so forth.
- Rezolus metrics
Although the impetus for this finishing up develop into figuring out which providers and products had been below or over configured for JVM survivor plan, it started with GC and container metrics since those had been very glaring issues to take a look at at and now we were incrementally adding various metrics since then. To earn an thought of the forms of issues we are able to query for and the diagram in which easy queries are in the occasion you barely SQL, listed below are some examples:
Very Excessive p90 JVM Survivor Condominium
This is piece of the customary goal of discovering below/over-provisioned providers and products. Any carrier with a in actuality high p90 JVM survivor plan utilization might maybe well perhaps perhaps very well be below provisioned on survivor plan. Equally, the leisure with a in actuality low p99 or p999 JVM survivor plan utilization when below peak load might maybe well perhaps perhaps very well be overprovisioned (query no longer displayed here, however we are able to scope the query to times of high load).
A Presto query for very high p90 survivor plan all the diagram in which by diagram of all providers and products is:
with outcomes as ( win servicename, approx_distinct(offer, 0.1) as approx_sources, -- number of cases for the carrier -- precise query uses [coalesce and nullif](https://prestodb.io/doctors/recent/features/conditional.html) to tackle edge cases, overlooked for brevity approx_percentile(jvmSurvivorUsed / jvmSurvivorMax, 0.90) as p90_used, approx_percentile(jvmSurvivorUsed / jvmSurvivorMax, 0.90) as p50_used, from ltm_service where ds >= '2020-02-01' and ds <= '2020-02-28' group by servicename) select from results where approx_sources > 100 expose by p90_used desc
In desire to having to take a look at by diagram of a bunch of dashboards, we are able to factual earn a checklist after which send diffs with config modifications to the categorical teams or write a script that takes the output of the query and robotically writes the diff. The above query supplies a pattern for any frequent utilization numbers or charges; you might maybe maybe perhaps watch at reminiscence utilization, novel or prone gen GC frequency, and so forth with an analogous queries. In a single case, we found a carrier that develop into wasting enough RAM to pay my salary for a decade.
I have been engaging a long way from the utilization of thresholds in opposition to easy percentiles to search out points, esteem on this query, however I’m presenting this query due to the that is a part other folks incessantly are alive to to kind that’s precious, what I opt to kind as an replacement is out of scope of this submit and potentially deserves its comprise submit.
The above query develop into over all providers and products, however we might maybe well perhaps perhaps also query all the diagram in which by diagram of hosts. As well, we are able to kind queries that be part of in opposition to properties of the host, characteristic flags, and so forth.
The exercise of 1 plan of queries, we had been in a self-discipline to search out out that we had a predominant number of providers and products running up in opposition to network limits even supposing host-level network utilization develop into low. The compute platform team then did a slack rollout of a alternate to network caps, which we monitored with queries esteem the one below to search out out that we weren’t gaze any performance degradation (theoretically that you just might maybe maybe perhaps maybe bring to mind if growing network caps prompted hosts or switches to hit network limits).
With the network alternate, we had been in a self-discipline to gape, smaller queue depths, smaller queue size (in bytes), fewer packet drops, and so forth.
The query below easiest presentations queue depths for brevity; adding the total quantities talked about is factual a topic of typing extra names in.
The frequent factor we are able to kind is, for any explicit rollout of a platform or carrier-level characteristic, we are able to gaze the affect on precise providers and products.
with rolled as ( win -- rollout develop into mounted for all hosts all the diagram in which by diagram of the time period, can opt an arbitrary ingredient from the time period arbitrary(element_at(misc, 'egress_rate_limit_increase')) as rollout, hostId from ltm_deploys where ds = '2019-10-10' and zone = 'foo' group by ipAddress ), host_info as( win arbitrary(nicSpeed) as nicSpeed, hostId from ltm_host where ds = '2019-10-10' and zone = 'foo' group by ipAddress ), host_rolled as ( win rollout, nicSpeed, rolled.hostId from rolled be part of host_info on rolled.ipAddress = host_info.ipAddress ), container_metrics as ( win carrier, netTxQlen, hostId from ltm_container where ds >= '2019-10-10' and ds <= '2019-10-14' and zone = 'foo' ) select service, nicSpeed, approx_percentile(netTxQlen, 1, 0.999, 0.0001) as p999_qlen, approx_percentile(netTxQlen, 1, 0.99, 0.001) as p99_qlen, approx_percentile(netTxQlen, 0.9) as p90_qlen, approx_percentile(netTxQlen, 0.68) as p68_qlen, rollout, count[coalesce and nullif] as cnt from container_metrics join host_rolled on host_rolled.hostId = container_metrics.hostId group by service, nicSpeed, rollout
Other questions that became easy to answer
- What's the latency, CPU usage, CPI, or other performance impact of X?
- Increasing or decreasing the number of performance counters we monitor per container
- Tweaking kernel parameters
- OS or other releases
- Increasing or decreasing host-level oversubscription
- General host-level load
- What hosts have unusually poor service-level performance for every service on the host, after controlling for load, etc.?
- This has usually turned out to be due to a hardware misconfiguration or fault
- Which services don't play nicely with other services aside from the general impact on host-level load?
- What's the latency impact of failover, or other high-load events?
- What level of load should we expect in the future given a future high-load event plus current growth?
- Which services see more load during failover, which services see unchanged load, and which fall somewhere in between?
- What config changes can we make for any fixed sized buffer or allocation that will improve performance without increasing cost or reduce cost without degrading performance?
- etc., there are a lot of questions that become easy to answer if you can write arbitrary queries against historical metrics data
LTM is about as boring a system as is possible. Every design decision falls out of taking the path of least resistance.
- Why using Scalding?
- It's standard at Twitter and the integration made everything trivial. I tried Spark, which has some advantages. However, at the time, I would have had to do manual integration work that I got for free with Scalding.
- Why is the system not real-time (with delays of at least one hour)?
- Twitter's batch job pipeline is easy to build on, all that was necessary was to read some tutorial on how it works and then write something similar, but with different business logic.
- There was a nicely written proposal to build a real-time analytics pipeline for metrics data written a couple years before I joined Twitter, but that never got built because (I estimate) it would have been one to four quarters of work to produce an MVP and it wasn't clear what team had the right mandate to work on that and also had 4 quarters of headcount available. But the add a batch job took one day, you don't need to have roadmap and planning meetings for a day of work, you can just do it and then do follow-on work incrementally.
- Why use Presto and not something that allows for live slice & dice queries like Druid?
- Rebecca Isaacs and Jonathan Simms were doing related work on tracing and we knew that we'd want to do joins across LTM and whatever they created. That's trivial with Presto but would have required more planning and work with something like Druid, at least at the time.
- Why not use Postgres or something similar?
- The amount of data we want to store makes this infeasible without a massive amount of effort; even though the cost of data storage is quite low, it's still a "big data" problem
I think writing about systems like this, that are just boring work is really underrated. A disproportionate number of posts and talks I read are about systems using hot technologies. I don't have anything against hot new technologies, but a lot of useful work comes from plugging boring technologies together and doing the obvious thing. Since posts and talks about boring work are relatively rare, I think writing up something like this is more useful than it has any right to be.
For example, a couple years ago, at a local meetup that Matt Singer organizes for companies in our size class to discuss infrastructure (basically, companies that are smaller than FB/Amazon/Google) I asked if anyone was doing something similar to what we'd just done. No one who was there was (or not who'd admit to it, anyway), and engineers from two different companies expressed shock that we could store so much data. This work is too straightforward and obvious to be novel, I'm sure people have built analogous systems in many places. It's literally just storing metrics data on HDFS (or, if you prefer a more general term, a data lake) indefinitely in a format that allows interactive queries.
If you do the math on the cost of metrics data storage for a project like this in a company in our size class, the storage cost is basically a rounding error. We've shipped individual diffs that easily pay for the storage cost for decades. I don't think there's any reason storing a few years or even a decade worth of metrics should be shocking when people deploy analytics and observability tools that cost much more all the time. But it turns out this was surprising, in part because people don't write up work this boring.
An unrelated example is that, a while back, I ran into someone at a similarly sized company who wanted to get similar insights out of their metrics data. Instead of starting with something that would take a day, like this project, they started with deep learning. While I think there's value in applying ML and/or stats to infra metrics, they turned a project that could return significant value to the company after a couple of person-days into a project that took person-years. And if you're only going to either apply simple heuristics guided by someone with infra experience and simple statistical models or naively apply deep learning, I think the former will actually get you better results. Applying both sophisticated stats/ML and practitioner guided heuristics together can get you better results than either alone, but I think it makes a lot more sense to start with the simple project that takes a day to build out and maybe another day or two to start to apply than to start with a project that takes months or years to build out and start to apply. But there are a lot of biases towards doing the larger project: it makes a better resume item (deep learning!), in many places, it makes a better promo case, and people are more likely to give a talk or write up a blog post on the cool system that uses deep learning.
The above discusses why writing up work is valuable for the industry in general. We covered why writing up work is valuable to the company doing the write-up in a previous post, so I'm not going to re-hash that here.
Appendix: stuff I screwed up
I think it's unfortunate that you don't get to hear about the downsides of systems without backchannel chatter, so here are things I did that are pretty obvious mistakes in retrospect. I'll add to this when something else becomes obvious in retrospect.
- Not using a double for almost everything
- In an ideal world, some things aren't doubles, but everything in our metrics stack goes through a stage where basically every metric is converted to a double
- I stored most things that "should" be an integral type as an integral type, but doing the conversion from
long -> double -> prolongedcan easiest maybe introduce extra rounding error than factual doing the conversion from
prolonged -> double
- I saved some issues that need to now not be an integral kind as an integral kind, which causes tiny values to unnecessarily lose precision
- Happily this hasn't prompted serious errors for any actionable diagnosis I've carried out, however there are analyses where it can perhaps perhaps trigger issues
- Longterm vs. LongTerm in the code
- I wasn't obvious which methodology this need to be capitalized once I develop into first writing this and, once I made a decision, I failed to grep for and squash the whole lot that develop into written the terrifying methodology, so now this pointless inconsistency exists in various places
Both of those are the type of factor you search recordsdata from need to you crank out something snappily and kind no longer think it by diagram of enough. The latter is trivial to repair and no longer great of a challenge since the ever-prove exercise of IDEs at Twitter methodology that every so incessantly anybody who would be impacted can uncover their IDE offer the suitable capitalization for them.
The old is extra problematic, each and each in that it can perhaps perhaps very well trigger fallacious analyses and in that fixing this can require doing a migration of your entire recordsdata now we uncover. My wager is that, at this level, this is able to perhaps perhaps be half of a week to a week of work, which I might maybe well perhaps perhaps've easily shunned by spending thirty extra seconds pondering by diagram of what I develop into doing.
Due to Leah Hanson and Andy Wilcox for feedback/corrections/dialogue