Closed Bug 1408112 Opened 8 years ago Closed 5 months ago

Telemetry server throws 500s

Categories

(Data Platform and Tools :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: ekr, Unassigned)

References

Details

Under load the telemetry server seems to throw 500s, which cause telemetry.js to fail. For instance: Could not obtain evolution: status 500 (https://aggregates.telemetry.mozilla.org/aggregates_by/submission_date/channels/beta/?version=51&dates=20171010%2C20171009%2C20171008%2C20171007%2C20171006%2C20171005%2C20171004%2C20171003%2C20171002%2C20171001%2C20170930%2C20170929%2C20170928%2C20170927%2C20170926%2C20170925%2C20170924%2C20170923%2C20170922%2C20170921%2C20170920%2C20170919%2C20170918%2C20170917%2C20170916%2C20170915%2C20170914%2C20170913%2C20170912%2C20170911%2C20170910%2C20170909%2C20170908%2C20170907%2C20170906%2C20170905%2C20170904%2C20170903%2C20170902%2C20170901%2C20170831%2C20170830%2C20170829%2C20170828%2C20170827%2C20170826%2C20170825%2C20170824%2C20170823%2C20170822%2C20170821%2C20170820%2C20170819%2C20170818%2C20170817%2C20170816%2C20170815%2C20170814%2C20170813%2C20170812%2C20170811%2C20170810%2C20170809%2C20170808%2C20170807%2C20170806%2C20170805%2C20170804%2C20170803%2C20170802%2C20170801%2C20170731%2C20170730%2C20170729%2C20170728%2C20170727%2C20170726%2C20170725%2C20170724%2C20170723%2C20170722%2C20170721%2C20170720%2C20170719%2C20170718%2C20170717%2C20170716%2C20170715%2C20170714%2C20170713%2C20170712%2C20170711%2C20170710%2C20170709%2C20170708%2C20170707%2C20170706%2C20170705%2C20170704%2C20170703%2C20170702%2C20170701%2C20170630%2C20170629%2C20170628%2C20170627%2C20170626%2C20170625%2C20170624%2C20170623%2C20170622%2C20170621%2C20170620%2C20170619%2C20170618%2C20170617%2C20170616%2C20170615%2C20170614%2C20170613%2C20170612%2C20170611%2C20170610%2C20170609%2C20170608%2C20170607%2C20170606%2C20170605%2C20170604%2C20170603%2C20170602%2C20170601%2C20170531%2C20170530%2C20170529%2C20170528%2C20170527%2C20170526%2C20170525%2C20170524%2C20170523%2C20170522%2C20170521%2C20170520%2C20170519%2C20170518%2C20170517%2C20170516%2C20170515%2C20170514%2C20170511%2C20170510%2C20170509%2C20170508%2C20170507%2C20170506%2C20170505%2C20170504%2C20170503%2C20170502%2C20170501%2C20170430%2C20170429%2C20170428%2C20170427%2C20170426%2C20170425%2C20170424%2C20170423%2C20170422%2C20170421%2C20170420%2C20170419%2C20170418%2C20170417%2C20170416%2C20170415%2C20170414%2C20170413%2C20170412%2C20170411%2C20170410%2C20170409%2C20170408%2C20170407%2C20170406%2C20170405%2C20170404%2C20170403&metric=HTTP_PAGELOAD_IS_SSL)
Looks like this was just a temporary hiccup, the evolution returns a 200 now with the data. EKR, have you considered using the main_summary table for HTTP_PAGELOAD_IS_SSL? It will have that data for all of release, while the aggregates dataset is just prerelease.
Status: NEW → RESOLVED
Closed: 8 years ago
Component: Telemetry APIs for Analysis → Telemetry Aggregation Service
Resolution: --- → WORKSFORME
I somehow didn't noticed that this got marked resolved. The problem isn't that it never works, it's that it gets overloaded when you make a lot of requests then falls over with 500. Has something happened to fix that defect?
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
As part of our User Productivity initiative in 2018, we'll be looking at performance of telemetry.js and telemetry.mo. Can you paste your use-case so we can keep it in mind?
Flags: needinfo?(ekr)
See: https://github.com/ekr/moz-telemetry-dashboard Basically, this is a cache that sits in front of the API so that we can build fast dashboards.
Flags: needinfo?(ekr)
For each of three measures, for each channel+version they appear in between 44 and 54, get the evolution. Assuming saturation of four channels we're talking... 120 requests, as fast as can be sent, using telemetry.js v2 via telemetry-next-node. Okay. The only other mass user of telemetry-next-node that I know of is mozilla/cerberus, and it serializes requests two at a time (well, sometimes it requests up to three versions at once for each of those two. So: six). telemetry.js v2 has no built-in rate-limiting or retry logic (as you've discovered). An updated library would have to take this into account.
Well, sort of. The semantics of how fast the HTTP requests are issued are complicated (depends if it's H1 or H2, etc...) Anyway, it's not just 120... There should be a way to do an arbitrary number, and have the system rate limit them sensibly. I would argue that the server shouldn't be throwing a 500 but should be queueing on the server.
Actually, the server should use the tools it has available to it (concurrent stream limits in h2, for example) to apply back pressure on clients. It seems unlikely that an HTTP/1.1 server would get so many requests from a single as to become overloaded, because clients are naturally rate limited. If the server is genuinely overloaded, it can send a 503 with Retry-After and the client *should* respect that and retry automatically. 429 is also good as a preventative measure, see below, but again the client should be robust and retry. For reference, I made 120 requests to bugzilla.m.o with a fast loop: for (var i = 0; i < 120; ++i) { fetch('https://bugzilla.mozilla.org/show_bug.cgi?id=' + i.toString(), { method: 'HEAD' }).then(r => console.log(r.status)); } It took a second for many of the responses to arrive, but the first 100 worked. The last 20 I got a nice, fast 429 response (too many requests, which is fair).
Blocks: 1430822
We talked through this in triage: - Short-term, we could send a 429 for these scenarios if that helps, but probably not spend more time on it. - Medium-term we're looking into the underlying issue this quarter as part of the Telemetry aggregates architecture review in bug 1430822 (tracking it through this bug and bug 136950).
Priority: -- → P3
Can you elaborate on "short-term" versus "medium-term"?
Short-term: In the next weeks. Medium-term: In this + next quarter. We'll review the performance & architecture of this service this quarter and will decide on next steps from there (for Q2+). Is anything blocked in the mean-time that needs quicker solutions?
Yes. We have dashboards in the offices which use the cache linked above and don't work reliably.
Ok, this is pulling histogram evolution data for custom dashboards. This is definitely a use-case we're tracking for the review. I see the dashboard pulls HTTPS fraction data through HTTP_PAGELOAD_IS_SSL [1]. For bug 1414839, we are providing a public dataset for HTTPS adoption ratio on Firefox release for the LetsEncrypt stats. This is using this query: https://sql.telemetry.mozilla.org/queries/49323/source#table Which can be accessed as CSV or JSON: https://sql.telemetry.mozilla.org/api/queries/49323/results.json?api_key=SywcgxNASAfDN6vXEfxCgKUQt7Jr2Vb4CIMiIta2 Is using that (or a modified version of it) a useful short-term solution? 1: https://github.com/ekr/moz-telemetry-dashboard/blob/master/static/catalog.js#L2
Well, we want to do arbitrary statistics, so it's not really a general solution.
Right. Does it cover the dashboard use-case for now though (https adoption ratio over time)?
fwiw gecko does not do anything special with 429/retry-after.. that would show up in the dashboard code for it to handle rescheduling (if desired).
(In reply to Georg Fritzsche [:gfritzsche] from comment #14) > Right. Does it cover the dashboard use-case for now though (https adoption > ratio over time)? No. We want to add other dashboards now but we cannot because of this.
for a general solution someone has to queue. You can turn down the h2 stream concurrency limit and gecko will queue for you.. but if the "overloaded" criteria is global (i.e. the sum of all requests from all clients) then you don't have much of a choice but to queue on the server side.
Near as I can tell, :ekr's code isn't running in Gecko. It's running in node. Which is why it's not hitting the "6 connections per server" pref.
That's a good point. But regardless, it's really the server's job to queue, for the reasons McManus notes above.
(In reply to Eric Rescorla (:ekr) from comment #13) > Well, we want to do arbitrary statistics, so it's not really a general > solution. Unfortunately, the aggregates dataset does not contain all of release data; as a holdover from FHR it only has users that have opted-in to data collection. Once we have all of release sending us the SSL data (bug 1340021), it will only be available from the re:dash query; only pre-release will be available from the aggregates dataset*. As such, can we add the statistic you're interested in to the query? Just let us know which ones and we can try and make them available. *We do have plans to aggregate a sample of release, and will be handling that as part of the arch review.
I feel like we need to take a step back: People want to build dashboards and the like on some kind of API. That API needs to work, be supported, and have all the data you would want without people being required to file a bug to get new statistics added. For obvious reasons they have assumed that was Telemetry.js. If that's not Telemetry.js. then it needs to be something else. So, what is the service offering that is intended for this purpose.
Component: Telemetry Aggregation Service → General

Hello,

The Mozilla Data Engineering organization is currently going through our extensive backlog, consisting of hundreds of issues stretching back for nearly 10 years. We've done a pass through all of the open bugzilla bugs and have identified and tagged the ones that we think are relevant enough to still need attention. The rest, including the bug with which this comment is associated, we are closing as "WONTFIX" in a single bulk operation.

If you feel we have closed this (or any) issue in error, please feel free to take the following actions:

  • Reopen the bug.
  • Edit the bug to add the string [dataplatform] (including the brackets) to the Whiteboard field. (Note that you must edit the Whiteboard, not the similarly named QA Whiteboard.)

Doing this will ensure that we see the bug in our weekly triage process, where we will decide how to proceed.

Thank you.

Status: REOPENED → RESOLVED
Closed: 8 years ago5 months ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.