Closed
Bug 1408112
Opened 8 years ago
Closed 5 months ago
Telemetry server throws 500s
Categories
(Data Platform and Tools :: General, defect, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: ekr, Unassigned)
References
Details
Comment 1•8 years ago
|
||
Looks like this was just a temporary hiccup, the evolution returns a 200 now with the data. EKR, have you considered using the main_summary table for HTTP_PAGELOAD_IS_SSL? It will have that data for all of release, while the aggregates dataset is just prerelease.
Status: NEW → RESOLVED
Closed: 8 years ago
Component: Telemetry APIs for Analysis → Telemetry Aggregation Service
Resolution: --- → WORKSFORME
| Reporter | ||
Comment 2•8 years ago
|
||
I somehow didn't noticed that this got marked resolved. The problem isn't that it never works, it's that it gets overloaded when you make a lot of requests then falls over with 500. Has something happened to fix that defect?
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Comment 3•8 years ago
|
||
As part of our User Productivity initiative in 2018, we'll be looking at performance of telemetry.js and telemetry.mo. Can you paste your use-case so we can keep it in mind?
Flags: needinfo?(ekr)
| Reporter | ||
Comment 4•8 years ago
|
||
See:
https://github.com/ekr/moz-telemetry-dashboard
Basically, this is a cache that sits in front of the API so that we can build fast dashboards.
Flags: needinfo?(ekr)
Comment 5•8 years ago
|
||
For each of three measures, for each channel+version they appear in between 44 and 54, get the evolution. Assuming saturation of four channels we're talking... 120 requests, as fast as can be sent, using telemetry.js v2 via telemetry-next-node.
Okay. The only other mass user of telemetry-next-node that I know of is mozilla/cerberus, and it serializes requests two at a time (well, sometimes it requests up to three versions at once for each of those two. So: six).
telemetry.js v2 has no built-in rate-limiting or retry logic (as you've discovered). An updated library would have to take this into account.
| Reporter | ||
Comment 6•8 years ago
|
||
Well, sort of. The semantics of how fast the HTTP requests are issued are complicated (depends if it's H1 or H2, etc...)
Anyway, it's not just 120... There should be a way to do an arbitrary number, and have the system rate limit them sensibly. I would argue that the server shouldn't be throwing a 500 but should be queueing on the server.
Comment 7•8 years ago
|
||
Actually, the server should use the tools it has available to it (concurrent stream limits in h2, for example) to apply back pressure on clients. It seems unlikely that an HTTP/1.1 server would get so many requests from a single as to become overloaded, because clients are naturally rate limited.
If the server is genuinely overloaded, it can send a 503 with Retry-After and the client *should* respect that and retry automatically. 429 is also good as a preventative measure, see below, but again the client should be robust and retry.
For reference, I made 120 requests to bugzilla.m.o with a fast loop:
for (var i = 0; i < 120; ++i) { fetch('https://bugzilla.mozilla.org/show_bug.cgi?id=' + i.toString(), { method: 'HEAD' }).then(r => console.log(r.status)); }
It took a second for many of the responses to arrive, but the first 100 worked. The last 20 I got a nice, fast 429 response (too many requests, which is fair).
Comment 8•7 years ago
|
||
We talked through this in triage:
- Short-term, we could send a 429 for these scenarios if that helps, but probably not spend more time on it.
- Medium-term we're looking into the underlying issue this quarter as part of the Telemetry aggregates architecture review in bug 1430822 (tracking it through this bug and bug 136950).
Updated•7 years ago
|
Priority: -- → P3
| Reporter | ||
Comment 9•7 years ago
|
||
Can you elaborate on "short-term" versus "medium-term"?
Comment 10•7 years ago
|
||
Short-term: In the next weeks.
Medium-term: In this + next quarter.
We'll review the performance & architecture of this service this quarter and will decide on next steps from there (for Q2+).
Is anything blocked in the mean-time that needs quicker solutions?
| Reporter | ||
Comment 11•7 years ago
|
||
Yes. We have dashboards in the offices which use the cache linked above and don't work reliably.
Comment 12•7 years ago
|
||
Ok, this is pulling histogram evolution data for custom dashboards. This is definitely a use-case we're tracking for the review.
I see the dashboard pulls HTTPS fraction data through HTTP_PAGELOAD_IS_SSL [1].
For bug 1414839, we are providing a public dataset for HTTPS adoption ratio on Firefox release for the LetsEncrypt stats.
This is using this query:
https://sql.telemetry.mozilla.org/queries/49323/source#table
Which can be accessed as CSV or JSON:
https://sql.telemetry.mozilla.org/api/queries/49323/results.json?api_key=SywcgxNASAfDN6vXEfxCgKUQt7Jr2Vb4CIMiIta2
Is using that (or a modified version of it) a useful short-term solution?
1: https://github.com/ekr/moz-telemetry-dashboard/blob/master/static/catalog.js#L2
| Reporter | ||
Comment 13•7 years ago
|
||
Well, we want to do arbitrary statistics, so it's not really a general solution.
Comment 14•7 years ago
|
||
Right. Does it cover the dashboard use-case for now though (https adoption ratio over time)?
Comment 15•7 years ago
|
||
fwiw gecko does not do anything special with 429/retry-after.. that would show up in the dashboard code for it to handle rescheduling (if desired).
| Reporter | ||
Comment 16•7 years ago
|
||
(In reply to Georg Fritzsche [:gfritzsche] from comment #14)
> Right. Does it cover the dashboard use-case for now though (https adoption
> ratio over time)?
No. We want to add other dashboards now but we cannot because of this.
Comment 17•7 years ago
|
||
for a general solution someone has to queue. You can turn down the h2 stream concurrency limit and gecko will queue for you.. but if the "overloaded" criteria is global (i.e. the sum of all requests from all clients) then you don't have much of a choice but to queue on the server side.
Comment 18•7 years ago
|
||
Near as I can tell, :ekr's code isn't running in Gecko. It's running in node. Which is why it's not hitting the "6 connections per server" pref.
| Reporter | ||
Comment 19•7 years ago
|
||
That's a good point. But regardless, it's really the server's job to queue, for the reasons McManus notes above.
Comment 20•7 years ago
|
||
(In reply to Eric Rescorla (:ekr) from comment #13)
> Well, we want to do arbitrary statistics, so it's not really a general
> solution.
Unfortunately, the aggregates dataset does not contain all of release data; as a holdover from FHR it only has users that have opted-in to data collection. Once we have all of release sending us the SSL data (bug 1340021), it will only be available from the re:dash query; only pre-release will be available from the aggregates dataset*.
As such, can we add the statistic you're interested in to the query? Just let us know which ones and we can try and make them available.
*We do have plans to aggregate a sample of release, and will be handling that as part of the arch review.
| Reporter | ||
Comment 21•7 years ago
|
||
I feel like we need to take a step back:
People want to build dashboards and the like on some kind of API. That API needs to work, be supported, and have all the data you would want without people being required to file a bug to get new statistics added. For obvious reasons they have assumed that was Telemetry.js. If that's not Telemetry.js. then it needs to be something else. So, what is the service offering that is intended for this purpose.
| Assignee | ||
Updated•3 years ago
|
Component: Telemetry Aggregation Service → General
Comment 22•5 months ago
|
||
Hello,
The Mozilla Data Engineering organization is currently going through our extensive backlog, consisting of hundreds of issues stretching back for nearly 10 years. We've done a pass through all of the open bugzilla bugs and have identified and tagged the ones that we think are relevant enough to still need attention. The rest, including the bug with which this comment is associated, we are closing as "WONTFIX" in a single bulk operation.
If you feel we have closed this (or any) issue in error, please feel free to take the following actions:
- Reopen the bug.
- Edit the bug to add the string
[dataplatform](including the brackets) to theWhiteboardfield. (Note that you must edit theWhiteboard, not the similarly namedQA Whiteboard.)
Doing this will ensure that we see the bug in our weekly triage process, where we will decide how to proceed.
Thank you.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 5 months ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•