Closed
Bug 844325
Opened 12 years ago
Closed 8 years ago
Telemetry needs sanity checks to prevent submission spam
Categories
(Toolkit :: Telemetry, defect)
Toolkit
Telemetry
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: vladan, Assigned: froydnj)
References
(Depends on 1 open bug)
Details
(Whiteboard: [Telemetry:P1])
We need internal sanity checks to guard against Telemetry ping spam caused by broken system clocks, bugs in the nsIdleDaily service, test automation VMs that repeatedly roll back & replay Firefox state, etc.
We need multiple guards, each checking for specific error states. The guards should report to Telemetry that they have been triggered so we can get information about the prevalence & causes of the bugs. This means that some number of "spam" pings should be allowed to go through each day.
Irving also suggested disabling Telemetry altogether when we find it's failed in an unrecoverable way, but that might be a second bug.
Ideas for guards:
a) Compare the # of "idle-daily" and "idle" events received during a session with the session uptime
b) Record timestamp of last Telemetry send attempt in a variable & compare it during the next send attempt. Double-check elapsed time with some other source (e.g. number of elapsed clock cycles)
c) Report system epoch time in Telemetry. We can confirm if the ping spam is coming from a VM that is being rolled back by examining the timestamps in successive pings from the same spammy session after we receive them
| Reporter | ||
Updated•12 years ago
|
Whiteboard: [Telemetry:P1]
Comment 1•12 years ago
|
||
I would encourage us to take a long hard look at merging the data submission logic for FHR and Telemetry. 95% of what these components do is similar. If they can share the same code, that should be a net win.
| Assignee | ||
Comment 2•12 years ago
|
||
(In reply to Vladan Djeric (:vladan) from comment #0)
> We need multiple guards, each checking for specific error states.
It's not clear to me that we need multiple guards.
> a) Compare the # of "idle-daily" and "idle" events received during a session
> with the session uptime
I don't think this is going to be particularly reliable.
> b) Record timestamp of last Telemetry send attempt in a variable & compare
> it during the next send attempt. Double-check elapsed time with some other
> source (e.g. number of elapsed clock cycles)
This seems like the best bet. I get the idea behind multiple time sources, but I'm not sure we have that many to check from JS. (And I sure don't want to do Windows time calibrations in JS...)
> c) Report system epoch time in Telemetry. We can confirm if the ping spam is
> coming from a VM that is being rolled back by examining the timestamps in
> successive pings from the same spammy session after we receive them
So this is more of a diagnostic mechanism than anything that would prevent spam? (Sure, we could block the IP, but that would be post-event.)
Comment 3•12 years ago
|
||
(In reply to Nathan Froyd (:froydnj) from comment #2)
> So this is more of a diagnostic mechanism than anything that would prevent
> spam? (Sure, we could block the IP, but that would be post-event.)
I think we need both diagnostics and prevention, though it may be separate bugs.
For diagnostics to detect misbehaving telemetry and idle-daily services in the field, I suggest:
1) count of idle-daily firings this session (we can compare this to session duration to see if we're getting too many or too few)
2) timestamps of idle-daily firings: more data, we can derive 1) but also see if the timestamps are wonky
3) timestamp vs. interval timer comparisons - does the change in timestamps roughly match the elapsed session duration (watch for overflow at 2^31 and 2^32 milliseconds, roughly 3.6 and 7.1 weeks)
For reliability and problem prevention:
1) Check session duration when we get an idle-daily callback and make sure a reasonable time has passed; don't send (and record in the packet) if called back too soon
2) Record internal errors (file open/close/delete errors on saved pings, for example) in the telemetry packet so we can see if there's a pattern
3) sanity check the saved pings we read - make sure there aren't duplicate session IDs and report if there are
4) keep a hash of when we last sent a packet for a specific session ID and complain if we're sending the same session ID more than once in a single pass through the sending loop
Comment 4•12 years ago
|
||
I just asked rnewman where my "don't trust idle timer" opinion came from and he said it used to fire multiple times when a machine came out of hibernation. This is why Sync stopped using it and is why we don't rely on it for data upload in FHR.
Perhaps rnewman can perform a full brain dump (complete with referenced bugs).
Comment 5•12 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #4)
> Perhaps rnewman can perform a full brain dump (complete with referenced
> bugs).
Sync includes debouncing for return-from-idle notifications because at least on OS X we found that they were behaving very oddly.
Code here:
http://hg.mozilla.org/integration/mozilla-inbound/file/default/services/sync/modules/policies.js#l220
See Bug 691988.
| Assignee | ||
Comment 6•8 years ago
|
||
Georg, we've done this via other means, yes?
Flags: needinfo?(gfritzsche)
Comment 7•8 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #6)
> Georg, we've done this via other means, yes?
We have multiple guards in place now:
- minimum subsession length of 5min
- upload throttling to 10 pings per minute
- we have a few health metrics covering error scenarios and get alerts for regressions
From the incoming data we didn't see any spam problems (except one spam event with custom pings from test pilot).
Bug 1318297 will address closing the gaps in monitoring for these kind of scenarios.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Flags: needinfo?(gfritzsche)
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•