844325 - Telemetry needs sanity checks to prevent submission spam

Reporter

Description

•

12 years ago

We need internal sanity checks to guard against Telemetry ping spam caused by broken system clocks, bugs in the nsIdleDaily service, test automation VMs that repeatedly roll back & replay Firefox state, etc. We need multiple guards, each checking for specific error states. The guards should report to Telemetry that they have been triggered so we can get information about the prevalence & causes of the bugs. This means that some number of "spam" pings should be allowed to go through each day. Irving also suggested disabling Telemetry altogether when we find it's failed in an unrecoverable way, but that might be a second bug. Ideas for guards: a) Compare the # of "idle-daily" and "idle" events received during a session with the session uptime b) Record timestamp of last Telemetry send attempt in a variable & compare it during the next send attempt. Double-check elapsed time with some other source (e.g. number of elapsed clock cycles) c) Report system epoch time in Telemetry. We can confirm if the ping spam is coming from a VM that is being rolled back by examining the timestamps in successive pings from the same spammy session after we receive them

Vladan Djeric (:vladan)

Reporter

Updated

•

12 years ago

Whiteboard: [Telemetry:P1]

Gregory Szorc [:gps]

Comment 1

•

12 years ago

I would encourage us to take a long hard look at merging the data submission logic for FHR and Telemetry. 95% of what these components do is similar. If they can share the same code, that should be a net win.

Nathan Froyd [:froydnj]

Assignee

Comment 2

•

12 years ago

(In reply to Vladan Djeric (:vladan) from comment #0) > We need multiple guards, each checking for specific error states. It's not clear to me that we need multiple guards. > a) Compare the # of "idle-daily" and "idle" events received during a session > with the session uptime I don't think this is going to be particularly reliable. > b) Record timestamp of last Telemetry send attempt in a variable & compare > it during the next send attempt. Double-check elapsed time with some other > source (e.g. number of elapsed clock cycles) This seems like the best bet. I get the idea behind multiple time sources, but I'm not sure we have that many to check from JS. (And I sure don't want to do Windows time calibrations in JS...) > c) Report system epoch time in Telemetry. We can confirm if the ping spam is > coming from a VM that is being rolled back by examining the timestamps in > successive pings from the same spammy session after we receive them So this is more of a diagnostic mechanism than anything that would prevent spam? (Sure, we could block the IP, but that would be post-event.)

:Irving Reid (No longer working on Firefox)

Comment 3

•

12 years ago

(In reply to Nathan Froyd (:froydnj) from comment #2) > So this is more of a diagnostic mechanism than anything that would prevent > spam? (Sure, we could block the IP, but that would be post-event.) I think we need both diagnostics and prevention, though it may be separate bugs. For diagnostics to detect misbehaving telemetry and idle-daily services in the field, I suggest: 1) count of idle-daily firings this session (we can compare this to session duration to see if we're getting too many or too few) 2) timestamps of idle-daily firings: more data, we can derive 1) but also see if the timestamps are wonky 3) timestamp vs. interval timer comparisons - does the change in timestamps roughly match the elapsed session duration (watch for overflow at 2^31 and 2^32 milliseconds, roughly 3.6 and 7.1 weeks) For reliability and problem prevention: 1) Check session duration when we get an idle-daily callback and make sure a reasonable time has passed; don't send (and record in the packet) if called back too soon 2) Record internal errors (file open/close/delete errors on saved pings, for example) in the telemetry packet so we can see if there's a pattern 3) sanity check the saved pings we read - make sure there aren't duplicate session IDs and report if there are 4) keep a hash of when we last sent a packet for a specific session ID and complain if we're sending the same session ID more than once in a single pass through the sending loop

:Irving Reid (No longer working on Firefox)

Updated

•

12 years ago

Depends on: 845558

Gregory Szorc [:gps]

Comment 4

•

12 years ago

I just asked rnewman where my "don't trust idle timer" opinion came from and he said it used to fire multiple times when a machine came out of hibernation. This is why Sync stopped using it and is why we don't rely on it for data upload in FHR. Perhaps rnewman can perform a full brain dump (complete with referenced bugs).

Richard Newman [:rnewman]

Comment 5

•

12 years ago

(In reply to Gregory Szorc [:gps] from comment #4) > Perhaps rnewman can perform a full brain dump (complete with referenced > bugs). Sync includes debouncing for return-from-idle notifications because at least on OS X we found that they were behaving very oddly. Code here: http://hg.mozilla.org/integration/mozilla-inbound/file/default/services/sync/modules/policies.js#l220 See Bug 691988.

Nathan Froyd [:froydnj]

Assignee

Comment 6

•

8 years ago

Georg, we've done this via other means, yes?

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 7

•

8 years ago

(In reply to Nathan Froyd [:froydnj] from comment #6) > Georg, we've done this via other means, yes? We have multiple guards in place now: - minimum subsession length of 5min - upload throttling to 10 pings per minute - we have a few health metrics covering error scenarios and get alerts for regressions From the incoming data we didn't see any spam problems (except one spam event with custom pings from test pilot). Bug 1318297 will address closing the gaps in monitoring for these kind of scenarios.

Status: ASSIGNED → RESOLVED

Closed: 8 years ago

Flags: needinfo?(gfritzsche)

Resolution: --- → WORKSFORME

Bugzilla

Telemetry needs sanity checks to prevent submission spam

Categories

(Toolkit :: Telemetry, defect)

Tracking

()

People

(Reporter: vladan, Assigned: froydnj)

References

(Depends on 1 open bug)

Details

(Whiteboard: [Telemetry:P1])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7