1487277 - Prevent telemetry from running during tests

Assignee

Description

•

7 years ago

This bug is to discuss preventing telemetry from running during tests as it can be a source of intermittent failures, and code coverage variability. See bug 1473626 for more details.

Greg Mierzwinski [:sparky]

Assignee

Comment 1

•

7 years ago

Attached file line_level_differences_1535772622_urls.json — Details

The attached JSON contains URL's for each of the lines that are different between various per-test-coverage reports. It has all the left over variability after running test_coverage_specialpowers.html with a wait time of 3 minutes, and with baseline correction. Using coco-tool's I used the following command to convert a JSON output by using `--differences` in the `pertestcoverage_variability.py` tool: `py pertestcoverage_json2urls.py ~/line_level_differences_1535772622.json -r 937a9f858a97 -b try --differences` Here's an interesting entry I found: https://hg.mozilla.org/try/annotate/937a9f858a97/js/src/gc/Statistics.cpp#l1124 It looks like it's related to telemetry as well, but I'm not sure if we could disable this one - it looks possible but maybe it's required for GC. Who could we ask about it? Overview of attachment: ----------------------- All `.h` files in the attached JSON should be ignored as they are very buggy - the upgrade to GCC7 might fix them. There are lots of things left that are outside of our control 1. https://hg.mozilla.org/try/annotate/937a9f858a97/js/src/gc/GC.cpp#l5939 2. https://hg.mozilla.org/try/annotate/937a9f858a97/js/src/vm/JSAtom.cpp#l544 3. https://hg.mozilla.org/try/annotate/937a9f858a97/js/src/vm/TypedArrayObject.cpp#l196 4. https://hg.mozilla.org/try/annotate/937a9f858a97/startupcache/StartupCache.cpp#l494 5. https://hg.mozilla.org/try/annotate/937a9f858a97/xpcom/base/nsCycleCollector.cpp#l3284 6. https://hg.mozilla.org/try/annotate/937a9f858a97/dom/workers/WorkerPrivate.cpp#l2743 (destructor call) 7. https://hg.mozilla.org/try/annotate/937a9f858a97/toolkit/components/telemetry/TelemetryScalar.cpp#l2855 Debug-only variability which cannot be controlled: 1. https://hg.mozilla.org/try/annotate/937a9f858a97/startupcache/StartupCache.cpp#l326 ------------------------ The entries mentioned above are all that is left in the variability for this test. So, based on this, I think we've eliminated most, if not all, of the most problematic variability by simply disabling telemetry. Now I've started looking into file-level variability as well because it seems to have increased but so far, it looks like most of the files are things we could ignore though. Here's the try run I am using for this analysis: https://treeherder.mozilla.org/#/jobs?repo=try&revision=02755346e2d6932972eba840416a5f569f962850

Marco Castelluccio [:marco]

Comment 2

•

7 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #1) > Here's an interesting entry I found: > https://hg.mozilla.org/try/annotate/937a9f858a97/js/src/gc/Statistics. > cpp#l1124 > > It looks like it's related to telemetry as well, but I'm not sure if we > could disable this one - it looks possible but maybe it's required for GC. > Who could we ask about it? Maybe :jonco knows if we can (and want) to disable the gathering of statistics during tests.

Flags: needinfo?(jcoppeard)

Jon Coppeard (:jonco)

Comment 3

•

7 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #0) > This bug is to discuss preventing telemetry from running during tests as it > can be a source of intermittent failures, and code coverage variability. See > bug 1473626 for more details. Out of interest, why does this cause failures? I think we need to make sure the telemetry gathering code is exercised during tests, but I don't see the point of actually recording telemetry from tests. I suspect that what's happening here though is that GC is being triggered by a timer and then generating telemetry data. Are you saying that we should disable timer-based GCs in coverage builds? That *might* expose problems but should probably be OK. Or are you saying we should not record GC telemetry in test builds?

Flags: needinfo?(jcoppeard)

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

7 years ago

:jonco we are trying to collect accurate code coverage relevant to a specific test instead of getting random code showing up and worse case much different sets of code related to any specific test case which makes comparison across revisions near impossible.

Marco Castelluccio [:marco]

Comment 5

•

7 years ago

(In reply to Jon Coppeard (:jonco) from comment #3) > (In reply to Greg Mierzwinski [:sparky] from comment #0) > > This bug is to discuss preventing telemetry from running during tests as it > > can be a source of intermittent failures, and code coverage variability. See > > bug 1473626 for more details. > > Out of interest, why does this cause failures? The Telemetry code can be triggered randomly at any point during a test, causing timeouts or other sorts of issues. E.g. while running a test for the opening of the awesomebar, we shouldn't attempt to collect Telemetry on WebGL and attempt to send it. > I think we need to make sure the telemetry gathering code is exercised > during tests, but I don't see the point of actually recording telemetry from > tests. > > I suspect that what's happening here though is that GC is being triggered by > a timer and then generating telemetry data. Are you saying that we should > disable timer-based GCs in coverage builds? That *might* expose problems > but should probably be OK. Or are you saying we should not record GC > telemetry in test builds? My thinking is that the Telemetry code should be tested explicitly with proper tests, and not by chance by other unrelated tests. I agree the Telemetry gathering code should not be disabled as it is part of the code under test. What we want to disable is periodic Telemetry gathering and reporting which is unrelated to the tests that are being executed.

Greg Mierzwinski [:sparky]

Assignee

Comment 6

•

7 years ago

(In reply to Jon Coppeard (:jonco) from comment #3) > (In reply to Greg Mierzwinski [:sparky] from comment #0) > > This bug is to discuss preventing telemetry from running during tests as it > > can be a source of intermittent failures, and code coverage variability. See > > bug 1473626 for more details. > > Out of interest, why does this cause failures? Comment 5 by :marco explains why this can be problematic. > I suspect that what's happening here though is that GC is being triggered by > a timer and then generating telemetry data. Are you saying that we should > disable timer-based GCs in coverage builds? That *might* expose problems > but should probably be OK. Or are you saying we should not record GC > telemetry in test builds? As :marco states in comment 5, we want to disable periodic Telemetry gathering and reporting (anything that gathers and/or reports telemetry or statistics using a timer). The line of code at [1] (that was hit in testing) is what we are wondering about as it seems like it could be related to telemetry but through a different timer than the ones we've tested. Do you think we should disable gathering these GC statistics during coverage builds or would it cause problems if we try? And if we should, do you know if there is already a simple way of skipping these GC statistics/telemetry measurements? [1]: https://hg.mozilla.org/try/annotate/937a9f858a97/js/src/gc/Statistics.cpp#l1124 [2]: https://hg.mozilla.org/try/annotate/937a9f858a97/js/src/gc/Statistics.cpp#l1132

Flags: needinfo?(jcoppeard)

Marco Castelluccio [:marco]

Comment 7

•

7 years ago

I've spoken with jonco, GC is similar to Telemetry in the sense that it could run at any point in time during tests, making tests not exactly deterministic and possibly causing intermittent failures. Differently from Telemetry though, the intermittent failures are actually real problems in GC that have to be fixed. We could in theory disable running GC on timers for coverage builds, but it would mean making the coverage build different than other builds. I'd like to keep them as similar as possible so I would prefer not to disable anything that we can't disable in all builds. Let's focus on disabling Telemetry, which we can do in all builds, for now, and then we'll re-evaluate GC later on if needed.

Flags: needinfo?(jcoppeard)

Joel Maher ( :jmaher ) (UTC -8)

Comment 8

•

7 years ago

can we identify which code is executed by GC, or is that sort of random based on which components GC affects in the current pass? per test coverage will not be useful if we are telling developers one day their test is covering 7000 lines of code and the next day only 4000 lines of code with no code changed. We could do... * start browser * force gc/cc * reset counters * run test * save counters * shutdown browser that would reduce the chance of a gc happening and if we can query the results and look for specific areas of code that indicate a GC happened during the test we could flag it as garbage (oh the irony)

Marco Castelluccio [:marco]

Comment 9

•

7 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #8) > can we identify which code is executed by GC, or is that sort of random > based on which components GC affects in the current pass? per test coverage > will not be useful if we are telling developers one day their test is > covering 7000 lines of code and the next day only 4000 lines of code with no > code changed. > > We could do... > * start browser > * force gc/cc > * reset counters > * run test > * save counters > * shutdown browser > > that would reduce the chance of a gc happening and if we can query the > results and look for specific areas of code that indicate a GC happened > during the test we could flag it as garbage (oh the irony) Let's file a separate bug for investigating this, as it doesn't have to block landing the disablement of Telemetry. If we actually find GC to be a real problem blocking per-test coverage, we can disable running it on timers for coverage builds.

Greg Mierzwinski [:sparky]

Assignee

Comment 10

•

7 years ago

Marco Castelluccio [:marco]

Comment 11

•

7 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #10) > :marco, what do you think of these? Are those the failures you were > referring to when we spoke on IRC? I think some of them may be related to > the changes I made. I see they all passed when you retriggered, so maybe they are simply intermittents?

Marco Castelluccio [:marco]

Comment 12

•

7 years ago

The Mac issue is bug 1490247.

Greg Mierzwinski [:sparky]

Assignee

Comment 13

•

7 years ago

(In reply to Marco Castelluccio [:marco] from comment #11) > (In reply to Greg Mierzwinski [:sparky] from comment #10) > > :marco, what do you think of these? Are those the failures you were > > referring to when we spoke on IRC? I think some of them may be related to > > the changes I made. > > I see they all passed when you retriggered, so maybe they are simply > intermittents? Yes, they are mostly new intermittents though. Would they cause this patch to be backed out?

Greg Mierzwinski [:sparky]

Assignee

Comment 14

•

7 years ago

Attached file Bug 1487277 - Prevent intermittent telemetry gathering and reporting. r?marco — Details

This patch sets the preference 'toolkit.telemetry.initDelay' to a very high value in an attempt to prevent telemetry from intermittently gathering and reporting data.

Marco Castelluccio [:marco]

Comment 15

•

7 years ago

Comment on attachment 9008052 [details] Bug 1487277 - Prevent intermittent telemetry gathering and reporting. r?marco Marco Castelluccio [:marco] has approved the revision.

Attachment #9008052 - Flags: review+

Andrew Halberstadt [:ahal]

Comment 16

•

7 years ago

Comment on attachment 9008052 [details] Bug 1487277 - Prevent intermittent telemetry gathering and reporting. r?marco Andrew Halberstadt [:ahal] has approved the revision.

Attachment #9008052 - Flags: review+

Greg Mierzwinski [:sparky]

Assignee

Comment 17

•

7 years ago

Hi :chutten, we spoke in the past about trying to find a way to prevent as much telemetry as possible from running while testing and we found that by changing the 'telemetry.toolkit.initDelay' preference to a very high value, we were able to stop most of it. We'd like to add this preference change to all tests (everything but raptor), and we are wondering what your thoughts are on this? Is it fine with you to do the switch? Please include anyone you think should be involved in this discussion.

Flags: needinfo?(chutten)

Chris H-C :chutten

Comment 18

•

7 years ago

I don't know what raptor is, but I have only mild concerns about such a change. Did you find that tests which test data collection (like, for instance [1]) still pass successfully even without Telemetry init? Does this satisfy the goal of reducing intermittent failures due to Telemetry deciding to perform work sometimes? ni?gfritzsche and Dexter for historical knowledge. ni?janerik for modern knowledge. [1]: https://searchfox.org/mozilla-central/rev/de7676288a78b70d2b9927c79493adbf294faad5/browser/base/content/test/about/browser_aboutHome_search_telemetry.js#30

Flags: needinfo?(jrediger)

Flags: needinfo?(gfritzsche)

Flags: needinfo?(chutten)

Flags: needinfo?(alessio.placitelli)

Greg Mierzwinski [:sparky]

Assignee

Comment 19

•

7 years ago

Yes, that test passes in this chunk on linux64-opt: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ffd6c93dce436903500becd3eb43040404a5fca3&selectedJob=197923101&searchStr=linux64%2Fopt-mochitest-browser-chrome-e10s-6 Comment 10 has the try runs we did, along with all the failures that looked new to me - I didn't notice any telemetry specific failures there. We have no evidence yet to say it will definitely reduce intermittent failure rates. But we think it will as it's used in code coverage intermittently so it's possible it causes some intermittent failures.

Alessio Placitelli [:Dexter]

Comment 20

•

7 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #17) > We'd like to add this preference change to all tests (everything but > raptor), and we are wondering what your thoughts are on this? Is it fine > with you to do the switch? What do you mean by "everything but raptor"? Do you also plan to disable Telemetry in mochitests, Talos, etc.? While Telemetry has unit tests and some integration tests, the rationale of having Telemetry enabled in all test suites (without sending pings) is to have tests running in an environment as close as possible to production, where Telemetry is always on (but not sending pings). With that said, I think it's ok to disable/delay Telemetry in some specific test suites, if that's causing issues.

Flags: needinfo?(alessio.placitelli) → needinfo?(gmierz2)

Georg Fritzsche [:gfritzsche]

Comment 21

•

7 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #17) > Hi :chutten, we spoke in the past about trying to find a way to prevent as > much telemetry as possible from running while testing and we found that by > changing the 'telemetry.toolkit.initDelay' preference to a very high value, > we were able to stop most of it. > > We'd like to add this preference change to all tests (everything but > raptor), and we are wondering what your thoughts are on this? Is it fine > with you to do the switch? The criteria that were mentioned are ok to me: - Telemetry instrumentation code is generally still active (as this is effectively part of the exercised code). - But Telemetry sending and scheduled activity is disabled in most tests. Adjusting that pref sounds like a good solution to me. We definitely can't have that pref apply to everything inside toolkit/components/telemetry/tests. There were some talks about generally collecting telemetry from test runs, but i'm not sure if this ever got anywhere. If it does, it could investigate separately where this is feasible.

Flags: needinfo?(gfritzsche)

Georg Fritzsche [:gfritzsche]

Comment 22

•

7 years ago

Raphael is generally looking into Telemetry and other Data testing. Raphael, does that sound reasonable to you here and align with your plans?

Flags: needinfo?(rpierzina)

Greg Mierzwinski [:sparky]

Assignee

Comment 23

•

7 years ago

(In reply to Alessio Placitelli [:Dexter] from comment #20) > What do you mean by "everything but raptor"? Do you also plan to disable > Telemetry in mochitests, Talos, etc.? Yes, it would be for all test harnesses (mochitest, xpcshell, etc.), including Talos. (In reply to Georg Fritzsche [:gfritzsche] from comment #21) > The criteria that were mentioned are ok to me: > - Telemetry instrumentation code is generally still active (as this is > effectively part of the exercised code). > - But Telemetry sending and scheduled activity is disabled in most tests. > > Adjusting that pref sounds like a good solution to me. > We definitely can't have that pref apply to everything inside > toolkit/components/telemetry/tests. Looking at this chunk, it seems like the tests from that folder pass with this preference on (gtest also has passing telemetry tests): https://treeherder.mozilla.org/#/jobs?repo=try&revision=ffd6c93dce436903500becd3eb43040404a5fca3&searchStr=linux64%2Fopt-xpcshell-8 Is this unexpected? Comment 10 has all the failures which looked to me like they were new, none of them are explicitly related to telemetry or within the folder you mentioned.

Flags: needinfo?(gmierz2) → needinfo?(gfritzsche)

Calixte Denizet (:calixte)

Updated

•

7 years ago

Assignee: nobody → gmierz2

Jan-Erik Rediger [:janerik] (Away 2026-04-02 - 2026-04-10)

Comment 24

•

7 years ago

Stuff in `toolkit/components/telemetry/tests` is probably fine, because they explicitly call `testSetup`, which ignores the `toolkit.telemetry.initDelay` pref (and instead uses a constant value of 1 for the delay). Actual recording in other parts of the code should also be fine, as they happen anyway and don't depend on the controller to be initialized. As this does not completely disable Telemetry, but only the "collecting external ping data & sending part", which we should already cover in our own tests, I'm fine with adding this for other tests to improve their coverage reports.

Flags: needinfo?(jrediger)

Pulsebot

Comment 25

•

7 years ago

Pushed by gmierz2@outlook.com: https://hg.mozilla.org/integration/autoland/rev/f54cd369acc0 Prevent intermittent telemetry gathering and reporting. r=ahal,marco

Arthur Iakab [arthur_iakab]

Comment 26

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f54cd369acc0

Status: NEW → RESOLVED

Closed: 7 years ago

status-firefox64: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla64

Raphael Aurich [:raphael] UTC+01:00

Comment 27

•

7 years ago

(In reply to Jan-Erik Rediger [:janerik] from comment #24) > As this does not completely disable Telemetry, but only the "collecting > external ping data & sending part", which we should already cover in our own > tests, I'm fine with adding this for other tests to improve their coverage > reports. I second this!

Flags: needinfo?(rpierzina)

Georg Fritzsche [:gfritzsche]

Updated

•

6 years ago

Flags: needinfo?(gfritzsche)

line_level_differences_1535772622_urls.json 7 years ago Greg Mierzwinski [:sparky] 398.45 KB, application/json		Details
Bug 1487277 - Prevent intermittent telemetry gathering and reporting. r?marco 7 years ago Greg Mierzwinski [:sparky] 46 bytes, text/x-phabricator-request	marco : review+ ahal : review+	Details \| Review