Open Bug 1473626 Opened 6 years ago Updated 2 years ago

[meta] Use a code coverage build to find things running on timers that should be disabled during tests

Categories

(Testing :: Code Coverage, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: marco, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Keywords: meta)

Attachments

(7 files)

We could:
1) Run an empty test;
2) Reset coverage counters;
3) Wait 120 seconds;
4) Dump coverage counters.

After 4, we can generate a report and see what was executed.

By disabling things that are executed on timers that shouldn't be executed during tests, we could make per-test coverage data cleaner, and we might fix some intermittents due to things running on timers badly interacting with tests.
Blocks: 1471577
Good idea! The last timer was found by looking at the raw lines and files covered so doing this would be much easier. We could use the baseline tests as the empty tests, and reset counters and wait in the test harness before the browser is closed, but after the test finishes. I think that would also remove the timers we already know about.
I've generated a test report, there are several Telemetry functions covered (so I guess we are collecting some Telemetry data on a timer, we should likely disable it during tests). E.g. GetWebrtcStats.
:marco, could you post a link to the code you used to find that?
Flags: needinfo?(mcastelluccio)
I think we should try this with telemetry disabled in the moz-configs: https://hg.mozilla.org/try/rev/334dd0c80f0eea321c2eea1ce4667af66e7accf0

You'll see it was explicitly enabled for windows but I'm not sure why. Telemetry was enabled by default (in common-opt [1][2]) in mac and linux.

[1]: https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/linux64/common-opt
[2]: https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/macosx64/common-opt
(In reply to Greg Mierzwinski [:sparky] from comment #3)
> :marco, could you post a link to the code you used to find that?

I don't have it anymore unfortunately, but I think I recall all the steps I followed:
- set the GCOV_PREFIX and GCOV_RESULTS_DIR environment variables to two directories on the system;
- remove the requestResetCoverageCounters and requestDumpCoverageCounters calls from testing/mochitest/tests/SimpleTest/TestRunner.js, so that they don't conflict with the calls from inside the test;
- modify a bit tools/code-coverage/tests/mochitest/test_coverage_specialpowers.html (or just create a new test) to make it:
1) Wait a few seconds (so that any delayed initialization finishes)
2) Reset coverage counters
3) Wait 120 seconds
4) Dump coverage counters

Then, run the test, then parse the gcda files that are in GCOV_RESULTS_DIR.


(In reply to Greg Mierzwinski [:sparky] from comment #4)
> I think we should try this with telemetry disabled in the moz-configs:
> https://hg.mozilla.org/try/rev/334dd0c80f0eea321c2eea1ce4667af66e7accf0
> 
> You'll see it was explicitly enabled for windows but I'm not sure why.
> Telemetry was enabled by default (in common-opt [1][2]) in mac and linux.
> 
> [1]:
> https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/
> linux64/common-opt
> [2]:
> https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/
> macosx64/common-opt

I'm not sure we can disable Telemetry like this, we might stop collecting coverage for Telemetry if we do, or we might have differences in the tests between the coverage build and the other builds.
Flags: needinfo?(mcastelluccio)
Good point, and I don't think we can disable telemetry in any other way because it looks like we need to build with it. But I may be wrong since I'm not too familiar with how telemetry works.

What if we filter out these telemetry functions from the coverage data by using file and function names in the etl or with grcov? I could see some coverage for that code getting through. But, it might be better than adding something hacky to get telemetry data out of per-test data. Assuming it's hard to get telemetry data removed with other methods.
I think there might be a timer somewhere that when triggered calls some Telemetry function to collect data. We just need to find out how to disable this timer (probably through a preference), and set this preference during tests (we already do this for several other similar things).
Ok sounds goood, I'll start looking around for that timer.
There are at least some portions of Telemetry that still work when we set the pref 'initDelay' to a very high number: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5183a4169f6352c261f2ba045eba9c30d608f31&selectedJob=192884180

Here's a full linux64-ccov run with initDelay set to a very high number: https://treeherder.mozilla.org/#/jobs?repo=try&revision=094e5accd4569f41dd0dd8351803132c5a217984

I'm not sure if those failures are legitimately caused by the change to initDelay because I have yet to find the link between the test that failed in 'dt4' at [1], and the Telemetry functions. I've re-triggered those chunks to see if it's an intermittent.

[1]: https://dxr.mozilla.org/mozilla-central/source/devtools/client/inspector/animation/test/browser_animation_rewind-button.js
There's only one failure that seems to caused by setting this value very high, and that is [1] mentioned in comment 9. So, we could disable telemetry with this flag - and either skip or fix that one test.

I'm going to work on trying to recreate the initial test that found this difference first to see if we still find telemetry functions in there after this change.
if it is 1 failure, lets fix the test if possible or we can |skip-if = coverage|.  Good idea to revisit this to see if changing the initDelay will show telemetry functions in per-test code coverage.
I think I recreated it in this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=37022d37ae3a7cb3bce5740dc20b27809b95fb1c&selectedJob=194154218

I've attached a file containing all the coverage that I see in the test-coverage task for tools/code-coverage/tests/mochitest/test_coverage_specialpowers.html with the following changes (see comment 5): https://hg.mozilla.org/try/rev/0eaaf5290d5c683daa93603b326370e98c577a2b

The coverage was found with the command `py pertestcoverage_view.py ~\Downloads\per-test-coverage-reports(93) -t special -s tele` using: https://github.com/gmierz/coco-tools

Unfortunately, this change didn't disable telemetry completely. I'll attach the coverage that it has in this test: https://treeherder.mozilla.org/#/jobs?repo=try&revision=73253340f90877cc00ad48ea1e24d9ed4f5a50fc

I'm testing this with a different change here as well: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e02f39cabcc5ada10bd3901e61e7664b468a506
Coverage with telemetry disabled through the initDelay change.
These are the differences between the two. Each entry, (-1--2), means "what is in the coverage when telemetry is enabled but not in the coverage when it is disabled", and (-2--1) means the opposite.
Also, here is another preference which could cause false differences between tests, and variability in certain cases: https://dxr.mozilla.org/mozilla-central/source/testing/profiles/unittest/user.js#16

But I don't think we can't disable this one, maybe trying to capture it in the baseline would be better?
Here's a graph for the change in coverage across the files in the .txt attachments. Point 1 is where the initDelay was unchanged, and points 2, and 3 showed a stable drop in coverage from the change.

So the change does help but it doesn't stop everything from coming through. I'm going to look into how variable the coverage from this task is because I haven't seen a call to GetWebrtcStats yet [1] (around lines 687 to 689).

Luckily for us though, it seems like the baseline captures everything that's left over. This can be seen just by looking at the Telemetry.cpp coverage:

Baseline coverage:  toolkit/components/telemetry/Telemetry.cpp: [120, 293, 542, 548, 552, 553, 565, 604, 607, 608, 611, 613, 614, 618, 620, 648, 650, 651, 653, 655, 657, 660, 663, 667, 669, 1088, 1091, 1095, 1098, 1110, 1111, 1112, 1116, 1120, 1137, 1138, 1139, 1143, 1147, 1148, 1149, 1150, 1151, 1169, 1173, 1175, 1179, 1181, 1183, 1184, 1185, 1186, 1193, 1197, 1200, 1201, 1204, 1207, 1210, 1212, 1214, 1215, 1217, 1229, 1547, 1549, 1553, 1554, 1558, 1560, 1564, 1565, 1580, 1581, 1628, 1630, 1634, 1638, 1639, 1646, 1652, 1654, 1658, 1660, 1664, 1666, 1748, 1752, 1801, 1803, 1804, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 1991, 1996, 2000, 2001, 2010, 2012, 2016, 2018, 2048, 2052, 2053, 2054, 2114, 2117, 2121, 2123, 2127, 2129, 2132, 2133, 2134, 2137, 2143, 2145, 2146, 2149, 2151, 2152, 2155, 2157, 2158, 2167, 2169, 2170, 2173, 2175, 2176, 2185, 2187, 2188, 2197]

Without baseline correction, with telemetry disabled: toolkit/components/telemetry/Telemetry.cpp: [1088, 1091, 1138, 1558, 1560, 1564, 1565, 1580, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 2016, 2018, 2143, 2145, 2146, 2167, 2169, 2170]

With baseline correction, with telemetry disabled: toolkit/components/telemetry/Telemetry.cpp: []

Without baseline correction, with telemetry enabled: toolkit/components/telemetry/Telemetry.cpp: [443, 448, 453, 454, 458, 463, 468, 469, 473, 476, 482, 490, 491, 492, 493, 630, 634, 635, 639, 643, 644, 693, 695, 696, 1088, 1091, 1137, 1138, 1139, 1169, 1173, 1175, 1558, 1560, 1564, 1565, 1580, 1628, 1630, 1670, 1673, 1698, 1701, 1702, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 2016, 2018, 2143, 2145, 2146, 2167, 2169, 2170]


With baseline correction, with telemetry enabled: toolkit/components/telemetry/Telemetry.cpp: [443, 448, 453, 454, 458, 463, 468, 469, 473, 476, 482, 490, 491, 492, 493, 630, 634, 635, 639, 643, 644, 693, 695, 696, 1670, 1673, 1698, 1701, 1702, 2167, 2169, 2170]

As mentioned above, I'm going to look into how variable these results are. There is at least line 2170 that seems to be have been an intermittent line before disabling.


[1]: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp?q=Telemetry.cpp&redirect_type=direct#685
These are the differences between chunks left over after disabling telemetry but without baseline correction.
And these are the differences with telemetry disabled, and after baseline correction.

There are some other files which were used randomly (grouped into file-level variability): 'toolkit/components/telemetry/ipc/TelemetryComms.h', 'toolkit/components/telemetry/TelemetryHistogram.cpp', 'toolkit/components/telemetry/Telemetry.cpp'

Or:
toolkit/components/telemetry/TelemetryHistogram.cpp: [1410, 1416, 1417]
toolkit/components/telemetry/Telemetry.cpp: [2167, 2169, 2170]
toolkit/components/telemetry/ipc/TelemetryComms.h: [189, 190, 220, 222, 225, 226]

So, disabling it with this flag definitely helped - but there may be another setting we could change to fix this last bit of variability.
Complete variability of test_coverage_specialpowers after base correction and disabling telemetry.
I was looking through the data in this push [1] and at files with 'time' in their name, and I found this file, with intermittent coverage of certain lines, but only happening in the first time the file is seen across all the pushes:

"js/xpconnect/src/XPCRuntimeService.cpp": {

    "-1--2": [
        53,
        58,
        59,
        60,
        61
    ],
    "-2--1": [ ]

},


This is called by: https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCWrappedNativeJSOps.cpp#734
And these lines are called uniquely in the first run:
"js/xpconnect/src/XPCWrappedNativeJSOps.cpp": {

    "-1--2": [
        768,
        384,
        385,
        775,
        753,
        754,
        755,
        756,
        757,
        758,
        760,
        764,
        767
    ],
    "-2--1": [ ]

}


With this coverage afterwards:
js/xpconnect/src/XPCWrappedNativeJSOps.cpp: [99, 101, 103, 104, 106, 107, 108, 111, 114, 119, 120, 121, 123, 124, 125, 128, 222, 236, 237, 238, 242, 244, 245, 248, 252, 253, 255, 258, 261, 262, 270, 271, 272, 277, 278, 279, 280, 281, 282, 351, 373, 374, 375, 376, 377, 378, 379, 382, 383, 389, 390, 393, 394, 395, 396, 397, 402, 404, 405, 406, 407, 408, 409, 410, 413, 414, 415, 417, 424, 426, 427, 428, 431, 496, 498, 499, 502, 503, 506, 507, 509, 513, 515, 516, 519, 520, 521, 525, 527, 528, 539, 541, 542, 543, 545, 547, 548, 549, 550, 553, 555, 556, 559, 561, 562, 563, 565, 566, 570, 573, 579, 617, 619, 620, 621, 623, 664, 666, 668, 671, 672, 675, 676, 677, 681, 683, 684, 685, 689, 690, 693, 694, 695, 699, 702, 703, 704, 705, 715, 717, 718, 719, 720, 721, 722, 724, 726, 727, 729, 731, 732, 734, 736, 737, 740, 741, 743, 747, 748, 749, 779, 854, 856, 859, 860, 861, 871, 873, 874, 875, 877, 878, 881, 883, 884, 885, 887, 890, 892, 893, 897, 899, 900, 901, 903, 907, 909, 911, 912, 913, 915, 918, 921, 922, 923, 924, 925, 930, 931, 970, 973, 974, 975, 976, 979, 982, 983, 986, 987, 991, 995, 996, 997, 998]


Now, that function is called by either (from what I've found so far): 
  (1): https://dxr.mozilla.org/mozilla-central/search?q=%2Bcallers%3A%22nsIXPCScriptable%3A%3AResolve%28nsIXPConnectWrappedNative+%2A%2C+JSContext+%2A%2C+JSObject+%2A%2C+jsid%2C+bool+%2A%2C+bool+%2A%29%22
  (2): https://dxr.mozilla.org/mozilla-central/search?q=%2Bcallers%3A%22BackstagePass%3A%3AResolve%28nsIXPConnectWrappedNative+%2A%2C+JSContext+%2A%2C+JSObject+%2A%2C+jsid%2C+bool+%2A%2C+bool+%2A%29%22


But I found no references to (2) being used in the code and it's a '.h' file which is already known to have problems. The intermittent line here is 121 and it's only hit in the first run (everything else is always hit):

obj-firefox/dist/include/nsIXPCScriptable.h: [63, 108, 111, 118, 119, 120, 121]

Furthermore, the line which calls them (line 734 in XPCWrappedNativeJSOps.cpp) is always used in all 21 instances of the test runs. Which makes me think that this is a bug, or I can't find the other references to this call so I can't check if it has coverage, or it's always hit in (2) but since it's a '.h' file it's bugged. This might be a different issue we could get around with the baseline though.



(test_coverage_specialpowers without baseline correction, with telemetry disabled)
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e02f39cabcc5ada10bd3901e61e7664b468a506
I tested out changing the `apz.content_response_timeout` preference but I can't find a significant change in variability from this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=39bb578d8b5427aeeeaceab23a29ed221302ae2a

It might have helped, but it would be a small difference that is not yet visible until we find larger sources of variability.
In this bug, I would only consider things that we find using a code coverage build which we let stay idle for a while.
These things should be disabled for all build configurations, not only code coverage ones, as they are sources of intermittent failures (and because we want the code coverage configuration to be as close as possible to normal configurations).

Sparky, I'd make this a meta bug and file separate bugs for every thing we find so that we can fix one at a time, otherwise we get down a rabbit hole trying to fix all possible variability sources.
Summary: Use a code coverage build to find things running on timers that should be disabled during tests → [meta] Use a code coverage build to find things running on timers that should be disabled during tests
Keywords: meta
As an alternative approach, have you tried waiting for browser-delayed-startup-finished, or similar?
:gbrown, I haven't tried that, thanks for pointing it out! I'll try that out next.

What other similar approaches are you thinking off?
Depends on: 1487277
(In reply to Greg Mierzwinski [:sparky] from comment #24)
> What other similar approaches are you thinking off?

Nothing specific. Generally, "wait for some event after startup", so that most off-main-thread slightly-delayed-after-startup initialization is likely complete. Or maybe just wait for 60? seconds after startup before starting tests?
Ok, we currently wait for 30 seconds in the baseline tests. I'll check to see if waiting on that observer notification can change our results.

But I'll look into the notifications some more because there could be others we could wait for.
In the run I did I was waiting for way longer (120 s) before starting to collect the coverage data, as I just wanted to see what are the things running on timers that we might want to disable during tests.
Depends on: 1489082
Depends on: 1490064
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: