Open Bug 1473626 Opened 6 years ago Updated 2 years ago

[meta] Use a code coverage build to find things running on timers that should be disabled during tests

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: marco, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Keywords: meta)

Attachments

(7 files)

telemetry_coverage_after_3mins.txt 6 years ago Greg Mierzwinski [:sparky] 5.04 KB, text/plain		Details
telemetry_coverage_after_3mins_withdisablechange.txt 6 years ago Greg Mierzwinski [:sparky] 3.54 KB, text/plain		Details
line_level_differences_1534374907.json 6 years ago Greg Mierzwinski [:sparky] 9.00 KB, application/json		Details
telemetry_differences_1enabled_23disabled.png 6 years ago Greg Mierzwinski [:sparky] 33.77 KB, image/png		Details
line_level_differences_1534703677.json 6 years ago Greg Mierzwinski [:sparky] 39.40 KB, application/json		Details
line_level_differences_1534705365.json 6 years ago Greg Mierzwinski [:sparky] 8.54 KB, application/json		Details
Variability of test_coverage_specialpowers after base correction and disabled telemetry 6 years ago Greg Mierzwinski [:sparky] 29.89 KB, application/json		Details

Marco Castelluccio [:marco]

Reporter

Description

•

6 years ago

We could:
1) Run an empty test;
2) Reset coverage counters;
3) Wait 120 seconds;
4) Dump coverage counters.

After 4, we can generate a report and see what was executed.

By disabling things that are executed on timers that shouldn't be executed during tests, we could make per-test coverage data cleaner, and we might fix some intermittents due to things running on timers badly interacting with tests.

Marco Castelluccio [:marco]

Reporter

Updated

•

6 years ago

Blocks: 1471577

Greg Mierzwinski [:sparky]

Comment 1

•

6 years ago

Good idea! The last timer was found by looking at the raw lines and files covered so doing this would be much easier. We could use the baseline tests as the empty tests, and reset counters and wait in the test harness before the browser is closed, but after the test finishes. I think that would also remove the timers we already know about.

Marco Castelluccio [:marco]

Reporter

Comment 2

•

6 years ago

I've generated a test report, there are several Telemetry functions covered (so I guess we are collecting some Telemetry data on a timer, we should likely disable it during tests). E.g. GetWebrtcStats.

Greg Mierzwinski [:sparky]

Comment 3

•

6 years ago

:marco, could you post a link to the code you used to find that?

Greg Mierzwinski [:sparky]

Updated

•

6 years ago

Flags: needinfo?(mcastelluccio)

Greg Mierzwinski [:sparky]

Comment 4

•

6 years ago

I think we should try this with telemetry disabled in the moz-configs: https://hg.mozilla.org/try/rev/334dd0c80f0eea321c2eea1ce4667af66e7accf0

You'll see it was explicitly enabled for windows but I'm not sure why. Telemetry was enabled by default (in common-opt [1][2]) in mac and linux.

[1]: https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/linux64/common-opt
[2]: https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/macosx64/common-opt

Marco Castelluccio [:marco]

Reporter

Comment 5

•

6 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #3)
> :marco, could you post a link to the code you used to find that?

I don't have it anymore unfortunately, but I think I recall all the steps I followed:
- set the GCOV_PREFIX and GCOV_RESULTS_DIR environment variables to two directories on the system;
- remove the requestResetCoverageCounters and requestDumpCoverageCounters calls from testing/mochitest/tests/SimpleTest/TestRunner.js, so that they don't conflict with the calls from inside the test;
- modify a bit tools/code-coverage/tests/mochitest/test_coverage_specialpowers.html (or just create a new test) to make it:
1) Wait a few seconds (so that any delayed initialization finishes)
2) Reset coverage counters
3) Wait 120 seconds
4) Dump coverage counters

Then, run the test, then parse the gcda files that are in GCOV_RESULTS_DIR.


(In reply to Greg Mierzwinski [:sparky] from comment #4)
> I think we should try this with telemetry disabled in the moz-configs:
> https://hg.mozilla.org/try/rev/334dd0c80f0eea321c2eea1ce4667af66e7accf0
> 
> You'll see it was explicitly enabled for windows but I'm not sure why.
> Telemetry was enabled by default (in common-opt [1][2]) in mac and linux.
> 
> [1]:
> https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/
> linux64/common-opt
> [2]:
> https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/
> macosx64/common-opt

I'm not sure we can disable Telemetry like this, we might stop collecting coverage for Telemetry if we do, or we might have differences in the tests between the coverage build and the other builds.

Flags: needinfo?(mcastelluccio)

Greg Mierzwinski [:sparky]

Comment 6

•

6 years ago

Good point, and I don't think we can disable telemetry in any other way because it looks like we need to build with it. But I may be wrong since I'm not too familiar with how telemetry works.

What if we filter out these telemetry functions from the coverage data by using file and function names in the etl or with grcov? I could see some coverage for that code getting through. But, it might be better than adding something hacky to get telemetry data out of per-test data. Assuming it's hard to get telemetry data removed with other methods.

Marco Castelluccio [:marco]

Reporter

Comment 7

•

6 years ago

I think there might be a timer somewhere that when triggered calls some Telemetry function to collect data. We just need to find out how to disable this timer (probably through a preference), and set this preference during tests (we already do this for several other similar things).

Greg Mierzwinski [:sparky]

Comment 8

•

6 years ago

Ok sounds goood, I'll start looking around for that timer.

Greg Mierzwinski [:sparky]

Comment 9

•

6 years ago

There are at least some portions of Telemetry that still work when we set the pref 'initDelay' to a very high number: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5183a4169f6352c261f2ba045eba9c30d608f31&selectedJob=192884180

Here's a full linux64-ccov run with initDelay set to a very high number: https://treeherder.mozilla.org/#/jobs?repo=try&revision=094e5accd4569f41dd0dd8351803132c5a217984

I'm not sure if those failures are legitimately caused by the change to initDelay because I have yet to find the link between the test that failed in 'dt4' at [1], and the Telemetry functions. I've re-triggered those chunks to see if it's an intermittent.

[1]: https://dxr.mozilla.org/mozilla-central/source/devtools/client/inspector/animation/test/browser_animation_rewind-button.js

Greg Mierzwinski [:sparky]

Comment 10

•

6 years ago

There's only one failure that seems to caused by setting this value very high, and that is [1] mentioned in comment 9. So, we could disable telemetry with this flag - and either skip or fix that one test.

I'm going to work on trying to recreate the initial test that found this difference first to see if we still find telemetry functions in there after this change.

Joel Maher ( :jmaher ) (UTC -8)

Comment 11

•

6 years ago

if it is 1 failure, lets fix the test if possible or we can |skip-if = coverage|.  Good idea to revisit this to see if changing the initDelay will show telemetry functions in per-test code coverage.

Greg Mierzwinski [:sparky]

Comment 12

•

6 years ago

Attached file telemetry_coverage_after_3mins.txt — Details

I think I recreated it in this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=37022d37ae3a7cb3bce5740dc20b27809b95fb1c&selectedJob=194154218

I've attached a file containing all the coverage that I see in the test-coverage task for tools/code-coverage/tests/mochitest/test_coverage_specialpowers.html with the following changes (see comment 5): https://hg.mozilla.org/try/rev/0eaaf5290d5c683daa93603b326370e98c577a2b

The coverage was found with the command `py pertestcoverage_view.py ~\Downloads\per-test-coverage-reports(93) -t special -s tele` using: https://github.com/gmierz/coco-tools

Unfortunately, this change didn't disable telemetry completely. I'll attach the coverage that it has in this test: https://treeherder.mozilla.org/#/jobs?repo=try&revision=73253340f90877cc00ad48ea1e24d9ed4f5a50fc

I'm testing this with a different change here as well: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e02f39cabcc5ada10bd3901e61e7664b468a506

Greg Mierzwinski [:sparky]

Comment 13

•

6 years ago

Attached file telemetry_coverage_after_3mins_withdisablechange.txt — Details

Coverage with telemetry disabled through the initDelay change.

Greg Mierzwinski [:sparky]

Comment 14

•

6 years ago

Attached file line_level_differences_1534374907.json — Details

These are the differences between the two. Each entry, (-1--2), means "what is in the coverage when telemetry is enabled but not in the coverage when it is disabled", and (-2--1) means the opposite.

Greg Mierzwinski [:sparky]

Comment 15

•

6 years ago

Also, here is another preference which could cause false differences between tests, and variability in certain cases: https://dxr.mozilla.org/mozilla-central/source/testing/profiles/unittest/user.js#16

But I don't think we can't disable this one, maybe trying to capture it in the baseline would be better?

Greg Mierzwinski [:sparky]

Comment 16

•

6 years ago

Attached image telemetry_differences_1enabled_23disabled.png — Details

Here's a graph for the change in coverage across the files in the .txt attachments. Point 1 is where the initDelay was unchanged, and points 2, and 3 showed a stable drop in coverage from the change.

So the change does help but it doesn't stop everything from coming through. I'm going to look into how variable the coverage from this task is because I haven't seen a call to GetWebrtcStats yet [1] (around lines 687 to 689).

Luckily for us though, it seems like the baseline captures everything that's left over. This can be seen just by looking at the Telemetry.cpp coverage:

Baseline coverage:  toolkit/components/telemetry/Telemetry.cpp: [120, 293, 542, 548, 552, 553, 565, 604, 607, 608, 611, 613, 614, 618, 620, 648, 650, 651, 653, 655, 657, 660, 663, 667, 669, 1088, 1091, 1095, 1098, 1110, 1111, 1112, 1116, 1120, 1137, 1138, 1139, 1143, 1147, 1148, 1149, 1150, 1151, 1169, 1173, 1175, 1179, 1181, 1183, 1184, 1185, 1186, 1193, 1197, 1200, 1201, 1204, 1207, 1210, 1212, 1214, 1215, 1217, 1229, 1547, 1549, 1553, 1554, 1558, 1560, 1564, 1565, 1580, 1581, 1628, 1630, 1634, 1638, 1639, 1646, 1652, 1654, 1658, 1660, 1664, 1666, 1748, 1752, 1801, 1803, 1804, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 1991, 1996, 2000, 2001, 2010, 2012, 2016, 2018, 2048, 2052, 2053, 2054, 2114, 2117, 2121, 2123, 2127, 2129, 2132, 2133, 2134, 2137, 2143, 2145, 2146, 2149, 2151, 2152, 2155, 2157, 2158, 2167, 2169, 2170, 2173, 2175, 2176, 2185, 2187, 2188, 2197]

Without baseline correction, with telemetry disabled: toolkit/components/telemetry/Telemetry.cpp: [1088, 1091, 1138, 1558, 1560, 1564, 1565, 1580, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 2016, 2018, 2143, 2145, 2146, 2167, 2169, 2170]

With baseline correction, with telemetry disabled: toolkit/components/telemetry/Telemetry.cpp: []

Without baseline correction, with telemetry enabled: toolkit/components/telemetry/Telemetry.cpp: [443, 448, 453, 454, 458, 463, 468, 469, 473, 476, 482, 490, 491, 492, 493, 630, 634, 635, 639, 643, 644, 693, 695, 696, 1088, 1091, 1137, 1138, 1139, 1169, 1173, 1175, 1558, 1560, 1564, 1565, 1580, 1628, 1630, 1670, 1673, 1698, 1701, 1702, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 2016, 2018, 2143, 2145, 2146, 2167, 2169, 2170]


With baseline correction, with telemetry enabled: toolkit/components/telemetry/Telemetry.cpp: [443, 448, 453, 454, 458, 463, 468, 469, 473, 476, 482, 490, 491, 492, 493, 630, 634, 635, 639, 643, 644, 693, 695, 696, 1670, 1673, 1698, 1701, 1702, 2167, 2169, 2170]

As mentioned above, I'm going to look into how variable these results are. There is at least line 2170 that seems to be have been an intermittent line before disabling.


[1]: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp?q=Telemetry.cpp&redirect_type=direct#685

Greg Mierzwinski [:sparky]

Comment 17

•

6 years ago

Attached file line_level_differences_1534703677.json — Details

These are the differences between chunks left over after disabling telemetry but without baseline correction.

Greg Mierzwinski [:sparky]

Comment 18

•

6 years ago

Attached file line_level_differences_1534705365.json — Details

And these are the differences with telemetry disabled, and after baseline correction.

There are some other files which were used randomly (grouped into file-level variability): 'toolkit/components/telemetry/ipc/TelemetryComms.h', 'toolkit/components/telemetry/TelemetryHistogram.cpp', 'toolkit/components/telemetry/Telemetry.cpp'

Or:
toolkit/components/telemetry/TelemetryHistogram.cpp: [1410, 1416, 1417]
toolkit/components/telemetry/Telemetry.cpp: [2167, 2169, 2170]
toolkit/components/telemetry/ipc/TelemetryComms.h: [189, 190, 220, 222, 225, 226]

So, disabling it with this flag definitely helped - but there may be another setting we could change to fix this last bit of variability.

Greg Mierzwinski [:sparky]

Comment 19

•

6 years ago

Attached file Variability of test_coverage_specialpowers after base correction and disabled telemetry — Details

Complete variability of test_coverage_specialpowers after base correction and disabling telemetry.

Greg Mierzwinski [:sparky]

Comment 20

•

6 years ago

I was looking through the data in this push [1] and at files with 'time' in their name, and I found this file, with intermittent coverage of certain lines, but only happening in the first time the file is seen across all the pushes:

"js/xpconnect/src/XPCRuntimeService.cpp": {

    "-1--2": [
        53,
        58,
        59,
        60,
        61
    ],
    "-2--1": [ ]

},


This is called by: https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCWrappedNativeJSOps.cpp#734
And these lines are called uniquely in the first run:
"js/xpconnect/src/XPCWrappedNativeJSOps.cpp": {

    "-1--2": [
        768,
        384,
        385,
        775,
        753,
        754,
        755,
        756,
        757,
        758,
        760,
        764,
        767
    ],
    "-2--1": [ ]

}


With this coverage afterwards:
js/xpconnect/src/XPCWrappedNativeJSOps.cpp: [99, 101, 103, 104, 106, 107, 108, 111, 114, 119, 120, 121, 123, 124, 125, 128, 222, 236, 237, 238, 242, 244, 245, 248, 252, 253, 255, 258, 261, 262, 270, 271, 272, 277, 278, 279, 280, 281, 282, 351, 373, 374, 375, 376, 377, 378, 379, 382, 383, 389, 390, 393, 394, 395, 396, 397, 402, 404, 405, 406, 407, 408, 409, 410, 413, 414, 415, 417, 424, 426, 427, 428, 431, 496, 498, 499, 502, 503, 506, 507, 509, 513, 515, 516, 519, 520, 521, 525, 527, 528, 539, 541, 542, 543, 545, 547, 548, 549, 550, 553, 555, 556, 559, 561, 562, 563, 565, 566, 570, 573, 579, 617, 619, 620, 621, 623, 664, 666, 668, 671, 672, 675, 676, 677, 681, 683, 684, 685, 689, 690, 693, 694, 695, 699, 702, 703, 704, 705, 715, 717, 718, 719, 720, 721, 722, 724, 726, 727, 729, 731, 732, 734, 736, 737, 740, 741, 743, 747, 748, 749, 779, 854, 856, 859, 860, 861, 871, 873, 874, 875, 877, 878, 881, 883, 884, 885, 887, 890, 892, 893, 897, 899, 900, 901, 903, 907, 909, 911, 912, 913, 915, 918, 921, 922, 923, 924, 925, 930, 931, 970, 973, 974, 975, 976, 979, 982, 983, 986, 987, 991, 995, 996, 997, 998]


Now, that function is called by either (from what I've found so far): 
  (1): https://dxr.mozilla.org/mozilla-central/search?q=%2Bcallers%3A%22nsIXPCScriptable%3A%3AResolve%28nsIXPConnectWrappedNative+%2A%2C+JSContext+%2A%2C+JSObject+%2A%2C+jsid%2C+bool+%2A%2C+bool+%2A%29%22
  (2): https://dxr.mozilla.org/mozilla-central/search?q=%2Bcallers%3A%22BackstagePass%3A%3AResolve%28nsIXPConnectWrappedNative+%2A%2C+JSContext+%2A%2C+JSObject+%2A%2C+jsid%2C+bool+%2A%2C+bool+%2A%29%22


But I found no references to (2) being used in the code and it's a '.h' file which is already known to have problems. The intermittent line here is 121 and it's only hit in the first run (everything else is always hit):

obj-firefox/dist/include/nsIXPCScriptable.h: [63, 108, 111, 118, 119, 120, 121]

Furthermore, the line which calls them (line 734 in XPCWrappedNativeJSOps.cpp) is always used in all 21 instances of the test runs. Which makes me think that this is a bug, or I can't find the other references to this call so I can't check if it has coverage, or it's always hit in (2) but since it's a '.h' file it's bugged. This might be a different issue we could get around with the baseline though.



(test_coverage_specialpowers without baseline correction, with telemetry disabled)
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e02f39cabcc5ada10bd3901e61e7664b468a506

Greg Mierzwinski [:sparky]

Comment 21

•

6 years ago

I tested out changing the `apz.content_response_timeout` preference but I can't find a significant change in variability from this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=39bb578d8b5427aeeeaceab23a29ed221302ae2a

It might have helped, but it would be a small difference that is not yet visible until we find larger sources of variability.

Marco Castelluccio [:marco]

Reporter

Comment 22

•

6 years ago

In this bug, I would only consider things that we find using a code coverage build which we let stay idle for a while.
These things should be disabled for all build configurations, not only code coverage ones, as they are sources of intermittent failures (and because we want the code coverage configuration to be as close as possible to normal configurations).

Sparky, I'd make this a meta bug and file separate bugs for every thing we find so that we can fix one at a time, otherwise we get down a rabbit hole trying to fix all possible variability sources.

Marco Castelluccio [:marco]

Reporter

Updated

•

6 years ago

Summary: Use a code coverage build to find things running on timers that should be disabled during tests → [meta] Use a code coverage build to find things running on timers that should be disabled during tests

Marco Castelluccio [:marco]

Reporter

Updated

•

6 years ago

Keywords: meta

Geoff Brown [:gbrown]

Comment 23

•

6 years ago

As an alternative approach, have you tried waiting for browser-delayed-startup-finished, or similar?

Greg Mierzwinski [:sparky]

Comment 24

•

6 years ago

:gbrown, I haven't tried that, thanks for pointing it out! I'll try that out next.

What other similar approaches are you thinking off?

Greg Mierzwinski [:sparky]

Updated

•

6 years ago

Depends on: 1487277

Geoff Brown [:gbrown]

Comment 25

•

6 years ago

(In reply to Greg Mierzwinski [:sparky] from comment #24)
> What other similar approaches are you thinking off?

Nothing specific. Generally, "wait for some event after startup", so that most off-main-thread slightly-delayed-after-startup initialization is likely complete. Or maybe just wait for 60? seconds after startup before starting tests?

Greg Mierzwinski [:sparky]

Comment 26

•

6 years ago

Ok, we currently wait for 30 seconds in the baseline tests. I'll check to see if waiting on that observer notification can change our results.

But I'll look into the notifications some more because there could be others we could wait for.

Marco Castelluccio [:marco]

Reporter

Comment 27

•

6 years ago

In the run I did I was waiting for way longer (120 s) before starting to collect the coverage data, as I just wanted to see what are the things running on timers that we might want to disable during tests.

Marco Castelluccio [:marco]

Reporter

Updated

•

6 years ago

Depends on: 1489082

Greg Mierzwinski [:sparky]

Updated

•

6 years ago

Depends on: 1490064

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.