Closed Bug 1383070 Opened 7 years ago Closed 7 years ago

Intermittent browser_onboarding_notification_3,browser_onboarding_notification_4.js,browser_onboarding_tours{,et}.js | Uncaught exception - Should load onboarding overlay - timed out after 50 tries.

Categories

(Firefox :: New Tab Page, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Firefox 56
Tracking Status
firefox56 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: Fischer)

References

Details

(Keywords: intermittent-failure, Whiteboard: [photon-onboarding][stockwell fixed:timing])

Attachments

(1 file)

The same issue as the bug 1381335. NI myself to keep tracking.
Component: Tours → New Tab Page
Flags: needinfo?(fliu)
Whiteboard: [photon-onboarding]
This looks to have gotten worse after bug 1159003 per investigation in bug 1382079 comment 28.
As noted in bug 1382079 comment 31, the Linux64 "load onboarding overlay" wait time increased from 296ms in the median case to 2861ms. And for Linux32, it increased from 378ms to 3675ms.

I pushed a try with significantly longer timeouts to find these numbers:
https://hg.mozilla.org/try/rev/852945be69dc4ef6bb7a1d81d988a5ab22755a13

Instead of waiting ~5 seconds for the onboarding overlay to load, it waits ~20 seconds.

For the 10 runs from that commit, onboarding_tours loads the overlay 6 times, and 21 of those loads took over 5 seconds, which roughly means 35% of those runs would have timed out with the current 5 second (100ms*50tries) waitForCondition.
baku, do you know how bug 1159003 would have caused the initial onboarding overlay load to take ~10x the time to appear compared to before this change:

Bug 1159003 - Remove Performance::GetAsISupports(), r=bz
https://hg.mozilla.org/mozilla-central/rev/f4378a5e6c7f

In bug 1382079 comment 31 I found that that specific commit increased loading times to ~2900ms whereas its immediate changeset parent on `inbound` had loading times of ~290ms.

Fischer can go into more details of what onboarding and its tests are doing. But from my glancing at it, onboarding:
https://searchfox.org/mozilla-central/source/browser/extensions/onboarding/content/onboarding.js#778

It waits for page load then requestIdleCallback before initializing Onboarding.
Blocks: 1159003
Flags: needinfo?(amarchesini)
Summary: Intermittent browser/extensions/onboarding/test/browser/browser_onboarding_notification_3.js | Uncaught exception - Should load onboarding overlay - timed out after 50 tries. → Intermittent browser_onboarding_notification_{3,4}.js,browser_onboarding_tours{,et}.js | Uncaught exception - Should load onboarding overlay - timed out after 50 tries.
(In reply to Fischer [:Fischer] from comment #7)
> Created attachment 8889150 [details]
> Bug 1383070 - Intermittent: Uncaught exception Should load onboarding
> overlay, .
> 
> Review commit: https://reviewboard.mozilla.org/r/160258/diff/#index_header
> See other reviews: https://reviewboard.mozilla.org/r/160258/
Hi Mossop,

The onboarding tests are experiencing Intermittent Uncaught exception - Should load onboarding overlay on Linux Debug.
The comment 5 (thanks for :Mardak's investigation) pointed out maybe some recent update affects timing so this intermittent happens. It would be better to make our tests more reliable.

This patch tries to make sure we go checking the onboarding element at a more reliable timing.
You will see `requestLongerTimeout(3)` in the patch.
This is because during testing few "This test exceeded the timeout threshold. It should be rewritten or split up...." issue occurred on the notification tests. And it occurred even with `requestLongerTimeout(2)`. However we've already put only 2~3 test cases in each file. It could be hard to manage the case of one test in one file only. Modifying the test cases is kind of not an option either because we have to reload page a few times to test notifications timing policy. So go for `requestLongerTimeout(3)`.

TRY with 20 rebuilds on Linux: https://treeherder.mozilla.org/#/jobs?repo=try&revision=eea1eb38b15e14f53e44f7d997cbba76b94e51df
TRY with 5 rebuilds on all platform: https://treeherder.mozilla.org/#/jobs?repo=try&revision=5ac4313a0131568e113fc93058d2c6d7e7ae2e99

Thanks
Blocks: 1383505
(In reply to Fischer [:Fischer] from comment #7)
> Created attachment 8889150 [details]
Just to measure some timing from the above attachment of the first idle, second idle, and waiting for loading:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=590dd07e323423b73e6642d8b65b85c838cd328e&selectedJob=116833298

Most of the time is just waiting for the first idle:
1330ms median,  137ms minimum, 5296ms maximum

The second idle is usually pretty quick:
 163ms median,   73ms minimum,  736ms maximum

Waiting for onboarding overlay load condition:
 123ms median,  109ms minimum, 1475ms maximum

And adding up all 3 for a total `promiseOnboardingOverlayLoaded` time for a given check:
1687ms median, 1687ms minimum, 5483ms maximum

So overall these linux32 numbers are generally faster than the 3675ms noted in comment 4.
Assignee: nobody → fliu
Status: NEW → ASSIGNED
Flags: needinfo?(fliu)
Priority: -- → P1
Whiteboard: [photon-onboarding] → [photon-onboarding][stockwell needswork]
Can we just skip waiting for idle during test runs?
Flags: needinfo?(fliu)
(In reply to Fischer [:Fischer] from comment #14)
> Comment on attachment 8889150 [details]
> Bug 1383070 - Intermittent: Uncaught exception Should load onboarding
> overlay,
> 
> Review request updated; see interdiff:
> https://reviewboard.mozilla.org/r/160258/diff/1-2/
TRY with 20 rebuilds on Linux: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4de9df7d531d865583daf777c8695247d5878650
TRY with 5 rebuilds on all platform: https://treeherder.mozilla.org/#/jobs?repo=try&revision=809993434e592ade714709b814aae86ac4704500
Comment on attachment 8889150 [details]
Bug 1383070 - Intermittent: Uncaught exception Should load onboarding overlay,

https://reviewboard.mozilla.org/r/160258/#review166242

::: browser/extensions/onboarding/test/browser/head.js:59
(Diff revision 2)
> -    return ContentTask.spawn(browser, {}, function() {
> -      return new Promise(resolve => {
> -        let doc = content && content.document;
> +    let doc = content && content.document;
> -        if (doc && doc.querySelector("#onboarding-overlay")) {
> -          resolve(true);
> -          return;
> +    if (doc.querySelector("#onboarding-overlay")) {
> +      ok(true, "Should load onboarding overlay");
> +      return Promise.resolve();

Why not just `return` here but return a resolved promise is because the lint complained that in the previous try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=977d216a33f408a4cf19495f2c23818c776f4744&selectedJob=117316632

::: browser/extensions/onboarding/test/browser/head.js:117
(Diff revision 2)
> -  let condition = () => {
> -    return ContentTask.spawn(browser, {}, function() {
> +  function isOpened() {
> +    let doc = content && content.document;
> +    let notification = doc.querySelector("#onboarding-notification-bar");
> +    if (notification && notification.classList.contains("onboarding-opened")) {
> +      ok(true, "Should open tour notification");
> +      return Promise.resolve();

The same reason as `promiseOnboardingOverlayLoaded` for returing a resolved promise.

::: browser/extensions/onboarding/test/browser/head.js:122
(Diff revision 2)
> +      return Promise.resolve();
> +    }
> -      return new Promise(resolve => {
> +    return new Promise(resolve => {
> -        let bar = content.document.querySelector("#onboarding-notification-bar");
> +      let observer = new content.MutationObserver(mutations => {
> +        mutations.forEach(mutation => {
> +          let bar = Array.from(mutation.addedNodes)

Why `bar` here and `notification` above is because the lint complained in the previous try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=977d216a33f408a4cf19495f2c23818c776f4744&selectedJob=117316632
> baku, do you know how bug 1159003 would have caused the initial onboarding
> overlay load to take ~10x the time to appear compared to before this change:

That patch produces a leak. There is a fix in bug 1382768, but it's not landed yet.
Flags: needinfo?(amarchesini)
(In reply to Dave Townsend [:mossop] from comment #11)
> Can we just skip waiting for idle during test runs?
Hi Mossop,
Try to go for the mutation observer. One thing for this approach is the timeout could be longer because of keeping waiting for the expected dom mutation in the failure case.
Please have a look, thanks.

(In reply to Fischer [:Fischer] from comment #18)
> Comment on attachment 8889150 [details]
> Bug 1383070 - Intermittent: Uncaught exception Should load onboarding
> overlay,
> 
> Review request updated; see interdiff:
> https://reviewboard.mozilla.org/r/160258/diff/2-3/
TRY with 20 rebuilds on Linux: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a723a1596c250293783e799c6c921d014600e72
TRY with 5 rebuilds on all platforms: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c518c2b435077302a5a9a6099301190ab0e38a86
Flags: needinfo?(fliu) → needinfo?(dtownsend)
(In reply to Fischer [:Fischer] from comment #19)
> (In reply to Dave Townsend [:mossop] from comment #11)
> > Can we just skip waiting for idle during test runs?
> Hi Mossop,
> Try to go for the mutation observer. One thing for this approach is the
> timeout could be longer because of keeping waiting for the expected dom
> mutation in the failure case.
> Please have a look, thanks.

I still don't understand why we need requestLongerTimeout here. Can you explain what is so slow about the tests that it is needed?
Flags: needinfo?(dtownsend) → needinfo?(fliu)
Mossop: could it be the not-yet-fixed leak described in comment 17?
Flags: needinfo?(dtownsend)
I double checked the load onboarding timing with bug 1382768's fix and it's a little bit faster but still much slower. The median linux32 time to load drops from 3421.5ms to 2892ms. So still quite a bit slower than the 378ms from before.

with bug 1382768 patch: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b7a2291e718fa026e04dea8cf20dc5e86d64dab4
without: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c7ced9d51621dc8dee959c00bee3c1d6a0233eec
See Also: → 1381679
Summary: Intermittent browser_onboarding_notification_{3,4}.js,browser_onboarding_tours{,et}.js | Uncaught exception - Should load onboarding overlay - timed out after 50 tries. → Intermittent browser_onboarding_notification_3,browser_onboarding_notification_4.js,browser_onboarding_tours{,et}.js | Uncaught exception - Should load onboarding overlay - timed out after 50 tries.
(In reply to Dave Townsend [:mossop] from comment #20)
> (In reply to Fischer [:Fischer] from comment #19)
> > (In reply to Dave Townsend [:mossop] from comment #11)
> > > Can we just skip waiting for idle during test runs?
> > Hi Mossop,
> > Try to go for the mutation observer. One thing for this approach is the
> > timeout could be longer because of keeping waiting for the expected dom
> > mutation in the failure case.
> > Please have a look, thanks.
> 
> I still don't understand why we need requestLongerTimeout here. Can you explain what is so slow about the tests that it is needed?
Hi Mossop,

- browser_onboarding_notification.js: we tests tour notifications would be prompted in order and no more prompt after 12 times prompts (6 tours x 2 prompt chances) so have to reload 13 times.
- browser_onboarding_notification2.js: we tests mute on the 1st session, skip completed tours, not prompt completed tour, totally 6 times of reloading about:newtab.
- browser_onboarding_notification3.js: we tests each tour has only 8 impressions(require 8 reloads), tour notification's 5-day max life time, should disappear after clicking buttons, so around 14 times of reloading about:newtab.
- browser_onboarding_notification4.js: we tests no more notification after clicking close button(6 times) or clicking action button(6 times) on every tour notifications, totally 14 times of reloading about:newtab.
The tour notification tests kind of take many reloads to make sure the right prompting policy is implemented. So go for `requestLongerTimeout(3)`.

- browser_onboarding_tours.js: we load both about:home and about:newtab to test some cross tabs operations in 3 test cases.
- browser_onboarding_tourset.js: we have 3 tests but I reduce it to only load about:newtab because our tourset loading logic is independent to the underlying tabs.
These 2 tests should be less intense in terms of tab loading so I tried `requestLongerTimeout(2)`. The results are good[1]. Guess we could switch to `2` or try more times with `1` for these 2 tests before checking in.

The current base timeout is 45s. Consider the comment 10, it could take 2 ~ 5 sec until our onboarding overlay got the chance to be loaded on each loading of about:newtab (not sure if faster in the reload case). For the intense tests and the slowest case, it might take 30s(6 reloads x 5s) ~ 70s(14 reloads x 5s) and plus time for other test operations after loading the onboarding and overhead etc so initially went for `requestLongerTimeout(2)`. However, still saw 
test-run-too-long error during testing so switch to `requestLongerTimeout(3)`.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=15ec225779117ceaba0306a950ad0578cae4f502

Thanks.
Flags: needinfo?(fliu)
(In reply to Fischer [:Fischer] from comment #24)
Sorry for not clear enough:
The current base timeout is 45s. Consider the *comment 4*, it could take 2 ~ 5s(sometimes even more) 
(the comment 10 results are similar as well)
> until our onboarding overlay got the chance to be loaded on each
> loading of about:newtab (not sure if faster in the reload case). For the
> intense tests and the slowest case, it might take 30s(6 reloads x 5s) ~
> 70s(14 reloads x 5s) and plus time for other test operations after loading
> the onboarding and overhead etc so initially went for
> `requestLongerTimeout(2)`. However, still saw 
> test-run-too-long error during testing so switch to
> `requestLongerTimeout(3)`.
> 
> [1]
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=15ec225779117ceaba0306a950ad0578cae4f502
> 
> Thanks.
Comment on attachment 8889150 [details]
Bug 1383070 - Intermittent: Uncaught exception Should load onboarding overlay,

https://reviewboard.mozilla.org/r/160258/#review166820

Comment 10 makes it clear that most of the time lost waiting for the onboarding stuff to show up is in waiting for idle. This isn't surprising for test runs. I'd like us to not wait for idle in test runs and see if we can reduce the timeouts that way before approving this. You can test if you're in automation with Cu.isInAutomation.
Attachment #8889150 - Flags: review?(dtownsend)
Flags: needinfo?(dtownsend)
(In reply to Dave Townsend [:mossop] from comment #26)
> Comment on attachment 8889150 [details]
> Bug 1383070 - Intermittent: Uncaught exception Should load onboarding
> overlay,
> 
> https://reviewboard.mozilla.org/r/160258/#review166820
> 
> Comment 10 makes it clear that most of the time lost waiting for the
> onboarding stuff to show up is in waiting for idle. This isn't surprising
> for test runs. I'd like us to not wait for idle in test runs and see if we
> can reduce the timeouts that way before approving this. You can test if
> you're in automation with Cu.isInAutomation.
Here are my test results:

Two types of time are measured:
- T1: The time to open about:newtab, which is the time from loading one about:newtab to the about:newtab load event fired. 
- T2: The time to load overlay, which is the time from the about:newtab load event fired to the overlay element found in the about:newtab

Some edge values are excluded.
On average, without `requestIdleCallback` T2 is around  200~300ms faster.
The bottleneck is T1 and could see the case of over 15000ms happens 3~6 times out of 20 tries. 

[ Linux Debug]
# No requestIdleCallback to load Onboarding if Cu.isInAutomation
- Max T1:     18765ms
- Median T1:  4246.5ms
- Min T1:     1786ms
- Avg T1:     5722ms
- Max T2:     3162ms
- Median T2:  836.5ms
- Min T2:     573ms
- Avg T2:     1232ms

# requestIdleCallback to load Onboarding always
- Max T1:     27371ms
- Median T1:  7644.5ms
- Min T1:     3119ms
- Avg T1:     9715.5ms
- Max T2:     3249ms
- Median T2:  1142.5ms
- Min T2:     355ms
- Avg T2:     1432ms

[ Linux64 Debug]
# No requestIdleCallback to load Onboarding if Cu.isInAutomation
- Max T1:     21785ms
- Median T1:  6428ms
- Min T1:     1574ms
- Avg T1:     7113ms
- Max T2:     3153ms
- Median T2:  602.5ms
- Min T2:     73ms
- Avg T2:     726.6ms

# requestIdleCallback to load Onboarding always
- Max T1:     16012ms
- Median T1:  6076ms
- Min T1:     1146ms
- Avg T1:     7657ms
- Max T2:     2951ms
- Median T2:  1017.5ms
- Min T2:     206ms
- Avg T2:     1170ms
See Also: → 1383134
Flags: needinfo?(dtownsend)
there are 3 intermittent test failure bugs totaling 100+ failures/day related to this failure- hopefully when this bug is resolved the other two bugs will be resoled/reduced.
(In reply to Fischer [:Fischer] from comment #28)
> (In reply to Dave Townsend [:mossop] from comment #26)
> > Comment on attachment 8889150 [details]
> > Bug 1383070 - Intermittent: Uncaught exception Should load onboarding
> > overlay,
> > 
> > https://reviewboard.mozilla.org/r/160258/#review166820
> > 
> > Comment 10 makes it clear that most of the time lost waiting for the
> > onboarding stuff to show up is in waiting for idle. This isn't surprising
> > for test runs. I'd like us to not wait for idle in test runs and see if we
> > can reduce the timeouts that way before approving this. You can test if
> > you're in automation with Cu.isInAutomation.
> Here are my test results:
> 
> Two types of time are measured:
> - T1: The time to open about:newtab, which is the time from loading one
> about:newtab to the about:newtab load event fired. 
> - T2: The time to load overlay, which is the time from the about:newtab load
> event fired to the overlay element found in the about:newtab
> 
> Some edge values are excluded.
> On average, without `requestIdleCallback` T2 is around  200~300ms faster.
> The bottleneck is T1 and could see the case of over 15000ms happens 3~6
> times out of 20 tries. 

Ok so sounds like switching away from requestIdleCallback will save us at least 200ms per load and since that happens so many times I'd still like us to do it to keep test speeds quick regardless of whether it means we can remove the longer timeouts or not.
Flags: needinfo?(dtownsend)
(In reply to Dave Townsend [:mossop] from comment #30)
> Ok so sounds like switching away from requestIdleCallback will save us at
> least 200ms per load and since that happens so many times I'd still like us
> to do it to keep test speeds quick regardless of whether it means we can
> remove the longer timeouts or not.
Definitely. Updated.
Comment on attachment 8889150 [details]
Bug 1383070 - Intermittent: Uncaught exception Should load onboarding overlay,

https://reviewboard.mozilla.org/r/160258/#review167862
Attachment #8889150 - Flags: review?(dtownsend) → review+
Pushed by edilee@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/2948f11c75c8
Intermittent: Uncaught exception Should load onboarding overlay, r=mossop
https://hg.mozilla.org/mozilla-central/rev/2948f11c75c8
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 56
Whiteboard: [photon-onboarding][stockwell needswork] → [photon-onboarding][stockwell fixed:timing]
Hey Fischer, this doesn't look like a bug I need to verify. If I do how would I verify it?
Flags: needinfo?(fliu)
we don't have to verify this bug since its only related to unit test, thanks
Flags: needinfo?(fliu) → qe-verify-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: