Closed Bug 1340415 Opened 8 years ago Closed 8 years ago

Intermittent toolkit/content/tests/browser/browser_bug295977_autoscroll_overflow.js | Failed: e should have scrolled vertically

Categories

(Toolkit :: General, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla58
Tracking Status
firefox57 --- wontfix
firefox58 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: botond)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:product])

Attachments

(1 file)

the failure rate has increased in the last 2 week, so far this week we have 20 failures: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1340415 primarily on osx (opt+debug), I will ask for people to look into this if the failures increase a bit more.
this test fails primarily on windows/debug, more recently primarily on stylo-disabled: https://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&bugid=1340415 this week the failure rate increased and this is on our radar- the previous weeks it was slightly lower so no large spike on a specific day. here is a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=133240752&lineNumber=17948 and related screenshot: https://public-artifacts.taskcluster.net/R5BbBjzaSJegkOO56S-Rhg/0/public/test_info/mozilla-test-fail-screenshot_dbilwh.png in the above log, I see this text: 02:21:07 INFO - 889 INFO timestamp=23980.896271999998 firstTimestamp=23902.013104 timeCompensation=3.944158399999833 02:21:07 INFO - 890 INFO timestamp=24002.24512 firstTimestamp=23902.013104 timeCompensation=5.011600799999906 02:21:07 INFO - 891 INFO TEST-PASS | toolkit/content/tests/browser/browser_bug295977_autoscroll_overflow.js | j should have scrolled vertically 02:21:07 INFO - j should not have scrolled horizontally - true == true - 02:21:07 INFO - 892 INFO TEST-PASS | toolkit/content/tests/browser/browser_bug295977_autoscroll_overflow.js | Browser still focused after autoscroll started - 02:21:07 INFO - 893 INFO timestamp=24142.727816000002 firstTimestamp=24142.727816000002 timeCompensation=0 02:21:07 INFO - 894 INFO timestamp=24158.352816000002 firstTimestamp=24142.727816000002 timeCompensation=0.78125 02:21:07 INFO - 895 INFO timestamp=24173.977816000002 firstTimestamp=24142.727816000002 timeCompensation=1.5625 02:21:07 INFO - 896 INFO timestamp=24194.6392 firstTimestamp=24142.727816000002 timeCompensation=2.5955691999999546 02:21:07 INFO - 897 INFO timestamp=24207.136896 firstTimestamp=24142.727816000002 timeCompensation=3.2204539999998816 02:21:07 INFO - 898 INFO timestamp=24222.761896 firstTimestamp=24142.727816000002 timeCompensation=4.001703999999881 02:21:07 INFO - 899 INFO timestamp=24238.386896 firstTimestamp=24142.727816000002 timeCompensation=4.782953999999881 02:21:07 INFO - 900 INFO timestamp=24256.225712 firstTimestamp=24142.727816000002 timeCompensation=5.674894799999857 02:21:07 INFO - Buffered messages finished 02:21:07 ERROR - 901 INFO TEST-UNEXPECTED-FAIL | toolkit/content/tests/browser/browser_bug295977_autoscroll_overflow.js | Failed: t should have scrolled vertically 02:21:07 INFO - t should have scrolled horizontally - false == true - 02:21:07 INFO - Stack trace: 02:21:07 INFO - resource://testing-common/content-task.js line 52 > eval:null:31 02:21:07 INFO - GECKO(12720) | ++DOMWINDOW == 21 (0000020910F07000) [pid = 10720] [serial = 42] [outer = 00000209082F3000] 02:21:07 INFO - GECKO(12720) | ++DOCSHELL 0000020910F17000 == 6 [pid = 10720] [id = {0972448f-2b2d-49cb-a68b-8018148fe345}] 02:21:07 INFO - GECKO(12720) | ++DOMWINDOW == 22 (0000020910F17800) [pid = 10720] [serial = 43] [outer = 0000000000000000] 02:21:07 INFO - GECKO(12720) | ++DOCSHELL 0000020910F18000 == 7 [pid = 10720] [id = {e4c27330-be26-4c96-9d08-d2e5146d27e6}] 02:21:07 INFO - GECKO(12720) | ++DOMWINDOW == 23 (0000020910F1A000) [pid = 10720] [serial = 44] [outer = 0000000000000000] 02:21:07 INFO - GECKO(12720) | ++DOCSHELL 0000020910F1A800 == 8 [pid = 10720] [id = {d6a9f16b-4bb4-41b4-afa3-0cda204c96c5}] 02:21:07 INFO - GECKO(12720) | ++DOMWINDOW == 24 (0000020910F1B000) [pid = 10720] [serial = 45] [outer = 0000000000000000] 02:21:07 INFO - GECKO(12720) | ++DOMWINDOW == 25 (0000020910F10800) [pid = 10720] [serial = 46] [outer = 0000020910F17800] 02:21:07 INFO - GECKO(12720) | ++DOMWINDOW == 26 (0000020910F22000) [pid = 10720] [serial = 47] [outer = 0000020910F1A000] 02:21:07 INFO - GECKO(12720) | ++DOMWINDOW == 27 (0000020911420800) [pid = 10720] [serial = 48] [outer = 0000020910F1B000] 02:21:07 INFO - GECKO(12720) | --DOCSHELL 00000213A12B8000 == 6 [pid = 12720] [id = {33ad6222-2054-42ed-8064-e8879f8362d0}] 02:21:07 INFO - GECKO(12720) | --DOCSHELL 000002139FDAD800 == 5 [pid = 12720] [id = {1c134d09-a94d-425c-9b78-6e95b9842853}] 02:21:07 INFO - 902 INFO TEST-PASS | toolkit/content/tests/browser/browser_bug295977_autoscroll_overflow.js | Browser still focused after autoscroll started - :past, I see you are the triage owner for toolkit::general, can you help set the priority of this and get it in front of the right developers?
Flags: needinfo?(past)
Whiteboard: [stockwell needswork]
Could this be failing more often on stylo-disabled because of bug 1390247?
Flags: needinfo?(past) → needinfo?(botond)
I don't expect any correlation with Stylo being enabled or disabled, but given the nature of the test and the timing of the spike in intermittent failures, it's quite likely that bug 1390247 caused the spike, yes. I'll investigate.
Assignee: nobody → botond
Flags: needinfo?(botond)
So, I'm trying to investigate this, but I need some help. Since this is a low-frequency intermittent, my strategy for investigating it is to add logging to the test, push to try, retrigger until it reproduces, and look at the logs. The first question is, what to push to try. Looking at the most recent week's OrangeFactor page [1], I see one of the platforms on which the failure occurs most commonly is "Windows 10 64-bit debug". However, the chunk is not consistent: sometimes it's "mochitest-e10s-browser-chrome-1", sometimes it's "mochitest-e10s-browser-chrome-3", sometimes it's "mochitest-e10s-browser-chrome-5". So, first question: ==> Why is the chunk number not consistent between runs? And given that it's not consistent: ==> What should I push to Try to run the test? For now, I used |mach try fuzzy| to select *all* mochitest-e10s-browser-chrome0* chunks on Windows 10 64-bit debug, and pushed to Try [2]. However, I still need to know which chunk contains my test so I know what to retrigger a bunch of times. ==> How should I find out which chunk contains my test? For now, I looked at the raw logs of each of the chunks bc1, bc3, and bc5, and searched for the test name, but didn't find it in either one. Joel, any advice on how to go about this? [1] https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1340415&startday=2017-09-18&endday=2017-09-24&tree=all [2] https://treeherder.mozilla.org/#/jobs?repo=try&revision=b30577cad1d9bc8f3598cffd3e2687cab12fe48c
Flags: needinfo?(jmaher)
(In reply to Botond Ballo [:botond] from comment #30) > For now, I looked at the raw logs of each of the chunks bc1, bc3, and bc5, > and searched for the test name, but didn't find it in either one. Ok, after continuing to textually search the logs of the other chunks, I found the test in "bc6". Seems like the chunk number is completely unpredictable... It would be reassuring to know that there's a better way to do this, and I just don't know it.
:botond, this is not ideal at best- for browser-chrome and many other suites we rebalance whenever a test is added or removed from the list. This means that a test can move between chunks on any random commit, likewise run in different chunks on different platforms. This is really problematic trying to backfill data as you don't know if you backfilled the correct chunk. It is really frustrating to figure out what you need to do in order to push to try. With that said, I typically look for the test in the log of each of the chunks- just make sure that if you rebase your code, that you repeat this exercise.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #33) > :botond, this is not ideal at best- for browser-chrome and many other suites > we rebalance whenever a test is added or removed from the list. This means > that a test can move between chunks on any random commit, likewise run in > different chunks on different platforms. This is really problematic trying > to backfill data as you don't know if you backfilled the correct chunk. It > is really frustrating to figure out what you need to do in order to push to > try. I filed bug 1405023 with a suggestion for how to improve this.
(In reply to Botond Ballo [:botond] from comment #30) > Since this is a low-frequency intermittent, my strategy for investigating it > is to add logging to the test, push to try, retrigger until it reproduces, > and look at the logs. I'm attempting to investigate using this approach, but bug 1405762 is making this difficult.
Depends on: 1405762
:botond, thanks for working on this bug- it is still failing, primarily on windows/osx on both opt and debug. There are a few linux failures, but very rare. 2 weeks ago this was primarily on windows/debug- I don't know if the failure is changing, but the error message looks the same. Let me know if you need more help.
I'm investigating this using a series of Try pushes with logging. It's just been slow-going as the reproduction rate is so low. What I've discovered so far is that in the failing case, the autoscroll animation is being terminated prematurely by APZ, due to a main-thread scroll offset update it receives. I need to track down the source of that scroll offset update. Also, I see a number of main-thread scroll offset updates being sent over the course of every test run, including passing ones, so I'd like to understand why the other ones don't cause any trouble.
Meanwhile, over the course of my debugging I've discovered another way the test can fail: for a middle-click over an unlayerized scroll frame, the Autoscroll:Start message can beat the layer transaction that layerizes the scroll frame to APZ, causing APZ to drop the Autoscroll:Start message, and never set up APZ autoscrolling. That's basically the equivalent of bug 1326290 (which was about async scrollbar dragging), but for autoscrolling. Unfortunately, it's more annoying to fix for autoscrolling because the Autoscroll:Start message is not sent from C++ code. We'll either need to drop into C++ code to send it, or find another way to guarantee the correct ordering of messages (such as perhaps splitting the APZ and non-APZ messages, and sending the APZ message in the first requestAnimationFrame() call).
(In reply to Botond Ballo [:botond] from comment #38) > What I've discovered so far is that in the failing case, the autoscroll > animation is being terminated prematurely by APZ, due to a main-thread > scroll offset update it receives. I need to track down the source of that > scroll offset update. Also, I see a number of main-thread scroll offset > updates being sent over the course of every test run, including passing > ones, so I'd like to understand why the other ones don't cause any trouble. It looks like the main-thread scroll offset update is being triggered by the scrollBy() call in autoscrollLoop() [1], the main-thread autoscroll implementation. We have logic to skip that scrollBy() call if the autoscroll is handled by APZ, but it only kicks in once we receive a notification from APZ that it is handling the autoscroll. Sometimes the main thread can sneak in one or more calls to autoscrollLoop() before that notification arrives. If the timing is particularly unfortunate, the resulting scroll offset update can arrive to APZ after it has started the autoscroll animation, and cancel the animation. [1] http://searchfox.org/mozilla-central/rev/1a4a26905f923458679a59a4be1e455ebc53c333/toolkit/content/browser-content.js#258
(In reply to Botond Ballo [:botond] from comment #39) > Meanwhile, over the course of my debugging I've discovered another way the > test can fail: for a middle-click over an unlayerized scroll frame, the > Autoscroll:Start message can beat the layer transaction that layerizes the > scroll frame to APZ, causing APZ to drop the Autoscroll:Start message, and > never set up APZ autoscrolling. Although, I'm not sure why in such cases we wouldn't at least fall back to main-thread autoscrolling...
(In reply to Botond Ballo [:botond] from comment #40) > It looks like the main-thread scroll offset update is being triggered by the > scrollBy() call in autoscrollLoop() [1], the main-thread autoscroll > implementation. > > We have logic to skip that scrollBy() call if the autoscroll is handled by > APZ, but it only kicks in once we receive a notification from APZ that it is > handling the autoscroll. Sometimes the main thread can sneak in one or more > calls to autoscrollLoop() before that notification arrives. If the timing is > particularly unfortunate, the resulting scroll offset update can arrive to > APZ after it has started the autoscroll animation, and cancel the animation. I discussed this with Kats and we came up with a plan for fixing this. Rather than having the main-thread autoscroll machinery assume that we're not doing APZ autoscrolling until APZ tells us we are, it will assume we *are* doing APZ autoscrolling until/unless it tells us it's not. Basically, instead of sending an "autoscroll-handled-by-apz" message if APZ successfully started autoscrolling, we'll send an "autoscroll-rejected-by-apz" message if it did not. The autoscroll can be rejected both by the parent process (by browser.xml / TabParent) or the compositor process (by APZCTreeManager), so we have to be careful to send the message in both cases. (In the parent process, we'll probably use the message manager, while in the compositor we'll use the existing GeckoCC path.) Then, the content process would only register autoscrollLoop() as an rAF callback in the first place, if it receives such a "rejected" message. As an optimization, any condition that would cause the autoscroll to be rejected that can be checked by the content proess (such as the autoscroll prefs) should be checked there (and then it can fall back to main-thread autoscrolling without waiting for a round-trip to the parent/compositor processes).
This ended up being a bit more straightforward than I initially thought, in two ways: - The Autoscroll:Start message that the content process sends the parent process, is synchronous, so the parent process can communicate a failure to give the autoscroll to APZ via the return value, rather than needing to use a separate message-manager message. - Since the content process doesn't start autoscrolling until it receives the response to this sync message anyways, it doesn't need to try to check any conditions (like the relevant prefs) itself - it can rely on the parent process checking them and factoring them into the response.
Comment on attachment 8919927 [details] Bug 1340415 - Ensure main-thread and async autoscrolling do not happen at the same time. https://reviewboard.mozilla.org/r/190872/#review196080 Looks good to me, just the one question below. ::: toolkit/content/browser-content.js:192 (Diff revision 1) > + // and we need to take over. > + Services.obs.addObserver(this, "autoscroll-rejected-by-apz"); So in the case where we don't have a separate compositor process, APZCTreeManager will dispatch the rejected message via the observer service synchronously before it returns from the StartAutoscroll function, is that correct? In which case this observer registration will be too late to catch the message. Although maybe the observer message will delay because of the IPC involved in sending to browser-content.js? I'm not sure, but it's something to verify. ::: toolkit/content/widgets/browser.xml:1104 (Diff revision 1) > this.updateBlockedPopups(); > break; > } > case "Autoscroll:Start": { > if (!this.autoscrollEnabled) { > - return false; > + return {enabled: false, usingApz: false}; For clarity I'd prefer calling this apzEnabled instead of just "enabled"
Attachment #8919927 - Flags: review?(bugmail) → review+
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #46) > So in the case where we don't have a separate compositor process, > APZCTreeManager will dispatch the rejected message via the observer service > synchronously before it returns from the StartAutoscroll function, is that > correct? In which case this observer registration will be too late to catch > the message. > > Although maybe the observer message will delay because of the IPC involved > in sending to browser-content.js? I'm not sure, but it's something to verify. Hmm, good point. I guess I figured that the newly schedules observer notification will not make it back to the content process before the response to the already-in-flight Autoscroll:Start message manager message, but I guess I have no grounds to rely on that. Now that I think about it, though, what do you think about having APZCTreeManager::StartAutoscroll() return "false" if GetTargetAPZC() didn't find an APZC? That way, in the synchronous case that information would get back to browser-content.js via usingApz=false in the Autoscroll:Start response. Since that makes the observer message redundant in the synchronous case, we could have APZCTreeManager send it only if XRE_IsGPUProcess() is true. > ::: toolkit/content/widgets/browser.xml:1104 > (Diff revision 1) > > this.updateBlockedPopups(); > > break; > > } > > case "Autoscroll:Start": { > > if (!this.autoscrollEnabled) { > > - return false; > > + return {enabled: false, usingApz: false}; > > For clarity I'd prefer calling this apzEnabled instead of just "enabled" The meaning of the first property is "will we autoscroll at all" (while the second is "will we (try to) use APZ"). I'm happy to rename it, but "apzEnabled" would be wrong.
(In reply to Botond Ballo [:botond] from comment #47) > Hmm, good point. I guess I figured that the newly schedules observer > notification will not make it back to the content process before the > response to the already-in-flight Autoscroll:Start message manager message, > but I guess I have no grounds to rely on that. > > Now that I think about it, though, what do you think about having > APZCTreeManager::StartAutoscroll() return "false" if GetTargetAPZC() didn't > find an APZC? That way, in the synchronous case that information would get > back to browser-content.js via usingApz=false in the Autoscroll:Start > response. Since that makes the observer message redundant in the synchronous > case, we could have APZCTreeManager send it only if XRE_IsGPUProcess() is > true. That sounds reasonable enough. > > ::: toolkit/content/widgets/browser.xml:1104 > The meaning of the first property is "will we autoscroll at all" (while the > second is "will we (try to) use APZ"). I'm happy to rename it, but > "apzEnabled" would be wrong. Oh, I somehow misread that. But yeah please rename it, maybe to autoscrollEnabled or something else that's appropriate.
Pushed by bballo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/d6703dd12065 Ensure main-thread and async autoscrolling do not happen at the same time. r=kats
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Whiteboard: [stockwell needswork] → [stockwell fixed:product]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: