Closed Bug 1283947 Opened 5 years ago Closed 4 years ago

Intermittent docshell/test/navigation/test_sessionhistory.html | Test timed out.

Categories

(Core :: DOM: Navigation, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox52 --- wontfix
firefox-esr52 --- affected
firefox53 --- wontfix
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: freesamael)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(3 files, 5 obsolete files)

This is easily hit on a Linux64 debug Try push with --disable-optimize, FWIW.
Bulk assigning P3 to all open intermittent bugs without a priority set in Firefox components per bug 1298978.
Priority: -- → P3
:smaug - This test has been failing intermittently for a long time. Can you make it more reliable? Or would you consider skipping it on some platforms? It looks like failures are all linux and all e10s, mostly debug.
Flags: needinfo?(bugs)
Attached patch guess fix (obsolete) — Splinter Review
Guess fix: since both hashchange event firing and animation frame callback firing are async, but don't depend on each others, try to enforce some particular order.
This is the place where the test timeouts.
Assignee: nobody → bugs
Flags: needinfo?(bugs)
Attachment #8830202 - Flags: review?(sawang)
(In reply to Olli Pettay [:smaug] from comment #23)
> Created attachment 8830202 [details] [diff] [review]
> guess fix
> 
> Guess fix: since both hashchange event firing and animation frame callback
> firing are async, but don't depend on each others, try to enforce some
> particular order.
> This is the place where the test timeouts.

I rr recorded a successful run and found in that recording nsGlobalWindow::FireHashchange also happens after nsIDocument::TakeFrameRequestCallbacks, so maybe the sequence doesn't really matter, I guess.

One thing interesting is that in all failure logs of comment 21 it reads
> INFO - Error: Unable to restore focus, expect failures and timeouts.

Could it be requestAnimationFrame taking too much time due to the whole browser window is somehow in the background?
Flags: needinfo?(bugs)
well those errors happen after the timeout.
Flags: needinfo?(bugs)
Comment on attachment 8830202 [details] [diff] [review]
guess fix

Review of attachment 8830202 [details] [diff] [review]:
-----------------------------------------------------------------

I can reproduce the timeout with mochitest --repeat 200 on my local Linux machine. I tried to reproduce the failure multiple times. With or without the patch, it always times out at around loop 100 ~ 150, so I think this patch doesn't fix the problem. (but I didn't try to reproduce it on try server)

Some observations here:

1. It does seem to be easier to reproduce the issue when compiling with --disable-optimize. If not building with --disable-optimize it can run up to 150 ~ 200 loops on my machine before first timeout happens.

2. The visible symptom is that the browser window of file_scrollRestoration.html becomes blank, nothing shows in that window. Resizing the window before test timeout fixes it -- the window recovers to show the page content and test continues.

3. I rr recorded the timeout with & without the patch. In both cases, there were some pending transactions (nsRefreshDriver::mPendingTransaction was 2 or 3 and nsRefreshDriver::mWaitingForTransaction was true) that never finish, so nsRefreshDriver::Tick always gets early return at [1] and the frame request callback never gets invoked. Through without sufficient knowledge I couldn't figure out what the root cause is...

BTW is there a faster way to rr replay to a given event? Replaying to the failure on my machine took 2.5 hrs...

[1] http://searchfox.org/mozilla-central/rev/e95e4ed8b5229a29dacc32c0b90968df5e7a6ff3/layout/base/nsRefreshDriver.cpp#1742
Attachment #8830202 - Flags: review?(sawang) → review-
Thanks for testing this, since I haven't managed to reproduce.

Sounds like the issue is somewhere in graphics.
Assignee: bugs → nobody
matt, does comment 27 ring any bells? Why would we waiting for DidComposite forever?
Flags: needinfo?(matt.woodrow)
Not really, I guess the compositor didn't send back the notification for some reason.

The rr recording is probably the best way to track this down.

Normally ShadowLayerForwarder::EndTransaction calls SendUpdate(NoSwap) with aScheduleComposite=true.

That calls through to LayerTransactionParent::RecvUpdate -> CrossProcessCompositorBridgeParent::ShadowLayersUpdated (for the content process) -> CompositorBridgeParent::NotifyShadowTreeTransaction and then that calls ScheduleComposition. That should result in a call to CompositorBridgeParent::CompositeToTarget and the DidComposite message being sent.


I'd suggest getting the rr replay to the nsRefreshDriver::GetTransactionId() call that increments mPendingTransaction to the value that it's stuck at, and then following through the code I described above to see which piece isn't happening.
Flags: needinfo?(matt.woodrow)
I may have found the reason. It seems the problematic sequence was 

1. The nsRefreshDriver(1) of "file_scrollRestoration.html" ticks, triggers SendUpdate(), and starts waiting for the transaction.

2. Test script sets window.location to "data:text/html,", causing nsRefreshDriver(1)::Freeze() as it's moving to bfcache.

3. The other nsRefreshDriver(2) of "data:text/html," becomes active. ClientLayerManager::mTransactionIdAllocator sets to nsRefreshDriver(2).

4. Chrome process notifies DidComposite, eventually calls nsRefreshDriver(2)::NotifyTransactionCompleted().

5. Test script performs history.back(). nsRefreshDriver(2) is frozen and nsRefreshDriver(1) becomes active again.

6. Test script registers a callback through requestAnimationFrame(). However the transaction nsRefreshDriver(1) is waiting has been delivered to nsRefreshDriver(2), so it waits forever and never invokes frame callbacks.
Attached patch v1 (obsolete) — Splinter Review
Comment on attachment 8833256 [details] [diff] [review]
v1

Hi Matt,

I'm super unfamiliar with gfx code. Does this look the right way to solve the problem?
Attachment #8833256 - Flags: review?(matt.woodrow)
Assignee: nobody → sawang
Comment on attachment 8833256 [details] [diff] [review]
v1

Review of attachment 8833256 [details] [diff] [review]:
-----------------------------------------------------------------

Can we instead clear the waiting for transaction state (set mCompletedTransaction to mPendingTransaction, and set mWaitingForTransaction to false) when we assign a different ClientLayerManager to the refresh driver?
Attachment #8833256 - Flags: review?(matt.woodrow)
Attachment #8830202 - Attachment is obsolete: true
Attachment #8833256 - Attachment is obsolete: true
Attached patch v2 (obsolete) — Splinter Review
Update to address review comment.
Comment on attachment 8836579 [details] [diff] [review]
v2

Hi Matt,

Would you take another review?
Attachment #8836579 - Flags: review?(matt.woodrow)
Attachment #8836579 - Flags: review?(matt.woodrow) → review+
Pushed by ihsiao@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/69bc1e04bb2c
Clear pending transactions when ClientLayerManager is assigning to a different refresh driver. r=mattwoodrow
Keywords: checkin-needed
Backed out for build bustage:

https://hg.mozilla.org/integration/mozilla-inbound/rev/79cea5a0238246a9dfe4ed8cc57161b09376998d

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=4c483e15fb91e4aaa6fb99e548d48516de194a78
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=77520366&repo=mozilla-inbound

[task 2017-02-15T09:51:29.215067Z] 09:51:29     INFO -  /home/worker/workspace/build/src/obj-firefox/_virtualenv/bin/python /home/worker/workspace/build/src/config/expandlibs_gen.py -o libdom_file.a.desc Unified_cpp_dom_file0.o
[task 2017-02-15T09:51:29.388317Z] 09:51:29     INFO -  In file included from /home/worker/workspace/build/src/obj-firefox/gfx/layers/Unified_cpp_gfx_layers4.cpp:38:0:
[task 2017-02-15T09:51:29.388767Z] 09:51:29     INFO -  /home/worker/workspace/build/src/gfx/layers/client/ClientLayerManager.cpp:513:1: error: redefinition of 'void mozilla::layers::ClientLayerManager::SetTransactionIdAllocator(mozilla::layers::TransactionIdAllocator*)'
[task 2017-02-15T09:51:29.389094Z] 09:51:29     INFO -   ClientLayerManager::SetTransactionIdAllocator(TransactionIdAllocator* aAllocator)
[task 2017-02-15T09:51:29.389133Z] 09:51:29     INFO -   ^
[task 2017-02-15T09:51:29.389279Z] 09:51:29     INFO -  In file included from /home/worker/workspace/build/src/gfx/layers/client/ClientCanvasLayer.h:10:0,
[task 2017-02-15T09:51:29.389533Z] 09:51:29     INFO -                   from /home/worker/workspace/build/src/gfx/layers/client/ClientCanvasLayer.cpp:6,
[task 2017-02-15T09:51:29.389607Z] 09:51:29     INFO -                   from /home/worker/workspace/build/src/obj-firefox/gfx/layers/Unified_cpp_gfx_layers4.cpp:2:
[task 2017-02-15T09:51:29.389996Z] 09:51:29     INFO -  /home/worker/workspace/build/src/gfx/layers/client/ClientLayerManager.h:223:16: note: 'virtual void mozilla::layers::ClientLayerManager::SetTransactionIdAllocator(mozilla::layers::TransactionIdAllocator*)' previously defined here
[task 2017-02-15T09:51:29.390131Z] 09:51:29     INFO -     virtual void SetTransactionIdAllocator(TransactionIdAllocator* aAllocator) override
[task 2017-02-15T09:51:29.390332Z] 09:51:29     INFO -                  ^
[task 2017-02-15T09:51:29.390402Z] 09:51:29     INFO -  /home/worker/workspace/build/src/config/rules.mk:1016: recipe for target 'Unified_cpp_gfx_layers4.o' failed
[task 2017-02-15T09:51:29.390551Z] 09:51:29     INFO -  gmake[5]: *** [Unified_cpp_gfx_layers4.o] Error 1
Flags: needinfo?(sawang)
Attached patch v3 (obsolete) — Splinter Review
Made a silly mistake on rebasing before...
Attachment #8836579 - Attachment is obsolete: true
Flags: needinfo?(sawang)
Attached patch v4 (obsolete) — Splinter Review
rebase for landing
Attachment #8837537 - Attachment is obsolete: true
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3a8c78a6146f
Clear pending transactions when ClientLayerManager is assigning to a different refresh driver. r=mattwoodrow
Keywords: checkin-needed
Backed out for failing browser_animation_pseudo_elements.js on Linux opt without e10s:

https://hg.mozilla.org/integration/mozilla-inbound/rev/cc5a5a7e5331255b5dda147e09ea4d85a9bd8d56

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=3a8c78a6146fa9fb3fa9f73d0547b6520e26a470&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=78876097&repo=mozilla-inbound

[task 2017-02-20T20:18:39.010986Z] 20:18:39     INFO - TEST-START | devtools/client/animationinspector/test/browser_animation_pseudo_elements.js
[task 2017-02-20T20:19:34.590530Z] 20:19:34     INFO - TEST-INFO | started process screentopng
[task 2017-02-20T20:19:35.113422Z] 20:19:35     INFO - TEST-INFO | screentopng: exit 0
[task 2017-02-20T20:19:35.113547Z] 20:19:35     INFO - Buffered messages logged at 20:18:39
[task 2017-02-20T20:19:35.114394Z] 20:19:35     INFO - Entering test bound 
[task 2017-02-20T20:19:35.115954Z] 20:19:35     INFO - Adding a new tab with URL: http://example.com/browser/devtools/client/animationinspector/test/doc_pseudo_elements.html
[task 2017-02-20T20:19:35.118351Z] 20:19:35     INFO - Tab added and finished loading
[task 2017-02-20T20:19:35.119839Z] 20:19:35     INFO - Loading the helper frame script chrome://mochitests/content/browser/devtools/client/animationinspector/test/doc_frame_script.js
[task 2017-02-20T20:19:35.121185Z] 20:19:35     INFO - Loading the helper frame script chrome://devtools/content/shared/frame-script-utils.js
[task 2017-02-20T20:19:35.122534Z] 20:19:35     INFO - Opening the inspector
[task 2017-02-20T20:19:35.123684Z] 20:19:35     INFO - Opening the toolbox
[task 2017-02-20T20:19:35.124880Z] 20:19:35     INFO - Buffered messages finished
[task 2017-02-20T20:19:35.126605Z] 20:19:35     INFO - TEST-UNEXPECTED-FAIL | devtools/client/animationinspector/test/browser_animation_pseudo_elements.js | Test timed out - 
[task 2017-02-20T20:19:35.127270Z] 20:19:35     INFO - Removing tab.
[task 2017-02-20T20:19:35.128132Z] 20:19:35     INFO - Waiting for event: 'TabClose' on [object XULElement].
[task 2017-02-20T20:19:35.130505Z] 20:19:35     INFO - Got event: 'TabClose' on [object XULElement].
[task 2017-02-20T20:19:35.131219Z] 20:19:35     INFO - Tab removed and finished closing
Flags: needinfo?(sawang)
Whiteboard: [stockwell unknown]
Attachment #8833256 - Attachment description: Clear mWaitingForTransaction on Thaw → v1
Attachment #8836579 - Attachment description: Clear pending transactions when ClientLayerManager is assigning to a different refresh driver → v2
Attachment #8837537 - Attachment description: Clear pending transactions when ClientLayerManager is assigning to a different refresh driver → v3
Attachment #8839065 - Attachment description: Clear pending transactions when ClientLayerManager is assigning to a different refresh driver → v4
Attachment #8839065 - Attachment is patch: false
Attachment #8839065 - Attachment is obsolete: true
Attachment #8839065 - Attachment is patch: true
Comment on attachment 8845335 [details] [diff] [review]
Clear pending transactions when ClientLayerManager is assigning to a different refresh driver

Looks I forgot to check aAllocator != mTransactionIdAllocator.

In the meantime, given that we already know DidComposite could happen after ClientLayerManager has switched to another refresh driver, I'm thinking we should also reset the transaction id of newly adopted refresh driver to a higher value so it doesn't get confused. Would like to know your opinion on this part. Does it sound reasonable to you?
Flags: needinfo?(sawang)
Attachment #8845335 - Flags: review?(matt.woodrow)
Comment on attachment 8845335 [details] [diff] [review]
Clear pending transactions when ClientLayerManager is assigning to a different refresh driver

Review of attachment 8845335 [details] [diff] [review]:
-----------------------------------------------------------------

Yeah, sounds like a good idea.
Attachment #8845335 - Flags: review?(matt.woodrow) → review+
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0fa3d7b36a7b
Clear pending transactions when ClientLayerManager is assigning to a different refresh driver. r=mattwoodrow
Keywords: checkin-needed
Whiteboard: [stockwell unknown] → [stockwell fixed]
https://hg.mozilla.org/mozilla-central/rev/0fa3d7b36a7b
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
No occurrences in the last week. Please request Aurora/Beta/ESR52 approval on this when you get a chance.
Flags: needinfo?(sawang)
Attached patch patch for auroraSplinter Review
rebase on aurora
Attached patch patch for betaSplinter Review
rebase on beta
Attachment #8850863 - Attachment description: Clear pending transactions when ClientLayerManager is assigning to a different refresh driver → patch for aurora
Attachment #8850865 - Attachment description: Clear pending transactions when ClientLayerManager is assigning to a different refresh driver → patch for beta
Comment on attachment 8850863 [details] [diff] [review]
patch for aurora

Approval Request Comment

[Feature/Bug causing the regression]: I think it's since we introduced e10s.

[User impact if declined]: In very rare case history navigation could result in a blank page. It's recoverable if a subsequent composite request happens, such as resizing, minimizing the window and back, or just refreshing the page.

[Is this code covered by automated tests?]: Yes. It fixes automation timeout.

[Has the fix been verified in Nightly?]: Yes.

[Needs manual test from QE? If yes, steps to reproduce]: No.

[List of other uplifts needed for the feature/fix]: None.

[Is the change risky?]: Slightly.

[Why is the change risky/not risky?]: It's just a small change but still changes to graphics / refresh driver could be relatively risky to cause regression.

[String changes made/needed]: None.
Attachment #8850863 - Flags: approval-mozilla-aurora?
Comment on attachment 8850865 [details] [diff] [review]
patch for beta

Approval Request Comment

[Feature/Bug causing the regression]: I think it's since we introduced e10s.

[User impact if declined]: In very rare case history navigation could result in a blank page. It's recoverable if a subsequent composite request happens, such as resizing, minimizing the window and back, or just refreshing the page.

[Is this code covered by automated tests?]: Yes. It fixes automation timeout.

[Has the fix been verified in Nightly?]: Yes.

[Needs manual test from QE? If yes, steps to reproduce]: No.

[List of other uplifts needed for the feature/fix]: None.

[Is the change risky?]: Slightly.

[Why is the change risky/not risky?]: It's just a small change but still changes to graphics / refresh driver could be relatively risky to cause regression.

[String changes made/needed]: None.
Attachment #8850865 - Flags: approval-mozilla-beta?
Comment on attachment 8850863 [details] [diff] [review]
patch for aurora

Fix an intermittent-failure. Let's take it in aurora first. Aurora54+.
Attachment #8850863 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8850865 [details] [diff] [review]
patch for beta

I'd like to let this ride with aurora 54 since we are getting near to the end of the beta cycle. Nice to see this long standing test failure fixed!
Attachment #8850865 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
You need to log in before you can comment on or make changes to this bug.