[meta] Investigate page load timeouts for Raptor tests (timed out loading test page)
Categories
(Testing :: Raptor, defect, P2)
Tracking
(Not tracked)
People
(Reporter: whimboo, Unassigned)
References
(Depends on 2 open bugs)
Details
(Keywords: meta)
A lot of tests are timing out in CI because the test page cannot be loaded. Here an example:
[task 2020-02-11T22:59:30.328Z] 22:59:10 ERROR - raptor-main Critical: TEST-UNEXPECTED-FAIL: test 'raptor-tp6m-microsoft-support-geckoview' timed out loading test page: https://support.microsoft.com/en-us pending metrics: fcp, fnb paint
[task 2020-02-11T22:59:30.328Z] 22:59:10 ERROR - Return code: 1
I assume those are all tests which get pages served via mitmproxy? I will add all the known bugs to the blocking field in a moment. Bebe, maybe you could have a look if that applies to mitmproxy? Maybe the new stats as added by Tarek recently could help us here with the investigation.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
For the Microsoft case above this is the log file:
https://firefoxci.taskcluster-artifacts.net/N4nbFpROSyWKBATz2AMywg/0/public/test_info/mitmproxy.log
Here I can see some killed requests like:
server_playback: killed non-replay request https://web.vortex.data.microsoft.com/collect/v1/t.js?ver=%272.1%27&name=%27Ms.Webi.PageView%27&
I wonder if those also happen in passing tests, and if not that those could be the cause that the page doesn't finish loading.
Reporter | ||
Comment 2•5 years ago
|
||
Otherwise if the problem is related to the Raptor web extension it might be good to know if the timeouts also happen with browsertime. If not it would be questionable if we should spend time on investigating this problem, or moving faster to browsertime.
Reporter | ||
Comment 3•5 years ago
|
||
Bebe, could your patch on bug 1589141 also help here?
Comment 4•5 years ago
|
||
it will just confirm that the proxy is working at the beginning of the test
Reporter | ||
Comment 5•5 years ago
|
||
The page load timeout is added when the [control server receives a webext_raptor-page-timeout
message:
https://searchfox.org/mozilla-central/rev/96f1457323cc598a36f5701f8e67aedaf97acfcf/testing/raptor/raptor/control_server.py#207-209
Here an example:
[task 2020-02-07T20:57:16.612Z] 20:57:16 INFO - raptor-control-server Info: received webext_raptor-page-timeout: [u'raptor-tp6-imdb-firefox', u'https://www.imdb.com/title/tt0084967/?ref_=nv_sr_2', {u'fcp': False, u'hero': False, u'dcf': False, u'fnb paint': False, u'ttfi': False, u'load time': False}]
The manifest for imdb is:
https://searchfox.org/mozilla-central/rev/96f1457323cc598a36f5701f8e67aedaf97acfcf/testing/raptor/raptor/tests/tp6/desktop/raptor-tp6-3.ini#24
measure = fnbpaint, fcp, dcf, loadtime
So i assume we didn't hit any of those measure points? But why do we output an empty pending metrics list then?
raptor-main Critical: TEST-UNEXPECTED-FAIL: test 'raptor-tp6-imdb-firefox' timed out loading test page: https://www.imdb.com/title/tt0084967/?ref_=nv_sr_2 pending metrics:
I assume the problem is located here:
https://searchfox.org/mozilla-central/rev/96f1457323cc598a36f5701f8e67aedaf97acfcf/testing/raptor/raptor/results.py#69
And that this should be [key for key, value in pending_metrics.items() if not value]
?
Sparky, is that correct? If yes, I will file a new bug to get that part fixed first.
Reporter | ||
Comment 6•5 years ago
|
||
Also for the failing page load cycle the output looks like:
[task 2020-02-07T20:57:15.223Z] 20:57:15 INFO - PID 5884 | console.log: "[raptor-runnerjs] begin pagecycle 16"
[task 2020-02-07T20:57:15.223Z] 20:57:15 INFO - PID 5884 | console.log: "[raptor-runnerjs] posting to control server"
[task 2020-02-07T20:57:15.223Z] 20:57:15 INFO - PID 5884 | console.log: "[raptor-runnerjs] begin pagecycle 16"
[task 2020-02-07T20:57:16.419Z] 20:57:16 INFO - PID 5884 | console.log: "[raptor-runnerjs] update tab: 2"
[task 2020-02-07T20:57:16.419Z] 20:57:16 INFO - PID 5884 | console.log: "[raptor-runnerjs] posting to control server"
[task 2020-02-07T20:57:16.419Z] 20:57:16 INFO - PID 5884 | console.log: "[raptor-runnerjs] update tab: 2"
[task 2020-02-07T20:57:16.424Z] 20:57:16 INFO - PID 5884 | console.log: "[raptor-runnerjs] posting to control server"
[task 2020-02-07T20:57:16.429Z] 20:57:16 INFO - PID 5884 | console.log: "[raptor-runnerjs] closed tab 2"
[task 2020-02-07T20:57:16.429Z] 20:57:16 INFO - PID 5884 | console.log: "[raptor-runnerjs] pageload test finished"
[task 2020-02-07T20:57:16.430Z] 20:57:16 INFO - PID 5884 | console.log: "[raptor-runnerjs] posting to control server"
[task 2020-02-07T20:57:16.430Z] 20:57:16 INFO - PID 5884 | console.log: "[raptor-runnerjs] "
[task 2020-02-07T20:57:16.612Z] 20:57:16 INFO - raptor-control-server Info: received webext_status: results received
[task 2020-02-07T20:57:16.612Z] 20:57:16 INFO - raptor-control-server Info: received webext_raptor-page-timeout: [u'raptor-tp6-imdb-firefox', u'https://www.imdb.com/title/tt0084967/?ref_=nv_sr_2', {u'fcp': False, u'hero': False, u'dcf': False, u'fnb paint': False, u'ttfi': False, u'load time': False}]
A working case looks like:
[task 2020-02-07T20:56:28.715Z] 20:56:28 INFO - PID 5884 | console.log: "[raptor-runnerjs] begin pagecycle 15"
[task 2020-02-07T20:56:28.715Z] 20:56:28 INFO - PID 5884 | console.log: "[raptor-runnerjs] posting to control server"
[task 2020-02-07T20:56:28.715Z] 20:56:28 INFO - PID 5884 | console.log: "[raptor-runnerjs] begin pagecycle 15"
[task 2020-02-07T20:56:28.716Z] 20:56:28 INFO - raptor-control-server Info: received webext_status: begin pagecycle 15
[task 2020-02-07T20:56:28.718Z] 20:56:28 INFO - PID 5884 | console.log: "[raptor-runnerjs] post success"
[task 2020-02-07T20:56:30.844Z] 20:56:30 INFO - PID 5884 | console.log: "[raptor-runnerjs] update tab: 2"
[task 2020-02-07T20:56:30.844Z] 20:56:30 INFO - PID 5884 | console.log: "[raptor-runnerjs] posting to control server"
[task 2020-02-07T20:56:30.845Z] 20:56:30 INFO - PID 5884 | console.log: "[raptor-runnerjs] update tab: 2"
[task 2020-02-07T20:56:30.848Z] 20:56:30 INFO - raptor-control-server Info: received webext_status: update tab: 2
[task 2020-02-07T20:56:30.868Z] 20:56:30 INFO - PID 5884 | console.log: "[raptor-runnerjs] test tab updated: 2"
[task 2020-02-07T20:56:30.868Z] 20:56:30 INFO - PID 5884 | console.log: "[raptor-runnerjs] posting to control server"
[task 2020-02-07T20:56:30.868Z] 20:56:30 INFO - PID 5884 | console.log: "[raptor-runnerjs] test tab updated: 2"
[task 2020-02-07T20:56:30.869Z] 20:56:30 INFO - PID 5884 | console.log: "[raptor-runnerjs] post success"
[task 2020-02-07T20:56:30.869Z] 20:56:30 INFO - raptor-control-server Info: received webext_status: test tab updated: 2
[task 2020-02-07T20:56:30.871Z] 20:56:30 INFO - PID 5884 | console.log: "[raptor-runnerjs] post success"
[task 2020-02-07T20:56:34.982Z] 20:56:34 INFO - PID 5884 | console.log: "[raptor-pageloadjs] pageloadjs raptorContentHandler!"
So it looks like the control server hasn't received the webext status for begin pagecycle 16
. I would have to dig into those different raptor-runnerjs
log entries to understand what's going on here. It's irritating that we log some of those multiple times without further explanation.
Reporter | ||
Comment 7•5 years ago
|
||
Ok, so in runner.js we are creating a new XMLHTTPRequest
for each and every message to be send to the control server:
https://searchfox.org/mozilla-central/rev/96f1457323cc598a36f5701f8e67aedaf97acfcf/testing/raptor/webext/raptor/runner.js#612-641
Problem here is in line 629 where we only send the message if readyState
is 1. If it's not we won't send the message at all! This is clearly a dataloss situation here, which can cause the control server to stay in a wrong state. If it's actually causing the problem as covered by this bug I don't know yet. I will file a new bug for this particular issue and have to get it fixed sooner than later.
Comment 8•5 years ago
|
||
:whimboo, yup you found it. There's also this area which prevents pending metrics from being populated when the web-ext sends a message: https://searchfox.org/mozilla-central/source/testing/raptor/webext/raptor/runner.js#485
Then, there is also this area which appends an empty pending metrics section in none was sent: https://searchfox.org/mozilla-central/source/testing/raptor/raptor/control_server.py#204-207
So this would mean that all of the error messages we have had up to now were invalid, and that everything that isn't in the pending metrics: ...
list was missed. Or, in other words, it means that everything in the pending metrics list were fields that were gathered and not pending.
Reporter | ||
Comment 9•5 years ago
|
||
With the changes from bug 1617909 landed on central I will let it settle for a bit, and have a look at this bug again on Monday.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 10•5 years ago
|
||
As discussed on bug 1558031 comment 16 we are going to remove the pending metrics from the actual failure message. But it would still be helpful to have all of the metrics listed as a log message before the failing line. I will do this update tomorrow.
Reporter | ||
Comment 11•5 years ago
•
|
||
Please note that for bug 1587640 comment 9 we failed to download the mitmproxy recording, but didn't bail out early enough. So a page load timeout happened.
Reporter | ||
Comment 12•5 years ago
|
||
Here is a recent example of a page load failure:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293210779&repo=mozilla-central&lineNumber=2625-2627
As it looks like the very first page load seems to not measure all the values, and we continue immediately with the next ones. Also the timing is strange. All page loads are done within a single second! Maybe we hit an error page instead? Do we already check for that?
It might be good to improve the failure message to include the cycle in which the measurement hasn't been taken.
Comment 13•5 years ago
|
||
(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #12)
Here is a recent example of a page load failure:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293210779&repo=mozilla-central&lineNumber=2625-2627As it looks like the very first page load seems to not measure all the values, and we continue immediately with the next ones. Also the timing is strange. All page loads are done within a single second! Maybe we hit an error page instead? Do we already check for that?
It might be good to improve the failure message to include the cycle in which the measurement hasn't been taken.
digging through the logs of the above error I see that we hit a "GDPR" notification page. Not the page we intended to test.
Also from the mitm log I find a 127.0.0.1:53061: ProtocolException("Error in HTTP connection: HttpException('Cannot assemble flow with missing content',)",)
error.
While testing mitm 5.0.2 these error where causing the pages to hang and not generate the page load event
Reporter | ||
Comment 14•5 years ago
|
||
For some tests I see the following Javascript error:
03-18 23:08:17.374 8127 8142 I Web Content: [JavaScript Error: "Cookie “__cfduid†has “sameSite†policy set to “lax†because it is missing a “sameSite†attribute, and “sameSite=lax†is the default value for this attribute." {file: "https://aframe.io/examples/showcase/animation/" line: 0}]
Maybe that is related to enabling cookie default behavior to Lax
instead of None
in Nightly (bug 1604212). Florin, could someone from you please push a try build with network.cookie.sameSite.laxByDefault
set to false
for Android G5?
Comment 15•5 years ago
|
||
I think this needs to go to the other Florin (aka Bebe). Moving ni.
Comment 16•5 years ago
|
||
Assigned to Marian
Comment 17•5 years ago
|
||
Henrik,
I did this push to try for network.cookie.sameSite.laxByDefault
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=b173e3e4c535bc6d715349f7c43d8b5e2ed73f08&group_state=expanded
Please let me know if this works.
Thanks!
Reporter | ||
Comment 18•5 years ago
|
||
(In reply to Marian Raiciof [:marauder] from comment #17)
I did this push to try for network.cookie.sameSite.laxByDefault
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=b173e3e4c535bc6d715349f7c43d8b5e2ed73f08&group_state=expanded
We didn't run too many jobs for this try build but there is no page load timeout visible at all. So I really wonder if the changes from bug 1604212 somewhat caused that. But then I would expect to have permanent failures and not intermittents for several page load tests.
Andrea, mind checking comment 14 and if that is something we should worry about? Note that all those pages have been recorded via mitmproxy so they can be run from localhost. Not sure if there might be something missing. If it's a red herring, we will check further. Thanks.
Comment 19•5 years ago
|
||
cookie sameSite=lax by default is going to be released soon-ish. Chrome has already enabled this feature.
The code itself doesn't introduce a performance regression but probably, websites behave differently with it.
Btw, the console log message is the correct one. We show that console message when a website sends a cookie without the sameSite attribute.
Reporter | ||
Comment 20•5 years ago
|
||
I noticed that even with a conditioned profile in use, which should be the default anyway, only a couple of files and the webext is pushed to the device's profile location. It means that it is a fresh profile, and hasn't been settled yet.
For conditioned profiles Raptor uses a post startup delay of 1s only, which means that the test starts immediately. Given that there are still background tasks going on, the first page load wouldn't reflect the real page load state as expected. Interestingly it's also always the first cycle which seem to miss various metrics. So it might explain the problems on the Android devices, especially the G5 which is slow.
I pushed a try build with the conditioned profiles turned off. Lets see what a post startup delay of 30s will change:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=065e2014b787450d7b4d8b5784d3c82bab384f88
Reporter | ||
Comment 21•5 years ago
|
||
Note that my patches from bug 1627253 seem to help a lot! I haven't seen any page load timeout failure yet. Lets hope it will stay that way once it got landed!
Reporter | ||
Comment 22•5 years ago
|
||
The activation of the conditioned profile for desktop and geckoview_example via bug 1628982 seems to have made a huge improvement here. Way lesser page load timeouts are reported. Remaining ones we will have to check later this week.
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 25•2 months ago
|
||
closing this as these days if there is any intermittent pageload timeout issues we directly just work with the filed bug. This meta bug has not been used in some years. feel free to reopen if needed/you disagree
Description
•