Closed Bug 1241297 Opened 6 years ago Closed 5 years ago

Wpt5 fails on TC but not on Buildbot

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(e10s+, firefox46 fixed, firefox47 fixed)

RESOLVED FIXED
mozilla47
Tracking Status
e10s + ---
firefox46 --- fixed
firefox47 --- fixed

People

(Reporter: armenzg, Assigned: armenzg)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

The wpt-3 e10s job always times out on TC/docker regardless of how long you give it (now timing out at 120 minutes) [1][2]. It seems that if I bumped it beyond 120 minutes it might finish on time.

It seems that Buildbot has that same chunk running hidden [3]

jgraham, do you want us to try chunking this job further?
or try it on an m3.xlarge instance?

Eventually we want to disable the Buildbot jobs and used these running on TaskCluster.


[1]
https://public-artifacts.taskcluster.net/FVFCzYC3SCWrdYZ-D1suhA/1/public/logs/live_backing.log
[2]
https://treeherder.mozilla.org/#/jobs?repo=try&author=armenzg@mozilla.com&filter-searchStr=web-platform-tests%203%29&selectedJob=15573412
[3]
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=linux%20x64%20debug%20web-platform-tests-e10s&group_state=expanded
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=linux%20x64%20debug%20web-platform-tests-e10s&group_state=expanded&exclusion_profile=false
[4]
16:41:01     INFO - PROCESS | 1207 | JavaScript error: executormarionette.py, line 33: Error: Permission denied to access property "timeout"
16:41:05     INFO - TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-embedded.html | expected OK
[5]
17:56:16 CRITICAL - Loading initial page http://web-platform.test:8000/testharness_runner.html failed. Ensure that the there are no other programs bound to this port and that your firewall rules or network setup does not prevent access.\eTraceback (most recent call last):
17:56:16 CRITICAL -   File "/home/worker/workspace/build/tests/web-platform/harness/wptrunner/executors/executormarionette.py", line 124, in load_runner
17:56:16 CRITICAL -     self.marionette.navigate(url)
17:56:16 CRITICAL -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1505, in navigate
17:56:16 CRITICAL -     self._send_message("get", {"url": url})
17:56:16 CRITICAL -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 36, in _
17:56:16 CRITICAL -     return func(*args, **kwargs)
17:56:16 CRITICAL -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 748, in _send_message
17:56:16 CRITICAL -     self._handle_error(err)
17:56:16 CRITICAL -   File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 809, in _handle_error
17:56:16 CRITICAL -     raise errors.lookup(error)(message, stacktrace=stacktrace)
17:56:16 CRITICAL - UnknownException: UnknownException: Error loading page
17:56:16 CRITICAL - 
17:56:16 CRITICAL -
this changed in the last 2 weeks for <50 minutes to >90 minutes- we should identify the root cause of that.
There are generally unexplained problems with W3 on Linux; see https://bugzilla.mozilla.org/show_bug.cgi?id=1238435#c26 Let's revisit after that is fixed.
Depends on: 1238435
w3 seems to have become much worse in runtime by looking at this range:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=Linux%20x64%20debug%20W3C%20Web%20Platform%20Tests%20W3C%20Web%20Platform%20Tests%20W%283%29&group_state=expanded&fromchange=c33f30666b37&tochange=6020a4cb41a7

judging by the running jobs, I suspect our culprit lies in this set of changes:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=359f86fecbc2

and we have new tests which showed up there:
https://hg.mozilla.org/mozilla-central/rev/31a86d5e5ffa

working on filling in the gaps in inbound to prove that:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-str=web&filter-searchStr=Linux%20x64%20debug%20W3C%20Web%20Platform%20Tests%20W3C%20Web%20Platform%20Tests%20W%283%29&tochange=31a86d5e5ffa&fromchange=cd6b93ff2af7

If we added a handful of tests and our runtime goes from ~45 -> ~80 minutes, that is suspect- are these tests necessary, are there bugs, can we optimize, is this across all platforms, etc.?

At the very least we should chunk more which we can do in taskcluster land- not sure about available builders in buildbot land (linux64 is close to full)
adding bkelly and yury as they landed/reviewed the new wpt tests- I know there are other issues with the stability of the wpt(3) tests, I care about runtime :)
jmaher: Yeah, those tests are already implicated. But the mechanism is a mystery because they don't actually run in W3. Which suggests chunking changes. But why that would cause the browser to be unable to load pages I don't know. Possibly something is crashing the web server or similar, but I need to actually reproduce the issue to be sure.
(In reply to James Graham [:jgraham] from comment #5)
> jmaher: Yeah, those tests are already implicated. But the mechanism is a
> mystery because they don't actually run in W3. Which suggests chunking
> changes. But why that would cause the browser to be unable to load pages I
> don't know. Possibly something is crashing the web server or similar, but I
> need to actually reproduce the issue to be sure.

As noted at https://bugzilla.mozilla.org/show_bug.cgi?id=1238435#c11 , the reflection wpt tests are crashing on e10s. The chunking of the tests moved them (reflection tests) in W3 for linux64 after new tests are added. It's hard to tell in which chunk reflection tests where located or in which chunks they are on other platforms (e.g. see raw logs of W2 on Mac OSX https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&selectedJob=20210378). But reflection test definitely produce crashes on linux64 e10s due to DNS host name resolution.
the runtime difference is in both e10s and non e10s.
Is disabling the test(s) causing the crash a possibility?
At least we would get results from other tests until the crash is ironed out.
(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #8)
> Is disabling the test(s) causing the crash a possibility?
> At least we would get results from other tests until the crash is ironed out.

This solution was r- at https://bugzilla.mozilla.org/show_bug.cgi?id=1238435#c16
we run an entirely different set of tests now in chunk 3 (keep in mind this is debug chunk 3 where we have 8 chunks).  I think this bug is not useful other than to split wpt into more chunks and let the crashing/timeouts be solved in bug 1238435
Has anything changed?

It seems that going from 8 to 12 chunks has cleared this:
https://treeherder.mozilla.org/#/jobs?repo=try&author=armenzg@mozilla.com&filter-searchStr=web-platform-tests&group_state=expanded

It seems that the lengthy tests are now running accross chunks 4, 5 & 6 (all between 40 & 50 minutes)
If chunking exposed this bug in the first place, it's not too surprising if rechunking hides it again.

I certainly don't object to increasing the number of chunks in general; do you want to do that on buildbot and TC?
I don't know if we will be able to do it for Buildbot as the limit of builders was pretty high and close to the limit.

We can increase the TC chunking and at least have that visible.
Good luck with working around it by changing your chunk numbers - the bug 1242153 wpt update moved things around enough to get whatever two tests it is that don't like being in the same chunk into your chunk 4, both e10s and not-e10s, so I added them to the exclusion hiding buildbot's e10s-3.
Assignee: nobody → armenzg
Summary: wpt-3 e10s always takes as long as the max runtime allows it to → Bump timeout for wpt tests (wpt4 times out)
Going back from 12 chunks to 8 chunks and increasing the time out does not necessarily improve the matter (AFAIK):
https://treeherder.mozilla.org/#/jobs?repo=try&revision=788f6950a3bd

I will have to see if the tests run on Buildbot or not and compare with the TC jobs.
Summary: Bump timeout for wpt tests (wpt4 times out) → Wpt5 fails on TC but not on Buildbot
Having the same chunks will hopefully make comparing Buildbot and TaskCluster easier (it might fix the problem).
Attachment #8715849 - Flags: review?(jmaher)
Comment on attachment 8715849 [details] [diff] [review]
wpt test jobs from 12 to 8 chunks to match Buildbot

Review of attachment 8715849 [details] [diff] [review]:
-----------------------------------------------------------------

I would like to go back to 12 chunks as soon as we are on taskcluster only.
Attachment #8715849 - Flags: review?(jmaher) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/6391ae6db0ff732c8affd9143e8d080c22d1c4c6
Bug 1241297 - Bump timeout for TC Linux64 wpt tests and go from 12 chunks to 8 chunks. DONTBUILD. r=jmaher
https://hg.mozilla.org/mozilla-central/rev/6391ae6db0ff
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla47
Going back to 8 has cleared the issue.

(In reply to Joel Maher (:jmaher) from comment #22)
> I would like to go back to 12 chunks as soon as we are on taskcluster only.
Someone will have to work on bug 1238435 before we can go to any different chunking.
You need to log in before you can comment on or make changes to this bug.