Closed Bug 1369083 Opened 4 years ago Closed 4 years ago

Android Debug marionette-4 job runs for too long

Categories

(Testing :: Marionette, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

(Keywords: regression)

Bug 1204281 records "Task timeout after 3600 seconds" failures; recently, there are intermittent failures of that type for test-android-4.3-arm7-api-15/debug-marionette-4 jobs.

The run-time of that job increased substantially with bug 1368101:

https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=android%20marionette&tochange=7b9687c90aea55f7893ecfb0ccd5f0c954e36eb0&fromchange=740d674779eb4dada7c7f47ef03fb3aaaa65d212

Before bug 1368101, Android Debug marionette-4 jobs completed in about 35 minutes; afterwards, they required 60 minutes to complete.
:whimboo - Do you know why job run-time increased so much? What do you want to do to avoid timeouts?
Flags: needinfo?(hskupin)
Blocks: 1204281
The bad thing for Marionette tests on Android is that there is no gecko.log provided. :( As such it is very hard to say anything about what's going on here. But from the standard log I can see that in both cases the job get stalled in the following test:

test_window_handles_content.py TestWindowHandles.test_window_handles_after_opening_new_tab

Beside the bug you mentioned above I also landed a change to those tests via bug 1368526. So the default wait timeout for page_load is set to 300s. Does the task timeout mean that this is for the whole job? I could imagine that we accumulate long delays in any of those tests and finally get killed.

The chance to hit this intermittent failure seems to be kinda low. Beside those two jobs I cannot see another one with this type of failure.
Blocks: 1368526
Flags: needinfo?(hskupin)
Keywords: regression
(In reply to Henrik Skupin (:whimboo) from comment #2)
> Does
> the task timeout mean that this is for the whole job? I could imagine that
> we accumulate long delays in any of those tests and finally get killed.

Yes, the "Task timeout after 3600 seconds" is for the whole job. I think it happens specifically when the mozharness job run from taskcluster does not complete after 3600 seconds.
 
> The chance to hit this intermittent failure seems to be kinda low. Beside
> those two jobs I cannot see another one with this type of failure.

From https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1204281&endday=2017-05-31&startday=2017-05-30&tree=trunk, I believe there are 8 of these failures on May 30, which is concerning...but none so far on May 31 - encouraging!

If I browse current Android Debug marionette jobs on mozilla-central and look at the "Duration" reported by treeherder, chunk 4 seems to be running in 20 to 30 minutes again, but chunk 3 is now running closer to 55 minutes -- no errors, but not much room for change.
Due to a Taskcluster issue with creating one click loaners I'm not able to check that live. Jonas put a fix in place, which will allow me to get a loaner tomorrow.
(In reply to Geoff Brown [:gbrown] from comment #3)
> If I browse current Android Debug marionette jobs on mozilla-central and
> look at the "Duration" reported by treeherder, chunk 4 seems to be running
> in 20 to 30 minutes again, but chunk 3 is now running closer to 55 minutes
> -- no errors, but not much room for change.

The chunk selection implementation is pretty bad. I would assume that some tests might have been executed in chunk 3 instead of 4, and as such chunk 4 doesn't fail.
Maybe this is somewhat related to bug 1368787. In some of the listed test jobs on OF I can see a lot of MessageChannel errors like `SendAccumulateChildKeyedHistograms`. Here a non Marionette test job:

https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=103298490&lineNumber=42249

Not sure if this might affect non-e10s builds/tests like on Android.
(In reply to Henrik Skupin (:whimboo) from comment #6)
> Maybe this is somewhat related to bug 1368787. In some of the listed test
> jobs on OF I can see a lot of MessageChannel errors like
> `SendAccumulateChildKeyedHistograms`. Here a non Marionette test job:
> 
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=103298490&lineNumber=42249

That failure is in mochitest-media-e10s and the MessageChannel errors are during shutdown, which is taking forever. That's a pattern I am aware of -- bug 1339568 -- but as far as I know, it only affects linux mochitest-media-e10s jobs and only affects shutdown.
Ok, so that should be related for Android then, because on that platform all of our restart tests are not getting run.

Given the frequency of this failure is so low at the moment I don't think it makes sense to dig into it. I will/can do when it's getting more prominent.
The reason why we do not have the detailed gecko.log files could be that this task gets killed, and no usually created artifacts are getting uploaded. This is sad, because it would definitely help us here. :/

Here an example of another job which wasn't killed:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=103286935&lineNumber=2139
Also the problem with Mn3 is that it contains all navigation tests, which take a long time. And all of them are in the same file. Chunking does only pick whole files. So we would have to speed-up the tests, or split the file into one or two more.
Android Mn4 is running fine lately, often completing in under 30 minutes.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
Blocks: 1411358
No longer blocks: 1411358
You need to log in before you can comment on or make changes to this bug.