1293090 - Marionette e10s is almost permaorange on OS X nightly

Reporter

Description

•

8 years ago

Nightly-only test failures are always a joy, since the only way you can tell a nightly test run from an opt test run is by guessing based on the time the run started relative to the times the opt and nightly builds finished, but... https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&noautoclassify&bugfiler&fromchange=6b65dd49d4f045c0a9753ce60bdb4b7b4aaedcf8&group_state=expanded&filter-searchStr=b37b720604651540cbac2070dff2e2e1ef027e75&tochange=d42aacfe34af25e2f5110e2ca3d24a210eabeb33 I believe that's 18 runs on two different nightlies with 1 successful run, and 22 runs on dep builds with 20 successful runs.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 1

•

8 years ago

I wonder why this test has been retriggered that many times. Usually even in case of a Nightly build we should see both Mn and Mn-e10s only twice! In general it looks like that my changes on bug 1257476 seem to have revealed a couple of unknown hangs in our Marionette tests, which seem to happen more or less and are hard to debug. For the profile_management case it's clearly to see why when checking the time information: 05:44:18 INFO - TEST-START | test_profile_management.py TestLog.test_preferences_are_set 05:52:21 ERROR - TEST-UNEXPECTED-ERROR | test_profile_management.py TestLog.test_preferences_are_set | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s) Maybe I could try to run a lot of tests on my MBP for recent Nightly builds to check what's wrong.

Tracy Walker [:tracy]

Updated

•

8 years ago

Blocks: e10s-tests

tracking-e10s: --- → +

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 2

•

8 years ago

Currently I can see lots of crashes for some Mn-e10s jobs. Phil, maybe you can update the query for the last days?

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Reporter

Comment 3

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=2c1f6bf57d21f2bb3dda621bfc21db36e6539fb0&filter-searchStr=b37b720604651540cbac2070dff2e2e1ef027e75&selectedJob=4638283 Green on Monday and Wednesday, failures on Tuesday, Thursday, Friday, Saturday, Sunday, Monday.

Flags: needinfo?(philringnalda)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 4

•

8 years ago

Phil, I checked those jobs and all are related to crashes of Firefox. I searched on Bugzilla about IPC related crashes and found bug 1051567. I strongly believe that all those crashes are related to that. Should we file another bug for the specific test, so it makes it easier for starring on Treeherder? Or how is it usually done?

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Reporter

Comment 5

•

8 years ago

Nothing will make it reasonable to star: "Absolutely any Marionette test will crash with a garbage signature" is only workable if you file a bug for every single testname. But focusing on starring and on two year old crashes misses every bit of what makes this interesting enough that I bothered to file it: it only happens on nightly builds. There are two acceptable differences between nightly builds and on-push builds: the update channel is 'nightly' rather than 'default' which is fine as long as nobody ever checks that and does foolish things like behaving differently for one rather than the other, and, we pass a parameter in (at least) Google searches from the searchbar. Any difference other than those, either in the binary or in the behavior of tests, is a serious bug of a class that we spent a lot of effort to eliminate. Whether it's an infra bug causing us to build differently, code either building differently or behaving differently, or tests behaving differently, having a nightly-only test failure means that someone has done something very badly wrong.

Flags: needinfo?(philringnalda)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 6

•

8 years ago

I don't actually know if there are some subtle differences in how we build nightlies on OS X. The only thing I could imagine are PGO builds which we do not have for per checkin builds. So do we support PGO on OS X at all? Maybe RelEng can shed some lights here. Beside that there is also a new follow-up bug 1295272 filed after a crash fix in the similar area, see bug 1171307. Now we see hangs for drag and drop. That's what we might face given that Marionette hangs for about 5 minutes. But when checking the logs I cannot find any drag&drop Marionette test which runs before.

Flags: needinfo?(nthomas)

Flags: needinfo?(bhearsum)

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

8 years ago

Depends on: 1295492

bhearsum@mozilla.com (:bhearsum)

Comment 7

•

8 years ago

Sounds like a question for buildduty.

Flags: needinfo?(nthomas)

Flags: needinfo?(bhearsum)

Flags: needinfo?(aselagea)

Flags: needinfo?(aobreja)

Alin Selagea [:aselagea]

Comment 8

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #6) > I don't actually know if there are some subtle differences in how we build > nightlies on OS X. The only thing I could imagine are PGO builds which we do > not have for per checkin builds. So do we support PGO on OS X at all? Maybe > RelEng can shed some lights here. No, we don't have PGO builds on OS X, see https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla/config.py#84 I also did a test on my master just to make sure that those builders are not among the available ones and they didn't show up.

Flags: needinfo?(aselagea)

Flags: needinfo?(aobreja)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 9

•

8 years ago

Thank you Alin for this information. So I'm not sure why exactly Nightly builds are that affected. But as other investigation has shown bug 1294456 is a likely high candidate of all those IOErrors we see for the socket. Mike Conley will investigate this regression soon.

Depends on: 1294456

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 10

•

8 years ago

Phil, has the change to use large desktop-test instances changed something for this bug?

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Reporter

Comment 11

•

8 years ago

No.

Flags: needinfo?(philringnalda)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 12

•

8 years ago

Well, that was a dump question because we only changed that for Linux tasks in TC. So yes, please forget it.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 13

•

8 years ago

I believe that with tomorrows Nightly builds our situation should be way better. Phil, can you please re-check tomorrow? Thanks.

Depends on: 1051567

Flags: needinfo?(philringnalda)

Phil Ringnalda (:philor)

Reporter

Updated

•

8 years ago

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(philringnalda)

Resolution: --- → WORKSFORME

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 14

•

8 years ago

Great to see. So bug 1051567 definitely fixed it then.

Resolution: WORKSFORME → FIXED

Whiteboard: [fixed by bug 1051567]

BMO Automation

Updated

•

2 years ago

Product: Testing → Remote Protocol

Bugzilla

Marionette e10s is almost permaorange on OS X nightly

Categories

(Remote Protocol :: Marionette, defect)

Tracking

(e10s+)

People

(Reporter: philor, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fixed by bug 1051567])

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Updated

Comment 14

Updated