Closed Bug 1293090 Opened 3 years ago Closed 3 years ago

Marionette e10s is almost permaorange on OS X nightly

Categories

(Testing :: Marionette, defect)

defect
Not set

Tracking

(e10s+)

RESOLVED FIXED
Tracking Status
e10s + ---

People

(Reporter: philor, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fixed by bug 1051567])

Nightly-only test failures are always a joy, since the only way you can tell a nightly test run from an opt test run is by guessing based on the time the run started relative to the times the opt and nightly builds finished, but...

https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&noautoclassify&bugfiler&fromchange=6b65dd49d4f045c0a9753ce60bdb4b7b4aaedcf8&group_state=expanded&filter-searchStr=b37b720604651540cbac2070dff2e2e1ef027e75&tochange=d42aacfe34af25e2f5110e2ca3d24a210eabeb33

I believe that's 18 runs on two different nightlies with 1 successful run, and 22 runs on dep builds with 20 successful runs.
I wonder why this test has been retriggered that many times. Usually even in case of a Nightly build we should see both Mn and Mn-e10s only twice! 

In general it looks like that my changes on bug 1257476 seem to have revealed a couple of unknown hangs in our Marionette tests, which seem to happen more or less and are hard to debug. For the profile_management case it's clearly to see why when checking the time information:

05:44:18     INFO -  TEST-START | test_profile_management.py TestLog.test_preferences_are_set
05:52:21    ERROR -  TEST-UNEXPECTED-ERROR | test_profile_management.py TestLog.test_preferences_are_set | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360s)

Maybe I could try to run a lot of tests on my MBP for recent Nightly builds to check what's wrong.
Blocks: e10s-tests
tracking-e10s: --- → +
Currently I can see lots of crashes for some Mn-e10s jobs. Phil, maybe you can update the query for the last days?
Flags: needinfo?(philringnalda)
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&fromchange=2c1f6bf57d21f2bb3dda621bfc21db36e6539fb0&filter-searchStr=b37b720604651540cbac2070dff2e2e1ef027e75&selectedJob=4638283

Green on Monday and Wednesday, failures on Tuesday, Thursday, Friday, Saturday, Sunday, Monday.
Flags: needinfo?(philringnalda)
Phil, I checked those jobs and all are related to crashes of Firefox. I searched on Bugzilla about IPC related crashes and found bug 1051567. I strongly believe that all those crashes are related to that.

Should we file another bug for the specific test, so it makes it easier for starring on Treeherder? Or how is it usually done?
Flags: needinfo?(philringnalda)
Nothing will make it reasonable to star: "Absolutely any Marionette test will crash with a garbage signature" is only workable if you file a bug for every single testname.

But focusing on starring and on two year old crashes misses every bit of what makes this interesting enough that I bothered to file it: it only happens on nightly builds.

There are two acceptable differences between nightly builds and on-push builds: the update channel is 'nightly' rather than 'default' which is fine as long as nobody ever checks that and does foolish things like behaving differently for one rather than the other, and, we pass a parameter in (at least) Google searches from the searchbar.

Any difference other than those, either in the binary or in the behavior of tests, is a serious bug of a class that we spent a lot of effort to eliminate. Whether it's an infra bug causing us to build differently, code either building differently or behaving differently, or tests behaving differently, having a nightly-only test failure means that someone has done something very badly wrong.
Flags: needinfo?(philringnalda)
I don't actually know if there are some subtle differences in how we build nightlies on OS X. The only thing I could imagine are PGO builds which we do not have for per checkin builds. So do we support PGO on OS X at all? Maybe RelEng can shed some lights here. 

Beside that there is also a new follow-up bug 1295272 filed after a crash fix in the similar area, see bug 1171307. Now we see hangs for drag and drop. That's what we might face given that Marionette hangs for about 5 minutes. But when checking the logs I cannot find any drag&drop Marionette test which runs before.
Flags: needinfo?(nthomas)
Flags: needinfo?(bhearsum)
Sounds like a question for buildduty.
Flags: needinfo?(nthomas)
Flags: needinfo?(bhearsum)
Flags: needinfo?(aselagea)
Flags: needinfo?(aobreja)
(In reply to Henrik Skupin (:whimboo) from comment #6)
> I don't actually know if there are some subtle differences in how we build
> nightlies on OS X. The only thing I could imagine are PGO builds which we do
> not have for per checkin builds. So do we support PGO on OS X at all? Maybe
> RelEng can shed some lights here. 

No, we don't have PGO builds on OS X, see https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla/config.py#84
I also did a test on my master just to make sure that those builders are not among the available ones and they didn't show up.
Flags: needinfo?(aselagea)
Flags: needinfo?(aobreja)
Thank you Alin for this information. So I'm not sure why exactly Nightly builds are that affected. But as other investigation has shown bug 1294456 is a likely high candidate of all those IOErrors we see for the socket. Mike Conley will investigate this regression soon.
Depends on: 1294456
Phil, has the change to use large desktop-test instances changed something for this bug?
Flags: needinfo?(philringnalda)
No.
Flags: needinfo?(philringnalda)
Well, that was a dump question because we only changed that for Linux tasks in TC. So yes, please forget it.
I believe that with tomorrows Nightly builds our situation should be way better. Phil, can you please re-check tomorrow? Thanks.
Depends on: 1051567
Flags: needinfo?(philringnalda)
Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(philringnalda)
Resolution: --- → WORKSFORME
Great to see. So bug 1051567 definitely fixed it then.
Resolution: WORKSFORME → FIXED
Whiteboard: [fixed by bug 1051567]
You need to log in before you can comment on or make changes to this bug.