Closed Bug 1239552 Opened 4 years ago Closed 4 years ago

Listener IPC message handlers are added twice for Marionette, causing test_selected.py (and others) to fail on e10s

Categories

(Testing :: Marionette, defect, P1, critical)

Version 3
defect

Tracking

(e10s+, firefox45+ disabled, firefox46-, firefox47 wontfix, firefox48 wontfix, firefox49 fixed, firefox50 fixed, firefox51 fixed)

RESOLVED FIXED
mozilla51
Tracking Status
e10s + ---
firefox45 + disabled
firefox46 - ---
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- fixed
firefox50 --- fixed
firefox51 --- fixed

People

(Reporter: RyanVM, Assigned: ato)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

[Tracking Requested - why for this release]: Merge day test bustage

This test is currently running green on Mn-e10s on Beta44, so something in 45 appears to have regressed this test. Not sure if the subsequent failures are fallout from the first one or something else.

https://treeherder.mozilla.org/logviewer.html#?job_id=15418548&repo=try

TEST-START | test_selected.py TestSelected.test_selected
TEST-UNEXPECTED-FAIL | test_selected.py TestSelected.test_selected | AssertionError: False is not true
Traceback (most recent call last):
  File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/marionette/marionette_test.py", line 344, in run
    testMethod()
  File "/builds/slave/test/build/tests/marionette/tests/testing/marionette/client/marionette/tests/unit/test_selected.py", line 15, in test_selected
    self.assertTrue(box.is_selected())
TEST-INFO took 724ms
Flags: needinfo?(dburns)
Tracking for 45+ since we want these tests to work and merge day is coming up in a week. 
Since we're now running experiments in beta the tests should probably be working on mozilla-beta as well.
I tried to disable this for my latest Try push, but apparently Marionette doesn't understand |skip-if = e10s| ?
(In reply to Ryan VanderMeulen [:RyanVM] from comment #2)
> I tried to disable this for my latest Try push, but apparently Marionette
> doesn't understand |skip-if = e10s| ?

I’m not sure precisely what code is responsible for implementing the skip-if directive, but I don’t think the Marionette harness does.

Instead, the tests in testing/marionette/client/marionette/tests/unit use the decorator skip_if_e10s from marionette.marionette_test.  An example can be seen here in https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/tests/unit/test_modal_dialogs.py?from=test_modal_dialogs.py#157.
I also don’t quite understand why test_selected.py is failing on e10s as it’s passing in CI.
From the looks of https://treeherder.mozilla.org/logviewer.html#?job_id=15681099&repo=try, we'll need to disable some others too.

(In reply to Andreas Tolfsen (:ato) from comment #4)
> I also don’t quite understand why test_selected.py is failing on e10s as
> it’s passing in CI.

The code we run changes depending on which branch we're on. i.e. NIGHTLY_BUILD vs. RELEASE_BUILD ifdefs can lead to behavior changes that appear on uplift. The Try push above simulates those changes on top of the current Aurora repo so we can anticipate them ahead of time and avoid multi-day tree closures after merge day (which has been the case for the last two).
Flags: needinfo?(ato)
19:34:38    ERROR -  TEST-UNEXPECTED-ERROR | test_emulator.py TestEmulatorCallbacks.test_emulator_cmd_content | MarionetteException: MarionetteException: Emulator callback still pending when finish() called

Can someone please explain to me how this test is even running? IIUC, it shouldn't be running on desktop builds since it's annotated "qemu = true". Though I also see it running (and passing) in non-e10s runs, so ¯\_(ツ)_/¯.
This is heavy-handed, but it gets Mn-e10s to green:
https://hg.mozilla.org/try/rev/0468b63cf2b9
(In reply to Ryan VanderMeulen [:RyanVM] from comment #6)
> 19:34:38    ERROR -  TEST-UNEXPECTED-ERROR | test_emulator.py
> TestEmulatorCallbacks.test_emulator_cmd_content | MarionetteException:
> MarionetteException: Emulator callback still pending when finish() called
> 
> Can someone please explain to me how this test is even running? IIUC, it
> shouldn't be running on desktop builds since it's annotated "qemu = true".
> Though I also see it running (and passing) in non-e10s runs, so ¯\_(ツ)_/¯.

If you look at the test code, it uses mocks to test the emulator interfaces.  It does not actually communicate with an emulator: https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/tests/unit/test_emulator.py#44
Flags: needinfo?(ato)
Any updates here?
jimm or blake, can you help us resolve this ? I'm not sure who should own this bug. Thanks.
Flags: needinfo?(mrbkap)
Flags: needinfo?(jmathies)
I don't know much about these tests. Do they run in the content process or do they use CPOWS?

Best guess is that the test harness helper to click the checkbox fails, or maybe it's not sync such that the checkbox isn't checked by the time the check routine executes. (Although I wonder how that check occurs, is it a CPOW or a ContentTask behind the scene?)

I have to defer to blake or someone who knows how we e10sified our marionette harness.
Flags: needinfo?(jmathies)
I also don't know enough about how these tests work or run. Are the python calls that I see in the test file synchronous or asynchronous? How do I run one of these tests on its own to debug? I don't even know who to ask (though dburns seems like he might be able to shed some light) :-/
Flags: needinfo?(mrbkap)
Looks like test_log.py started failing recently as well. Any recent Marionette harness changes (like in the last week or so) that might give some clues as to what's going on?

https://treeherder.mozilla.org/logviewer.html#?job_id=17679313&repo=try
https://treeherder.mozilla.org/logviewer.html#?job_id=17679225&repo=try
https://treeherder.mozilla.org/logviewer.html#?job_id=17679469&repo=try
Looks like the only recent change I can see was the removal of test_import_script_reuse_window.py from unit-tests.ini. Which is *very* interesting to me in that test_emulator.py (one of the previously-failing tests) is the one directly below test_log.py in the manifest, so removing a test from the manifest shifted the failure up one spot. Timing issue or something?
We have one CPOW left but the harness is preventing things from being thrown for that test atm.

Will try look at this ASAP.
This got caught in triage limbo waiting on an ni. Ryan, is this still an issue?
Flags: needinfo?(dburns) → needinfo?(ryanvm)
Yup, still happens.
Flags: needinfo?(ryanvm)
Keywords: leave-open
I have been trying to replicate this locally and can not so assuming that e10s has introduced some kind of race condition to the code
So I have tried to reproduce this permafail on beta and can't. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=3566e35f3538&selectedJob=18605874

It has a number of crashes with it (which just indicates Mn is at worse tickling a crash in c++ code somewhere) but not the permafail on tests as described.

I can't see a reason to have these tests disabled anymore.
Flags: needinfo?(ryanvm)
Added a try push against m-c in comment 25
Try push in comment 25 is green. If I dont hear anything by Friday I will be landing and uplifting
Bad news, I'm still seeing these failures on my latest simulations (both 47 and 48) now that the tests have been re-enabled.
Trunk:  https://treeherder.mozilla.org/logviewer.html#?job_id=19232200&repo=try
Aurora: https://treeherder.mozilla.org/logviewer.html#?job_id=19232664&repo=try

If you want to wait until the next uplift to see what actually happens in the Real World when 47 hits Beta, I'd be OK with that. It's certainly easy enough to revert comment 29/30 at that point if need-be.
Flags: needinfo?(ryanvm)
This is happening on the uplift of 47 to beta. Filed bug 1267431 about this before remembering this bug. I'll probably back the patch from here out at least on beta47 sometime today.
Initial regression range:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=%207a91754ea083&tochange=324c50cbc40a

I'm sorta-suspecting bug 1236926. I wonder if the harness is getting thrown off by not having to dismiss a prompt that used to be there. That bug landed on 46 and was uplifted to 45, which fits the initial filing of this bug. Just a thought, anyway. I'll keep narrowing this down on inbound.
Nope, wasn't that. Down to https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a0f514352b09&tochange=cca3da96be3e. I've got a final round of Try pushes going now. Should know which commit it is tomorrow.
I've bisected this down to https://hg.mozilla.org/integration/mozilla-inbound/rev/1626afe51b11 (bug 1239520). However, I don't think that's the real cause here. Rather, that patch basically made it so Beta CI builds could actually run with e10s enabled (before that, Gecko was specifically disallowing it, so the green "Mn-e10s" runs didn't actually have e10s enabled!).

One of the simulation patches I've run with in the past manually sets the release channel to "beta" (rather than the "default" that CI builds usually have), which would explain why I was able to file this bug on the 13th before bug 1239520 landed. The good news is that it gives me reason to believe that the real culprit isn't too far off from there, I'll just need to include the rev above in the simulation patch and re-bisect.
Duplicate of this bug: 1267431
I can reproduce the issue locally and I'll give it a shot at figuring it out.  I don't know much about marionette yet though, so be prepared for a flood of dumb questions..
Assignee: nobody → felipc
Status: NEW → ASSIGNED
What's happening in test_selected.py is that, when running on --e10s, the box.click() code is ending up in two clicks instead of one. So it checks and then unchecks the checkbox, and the next test fails because it verifies that the click checked the checkbox.

I'm not sure yet if the entire function is running twice, or just that the mouse events are being duplicated.

Also, strangely this only happens if RELEASE_BUILD is defined.

I'll try to figure out the cause tomorrow. If anyone has any guess with this new info, I'd love to hear it.
Classy though it is that we've been starring permaorange as a bug marked as a duplicate, I've hidden m-e10s on beta.
What's happening under e10s here is that this listener:
http://mxr.mozilla.org/mozilla-beta/source/testing/marionette/proxy.js?rev=a95822fe264c#215

is only added once. However, its callback is called twice, causing the action to happen two times. At the second call, one can also see an error in the error console because the removeListener function fails to find the listener (on line 227, `let l = ...`).

Still fighting to figure out why..
Summary: test_selected.py (and some others after it) are going to permafail in e10s mode when Gecko 45 merges to Beta → Marionette listeners are called twice under e10s, causing test_selected.py (and others) to fail
Surprisingly, this is tied not to RELEASE_BUILD, but actually to --enable-official-branding. Yes..

It's reproducible with mozilla-central too. Just need to build using --enable-official-branding and then:

> mach marionette-test testing/marionette/harness/marionette/tests/unit/test_selected.py

(note: on beta, one needs to add --e10s because it's not the default option)
(In reply to :Felipe Gomes (needinfo me!) from comment #40)
> What's happening under e10s here is that this listener:
> http://mxr.mozilla.org/mozilla-beta/source/testing/marionette/proxy.
> js?rev=a95822fe264c#215
> 
> is only added once. However, its callback is called twice, causing the
> action to happen two times. At the second call, one can also see an error in
> the error console because the removeListener function fails to find the
> listener (on line 227, `let l = ...`).

This sounds entirely reasonable.

However, I’m at loss why --enable-official-branding would affect this.
Priority: -- → P1
I'm unassigning myself for now as I'm kinda stuck and someone with Marionette experience can probably debug this more easily than myself. I think it's not too far now from being figured out. Comment 40 explains what makes the test fail, and comment 42 what's necessary to reproduce it.
Assignee: felipc → nobody
Status: ASSIGNED → NEW
Dave, can someone on your team please take it from here?
Flags: needinfo?(dburns)
Andreas, can you look into this please.
Flags: needinfo?(dburns) → needinfo?(ato)
Any news here?
Flags: needinfo?(dburns)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #46)
> Any news here?

Unfortunately due to an unplanned startup crash we have been fire fighting for the last few weeks and this has not been a priority. Hopefully we can look into it next week.
Flags: needinfo?(dburns)
Blake,

Can you help us debug a little further here? Not knowing enough about e10s we don't know where to start on this.
Flags: needinfo?(ato) → needinfo?(mrbkap)
It looks like we’re calling loadFrameScript on the same browser twice, causing the message listeners for the listener.js paths to be added again.  We are only sending the IPC message from chrome space through proxy.js once, but it is getting picked up multiple times by the same handler on the listener side.
Flags: needinfo?(mrbkap)
Summary: Marionette listeners are called twice under e10s, causing test_selected.py (and others) to fail → Listener IPC message handlers are added twice for Marionette, causing test_selected.py (and others) to fail on e10s
Setting the aAllowDelayedLoad argument to loadFrameScript to false, causing the loaded script to not be implicitly loaded into child frames, solves the issue.  We appear to be calling loadFrameScript unnecessarily whenever a new browser is detected; we probably only need to call it on top-level frames.

This means we end up loading the script twice: Once implicitly, because aAllowDelayedLoad, and once explicitly, through GeckoDriver#whenBrowserStarted.
The preference startup.homepage_welcome_url.additional causes problems
with official branded builds and breaks certain Marionette tests, such as
testing/marionette/harness/marionette/tests/unit/test_window_handles.py
and testing/marionette/harness/marionette/tests/unit/test_selected.py.

Many Marionette tests make assumptions about the number of open windows,
and this aligns the test environment of the officially branded builds
to be more similar to that of Nightly builds.

There is another issue in Marionette, described in bug 1239552, that we
register the listener IPC message handlers twice.  This patch does not
directly address that problem.

Review commit: https://reviewboard.mozilla.org/r/65200/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/65200/
Attachment #8772375 - Flags: review?(dburns)
Comment on attachment 8772375 [details]
Bug 1239552 - Disable additional welcome URL in automation;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65200/diff/1-2/
Attachment #8772375 - Attachment description: Bug 1239552 - Disable additional wecome URL; → Bug 1239552 - Disable additional welcome URL in automation;
Assignee: nobody → ato
That neatly explains why we were seeing this on Win10 unbranded trunk builds as well - there's a special first run page that only shows there. See bug 1248316 for a similar issue that hit mochitests in the past (as well as the pref to set to make it go away).
Comment on attachment 8772375 [details]
Bug 1239552 - Disable additional welcome URL in automation;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65200/diff/2-3/
Comment on attachment 8772375 [details]
Bug 1239552 - Disable additional welcome URL in automation;

https://reviewboard.mozilla.org/r/65200/#review62328
Attachment #8772375 - Flags: review?(dburns) → review+
we need this uplifted too or ?
Flags: needinfo?(ato)
By default, frame scripts each have their own scope, so they can declare
global variables without causing conflicts with any other frame scripts.
The current code switches off that behaviour, meaning that the script's
scope is shared with any other frame scripts in the same frame.

This patch changes it so that each loaded frame script gets its own
scope.  This works around the problem that on Windows in officially
branded builds, IPC message handlers are registered twice causing a race
condition in the IPC communication between driver.js and listener.js.

This is not a solution for fixing the core of the problem.

Review commit: https://reviewboard.mozilla.org/r/68298/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/68298/
Attachment #8776565 - Flags: review?(dburns)
Comment on attachment 8772375 [details]
Bug 1239552 - Disable additional welcome URL in automation;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65200/diff/3-4/
(In reply to Carsten Book [:Tomcat] from comment #57)
> we need this uplifted too or ?

This has not landed yet.
Flags: needinfo?(ato)
Comment on attachment 8776565 [details]
Bug 1239552 - Disable frame script global scope;

https://reviewboard.mozilla.org/r/68298/#review65468
Attachment #8776565 - Flags: review?(dburns) → review+
Pushed by atolfsen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/493d0d1db532
Disable additional welcome URL in automation; r=automatedtester
https://hg.mozilla.org/integration/autoland/rev/c53734d006b3
Disable frame script global scope; r=automatedtester
Would it make sense to uplift this fix?  I keep noticing it on beta.
Flags: needinfo?(ato)
Andreas, I'm thinking we should just close this bug out and open a new one for fixing the underlying issue that's being worked around in these patches.
Target Milestone: --- → mozilla51
(In reply to Ryan VanderMeulen [:RyanVM] from comment #67)
> Andreas, I'm thinking we should just close this bug out and open a new one
> for fixing the underlying issue that's being worked around in these patches.

I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1296628 as a follow-up to fix the underlying problems described herein.
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(ato)
Resolution: --- → FIXED
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.