Closed
Bug 1239552
Opened 8 years ago
Closed 8 years ago
Listener IPC message handlers are added twice for Marionette, causing test_selected.py (and others) to fail on e10s
Categories
(Remote Protocol :: Marionette, defect, P1)
Tracking
(e10s+, firefox45+ disabled, firefox46-, firefox47 wontfix, firefox48 wontfix, firefox49 fixed, firefox50 fixed, firefox51 fixed)
People
(Reporter: RyanVM, Assigned: ato)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
[Tracking Requested - why for this release]: Merge day test bustage This test is currently running green on Mn-e10s on Beta44, so something in 45 appears to have regressed this test. Not sure if the subsequent failures are fallout from the first one or something else. https://treeherder.mozilla.org/logviewer.html#?job_id=15418548&repo=try TEST-START | test_selected.py TestSelected.test_selected TEST-UNEXPECTED-FAIL | test_selected.py TestSelected.test_selected | AssertionError: False is not true Traceback (most recent call last): File "/builds/slave/test/build/venv/local/lib/python2.7/site-packages/marionette/marionette_test.py", line 344, in run testMethod() File "/builds/slave/test/build/tests/marionette/tests/testing/marionette/client/marionette/tests/unit/test_selected.py", line 15, in test_selected self.assertTrue(box.is_selected()) TEST-INFO took 724ms
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(dburns)
Comment 1•8 years ago
|
||
Tracking for 45+ since we want these tests to work and merge day is coming up in a week. Since we're now running experiments in beta the tests should probably be working on mozilla-beta as well.
Reporter | ||
Comment 2•8 years ago
|
||
I tried to disable this for my latest Try push, but apparently Marionette doesn't understand |skip-if = e10s| ?
Assignee | ||
Comment 3•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #2) > I tried to disable this for my latest Try push, but apparently Marionette > doesn't understand |skip-if = e10s| ? I’m not sure precisely what code is responsible for implementing the skip-if directive, but I don’t think the Marionette harness does. Instead, the tests in testing/marionette/client/marionette/tests/unit use the decorator skip_if_e10s from marionette.marionette_test. An example can be seen here in https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/tests/unit/test_modal_dialogs.py?from=test_modal_dialogs.py#157.
Assignee | ||
Comment 4•8 years ago
|
||
I also don’t quite understand why test_selected.py is failing on e10s as it’s passing in CI.
Reporter | ||
Comment 5•8 years ago
|
||
From the looks of https://treeherder.mozilla.org/logviewer.html#?job_id=15681099&repo=try, we'll need to disable some others too. (In reply to Andreas Tolfsen (:ato) from comment #4) > I also don’t quite understand why test_selected.py is failing on e10s as > it’s passing in CI. The code we run changes depending on which branch we're on. i.e. NIGHTLY_BUILD vs. RELEASE_BUILD ifdefs can lead to behavior changes that appear on uplift. The Try push above simulates those changes on top of the current Aurora repo so we can anticipate them ahead of time and avoid multi-day tree closures after merge day (which has been the case for the last two).
Flags: needinfo?(ato)
Reporter | ||
Comment 6•8 years ago
|
||
19:34:38 ERROR - TEST-UNEXPECTED-ERROR | test_emulator.py TestEmulatorCallbacks.test_emulator_cmd_content | MarionetteException: MarionetteException: Emulator callback still pending when finish() called Can someone please explain to me how this test is even running? IIUC, it shouldn't be running on desktop builds since it's annotated "qemu = true". Though I also see it running (and passing) in non-e10s runs, so ¯\_(ツ)_/¯.
Reporter | ||
Comment 7•8 years ago
|
||
This is heavy-handed, but it gets Mn-e10s to green: https://hg.mozilla.org/try/rev/0468b63cf2b9
Assignee | ||
Comment 8•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #6) > 19:34:38 ERROR - TEST-UNEXPECTED-ERROR | test_emulator.py > TestEmulatorCallbacks.test_emulator_cmd_content | MarionetteException: > MarionetteException: Emulator callback still pending when finish() called > > Can someone please explain to me how this test is even running? IIUC, it > shouldn't be running on desktop builds since it's annotated "qemu = true". > Though I also see it running (and passing) in non-e10s runs, so ¯\_(ツ)_/¯. If you look at the test code, it uses mocks to test the emulator interfaces. It does not actually communicate with an emulator: https://dxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/tests/unit/test_emulator.py#44
Flags: needinfo?(ato)
Reporter | ||
Updated•8 years ago
|
Comment 11•8 years ago
|
||
jimm or blake, can you help us resolve this ? I'm not sure who should own this bug. Thanks.
Flags: needinfo?(mrbkap)
Flags: needinfo?(jmathies)
Comment 12•8 years ago
|
||
I don't know much about these tests. Do they run in the content process or do they use CPOWS? Best guess is that the test harness helper to click the checkbox fails, or maybe it's not sync such that the checkbox isn't checked by the time the check routine executes. (Although I wonder how that check occurs, is it a CPOW or a ContentTask behind the scene?) I have to defer to blake or someone who knows how we e10sified our marionette harness.
Flags: needinfo?(jmathies)
Comment 13•8 years ago
|
||
I also don't know enough about how these tests work or run. Are the python calls that I see in the test file synchronous or asynchronous? How do I run one of these tests on its own to debug? I don't even know who to ask (though dburns seems like he might be able to shed some light) :-/
Flags: needinfo?(mrbkap)
Reporter | ||
Comment 14•8 years ago
|
||
Looks like test_log.py started failing recently as well. Any recent Marionette harness changes (like in the last week or so) that might give some clues as to what's going on? https://treeherder.mozilla.org/logviewer.html#?job_id=17679313&repo=try https://treeherder.mozilla.org/logviewer.html#?job_id=17679225&repo=try https://treeherder.mozilla.org/logviewer.html#?job_id=17679469&repo=try
Reporter | ||
Comment 15•8 years ago
|
||
Looks like the only recent change I can see was the removal of test_import_script_reuse_window.py from unit-tests.ini. Which is *very* interesting to me in that test_emulator.py (one of the previously-failing tests) is the one directly below test_log.py in the manifest, so removing a test from the manifest shifted the failure up one spot. Timing issue or something?
Comment 16•8 years ago
|
||
We have one CPOW left but the harness is preventing things from being thrown for that test atm. Will try look at this ASAP.
Reporter | ||
Comment 17•8 years ago
|
||
https://hg.mozilla.org/releases/mozilla-beta/rev/e1ad601e3c33
Comment 18•8 years ago
|
||
This got caught in triage limbo waiting on an ni. Ryan, is this still an issue?
Flags: needinfo?(dburns) → needinfo?(ryanvm)
Reporter | ||
Updated•8 years ago
|
Keywords: leave-open
Reporter | ||
Comment 21•8 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/86c9fed0d412 https://hg.mozilla.org/releases/mozilla-beta/rev/5c27ccdc982b
status-firefox48:
--- → disabled
Reporter | ||
Comment 22•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/4015564e7420
Updated•8 years ago
|
Blocks: e10s-tests
Comment 23•8 years ago
|
||
I have been trying to replicate this locally and can not so assuming that e10s has introduced some kind of race condition to the code
Comment 24•8 years ago
|
||
So I have tried to reproduce this permafail on beta and can't. See https://treeherder.mozilla.org/#/jobs?repo=try&revision=3566e35f3538&selectedJob=18605874 It has a number of crashes with it (which just indicates Mn is at worse tickling a crash in c++ code somewhere) but not the permafail on tests as described. I can't see a reason to have these tests disabled anymore.
Flags: needinfo?(ryanvm)
Comment 26•8 years ago
|
||
Added a try push against m-c in comment 25
Comment 27•8 years ago
|
||
Try push in comment 25 is green. If I dont hear anything by Friday I will be landing and uplifting
Comment 28•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/edb61379f407ea49a6a5c77a2f85cf0d6566e246 Bug 1239552: Reenabling tests disabled for e10s r=me
Reporter | ||
Comment 29•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/edb61379f407
Comment 30•8 years ago
|
||
uplifted to Aurora https://hg.mozilla.org/releases/mozilla-aurora/rev/7e501ac32481
status-firefox46:
disabled → ---
status-firefox47:
disabled → ---
Reporter | ||
Comment 31•8 years ago
|
||
Bad news, I'm still seeing these failures on my latest simulations (both 47 and 48) now that the tests have been re-enabled. Trunk: https://treeherder.mozilla.org/logviewer.html#?job_id=19232200&repo=try Aurora: https://treeherder.mozilla.org/logviewer.html#?job_id=19232664&repo=try If you want to wait until the next uplift to see what actually happens in the Real World when 47 hits Beta, I'd be OK with that. It's certainly easy enough to revert comment 29/30 at that point if need-be.
Flags: needinfo?(ryanvm)
This is happening on the uplift of 47 to beta. Filed bug 1267431 about this before remembering this bug. I'll probably back the patch from here out at least on beta47 sometime today.
Reporter | ||
Comment 33•8 years ago
|
||
Initial regression range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=%207a91754ea083&tochange=324c50cbc40a I'm sorta-suspecting bug 1236926. I wonder if the harness is getting thrown off by not having to dismiss a prompt that used to be there. That bug landed on 46 and was uplifted to 45, which fits the initial filing of this bug. Just a thought, anyway. I'll keep narrowing this down on inbound.
Reporter | ||
Comment 34•8 years ago
|
||
Nope, wasn't that. Down to https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=a0f514352b09&tochange=cca3da96be3e. I've got a final round of Try pushes going now. Should know which commit it is tomorrow.
Reporter | ||
Comment 35•8 years ago
|
||
I've bisected this down to https://hg.mozilla.org/integration/mozilla-inbound/rev/1626afe51b11 (bug 1239520). However, I don't think that's the real cause here. Rather, that patch basically made it so Beta CI builds could actually run with e10s enabled (before that, Gecko was specifically disallowing it, so the green "Mn-e10s" runs didn't actually have e10s enabled!). One of the simulation patches I've run with in the past manually sets the release channel to "beta" (rather than the "default" that CI builds usually have), which would explain why I was able to file this bug on the 13th before bug 1239520 landed. The good news is that it gives me reason to believe that the real culprit isn't too far off from there, I'll just need to include the rev above in the simulation patch and re-bisect.
Comment 37•8 years ago
|
||
I can reproduce the issue locally and I'll give it a shot at figuring it out. I don't know much about marionette yet though, so be prepared for a flood of dumb questions..
Assignee: nobody → felipc
Status: NEW → ASSIGNED
Comment 38•8 years ago
|
||
What's happening in test_selected.py is that, when running on --e10s, the box.click() code is ending up in two clicks instead of one. So it checks and then unchecks the checkbox, and the next test fails because it verifies that the click checked the checkbox. I'm not sure yet if the entire function is running twice, or just that the mouse events are being duplicated. Also, strangely this only happens if RELEASE_BUILD is defined. I'll try to figure out the cause tomorrow. If anyone has any guess with this new info, I'd love to hear it.
Comment 39•8 years ago
|
||
Classy though it is that we've been starring permaorange as a bug marked as a duplicate, I've hidden m-e10s on beta.
Comment 40•8 years ago
|
||
What's happening under e10s here is that this listener: http://mxr.mozilla.org/mozilla-beta/source/testing/marionette/proxy.js?rev=a95822fe264c#215 is only added once. However, its callback is called twice, causing the action to happen two times. At the second call, one can also see an error in the error console because the removeListener function fails to find the listener (on line 227, `let l = ...`). Still fighting to figure out why..
Summary: test_selected.py (and some others after it) are going to permafail in e10s mode when Gecko 45 merges to Beta → Marionette listeners are called twice under e10s, causing test_selected.py (and others) to fail
Comment 41•8 years ago
|
||
Surprisingly, this is tied not to RELEASE_BUILD, but actually to --enable-official-branding. Yes..
It's reproducible with mozilla-central too. Just need to build using --enable-official-branding and then:
> mach marionette-test testing/marionette/harness/marionette/tests/unit/test_selected.py
(note: on beta, one needs to add --e10s because it's not the default option)
Assignee | ||
Comment 42•8 years ago
|
||
(In reply to :Felipe Gomes (needinfo me!) from comment #40) > What's happening under e10s here is that this listener: > http://mxr.mozilla.org/mozilla-beta/source/testing/marionette/proxy. > js?rev=a95822fe264c#215 > > is only added once. However, its callback is called twice, causing the > action to happen two times. At the second call, one can also see an error in > the error console because the removeListener function fails to find the > listener (on line 227, `let l = ...`). This sounds entirely reasonable. However, I’m at loss why --enable-official-branding would affect this.
Updated•8 years ago
|
Priority: -- → P1
Comment 43•8 years ago
|
||
I'm unassigning myself for now as I'm kinda stuck and someone with Marionette experience can probably debug this more easily than myself. I think it's not too far now from being figured out. Comment 40 explains what makes the test fail, and comment 42 what's necessary to reproduce it.
Assignee: felipc → nobody
Status: ASSIGNED → NEW
Reporter | ||
Comment 44•8 years ago
|
||
Dave, can someone on your team please take it from here?
Flags: needinfo?(dburns)
Comment 45•8 years ago
|
||
Andreas, can you look into this please.
Flags: needinfo?(dburns) → needinfo?(ato)
Comment 47•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #46) > Any news here? Unfortunately due to an unplanned startup crash we have been fire fighting for the last few weeks and this has not been a priority. Hopefully we can look into it next week.
Flags: needinfo?(dburns)
Updated•8 years ago
|
Reporter | ||
Comment 48•8 years ago
|
||
Still firefighting?
Comment 49•8 years ago
|
||
Blake, Can you help us debug a little further here? Not knowing enough about e10s we don't know where to start on this.
Flags: needinfo?(ato) → needinfo?(mrbkap)
Assignee | ||
Comment 50•8 years ago
|
||
It looks like we’re calling loadFrameScript on the same browser twice, causing the message listeners for the listener.js paths to be added again. We are only sending the IPC message from chrome space through proxy.js once, but it is getting picked up multiple times by the same handler on the listener side.
Flags: needinfo?(mrbkap)
Summary: Marionette listeners are called twice under e10s, causing test_selected.py (and others) to fail → Listener IPC message handlers are added twice for Marionette, causing test_selected.py (and others) to fail on e10s
Assignee | ||
Comment 51•8 years ago
|
||
Setting the aAllowDelayedLoad argument to loadFrameScript to false, causing the loaded script to not be implicitly loaded into child frames, solves the issue. We appear to be calling loadFrameScript unnecessarily whenever a new browser is detected; we probably only need to call it on top-level frames. This means we end up loading the script twice: Once implicitly, because aAllowDelayedLoad, and once explicitly, through GeckoDriver#whenBrowserStarted.
Assignee | ||
Comment 52•8 years ago
|
||
The preference startup.homepage_welcome_url.additional causes problems with official branded builds and breaks certain Marionette tests, such as testing/marionette/harness/marionette/tests/unit/test_window_handles.py and testing/marionette/harness/marionette/tests/unit/test_selected.py. Many Marionette tests make assumptions about the number of open windows, and this aligns the test environment of the officially branded builds to be more similar to that of Nightly builds. There is another issue in Marionette, described in bug 1239552, that we register the listener IPC message handlers twice. This patch does not directly address that problem. Review commit: https://reviewboard.mozilla.org/r/65200/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/65200/
Attachment #8772375 -
Flags: review?(dburns)
Assignee | ||
Comment 53•8 years ago
|
||
Comment on attachment 8772375 [details] Bug 1239552 - Disable additional welcome URL in automation; Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65200/diff/1-2/
Attachment #8772375 -
Attachment description: Bug 1239552 - Disable additional wecome URL; → Bug 1239552 - Disable additional welcome URL in automation;
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → ato
Reporter | ||
Comment 54•8 years ago
|
||
That neatly explains why we were seeing this on Win10 unbranded trunk builds as well - there's a special first run page that only shows there. See bug 1248316 for a similar issue that hit mochitests in the past (as well as the pref to set to make it go away).
Assignee | ||
Comment 55•8 years ago
|
||
Comment on attachment 8772375 [details] Bug 1239552 - Disable additional welcome URL in automation; Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65200/diff/2-3/
Comment 56•8 years ago
|
||
Comment on attachment 8772375 [details] Bug 1239552 - Disable additional welcome URL in automation; https://reviewboard.mozilla.org/r/65200/#review62328
Attachment #8772375 -
Flags: review?(dburns) → review+
Comment 57•8 years ago
|
||
we need this uplifted too or ?
Updated•8 years ago
|
Flags: needinfo?(ato)
Assignee | ||
Comment 58•8 years ago
|
||
By default, frame scripts each have their own scope, so they can declare global variables without causing conflicts with any other frame scripts. The current code switches off that behaviour, meaning that the script's scope is shared with any other frame scripts in the same frame. This patch changes it so that each loaded frame script gets its own scope. This works around the problem that on Windows in officially branded builds, IPC message handlers are registered twice causing a race condition in the IPC communication between driver.js and listener.js. This is not a solution for fixing the core of the problem. Review commit: https://reviewboard.mozilla.org/r/68298/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/68298/
Attachment #8776565 -
Flags: review?(dburns)
Assignee | ||
Comment 59•8 years ago
|
||
Comment on attachment 8772375 [details] Bug 1239552 - Disable additional welcome URL in automation; Review request updated; see interdiff: https://reviewboard.mozilla.org/r/65200/diff/3-4/
Assignee | ||
Comment 60•8 years ago
|
||
try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a50015a9b238ae179897ba95a00c6e1a19423341
Assignee | ||
Comment 61•8 years ago
|
||
(In reply to Carsten Book [:Tomcat] from comment #57) > we need this uplifted too or ? This has not landed yet.
Flags: needinfo?(ato)
Comment 62•8 years ago
|
||
Comment on attachment 8776565 [details] Bug 1239552 - Disable frame script global scope; https://reviewboard.mozilla.org/r/68298/#review65468
Attachment #8776565 -
Flags: review?(dburns) → review+
Comment 63•8 years ago
|
||
Pushed by atolfsen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/493d0d1db532 Disable additional welcome URL in automation; r=automatedtester https://hg.mozilla.org/integration/autoland/rev/c53734d006b3 Disable frame script global scope; r=automatedtester
Comment 64•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/493d0d1db532 https://hg.mozilla.org/mozilla-central/rev/c53734d006b3
Comment 65•8 years ago
|
||
Would it make sense to uplift this fix? I keep noticing it on beta.
Flags: needinfo?(ato)
Reporter | ||
Comment 66•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-aurora/rev/71c3f5e108c8 https://hg.mozilla.org/releases/mozilla-aurora/rev/fc6e72b89218
Reporter | ||
Comment 67•8 years ago
|
||
Andreas, I'm thinking we should just close this bug out and open a new one for fixing the underlying issue that's being worked around in these patches.
status-firefox51:
--- → fixed
Target Milestone: --- → mozilla51
Reporter | ||
Comment 68•8 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/27de1823b2f7 https://hg.mozilla.org/releases/mozilla-beta/rev/e3dbc263febf
Assignee | ||
Comment 69•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #67) > Andreas, I'm thinking we should just close this bug out and open a new one > for fixing the underlying issue that's being worked around in these patches. I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1296628 as a follow-up to fix the underlying problems described herein.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(ato)
Resolution: --- → FIXED
Reporter | ||
Updated•8 years ago
|
Keywords: leave-open
Updated•1 year ago
|
Product: Testing → Remote Protocol
You need to log in
before you can comment on or make changes to this bug.
Description
•