Closed Bug 1337772 Opened 3 years ago Closed 3 years ago

Intermittent browser_context_menu_autocomplete_interaction.js | Autocomplete shouldn't appear - false == true - JS frame :: chrome://mochitests/content/browser/toolkit/components/passwordmgr/test/browser/browser_context_menu_autocomplete_interaction.js

Categories

(Toolkit :: Password Manager, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- wontfix
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: johannh)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(3 files)

When someone goes looking for the reason why the frequency of this exploded, the answer is going to be that it moved to a new chunk where it isn't happy.
The test was also just updated, in bug 1337259.

:johannh - Are these failures related to your changes? Can you make this run more reliably?
Flags: needinfo?(jhofmann)
I'm looking into it. So far I can unfortunately not reproduce this locally, running for a long time in chaos mode. I'll try to get a loaner, I guess.
Assignee: nobody → jhofmann
Status: NEW → ASSIGNED
Flags: needinfo?(jhofmann)
Quick update on this:

The intermittent failure is pretty special and should normally not be observable by users. On the loaner I could only intermittently reproduce the problem on the very first right click into an input in the program lifetime. It's not reproducible until Firefox restarts. Presuming that this has to do something with formFillController initialization I pushed the following patch to try: 

https://hg.mozilla.org/try/rev/a6264cbe879275a26080056bb6787a8693e8842d

But it did not have the desired effect. Looking into this more thoroughly, it's unlikely that the primary changes in bug 1337259 caused the sudden spike in failures, since the username field was not touched in the patch and is also affected (https://public-artifacts.taskcluster.net/NnYUlzhYQzevBzBG3v2DJQ/0/public/test_info//mozilla-test-fail-screenshot_vuXQsH.png). It was more likely triggered by:

- bug 1337259 enabling the test case on the password field, doubling the chance of this happening
- changing the execution conditions by moving it do a different chunk, since this intermittent seems to not happen if formFillController is initialized already 

I can intermittently reproduce the bug on a Linux VM in Chaos Mode. Next I will try to debug the failure, maybe trying to catch one in rr. The weirdest part is that this is happening for both input fields although these two should have pretty different code paths for hiding the popup.

If I'm unable to find the cause until next week I'd say we disable the test on Linux since it's really an edge-case in production.
Thanks for the work so far on this Johann.  If you need any help with your try push or trying a few different things out, please ask for it.
Whiteboard: [stockwell needswork]
After applying the first patch I could not observe the only-on-first-click behavior anymore (I'm not sure if I fixed it or if I wasn't observing it correctly from the start), but the test still intermittently failed. Debugging on Linux revealed that the timeDiff value in these cases was always slightly above 250, which is the maximum threshold. In my observation, the less performant Firefox was (e.g. with rr or gdb attached), the higher the potential timeDiff (seems obvious in hindsight).

So my fix for now would be to raise the threshold to an even higher value, e.g. 500, in addition to correctly initializing the timestamp and checking for a password field before comparing timestamps. Matt, we can change that number to something lower if you like, in my testing I never saw anything above 400 and my browser was painfully slow.

The initial try run looks good and I did a bunch of retriggers to confirm. Let's wait and see.
Comment on attachment 8843002 [details]
Bug 1337772 - Part 1 - Use mousedown instead of contextmenu to avoid showing the password autocomplete.

https://reviewboard.mozilla.org/r/116748/#review118492

r=me with 400 and the ini change for debug/ASAN
Attachment #8843002 - Flags: review?(MattN+bmo) → review+
(In reply to Johann Hofmann [:johannh] from comment #15)
> After applying the first patch I could not observe the only-on-first-click
> behavior anymore (I'm not sure if I fixed it or if I wasn't observing it
> correctly from the start), but the test still intermittently failed.
> Debugging on Linux revealed that the timeDiff value in these cases was
> always slightly above 250, which is the maximum threshold. In my
> observation, the less performant Firefox was (e.g. with rr or gdb attached),
> the higher the potential timeDiff (seems obvious in hindsight).
> 
> So my fix for now would be to raise the threshold to an even higher value,
> e.g. 500, in addition to correctly initializing the timestamp and checking
> for a password field before comparing timestamps. Matt, we can change that
> number to something lower if you like, in my testing I never saw anything
> above 400 and my browser was painfully slow.

The timing issue makes a lot of sense since the failures are much more common on debug/asan builds. Because non-optimized builds are slower perhaps we should disable the test outside opt? How about a value of 400 plus no debug/asan?
> The timing issue makes a lot of sense since the failures are much more common on debug/asan builds. Because non-optimized builds are slower perhaps we should disable the test outside opt? How about a value of 400 plus no debug/asan?

Let's disable only on asan or linux 32 bit debug, since that seems to be the most common platform it fails on. I would prefer not to disable on too many platforms initially since this is still an indicator of end-user experience.
Let's leave this open to verify that we don't have to disable for 64 bit debug linux (and that the fix worked).
Keywords: leave-open
Pushed by jhofmann@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ebf65a1af072
Fix intermittent browser_context_menu_autocomplete_interaction.js. r=MattN
Whiteboard: [stockwell needswork] → [stockwell fixed]
:johann, you landed on Autoland and after doing so I don't see failures there- I am find to say this is fixed, but it just got merged today to m-c, so another day or so and it will be clear if this is fixed.
(In reply to Joel Maher ( :jmaher) from comment #28)
> :johann, you landed on Autoland and after doing so I don't see failures
> there- I am find to say this is fixed, but it just got merged today to m-c,
> so another day or so and it will be clear if this is fixed.

Yes, in my experience I usually have to wait a couple of days until I can clearly perceive the effect of an intermittent fix, but feel free to close this whenever you feel confident that we fixed it :)
odd, we still have a high failure rate, I suspect this didn't make it to mozilla-inbound as almost all failures are on inbound.

:tomcat, early yesterday we merged to m-c, but I see a high failure rate on m-i, I don't see this landed on inbound- is there a reason we went to m-c but not to inbound yesterday?
Flags: needinfo?(cbook)
(In reply to Joel Maher ( :jmaher) from comment #31)
> odd, we still have a high failure rate, I suspect this didn't make it to
> mozilla-inbound as almost all failures are on inbound.
> 
> :tomcat, early yesterday we merged to m-c, but I see a high failure rate on
> m-i, I don't see this landed on inbound- is there a reason we went to m-c
> but not to inbound yesterday?

https://hg.mozilla.org/integration/mozilla-inbound/rev/ebf65a1af072 :)
Flags: needinfo?(cbook)
ok, this is on inbound, odd why we are only getting failures on inbound though, lets see what another day or two yields in terms of patterns.
I have no idea how my small patches which don't affect anything other than whether or not the paste command is enabled could be causing these test failures. The code is completely unrelated.

Is there a chance that my patch is just unlucky and just caused the test to move between chunks again to a less favorable position for performance?
Flags: needinfo?(michael) → needinfo?(jhofmann)
:kwierso, thanks for the retriggers and finding the root cause.

I verified the test in question has remained in the bc1 test chunk, so we didn't see anything move around.  This looks to be the same failure (logs are the same).  The difference is this is primarily linux opt/pgo, not linux debug.

Michael, could you maybe bisect this down with your changes to see what part of your changes are the root cause?  This looks to be a perma fail, so I would recommend backing out the original changes until we have an opportunity to find the root cause.

also are there interactions between the fix for this and the new code which landed?
(In reply to Michael Layzell [:mystor] from comment #35)
> The code is completely unrelated.

https://hg.mozilla.org/integration/mozilla-inbound/rev/ac2f3894113f831e5f2828185d8ac8c93ef8053b made changes to several other context menu tests, related to the paste command. I don't see anything in browser_context_menu_autocomplete_interaction.js specifically related to paste, but the context menu connection seems more than coincidental.
I've talked with ehsan, and I think I've figured out how my changes are regressing this test.

One of the effects which my changes have is reducing the number of sync IPC messages which we make when updating command state in e10s. Before my patch we had to perform multiple sync IPC messages to determine if there is data on the clipboard to determine whether or not to enable the paste command. After my changes this is no longer necessary, and we avoid those sync IPC messages.

This most likely affects the timing of opening the context menu (which requires updating command state), and thus throws off the fragile timing of this test.
I don't think it's related to performance or chunking/interaction with surrounding tests since I can easily reproduce the failure on my local VM running just this individual test with stellar performance on opt. There's something else going on here.

> This most likely affects the timing of opening the context menu (which requires updating command state), and thus throws off the fragile timing of this test.

This sounds like your change made the context menu faster, which wouldn't make the test fail since it only fails on spectacularly bad context menu opening times.

Note that we rely on the contextmenu event to arrive synchronously with the focus event (before our Runnable created in the focus event listener is run on the main thread). If your patch changed something around this it might cause the failure. It's probably best to just debug it.

It's too late in my timezone to further investigate this, but I'll try to follow up tomorrow.

Thanks!
Flags: needinfo?(jhofmann)
Michael, are you going to fix this bug?  I just want to update the assignee and get this resolved
Flags: needinfo?(michael)
Since I'm also looking into this right now (Michael's patch doesn't look wrong, our test is just too brittle against these kinds of changes) it's probably fine to leave it assigned to me.

Michael, it would still be great if you could validate that this is related to reducing the IPC messages.
As Johann said, I think he's going to handle it, and I'll try to help out if I can :).
Flags: needinfo?(michael)
Debugging this turned out to be a bit of a pain due to occasional crashes but logging finally gave me the answer I was looking for. Turns out that mystor's patch seems to have triggered a condition on Linux where the contextmenu event is not guaranteed to happen before we spin the event loop in the focus event handler. This causes our lastContextMenuTimeStamp variable to be set to an old value when the runnable checks it to figure out if it should show the popup.

smaug and mystor told me on IRC that we should not rely on these events happening in sync and smaug made the suggestion to instead listen for mousedown events and ignore the next focus event if the mousedown came from a right click. I'll look into that tomorrow.

Joel, I'd be up for disabling this test on all of Linux while we work on a solution. I can do that tomorrow but feel free to do it yourself if you think it's urgent.
how long do you think a fix might take?  If it could be landed by Monday, lets leave it, otherwise lets disable- I can disable tomorrow if this might take a few more days.
ni myself to follow up in 1 week- I know :johannh is working on this, I don't want to forget about it
Flags: needinfo?(jmaher)
Whiteboard: [stockwell fixed] → [stockwell disabled]
Attachment #8845534 - Flags: review?(gbrown) → review+
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b5a1f3455ad4
Intermittent browser_context_menu_autocomplete_interaction.js. disable on linux. r=gbrown
Comment on attachment 8843002 [details]
Bug 1337772 - Part 1 - Use mousedown instead of contextmenu to avoid showing the password autocomplete.

So, uh, no idea how this works with reviewboard now. It seems to have pushed on top of the existing commit. Fun.

Matt, let me know if that's ok for you to review or if I should attach a new patch.
Attachment #8843002 - Flags: review+ → review?(MattN+bmo)
thanks for getting a fix Johann!
Flags: needinfo?(jmaher)
Whiteboard: [stockwell disabled] → [stockwell needswork]
Depends on: 1192800
Comment on attachment 8843002 [details]
Bug 1337772 - Part 1 - Use mousedown instead of contextmenu to avoid showing the password autocomplete.

https://reviewboard.mozilla.org/r/116748/#review122720

r=me on this version but I guess there may be follow-up fixes for test failures.

::: toolkit/components/passwordmgr/LoginManagerContent.jsm:594
(Diff revision 4)
> -      } else {
> +    } else {
> -        log("maybeOpenAutocompleteAfterFocus: FormFillController has a different focused input");
> +      log("Not opening autocomplete after focus since a context menu was opened within",
> +          timeDiff, "ms");

Please keep this before an early return and get rid of the `else` as that is m-c style.
Attachment #8843002 - Flags: review?(MattN+bmo) → review+
I saw a few try pushes here a week ago, not sure if they amounted to anything.  This is slightly reduced, but still showing up daily.  it also looks like possible progress a week ago in bug 1192800 as well- it sounds like we are close, maybe next week this can be fixed up!
Yes, sorry for the lack of updates. The latest try push is with the patch from bug 1192800 applied, and there I'm pretty confident that it's fixed for good. So I'm just waiting on the review on that bug and then I'll push the small additional change for review here.
Since the latest try push is already looking good (will do a ton of retriggers now) here's the second part of the patch for review. I also set the timeout to 400 again, let's see if we can leave it enabled on debug & asan.
Comment on attachment 8852521 [details]
Bug 1337772 - Part 2 - Explicitly trigger a mousedown event before opening the context menu in browser_context_menu.js.

https://reviewboard.mozilla.org/r/124712/#review127482

Thanks a lot!

::: browser/components/extensions/test/browser/head.js:247
(Diff revision 1)
>    let popupShownPromise = BrowserTestUtils.waitForEvent(contentAreaContextMenu, "popupshown");
> +  await BrowserTestUtils.synthesizeMouseAtCenter(selector, {type: "mousedown", button: 2}, gBrowser.selectedBrowser);
>    await BrowserTestUtils.synthesizeMouseAtCenter(selector, {type: "contextmenu"}, gBrowser.selectedBrowser);

This is to unbreak some non-pwmgr tests?

::: toolkit/components/passwordmgr/test/browser/browser_context_menu.js:278
(Diff revision 1)
> +  let eventDetails = {type: "mousedown", button: 2};
> +  yield BrowserTestUtils.synthesizeMouseAtCenter(passwordInput, eventDetails, browser);
> +  // Synthesize a contextmenu event to actually open the context menu.
> +  eventDetails = {type: "contextmenu", button: 2};
> +  yield BrowserTestUtils.synthesizeMouseAtCenter(passwordInput, eventDetails, browser);

Ideally we would fix `synthesizeMouseAtCenter` to send the other mouse events (never understood why it doesn't in this case) but this is okay since we're uplifting and that may have some history behind it.
Attachment #8852521 - Flags: review?(MattN+bmo) → review+
Pushed by jhofmann@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/31159cce08ef
Part 1 - Use mousedown instead of contextmenu to avoid showing the password autocomplete. r=MattN
https://hg.mozilla.org/integration/autoland/rev/6d936a886de8
Part 2 - Explicitly trigger a mousedown event before opening the context menu in browser_context_menu.js. r=MattN
Whiteboard: [stockwell needswork] → [stockwell fixed]
Depends on: 1260593
Closing as fixed since there haven't been any failures since landing (except one in Aurora).
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Keywords: leave-open
Resolution: --- → FIXED
Should we consider this for uplift to Aurora then? Please nominate if so :)
Flags: needinfo?(jhofmann)
Not clearing needinfo, just wanted to mention that I'd like to assess and ideally fix bug 1260593 before nominating this for (now) beta.
Flags: needinfo?(jhofmann)
You need to log in before you can comment on or make changes to this bug.