Open Bug 1369440 Opened 2 years ago Updated Last year

Intermittent toolkit/components/extensions/test/mochitest/test_ext_contentscript_about_blank.html | Test timed out.

Categories

(WebExtensions :: General, defect, P3)

defect

Tracking

(firefox55 fixed)

REOPENED
mozilla55
Tracking Status
firefox55 --- fixed

People

(Reporter: RyanVM, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell disabled])

Attachments

(1 file)

doing some retriggers here to narrow this down:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=android%20debug%2048&tochange=5600d485769bed18de1fc9e8ba8e3f7e66bc771b&fromchange=59fa62ee9d7537bf9dd5e75ef77a6561b396fd21

34 failures in a day is very high failure rate!  While waiting for the results, I would like to get the dev team aware of this and figure out who will work on this.  I will update the bug with information when the retriggers are done.

:andym- can you help find someone to look into this?
Flags: needinfo?(amckay)
Whiteboard: [stockwell needswork]
:bkelly- this appears to be related to your change from bug 1363829.  I know you worked hard to address issues prior to deployment and were expecting a few random things to show up, here you can see the difference:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=android%20debug%2048&tochange=9bf2eafa8defa9b0d69bfe8c757ca69826543e9b&fromchange=83f3573ea26840208086aa5bd7d41aff37a0ddf1
Blocks: 1363829
Flags: needinfo?(amckay) → needinfo?(bkelly)
I'll take a look on Monday.  I saw this one with an earlier version of my patches, but it went away after some timing tweaks.  I guess I was incorrect about that.
Kris, it looks like you looked at this test before in bug 1312161.  Do you have any ideas?  I'm going to investigate on Monday, but I don't really know our extension internals or test framework.  Thanks!

It looks like its not seeing the about:blank window:

[task 2017-06-02T19:59:21.356964Z] 19:59:21     INFO -  8 INFO SpawnTask.js | Entering test test_contentscript_about_blank
[task 2017-06-02T19:59:21.357017Z] 19:59:21     INFO -  9 INFO Extension loaded
[task 2017-06-02T19:59:21.357065Z] 19:59:21     INFO -  Buffered messages logged at 19:54:30
[task 2017-06-02T19:59:21.357116Z] 19:59:21     INFO -  10 INFO script ran: all,about:srcdoc,about:srcdoc
[task 2017-06-02T19:59:21.357161Z] 19:59:21     INFO -  Buffered messages logged at 19:54:31
[task 2017-06-02T19:59:21.357230Z] 19:59:21     INFO -  11 INFO script ran: all,top,http://example.com/tests/toolkit/components/extensions/test/mochitest/file_with_about_blank.html
[task 2017-06-02T19:59:21.357279Z] 19:59:21     INFO -  Buffered messages finished
[task 2017-06-02T19:59:21.357359Z] 19:59:21     INFO -  12 INFO TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_contentscript_about_blank.html | Test timed out.

My patches changed the timing of setTimeout() slightly, but its unclear to me how that factors into web extensions or this test.
Flags: needinfo?(kmaglione+bmo)
Assignee: nobody → bkelly
Status: NEW → ASSIGNED
Flags: needinfo?(bkelly)
Looks like the extension-process-scripts.js code just changed a lot thanks to bug 1368102.  I wonder if this still triggers.
Ugh.  Android mochitests split logcat output from the test output so its really hard to line things up.  And these tests take ~90 minutes to run.
I see this in a failing log:

06-05 10:51:21.922   789   809 E GeckoConsole: [JavaScript Error: "TypeError: win is null" {file: "resource://gre/modules/ExtensionUtils.jsm" line: 143}]
06-05 10:51:22.412   789   809 E GeckoConsole: [JavaScript Error: "TypeError: win is null" {file: "resource://gre/modules/ExtensionUtils.jsm" line: 143}]
06-05 10:51:22.412   789   809 E GeckoConsole: [JavaScript Error: "TypeError: win is null" {file: "resource://gre/modules/ExtensionUtils.jsm" line: 143}]
It seems its coming from the Script.inject() code in ExtensionContent.jsm:

06-05 19:33:49.955   797   817 I Gecko   : ### ### ExtensionContent.jsm Script.inject([object Object]) window: null
06-05 19:33:49.965   797   817 I Gecko   : ### ### ExtensionUtils.jsm _winUtils(null)
It seems inject() already has a short-circuit for `!window` here.  Maybe we can just do that a bit earlier.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a4bc7f575123e63465d3b0b986a6d4905b01cf6
The short circuit prevents the exceptions, but the timeout still happens.  I guess that makes sense if its preventing the extension from being injected into the page.

I could really use some help from the extension team to figure out what is going on.
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #17)
> It seems its coming from the Script.inject() code in ExtensionContent.jsm:
> 
> 06-05 19:33:49.955   797   817 I Gecko   : ### ### ExtensionContent.jsm
> Script.inject([object Object]) window: null
> 06-05 19:33:49.965   797   817 I Gecko   : ### ### ExtensionUtils.jsm
> _winUtils(null)

That usually indicates a timing issue. The injection happens asynchronously, and we null out the contentWindow property if the window navigates or is destroyed after the context is created.

Unfortunately, the logic for injecting things into about:blank is pretty hairy, since about:blank documents don't behave like most other documents. For a start, they can be generated in two completely different ways, both of which are different from the way we generate normal documents, which means that things like document-element-inserted observers don't work. about:blank documents also tend to be created early for new docShells, and then reused to load the actual initial document, which means that knowing whether a new document is *actually* an about:blank document is now always easy.
Flags: needinfo?(kmaglione+bmo)
Can you point me to the code that calls inject() asynchronous for about:blank?
(In reply to Kris Maglione [:kmag] (busy; behind on reviews) from comment #24)
> (In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #22)
> > Can you point me to the code that calls inject() asynchronous for
> > about:blank?
> 
> All of the asynchrony is at:
> 
> http://searchfox.org/mozilla-central/rev/
> 20963d7269b1b14d455f47bc0260d0653015bf84/toolkit/components/extensions/
> ExtensionContent.jsm#255-265,325-329

Does it make sense that:

    let context = this.extension.getContext(window);

Can return a context with null `context.contentWindow`?  Could we fix the contentWindow property there if we see its currently null?
Flags: needinfo?(kmaglione+bmo)
Also, since there are known issues with the about:blank handling, would you be ok if I disabled this test on android debug?  The android emulator is exceptionally slow compared to real devices (even low end ones).
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #25)
> Does it make sense that:
> 
>     let context = this.extension.getContext(window);
> 
> Can return a context with null `context.contentWindow`?

It can't return a context with a null contentWindow, but the contentWindow property can become null if the window navigates after the context is created, which is what's happening here.

> Could we fix the contentWindow property there if we see its currently null?

No, the property is null intentionally, because when the inner window that the
context was created for is not current, there's no way to access it from
JavaScript, and accidentally accessing the incorrect inner window instead
would be a major security issue.
Flags: needinfo?(kmaglione+bmo)
(In reply to Kris Maglione [:kmag] (busy; behind on reviews) from comment #28)
> (In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #25)
> > Does it make sense that:
> > 
> >     let context = this.extension.getContext(window);
> > 
> > Can return a context with null `context.contentWindow`?
> 
> It can't return a context with a null contentWindow, but the contentWindow
> property can become null if the window navigates after the context is
> created, which is what's happening here.

I don't understand how that could be happening in this test.
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #29)
> (In reply to Kris Maglione [:kmag] (busy; behind on reviews) from comment
> #28)
> > (In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #25)
> > > Does it make sense that:
> > > 
> > >     let context = this.extension.getContext(window);
> > > 
> > > Can return a context with null `context.contentWindow`?
> > 
> > It can't return a context with a null contentWindow, but the contentWindow
> > property can become null if the window navigates after the context is
> > created, which is what's happening here.
> 
> I don't understand how that could be happening in this test.

The context is created. We call promiseDocumentReady or promiseDocumentLoaded and await the result. The page navigates, fires the DOMContentLoaded and load events, which causes the load promises to resolve, and then immediately fires pagehide, which causes the contentWindow property to become null. The promise resolution handler triggers the async injectInto function to resume, with the contentWindow property now null, which then calls inject().

We should probably null check at that point, and do something sensible rather than calling inject(), but I don't think that would fix the failing test.
This test does not navigate the page after the initial load AFAICT.  So I don't understand where that navigate comes from.
Most likely it's a race from the initial about:blank document being replaced with something else.
The about:blank is a plain old iframe in the loaded document:

https://dxr.mozilla.org/mozilla-central/source/toolkit/components/extensions/test/mochitest/file_with_about_blank.html

I don't think that should be getting a replacement.  But even if it did, shouldn't the test pick of the new window that replaces it and then pass?
(In reply to Ben Kelly [reviewing, but slowly][:bkelly] from comment #34)
> The about:blank is a plain old iframe in the loaded document:
> 
> https://dxr.mozilla.org/mozilla-central/source/toolkit/components/extensions/
> test/mochitest/file_with_about_blank.html
> 
> I don't think that should be getting a replacement.

In a lot of circumstances, it should, if we're forced to create an about:blank
content viewer before the explicit about:blank load happens.

> But even if it did, shouldn't the test pick of the new window that replaces
> it and then pass?

If the new window were anything other than about:blank, yes, since we use
document-element-inserted observers to detect most loads. But about:blank
loads don't trigger document-element-inserted, and content-document-global-created
is not triggered when the original about:blank document is reused.
It looks like in the failure cases we are getting some additional pagehide events which are clearing the contentWindow properties on the context.

At this point I think I've gone as far as I have time.  I'm going to disable on android debug for now.  If the web extension team sorts out the races in play here they can re-enable.

Does that sound reasonable to everyone?
Yes, that sounds fine to me.
This disables the test on android.  I included opt since brasstacks suggests it fails there as well at a much lower rate.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1ac81f3ceb87c2cb94528f96375b86564e8b1edf
Attachment #8875881 - Flags: review?(kmaglione+bmo)
Attachment #8875881 - Flags: review?(kmaglione+bmo) → review+
Pushed by bkelly@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/40a8be84033f
Disable test_ext_contentscript_about_blank.html on android due to intermittent failures. r=kmag
https://hg.mozilla.org/mozilla-central/rev/40a8be84033f
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Whiteboard: [stockwell needswork] → [stockwell disabled]
This is still failed on Android, so I'm going to reopen it. It looks like there was a lot of investigation and work done on this already, so maybe someone can pick up where that left off. It's not a high priority, but we shouldn't just ignore the fact that it's disabled.
Assignee: bkelly → nobody
Status: RESOLVED → REOPENED
Component: WebExtensions: Untriaged → WebExtensions: General
Priority: -- → P3
Resolution: FIXED → ---
This was disabled on android in comment 41.  When was it re-enabled?  Perhaps it should block that bug.
Product: Toolkit → WebExtensions
You need to log in before you can comment on or make changes to this bug.