Closed Bug 1673954 Opened 5 years ago Closed 2 years ago

[gv-junit-fis] NavigationDelegate tests: `session.evaluateJS` times out due to no open ports available in GeckoSessionTestRule

Categories

(GeckoView Graveyard :: Sandboxing, defect, P3)

Unspecified
All

Tracking

(firefox120 wontfix, firefox121 fixed)

RESOLVED FIXED
121 Branch
Tracking Status
firefox120 --- wontfix
firefox121 --- fixed

People

(Reporter: agi, Assigned: m_kato)

References

(Blocks 1 open bug)

Details

(Whiteboard: [fission:android:m2] [geckoview:2022q3][geckoview:m115?][geckoview:m119?][fxdroid][foundation])

Attachments

(2 files)

This bug tracks enabling all NavigationDelegateTest to run on Fission.

Summary: Run NavigationDelegateTest on Fenix → Run NavigationDelegateTest on Fission
Severity: -- → N/A
Priority: -- → P2
Fission Milestone: --- → Future

Tracking this bug for Android Fission milestone M2 (pass tests with Fission enabled).

Whiteboard: [fission:android:m2]
Fission Milestone: Future → ---
Whiteboard: [fission:android:m2] → [fission:android:m2] [geckoview:2022q3?]
Whiteboard: [fission:android:m2] [geckoview:2022q3?] → [fission:android:m2] [geckoview:2022q3]

Moving Android Fission bugs to the new GeckoView::Sandboxing component.

Component: General → Sandboxing
Summary: Run NavigationDelegateTest on Fission → [gv-junit-fis] Run NavigationDelegateTest on Fission
Whiteboard: [fission:android:m2] [geckoview:2022q3] → [fission:android:m2] [geckoview:2022q3][geckoview:m115?]

I am converting this bug. We actually have three errors in NavigationDelegate tests, and let us have this bug documenting one of them.

This error lies in our test framework. To assert on some things, we use session.evaluateJS method, which relies on the test-helper web extension. That web extension essentially populates the list of open ports (mPorts) in GeckoSessionTestRule on pageshow event, but on Fission that doesn't work due to the process switch. I checked the production code in Web Extension framework - everything there works correctly. We might want to rethink the way we populate the list of open ports in GeckoSessionTestRule, or possibly find another way to do the assertions.

Currently the affected tests are:

  • NavigationDelegateTest#loadFileNotFound
  • NavigationDelegateTest#bypassHTTPSOnlyError
Summary: [gv-junit-fis] Run NavigationDelegateTest on Fission → [gv-junit-fis] NavigationDelegate tests: `session.evaluateJS` times out due to no open ports available in GeckoSessionTestRule
Type: task → defect
Severity: N/A → S3
Assignee: nobody → ohall
Whiteboard: [fission:android:m2] [geckoview:2022q3][geckoview:m115?] → [fission:android:m2] [geckoview:2022q3][geckoview:m115?][geckoview:m119?][fxdroid][foundation]
Priority: P2 → P1
Priority: P1 → P3

test-support.js attaches the event listener for pageshow, but it is attached on old window object... Is there a way to reload content script of web extension after switching process?

(Unassigning myself because I'm focusing on translations at the moment.)

I have in my notes that the event listener can catch a focus event in test-support.js. Not sure if that observation could be useful for finding the right window or reloading or anything, but thought I'd mention it on the off-chance it was useful.

Assignee: ohall → nobody

(In reply to Makoto Kato [:m_kato] from comment #4)

test-support.js attaches the event listener for pageshow, but it is attached on old window object... Is there a way to reload content script of web extension after switching process?

Can we consider some other events? Are there more global events, that attach to the parents of those window objects? Can we try the focus event that Olivia mentions?

Re: the way to reload the content script - I would consult Robwu on this, he mentioned something like that when I was debugging this. Rob, do you have ideas, can this be done?

Flags: needinfo?(rob)

Makoto, I'll assign the bug to you, thank you for your help!

Assignee: nobody → m_kato

(In reply to Makoto Kato [:m_kato] from comment #4)

test-support.js attaches the event listener for pageshow, but it is attached on old window object... Is there a way to reload content script of web extension after switching process?

This claim is dubious - test-support.js is a content script, whose lifetime is tied to the document it's running in. If there is a process switch, that would imply that there is a new document that replaces the previous one, which in turn implies that a new port should be opened (from the new content script).

If you're debugging, it is possible for the "old" window (inner window) to be around, if the document is stored in the bfcache (and before that happens, "pagehide" is fired). If the document is revived by navigating to it, then "pageshow" is fired in that original content script (with the "old" window being the current window again).

I took a quick look at GeckoSessionTestRule to see if anything stood out, and the logic seems to be correct, specifically at:

  • MessageDelegate to receive ports and process messages: lines 2323-2392
  • webExtensionApiCall: Connecting to the test-support extension: lines 2758-2799
  • waitForMessage to wait for the response: lines 2494-2521
    • The wait at line 2496 should eventually return because MessageDelegate has logic in detach to release the message. And the port is removed upon disconnect, so at worst line 2446 or line 2797 would be a NPE if somehow the port doesn't exist.

(In reply to [:owlish] 🦉 PST from comment #6)

Re: the way to reload the content script - I would consult Robwu on this, he mentioned something like that when I was debugging this. Rob, do you have ideas, can this be done?

Are you asking the right question? I think that it's not a question of how to reload the content script, but how to detect the load completion (or at least get the ports to connect).

Given the circumstances (content script loads without fission, and the apparent observation of the port not connecting), I wonder whether it is possible for the content script to have run too late (after the "pageshow"). The `"run_at": "document_start" declaration of test-support.js in manifest.json requests immediate execution at document_start. But if the extension loads after the document has been initialized, it's possible for the content script to be executed after that point, potentially even after "pageshow" has already fired. So I suggest to check whether this is happening, e.g. by adding logging.

Flags: needinfo?(rob)

Thank you so much, Rob!

But if the extension loads after the document has been initialized, it's possible for the content script to be executed after that point, potentially even after "pageshow" has already fired.

According to my notes, that is exactly what happens. That is why in the test itself we have two page loads - one is to load the conent script so that it sets the pageshow listeners, and the second one to get those listeners to fire. Speaking of which, Makoto, another idea for you: what if we changed that flow in the test to first load a normal page, and only then the error (instead of the two error loads), would that help?

after switching process

Sorry, I only now realized - there should be no process switch on the fresh rebase. I merged a patch to prevent that from happening specifically on Android when loading error pages as a temporary solution.

Since I don't know web extension code, I have a question. When setting content_scripts property, where is this script loaed in gecko's code? I should debug web extension code internally for GeckoView.

This log is example when this occurs. pageshow is fired, but content scripts doesn't receive it. I suspect GeckoView's web extensions implementation if this doesn't occur on Desktop.

09-19 14:40:57.468  7681  7696 I Isolated Web Content: loading test-support.js
...
09-19 14:40:57.529  7547  7566 D GeckoViewXUL: WillChangeBrowserRemoteness
09-19 14:40:57.530  7547  7566 D GeckoViewMediaControl: onDestroyBrowser
09-19 14:40:57.531  7681  7696 I Isolated Web Content: pagehide on test-support.js
09-19 14:40:57.531  7681  7696 E Isolated Web Content: [JavaScript Error: "TypeError: can't access property "disconnect", backgroundPort is null" {file: "moz-extension://ea2bd5a4-08f2-4bcf-a708-d3a7a2302198/test-support.js" line: 57}]
09-19 14:40:57.531  7681  7696 E Isolated Web Content: @moz-extension://ea2bd5a4-08f2-4bcf-a708-d3a7a2302198/test-support.js:57:3
09-19 14:40:57.532  7681  7696 D GeckoViewSelectionActionDelegate[C]: handleEvent: visibilitychange
09-19 14:40:57.532  7681  7696 D GeckoViewClipboardPermissionChild[C]: handleEvent: pagehide
09-19 14:40:57.533  7681  7696 D GeckoViewAutoFill[C]: handleEvent: pagehide
...
09-19 14:40:57.535  7547  7566 D GeckoViewXUL: DidChangeBrowserRemoteness
09-19 14:40:57.536  7547  7566 D GeckoViewNavigation: sessionContextId=null
09-19 14:40:57.536  7547  7566 D GeckoViewMediaControl: onInitBrowser
...
09-19 14:40:57.542  7547  7566 D GeckoViewXUL: receiveMessage GeckoView:ContentModuleLoaded {}
...
09-19 14:40:57.545  7547  7566 E GeckoConsole: [JavaScript Error: "HTTPS-Only Mode: Upgrading insecure request “https://expired.example.com/” failed. (M21-C8181)"]
09-19 14:40:57.547  7681  7696 D GeckoViewContent[C]: handleEvent: pageshow
09-19 14:40:57.547  7681  7696 D GeckoViewProgressDelegate[C]: handleEvent: pageshow
09-19 14:40:57.548  7681  7696 D GeckoViewAutoFill[C]: handleEvent: pageshow
09-19 14:40:57.549  7547  7566 D GeckoViewProgress: receiveMessage: pageshow
Flags: needinfo?(rob)

(Or I suspect GeckoView's navigation, so I have to make sense web extension's load logic with content script).

(In reply to [:owlish] 🦉 PST from comment #9)

Thank you so much, Rob!

But if the extension loads after the document has been initialized, it's possible for the content script to be executed after that point, potentially even after "pageshow" has already fired.

According to my notes, that is exactly what happens. That is why in the test itself we have two page loads - one is to load the conent script so that it sets the pageshow listeners, and the second one to get those listeners to fire.

If you are indeed encountering the issue of the content script executing after "pageshow" (which seems to be the case according to the log from comment 10), then I suggest to change the implementation as follows:

  • In test-support/manifest.json, replace document_start with document_end. The script only uses pageshow/pagehide, and these events do not require very early registration. The use of document_start has unwanted side effects such as blocking the DOM parser until the scripts are ready to run.
  • Ordinarily, the first pageshow fires after window.onload, which in turn fires after DOMContentLoaded. The latter event matches with document_end, which also why I'm suggesting to replace (unnecessarily early) document_start with document_end.
  • Use the following logic in test-support.js:
let backgroundPort = null;
let nativePort = null;

function connect() {
  if (nativePort) {
    return; // Already initialized
  }
  // Rest of logic from https://searchfox.org/mozilla-central/rev/c56cc57d547a5febcc34160e581f9d58c0326aa7/mobile/android/geckoview/src/androidTest/assets/web_extensions/test-support/test-support.js#9-42
}

window.addEventListener("pageshow", connect);
window.addEventListener("pagehide", () => {
  backgroundPort.disconnect();
  nativePort.disconnect();
  nativePort = backgroundPort = null;
});

// Connect ASAP.
connect();

Alternatively, if you don't want the port to connect at DOMContentLoaded, but at the original point (i.e. around window.onload), wrap the last call in a document.readyState === "complete" check OR use document_idle instead of document_end.

load a normal page, and only then the error (instead of the two error loads), would that help?

The content script is only run in documents that match <all_urls>, usually http(s) and file:. They don't run in error pages. If you want it to run in more contexts, you have to add the about:-URL explicitly to the matches section in manifest.json.

(In reply to Makoto Kato [:m_kato] from comment #10)

Since I don't know web extension code, I have a question. When setting content_scripts property, where is this script loaed in gecko's code? I should debug web extension code internally for GeckoView.

Not sure how much that would help - adding dump(); or console.log calls in the content script (test-support.js) should provide enough logs in the console. To directly answer your question, see https://searchfox.org/mozilla-central/rev/c56cc57d547a5febcc34160e581f9d58c0326aa7/toolkit/components/extensions/ExtensionContent.sys.mjs#561-564,580-584

Flags: needinfo?(rob)

Thanks for more information.

Best workaround is that loading content script after DidChangeBrowserRemoteness, but content process won't receive this notification. Even if document_end, this issue isn't resolved. After document element is created (initial-document-element-inserted) then onDOMContentLoaded is fired, WillChangeBrowserRemoteness is fired (It means that switching process occurs). After WillChangeBrowserRemoteness is fired, window object seems to be re-created with new window ID... pageshow is dispatched with new window id when this issue occurs.

And, when this problem doesn't occur (extremely low probability), content script is loaded after DidChangeBrowserRemoteness.

Reproduced test throws handleLoadError then switching process. GeckoView has delegation for handleLoadError, so I suspect that this causes something issues if this issue doesn't occur on Desktop... So I need to try to create a sample for desktop if possible.

Use the following logic in test-support.js:

When this occur, 'pagehide' is received at first. So since native port is closed, we need something trigger to open native port by something way. "pageshow" is dispatched with new window id, so no listener...

And this issue may not occur if bug 1849060 is fixed. But I am not sure.

(In reply to Makoto Kato [:m_kato] from comment #15)

And this issue may not occur if bug 1849060 is fixed. But I am not sure.

I doubt this issue is blocked on that bug. You can try replacing the current HTTP page with a resource type URL or even use regular Gecko's about: error page to see if that helps.

(In reply to [:owlish] 🦉 PST from comment #16)

(In reply to Makoto Kato [:m_kato] from comment #15)

And this issue may not occur if bug 1849060 is fixed. But I am not sure.

I doubt this issue is blocked on that bug. You can try replacing the current HTTP page with a resource type URL or even use regular Gecko's about: error page to see if that helps.

This seems to be related to error page handing. If we don't replace error page with something by onLoadError, I cannot reproduce this even if switching process. Both tests replace error page. Then in error page, since content script isn't loaded correctly, evaluateJS won't work. When loading replaced error URL, document's insertion event isn't fired (nsContentSink::NotifyDocElementCreate isn't called), so extension component doesn't insert content script... I need to debug DOM side and GV's LoadURIDelegate and this handling in docshell.

  • If onLoadError returns null, document is nothing. It means that document element isn't created then web extension won't work.
  • If we return about:* URL etc with "match_about_blank: true", content script is loaded correctly. But it need to modify content script like comment #12.

GeckoView can hook error page by NavigationDelegate.onLoadError. But
when this delegation returns null, test-support Web extensions won't
be loaded correctly.

A content script of Web extensions is loaded when first document element
is inserted. But when onLoadError returns null, there is no document
element. So content script won't be loaded on this situation.

If not fission, content script is already loaded. But if fission, content
script may not be loaded if switching process by loading new URL.

So we should return non-null string if we use evaluateJS.

Also, document.reloadWithHttpsOnlyException may causes that the document
will be unloaded immediately before native message isn't handled. So I
use setTimeout to call it.

Pushed by m_kato@ga2.so-net.ne.jp: https://hg.mozilla.org/integration/autoland/rev/38f82f76af47 onLoadError shouldn't return null if using test-support extension APIs. r=geckoview-reviewers,owlish

(I request to back out since gv-junit failure rate is incremented due to PortDisconnectException)

Backed out as requested by dev for causing gv-junit failures

Backout link

Push with failures

Failure log

Flags: needinfo?(m_kato)

Also org.mozilla.geckoview.test.NavigationDelegateTest#redirectLoadIframe is sometimes failure after landing this. I guess that this is another bug (we need to consdier HTTPS first?)

Flags: needinfo?(m_kato)

Do you think we need to block this one on some other bug, Makoto? What the next steps would be for this ticket?

Flags: needinfo?(m_kato)
  • redirectLoadIframe is failure high frequencies. This test uses another URL in automation, but I guess that it can fix by modifying iframe_redirect_automation.html. (http -> https). This was bug 1735786.
  • Timing issue of document.reloadWithHttpsOnlyException(). this might causes page navigation before addon replies response.
Flags: needinfo?(m_kato)

Timing issue of document.reloadWithHttpsOnlyException(). this might causes page navigation before addon replies response.

Hmm, when this occurs, content script isn't executed via extensions. No GeckoView:WebExtension:PortMessage message?

Makoto, must we use a web extension for evaluateJS (maybe not the whole of evaluateJS, but in these specific cases)? Is there any other way to achieve this?

Flags: needinfo?(m_kato)

Zak, since you just recently worked with web extensions - do you have any ideas?

Flags: needinfo?(zmckenney)

(In reply to [:owlish] 🦉 PST from comment #28)

Makoto, must we use a web extension for evaluateJS (maybe not the whole of evaluateJS, but in these specific cases)? Is there any other way to achieve this?

This failure is communication port is immediately closed by executing JS. Even if document.reloadWithHttpsOnlyException() is executed on content script, the navigation by it is stopped unfortunately. When I try many workarounds, I will add simple workaround that is better.

https://treeherder.mozilla.org/jobs?repo=try&revision=6d235179f652fc0ea07ab2c8d46da863b97674f0 seems to fix it.

Flags: needinfo?(m_kato)
Attachment #9355055 - Attachment description: Bug 1673954 - onLoadError shouldn't return null if using test-support extension APIs. r=#geckoview-reviewers → Bug 1673954 - Part 1. onLoadError shouldn't return null if using test-support extension APIs. r=#geckoview-reviewers

When landing previous fix, redirectLoadIframe's failure rate is
incremented. I think this is same as bug 1735786.

Depends on D189217

Pushed by m_kato@ga2.so-net.ne.jp: https://hg.mozilla.org/integration/autoland/rev/e10f0ce61cc0 Part 1. onLoadError shouldn't return null if using test-support extension APIs. r=geckoview-reviewers,owlish https://hg.mozilla.org/integration/autoland/rev/d0c4fdebcd36 Part 2. Use HTTPS to avoid https-first redirection. r=geckoview-reviewers,calu
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 121 Branch
Flags: needinfo?(zmckenney)
Product: GeckoView → GeckoView Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: