Closed Bug 1462121 Opened 2 years ago Closed Last year

onPageHide logic in BaseContext can be bypassed

Categories

(WebExtensions :: General, enhancement, P3)

60 Branch
enhancement

Tracking

(firefox64 fixed)

RESOLVED FIXED
mozilla64
Tracking Status
firefox64 --- fixed

People

(Reporter: robwu, Assigned: robwu)

References

Details

Attachments

(2 files, 5 obsolete files)

Bug 1292369 introduced a pagehide listener to clear the window reference.
If the listener is registered after a previous listener, then the listener can be suppressed, resulting in leakage of the contentWindow.

Test case, almost identical to bug 1292332 (added one step and re-ordered the initial steps to trigger the condition):

1. Visit example.com
2. Open the console, run:
addEventListener("pagehide", function(e) {
    console.log("pagehide at " + document.URL);
    e.stopImmediatePropagation();
});
3. Load the add-on from bug 1292332
   (Now a content style will be inserted that turns the page pink, and the onPageHide listener in BaseContext will be registered)
4. Navigate to example.com/somethingelse
   (Now the previous page will go to the bfcache)
   (The new page is also pink)
5. Visit about:memory
6. Click on the "Minimize memory" button - this will evict bfcache entries.
7. Look at the example.com tab

Expected:
- The page should still be pink.

Actual:
- The page is no longer pink because the extension stylesheet was removed, because the pagehide listener from step 2 suppressed the pagehide listener at https://searchfox.org/mozilla-central/rev/00dd116638001772fe354b081353b73f1cad405d/toolkit/components/extensions/ExtensionCommon.jsm#164-173,176


Other observations:
- The docShell property is no longer used.
- The "next tick" call in the referenced code snippet immediately happens when the "pagehide" listener has fired. The intention was to defer the cleanup logic until after all "pagehide" listeners were triggered, but since the "pagehide" event is triggered without JS stack, the microtask queue is immediately flushed.

This bug could be resolved by adding the listener to the system group. That might also fix the issue in the last observation.
We can probably just use a bubble phase listener on the chrome event handler, at this point. We should confirm that content listeners can't block the dispatch of the chrome phase of the event, though. I don't think they can.
(In reply to Kris Maglione [:kmag] (long backlog; ping on IRC if you're blocked) from comment #1)
> We can probably just use a bubble phase listener on the chrome event
> handler, at this point. We should confirm that content listeners can't block
> the dispatch of the chrome phase of the event, though. I don't think they
> can.

The pagehide event does not bubble (according to MDN), and according to the following test case:
data:text/html;charset=utf-8,<iframe srcdoc='<script>
console.log("load start");
addEventListener("pageshow",e=>{console.log("pageshow")});
addEventListener("pagehide",e=>console.log("pagehide 1 useCapture=false"),false);
addEventListener("pagehide",e=>console.log("pagehide 2 useCapture=true"),true);
addEventListener("pagehide",e=>console.log("pagehide 3 useCapture=false"),false);
addEventListener("pagehide",e=>console.log("pagehide 4 useCapture=true"),true);
location.href="data:,see console"
</script>'></iframe>

The output is this, which shows that the third flag (capture) is ignored, and that the events are dispatched in the order of registration:
load start
pageshow
pagehide 1 useCapture=false
pagehide 2 useCapture=true
pagehide 3 useCapture=false
pagehide 4 useCapture=true


By adding to the system group, I can confirm that the event handler is always run after any content handlers.
This also implies that the restore-contentWindow/active-on-pageshow-logic needs to be revised (because we want it to run it before anything else).

Another thing that is unaccounted for is that web pages can forge pagehide/pageshow events. I'll show a demo in bug 1459404.
Blocks: 1459404
(In reply to Rob Wu [:robwu] from comment #2)
> Another thing that is unaccounted for is that web pages can forge
> pagehide/pageshow events. I'll show a demo in bug 1459404.

My demo in https://bugzil.la/1459404#c4 did not demonstrate this thing (it uses this bug to simulate a work-around).
But bug 1459404 is one of the things that can be caused by forged pagehide/pageshow events.
Comment on attachment 8976599 [details]
Bug 1462121 - Ignore untrusted pagehide/pageshow events

https://reviewboard.mozilla.org/r/244696/#review250766

::: toolkit/components/extensions/ExtensionCommon.jsm:165
(Diff revision 1)
>      // Use an event target object (instead of a function) so that the closure
>      // can be released upon closing this context, even if the document is in the
>      // bfcache.
>      let handler = {
>        handleEvent: (event) => {
> -        if (event.target !== document) {
> +        if (event.target !== document || !event.isTrusted) {

I'll remove the `event.isTrusted` check here and pass `false` as the fourth parameter to `addEventListener` to ignore untrusted events.
Comment on attachment 8976598 [details]
Bug 1462121 - Properly unregister pagehide/pageshow from pages in bfcache

https://reviewboard.mozilla.org/r/244694/#review250776

::: toolkit/components/extensions/ExtensionCommon.jsm:186
(Diff revision 2)
> -    contentWindow.addEventListener("pageshow", onPageShow, {mozSystemGroup: true});
> +    contentWindow.addEventListener("pagehide", handler, {mozSystemGroup: true});
> +    contentWindow.addEventListener("pageshow", handler, {mozSystemGroup: true});
>      this.callOnClose({
>        close: () => {
> -        onPageHide();
> -        if (this.active) {
> +        if (isValidInnerWindow()) {
> +          contentWindow.removeEventListener("pagehide", handler, {mozSystemGroup: true});

I would really like to make this unconditional, but onfortunately it doesn't seem to be possible to remove an event listener for a specific inner window.

C++ has GetInnerWindowWithId, but that's not exposed to JS.
Priority: -- → P3
Attachment #8976596 - Flags: review?(kmaglione+bmo) → review?(tomica)
Attachment #8976597 - Flags: review?(kmaglione+bmo) → review?(tomica)
Attachment #8976598 - Flags: review?(kmaglione+bmo) → review?(tomica)
Attachment #8976599 - Flags: review?(kmaglione+bmo) → review?(tomica)
Attachment #8976600 - Flags: review?(kmaglione+bmo) → review?(tomica)
Product: Toolkit → WebExtensions
Comment on attachment 8976596 [details]
Bug 1462121 - Register pagehide/pageshow in system group

https://reviewboard.mozilla.org/r/244690/#review261372

::: toolkit/components/extensions/test/xpcshell/test_ext_contentscript_context_isolation.js:40
(Diff revision 2)
> +</script>`);
> +});
> +
> +add_task(async function test_contentscript_context_isolation() {
> +  function contentScript() {
> +    browser.test.sendMessage("content-script-ready");

This should go at the end.

::: toolkit/components/extensions/test/xpcshell/test_ext_contentscript_context_isolation.js:42
(Diff revision 2)
> +
> +add_task(async function test_contentscript_context_isolation() {
> +  function contentScript() {
> +    browser.test.sendMessage("content-script-ready");
> +
> +    exportFunction(browser.test.sendMessage, window, {defineAs: "browserTestSendMessage"});

I'm guessing we can't use `postMessage` instead because of race condition?

::: toolkit/components/extensions/test/xpcshell/test_ext_contentscript_context_isolation.js:66
(Diff revision 2)
> +      "content_script.js": contentScript,
> +    },
> +  });
> +
> +  let contentPage = await ExtensionTestUtils.loadContentPage("http://example.com/bfcachetestpage");
> +  await extension.startup();

Please add an assertion (or at least a comment) that the page content listeners are setup before the extension is loaded for this test.

::: toolkit/components/extensions/test/xpcshell/test_ext_contentscript_context_isolation.js:86
(Diff revision 2)
> +  });
> +
> +  await extension.awaitMessage("content-script-hide");
> +
> +  await contentPage.spawn(null, async () => {
> +    Assert.equal(this.context.contentWindow, null, "Context's contentWindow property is null");

It's not immediately obvious that this is the main point of the the whole patch/test.  Perhaps expand the message or add a line of comment?
Attachment #8976596 - Flags: review?(tomica) → review+
Comment on attachment 8976597 [details]
Bug 1462121 - Remove unused docShell member

https://reviewboard.mozilla.org/r/244692/#review261384
Attachment #8976597 - Flags: review?(tomica) → review+
Comment on attachment 8976598 [details]
Bug 1462121 - Properly unregister pagehide/pageshow from pages in bfcache

https://reviewboard.mozilla.org/r/244694/#review261424

::: toolkit/components/extensions/ExtensionCommon.jsm:160
(Diff revision 2)
> +    // Use an event target object (instead of a function) so that the closure
> +    // can be released upon closing this context, even if the document is in the
> +    // bfcache.

Reasoning about all the code below is making my head hurt.

"Can we just sprinkle a few `Cu.getWeakReference()`s around this code and call it done."

But to be serious, this could probably be solved in a simpler way by passing weak references instead.
Attachment #8976598 - Flags: review?(tomica)
Comment on attachment 8976599 [details]
Bug 1462121 - Ignore untrusted pagehide/pageshow events

https://reviewboard.mozilla.org/r/244696/#review261434
Attachment #8976599 - Flags: review?(tomica) → review+
Comment on attachment 8976600 [details]
Bug 1462121 - Improve reliability of context.contentWindow/active

https://reviewboard.mozilla.org/r/244698/#review261436

::: commit-message-70938:3
(Diff revision 2)
> +Previously, the context.contentWindow and context.active were unreliable
> +around navigations from/to the bfcache, as they were not set during the
> +pageshow event, and neither after pagehide (including the unload event).

Is this causing actual problems?
Attachment #8976600 - Flags: review?(tomica)
Comment on attachment 8976598 [details]
Bug 1462121 - Properly unregister pagehide/pageshow from pages in bfcache

https://reviewboard.mozilla.org/r/244694/#review261424

> Reasoning about all the code below is making my head hurt.
> 
> "Can we just sprinkle a few `Cu.getWeakReference()`s around this code and call it done."
> 
> But to be serious, this could probably be solved in a simpler way by passing weak references instead.

How can weak pointers help me here? I would need a weak reference to an inner window, but AFAIK I cannot get a direct reference to an inner window, can I? (there is `GetInnerWindowWithId` in C++, but that is not exposed to JS).
Comment on attachment 8976600 [details]
Bug 1462121 - Improve reliability of context.contentWindow/active

https://reviewboard.mozilla.org/r/244698/#review261436

> Is this causing actual problems?

Functionality-wise in our current code? No, except for unlikely races (=async extension event is scheduled somewhere between pagehide and unload).
Maintainability / readability? Yes: It is more difficult to reason about the value of `context.contentWindow` or `context.active` for logic that runs at (or before) pageshow, or at unload (=after pagehide).

For example, in the following patch, the callback that is responsible for triggering "port.onDisconnect" is called at pageshow: https://reviewboard.mozilla.org/r/246110/diff/1/#1
This callback first checks `context.active` before firing the event (see registerOnDisconnect).
Now, question for the code reader: Will the logic run as expected (at the pageshow event)?

The answer is yes: `context.active = true` is set in the pageshow event of in the `pageshow` handler in setContentWindow in ExtensionCommon.jsm, and the `pageshow` handler is registered as soon as `setContentWindow` is called (which is called when the context is created for an extension script or content script) - so very early. The `pageshow` event of that other patch is added at the `pagehide` event (so definitely after the `pageshow` handler of `setContentWindow`). Therefore, when the `pageshow` handler of that other patch is called, `context.active` is `true` as desired, and the logic runs correctly (provided that the `pageshow` handlers are added in the system group (mozSystemGroup); otherwise that handler would run before the handler of `setContentWindow`, and `context.active` would be false and the event listeners would not run).

The last explanatory paragraph has many conditions and references to different places in the code base. I don't want future code readers (including myself) to run that analysis whenever they want to know whether API methods or events work as expected at the pageshow/unload events (this analysis is relevant if there is a use of `context.active` or `context.contentWindow` anywhere in the call stack).

The alternative to using these getters is to add a comment to .active and .contentWindow, to document the lifetime and semantics. These properties will have an unexpected value under the following circumstances/events:
- pageshow without system group (=normal add-on code).
- pagehide with system group (=our code)
- unload with and without system group (=normal add-on code and our code).

Do you want to keep the patch in its current state, or remove the custom getter and have code comments instead?
(In reply to Rob Wu [:robwu] from comment #21)
> How can weak pointers help me here? I would need a weak reference to an
> inner window, but AFAIK I cannot get a direct reference to an inner window,
> can I? (there is `GetInnerWindowWithId` in C++, but that is not exposed to
> JS).

Sorry for not being more specific, or if I misunderstood the issue, I was going from your commit message:

> 1. Extension has a content script / style.
> 2. Page moves to the bfcache due to navigation.
> 3. Extension is unloaded.

> After the last step, the context should have been released.
> That did not happen, and consequently the extension context cannot be
> garbage-collected. This caused bug 1459404.

My understanding is that leak is caused by:
 - the content window in BFcache holding a (strong) reference to the onPageHide/Show listeners,
 - which in turn keep the extension BaseContext alive because of the closure reference to `this`.

If that is correct, swapping either of those references for a weak one would break the chain and allow the BaseContext to be GCed on unload.  If that's not the cause of the leaks, could you please explain what is?


> Do you want to keep the patch in its current state, or remove the custom
> getter and have code comments instead?

I didn't investigate into details of the patch because of that question I had and because the code was based on Part 3 that we are still discussing, so I'll revisit once we have part 3 figured out.
No longer blocks: 1459404
(In reply to Tomislav Jovanovic :zombie from comment #23)
> My understanding is that leak is caused by:
>  - the content window in BFcache holding a (strong) reference to the
> onPageHide/Show listeners,
>  - which in turn keep the extension BaseContext alive because of the closure
> reference to `this`.
> 
> If that is correct, swapping either of those references for a weak one would
> break the chain and allow the BaseContext to be GCed on unload.  If that's
> not the cause of the leaks, could you please explain what is?

The leak is caused by:
- BaseContext registers itself as a pagehide+pageshow handler for some (inner) window.
- When the user navigates elsewhere, the page moves into the bfcache (and the inner window changes).
- When the extension unloads, the pagehide/pageshow handler cannot be removed because contentWindow.removeEventListener attempts to remove the listener from a different inner window, unrelated to the original one.
- The inner window in the bfcache has a valid reason for staying in memory, so the pagehide+pageshow listeners keep the BaseContext instance alive (and all properties attached to it, such as jsonSandbox in bug 1459404).

There is no use in using a weak ref to the window, because that would be a weak ref to the outer window.
Just a weak ref would indeed solve the issue of GC, but not fix the other issue.

To be explicit, the issues that I aim to solve are:
1) Prevent memory leak when extension unloads while extension is in the page.
2) Prevent pages from blocking the context.active/contentWindow management (by using system event listeners).
3) Ensure that extension APIs work correctly when called during pageshow.

The third issue is introduced by the fix for the second issue:
- Currently, the "pageshow" listener is registered before the scripts of an extension run in the content script/extension script context (potentially after scripts in the page - hence the second issue).

- By using the system group, I fix the second issue, but the "pageshow" listener is only dispatched after the other pageshow listeners are dispatched in the extension content. Consequently, a dynamic getter for contentWindow/active is needed after pageshow is fired; otherwise extension APIs will not work when called during a pageshow event in an extension.
Attachment #8976596 - Attachment is obsolete: true
Attachment #8976597 - Attachment is obsolete: true
Attachment #8976598 - Attachment is obsolete: true
Attachment #8976599 - Attachment is obsolete: true
Comment on attachment 8976600 [details]
Bug 1462121 - Improve reliability of context.contentWindow/active

Going to move the review over to Phabricator.
Attachment #8976600 - Attachment is obsolete: true
Attachment #8976600 - Flags: review?(tomica)
- Register pagehide/pageshow events in the system group and ignore
  synthetic events to avoid interference from web pages.
- Remove unused docShell member.
- Fix memory leak in bfcached windows, by ensuring that BaseContext
  instances can be GC'd when an extension is unloaded, even if the
  context is associated with a page in the bfcache.
- Ensure that context.contentWindow and context.active always have an
  accurate value.

The latter is achieved by moving all contentWindow tracking logic in a
new helper class "InnerWindowReference".
Comment on attachment 9003442 [details]
Bug 1462121 - Improve reliability of context.contentWindow/active

Tomislav Jovanovic :zombie has approved the revision.
Attachment #9003442 - Flags: review+
Pushed by rob@robwu.nl:
https://hg.mozilla.org/integration/autoland/rev/e31d944d504b
Improve reliability of context.contentWindow/active r=zombie
Flags: needinfo?(rob)
Pushed by rob@robwu.nl:
https://hg.mozilla.org/integration/autoland/rev/8d6a57caa626
Improve reliability of context.contentWindow/active r=zombie
Backed out changeset 8d6a57caa626 (Bug 1462121) for bc16 failures in browser_ext_getViews.js

Failure: https://treeherder.mozilla.org/#/jobs?repo=autoland&fromchange=0d81f22a74cebcfa6cd68cb266534b54c0213604&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-classifiedState=unclassified&selectedJob=197398112

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=197398112&repo=autoland&lineNumber=14793

[task 2018-09-04T15:59:02.285Z] 15:59:02     INFO - TEST-UNEXPECTED-FAIL | browser/components/extensions/test/browser/test-oop-extensions/browser_ext_getViews.js | popup count correct - Got 1, expected 0
[task 2018-09-04T15:59:02.287Z] 15:59:02     INFO - Stack trace:
[task 2018-09-04T15:59:02.287Z] 15:59:02     INFO - chrome://mochikit/content/browser-test.js:test_is:1304
[task 2018-09-04T15:59:02.288Z] 15:59:02     INFO - chrome://mochitests/content/browser/browser/components/extensions/test/browser/test-oop-extensions/browser_ext_getViews.js:checkViews:123
[task 2018-09-04T15:59:02.289Z] 15:59:02     INFO - chrome://mochitests/content/browser/browser/components/extensions/test/browser/test-oop-extensions/browser_ext_getViews.js:null:179
[task 2018-09-04T15:59:02.290Z] 15:59:02     INFO - chrome://mochikit/content/browser-test.js:Tester_execTest/<:1102
[task 2018-09-04T15:59:02.291Z] 15:59:02     INFO - chrome://mochikit/content/browser-test.js:Tester_execTest:1093
[task 2018-09-04T15:59:02.292Z] 15:59:02     INFO - chrome://mochikit/content/browser-test.js:nextTest/<:995
[task 2018-09-04T15:59:02.293Z] 15:59:02     INFO - chrome://mochikit/content/tests/SimpleTest/SimpleTest.js:SimpleTest.waitForFocus/waitForFocusInner/focusedOrLoaded/<:795
[task 2018-09-04T15:59:02.294Z] 15:59:02     INFO - TEST-PASS | browser/components/extensions/test/browser/test-oop-extensions/browser_ext_getViews.js | count for type correct - 
[task 2018-09-04T15:59:02.295Z] 15:59:02     INFO - TEST-PASS | browser/components/extensions/test/browser/test-oop-extensions/browser_ext_getViews.js | count for {"windowId":765} correct - 
[task 2018-09-04T15:59:02.296Z] 15:59:02     INFO - TEST-PASS | browser/components/extensions/test/browser/test-oop-extensions/browser_ext_getViews.js | count for {"tabId":-1} correct - 
[task 2018-09-04T15:59:02.297Z] 15:59:02     INFO - closing one tab
[task 2018-09-04T15:59:02.298Z] 15:59:02     INFO - one tab closed, one remains
Flags: needinfo?(rob)
Backout by shindli@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/87071e1b38e1
Backed out changeset 8d6a57caa626 for bc16 failures in browser_ext_getViews.js
extension.getViews() returns all windows whose context.active property
is true. In the previous commit, this "active" property was no longer
set to false upon pagehide (which is a bit too early, since the window
has not unloaded yet), but set to false if the window is truly unloaded
(or frozen in the bfcache).

In the extension.getViews() test, the parts that close the popup or tab
should not immediately resume the test, but wait until the window in the
extension process has unloaded. Otherwise there is a rare chance that
extension.getViews() will return the window that was expected to be
closed, which results in a test failure.

Depends on D4072
Comment on attachment 9006407 [details]
Bug 1462121 - Fix intermittent in browser_ext_getViews.js

Tomislav Jovanovic :zombie has approved the revision.
Attachment #9006407 - Flags: review+
Pushed by rob@robwu.nl:
https://hg.mozilla.org/integration/autoland/rev/f92210c5643b
Fix intermittent in browser_ext_getViews.js r=zombie
Pushed by rob@robwu.nl:
https://hg.mozilla.org/integration/autoland/rev/9bb31b997461
Improve reliability of context.contentWindow/active r=zombie
https://hg.mozilla.org/mozilla-central/rev/f92210c5643b
https://hg.mozilla.org/mozilla-central/rev/9bb31b997461
Status: ASSIGNED → RESOLVED
Closed: Last year
Resolution: --- → FIXED
Target Milestone: --- → mozilla64
Flags: needinfo?(rob) → qe-verify-
You need to log in before you can comment on or make changes to this bug.