Closed Bug 1166351 Opened 9 years ago Closed 8 years ago

XBL parent bindings not properly attached - tabs suddenly do not switch or close properly

Categories

(Core :: XBL, defect)

40 Branch
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
e10s - ---
firefox39 --- affected
firefox40 --- affected
firefox41 --- affected
firefox47 --- wontfix
firefox48 --- wontfix
firefox49 --- fixed

People

(Reporter: kbrosnan, Assigned: mconley)

References

Details

(Keywords: regression)

Attachments

(5 files, 4 obsolete files)

34.15 KB, image/png
Details
3.89 KB, patch
Felipe
: review+
mconley
: checkin+
Details | Diff | Splinter Review
58 bytes, text/x-review-board-request
bholley
: review+
Details
58 bytes, text/x-review-board-request
bholley
: review+
Details
256.61 KB, application/zip
Details
My DevEdition is in a state where new tabs are selectable but do not switch. All tabs are unclosable. E10S is enabled.
Application Basics
------------------

Name: Firefox
Version: 40.0a2
Build ID: 20150517004004
Update Channel: aurora
User Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:40.0) Gecko/20100101 Firefox/40.0
Multiprocess Windows: 1/1 (default: true)

Crash Reports for the Last 3 Days
---------------------------------

All Crash Reports

Extensions
----------

Name: Click to Play per-element
Version: 0.3.1
Enabled: true
ID: ClickToPlayPerElement@uaSad.addons.mozilla.org

Name: Garmin Communicator
Version: 4.1.0
Enabled: true
ID: {195A3098-0BD5-4e90-AE22-BA1C540AFD1E}

Name: InlineDisposition
Version: 1.0.2.4
Enabled: true
ID: {123647d5-da43-4344-bfe2-fc093bdf8f5e}

Name: PDF Viewer
Version: 1.1.157
Enabled: true
ID: uriloader@pdf.js

Name: ChatZilla
Version: 0.9.91.1
Enabled: false
ID: {59c81df5-4b7a-477b-912d-4e0fdf64e5f2}

Name: Cheevos
Version: 1.5.1-signed
Enabled: false
ID: jid1-bpzDizt9E1R7nw@jetpack

Name: Lightbeam
Version: 1.2.1
Enabled: false
ID: jid1-F9UJ2thwoAm5gQ@jetpack

Name: Live HTTP headers
Version: 0.17
Enabled: false
ID: {8f8fe09b-0bd3-4470-bc1b-8cad42b8203a}

Name: SQLite Manager
Version: 0.8.3
Enabled: false
ID: SQLiteManager@mrinalkant.blogspot.com

Name: SuperStop
Version: 0.1.1-signed
Enabled: false
ID: superstop@gavinsharp.com

Name: Tamper Data
Version: 11.0.1
Enabled: false
ID: {9c51bd27-6ed8-4000-a2bf-36cb95c0c947}

Name: User Agent Switcher
Version: 0.7.3
Enabled: false
ID: {e968fc70-8f95-4ab9-9e79-304de2a71ee1}

Name: Whimsy Pro
Version: 1.1.2
Enabled: false
ID: jid1-6mUPixNFCjAgkg@jetpack

Name: YSlow
Version: 3.1.8
Enabled: false
ID: yslow@yahoo-inc.com

Graphics
--------

Adapter Description: NVIDIA GeForce GTX 980
Adapter Drivers: nvd3dumx,nvwgf2umx,nvwgf2umx nvd3dum,nvwgf2um,nvwgf2um
Adapter RAM: 4095
Asynchronous Pan/Zoom: none
Device ID: 0x13c0
Direct2D Enabled: true
DirectWrite Enabled: true (6.3.9600.17795)
Driver Date: 4-8-2015
Driver Version: 9.18.13.5012
GPU #2 Active: false
GPU Accelerated Windows: 1/1 Direct3D 11 (OMTC)
Subsys ID: 111610de
Supports Hardware H264 Decoding: true
Vendor ID: 0x10de
WebGL Renderer: Google Inc. -- ANGLE (NVIDIA GeForce GTX 980 Direct3D11 vs_5_0 ps_5_0)
windowLayerManagerRemote: true
AzureCanvasBackend: direct2d 1.1
AzureContentBackend: direct2d 1.1
AzureFallbackCanvasBackend: cairo
AzureSkiaAccelerated: 0

Important Modified Preferences
------------------------------

accessibility.typeaheadfind.flashBar: 0
browser.cache.disk.capacity: 358400
browser.cache.disk.filesystem_reported: 1
browser.cache.disk.smart_size.first_run: false
browser.cache.disk.smart_size.use_old_max: false
browser.cache.frecency_experiment: 4
browser.download.importedFromSqlite: true
browser.download.manager.alertOnEXEOpen: true
browser.places.smartBookmarksVersion: 7
browser.search.useDBForOrder: true
browser.sessionstore.upgradeBackup.latestBuildID: 20150517004004
browser.startup.homepage: about:home
browser.startup.homepage_override.buildID: 20150517004004
browser.startup.homepage_override.mstone: 40.0a2
browser.tabs.remote.autostart: true
dom.apps.reset-permissions: true
dom.mozApps.used: true
extensions.lastAppVersion: 40.0a2
font.internaluseonly.changed: true
gfx.direct3d.last_used_feature_level_idx: 0
media.gmp-eme-adobe.lastUpdate: 1431096159
media.gmp-eme-adobe.version: 9
media.gmp-gmpopenh264.lastUpdate: 1430836958
media.gmp-gmpopenh264.version: 1.4
media.gmp-manager.buildID: 20150517004004
media.gmp-manager.lastCheck: 1432013779
network.cookie.prefsMigrated: true
network.predictor.cleaned-up: true
places.database.lastMaintenance: 1431614490
places.history.expiration.transient_current_max_pages: 104858
plugin.disable_full_page_plugin_for_types: application/pdf
plugin.importedState: true
plugin.state.npctrl: 0
plugin.state.npgarmin: 0
plugin.state.npintelwebapiipt: 0
plugin.state.npintelwebapiupdater: 0
plugin.state.npnv3dv: 0
plugin.state.npnv3dvstreaming: 0
privacy.clearOnShutdown.sessions: false
privacy.donottrackheader.enabled: true
privacy.sanitize.migrateFx3Prefs: true
security.tls.version.max: 2
storage.vacuum.last.index: 1
storage.vacuum.last.places.sqlite: 1430855217

Important Locked Preferences
----------------------------

JavaScript
----------

Incremental GC: true

Accessibility
-------------

Activated: false
Prevent Accessibility: 0

Library Versions
----------------

NSPR
Expected minimum version: 4.10.8
Version in use: 4.10.8

NSS
Expected minimum version: 3.19 Basic ECC
Version in use: 3.19 Basic ECC

NSSSMIME
Expected minimum version: 3.19 Basic ECC
Version in use: 3.19 Basic ECC

NSSSSL
Expected minimum version: 3.19 Basic ECC
Version in use: 3.19 Basic ECC

NSSUTIL
Expected minimum version: 3.19
Version in use: 3.19

Experimental Features
---------------------
Attached image 2015-05-19-939.png
When opening a new tab I get.
  TypeError: browser.attachFormFill is not a function tabbrowser.xml:1575:14
  TypeError: browser.loadURI is not a function tabbrowser.xml:1610:12

When clicking on customize in the menu
  TypeError: browser is undefined browser.js:18168:0

When clicking on settings
  [CustomizableUI] TypeError: browser is undefined
  Stack trace:
  switchIfURIInWindow@chrome://browser/content/browser.js:18167:1
  switchToTabHavingURI@chrome://browser/content/browser.js:18202:26
  openPreferences@chrome://browser/content/utilityOverlay.js:552:18
  CustomizableWidgets<.onCommand@resource:///modules/CustomizableWidgets.jsm:499:9
  CustomizableUIInternal.handleWidgetCommand@resource:///modules/CustomizableUI.jsm:1436:11

When clicking on add-ons
  [CustomizableUI] TypeError: browser is undefined
  Stack trace:
  switchIfURIInWindow@chrome://browser/content/browser.js:18168:1
  switchToTabHavingURI@chrome://browser/content/browser.js:18202:26
  BrowserOpenAddonsMgr@chrome://browser/content/browser.js:17227:18
  CustomizableWidgets<.onCommand@resource:///modules/CustomizableWidgets.jsm:484:9
  CustomizableUIInternal.handleWidgetCommand@resource:///modules/CustomizableUI.jsm:1436:11

When clicking on tabs
  TelemetryStopwatch: key "FX_TAB_CLICK_MS" was already initialized TelemetryStopwatch.jsm:52:0
  TypeError: tab.linkedBrowser is undefined tabbrowser.xml:3420:16

When closing a blank tab
  TypeError: browser is undefined tabbrowser.xml:2136:18
This sounds similar to something I was experiencing in bug 1141657.

Do you remember what you were doing before you got into this state? Are you able to get into it reliably?
tracking-e10s: --- → ?
Seems like a possible dupe of bug 1161290.
I think this is serious enough to lean on.
Assignee: nobody → mconley
11:22 < mconley> kbrosnan: hmm.... I'd be interested in knowing if you can get access to your currently selected browser's webNavigation
11:23 < mconley> via gBrowser.selectedBrowser.webNavigation in the Browser Console
Flags: needinfo?(kbrosnan)
gBrowser.selectedBrowser.webNavigation
Object { _browser: <browser>, _messageManager: ChromeMessageSender, _sessionHistory: XPCWrappedNative_NoHelper, _currentURI: XPCWrappedNative_NoHelper, canGoBack: true, canGoForward: true, document: null, currentURI: XPCWrappedNative_NoHelper, sessionHistory: XPCWrappedNative_NoHelper }
unsafe CPOW usage script.js:3319:0
unsafe CPOW usage script.js:3321:0
unsafe CPOW usage script.js:3322:0
unsafe CPOW usage script.js:3323:0
unsafe CPOW usage script.js:4140:0
unsafe CPOW usage script.js:4151:0
unsafe CPOW usage script.js:4162:0
unsafe CPOW usage script.js:4163:0
unsafe CPOW usage script.js:4164:0
unsafe CPOW usage script.js:4165:0
unsafe CPOW usage script.js:4190:0
unsafe CPOW usage script.js:4191:0
unsafe CPOW usage script.js:4192:0
unsafe CPOW usage script.js:4193:0
unsafe CPOW usage script.js:4194:0
unsafe CPOW usage script.js:4195:0
unsafe CPOW usage script.js:4196:0
unsafe CPOW usage script.js:4197:0
unsafe CPOW usage script.js:4226:0
unsafe CPOW usage script.js:4249:0
unsafe CPOW usage script.js:4300:0
unsafe CPOW usage script.js:4383:0
unsafe CPOW usage script.js:4405:0
unsafe CPOW usage script.js:3321:0
unsafe CPOW usage script.js:3322:0
unsafe CPOW usage script.js:3323:0
unsafe CPOW usage script.js:4140:0
unsafe CPOW usage script.js:4151:0
unsafe CPOW usage script.js:4162:0
unsafe CPOW usage script.js:4163:0
unsafe CPOW usage script.js:4164:0
unsafe CPOW usage script.js:4165:0
unsafe CPOW usage script.js:4190:0
unsafe CPOW usage script.js:4191:0
unsafe CPOW usage script.js:4192:0
unsafe CPOW usage script.js:4193:0
unsafe CPOW usage script.js:4194:0
unsafe CPOW usage script.js:4195:0
unsafe CPOW usage script.js:4196:0
unsafe CPOW usage script.js:4197:0
unsafe CPOW usage script.js:4226:0
unsafe CPOW usage script.js:4249:0
unsafe CPOW usage script.js:4300:0
unsafe CPOW usage script.js:4383:0
unsafe CPOW usage script.js:4405:0
The object cannot be linked to the inspector without a toolbox console-output.js:3145:0
Flags: needinfo?(kbrosnan)
I can't remember if you answered this in IRC - are you able to get into this state semi-reliably?

My current theory is that it has something to do with the preloaded newtab browser.

If you're able to reproduce it reliably, I'd be interested in knowing if setting browser.newtab.preload to false makes the problem go away.
Flags: needinfo?(kbrosnan)
This is the first time I saw the issue. I don't know how reproducible it is. I still have the browser in this state. A new window seems to largely function correctly.
Flags: needinfo?(kbrosnan)
Hrm.

This is going to be a pain. I'd ask you to use the Browser Toolbox to debug what goes on when you attempt to open a new tab in your buggy window, except bug 1163540 will make setting breakpoints unreliable / flat out not work in tabbrowser.xml. :(

Your browser window is in a busted state, and I need as many clues as I can get on how you got into that state.

What is your value for browser.newtab.preload? Do you remember what you were doing just before your browser got into this state? Had you just closed a tab?
Flags: needinfo?(kbrosnan)
I've just spent the past half hour trying a myriad of ways of reproducing this, with no luck.

I _think_ I've hit this bug periodically, so I guess my next step is to set a trap for it. I'll add a bunch of logging to my tabbrowser.xml so that next time I hit this (if I hit this), I'll have some more information about what was happening leading up to entering the bad state.

That's the best I can come up with, I'm afraid. :/
kbrosnan: if you could also supply me with your browser.newtab.url, that'd also be very helpful, thanks!
browser.newtab.preload; true
browser.newtab.url; about:newtab

I did have my new window in the already troublesome session break in a similar manner just now. I think I was middle clicking some links to open in a new tab. Is there anything else useful or should I just restart Firefox and hope it happens again? Could I patch the js file live Firefox (I assume it is in omni.ja)?
Flags: needinfo?(kbrosnan)
(In reply to Kevin Brosnan [:kbrosnan] from comment #14)
> browser.newtab.preload; true
> browser.newtab.url; about:newtab
> 
> I did have my new window in the already troublesome session break in a
> similar manner just now. I think I was middle clicking some links to open in
> a new tab. Is there anything else useful or should I just restart Firefox
> and hope it happens again? Could I patch the js file live Firefox (I assume
> it is in omni.ja)?

Patching omni.ja is theoretically possible? But I have absolutely no idea if it'd work.

Have you seen this at all again since you originally reported it? I've got an instrumented build here that I've been running for a number of days now, hoping to catch it, but no dice.
Flags: needinfo?(kbrosnan)
I left the browser in that state for several days. I restarted sometime mid-last week. Have not seen it yet. I suspect this will just end up as incomplete. :(
Flags: needinfo?(kbrosnan)
Alright, well, let's re-open this if it ever rears its head again. I'm pretty stumped. :/
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
Just had this happen again on Linux.
Status: RESOLVED → REOPENED
OS: Windows 8.1 → All
Hardware: x86 → All
Resolution: INCOMPLETE → ---
Status: REOPENED → NEW
http://people.mozilla.org/~kbrosnan/tmp/1166351/screencast.webm 

Saw similar errors to comment 3, in addtion I saw the following being spammed to the console.

TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_LONGEST_OP_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_ALL_WINDOWS_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TypeError: browser is undefined TabStateCache.jsm:57:4
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_LONGEST_OP_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_ALL_WINDOWS_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TypeError: browser is undefined TabStateCache.jsm:57:4
TypeError: browser is undefined TabStateCache.jsm:57:4
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_LONGEST_OP_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_ALL_WINDOWS_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TypeError: browser is undefined TabStateCache.jsm:57:4
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_LONGEST_OP_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_ALL_WINDOWS_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TypeError: browser is undefined TabStateCache.jsm:57:4
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_LONGEST_OP_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_ALL_WINDOWS_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TypeError: browser is undefined TabStateCache.jsm:57:4
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_LONGEST_OP_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_ALL_WINDOWS_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TypeError: browser is undefined TabStateCache.jsm:57:4
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_DATA_LONGEST_OP_MS" was already initialized TelemetryStopwatch.jsm:52:0
TelemetryStopwatch: key "FX_SESSION_RESTORE_COLLECT_ALL_WINDOWS_DATA_MS" was already initialized TelemetryStopwatch.jsm:52:0
TypeError: browser is undefined TabStateCache.jsm:57:4
Mike wanted to know the output of gBrowser._preloadedBrowser

gBrowser._preloadedBrowser
<browser type="content-targetable" message="true" messagemanagergroup="browsers" contextmenu="contentAreaContextMenu" tooltip="aHTMLTooltip" selectmenulist="ContentSelectDropdown" autoscrollpopup="autoscroller" nodefaultsrc="true" clickthrough="never">
Hey Kevin,

Assuming you've still got this browser window open (thanks for keeping it!), is it true that any subsequent newtabs that you create are broken?

If so, can you please open the Browser Toolbox and attach the debugger to break at addTab in tabbrowser.xml? And then another in _handleNewTab. Then try to add a new tab.

You should hit the first breakpoint, and then the second.

If it's possible for you to record your way stepping through those two functions, that'd be very interesting... not sure if it'd reveal much, but it'd certainly be a help.

Alternatively, we can wait until Whistler, and (assuming you'll be there), we can try to debug this in person.
Flags: needinfo?(kbrosnan)
(In reply to Kevin Brosnan [:kbrosnan] from comment #22)
> http://people.mozilla.org/~kbrosnan/tmp/1166351/stepping.webm

Looks like browser.xml's attachFormFill is throwing an exception when called in _getPreloadedBrowser. Can you tell me what that exception is, and where exactly it's being thrown from within attachFormFill?
Flags: needinfo?(kbrosnan)
browser.attachFormFill is not a function
"_getPreloadedBrowser@chrome://browser/content/tabbrowser.xml:1587:15
 addTab@chrome://browser/content/tabbrowser.xml:1784:19
 loadOneTab@chrome://browser/content/tabbrowser.xml:1413:23
 openLinkIn@chrome://browser/content/utilityOverlay.js:354:1
 openUILinkIn@chrome://browser/content/utilityOverlay.js:203:3
 BrowserOpenTab@chrome://browser/content/browser.js:13018:3
 BrowserOpenNewTabOrWindow@chrome://browser/content/browser.js:18295:5
 oncommand@chrome://browser/content/browser.xul:1:1"
Flags: needinfo?(kbrosnan)
Ok, how about this - can you add a breakpoint in the browser.xml destructor method and see if you're hitting that sometime after you open a new tab? Or the destroy method? That'd be very interesting.
Flags: needinfo?(kbrosnan)
See Also: → 1175023
So hopefully this will help us gather information when this bug rears its head.
Attachment #8624451 - Flags: review?(felipc)
Attachment #8624451 - Flags: review?(felipc) → review+
Keywords: leave-open
Attachment #8624451 - Flags: checkin+
An update:

:margaret hit this bug in Whistler, and I was able to dig in a bit using the DevTools.

It looks like the tabbrowser-browser binding, which extends browser.xml#browser, is applied - but the parent browser XBL binding is not being properly applied.

I... don't know how that can be. Because this is happening only intermittently, I suspect there's a race occurring when we attach the tabbrowser-browser binding as we create a non-remote browser in an e10s window... but I'm not certain why just yet.

We can probably remove the instrumentation at this point - we got the information we needed.
Flags: needinfo?(kbrosnan)
Bug 1166351 - Modify instrumentation to dump out applied binding URLs. r?felipe
Attachment #8626814 - Flags: review?(felipc)
Comment on attachment 8626814 [details]
MozReview Request: Bug 1166351 - Modify instrumentation to dump out applied binding URLs. r?felipe

https://reviewboard.mozilla.org/r/12151/#review10625

Ship It!
Attachment #8626814 - Flags: review?(felipc) → review+
Blocks: 1175023
I hit this again today. My instrumentation showed that the binding URLs are correct, but that one of the parent bindings just isn't applying properly. This is important, because it shows us that our selectors seem to be working correctly. The problem seems to be at a lower XUL/XBL level.

Enn pointed at http://mxr.mozilla.org/mozilla-central/source/dom/xbl/nsXBLProtoImpl.cpp#47 as a failure that could cause what I'm seeing - though I have no idea how we'd get into that condition.

Also, when I hit this bug, it wasn't the browser.xml#browser binding that wasn't applying this time. Instead, it was tabbox.xml#tab that wasn't applied properly. The end result is that none of my tabs seem to visually select properly, even though I seem to be able to view the tabs when clicking on them.

What's more disturbing is that dholbert hit this today with e10s disabled.

So my conclusion is that:

1) This is not an e10s bug.
2) This bug has been seen on OS X, Windows and Linux now, by multiple people.
3) This is a pretty serious bug probably somewhere in the XUL / XBL layer, and we need to put some attention on this.
This is not an e10s bug, so I'm minus-ing it.
Severity: normal → critical
Summary: Unable to access new tabs and unable to close any tab → XBL parent bindings not properly attached - tabs suddenly do not switch or close properly
bz suggested constantly running in a debug build with rr until we hit this again, and then doing time-travel debugging to find out wtf is going on.
I took this bug under the assumption that it was an e10s bug, which it doesn't appear to be anymore, so resetting assignee.
Assignee: mconley → nobody
Neil, can you look into this a.s.a.p.?
Assignee: nobody → enndeakin
See Also: → 1198218
Assignee: enndeakin → nobody
I happened to see something like this this morning - this time when opening a tab instead of closing it. I happened to have my Browser Console open at the time, and because I have GC / CC logging enabled, noticed that I had just done a CC right around when the problem occurred.

So I wonder if it's at all possible that we're incorrectly cycle collecting some XBL or XUL stuff...
The error console asked me to post this here:

tab.linkedBrowser.currentURI is undefined TabItems.jsm:847:0
TypeError: browser is undefined TabState.jsm:162:1
tab.linkedBrowser.currentURI is undefined TabItems.jsm:847:0
The browser has the following bindings: tabbrowser.xml:3609:0
chrome://browser/content/tabbrowser.xml#tabbrowser-browser,chrome://global/content/bindings/browser.xml#browser tabbrowser.xml:3610:0
MozBinding is currently: url("chrome://browser/content/tabbrowser.xml#tabbrowser-browser") tabbrowser.xml:3611:0
Parent is: <stack xmlns="http://www.mozilla.org/keymaster/gatekeeper/there.is.only.xul" class="browserStack" flex="1"><browser type="content-targetable" message="true" messagemanagergroup="browsers" contextmenu="contentAreaContextMenu" tooltip="aHTMLTooltip" autocompletepopup="PopupAutoComplete" selectmenulist="ContentSelectDropdown" autoscrollpopup="autoscroller" clickthrough="never" pending="true"/></stack> tabbrowser.xml:3616:0
Please report the above errors in bug 1166351. tabbrowser.xml:3464:0
TypeError: this.browsers[i] is undefined
_getTabForContentWindow()
 tabbrowser.xml:411
getBrowserForContentWindow()
 tabbrowser.xml:378
httpObserver.tabIdFromChannel()
 vapi-background.js:1994
httpObserver.synthesizePendingRequest()
 vapi-background.js:2006
httpObserver.observe()
 vapi-background.js:2171
adjustTabstrip()
 tabbrowser.xml:4786
showTab()
 tabbrowser.xml:2749
updateCurrentBrowser()
 tabbrowser.xml:1062
onselect()
 browser.xul:1
set_selectedIndex()
 tabbrowser.xml:6508
set_selectedPanel()
 tabbox.xml:683
set_selectedIndex()
 tabbox.xml:399
set_selectedItem()
 tabbox.xml:431
set_selectedTab()
 tabbox.xml:110
set_selectedTab()
 tabbrowser.xml:2811
this.UI.goToTab()
 UI.jsm:776
this.TabItem.prototype<.zoomIn/onZoomDone()
 TabItems.jsm:465
this.TabItem.prototype<.zoomIn/<.complete()
 TabItems.jsm:497
this.iQClass.prototype.animate/<()
 iQ.jsm:417
 tabbrowser.xml:411:1
aWebProgress is undefined browser.js:4444:0
location is undefined browser.js:4358:0

(Followed by further 'TypeError: ... is undefined' and TelemetryStopwatch errors).
Thank you so much Dmitry! Can you please tell me whether or not you have multi-process Firefox enabled? (If you go to about:support, it should be in Application Basics after "Multiprocess Windows").
Flags: needinfo?(dgutov)
I do:

Multiprocess Windows 	2/2 (default: true)

I'm also on Aurora channel, and I have the Tab Groups addon installed.
Flags: needinfo?(dgutov)
Hey Dmitry,

Reading the data that was posted, am I correct when I say that this tab that broke was a background tab that hadn't yet been restored yet?
Flags: needinfo?(dgutov)
Mike,

I'm not sure what you mean by "that tab". There's no particular tab I can point out.

After the problem arrived, I became unable to open new tabs (by clicking the button, or pressing Ctrl+T), or use the "Panorama" button. I could open new tab by clicking the "Home" button. I do not recall if I had a problem closing tabs.

I recall having this problem before. IIRC, that only started happening after the "let's remove Panorama" debacle, and after I installed the new Tab Groups addon. But this is the first time I've seen the "TelemetryStopwatch" error messages.
Flags: needinfo?(dgutov)
Been hitting this bug for over a year, probably less than once a month. Finally traced it last night to see this bug report in the code. Still no steps to reproduce.

tab.linkedBrowser.currentURI is undefined
https://dxr.mozilla.org/mozilla-central/source/browser/base/content/tabbrowser.xml#3754

Most likely e10s only. Been assuming it is down to dom.ipc.processCount but maybe not.
Flags: needinfo?(mconley)
(In reply to Jonathan Howard from comment #48)
> Most likely e10s only. Been assuming it is down to dom.ipc.processCount but
> maybe not.

(Per comment 35, I've hit this bug [or something like it] with e10s disabled, but it also seems possible I was hitting a different bug with similar symptoms.)
Unfortunately, this bug isn't really actionable without better STR. :(

Jonathan Howard - do you recall what you did just before the bug showed itself?
Flags: needinfo?(mconley) → needinfo?(jonathan)
Nothing stands out much from what usually I do many times with different sites.
Maybe me having a second window open helped trigger it.
I opened three bugzilla pages in background and switched window. Seemed the tab-title/icon loaded fine. When I went back found I couldn't switch to the tabs. Opening and closing new tabs worked but bad ones remained, (All prior tabs including pinned would not switch.) Browser Toolbox lead me (after multiple tries) to above spot. Passed over without error diagnostic that has been added.

Currently not sure what to try if I get it again. Hopefully won't be so late if I get it again. (Thinking of telling the tab to load another url just to see what happens next.)

Previously I've tried killing all child processes but it made no difference. kill -6 firefox but stacks seemed normal.
Flags: needinfo?(jonathan)
Just had a similar bug to comment 48/51.
This time new tab not working.
I quickly closed a tab just after switching to another. Then found out creating new tabs didn't work and had problems changing tabs.

Code was constantly going through this loop;
https://dxr.mozilla.org/mozilla-central/source/browser/base/content/tabbrowser.xml#3609

Think I made too much litter in browser console to find anything reported when it started. This one stood out due to multiple occurences;
TypeError: can't access dead object
callListeners()
 tabbrowser.xml:499
_callProgressListeners()
 tabbrowser.xml:522
mTabProgressListener/<._callProgressListeners()
 tabbrowser.xml:571
mTabProgressListener/<.onProgressChange()
 tabbrowser.xml:599
RemoteWebProgressManager.prototype._callProgressListeners()
 RemoteWebProgress.jsm:174
RemoteWebProgressManager.prototype.receiveMessage()

Popped the dead object from this.mTabsProgressListeners but still bugged.

addTab was throwing here with this.selectedItem null
https://dxr.mozilla.org/mozilla-central/source/browser/base/content/tabbrowser.xml#4850
I guess the tab I closed was the selected and nothing took its place.
Maybe I hit this timeout but will never know.
https://dxr.mozilla.org/mozilla-central/source/browser/base/content/tabbrowser.xml#3484

Hacked on selected = true to final tab of list and got partial functionality creating new tabs, was still breakage with throbber. Closed the window soon after. Second window I had open stayed functional throughout.
(In reply to Jonathan Howard from comment #52)
> Just had a similar bug to comment 48/51.
> This time new tab not working.
> I quickly closed a tab just after switching to another. Then found out
> creating new tabs didn't work and had problems changing tabs.
> 
> Code was constantly going through this loop;
> https://dxr.mozilla.org/mozilla-central/source/browser/base/content/
> tabbrowser.xml#3609
> 
> Think I made too much litter in browser console to find anything reported
> when it started. This one stood out due to multiple occurences;
> TypeError: can't access dead object
> callListeners()
>  tabbrowser.xml:499
> _callProgressListeners()
>  tabbrowser.xml:522
> mTabProgressListener/<._callProgressListeners()
>  tabbrowser.xml:571
> mTabProgressListener/<.onProgressChange()
>  tabbrowser.xml:599
> RemoteWebProgressManager.prototype._callProgressListeners()
>  RemoteWebProgress.jsm:174
> RemoteWebProgressManager.prototype.receiveMessage()
> 
> Popped the dead object from this.mTabsProgressListeners but still bugged.
> 
> addTab was throwing here with this.selectedItem null
> https://dxr.mozilla.org/mozilla-central/source/browser/base/content/
> tabbrowser.xml#4850
> I guess the tab I closed was the selected and nothing took its place.
> Maybe I hit this timeout but will never know.
> https://dxr.mozilla.org/mozilla-central/source/browser/base/content/
> tabbrowser.xml#3484
> 
> Hacked on selected = true to final tab of list and got partial functionality
> creating new tabs, was still breakage with throbber. Closed the window soon
> after. Second window I had open stayed functional throughout.

Hey Jonathan - thanks for this analysis. This is really interesting.

One problem is that the links to the source you provided have probably gone stale, since they weren't permalinks (the source has changed under them).

Example, https://dxr.mozilla.org/mozilla-central/source/browser/base/content/tabbrowser.xml#3609 doesn't point to a loop.

I'm particularly interested in the loop you're referring to. When you find the line of code in DXR, can you copy the permalink link (it's in the top right of DXR), and paste that?
Flags: needinfo?(jonathan)
Flags: needinfo?(mconley)
See Also: → 1263167
I'm pretty stoked - John-Galt has pointed me at a reliable reproduction of this bug. I'll have a patch posted shortly.
Flags: needinfo?(mconley)
(A patch that reproduces the bug reliably, not fixes it)
Run ./mach mochitest --flavor=plain toolkit/components/extensions/test/mochitest/ --jsdebugger
Run the tests
When the tests cease, focus the main browser showing the Mochitest results
Open the Browser Console
Try to access gBrowser.selectedBrowser.currentURI

Review commit: https://reviewboard.mozilla.org/r/48485/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/48485/
Attachment #8626814 - Attachment is obsolete: true
Happy to report that Enn was able to reproduce with this test case. I'm going to set a needinfo on him to get his analysis on what's going on here, and what needs to be done to avoid this problem.
Flags: needinfo?(enndeakin)
To reproduce this bug, follow the steps in comment 57 with that patch applied. You don't need the debugger flag, but you can use --keep-open and try opening/closing/switching tabs. I also get errors printed about currentURI and/or webNavigation being undefined/null.

What's happening here is that there is some JS prototype-related objects being added to a hash in nsXBLBinding::GetOrCreateMapEntryForPrototype. This hash I think is used for all bindings in chrome scope.

The test here has a line which changes the user agent locale. This causes the startup cache to flush which wipes out any cached xbl documents. This means that any new elements created, for example, from opening a new tab, get fresh bindings and associated data structures, including new nsXBLPrototypeBinding objects. However, the hash mentioned above still contains old prototype related info so when GetOrCreateMapEntryForPrototype is run on the new tab, it gets some associated with some old objects and new objects.

When the tab is then closed, nsXBLBinding::ChangeDocument gets confused as its mPrototypeBindings (created post-startupcache-flush) members don't match and causes weird binding errors, such as happens here.

A cursory glance suggests that the startup cache can be flushed by some devtools and addon support code among a few other places. I haven't found any reliable way to reproduce this other than these test steps.

I'm not sure what the hash is used for exactly. Changing GetOrCreateMapEntryForPrototype to always return new values instead of looking in the hash "fixes" this bug. Is it sufficient to clear out these hashes when the startup cache is flushed?

Let's ask if someone more familiar with this code could take this bug or provide some pointers?
Flags: needinfo?(enndeakin) → needinfo?(bobbyholley)
Sorry for the lag here - there's a lot to page in and I have a lot on my plate. :\

(In reply to Neil Deakin from comment #59)
> To reproduce this bug, follow the steps in comment 57 with that patch
> applied. You don't need the debugger flag, but you can use --keep-open and
> try opening/closing/switching tabs. I also get errors printed about
> currentURI and/or webNavigation being undefined/null.
> 
> What's happening here is that there is some JS prototype-related objects
> being added to a hash in nsXBLBinding::GetOrCreateMapEntryForPrototype. This
> hash I think is used for all bindings in chrome scope.

To be clear, this hash is a JS WeakMap, i.e. it's a script-accessible Javascript object (we only ever really use it from C++, be having the JS GC manage lifetimes is very convenient). The WeakMap is keyed off of "grand-protos", which is to say the original prototype of a given object before we spliced the special XBL prototype in. The value is a plain JSObject that maps from { xblBindingName: xblPrototypeObject }.

So supposing we have some XBL binding foo, and we attach that binding to both <span> and <div> elements. We'll have a separate entry in the WeakMap for both HTMLSpanElement.prototype and HTMLDivElement.prototype, each of which will have an object { foo: fooProto }, where fooProto is an object with the XBL prototype methods, whose own prototype is HTMLSpanElement.prototype and HTMLDivElement.prototype respectively.

There's also the degenerate case where the grand-proto is null, in which case we just define the properties directly on the WeakMap, expando-style.

Finally, this is all extra tricky because, in the XBL-in-untrusted-content case, these WeakMaps live in the special pseudo-privileged XBL scope, and the keys and values are generally cross-compartment wrappers to the objects in the bonafide content scope.

All of this exists as an optimization so that, for example, if we bind N <div>s in a document, we can have one copy of the XBL <implementation> prototype stuff (and all of the JS methods that hang off of it), rather than N.

> The test here has a line which changes the user agent locale. This causes
> the startup cache to flush which wipes out any cached xbl documents. This
> means that any new elements created, for example, from opening a new tab,
> get fresh bindings and associated data structures, including new
> nsXBLPrototypeBinding objects. However, the hash mentioned above still
> contains old prototype related info so when GetOrCreateMapEntryForPrototype
> is run on the new tab, it gets some associated with some old objects and new
> objects.

In general this wouldn't be a problem, since all of the properties on the XBL prototype objects are just pure JS functions cloned out of the bindings. _However_, the XBL prototype object has a reserved slot which points to its prototype binding. This is used to find the right object during un-installation, as well as to enable the lazy installation of XBL <field>s.

> 
> When the tab is then closed, nsXBLBinding::ChangeDocument gets confused as
> its mPrototypeBindings (created post-startupcache-flush) members don't match
> and causes weird binding errors, such as happens here.

It seems to me that the correct thing to do here is to uninstall existing bindings before flushing the startup cache. Do I understand correctly that we do this already, and that the problem is merely that we find the stale prototype in the WeakMap when attaching _new_ bindings?

We should augment the assertion at [1] to check that the reserved slot on the cached binding is equal to aProtoBinding. Assuming your hypothesis is correct, that assertion should fire, which is much better than catching this issue at teardown time. Can you verify that?

> A cursory glance suggests that the startup cache can be flushed by some
> devtools and addon support code among a few other places. I haven't found
> any reliable way to reproduce this other than these test steps.
> 
> I'm not sure what the hash is used for exactly.

Hopefully this is explained sufficiently above. Let me know if any parts are unclear.

> Changing
> GetOrCreateMapEntryForPrototype to always return new values instead of
> looking in the hash "fixes" this bug. Is it sufficient to clear out these
> hashes when the startup cache is flushed?

Given that we also define properties directly on the WeakMap expando-style, we'd either need to replace the WeakMaps entirely, or manually clear the properties. This is a bit annoying because both the properties and the weakmaps themselves are defined JSPROP_PERMANENT (i.e. non-configurable). Blowing away the WeakMaps entirely (and making them non-JSPROP_PERMANENT) is probably the right approach. We'd want a ClearClassObjectMaps, analogous to GetOrCreateClassObjectMap (which will need the aforementioned removal of the JSPROP_PERMANENT property).

[1] http://searchfox.org/mozilla-central/commit/20598f44f63420c793201e5dbb1e1421c31cea34/dom/xbl/nsXBLBinding.cpp#1015
Flags: needinfo?(bobbyholley)
Mike, do you think we should re-prioritize this bug?
Flags: needinfo?(mconley)
(To be clear, the suggestions I'm making a pretty straightforward to implement)
(In reply to Neil Deakin from comment #61)
> Mike, do you think we should re-prioritize this bug?

This seems somewhat rare, but the user impact is particularly brutal. I'd love to get this fixed soon, personally - though technically this doesn't fall into the "e10s" category of bugs. Fixing this is probably just the Right Thing to Do.

Did you want to take it, or shall I?
Flags: needinfo?(mconley) → needinfo?(enndeakin)
> In general this wouldn't be a problem, since all of the properties on the
> XBL prototype objects are just pure JS functions cloned out of the bindings.
> _However_, the XBL prototype object has a reserved slot which points to its
> prototype binding. This is used to find the right object during
> un-installation

Yes, and it is this finding the right object during uninstallation that fails.

> It seems to me that the correct thing to do here is to uninstall existing
> bindings before flushing the startup cache. Do I understand correctly that
> we do this already, and that the problem is merely that we find the stale
> prototype in the WeakMap when attaching _new_ bindings?

I don't think we remove any bindings applying to existing elements when flushing the startup cache. Any newly created elements will use new binding and prototype objects though. The problem is indeed that we find the old prototype in the map, even though the new element is using new binding objects and prototypes.


> We should augment the assertion at [1] to check that the reserved slot on
> the cached binding is equal to aProtoBinding. Assuming your hypothesis is
> correct, that assertion should fire, which is much better than catching this
> issue at teardown time. Can you verify that?

Yes, this assertion occurs when running the test.
Flags: needinfo?(enndeakin)
(In reply to Neil Deakin from comment #64)
> > In general this wouldn't be a problem, since all of the properties on the
> > XBL prototype objects are just pure JS functions cloned out of the bindings.
> > _However_, the XBL prototype object has a reserved slot which points to its
> > prototype binding. This is used to find the right object during
> > un-installation
> 
> Yes, and it is this finding the right object during uninstallation that
> fails.
> 
> > It seems to me that the correct thing to do here is to uninstall existing
> > bindings before flushing the startup cache. Do I understand correctly that
> > we do this already, and that the problem is merely that we find the stale
> > prototype in the WeakMap when attaching _new_ bindings?
> 
> I don't think we remove any bindings applying to existing elements when
> flushing the startup cache.

Hm. I guess that's still ok, since nsXBLPrototypeBinding will match the one in the nsXBLBinding during teardown for those elements.

> Any newly created elements will use new binding
> and prototype objects though. The problem is indeed that we find the old
> prototype in the map, even though the new element is using new binding
> objects and prototypes.
> 
> 
> > We should augment the assertion at [1] to check that the reserved slot on
> > the cached binding is equal to aProtoBinding. Assuming your hypothesis is
> > correct, that assertion should fire, which is much better than catching this
> > issue at teardown time. Can you verify that?
> 
> Yes, this assertion occurs when running the test.

Ok. I think we should just be able to implement and call ClearClassObjectMaps (along with the JSPROP_PERMANENT changes), and that should fix this bug.
I'll take this.
Assignee: nobody → mconley
Attachment #8744361 - Attachment is obsolete: true
Hey bholley, can you think of a good place where we might call into ClearClassObjectMaps from? Also, am I correct in thinking that my design requires us to be running JS within the XBL scope in order to call them? Is this problematic, and if so, is there a better design you could suggest[1]?

[1]: Note that this is one of the very few times I've been doing JS on the other side of the fence, so my patches should probably be viewed with extreme skepticism.
Flags: needinfo?(enndeakin)
The rest of the xbl objects are cleared in nsXULPrototypeCache::Flush when the startup cache flushes. Or you could add another observer in nsXBLService.

But I'll needinfo the person you meant to just in case.
Flags: needinfo?(enndeakin) → needinfo?(bobbyholley)
Whoops, thanks. :)
Discussed on IRC.
Flags: needinfo?(bobbyholley)
Okay, I think I'm going to need a little bit more guidance here. The weak maps that we want to clear out are defined as properties on "XBL-hosting globals" according to the inline-documentation.

This suggests to me that I need a way of iterating all of the XBL-hosting globals at XUL cache flush time. How does one do that?

Is it sufficient to do the iteration inside XPCWrappedNativeScope.cpp, where I have access to "gScopes", looking for XBL scopes? Or is there an easier way?
Flags: needinfo?(bobbyholley)
Nevermind - I re-watched the compartments talk on airmo, and I think I'm starting to get a hang of what all of these pieces represent. Give me another day or so to crack this.
Flags: needinfo?(bobbyholley)
So I think what I need to do here is iterate all of the JS scopes where XBL bindings have been installed, and within those scopes clear the WeakMaps. There doesn't seem to be a great way for me to iterate the scopes where XBL bindings have been installed...

Enn, do you know if it's advisable to hold an iterable set of (weak?) scopes for the nsIContent that get passed to nsXBLService::LoadBindings? Is that a choke-point where I can assume _all_ XBL scopes will pass through?

If so, I guess at flush time, I can just iterate those scopes and I _think_ the clearing of the caches is pretty straight forward.

Or is there another way of getting all of the XBL binding scopes?
Flags: needinfo?(enndeakin)
> Enn, do you know if it's advisable to hold an iterable set of (weak?) scopes
> for the nsIContent that get passed to nsXBLService::LoadBindings? Is that a
> choke-point where I can assume _all_ XBL scopes will pass through?

I don't really know what that means, but bindings can also be loaded via nsXBLService::LoadBindingDocumentInfo. But that just preloads the file though so I don't think is an issue here. Otherwise, all attempts to apply a binding go through LoadBindings.
Flags: needinfo?(enndeakin)
Hey bholley,

So I've been poking around a bit in the XBL source, and I can't seem to find a good way of finding all of the globals where the class WeakMap's have been defined (or even all of the globals where XBL has been bound to some content).

The best plan I've got right now is that, every time we create a new ClassObjectMap in GetOrCreateClassObjectMap for some XBL scope / global, we then add that scope / global to a static, iterable (weak?) set, that I can then iterate to blow away these WeakMaps when we flush the caches.

Does that seem like a reasonable plan? If there's a better way of finding all of the scopes / globals with these WeakMaps, I'm all ears.
Flags: needinfo?(bobbyholley)
So, assuming XBL doesn't provide a way to find all the nsXBLBindings in the system (I haven't checked), we probably need to iterate over all the JS globals in the system, and then find all the possible places where we might have installed these maps (i.e. the things returned from the xpc::GetXBLScopeOrGlobal in GetOrCreateMapEntryForPrototype). This should generally allow us to avoid touching web-exposed globals, which is good, because there's no rule that says that web pages aren't allowed to have a global property  called __ContentClassObjectMap__.

If memory serves correctly (please double-check), there are basically two cases:
* System principaled windows (where we run XBL directly without a sandbox)
* XBL Sandbox scopes hanging off the XPCWrappedNativeScope.

I also don't know offhand the current state of the art for iterating all the globals in the system. XPConnect definitely has a linked list of XPCWrappedNativeScopes. The JS engine or CycleCollectedJSRuntime might provide a better API.
Flags: needinfo?(bobbyholley)
Another question - is the only case where we unregister bindings the case where we unregister _all_ the bindings? If it's not, we might have trouble here.

Boris, just to sanity-check: mconley is trying to fix coherency bugs caused by flushing the startup cache, whereby we end up with a new nsXBLPrototypeBinding in the binding manager, but find class objects for the old nsXBLPrototypeBinding in the WeakMap. He's currently looking at iterating all the scopes and clearing the WeakMaps if they exist. Does this seem reasonable to you?
Flags: needinfo?(bzbarsky)
Yeah, that seems reasonable.  This doesn't seem like a common operation or one that's expected to be super-cheap for some reason....
Flags: needinfo?(bzbarsky)
Comment on attachment 8747276 [details]
MozReview Request: Bug 1166351 - Add helper methods for clearing the XBL class object weakmap. r?bholley

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49793/diff/1-2/
So this appears to work, with one caveat, and that's the assertion that I put into the first commit. That assertion is still failing even with the other patches on top of it, so I'm not sure I'm doing it correctly. Pointers welcome.

If I remove the assertion, then we appear to clear the maps correctly, and the bug in the STR no longer appears (\o/).
Attachment #8747275 - Flags: review?(bobbyholley) → review+
Comment on attachment 8747275 [details]
MozReview Request: Bug 1166351 - Add an assertion to make sure that we use the correct cached XBL binding prototypes. r?bholley

https://reviewboard.mozilla.org/r/49791/#review47343

r=me with that fixed.

::: dom/xbl/nsXBLBinding.cpp:1015
(Diff revision 1)
>      MOZ_ASSERT(JS_GetClass(js::UncheckedUnwrap(proto)) == &gPrototypeJSClass);
> +
> +    nsXBLPrototypeBinding* cachedBinding =
> +      static_cast<nsXBLPrototypeBinding*>(::JS_GetReservedSlot(obj, 0).toPrivate());
> +    MOZ_ASSERT(cachedBinding == aProtoBinding);

So, I'm guessing that the existence of this UncheckedUnwrap before asserting the JSClass means that we can have wrappers here. And if that's the case, then getting the reserved slot on the wrapper will do the wrong thing.

What I think we should instead do is to create a helper called GetProtoBindingFromClassObject. These helper will assert gPrototypeJSClass and then access the slot.

Once you do that, you can replace both MOZ_ASSERTs here with:


MOZ_ASSERT(GetProtoBindingFromClassObject(js::UncheckedUnwrap(proto)) == aProtoBinding);

This may be the reason that the assert is firing.
Comment on attachment 8747276 [details]
MozReview Request: Bug 1166351 - Add helper methods for clearing the XBL class object weakmap. r?bholley

https://reviewboard.mozilla.org/r/49793/#review47349

::: dom/xbl/nsXBLBinding.h:112
(Diff revision 2)
>    void GenerateAnonymousContent();
>    void InstallAnonymousContent(nsIContent* aAnonParent, nsIContent* aElement,
>                                 bool aNativeAnon);
>    static void UninstallAnonymousContent(nsIDocument* aDocument,
>                                          nsIContent* aAnonParent);
> +  static void ClearClassObjectMaps(JSContext *cx, JS::Handle<JSObject*> obj);

Add some documentation, and call |obj| |scope|.

::: dom/xbl/nsXBLBinding.cpp:891
(Diff revision 2)
> -                                 JSPROP_PERMANENT | JSPROP_READONLY,
> -                                 JS_STUBGETTER, JS_STUBSETTER))
> +                                 JSPROP_READONLY, JS_STUBGETTER,
> +                                 JS_STUBSETTER))

Add a comment that this can't be permanent because we need to clear the maps.

::: dom/xbl/nsXBLBinding.cpp:902
(Diff revision 2)
> +  AssertSameCompartment(cx, scope);
> +  MOZ_ASSERT(JS_IsGlobalObject(scope));
> +  MOZ_ASSERT(scope == xpc::GetXBLScopeOrGlobal(cx, scope));

You can remove the middle assert.

::: dom/xbl/nsXBLBinding.cpp:906
(Diff revision 2)
> +  JS::Rooted<JS::PropertyDescriptor> desc(cx);
> +  if (JS_GetOwnPropertyDescriptor(cx, scope, name, &desc) &&
> +      desc.object() && desc.value().isObject() &&
> +      JS::IsWeakMapObject(&desc.value().toObject())) {

Let's MOZ_RELEASE_ASSERT that the result is either undefined or a WeakMapObject. These scopes are all self-controlled.

::: dom/xbl/nsXBLBinding.cpp:912
(Diff revision 2)
> +  if (JS_GetOwnPropertyDescriptor(cx, scope, name, &desc) &&
> +      desc.object() && desc.value().isObject() &&
> +      JS::IsWeakMapObject(&desc.value().toObject())) {
> +    JS::ObjectOpResult result;
> +    JS_DeleteProperty(cx, scope, name, result);
> +    MOZ_ASSERT(result.ok());

MOZ_RELEASE_ASSERT.

::: dom/xbl/nsXBLBinding.cpp:920
(Diff revision 2)
> +  MOZ_ASSERT(js::GetGlobalForObjectCrossCompartment(scope) == scope);
> +  AssertSameCompartment(cx, scope);
> +  MOZ_ASSERT(JS_IsGlobalObject(scope));
> +  MOZ_ASSERT(scope == xpc::GetXBLScopeOrGlobal(cx, scope));

The second and fourth assert should cover you here.
Attachment #8747276 - Flags: review?(bobbyholley) → review+
Comment on attachment 8748398 [details]
MozReview Request: Bug 1166351 - Clear XBL class object maps on XUL prototype cache flush. r?bholley

https://reviewboard.mozilla.org/r/50257/#review47351

::: js/xpconnect/src/XPCWrappedNativeScope.cpp:163
(Diff revision 1)
>          }
>      }
>  }
>  
> +void
> +XPCWrappedNativeScope::ClearClassMaps()

I'd call this ClearXBLClassObjectMaps.

::: js/xpconnect/src/XPCWrappedNativeScope.cpp:165
(Diff revision 1)
> +     AutoJSAPI jsapi;
> +     jsapi.Init(mGlobalJSObject);
> +     JSContext* cx = jsapi.cx();
> +     JS::Rooted<JSObject*> xblscope(cx, xpc::GetXBLScopeOrGlobal(cx, mGlobalJSObject));
> +     nsXBLBinding::ClearClassObjectMaps(cx, xblscope);

ClearClassObjectMaps does AssertSameCompartment(cx, scope) right?

It seems like this should give you a compartment mismatch assertion every time xblScope is different from mGlobalJSObject. Can you figure out what's going on here?

More generally, it seems better to design the API such that ClearClassMaps operates on mGlobalJSObject directly (i.e. it's up to the caller to figure out whether this scope might be used to host XBL).

::: js/xpconnect/src/XPCWrappedNativeScope.cpp:176
(Diff revision 1)
> +    for (XPCWrappedNativeScope* cur = gScopes; cur; cur = cur->mNext) {
> +       if (cur->mContentXBLScope) {
> +           cur->ClearClassMaps();
> +       } else if (cur->mGlobalJSObject) {
> +           nsIPrincipal* principal = cur->GetPrincipal();
> +           const js::Class* clasp = js::GetObjectClass(cur->mGlobalJSObject);
> +           if (principal && nsContentUtils::IsSystemPrincipal(principal) &&
> +               !(strcmp(clasp->name, "Window"))) {
> +               cur->ClearClassMaps();
> +           }
> +       }
> +    }

Hm, this is all a bit yucky actually.

Maybe instead we should stick a bit on the XPCWrappedNativeScopes called mHasXBLClassObjectMaps - we could set that bit (idempotently) whenever we install the maps, and then our filtering job here becomes really easy.
Attachment #8748398 - Flags: review?(bobbyholley)
https://reviewboard.mozilla.org/r/49791/#review47343

> So, I'm guessing that the existence of this UncheckedUnwrap before asserting the JSClass means that we can have wrappers here. And if that's the case, then getting the reserved slot on the wrapper will do the wrong thing.
> 
> What I think we should instead do is to create a helper called GetProtoBindingFromClassObject. These helper will assert gPrototypeJSClass and then access the slot.
> 
> Once you do that, you can replace both MOZ_ASSERTs here with:
> 
> 
> MOZ_ASSERT(GetProtoBindingFromClassObject(js::UncheckedUnwrap(proto)) == aProtoBinding);
> 
> This may be the reason that the assert is firing.

Yep, that fixed it - thanks.
https://reviewboard.mozilla.org/r/50257/#review47351

> ClearClassObjectMaps does AssertSameCompartment(cx, scope) right?
> 
> It seems like this should give you a compartment mismatch assertion every time xblScope is different from mGlobalJSObject. Can you figure out what's going on here?
> 
> More generally, it seems better to design the API such that ClearClassMaps operates on mGlobalJSObject directly (i.e. it's up to the caller to figure out whether this scope might be used to host XBL).

At least for this particular test case, mGlobalJSObject was always equal to xblscope, which is why I never hit an assertion.

Using the mHasXBLClassObjectMaps bit you suggested in the other commit, I think the caller has enough information to be confident in using the mGlobalJSObject on the scope. I've removed the GetXBLScopeOrGlobal.
Comment on attachment 8747275 [details]
MozReview Request: Bug 1166351 - Add an assertion to make sure that we use the correct cached XBL binding prototypes. r?bholley

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49791/diff/1-2/
Attachment #8748398 - Flags: review?(bobbyholley)
Comment on attachment 8747276 [details]
MozReview Request: Bug 1166351 - Add helper methods for clearing the XBL class object weakmap. r?bholley

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49793/diff/2-3/
Comment on attachment 8748398 [details]
MozReview Request: Bug 1166351 - Clear XBL class object maps on XUL prototype cache flush. r?bholley

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/50257/diff/1-2/
Attachment #8748398 - Flags: review?(bobbyholley)
So I was able to get a trace using rr, and I'm still a bit puzzled. We're failing the assertion I add in my first commit - I guess there's some pre-existing XBL prototype that hasn't been cleared out by the flush somehow.

And that's odd, because thanks to rr, I'm able to replay and check, and sure enough, we iterate through all scopes including the one with the matching memory address as the one where we fail the assertion, and we do attempt to clear it out. And we don't ever seem to attempt to add a new WeakMap to that scope.

So I guess my #1 theory is that somehow, for some cases, my deletion of the WeakMap is just not working. Hey bholley, assuming that I'm iterating all of the proper scopes, is there anything wrong with my way of _deleting the WeakMap_ in each XBL scope that might be problematic in some cases? I say some cases, because these patches definitely work to clear the "minimal" test case I laid out in comment 57.
Flags: needinfo?(bobbyholley)
I'm afraid after an afternoon of going back and forth over my rr recording, I'm still not sure why I'm hitting this assertion.

I have confirmed that for the global / scope in question, the WeakMap _is_ being deleted when we flush the cache.

The problem appears to be that later on, we initialize a chrome://global/content/bindings/general.xml#basecontrol binding, which has its prototype then stored in the global for the one we later fail the assertion in.

Later on, we initialize yet another chrome://global/content/bindings/general.xml#basecontrol binding, but this one appears to have a different aProtoBinding memory address than the one we had cached earlier, which results in the assertion failure.

I'm really at a loss as to how to proceed here. :(
So, more fiddling with rr tonight, and I have a clearer picture of what's going on:

Before we clear the cache, we keep initting the same binding over and over again - the one at 0x42699b0 for chrome://global/content/bindings/general.xml#basecontrol. All is going well.

Then we flush the bindings.

After we flush the bindings, another attempt is made to init chrome://global/content/bindings/general.xml#basecontrol, using the out-of-date prototype at 0x42699b0. Tracing this back to its source, it looks like this prototype binding is the one that's a protected member of nsXBLBinding - mPrototypeBinding.

So, after we flush the bindings and clear out all of the maps, should we be somehow unlinking the nsXBLBinding's from their original mPrototypeBindings somehow?
I chatted with mconley over video today, and he's unblocked now.
Flags: needinfo?(bobbyholley)
Okay, I think I know what's happening.

There are a few players here:

- The global nsXULPrototypeCache, which caches lots of information, including references to all nsXBLDocumentInfo’s that get loaded.
- nsBindingManager - there’s one of these per document, I believe. This also holds references to the nsXBLDocumentInfo’s that have been loaded for bindings within the document that it belongs to.
- nsXBLService - this is a singleton that does the work of loading bindings off the disk. What’s important to know is that when the nsXBLService loads a binding for some document, it will first check the nsXULPrototypeCache for the appropriate nsXBLDocumentInfo for that binding. If it doesn’t find it in the nsXULPrototypeCache, it’ll then check the nsBindingManager for the document that the binding is being used within. Failing that, it’ll go load it fresh off of the disk.

Here’s why I think I’m hitting my assertion:

- An XBL binding B is loaded, and the nsXBLDocumentInfo for that binding is cached both in nsXULPrototypeCache, along with the nsBindingManager that the binding is being loaded into. Let’s call this nsBindingManager X.
- When the XBL binding is loaded, it also creates the JS prototype object that’s being spliced into the binding target’s prototype chain - this gets tossed into the WeakMap for that global via nsXBLBinding::DoInitJSClass
- The cache if flushed - so that nsXBLDocumentInfo is removed from the nsXULPrototypeCache. Due to my patches, this also clears out the WeakMap of JS prototype objects. It is _not_, however, removed from the nsBindingManager X for the documents for which XBL binding B has been loaded into.
- Later on, the document with nsBindingManager X tries loading another instance of XBL binding B. It can’t find the nsXBLDocumentInfo in the nsXULPrototypeCache, so it falls back to checking the nsBindingManager. It finds the pre-flush nsXBLDocumentInfo, and calls nsXBLBinding::DoInitJSClass, which puts the prototype binding from the pre-flush nsXBLDocumentInfo into the WeakMap!
- Later, some other document tries to load XBL binding B. It’s not in the nsXULPrototypeCache, and it’s not in the nsBindingManager for that document (since it’s never been loaded before), so it loads it from the disk.
- Once the load from disk is completed, the newly created nsXBLDocumentInfo for that binding document is put into the nsXULPrototypeCache.
- Back in the document with nsBindingManager X, another copy of XBL binding B is being used - so nsXBLService goes to load the nsXBLDocumentInfo, and finds the newly created nsXBLDocumentInfo that’s now in the nsXULPrototypeCache. This is what it gets the nsXBLPrototypeBinding from, and this is what eventually gets passed to nsXBLBinding::DoInitJSClass.
- And we assert, because this new nsXBLPrototypeBinding does not match what was cached in the WeakMap.
Now that I think I understand this, I think I see an alternative solution - instead of clearing the ClassObjectMap on nsXULPrototypeCache flush, what we could do instead is prioritize the nsBindingManager nsXBLDocumentInfo over the nsXULPrototypeCache nsXBLDocumentInfo. That means that if an nsBindingManager had seen a particular binding before, it'd use the one that it's familiar with instead of using a more recent one.
This is kind of a long story, stay with me on this.

In bug 990290, a WeakMap was added to any JS scope that loaded an XBL
binding. That WeakMap stored the JS prototypes that are injected into
a bound node's prototype chain.

When a binding is removed, we search the prototype chain for the
JS prototype that we'd added, and remove it.

The XUL prototype cache caches numerous things, including nsXBLDocumentInfo,
which we use to get at the nsXBLPrototypeBinding for a particular binding,
which is then used to generate the class object that's put into the WeakMap.

When the XUL prototype cache is flushed, that means that when a binding
is bound, its definition needs to be reloaded off of the disk. If, however,
there was a pre-existing instance of the binding already being used in a
document, now we were in a funny case.

We were in a funny case, because when attempting to remove a binding, we
would look up the nsXBLPrototypeBinding via the nsXBLDocumentInfo that's
being held within the nsXULPrototypeCache, find (or load off the disk) a
_new_ nsXBLDocumentInfo and generate a _new_ nsXBLPrototypeBinding that
did not match to the one that we'd already stored in the WeakMap. This
would mean that removal would go wrong, and things would break horribly.

This patch makes it so that we prioritize checking the nsBindingManager
for a document for the nsXBLDocumentInfo before checking the
global nsXULPrototypeCache. That way, even if the cache gets cleared,
if the binding was ever used in this document, it'll be in the
nsBindingManager, and we'll get the same nsXULProtoypeBinding that
we'd been using before, and sanity will prevail.

Review commit: https://reviewboard.mozilla.org/r/53608/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/53608/
Comment on attachment 8747275 [details]
MozReview Request: Bug 1166351 - Add an assertion to make sure that we use the correct cached XBL binding prototypes. r?bholley

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/49791/diff/2-3/
Attachment #8747276 - Attachment is obsolete: true
Attachment #8748398 - Attachment is obsolete: true
Attachment #8753965 - Flags: review?(bobbyholley)
Attachment #8753965 - Flags: review?(bobbyholley) → review+
Comment on attachment 8753965 [details]
MozReview Request: Bug 1166351 - Prioritize getting the nsXBLDocumentInfo out of the bound document's nsBindingManager instead of the nsXULPrototypeCache. r?bholley

https://reviewboard.mozilla.org/r/53608/#review50490

::: dom/xbl/nsXBLService.cpp:904
(Diff revision 1)
> -        return NS_OK;
> +      return NS_OK;
> -      }
> +    }
> -    }
> +  }
>  
>  #ifdef MOZ_XUL
> -    // Next, look in the startup cache
> +  // The second line of defence is the global nsXULPrototypeCache,

Nit: en-us prefers defense over defence.

::: dom/xbl/nsXBLService.cpp:971
(Diff revision 1)
> +    // we can ensure that if the document has loaded this binding
> +    // before, it can continue to use it even if the XUL prototype
> +    // cache gets flushed. That way, if a flush does occur, we
> +    // don't get into a weird state where we're using different
> +    // XBLDocumentInfo's for the same XBL document in a single
> +    // document that has loaded some bindings.

Nit: remove the apostrophe.

Also, do we have an automated test that triggers the assertion in the other patch?
https://reviewboard.mozilla.org/r/53608/#review50490

> Nit: remove the apostrophe.
> 
> Also, do we have an automated test that triggers the assertion in the other patch?

Yeah, a few of the automated tests hit it without intending to without this patch (which makes me wonder what other weird behaviours this might fix!).

I'll file a follow-up to get one written for this specific case.
Comment on attachment 8753965 [details]
MozReview Request: Bug 1166351 - Prioritize getting the nsXBLDocumentInfo out of the bound document's nsBindingManager instead of the nsXULPrototypeCache. r?bholley

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/53608/diff/1-2/
Component: Tabbed Browser → XBL
Product: Firefox → Core
Depends on: 1274140
https://hg.mozilla.org/mozilla-central/rev/50b0296bbc46
https://hg.mozilla.org/mozilla-central/rev/6a70d84dd6dc
Status: NEW → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
Sounds like this fix should stay with 49; 48 is heading into its last couple of weeks.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: