Closed Bug 1632544 Opened 5 years ago Closed 5 years ago

Startup and script cache invalidation broke - manifests as OOM ("uncaught exception: out of memory") errors with either no source trace or pointing to Cu.import statements

Categories

(Toolkit :: Startup and Profile System, defect)

defect

Tracking

()

VERIFIED FIXED
mozilla77
Tracking Status
firefox-esr68 --- unaffected
firefox75 --- unaffected
firefox76 --- unaffected
firefox77 + verified

People

(Reporter: Gijs, Assigned: Gijs)

References

(Depends on 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(1 file)

We're still in the process of tracking this down.

What's kind of remarkable is how usable the browser is despite all of these OOM exceptions being thrown.

One thing I've noticed is that I can easily get these exceptions thrown when I navigate to cnn.com in a Fission window.

Frustratingly, if I use mozregression (and clone my default profile), I don't get these exceptions, which is going to make this a pain to hunt down. Right now, I'm running that cloned profile for a few hours to see if heap fragmentation is related.

I updated Nightly for the second time today, and after the last update I started to see this error in the global JS console. Restarting the browser doesn't fix the problem.

Nightly 77.0a1 build ID 20200423145559 on macOS.

Just restarted my Nightly, and this shows up in the Browser Console right away:

15:19:59.544 uncaught exception: out of memory
15:19:59.545 uncaught exception: out of memory MainProcessSingleton.jsm:7:34
15:19:59.547 uncaught exception: out of memory Services.jsm:9:38
15:19:59.548 uncaught exception: out of memory Services.jsm:12:36
15:19:59.553 uncaught exception: out of memory MainProcessSingleton.jsm:22:21
15:19:59.555 uncaught exception: out of memory MainProcessSingleton.jsm:34:23
15:19:59.557 uncaught exception: out of memory MainProcessSingleton.jsm:38:23
15:19:59.558 uncaught exception: out of memory
15:19:59.567 uncaught exception: out of memory BrowserGlue.jsm:1110:5
15:19:59.572 uncaught exception: out of memory 3
15:19:59.579 uncaught exception: out of memory addonManager.js:62:56
15:19:59.584 uncaught exception: out of memory AddonManager.jsm:69:37
15:19:59.590 uncaught exception: out of memory AddonManager.jsm:105:29
15:19:59.593 uncaught exception: out of memory AddonManager.jsm:4765:13
15:19:59.594 uncaught exception: out of memory AddonManager.jsm:739:23
15:19:59.600 uncaught exception: out of memory AddonManager.jsm:772:23
15:19:59.602 uncaught exception: out of memory GMPProvider.jsm:16:28
15:20:00.891 uncaught exception: out of memory 2
15:20:00.891 uncaught exception: out of memory ExtensionProcessScript.jsm:15:40

From 20200423145559 (https://hg.mozilla.org/mozilla-central/rev/03626342f6e659ac6699a21e30423e2267c1971f)

So interestingly, some of those MainProcessSingleton exceptions are fired inside of a handler for app-startup, which I think only gets notified once via the nsAppStartupNotifier...

and yet we run line 22-25 in MainProcessSingleton, and apparently throw... and then line 33-37 and apparently throw... and then 38-41 and then apparently throw. How are we continuing on if we're throwing like this? This makes no sense, or my understanding of exceptions or app-startup is completely wrong.

Hey Iain... any idea what might be happening here?

Flags: needinfo?(iireland)

I can consistently reproduce after a new restart, without restarting again, by opening a private browsing window. It has something to do with the initialization inside a content process. E.g. at some point I was able to open 3 PB windows (and get 6 "uncaught exception: out of memory" without source). Then I opened a new tab in my normal browsing window, and now I only got logs after 2 more PB windows, probably due to reuse of processes after reaching the maximum number of content processes.

I was able to see a message attributed to this source:

https://searchfox.org/mozilla-central/rev/41c3ea3ee8eab9ce7b82932257cb80b703cbba67/devtools/server/startup/content-process.js#22

Since the Cu.import is directly inside a condition on process type being content, it's probably safe to assume that the error comes from the content process.

(In reply to Rob Wu [:robwu] from comment #6)

it's probably safe to assume that the error comes from the content process.

But the MainProcessSingleton only runs in the parent process, and I'm seeing it there too (see comment 5). So it might be across both content and the parent?

Can anyone reproduce this on a machine where they're able to use source level debugging (ie on Windows, https://developer.mozilla.org/en-US/docs/Mozilla/Using_the_Mozilla_source_server ), or reproduce it on a self-built nightly, where they can get more info on the source of the error by using a (C++) debugger to set breakpoints on the thing tripping these? (possibly at https://searchfox.org/mozilla-central/rev/41c3ea3ee8eab9ce7b82932257cb80b703cbba67/dom/script/ScriptSettings.cpp#506 or similar)

Reproduced on:
Nightly 77.0a1 build ID 20200423145559 on macOS Mojave (I commented before).
Nightly 77.0a1 build ID 20200423095248 on macOS Catalina (almost vanilla profile; no add-ons, changed only a couple prefs)

I can only reproduce with an existing profile, not a new one. I tried Linux and macOS.

When the profile is copied elsewhere, and Firefox is started with the copy, the issue does not occur.
When the original profile is renamed, and then copied back to the original location, the issue returned.
When Nightly was started upon removing files from the profile directory, the issue kept appearing, even when at the end only .parentlock and compatibility.ini were left.
When the last two files were deleted, the issue did not occur any more, even after replacing the test directory with the original directory.

Does Firefox store any other state outside of the profile directory that could trigger this bug?

Yes, there is a separate local profile folder (ProfLD vs. the regular ProfD) which could explain the problem… maybe this is related to startupcache or another cache?

See /Users/username/Library/Caches/Firefox/Profiles/{profile folder name} for ProfLD on macOS.

When you launch Firefox with --profile /tmp/foo I believe both directory service entries point to one folder so if you want to clone the profile you can merge both the profile folders into one temp. folder and point Firefox at that.

[Tracking Requested - why for this release]:
We broke cache invalidation.

(In reply to Matthew N. [:MattN] (PM me if request are blocking you) from comment #10)

Yes, there is a separate local profile folder (ProfLD vs. the regular ProfD) which could explain the problem… maybe this is related to startupcache or another cache?

See /Users/username/Library/Caches/Firefox/Profiles/{profile folder name} for ProfLD on macOS.

When you launch Firefox with --profile /tmp/foo I believe both directory service entries point to one folder so if you want to clone the profile you can merge both the profile folders into one temp. folder and point Firefox at that.

So in some way, my no-longer-reproducing profile went back to reproducing this bug (but not the autocomplete weirdness). So I followed these steps, and ran the cloned profile in my regular nightly, which showed the issue.

Then I ran the cloned profile against mozregression (builds from both April 1st and April 23rd, ie yesterday's nightly), and that didn't show the issue. Same results when trying to run the same cloned profile against a local build with mach run - no issue.

Then I tried to run the steps Rob ran in comment #9. I was able to remove everything down to compatibility.ini and the script cache (in the startupCache folder), and still reproduce. Once I remove the script cache files or invalidate compatibility.ini, the issue goes away.

So I think somehow, we're putting bogus things into the script cache, or not invalidating it when we should.

Next, I looked at what we've changed here. The script cache C++ files etc. haven't changed at all since 2018. I knew of bug 1588113 but didn't immediately make the connection, however...

bug 1588113 made it so we check only major versions for allowing downgrades. This is great, it means you can run the same profile in local builds and nightlies and not get this dialog. To do this, it removed a bunch of checks from CheckCompatibility, which returns a bool indicating whether the app version has changed. However, as a side-effect, it also updates an outparam bool cachesOK, which is used to decide whether to invalidate the startupcache, script cache, etc.

Previously, any nightly update invalidated caches, because the version comparison included build IDs and would always indicate the version had changed (esp. for upgrade rather than downgrade) and so we'd not reuse caches. Now, the same code checks the version but not the build ID, so we keep going, back in CheckCompatibility, to see if there are other mismatches.

If anything else mismatches, we exit while cachesOK is still false. This will happen if we're running with a different appdir (which explains why we all struggled to repro when running against local builds). But if we're in the same appdir and have just updated, we will get to the end and set aCachesOK to true when the compatibility.ini file doesn't explicitly say to clear it . Then we'll end up reusing the script cache, but that's from a previous build so contains outdated JS, and then Bad Things happen. I'm actually surprised how benign the effects have been so far.

Component: General → Startup and Profile System
Flags: needinfo?(iireland)
Product: Firefox → Toolkit
Regressed by: 1588113
Summary: People are seeing OOM ("uncaught exception: out of memory") errors with either no source trace or pointing to Cu.import statements → Startup and script cache invalidation broke - manifests as OOM ("uncaught exception: out of memory") errors with either no source trace or pointing to Cu.import statements
Has Regression Range: --- → yes
Assignee: nobody → gijskruitbosch+bugs
Status: NEW → ASSIGNED
Pushed by gijskruitbosch@gmail.com: https://hg.mozilla.org/integration/autoland/rev/1457e6a571aa reinstate startup and script cache invalidation for build ID and minor version changes, r=mossop
Pushed by ccoroiu@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/db2e1d780268 reinstate startup and script cache invalidation for build ID and minor version changes, r=mossop a=Aryx
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla77

I'm not seeing the OOM errors anymore.

Yeah, the errors disappeared from my affected regular profile that I'd copied, too. Marking verified per comment #17.

Status: RESOLVED → VERIFIED
Depends on: 1633248
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: