Closed Bug 1221974 Opened 5 years ago Closed 4 years ago

Firefox 42.0 OOM crash on startup

Categories

(Toolkit :: Places, defect)

42 Branch
x86_64
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox49 --- affected
firefox-esr45 --- affected
firefox50 --- affected
firefox51 --- affected
firefox52 --- affected

People

(Reporter: applecobbler.mb, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: crash)

Crash Data

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:41.0) Gecko/20100101 Firefox/41.0
Build ID: 20151014143721

Steps to reproduce:

Executive summary: Firefox 42, when started, consumes increasing amounts of memory and eventually crashes, without ever actually starting up.

Steps to reproduce:
1. Update to Firefox 42.0 (from 41.0.2).
2. Start Firefox in Safe Mode. (firefox -safe-mode).



Actual results:

Actual results:
1. Firefox consumes increasing amounts of memory for about 2 minutes, and eventually hits the 4 GiB mark, at which point it crashes.
2. The UI is never shown.

Example crashes:
https://crash-stats.mozilla.com/report/index/69a93469-605d-4378-a798-db8002151103 (not in Safe Mode)
https://crash-stats.mozilla.com/report/index/4df1ecbd-7ded-4466-8edf-7804f2151105 (Safe Mode)
https://crash-stats.mozilla.com/report/index/2847d4a8-7dd5-4a68-9d27-a5f442151105 (Safe Mode)


Expected results:

Expected results:
1. Firefox starts normally.

Notes:
1. Firefox 41.0.2 works fine, as did all previous versions. Firefox 41.0.2 still works after downgrading to that version again.
2. The problem also happens in Safe Mode.
OS: Unspecified → Windows 7
Hardware: Unspecified → x86_64
Crash Signature: [@ `anonymous namespace''::UpgradeHostToOriginDBMigration::Insert ] [@ OOM | small ]
Related to bug 1215181 ?
Does not happen with a new, empty profile.
Keywords: crash
All three crashes are under UpgradeHostToOriginAndInsert.  So I'm guessing this is the "Version 4->5 is the merging of host, appId, and isInBrowserElement into origin" migration in nsPermissionManager::InitDB that was added in bug 1165263.
Blocks: 1165263
Flags: needinfo?(michael)
I'm not seeing obvious leaks in http://hg.mozilla.org/mozilla-central/diff/16644f45e4e6/extensions/cookie/nsPermissionManager.cpp at first glance.  I wonder why we end up eating more and more memory here...

Apple Cobbler, if you create a clean profile but then copy your places.sqlite from the broken profile to the clean one, does that cause the clean one to have the same problem?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(applecobbler.mb)
The Places API usage here doesn't look like it may cause such an oom. I also couldn't find infinite loops around.

I have some weak theories, but I'm not sure:

1. there may be a particular corruption in one of the 2 databases (places.sqlite or permissions.sqlite) that causes a large memory usage in sqlite itself, having a copy of the dbs (even privately) *may* help figuring that out.

2. migration is doing lots of updates in the middle of a pending large select on the same database. In case of wal journals this may be problematic, but this db is using legacy default options, so it should not happen. It's error-prone like updating an array while iterating it, but again this is selecting from table A and writing to table B, so it should not be a problem. It's easy solvable but I doubt it's our culprit.

being able to reproduce this in a clean profile by just putting the 2 dbs into it would greatly increase the chances to figure out what's up.
Thanks for looking into this, guys, it's much appreciated.

You're both right, it is related to places.sqlite and/or permissions.sqlite. Here's what I tried to isolate the problem, and what happened:

1. Copy places.sqlite from my usual profile to a fresh one, after saving the original. Result: Firefox starts normally with the fresh profile.
2. Additionally copy permissions.sqlite from my usual profile to the fresh one, after saving the original. Result: Firefox now leaks memory upon startup with that fresh profile as well.
3. Remove places.sqlite from the fresh profile, replacing it with the saved original version from that profile. Result: Firefox now starts normally again.
4. Re-copy permissions.sqlite from my usual profile to the fresh one. Result: although both files are now copied over from my usual profile, as in step 2, Firefox does NOT leak memory and instead starts normally.

Since you've mentioned that there's been a database schema change, I then tried copying the files from the fresh profile over to my usual one again, on the assumption that with Firefox having started up normally, the migration must have been successful. And indeed Firefox is now starting normally with my usual profile again.

I'm not sure if the bug should therefore be closed, given that others might also be affected. I'm afraid I wouldn't feel comfortable sharing those files given that they contain fairly detailed information on what websites I visit, but I've got saved copies of the unmigrated versions and will be happy to help with further debugging.
Flags: needinfo?(applecobbler.mb)
So, 2 of the crashes appear to be located in the opening process for places results. I'm thinking that this could be caused by the history query being too large, and it taking too much memory to lookup the history entries which were requested? This correlates with apple cobbler's results, as as soon as the places sqlite was removed, the migration happened without memory leaks (as the requests to the places API would return empty sets). I'm curious about how big the places database was to create this much of a memory leak (assuming that that is the cause), but it could definitely be a problem theoretically I would think.
Flags: needinfo?(michael)
(In reply to Michael Layzell [:mystor] from comment #7)
> I'm curious
> about how big the places database was to create this much of a memory leak
> (assuming that that is the cause), but it could definitely be a problem
> theoretically I would think.

Fairly large. The following figures are for the unmigrated file that triggered the crash:

$ sqlite3 places.sqlite
SQLite version 3.9.1 2015-10-16 17:31:12
Enter ".help" for usage hints.
sqlite> .tables
moz_anno_attributes  moz_favicons         moz_items_annos
moz_annos            moz_historyvisits    moz_keywords
moz_bookmarks        moz_hosts            moz_places
moz_bookmarks_roots  moz_inputhistory
sqlite> select count(*) from moz_anno_attributes;
26
sqlite>  select count(*) from moz_annos;
5368
sqlite>  select count(*) from moz_bookmarks;
529
sqlite>  select count(*) from moz_bookmarks_roots;
5
sqlite>  select count(*) from moz_favicons;
4356
sqlite>  select count(*) from moz_historyvisits;
137338
sqlite>  select count(*) from moz_hosts;
6345
sqlite>  select count(*) from moz_inputhistory;
1404
sqlite>  select count(*) from moz_items_annos;
964
sqlite>  select count(*) from moz_keywords;
62
sqlite>  select count(*) from moz_places;
92691
sqlite>

And for the sake of completeness, permissions.sqlite (the unmigrated version as well):

$ sqlite3 permissions.sqlite
SQLite version 3.9.1 2015-10-16 17:31:12
Enter ".help" for usage hints.
sqlite> .tables
moz_hosts
sqlite> select count(*) from moz_hosts;
6332
sqlite>
Are we reading the entire places database into memory instead of just stepping along it incrementally?
(In reply to Boris Zbarsky [:bz] from comment #9)
> Are we reading the entire places database into memory instead of just
> stepping along it incrementally?

I believe that that histResultContainer->SetContainerOpen(true); loads the results into memory, but I'd need to ask someone who knows more about the places API :S.
Marco, thoughts on comment 10?  And is there a better way of doing that?
Flags: needinfo?(mak77)
The query is only reading a part of the data, it's filtering on host, so it's not reading the whole database but just a small part of it.
Opening the result container surely reads the whole required data for that container, it's an API intended to be used for UI pieces, the same query can be issues through common Storage APIs if needed.
But even if we'd read it completely, I really don't see how it may cause an oom, in such a case we should also OOM by just opening the history sidebar that contains ALL of history.
Flags: needinfo?(mak77)
I think the oom happens there more often cause it's probably the piece that temporarily uses most memory, but still by itself should be unable to fill up memory.

A possibility I'm thinking is that history results are CC collected, so I wonder if we're hitting a case where they can't be collected quick enough (CC doesn't run that often) and memory sums up. the fact this happens in a strict loop would explain that better.
In such a case, moving to plain Storage API would probably be better memory-wise.
The puzzling thing about this latest theory is that closing the container should throw away most used memory and the code IS doing that. So it should not leak results memory.
I think the only way to figure what's up is to profile memory usage while the migration happens.

It may also be interesting to run PRAGMA integrity_check on both places.sqlite and permissions.sqlite that are causing the crash and check if both report OK.
(In reply to Marco Bonardo [::mak] from comment #14)
> It may also be interesting to run PRAGMA integrity_check on both
> places.sqlite and permissions.sqlite that are causing the crash and check if
> both report OK.

They do:

$ sqlite3.exe places.sqlite
SQLite version 3.9.1 2015-10-16 17:31:12
Enter ".help" for usage hints.
sqlite> PRAGMA integrity_check;
ok
sqlite>
$ sqlite3 permissions.sqlite
SQLite version 3.9.1 2015-10-16 17:31:12
Enter ".help" for usage hints.
sqlite> PRAGMA integrity_check;
ok
sqlite>
$
I don't suppose there's any chance that opening the history sidebar triggers an OOM (I know it won't - just fishing...)?

My current working theory is that cycle collection isn't being run, and something is being "leaked" to be cleaned up by cycle collection in the loop, and it is not being cleaned up before an allocation is made. I'm guessing it's something inside of the history API, because I don't think that URIs, Strings, or Principals will fail to be collected due to cycles, and I think that that is basically the only stuff being allocated other than the places API objects in the loop. It also re-uses the storage stmts etc. so I don't think that should be causing allocations either.
Since regardless I think using history result for this is not right (lots of info we don't need here, and it's not a performant API for strict loops), I'd suggest to convert the code to directly use the connection from nsPIPlacesDatabase and query places.sqlite through it. I'm available to review such patch.
I'd also be happy if this code would be wrapped in an #ifdef MOZ_PLACES since Android doesn't use Places and in general Places code out of Places and browser should be wrapped.
(In reply to Marco Bonardo [::mak] from comment #17)
> Since regardless I think using history result for this is not right (lots of
> info we don't need here, and it's not a performant API for strict loops),
> I'd suggest to convert the code to directly use the connection from
> nsPIPlacesDatabase and query places.sqlite through it. I'm available to
> review such patch.
> I'd also be happy if this code would be wrapped in an #ifdef MOZ_PLACES
> since Android doesn't use Places and in general Places code out of Places
> and browser should be wrapped.

I've made a bug for that - I'm not sure how high priority it is to fix, however, as this code should be dead for almost all users of firefox now (as the release population has already been migrated to the new permissions database). By the time the changes made it to release, I would expect very very few people to run the new code, as that would require an upgrade from 41 to (44?) without ever opening in the intervening versions.
(In reply to Michael Layzell [:mystor] from comment #16)
> I don't suppose there's any chance that opening the history sidebar triggers
> an OOM (I know it won't - just fishing...)?

No, opening the history works as expected and intended.
Duplicate of this bug: 1231716
Severity: normal → critical
Crash volume for signature 'OOM | small':
 - nightly (version 52): 182 crashes from 2016-09-19.
 - aurora  (version 51): 711 crashes from 2016-09-19.
 - beta    (version 50): 11700 crashes from 2016-09-20.
 - release (version 49): 45970 crashes from 2016-09-05.
 - esr     (version 45): 76340 crashes from 2016-06-01.

Crash volume on the last weeks (Week N is from 10-03 to 10-09):
            W. N-1  W. N-2
 - nightly     110      72
 - aurora      624      87
 - beta       9504    2196
 - release   36856    9110
 - esr        5944    6079

Affected platforms: Windows, Mac OS X, Linux

Crash rank on the last 7 days:
           Browser   Content     Plugin
 - nightly #19       #23
 - aurora  #9        #5
 - beta    #1        #2
 - release #1        #1
 - esr     #1
Component: General → Places
Product: Core → Toolkit
Apple Cobbler, do you still crash?
Flags: needinfo?(applecobbler.mb)
(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #23)
> Apple Cobbler, do you still crash?

No, I haven't had any problems since stumbling over the workaround I mentioned in Comment #6.

Thanks for still being on the ball, BTW!
Flags: needinfo?(applecobbler.mb)
there's not enough information to figure out a solution here, and the migration happened far in the past, I'm not sure it's worth keeping this open.
The signature in comment 22 is very generic and catch-all, not directly related to this.
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
Crash Signature: [@ `anonymous namespace''::UpgradeHostToOriginDBMigration::Insert ] [@ OOM | small ] → [@ `anonymous namespace''::UpgradeHostToOriginDBMigration::Insert ]
You need to log in before you can comment on or make changes to this bug.