Closed Bug 1471982 Opened 6 years ago Closed 6 years ago

Firefox constantly >100% CPU, seemingly from Places migration

Categories

(Toolkit :: Places, defect, P1)

defect

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox62 --- affected

People

(Reporter: sfink, Unassigned)

References

Details

(Whiteboard: [fxsearch])

Attachments

(2 files)

I'm having a constant CPU usage of a little over 100%. The profiler wasn't showing it because it was in an unsampled thread. perf reported it being in some sqlite function. strace showed lots of fcntl calls. I did manual sampling of fcntl stacks with gdb and got

          'socket' => 26,
          'commit' => 22,
          'async' => 155,
          'migrate' => 654,
          'other' => 23,
          'nssToken' => 312

where each of those tokens means something identifiable was found in the 15 frames of the stack leading to fcntl. The exact meanings are

    $stuff->{migrate} = 1 if /MigrateV48Frec/;
    $stuff->{commit} = 1 if /mozStorageTransaction::Commit/;
    $stuff->{async} = 1 if /AsyncExecuteStatements::Run/;
    $stuff->{nssToken} = 1 if /nssToken_/;
    $stuff->{socket} = 1 if /nsSocketTransport::/;

MigrateV48FrecenciesRunnable counts for the majority. I restarted the profiler while sampling the mozStorage threads: https://perf-html.io/from-addon/calltree/?hiddenThreads=17-18&range=32.3456_50.7562&thread=4&threadOrder=0-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-1&v=3

That seems to show the time going to

Connection::executeSql UPDATE moz_origins SET frecency = ( SELECT MAX(frecency) FROM moz_places WHERE moz_places.origin_id = moz_origins.id ) WHERE rowid IN ( SELECT rowid FROM moz_origins WHERE frecency = -1 LIMIT 400 ) 

This is still going on after several hours. My disk has 6.5GB free.

% ls -hs1 places*
 51M places.sqlite
 32K places.sqlite-shm
4.0M places.sqlite-wal

If I copy places.sqlite to /tmp, run sqlite3 on it, and execute the above statement, it finishes immediately.

I think this started the first time I ran Nightly 63 (previously I was on Nightly 62).
gah. Sorry, the *public* profiler link is https://perfht.ml/2IzwmMq
That migration is temporary, once done we ideally don't need to run it anymore, so it's somehow expected to be expensive. It shouldn't last long though, surely not 4 hours, it should be matter of minutes, at worst.

First, start by doing a backup of the profile.

Try to run Places integrity check from about:support, and then restart the browser.

Also, we recently replaced part of this code, so the problem may have gone away in the meanwhile. I'm sorry I didn't notice this report sooner, please let us know how things are going now.
Flags: needinfo?(sphink)
Blocks: 1239708
P1 while waiting for info
Priority: -- → P1
Whiteboard: [fxsearch]
Some initial thoughts:

* As Marco said, we replaced this code recently (but the new code is similar).  And it's been a while since this bug was filed.  I'm wondering whether you/we can repeat these results now.  If not, it may be hard to investigate this now.

* Is all that time spent in a single execution of that SQL, or is the SQL being executed again and again?

* What's the size of your moz_places?

* What's your OS?  Are you using spinning disks?

* Maybe something is setting frecencies to -1 before the migration can finish.  We used to do that when invalidating frecencies, but I'm not sure that's true anymore.
Is this something QA can help with? If you can give possible STR we can ask them to help out.
Doesn't make sense to me to be tracking a one-off report.

Mike de Boer said there were some known perf issues fixed after landing, and will check to see if some existing telemetry/slow-sql data indicates if there's any ongoing issue here.
(In reply to Marco Bonardo [::mak] from comment #2)
> That migration is temporary, once done we ideally don't need to run it
> anymore, so it's somehow expected to be expensive. It shouldn't last long
> though, surely not 4 hours, it should be matter of minutes, at worst.
> 
> First, start by doing a backup of the profile.

Oops, too late.

The problem disappeared at some point, I'm not sure when.

> * As Marco said, we replaced this code recently (but the new code is
> similar).  And it's been a while since this bug was filed.  I'm wondering
> whether you/we can repeat these results now.  If not, it may be hard to
> investigate this now.

Yeah, unfortunately I don't have any way to reproduce. I should have thought to at least backup the profile, sorry.

> * Is all that time spent in a single execution of that SQL, or is the SQL
> being executed again and again?

From the gdb profiling, I'd guess again and again. The profile seems to say that one execution took 16 seconds but then finished. I wish that profile had a little more at the end...

> * What's the size of your moz_places?

I don't know. places.sqlite was 51MB and is currently 50MB, and currently there are 79935 rows in my moz_places table.

> * What's your OS?  Are you using spinning disks?

linux64 (Fedora 27) on an SSD.

> * Maybe something is setting frecencies to -1 before the migration can
> finish.  We used to do that when invalidating frecencies, but I'm not sure
> that's true anymore.

At the moment, none of my frecencies are -1, but at the moment I'm not having the problem anymore.

I guess I was confused by the "48" in MigrateV48FrecenciesRunnable, and assumed that something named 48 should not be running at all given that I had upgraded using the same profile through every version from 48 to 63. But maybe 48 means "migrate from the version first introduced in 48" or something?
Flags: needinfo?(sphink)
the Places schema versions are unrelated to the Firefox versions, because a single Firefox version can have zero or more than one schema migrations. So that 48 is just an internal version of the schema, unrelated to the Firefox version.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
Oh, I just remembered that I set up 'atop' not long ago, so I can at least say exactly when this started and stopped.

The first CPU usage spike, which seems to also be when I started Firefox, was at 2018/06/27 13:50:01 (plus or minus 10 minutes). It briefly dropped below 100% CPU usage 19 hours later (!), but even that was only for 20 minutes and did not go below 65%. It gets a little confusing later on because I was running a bunch of CPU-hogging things myself, and it looks like I may have suspended the browser at about 2018/06/28 19:42:01 (likely to speed things up; it disappears from the list of processes using CPU for a while, but later returns with the same pid.) I allowed it to resume at 21:22.

Sometime between 2018/06/30 10:35 and 2018/07/01 22:51, I finally killed the process. Which is a little surprising. I was on vacation starting 06/28 and took the laptop with me, so it was suspended. Probably I resumed it at some point during the vacation, saw it was still bogged down, and killed firefox. The next time the machine was on was 2018/07/05, at which point the CPU usage was under control (varying, but never exceeding 20% for any 10-minute window.)

I doubt any of this is useful, other than to say "atop is cool!", but I'll attach my updates.xml in case someone can make sense of that to see what versions were involved. Hm... or rather, I'll attach two updates.xml, because it looks like I had a startup crash that made me wipe out my firefox install directory on 2018-06-27 and reinstall Nightly. (Yeah, it hasn't been a good month for me & Firefox.)
Anyway, I'm not expecting anything to come of this, just wanted to report what I could in case it's helpful in the future.
Thank you for reporting, every single report can be useful.

The times this happened seem to be after our initial perf fixes landings, and that's a little bit puzzling.
We landed some new code on the 4th of July in bug 1467627, it is also computationally expensive but it should be a bit cheaper than the old code.

We'll keep monitoring the Beta feedback.


(In reply to Drew Willcoxon :adw from comment #4)
> * Maybe something is setting frecencies to -1 before the migration can
> finish.  We used to do that when invalidating frecencies, but I'm not sure
> that's true anymore.

We don't invalidate origins frecencies anymore AFAICT.
Thanks Steve.
Steve, one more question -- do you use Sync?  Was it enabled when this happened?
Flags: needinfo?(sphink)
(In reply to Drew Willcoxon :adw from comment #15)
> Steve, one more question -- do you use Sync?  Was it enabled when this
> happened?

Yes and yes.
Flags: needinfo?(sphink)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: