Closed Bug 1387782 Opened 3 years ago Closed 3 years ago

places.sqlite work following Firefox Sync "Receive from Device" activity cause 100% CPU usage and crash

Categories

(Toolkit :: Places, defect)

57 Branch
x86_64
Linux
defect
Not set
major

Tracking

()

RESOLVED INVALID

People

(Reporter: ronan, Unassigned)

References

Details

(Keywords: crash, Whiteboard: [fxsearch])

Crash Data

Attachments

(1 file)

First, this bug is imprecise and I'm sorry for it.
Please help me help you and ask me the relevant debug info you need :)

I use Sync's awesome "Send to Device" feature to send tabs from
device `work` (Ubuntu 16.04.3 x64) to device `home` (Arch x64),
both running the latest Nightly.

Everyday back at home, I get a crash when receiving the tabs from work.
The tabs arrive and, after a few seconds, one of 4 CPU core gets topped
(GNOME System Monitor shows Firefox using 25%), Firefox becomes
unresponsive, and crashes.

- Problem arrives only when sending from work (Ubuntu) to home (Arch),
  not the contrary.

- I tried scrapping my home profile and starting with a fresh one,
  changing as little prefs as possible and just re-Syncing,
  that didn't help, the crashes keep happening regularly.

See attached logs for sample output (no debug flags set,
that's regular stdout, please point me to whatever config produces
more detailed debug information). Crashes #1 and #2 look similar,
and crashes #3 and #4 too, but not sure all are related.
One more thing: I submitted a bunch of these crashes
through Firefox's crash reporting tool.

They should be associated to my email address (firstname@lastname.fr)
Hi Ronan,

Thank you for the report. 

(In reply to Ronan Jouchet from comment #1)
> One more thing: I submitted a bunch of these crashes
> through Firefox's crash reporting tool.
> 
> They should be associated to my email address (firstname@lastname.fr)

Not everyone has access to search these by email, could you go to about:crashes and copy/paste the last few crash IDs here? It'll then be easier for us to take a look.
Flags: needinfo?(ronan)
(In reply to Mark Banner (:standard8) from comment #2)
> (In reply to Ronan Jouchet from comment #1)
> > One more thing: I submitted a bunch of these crashes
> > through Firefox's crash reporting tool.
> > 
> > They should be associated to my email address (firstname@lastname.fr)
> 
> Not everyone has access to search these by email, could you
> go to about:crashes and copy/paste the last few crash IDs here?
> It'll then be easier for us to take a look.

:standard8 sure thing,

https://crash-stats.mozilla.com/report/index/df071b9a-24ec-4397-9ce5-d60c50170725
https://crash-stats.mozilla.com/report/index/2ae046cd-65dd-438b-baa8-898c30170727
https://crash-stats.mozilla.com/report/index/792fffaa-d321-4705-9a74-09a230170801
https://crash-stats.mozilla.com/report/index/cfc97b04-a16c-4fd0-ad96-240210170802
https://crash-stats.mozilla.com/report/index/fcb79456-ce87-4404-a4ea-026810170804

And also lying in my about:crashes since my profile reboot,
but apparently unrelated and possibly already fixed:

https://crash-stats.mozilla.com/report/index/78cbc476-1e62-468a-80fb-9ba410170725
https://crash-stats.mozilla.com/report/index/52ef933d-e592-4c6b-a23b-43cb70170727

Thanks for the fast feedback!
Flags: needinfo?(ronan)
Thank you Ronan. I've taken a brief look and I've now set this up with the right flags.

Do you know if you only get the errors about timeout when you start restarting Firefox, or do they occur before then?

How large is your places database? (see under Places Database in about:support).
Crash Signature: [@ AsyncShutdownTimeout | places.sqlite#1: waiting for clients | places.sqlite#1: Bookmarks.jsm: fetchBookmark (158)]
Component: Bookmarks & History → Places
Keywords: crash
Product: Firefox → Toolkit
Whiteboard: [fxsearch]
(In reply to Mark Banner (:standard8) from comment #4)
> I've taken a brief look and I've now set this up with the right flags.

Out of curiosity, what do you mean by "set this up with the right flags"?
Some Nightly build config? Some config of the crash reporting tool?

> Do you know if you only get the errors about timeout when you
> start restarting Firefox, or do they occur before then?

I'm not sure, will look at it next time a crash happens and report back.

Oh, one more thing: not *all* crashes are caught by the Crash Reporter:

- Working Crash Reporter scenario:
  1. Tabs from work arrive
  2. 100% CPU
  3. Ctrl+Q
  4. After a few seconds (~= 30s; I guess a timeout), Firefox exits
  5. Crash reporter is shown.

- *Failing* Crash Reporter scenario:
  1. 2. 3. Identical
  4. Firefox doesn't exit even after minutes, I have to manually kill it.
  5. Crash reporter is not shown.

> How large is your places database?

Clicking 'Verify Integrity' in about:support -> Places Database yields:

    > Task: checkIntegrity
    + The database is sane
    > Task: checkCoherence
    + The database is coherent
    > Task: expire
    + Database cleaned up
    > Task: vacuum
    + Initial database size is 10240 KiB
    + The database has been vacuumed
    + Final database size is 10240 KiB
Mark, a new crash today: https://crash-stats.mozilla.com/report/index/999510ca-3ecc-492a-9854-d91070170808

This time, the only stdout I had was this:

> ~ firefox
> ATTENTION: default value of option force_s3tc_enable overridden by environment.
> [Parent 871] WARNING: waitpid failed pid:995 errno:10: file /home/worker/workspace/build/src/ipc/chromium/src/base/process_util_posix.cc, line 276
> 
> [Parent 871] ###!!! ABORT: file resource://gre/modules/Sqlite.jsm, line 656
> [Parent 871] ###!!! ABORT: file resource://gre/modules/Sqlite.jsm, line 656
> ExceptionHandler::GenerateDump cloned child 8830
> ExceptionHandler::WaitForContinueSignal waiting for continue signal...
> ExceptionHandler::SendContinueSignalToChild sent continue signal to child
Today's crash:
https://crash-stats.mozilla.com/report/index/0da390b8-35d8-4e57-822c-d100a0170809

> [Parent 13454] ###!!! ABORT: file resource://gre/modules/PlacesUtils.jsm, line 2067
> [Parent 13454] ###!!! ABORT: file resource://gre/modules/PlacesUtils.jsm, line 2067
> ExceptionHandler::GenerateDump cloned child 22117
> ExceptionHandler::SendContinueSignalToChild sent continue signal to child
> ExceptionHandler::WaitForContinueSignal waiting for continue signal...
Thanks Ronan, I don't have anything to suggest at the moment, they all seem to be more or less the same, so I don't think there's a need to post more.

Kit, any ideas here?
Flags: needinfo?(kit)
Depends on: 1389716
Kit, Mark, looks like the fix attempt at bug 1389716 didn't solve my bug 1387782.
Today's crash: https://crash-stats.mozilla.com/report/index/588463bd-7373-4b19-8cf7-5badd0170817

Feel free to ask for more info, or even a temporary remote access to my machine.
Marco, can you have a look at this as well?
Flags: needinfo?(mak77)
That's unfortunate. :-( Thanks for getting back to us, Ronan. Could you please enable `mozStorage` logging on your home machine (https://bugzilla.mozilla.org/show_bug.cgi?id=1388584#c23) and attach the log here? I'd like to see the queries we're trying to execute before the crash.
Flags: needinfo?(kit) → needinfo?(ronan)
note that the log will contain bound queries, and thus it may contains private information. IF you're not comfortable with sharing it here, you can send it through a private mail download link. I'd also expect it to be quite large, so limit the session to as short as possible.
(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-7) from comment #11)
> Could you please enable `mozStorage` logging on your home machine
> (https://bugzilla.mozilla.org/show_bug.cgi?id=1388584#c23)

Kit, reading the aforelinked bug 1388584 comment 23, I'm not sure
which logging I should enable. To save us unnecessary round-trips,
can you precise the exact environment variable you want me to set?
`MOZ_LOG=mozStorage:5` ? `MOZ_LOG=timestamp,mozStorage:5` ? Something else?

(In reply to Marco Bonardo [::mak] from comment #12)
> note that the log will contain bound queries, and thus it may contains
> private information. IF you're not comfortable with sharing it here,
> you can send it through a private mail download link. I'd also expect
> it to be quite large, so limit the session to as short as possible.

Will do. Thanks for the followup, everyone.
`MOZ_LOG=timestamp,mozStorage:5`, please. Also, would you mind setting the following prefs in `about:config`? These should make Sync output more detailed logs to the console and log files. Thank you!

* services.sync.log.appender.file.logOnSuccess = true
* services.sync.log.appender.dump = Trace
* services.sync.log.appender.file.level = Trace
* services.sync.log.logger.engine.tabs = Trace
* services.sync.log.logger.engine.bookmarks = Trace
* services.sync.log.logger.engine.history = Trace
Gah, this will have to wait one week. But wait, something is weird:

I'm off from work and my work machine this week. So, to reproduce
the bug and provide mozStorage logs, I did what anyone would do:
run `home` machine with another fresh profile (called `test`),
let it Sync, and send tabs from `test` to `home`.

Which I did, and you guessed it, tabs arrive in `home` but no crash happens.
So, it looks like this issue happens *only* from one specific account
to one specific account. oÔ. Not sure that's an interesting data point,
feel free to ask anything more, and else I'll be back with logs around
Monday 28, when I get access to my `work` machine.
The crash is just an async shutdown crash (potentially harmless, if not for the annoyance of getting the crash reporter UI) stating that the Places Sqlite.jsm wrapped connection didn't close before Sqlite.jsm. 

The async timeout metadata tells us we are at:
  state = "1. Service has initiated shutdown";
that means this didn't resolve nor reject:
  await conn.close();

This is likely to be waiting on some operation that was invoked through executeBeforeShutdown (withConnectionWrapper). These take a name parameter that should be reported in the async shutdown state, but for some reason the state is missing that info. There is  surely a bug or some missing code that prevents async shutdown from properly chaining all the currently pending states.
So we know where we are blocked, but not by who.

I just found that async shutdown has its own logging! the boolean toolkit.asyncshutdown.log pref can be set to true, and it should print out to the console all of its logging. Unfortunately this means you should reproduce the bug in a debug build (that has a console) to be able to catch this logging, otherwise the dump() calls would be lost.
It could be very useful, maybe more than the mozStorage one.

I actually wonder if we could keep a circular buffer of the last N logged messages, and always report those in the crash metadata... Sounds like it would be useful.
Flags: needinfo?(mak77)
(In reply to Marco Bonardo [::mak] from comment #16)
> I just found that async shutdown has its own logging! the boolean
> toolkit.asyncshutdown.log pref can be set to true, and it should print out
> to the console all of its logging. Unfortunately this means you should
> reproduce the bug in a debug build (that has a console) to be able to catch
> this logging, otherwise the dump() calls would be lost.
> It could be very useful, maybe more than the mozStorage one.

FWIW, if you are at a command prompt and start Firefox with the "-attach-console" flag, you should find this output does go to that console. Note also that you must set the pref browser.dom.window.dump.enabled to true to see dump output in either case.
Well, I can no longer reproduce this bug. Not sure whether my testing
of Kit's fix described in bug 1389716 mentioned at comment 9 was flaky,
or if another change fixed the issue in the meantime :-/

This week I resumed testing with the same scenario as usual
(sending tabs from `work` to `home`). When receiving the tabs at `home`,
there is a noticeable CPU usage spike, but no 100% topping of one core,
and CPU goes back to idling after a few seconds once sync is done.

So, puzzled to not know what fixed but glad it's fixed! Will re-open the
bug if it shows up again; by then thanks everyone for the followup :)
Flags: needinfo?(ronan)
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.