Note: There are a few cases of duplicates in user autocompletion which are being worked on.

Cookie Service closes DB on main thread

RESOLVED FIXED in Firefox 55

Status

()

Core
Networking: Cookies
RESOLVED FIXED
4 years ago
a month ago

People

(Reporter: vladan, Assigned: junior)

Tracking

(Blocks: 2 bugs)

unspecified
mozilla55
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox55 fixed)

Details

(Whiteboard: [necko-active][qf:p1][necko-quantum])

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

4 years ago
We're closing the cookie DB on the main thread after reading it in. Can we do an async close to prevent main-thread jank?

Profile: http://people.mozilla.com/~bgirard/cleopatra/#report=3749141e55c007c281f874795d9d0cd6cd751e15&search=ReadCookieDBListener

Comment 1

4 years ago
mak, asuth, what can be done here.  Why is the db being closed at all during normal running of ff?
:vladan, the profile you link to confuses me.  Am I correct in reading it as saying that 100% of the time here is in the bottom call (NtClose)?  Just how long is the jank here, and when is it happening (shutdown)?

I think I've got the fix--patch coming...
Created attachment 744961 [details] [diff] [review]
v1: close syncConn asynchronously

bent: one-line patch (with long-winded explanation :)  

So what's going on here is that we do this stupid thing with the cookie code where we

1) at first use of the cookie service we start an async load of the entire DB into memory (I think we should do this much earlier in startup and skip step 2).

2) we then issue a sync DB request (on a different db connection) to get just the single cookie that's immediately needed.  You and I had a long IRC chat where we concluded that it's likely this sync request winds up actually blocking for the async query to complete :(

3) we then null out the syncConn connection pointer, and that's currently causing the connection's Release function to call synchronous DB close on the main thread (see stack in profile):

   http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageConnection.cpp#436

   http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageConnection.cpp#413

So this patch just calls AsyncClose on the syncConn before we release it, which should make the close happen off-main thread. (Or specifically, it should cause the connection to be flushed off main thread, which will make the actual main thread close() much faster.  Or so is my weak understanding of mozStorageConnection):

   http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageConnection.cpp#347

I'm kind of surprised that the synchronous close of the sync conn is taking so much time.  I'd expect, for instance, that the blocking wait for the sync cookie SQL lookup would take longer. In any case, I guess this should help.

bent: is there any reason we can't make the connection::Release logic call AsyncClose(), so we don't hit sync Close() by default?  Seems like a jank footgun.
Assignee: nobody → jduell.mcbugs
Status: NEW → ASSIGNED
Attachment #744961 - Flags: review?(bent.mozilla)
bent: oh, and IIRC we open all sqlite connections (or at least the cookie ones) on the main thread, which does main thread I/O.  Could we move that to another thread?
(Reporter)

Comment 5

4 years ago
(In reply to Jason Duell (:jduell) from comment #2)
> :vladan, the profile you link to confuses me.  Am I correct in reading it as
> saying that 100% of the time here is in the bottom call (NtClose)? 

The profile I linked to is filtered on "ReadCookieDBListener". You can see the full profile with samples before & after NtClose by removing the "search" argument from the URL:

http://people.mozilla.com/~bgirard/cleopatra/#report=3749141e55c007c281f874795d9d0cd6cd751e15

> Just how long is the jank here, and when is it happening (shutdown)?

The sampling frequency is 50 ms (see info in panel on the left) and there are 58 samples with this stack, so 58 * 50 ms = 2.9 seconds to perform NtClose

Note that I captured this profile while running Firefox with its profile directory on an SD card:

https://blog.mozilla.org/vdjeric/2013/05/01/running-nightly-23-from-an-sd-card/
Comment on attachment 744961 [details] [diff] [review]
v1: close syncConn asynchronously

Review of attachment 744961 [details] [diff] [review]:
-----------------------------------------------------------------

bent suggests :mak for advice here.  I currently suspect that this patch won't actually do very much, since it's the actual NtClose that's blocking for 2.9 seconds, not anything that's flushing the db connection first.  I'm guessing the right fix here is to close the the connection on a different thread.  If that's right just r- this patch and I'll write a new one.
Attachment #744961 - Flags: review?(bent.mozilla) → review?(mak77)
FYI: opening the DB connection is also blocking I/O on main thread, but it only causes a single sample delay, so 50ms:

   http://people.mozilla.com/~bgirard/cleopatra/#report=3749141e55c007c281f874795d9d0cd6cd751e15&search=TryInitDB
(In reply to comment #7)
> FYI: opening the DB connection is also blocking I/O on main thread, but it only
> causes a single sample delay, so 50ms:
> 
>   
> http://people.mozilla.com/~bgirard/cleopatra/#report=3749141e55c007c281f874795d9d0cd6cd751e15&search=TryInitDB

But that could become a much worse jank on other systems if the I/O subsystem is busy doing other things, etc.  We should fix that too, but perhaps in a different bug.
(Reporter)

Comment 9

4 years ago
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #8)
> (In reply to comment #7)
> > FYI: opening the DB connection is also blocking I/O on main thread, but it only
> > causes a single sample delay, so 50ms:
> But that could become a much worse jank on other systems if the I/O
> subsystem is busy doing other things, etc.  We should fix that too, but
> perhaps in a different bug.

Yoric is working on making this possible in bug 702559
Comment on attachment 744961 [details] [diff] [review]
v1: close syncConn asynchronously

Review of attachment 744961 [details] [diff] [review]:
-----------------------------------------------------------------

Calling AsyncClose, despite the name, won't help you here.
It just enqueues an event to the async thread that, when executed, enqueues another event to the main thread, that does the actual closing. So you would just be delaying the hang. Would be mandatory only if you had executed something async on that connection.
As said, Yoric is indeed working on a pure-async connection that can do open and close off main-thread, though again that won't help us here, since it doesn't do that for synchronous connections, that is our case.
May we evaluate closing synchronous connections off main-thread? well, it could be feasible I suppose.

What we can do for now is checking if there are obvious reasons for close to take this much.
Since cookies are using wal, they should not block each other. The wal should also not be merged when we close this sync connection, since according to the documentation it's merged when the last connection closes, and at the point this one is closed the async one is still open.  So doesn't look like an fsync.

One thing we may try is to make the synchronous connection read-only, while I'm not sure that will change much, it should conceptually reduce the needed synchronization work on closing it.
We should also ensure all prepared statements are finalized (or nullified, that is the same thing due to the destructor) before the connection is closed, this stmt is not http://mxr.mozilla.org/mozilla-central/source/netwerk/cookie/nsCookieService.cpp#2180 would be enough to scope it.  Again, I'm not sure this helps, but it's surely wrong atm.

I took a look at the code in cookies service and couldn't find other misuses, so far.
That said, looks like all of the time is spent by the OS in closing the handle, so I'm not sure we can anything else than evaluating moving all sqlite_close calls off main-thread.
Attachment #744961 - Flags: review?(mak77) → review-
marco, thanks for looking at this.

Since this is one-time jank that only occurs on really fairly crappy storage, I don't think it's a top priority.

I see two ways to fix this when the time is right.  

1) we can wait for bug 702559 and we'll get async open/close "for free".

2) we should move the cookie service init to an earlier part of startup, since even if we use async DB calls we'll still block the main thread during the 1st channel load.  I filed bug 870460 for that.

For now I'll mark this depending on 702559.  If we do #2 this would become INVALID.
Depends on: 702559
(In reply to Jason Duell (:jduell) from comment #11)
> marco, thanks for looking at this.
> 
> Since this is one-time jank that only occurs on really fairly crappy
> storage, I don't think it's a top priority.
> 
> I see two ways to fix this when the time is right.  
> 
> 1) we can wait for bug 702559 and we'll get async open/close "for free".

As I said, that bug won't help us, since it's only for async connections, that means you open a connection and can only run async operations on it. Here we have a synchronous connection on main-thread.

What you want is to make sync connections (that live on main-thread) call sqlite_close on another thread.  For this we need a separate bug, that may probably re-use some code from bug 702559, but it's not the same thing.
Depends on: 874814

Updated

4 years ago
No longer depends on: 702559

Updated

4 years ago
Blocks: 885732
Now we can use asyncClose on a synchronous connection, and it will close on the async thread.
the suggested patch should do, but we could also try to make syncConn a readonly clone of dbConn and ensure stmtReadDomain is finalized before asyncClose.
Whiteboard: [necko-backlog]
Whiteboard: [necko-backlog] → [necko-backlog][qf]
Potential for main-thread IO, qf:p1.
Whiteboard: [necko-backlog][qf] → [necko-backlog][qf:p1]
I suspect this bug only happens at startup/shutdown (ie IIRC we only open/close the cookie SQLite connection at startup/shutdown).  I wouldn't assume it's a major performance issue, but worth looking at.  We would want to make sure the patch actually eliminates some main thread I/O (i.e. it doesn't seem clear that asyncClose has always meant we get non-main I/O, but comment 13 looks promising).
Assignee: jduell.mcbugs → juhsu
Blocks: 1348289
No longer blocks: 1348289
Comment hidden (obsolete)
Whiteboard: [necko-backlog][qf:p1] → [necko-backlog][qf:p1][necko-quantum]
(In reply to Jason Duell [:jduell] (needinfo me) from comment #15)
> I suspect this bug only happens at startup/shutdown (ie IIRC we only
> open/close the cookie SQLite connection at startup/shutdown).  I wouldn't
> assume it's a major performance issue, but worth looking at.

From bug 1362361:
See this profile where it blocked the main thread for 238ms on the quantum reference hardware: https://perfht.ml/2pNcTA7
Duplicate of this bug: 1362361
(In reply to Jason Duell [:jduell] (needinfo me) from comment #15)
> I suspect this bug only happens at startup/shutdown (ie IIRC we only
> open/close the cookie SQLite connection at startup/shutdown).  I wouldn't
> assume it's a major performance issue, but worth looking at.  We would want
> to make sure the patch actually eliminates some main thread I/O (i.e. it
> doesn't seem clear that asyncClose has always meant we get non-main I/O, but
> comment 13 looks promising).

FWIW this is mostly a startup issue (we read the cookies into memory at startup) but it could also bite you after that if we detect that the cookie DB is corrupted and attempt to recreate it.
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #19)
> FWIW this is mostly a startup issue (we read the cookies into memory at
> startup) but it could also bite you after that if we detect that the cookie
> DB is corrupted and attempt to recreate it.

Corruption should be a rare event, BUT the cookies service seems to handle any failure as a corruption through:
if (NS_FAILED(rv)) {... HandleCorruptDB()
instead of checking for NS_ERROR_FILE_CORRUPTED (bug to be filed, bug 1125157 would be useful), so any kind of other failure like SQLITE_BUSY or similar causes us to replace the db.
And, for unexplained perf reasons, it opts-in to corruption by using synchronous = off (bug 1158387).

Note that in case of corruption we may still need to close the database synchronously to be able to throw away the current file and re-create it. The only way around that would be for a component to provide only asynchronous APIs, but most components provide some synchronous APIs that expect the db to be available immediately (I didn't check if this is true for the cookies manager). For that reason Bug 1355561 is introducing an API to synchronously close even an async connection, because we can't do differently for some consumers now. Once the migration to WebExt is done we could be able to speed up refactoring to remove synchronous APIs.
If we needed to throw away the cookie database with any frequency, our problem would be mobs of angry users whose cookies we've lost, more than some jank on the main thread when we do it :)  Given that we don't have a lot of those bug reports I'm guessing we're OK on that front.
Whiteboard: [necko-backlog][qf:p1][necko-quantum] → [necko-active][qf:p1][necko-quantum]
(Assignee)

Comment 22

2 months ago
a quick experiment shows asyncOpen version (comment 6) make thing worse (3ms -> 4ms in my Windows)

The profile of comment 17 is not special (with many cookies), so is mine.
(In reply to Junior[:junior] from comment #22)
> a quick experiment shows asyncOpen version (comment 6) make thing worse (3ms
> -> 4ms in my Windows)

is not comment 6 about asyncClose, rather than asyncOpen?
It's also unclear which things get worse. If it takes 1ms more to close is not a big deal, if it causes 1ms of additional jank that's bad.

comment 10 also suggests other things to check:
1. The synchronous connection could be made read-only, that would make it nicely faster
2. Should check that all statements are finalized before the close call, otherwise Storage will spend time looping through all open statements to close them. A debug build should warn when this happens IIRC.

additionally, this doesn't say how you plan to handle a corruption requiring a synchronous close. But I assume for that single case for now you'll keep using Close() instead of AsyncClose().
(Assignee)

Comment 24

2 months ago
(In reply to Marco Bonardo [::mak] from comment #23)
> (In reply to Junior[:junior] from comment #22)
> > a quick experiment shows asyncOpen version (comment 6) make thing worse (3ms
> > -> 4ms in my Windows)
> 
> is not comment 6 about asyncClose, rather than asyncOpen?
> It's also unclear which things get worse. If it takes 1ms more to close is
> not a big deal, if it causes 1ms of additional jank that's bad.
It's |asyncClose|. Sorry for the typo.
and it's 1ms additional cost in the main thread, which is bad.
(In reply to Junior[:junior] from comment #24)
> It's |asyncClose|. Sorry for the typo.
> and it's 1ms additional cost in the main thread, which is bad.

You know what, now it all makes sense.
I think (could likely be verified with some profiling) the cost you see is due to the need of starting up the async thread. Since this is a synchronous connection the async thread hasn't been initialized, so when we invoke asyncClose() it is created and then the close runnable is enqueued on it.
This makes hard to measure the win here, because it will depend on what is faster, spawning the thread or the close() call?
Sync SyncConn is only running selects, I suspect close is quite fast since it won't have to fsync anything, so we may not gain anything. This bug may end up being a wontfix in the current shape.

It would still be interesting to make syncConn a read-only connection, that would remove some locking requirements from it and could also make close a bit faster.
Assuming the plan from bug 870460 comment 21 works, we'll be moving all cookie SQLite I/O onto a dedicated thread, which will also take care of closing the db.  If that works we should dupe this bug to 870460.  Keeping this bug separate only makes sense if we find we can't off-main our cookie I/O for some reason.
(In reply to Marco Bonardo [::mak] from comment #25)
> I think (could likely be verified with some profiling) the cost you see is
> due to the need of starting up the async thread. Since this is a synchronous
> connection the async thread hasn't been initialized, so when we invoke
> asyncClose() it is created and then the close runnable is enqueued on it.
> This makes hard to measure the win here, because it will depend on what is
> faster, spawning the thread or the close() call?

Note that after removing synchronous = OFF the close() call may become more expensive since it will require an fsync. Then it may be worth to still use AsyncClose().
(Assignee)

Comment 28

a month ago
(In reply to Marco Bonardo [::mak] from comment #27)
> (In reply to Marco Bonardo [::mak] from comment #25)
> > I think (could likely be verified with some profiling) the cost you see is
> > due to the need of starting up the async thread. Since this is a synchronous
> > connection the async thread hasn't been initialized, so when we invoke
> > asyncClose() it is created and then the close runnable is enqueued on it.
> > This makes hard to measure the win here, because it will depend on what is
> > faster, spawning the thread or the close() call?
> 
> Note that after removing synchronous = OFF the close() call may become more
> expensive since it will require an fsync. Then it may be worth to still use
> AsyncClose().

I have tried test_cookies_async_failure.js which contains lots of data operation.
Seems no difference between close() and asyncClose() for the whole test time.

However, even we remove synchronous = OFF,
Close() wins AsyncClose() if I measure the closing time as start-up read,
but not causing a serious worse jank (around 160ns->180ns in my Linux desktop with HDD).

Theoretically, asyncClose would work well with the heavy fasyc.
(Assignee)

Updated

a month ago
Blocks: 1158387
(Assignee)

Comment 29

a month ago
Created attachment 8875679 [details] [diff] [review]
close syncConn asynchronously- v2

Only change the comments
Attachment #744961 - Attachment is obsolete: true
Attachment #8875679 - Flags: review?(mak77)
Comment on attachment 8875679 [details] [diff] [review]
close syncConn asynchronously- v2

Review of attachment 8875679 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/cookie/nsCookieService.cpp
@@ +2765,5 @@
>    mDefaultDBState->readListener = nullptr;
> +
> +  // Close sync connection asynchronously: if we let destructor close, it does
> +  // on main thread, and after removing synchronous = OFF, the Close() call may
> +  // become more expensive since it will require an fsync.

This comment will require to be changed again with the synhronous change. I'd just say ": if we let the destructor close, it may cause an expensive fsync operation on the main-thread."
Attachment #8875679 - Flags: review?(mak77) → review+
(Assignee)

Comment 31

a month ago
Created attachment 8875976 [details] [diff] [review]
close syncConn asynchronously- v3

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a7eb9b093785d6bbf51652363d78bf9282e7b9a
Attachment #8875679 - Attachment is obsolete: true
Attachment #8875976 - Flags: review+
(Assignee)

Updated

a month ago
Keywords: checkin-needed

Comment 32

a month ago
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/94a5f5ddcf39
avoid blocking main thread when we closes nsCookieService syncConn db, r=mak
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/94a5f5ddcf39
Status: ASSIGNED → RESOLVED
Last Resolved: a month ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.