Closed Bug 1232638 Opened 4 years ago Closed 2 years ago

IPDL::PCookieService::RecvGetCookieString causing janks

Categories

(Core :: Networking: Cookies, defect, P4)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1331680
Tracking Status
e10s + ---

People

(Reporter: gkrizsanits, Assigned: gkrizsanits)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-backlog])

See bug 1182637 comment 17. In short we should use push scheme for cookies to avoid these janks. This is responsible for 7% of the janks.
Blocks: 1182637
tracking-e10s: --- → ?
Component: IPC → Networking: Cookies
Assignee: nobody → mrbkap
Blocks: e10s-perf
Priority: -- → P1
Blake should I take this one?
I'd like to take a look at this first.
Assignee: mrbkap → gkrizsanits
OK, so I'm trying to find the possible foot-guns I can run into while turning the messaging for cookies into async. But before that there is something I don't get: this comment here [1] and the actual ipdl [2] are conflicting each other. 

[1]: http://mxr.mozilla.org/mozilla-central/source/netwerk/cookie/CookieServiceChild.cpp#165
[2]: http://mxr.mozilla.org/mozilla-central/source/netwerk/cookie/PCookieService.ipdl#98

I think only the comment is outdated, since IPC message dispatching should preserve message orders in this case (same prio) I don't see a reason for the 'set' part to be sync. About the 'get' part, there is a comment that says the reason is to avoid a racing: [3], but I don't get that example at all... I thought the reason is to return a value asap... Could you explain me that comment?

[3]: http://mxr.mozilla.org/mozilla-central/source/netwerk/cookie/PCookieService.ipdl#38

OK, so what I had in mind that we just keep a local copy of the cookie string and whenever something changes the cookie string for a given URI it notifies the parent process, which broadcasts the change to the child processes. Storing the local copy is the hard part though. We don't want to have a copy of the whole database ofc... One option is to store the cookie string on the document and on each httprequest, and then we still have to keep a (probably weak) reference to all the documents and httprequests that we need to update about the changes... but then there is still the case of plugins... This might be harder than I thought it is...
I think we should re-triage this. The reason why this was P1 perf, that we had a problem with chrome process janking, and this was responsible for a significant amount of those janks. As it turns out those measurements were invalid so I'm not sure if that's still the case, but we are almost certain that those janks are looking quite good now anyway. I still think that this bug should be fixed, but I'm not convinced that this is P1. I vote for P2, but we should ask Bill too.
mconley says this is still showing up as a top source of jank in current BHR
The bothering part is that this jank is happening on parent side. Which means even if I manage to turn this into an async operation it will still jank (as does the recvSetCookieString as well just with a lot less disturbing rate). 

Seems like recvGetCookieString is just taking too much time to finish... Bill, Blake mentioned that you had some plans to clean this up, I'm sure you've already realized this, so I was wondering what was your plan?

Is there a way to dispatch an async (or a sync) message in two steps? Like content process sends a GetCookieString request message and blocks, then we dispatch recvGetCookieString on parent side, but instead of doing all the work right there, firing up a worker thread to do all the database work and what not and message us back when ready and in the meanwhile we would not jank at least. When all the work is done we would finally reply back to the content process.

This would probably cause some more content process jank as a side effect but for that we could still figure out some caching optimization for cookies combined with a broadcasted updating message for setCookieString.
Flags: needinfo?(wmccloskey)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #5)
> mconley says this is still showing up as a top source of jank in current BHR

Yes, but is that a top priority right now that we have pretty good figures for BHR janks? Last time I checked GetCookieString was responsible for about 10% of the janks, given that e10s parent process is already a lot less janky than the non-e10s process I'm not sure why is this still P1. Am I missing something? Mike, what were the numbers you've seen?
Flags: needinfo?(mconley)
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #7)
> (In reply to Brad Lassey [:blassey] (use needinfo?) from comment #5)
> > mconley says this is still showing up as a top source of jank in current BHR
> 
> Yes, but is that a top priority right now that we have pretty good figures
> for BHR janks? Last time I checked GetCookieString was responsible for about
> 10% of the janks, given that e10s parent process is already a lot less janky
> than the non-e10s process I'm not sure why is this still P1. Am I missing
> something? Mike, what were the numbers you've seen?

I remember seeing it on the BHR dashboard at https://telemetry.mozilla.org/hang/bhr/. I thought I'd seen it in the Top 10, but for "two weeks ago" it was at the 20th spot, and "four weeks ago" it was at the 18th spot.

So perhaps I was mistaken in how severe it was, but it's definitely showing up as a source of hangs.
Flags: needinfo?(mconley)
The sorting of that list is really weird. A lot of those hangs are actually Fennec-only. If I look at the number of hangs that have higher Firefox numbers than the RecvGetCookieString one, there are only 4. So it's basically #5.

However, I agree with Gabor that it's hard to avoid making this sync in the parent if we really need to do the I/O. The child call definitely needs to be sync. We can't spin a nested event loop since that would probably break things horribly.

There currently isn't any way in IPDL to send a blocking message and handle it asynchronously in the parent. That might be nice to have, but it would be some work. The one objection I can think of is that CPOWs are forbidden when we're handling a sync message from the child. If we handle the sync message asynchronously, then add-on code could run during that time and have it's CPOWs unexpectedly rejected.

An additional issue is that RecvGetCookieString really only handles the document.cookies case. As far as I can tell, we would still be reading cookies synchronously when opening network requests. That's probably a much more common case anyway.

Honza, does this analysis sound right to you, at least as far as the networking code works?
Flags: needinfo?(wmccloskey) → needinfo?(honzab.moz)
(In reply to Bill McCloskey (:billm) from comment #9)
> There currently isn't any way in IPDL to send a blocking message and handle
> it asynchronously in the parent. That might be nice to have, but it would be
> some work.

I expect this to be quite some work for sure, with questionable results... So
even if it's #5, if we want to improve the BHR hangs I would try to look into
1-4 if there is a lower hanging fruit there...

> An additional issue is that RecvGetCookieString really only handles the
> document.cookies case. As far as I can tell, we would still be reading
> cookies synchronously when opening network requests. That's probably a much
> more common case anyway.

I'm not sure which case are you referring to, but it handles http requests case
as well which I would expect to be somewhat frequent... http://mxr.mozilla.org/mozilla-central/source/netwerk/cookie/CookieServiceChild.cpp#127 and plugins but I wouldn't want to touch
that part...
(In reply to Mike Conley (:mconley) - Needinfo me! from comment #8)
> I remember seeing it on the BHR dashboard at
> https://telemetry.mozilla.org/hang/bhr/.

From 50k hangs we don't know where 23k goes (only Startup::XRE_Main on the pseudo stack) see related Bug 1232641 and bug 1224374 (landed but does not seem to improve the situation). So that is 46% of the hangs, I would focus on that instead. Timer::Fire 11.4k 22% this one could be interesting too. And for IPDL::PCookieService::RecvGetCookieString we have 1.5k which is about 3%. I don't think this worth the effort right now given the complexity this fix will involve.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #11)
> (In reply to Mike Conley (:mconley) - Needinfo me! from comment #8)
> > I remember seeing it on the BHR dashboard at
> > https://telemetry.mozilla.org/hang/bhr/.
> 
> From 50k hangs we don't know where 23k goes (only Startup::XRE_Main on the
> pseudo stack) see related Bug 1232641 and bug 1224374 (landed but does not
> seem to improve the situation). So that is 46% of the hangs, I would focus
> on that instead. Timer::Fire 11.4k 22% this one could be interesting too.
> And for IPDL::PCookieService::RecvGetCookieString we have 1.5k which is
> about 3%. I don't think this worth the effort right now given the complexity
> this fix will involve.

Yep, that sounds reasonable.
(In reply to Bill McCloskey (:billm) from comment #9)
> The sorting of that list is really weird. A lot of those hangs are actually
> Fennec-only. If I look at the number of hangs that have higher Firefox
> numbers than the RecvGetCookieString one, there are only 4. So it's
> basically #5.
> 
> However, I agree with Gabor that it's hard to avoid making this sync in the
> parent if we really need to do the I/O. The child call definitely needs to
> be sync. We can't spin a nested event loop since that would probably break
> things horribly.
> 
> There currently isn't any way in IPDL to send a blocking message and handle
> it asynchronously in the parent. That might be nice to have, but it would be
> some work. The one objection I can think of is that CPOWs are forbidden when
> we're handling a sync message from the child. If we handle the sync message
> asynchronously, then add-on code could run during that time and have it's
> CPOWs unexpectedly rejected.
> 
> An additional issue is that RecvGetCookieString really only handles the
> document.cookies case. As far as I can tell, we would still be reading
> cookies synchronously when opening network requests.

I don't understand this statement.

> That's probably a much
> more common case anyway.
> 
> Honza, does this analysis sound right to you, at least as far as the
> networking code works?

AFAIK, necko reads cookies (means - adds them to the request) only on the parent process, no IPC involved.  It's been so for ages.

(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #3)
> OK, so what I had in mind that we just keep a local copy of the cookie
> string and whenever something changes the cookie string for a given URI it
> notifies the parent process, which broadcasts the change to the child
> processes. Storing the local copy is the hard part though. We don't want to
> have a copy of the whole database ofc... One option is to store the cookie
> string on the document and on each httprequest, 

Why http requests?

> and then we still have to
> keep a (probably weak) reference to all the documents and httprequests 

Again, why http requests?

> that
> we need to update about the changes... but then there is still the case of
> plugins... This might be harder than I thought it is...

I kinda don't understand this comment fully:
http://hg.mozilla.org/mozilla-central/annotate/6ea654cad929/netwerk/cookie/PCookieService.ipdl#l37

37    * Get the complete cookie string associated with the URI. This is a sync
38    * call in order to avoid race conditions -- for instance, an HTTP response
39    * on the parent and script access on the child.

What kind of race condition is here being talked about?  I don't follow at all.  If a content script reads document.cookie string _after_ an http request (probably XHR) responsible for setting the cookie has been done (readystate==4) cookies will be stored at the moment, both sync or async request to the parent will see it.  If there is any OTHER http response overriding the cookies on the parent process just around the time when reading document.cookie on a child process then there is no help doing it synchronously anyway, the race will always be there.
Flags: needinfo?(honzab.moz)
(In reply to Honza Bambas (:mayhemer) from comment #13)
> AFAIK, necko reads cookies (means - adds them to the request) only on the
> parent process, no IPC involved.  It's been so for ages.

Sorry I think I attached the wrong link to Comment 10. The correct one is:
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/HttpChannelChild.cpp#1607
and
http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/HttpChannelChild.cpp#2391

I assumed HttpChannelChild runs in the child, hence I assumed this code uses IPC for getting the cookies but then I guess I'm wrong?

> I kinda don't understand this comment fully:
> http://hg.mozilla.org/mozilla-central/annotate/6ea654cad929/netwerk/cookie/
> PCookieService.ipdl#l37
> 
> 37    * Get the complete cookie string associated with the URI. This is a
> sync
> 38    * call in order to avoid race conditions -- for instance, an HTTP
> response
> 39    * on the parent and script access on the child.
> 

You're not alone with that ;) I think why it has to be sync is because it can
be called from script, but I don't understand that race condition story at all.
(In reply to Gabor Krizsanits [:krizsa :gabor] from comment #14)
> (In reply to Honza Bambas (:mayhemer) from comment #13)
> > AFAIK, necko reads cookies (means - adds them to the request) only on the
> > parent process, no IPC involved.  It's been so for ages.
> 
> Sorry I think I attached the wrong link to Comment 10. The correct one is:
> http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/
> HttpChannelChild.cpp#1607
> and
> http://mxr.mozilla.org/mozilla-central/source/netwerk/protocol/http/
> HttpChannelChild.cpp#2391

(Please use annotated links to the source)

Hmm...  Looks like I was wrong!  I remember this has been removed once.  The child channel AFAIK doesn't need to have cookies in hands.  We don't notify any on-*-request for child channels and fill cookies again in parent.  HttpChannelParent (the wrapper for nsHttpChannel) doesn't work with cookies at all - i.e. is not passing any cookies from the child channel to the real channel.

Please file a bug to remove AddCookiesToRequest from the child part of http.  We will investigate there.

> 
> I assumed HttpChannelChild runs in the child, hence I assumed this code uses
> IPC for getting the cookies but then I guess I'm wrong?
> 
> > I kinda don't understand this comment fully:
> > http://hg.mozilla.org/mozilla-central/annotate/6ea654cad929/netwerk/cookie/
> > PCookieService.ipdl#l37
> > 
> > 37    * Get the complete cookie string associated with the URI. This is a
> > sync
> > 38    * call in order to avoid race conditions -- for instance, an HTTP
> > response
> > 39    * on the parent and script access on the child.
> > 
> 
> You're not alone with that ;) I think why it has to be sync is because it can
> be called from script, but I don't understand that race condition story at
> all.

Generally, sync getters are evil.  I don't know how to help here tho, cookies is not a code I'm taking care of.  I think Ehsan may help here more.
(In reply to Honza Bambas (:mayhemer) from comment #13)
> > Honza, does this analysis sound right to you, at least as far as the
> > networking code works?
> 
> AFAIK, necko reads cookies (means - adds them to the request) only on the
> parent process, no IPC involved.  It's been so for ages.

Yes, understood. Ignore e10s for the moment, or assume we're in the parent process. My question is whether Necko reads cookies on the main thread when you open an HTTP channel.
(In reply to Bill McCloskey (:billm) from comment #16)
> (In reply to Honza Bambas (:mayhemer) from comment #13)
> > > Honza, does this analysis sound right to you, at least as far as the
> > > networking code works?
> > 
> > AFAIK, necko reads cookies (means - adds them to the request) only on the
> > parent process, no IPC involved.  It's been so for ages.
> 
> Yes, understood. Ignore e10s for the moment, or assume we're in the parent
> process. My question is whether Necko reads cookies on the main thread when
> you open an HTTP channel.

We call cookieService->GetCookieStringFromHttp() from AsyncOpen.  Which is called on the main thread.
(In reply to Honza Bambas (:mayhemer) from comment #17)
> We call cookieService->GetCookieStringFromHttp() from AsyncOpen.  Which is
> called on the main thread.

Do you know if that has been a source of jank?
(In reply to Bill McCloskey (:billm) from comment #18)
> (In reply to Honza Bambas (:mayhemer) from comment #17)
> > We call cookieService->GetCookieStringFromHttp() from AsyncOpen.  Which is
> > called on the main thread.
> 
> Do you know if that has been a source of jank?

I don't recall, sorry.  but I think we have bigger offenders than this one.
(In reply to Honza Bambas (:mayhemer) from comment #15)
> (Please use annotated links to the source)

Ehh... thanks for reminding me, I totally should have done that.

> Please file a bug to remove AddCookiesToRequest from the child part of http.
> We will investigate there.

Sure, Bug 1248699.

> Generally, sync getters are evil.

Yeah :(

(In reply to Honza Bambas (:mayhemer) from comment #19)
> (In reply to Bill McCloskey (:billm) from comment #18)
> > Do you know if that has been a source of jank?
> 
> I don't recall, sorry.  but I think we have bigger offenders than this one.

If it were, did we see that based on the pseudo stack?
> (In reply to Honza Bambas (:mayhemer) from comment #19)
> > (In reply to Bill McCloskey (:billm) from comment #18)
> > > Do you know if that has been a source of jank?
> > 
> > I don't recall, sorry.  but I think we have bigger offenders than this one.
> 
> If it were, did we see that based on the pseudo stack?

Not sure what tool you refer to.  But probably yes?  I was doing some tests in the pasts with a slow media storage for profile and this didn't pop up to me.
(In reply to Honza Bambas (:mayhemer) from comment #21)
> Not sure what tool you refer to.  But probably yes?  I was doing some tests
> in the pasts with a slow media storage for profile and this didn't pop up to
> me.

We tag certain functions and telemetry builds a so called pseudo stack from these, and we can later on see what was on this pseudo stack when let's say a BHR jank occurred. Anyway, I just wanted to point it out that in the non-e10s case I don't think we would notice this kind of jank by looking at the pseudo stack (without the IPC part I'm not sure we tag any related function), but since you used other methods for profiling, just ignore my question.
Whiteboard: [necko-backlog]
After talking about this and the other PCookieService related bug with many people and we all agreed that this issue is not important for now I'm setting its priority back.
Priority: P1 → P4
Is this a dupe of bug 1331680?
I am interested to know if bug 1331680 also fix what mentioned in comment 6 as below.  Maybe we re-confirm it after bug 1331680 landed.
"The bothering part is that this jank is happening on parent side. Which means even if I manage to turn this into an async operation it will still jank (as does the recvSetCookieString as well just with a lot less disturbing rate). "
Flags: needinfo?(amchung)
Amy, can this bug be resolved as a duplicate of bug 1331680?
Depends on: 1331680
Hi,
In my view, this bug is similar to bug 1331680.
I had measured time of using async method to get cookie string, and the longest time is not over 0.1ms.
I think we don't have to off main thread on cookie service.
Flags: needinfo?(amchung)
I think this can be resolved as duplicate of bug 1331680 as the sync get cookie string method is only used when pref'ed off.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1331680
You need to log in before you can comment on or make changes to this bug.