UrlClassifierDBServiceWorkerProxy::DoLocalLookup does sync dispatch to off-main-thread that may take 8 seconds

RESOLVED INVALID

Status

()

Toolkit
Safe Browsing
P1
normal
RESOLVED INVALID
a year ago
a year ago

People

(Reporter: cpearce, Assigned: hchang)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [qf][bhr])

Attachments

(1 attachment)

(Reporter)

Description

a year ago
Created attachment 8863991 [details]
UrlClassifierDBServiceWorkerProxy.txt

Michael Layzell has added Background Hang Reporter telemetry which detects when the Gecko main thread is blocked for at least 8 seconds, and when that happens it reports a pseudo stack of the main thread.

We have 5 reports in the report from 2017-04-29 report [1] that show that we're  blocked in UrlClassifierDBServiceWorkerProxy::DoLocalLookup. This does a sync dispatch to another thread from the main thread.

I'd guess the sync thread dispatch is taking longer than the author of the original code realized.

I've attached the pseudo stacks grep'd out of [2] that show how we entered this function.

We should *not* be doing sync thread dispatch, it generally leads to trouble, and especially we should not be blocking the main thread while doing it. Blocking the main thread can jank rendering, JS execution, etc.

Note that synchronizing whatever structure is off-main-thread that you're blocking to access here via a lock can still result in a "hang" here, as whatever causes the 8 second hang here could equally be holding the lock while the main thread is blocked waiting to acquire the lock.

One possible way to fix this would be to maintain a "mirror" of the off-main-thread state on the main thread. We have existing infrastructure to make this easy:
http://searchfox.org/mozilla-central/source/xpcom/threads/StateMirroring.h



[1] https://people-mozilla.org/~mlayzell/bhr/20170429.html
[2] (all data) https://people-mozilla.org/~mlayzell/bhr/20170429/hangs.json
(Assignee)

Comment 1

a year ago
I am going to discuss this topic in two aspects

1) There is only one last use of UrlClassifierDBServiceWorkerProxy::DoLocalLookup 
   and I am working on removing that in bug 1345058. In fact, I have
   removed two other (indirect) uses of DoLocalLookup. The fix is essentially
   to replace the sync API with the async one.

2) Since Bug 1339050 has been fixed, even if we are still using the sync API,
   the main thread blocking shouldn't have been 8 seconds.

However, looking at the call stack [1], I am guessing the profiling result was
not come out based on nightly build since nsHttpChannel::BeginConnect() 
no longer uses nsUrlClassifierDBService::ClassifyLocalWithTables.


[1] https://people-mozilla.org/~mlayzell/bhr/20170429/585.html
[2] http://searchfox.org/mozilla-central/rev/abe68d5dad139e376d5521ca1d4b7892e1e7f1ba/netwerk/protocol/http/nsHttpChannel.cpp#6012
(Assignee)

Comment 2

a year ago
Hi Chris,

May I know what specific build the report came from? My fixes mentioned in comment 1
(at least bug 1339050 and nsHttpChannel::BeginConnect()) were landed not very long ago 
but should have been landed on nightly.

Thanks :)
Flags: needinfo?(cpearce)
Assignee: nobody → hchang
Status: NEW → ASSIGNED
(Reporter)

Comment 3

a year ago
(In reply to Henry Chang [:henry][:hchang] from comment #2)
> Hi Chris,
> 
> May I know what specific build the report came from? My fixes mentioned in
> comment 1
> (at least bug 1339050 and nsHttpChannel::BeginConnect()) were landed not
> very long ago 
> but should have been landed on nightly.
> 
> Thanks :)

Hi Henry. Sorry, the BHR report dump doesn't include that. We'll get that for future reports.


(In reply to Henry Chang [:henry][:hchang] from comment #1)
> I am going to discuss this topic in two aspects
> 
> 1) There is only one last use of
> UrlClassifierDBServiceWorkerProxy::DoLocalLookup 
>    and I am working on removing that in bug 1345058. In fact, I have
>    removed two other (indirect) uses of DoLocalLookup. The fix is essentially
>    to replace the sync API with the async one.


I recommend you proceed with that refactoring regardless of whether BeginConnect() calls ClassifyLocalWithTables, as based on my experience with all the multi-threaded code in the media playback stack, sync thread dispatch is a source of problems.


> 2) Since Bug 1339050 has been fixed, even if we are still using the sync API,
>    the main thread blocking shouldn't have been 8 seconds.
> 
> However, looking at the call stack [1], I am guessing the profiling result
> was
> not come out based on nightly build since nsHttpChannel::BeginConnect() 
> no longer uses nsUrlClassifierDBService::ClassifyLocalWithTables.


This makes sense. All of the call sites I have stacks for contain BeginConnect() calling nsUrlClassifierDBService::ClassifyLocalWithTables. I'll mark this as resolved for now, until we get more information.
Status: ASSIGNED → RESOLVED
Last Resolved: a year ago
Flags: needinfo?(cpearce)
Resolution: --- → INVALID
(Assignee)

Comment 4

a year ago
(In reply to Chris Pearce (:cpearce) from comment #3)
> (In reply to Henry Chang [:henry][:hchang] from comment #2)
> > Hi Chris,
> > 
> > May I know what specific build the report came from? My fixes mentioned in
> > comment 1
> > (at least bug 1339050 and nsHttpChannel::BeginConnect()) were landed not
> > very long ago 
> > but should have been landed on nightly.
> > 
> > Thanks :)
> 
> Hi Henry. Sorry, the BHR report dump doesn't include that. We'll get that
> for future reports.
> 
> 
> (In reply to Henry Chang [:henry][:hchang] from comment #1)
> > I am going to discuss this topic in two aspects
> > 
> > 1) There is only one last use of
> > UrlClassifierDBServiceWorkerProxy::DoLocalLookup 
> >    and I am working on removing that in bug 1345058. In fact, I have
> >    removed two other (indirect) uses of DoLocalLookup. The fix is essentially
> >    to replace the sync API with the async one.
> 
> 
> I recommend you proceed with that refactoring regardless of whether
> BeginConnect() calls ClassifyLocalWithTables, as based on my experience with
> all the multi-threaded code in the media playback stack, sync thread
> dispatch is a source of problems.
> 

UrlClassifierDBServiceWorkerProxy::DoLocalLookup is going to be removed entirely
once bug 1345058 is fixed so that it shouldn't be a problem in this case :)

Thanks for your suggestion and sharing your experience!

Updated

a year ago
Whiteboard: [qf] → [qf][bhr]
You need to log in before you can comment on or make changes to this bug.