Main thread IO in UIC_startSearch()

RESOLVED FIXED in mozilla25

Status

()

Toolkit
Places
--
critical
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: jrmuizel, Assigned: mak)

Tracking

(Blocks: 1 bug, {hang, main-thread-io, perf})

unspecified
mozilla25
hang, main-thread-io, perf
Points:
---
Dependency tree / graph
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [snappy])

Attachments

(1 attachment, 3 obsolete attachments)

(Reporter)

Updated

6 years ago
Whiteboard: [snappy]
Likely culprit: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/nsPlacesAutoComplete.js#1385

Cc-ing ventnor who wrote these lines and mak who has been fighting on this front for quite some time.
Keywords: main-thread-io, perf
OS: Mac OS X → All
Hardware: x86 → All
As far as I can tell, there is no good reason for this query to be synchronous. I will see if I can easily turn it to async.
Assignee: nobody → dteller
Ok, not as straightforward as I expected initially. Still, working on it.
Created attachment 676559 [details] [diff] [review]
Removing main thread IO from UIC_startSearch() - early prototype

After synchronization with mak, it seems that UIC_startSearch() does main thread IO by design. We need to discuss the future of this bug. Attaching a prototype that rewrites UIC to make it asynchronous.

I took the liberty of cleaning up some of the code of UIC along the way to understand how things work. Although I kind of like the new look of the code, the final patch might entail less code rewrite.
Inline autocomplete has been made synchronous on purpose. This is because it only runs while the user is typing and replaces what the user is typing on the fly. Making it async would cause you more headaches you may think, due to race conditions with what you put in the input field and what the user typed in the meanwhile. Basically we should buffer all the input and synchronize the async queries (start and stop) with the buffer contents. We may try though as an experiment, IIRC the controller has protection against replacing what's typed, but that protection may be weak.

That said it has been built to search as efficiently as possible, and uses a separate connection to avoid contention with other reads/writes.
That said, it's not supposed to take 500ms, unless the disk was really busy and ended up delaying the read operation.
It's even possible we missed something in the equation, like an unintended contention or similar, should be verified.

In the profile I clearly see a keydown event going down to UIC_Start (as Yoric pointed out), so while the page was loading Jeff started typing in the locationbar and a search began. Ideally (based on design) this search should just do a simple read on disk and not wait for anything.

My thinking is that before going head down on a rewrite we should figure if there's something we didn't consider blocking the read.
Jeff, do you remember if this was the first search in the session?
In such a case for the first search we have to do some additional work to clone the connection, and this work can cause contention. My idea regarding these clone connections contentions was to provide a pool of readonly connections from nsPIPlacesDatabase, I think I filed the bug about this but can't find it atm.
my other question is whether this was a debug or opt build
(Reporter)

Comment 9

6 years ago
(In reply to Marco Bonardo [:mak] from comment #7)
> Jeff, do you remember if this was the first search in the session?
> In such a case for the first search we have to do some additional work to
> clone the connection, and this work can cause contention. My idea regarding
> these clone connections contentions was to provide a pool of readonly
> connections from nsPIPlacesDatabase, I think I filed the bug about this but
> can't find it atm.

I don't think it was the first search in the session. The profile shows us blocking in unixRead() which shouldn't contend on anything more than disk io.

Since we can't make any guarantees about when disk reads will be serviced, intentionally blocking the display of keypresses seems like a recipe for typing lag. Even buffering the input while servicing other events will give the same lag.

Can't we track what the input was that triggered the disk io and compare with the current input to ensure that it's the same before doing any replacement? If it's not the same we would discard the results and try again.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #9)
> Can't we track what the input was that triggered the disk io and compare
> with the current input to ensure that it's the same before doing any
> replacement? If it's not the same we would discard the results and try again.

There is already a similar code, that may be enough or not, Yoric has a trybuild above we may test in practice.
Created attachment 676605 [details] [diff] [review]
Removing main thread IO from UIC_startSearch() - early prototype, v2

Same one, without a typo that kills a number of tests.
Attachment #676559 - Attachment is obsolete: true
(In reply to Marco Bonardo [:mak] from comment #10)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #9)
> > Can't we track what the input was that triggered the disk io and compare
> > with the current input to ensure that it's the same before doing any
> > replacement? If it's not the same we would discard the results and try again.
> 
> There is already a similar code, that may be enough or not, Yoric has a
> trybuild above we may test in practice.

So, the trybuild passes all tests that it is expected to pass (i.e. my instrumentation trivially breaks one test and there is another test that seems to depend on the IO being synchronous).

What do you suggest as next step?
Flags: needinfo?(mak77)
we need a lot of manual testing, probably involving Nightly testers would be fine (by creating a "need help testing this build" thread on mozillazine).
And I need a somehow polished patch (no debug, reduced refactoring) to evaluate. We can do refactoring at a later time.
If you can fix tests to properly wait for something we may start by reviewing and checkin-in that part, this would be good regardless.
Flags: needinfo?(mak77)
an alternative could be to put the behavior under a temporary hidden pref, so the code may work both sync or async, we use async in Nightly and get testing in the wild, if things break we can always go back in Aurora by adding a browser.urlbar.autoFill.syncQuery=true pref. Once we are comfortable with the async behavior we just remove the pref and the related code.
Would this be complicated to support code-wise?

Also, as I suggested on IRC may be nice to add telemetry for this.
So, what about the following plan (I'm just taking things that have been written and ordering them)?

1. add telemetry for this – take no further action until telemetry confirms that there is a need;

2. tweak the API, if necessary, to make it possible to cancel a request from the UI (I seem to remember that the code is there, but that we need to expose some more stuff to the UI);

3. proceed to add buffering;

4. proceed with a minimal patch moving the request off the main thread;

5. refactor the code if necessary (I believe that it is).

Each item is one different bug.
Releasing the bug until we have interesting Telemetry information.
Assignee: dteller → nobody
I hit the same kind of freeze a couple of times in the past. Now by running with the profiler addon I'm finally able to report some useful information. This time I got a freeze of about 25s! The harddrive was kinda busy with copying large files over the network and compiling Firefox beside. As mentioned I see this issue regularly on my OS X machine and it is pretty annoying. This profile is now up for a couple of days and firefox has a memory consumption of 1100MB. Not sure how all that is connected to each other.

Here the profile:
http://people.mozilla.com/~bgirard/cleopatra/#report=3566ea6fc5d2bb8adae469f6b31bd0b0b399d647

Avg. Event Lag:
    18070.68 ms
Max Event Lag:
    25251.07 ms
Real Interval:
    1.20 ms ±5.06
Keywords: hang
Severity: normal → critical
I don't think we need additional Telemetry.

I hit this hang (lasting ~2 seconds) while typing a URL into the AwesomeBar. It might have been the first search of the session.

This is the profile: http://people.mozilla.com/~bgirard/cleopatra/#report=e80ae60e25e4904ef32390c67dcb2cb93743b41d

This is the SQL statement that was executed:

Hits	Avg. Time (ms)	Statement
1	2397	/* do not warn (bug no): could index on (typed,frecency) but not worth it */ SELECT host || :private, prefix || host || :private FROM moz_hosts WHERE host BETWEEN :search_string AND :search_string || X:private AND frecency <> 0 AND typed = 1 ORDER BY frecency DESC LIMIT 1 /* places.sqlite */

According to the Slow SQL dashboard, this is the #32 most common source of slow main-thread SQL on the release channel:

http://tinyurl.com/d8q7sbq

Search the above page for "do not warn" and hover mouse over the text
if we take the change very early in this cycle (like in a week) we may get whole testing cycles over it and see if it may create user interaction problems, backout may be easy in case of issues.
we just merged, I'll try to reduce David's patch for landing.
Assignee: nobody → mak77
Status: NEW → ASSIGNED
just for numbers, from the telemetry we added looks like 99,74% of users get the autofill result in less than 50ms. It's still worth to experiment doing this change cause jank is bad regardless when it hits and SLOW SQL clearly shows this.
Though, the situation is not critical.
Blocks: 699820
50ms of jank is pretty bad. I figure we have at best 5ms.
I said "less than", unfortunately we don't have more fine grained buckets here.
Created attachment 767990 [details] [diff] [review]
patch v3

This is not intended to cleanup anything, we can definitely cleanup code in autocomplete, but not here.
Attachment #676605 - Attachment is obsolete: true
Created attachment 768254 [details] [diff] [review]
patch v4

Fixes a failing test that was expecting a synchronous behavior
Attachment #767990 - Attachment is obsolete: true
Attachment #768254 - Flags: review?(mano)
Comment on attachment 768254 [details] [diff] [review]
patch v4

Looks good.


While you're at it:

>-    // Do a synchronous search on the table of domains.
>-    let query = this._syncQuery;
>+    // Hosts have no "/" in them.
>+    let lastSlashIndex = this._currentSearchString.lastIndexOf("/");
>+

String.contains

>+    let maybeSearchURL = () => {
>+      // Search URLs if there's a slash and it's not at the end of the search
>+      // string.

"Search for" or "Check if"
Attachment #768254 - Flags: review?(mano) → review+
(In reply to Mano from comment #27)
> >-    // Do a synchronous search on the table of domains.
> >-    let query = this._syncQuery;
> >+    // Hosts have no "/" in them.
> >+    let lastSlashIndex = this._currentSearchString.lastIndexOf("/");
> >+
> 
> String.contains

I can't use contains here, cause later I need the actual index where the slash is found.
With fixed comment:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8f7ef2198996
Flags: in-testsuite+
Target Milestone: --- → mozilla25
https://hg.mozilla.org/mozilla-central/rev/8f7ef2198996
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED

Updated

4 years ago
Blocks: 950797
No longer blocks: 1053229
You need to log in before you can comment on or make changes to this bug.