Closed Bug 791776 Opened 12 years ago Closed 11 years ago

Main thread IO in UIC_startSearch()

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla25

People

(Reporter: jrmuizel, Assigned: mak)

References

(Blocks 1 open bug)

Details

(Keywords: hang, main-thread-io, perf, Whiteboard: [snappy])

Attachments

(1 file, 3 obsolete files)

Removing main thread IO from UIC_startSearch() - early prototype 12 years ago David Teller [:Yoric] - still alive but not very active 26.81 KB, patch		Details \| Diff \| Splinter Review
Removing main thread IO from UIC_startSearch() - early prototype, v2 12 years ago David Teller [:Yoric] - still alive but not very active 26.85 KB, patch		Details \| Diff \| Splinter Review
patch v3 11 years ago Marco Bonardo [:mak] (Away Apr 25 - May 5) 19.01 KB, patch		Details \| Diff \| Splinter Review
patch v4 11 years ago Marco Bonardo [:mak] (Away Apr 25 - May 5) 21.77 KB, patch	asaf : review+	Details \| Diff \| Splinter Review

Jeff Muizelaar [:jrmuizel]

Reporter

Description

•

12 years ago

http://people.mozilla.com/~bgirard/cleopatra/?report=d2e7374fe8a7dee6d746201ccabd3c227485370a

I saw this during NY times loading.

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

12 years ago

Whiteboard: [snappy]

David Teller [:Yoric] - still alive but not very active

Comment 1

•

12 years ago

Likely culprit: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/places/nsPlacesAutoComplete.js#1385

Cc-ing ventnor who wrote these lines and mak who has been fighting on this front for quite some time.

Keywords: main-thread-io, perf

David Teller [:Yoric] - still alive but not very active

Updated

•

12 years ago

OS: Mac OS X → All

Hardware: x86 → All

David Teller [:Yoric] - still alive but not very active

Comment 2

•

12 years ago

As far as I can tell, there is no good reason for this query to be synchronous. I will see if I can easily turn it to async.

David Teller [:Yoric] - still alive but not very active

Updated

•

12 years ago

Assignee: nobody → dteller

David Teller [:Yoric] - still alive but not very active

Comment 3

•

12 years ago

Ok, not as straightforward as I expected initially. Still, working on it.

David Teller [:Yoric] - still alive but not very active

Comment 4

•

12 years ago

Attached patch Removing main thread IO from UIC_startSearch() - early prototype (obsolete) — Details — Splinter Review

After synchronization with mak, it seems that UIC_startSearch() does main thread IO by design. We need to discuss the future of this bug. Attaching a prototype that rewrites UIC to make it asynchronous.

I took the liberty of cleaning up some of the code of UIC along the way to understand how things work. Although I kind of like the new look of the code, the final patch might entail less code rewrite.

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 5

•

12 years ago

Inline autocomplete has been made synchronous on purpose. This is because it only runs while the user is typing and replaces what the user is typing on the fly. Making it async would cause you more headaches you may think, due to race conditions with what you put in the input field and what the user typed in the meanwhile. Basically we should buffer all the input and synchronize the async queries (start and stop) with the buffer contents. We may try though as an experiment, IIRC the controller has protection against replacing what's typed, but that protection may be weak.

That said it has been built to search as efficiently as possible, and uses a separate connection to avoid contention with other reads/writes.
That said, it's not supposed to take 500ms, unless the disk was really busy and ended up delaying the read operation.
It's even possible we missed something in the equation, like an unintended contention or similar, should be verified.

In the profile I clearly see a keydown event going down to UIC_Start (as Yoric pointed out), so while the page was loading Jeff started typing in the locationbar and a search began. Ideally (based on design) this search should just do a simple read on disk and not wait for anything.

My thinking is that before going head down on a rewrite we should figure if there's something we didn't consider blocking the read.

David Teller [:Yoric] - still alive but not very active

Comment 6

•

12 years ago

Investigating: https://tbpl.mozilla.org/?tree=Try&rev=d23794fc2539

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 7

•

12 years ago

Jeff, do you remember if this was the first search in the session?
In such a case for the first search we have to do some additional work to clone the connection, and this work can cause contention. My idea regarding these clone connections contentions was to provide a pool of readonly connections from nsPIPlacesDatabase, I think I filed the bug about this but can't find it atm.

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 8

•

12 years ago

my other question is whether this was a debug or opt build

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 9

•

12 years ago

(In reply to Marco Bonardo [:mak] from comment #7)
> Jeff, do you remember if this was the first search in the session?
> In such a case for the first search we have to do some additional work to
> clone the connection, and this work can cause contention. My idea regarding
> these clone connections contentions was to provide a pool of readonly
> connections from nsPIPlacesDatabase, I think I filed the bug about this but
> can't find it atm.

I don't think it was the first search in the session. The profile shows us blocking in unixRead() which shouldn't contend on anything more than disk io.

Since we can't make any guarantees about when disk reads will be serviced, intentionally blocking the display of keypresses seems like a recipe for typing lag. Even buffering the input while servicing other events will give the same lag.

Can't we track what the input was that triggered the disk io and compare with the current input to ensure that it's the same before doing any replacement? If it's not the same we would discard the results and try again.

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 10

•

12 years ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #9)
> Can't we track what the input was that triggered the disk io and compare
> with the current input to ensure that it's the same before doing any
> replacement? If it's not the same we would discard the results and try again.

There is already a similar code, that may be enough or not, Yoric has a trybuild above we may test in practice.

David Teller [:Yoric] - still alive but not very active

Comment 11

•

12 years ago

Attached patch Removing main thread IO from UIC_startSearch() - early prototype, v2 (obsolete) — Details — Splinter Review

Same one, without a typo that kills a number of tests.

Attachment #676559 - Attachment is obsolete: true

David Teller [:Yoric] - still alive but not very active

Comment 12

•

12 years ago

TryServer: https://tbpl.mozilla.org/?tree=Try&rev=f6edaf31e25d

David Teller [:Yoric] - still alive but not very active

Comment 13

•

12 years ago

(In reply to Marco Bonardo [:mak] from comment #10)
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #9)
> > Can't we track what the input was that triggered the disk io and compare
> > with the current input to ensure that it's the same before doing any
> > replacement? If it's not the same we would discard the results and try again.
> 
> There is already a similar code, that may be enough or not, Yoric has a
> trybuild above we may test in practice.

So, the trybuild passes all tests that it is expected to pass (i.e. my instrumentation trivially breaks one test and there is another test that seems to depend on the IO being synchronous).

What do you suggest as next step?

Flags: needinfo?(mak77)

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 14

•

12 years ago

we need a lot of manual testing, probably involving Nightly testers would be fine (by creating a "need help testing this build" thread on mozillazine).
And I need a somehow polished patch (no debug, reduced refactoring) to evaluate. We can do refactoring at a later time.
If you can fix tests to properly wait for something we may start by reviewing and checkin-in that part, this would be good regardless.

Flags: needinfo?(mak77)

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 15

•

12 years ago

an alternative could be to put the behavior under a temporary hidden pref, so the code may work both sync or async, we use async in Nightly and get testing in the wild, if things break we can always go back in Aurora by adding a browser.urlbar.autoFill.syncQuery=true pref. Once we are comfortable with the async behavior we just remove the pref and the related code.
Would this be complicated to support code-wise?

Also, as I suggested on IRC may be nice to add telemetry for this.

David Teller [:Yoric] - still alive but not very active

Comment 16

•

12 years ago

So, what about the following plan (I'm just taking things that have been written and ordering them)?

1. add telemetry for this – take no further action until telemetry confirms that there is a need;

2. tweak the API, if necessary, to make it possible to cancel a request from the UI (I seem to remember that the code is there, but that we need to expose some more stuff to the UI);

3. proceed to add buffering;

4. proceed with a minimal patch moving the request off the main thread;

5. refactor the code if necessary (I believe that it is).

Each item is one different bug.

David Teller [:Yoric] - still alive but not very active

Updated

•

12 years ago

Depends on: 808557

David Teller [:Yoric] - still alive but not very active

Comment 17

•

12 years ago

Releasing the bug until we have interesting Telemetry information.

Assignee: dteller → nobody

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 18

•

11 years ago

I hit the same kind of freeze a couple of times in the past. Now by running with the profiler addon I'm finally able to report some useful information. This time I got a freeze of about 25s! The harddrive was kinda busy with copying large files over the network and compiling Firefox beside. As mentioned I see this issue regularly on my OS X machine and it is pretty annoying. This profile is now up for a couple of days and firefox has a memory consumption of 1100MB. Not sure how all that is connected to each other.

Here the profile:
http://people.mozilla.com/~bgirard/cleopatra/#report=3566ea6fc5d2bb8adae469f6b31bd0b0b399d647

Avg. Event Lag:
    18070.68 ms
Max Event Lag:
    25251.07 ms
Real Interval:
    1.20 ms ±5.06

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

11 years ago

Keywords: hang

Henrik Skupin [:whimboo][⌚️UTC+1]

Updated

•

11 years ago

Severity: normal → critical

Vladan Djeric (:vladan)

Comment 19

•

11 years ago

I don't think we need additional Telemetry.

I hit this hang (lasting ~2 seconds) while typing a URL into the AwesomeBar. It might have been the first search of the session.

This is the profile: http://people.mozilla.com/~bgirard/cleopatra/#report=e80ae60e25e4904ef32390c67dcb2cb93743b41d

This is the SQL statement that was executed:

Hits	Avg. Time (ms)	Statement
1	2397	/* do not warn (bug no): could index on (typed,frecency) but not worth it */ SELECT host || :private, prefix || host || :private FROM moz_hosts WHERE host BETWEEN :search_string AND :search_string || X:private AND frecency <> 0 AND typed = 1 ORDER BY frecency DESC LIMIT 1 /* places.sqlite */

According to the Slow SQL dashboard, this is the #32 most common source of slow main-thread SQL on the release channel:

http://tinyurl.com/d8q7sbq

Search the above page for "do not warn" and hover mouse over the text

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 20

•

11 years ago

if we take the change very early in this cycle (like in a week) we may get whole testing cycles over it and see if it may create user interaction problems, backout may be easy in case of issues.

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 21

•

11 years ago

we just merged, I'll try to reduce David's patch for landing.

Assignee: nobody → mak77

Status: NEW → ASSIGNED

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 22

•

11 years ago

just for numbers, from the telemetry we added looks like 99,74% of users get the autofill result in less than 50ms. It's still worth to experiment doing this change cause jank is bad regardless when it hits and SLOW SQL clearly shows this.
Though, the situation is not critical.

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Updated

•

11 years ago

Blocks: StorageJank

David Teller [:Yoric] - still alive but not very active

Comment 23

•

11 years ago

50ms of jank is pretty bad. I figure we have at best 5ms.

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 24

•

11 years ago

I said "less than", unfortunately we don't have more fine grained buckets here.

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 25

•

11 years ago

Attached patch patch v3 (obsolete) — Details — Splinter Review

This is not intended to cleanup anything, we can definitely cleanup code in autocomplete, but not here.

Attachment #676605 - Attachment is obsolete: true

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 26

•

11 years ago

Attached patch patch v4 — Details — Splinter Review

Fixes a failing test that was expecting a synchronous behavior

Attachment #767990 - Attachment is obsolete: true

Attachment #768254 - Flags: review?(mano)

Asaf Romano (gone)

Comment 27

•

11 years ago

Comment on attachment 768254 [details] [diff] [review]
patch v4

Looks good.


While you're at it:

>-    // Do a synchronous search on the table of domains.
>-    let query = this._syncQuery;
>+    // Hosts have no "/" in them.
>+    let lastSlashIndex = this._currentSearchString.lastIndexOf("/");
>+

String.contains

>+    let maybeSearchURL = () => {
>+      // Search URLs if there's a slash and it's not at the end of the search
>+      // string.

"Search for" or "Check if"

Attachment #768254 - Flags: review?(mano) → review+

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 28

•

11 years ago

(In reply to Mano from comment #27)
> >-    // Do a synchronous search on the table of domains.
> >-    let query = this._syncQuery;
> >+    // Hosts have no "/" in them.
> >+    let lastSlashIndex = this._currentSearchString.lastIndexOf("/");
> >+
> 
> String.contains

I can't use contains here, cause later I need the actual index where the slash is found.

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Comment 29

•

11 years ago

With fixed comment:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8f7ef2198996

Flags: in-testsuite+

Target Milestone: --- → mozilla25

Phil Ringnalda (:philor)

Comment 30

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/8f7ef2198996

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Julian Viereck

Updated

•

11 years ago

Blocks: 950797

Jared Wein [:jaws] (please needinfo? me)

Updated

•

10 years ago

Blocks: 1053229

Marco Bonardo [:mak] (Away Apr 25 - May 5)

Assignee

Updated

•

10 years ago

No longer blocks: 1053229

You need to log in before you can comment on or make changes to this bug.