Closed Bug 1411226 Opened 2 years ago Closed 2 years ago

Entries from history are appearing with a large delay

Categories

(Firefox for Android :: Awesomescreen, defect, critical)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
Firefox 59
Tracking Status
firefox57 + wontfix
firefox58 + fixed
firefox59 + fixed

People

(Reporter: marco, Assigned: Grisha)

References

Details

(Keywords: regression, regressionwindow-wanted, Whiteboard: [summary in comment 35])

Attachments

(3 files)

I believe this is a recent regression, I think I started seeing it in 57.

When entering text in the location bar, the history entries used to be shown very quickly, now there's a very large (usually several seconds) delay.
Moreover, the old suggestions from the history are shown, even if completely unrelated, until the new ones show up.

[Tracking Requested - why for this release]: very noticeable regression in performance of awesomescreen.
Joe, does it ring a bell?
Flags: needinfo?(jcheng)
Mihai - Have you seen this issue at any point during beta testing?
Flags: needinfo?(mihai.ninu)
Hi Nicole, Wesly, could you please triage this one? I tried this and I don't notice the delay.
Flags: needinfo?(wehuang)
Flags: needinfo?(nyee)
(In reply to Ritu Kothari (:ritu) from comment #3)
> Hi Nicole, Wesly, could you please triage this one? I tried this and I don't
> notice the delay.

Sylvestre could reproduce it on his phone too.
Can't repro on my device either (Fennec 56, 57, 58, with Samsung Galaxy Edge 7, Android 7.0), and don't think we changed things there recently.

Ioana, would you help find someone to quickly check this? If truly a 57 regression, then help us identify the window. Thanks!
Flags: needinfo?(wehuang) → needinfo?(ioana.chiorean)
ni related people
Flags: needinfo?(topwu.tw)
Flags: needinfo?(max)
Flags: needinfo?(gkruglov)
May I know the version you were using? is it a large profile?
Not sure if I am reproducing it on Pixel - Android 8.0 latest Beta 11
Is this what you mean? I see a bit of delay but not sure if this is what you experience or something worst. I used a large profile synced. 
https://photos.app.goo.gl/kXK6uCMukfWl5Rt02

Also asked the team to investigate more around different devices till we get the reporting devices.
Flags: needinfo?(ioana.chiorean)
I can reproduce it all the time with a samsung s8 (up to date) on fx nightly.
Now that I've seen this bug, I feel like I'm noticing something on Release as well, but to be honest I can't say for sure whether it's always been like that or got worse recently.
For reference, I'm on a Moto G4 Play (6.0.1) and my (synced) browser.db currently weighs in at 43 MB (~3,700 bookmarks, ~67,000 history entries, ~114,000 visits).
(In reply to Ioana Chiorean from comment #8)
> Not sure if I am reproducing it on Pixel - Android 8.0 latest Beta 11
> Is this what you mean? I see a bit of delay but not sure if this is what you
> experience or something worst. I used a large profile synced. 
> https://photos.app.goo.gl/kXK6uCMukfWl5Rt02

No, it's definitely worse than this. When I type something, the results appear after 3-4 seconds.
Your video is basically what was happening on my device before the regression.

(In reply to Jan Henning [:JanH] from comment #10)
> Now that I've seen this bug, I feel like I'm noticing something on Release
> as well, but to be honest I can't say for sure whether it's always been like
> that or got worse recently.
> For reference, I'm on a Moto G4 Play (6.0.1) and my (synced) browser.db
> currently weighs in at 43 MB (~3,700 bookmarks, ~67,000 history entries,
> ~114,000 visits).

I'm pretty sure I've never seen this before. I can't tell for sure when it regressed, but it did regress at some point.
Hi everyone,

Tried reproducing this on several different devices and could not reproduce the mentioned issue. I've tried by checking local history, and synced (from another device) history and it works fine, no delay. 

Devices used: 
Samsung galaxy S6 EDGE - Android 7.0
Pixel C - Android 8.0
Xiaomi mi i4 - Android 5.0.2
Build 57.0b11
Flags: needinfo?(mihai.ninu)
For the record, my device is a Nexus 5 with Android 6.0.1.
Hello,

I've also tested on a Nexus 6 (7.1.1) & Nexus 5 (6.0.1) and the issue was not reproducible on any of the devices.
Taking a quick look - luckily, we don't have to guess - there's a telemetry probe just for tracking this stuff.

https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=95th-percentile&cumulative=0&end_date=null&keys=&max_channel_version=beta%252F57&measure=FENNEC_SEARCH_LOADER_TIME_MS&min_channel_version=beta%252F54&processType=*&product=Fennec&sanitize=1&sort_keys=submissions&start_date=null&trim=1&use_submission_date=0

Looking at the graph, we've certainly regressed the performance of BrowserDB's filter() in 57. It's particularly noticeable for the 95th percentile, but almost visible around the median. Perhaps that's indicative of the reports here?

Mike, you're probably the last person to have touched this stuff (albeit indirectly). Does any recent change stand-out as a possible culprit?
Flags: needinfo?(gkruglov) → needinfo?(michael.l.comella)
Thanks for flagging me and for the telemetry, Grisha.

(In reply to :Grisha Kruglov from comment #15)
> Looking at the graph, we've certainly regressed the performance of
> BrowserDB's filter() in 57.

filter() is what we call when the user types a letter into the awesomescreen.

> Mike, you're probably the last person to have touched this stuff (albeit
> indirectly). Does any recent change stand-out as a possible culprit?

Nothing obvious stands out: I don't think this is related to Activity Stream because I can still reproduce (on my personal device/profile) even when the Top Sites panel is disabled.

I'll continue digging around for a bit though.

---

If this were related to Activity Stream, I'd speculate it's because AS has increased the simultaneous disk reads we're making when loading about:home:
- Top Sites (a query)
- Top Stories (parsing a json file)
- Highlights (a query)

Then if the user starts typing, we add another – I'd guess we're slowing down because of all the disk reads we're doing. I'd expect that users on older devices with larger and more fragmented profiles (i.e. older) are more likely to see this.

Two possible solutions for AS:
- (if possible) Cancel in-progress DB queries when we start filtering options
- Cache the results of these queries and update them periodically (bug 1388187)

---

Anecdotally, I have seen this as well: the awesomescreen feels slow and clunky (including top sites) and it feels bad to use fennec. I'm using Nightly on a Galaxy S5 with a pretty old and large profile.
The most recent change to LocalBrowserDB was in 56 [1][2], which doesn't show problems, so we're likely looking at a front-end change or an indirect change.

Possible research:
- Were there any changes to the awesomescreen search UI? Could these be related?
- We introduced a lot of changes at once for 57: what are they? Could they be related? (see items below)
    - Photon UI: Can we repro without it (if we can still turn it off)?
    - Sync: can we reproduce without sync enabled? Without a sync account activated on the device?
    - Telemetry: can we reproduce without telemetry enabled?
    - (note: it isn't worth ruling out Sync or telemetry just because they're disabled - we have to make sure none of their code is running)
- Do we notice any contention over device resources?
    - GC churn: could we be forcing GC during `filter`? e.g. our memory usage is high (a memory leak?) that keeps us under memory pressure while we're creating Strings from the DB
    - Disk use: could there be any simultaneous disk reads?
    - CPU use: could there be any competing threads or background tasks?

[1]: https://hg.mozilla.org/mozilla-central/log/tip/mobile/android/base/java/org/mozilla/gecko/db/LocalBrowserDB.java
[2]: https://hg.mozilla.org/mozilla-central/rev/6503ba3ed9a4
> so we're likely looking at a front-end change or an indirect change.

Since Grisha noted that the disk query itself is taking longer (here's the probe in code [1]), this actually wouldn't be related to a front-end change unless that front-end change is creating contention over device resources.

Unfortunately, I haven't found any other obvious possibilities and since I don't think this is AS related, I'm going to stop looking into this. Please NI me (or irc) again if you need my help, want to discuss, think it's AS-related, etc.

A few observations:
- The appearance of the search screen has completely changed from 56 to 57: could this be related? (can we still reproduce without photon enabled?)
- The old search screen has no overdraw while the new search screen has a layer of overdraw: it looks like we used to hide about:home when the search screen is shown (we did this primarily for a11y reasons) but I wonder if keeping the other screens around is creating resource contention

[1]: http://searchfox.org/mozilla-central/rev/1c4901d060e3953e41088c038b62a0c026c1e1fb/mobile/android/base/java/org/mozilla/gecko/home/SearchLoader.java#105
Flags: needinfo?(michael.l.comella)
Discussed with the team here and don't think we have frontend things touched it in 57. Given the tasks in hand we are not able to further trace it recently (sorry) and believe to identify the regression window directly will be the most efficient next step.

Marco, as this is not reproducible by anyone, would you help our QA to find the regression window? (maybe sharing your profile with them, or even to locate the regression directly if possible) Thanks!
Flags: needinfo?(topwu.tw)
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(max)
(In reply to Wesly Huang (Fennec Frontend/Firefox EPM) from comment #19)
> Discussed with the team here and don't think we have frontend things touched
> it in 57. Given the tasks in hand we are not able to further trace it
> recently (sorry) and believe to identify the regression window directly will
> be the most efficient next step.
> 
> Marco, as this is not reproducible by anyone, would you help our QA to find
> the regression window? (maybe sharing your profile with them, or even to
> locate the regression directly if possible) Thanks!

I don't have any experience with Android, so it would take me more time. Maybe Michael, who can reproduce as well, can help?
If not, I can do it with some hand holding.
Flags: needinfo?(mcastelluccio) → needinfo?(michael.l.comella)
(In reply to Marco Castelluccio [:marco] from comment #20)
> I don't have any experience with Android, so it would take me more time.
> Maybe Michael, who can reproduce as well, can help?
> If not, I can do it with some hand holding.

There are a few things I can try tomorrow. I think we have to try reproducing with a long-term, existing profile: I can reproduce on my daily driver, Nightly but I cannot reproduce after syncing a local dev build with my personal profile (the same one in Nightly). This latter build will not have the 1) same quantity of data as my personal profile and 2) the fragmentation of my personal profile. (I wonder if we should invest in being able to manually build these profiles for testing purposes...)

To find the regression range, I can try to copy my personal profile out of my install [1] and then running builds from various revisions with my profile [2].

That being said, we'll only be able to use existing profiles with the current DB version OR manually incrementing the DB version number in the browser database and hoping any changes using the DB we're missing don't cause false positives or crashes. It unfortunately may be a time consuming process. :(

The last DB migrations we've made are:
- bug 1364644 FF57: https://hg.mozilla.org/mozilla-central/rev/866bf0a5a6b6#l2.77
- bug 1386052 FF57: https://hg.mozilla.org/mozilla-central/rev/35b56190a1a9#l2.13
- bug 1351805 FF55: https://hg.mozilla.org/mozilla-central/rev/af459377995c#l2.31

[1]: Either with copy-profile add-on (adding the workaround to install legacy extensions) or adding custom code
[2]: Either with a rooted device or custom code
Attached video Movie of slowness
Here's a movie displaying the slowness; notice the delay when I enter "p" and the delay in displaying top sites. Sometimes it can be worse than this and this is probably the best case. I remember performance being better before.
> I can try to copy my personal profile out of my install

I was able to do this by repairing the copy-profile add-on. You can find my updated version at https://github.com/mcomella/copy-profile. fwiw, we should really create a non-deprecated solution before copy-profile stops working (bug 1396929).
> then running builds from various revisions with my profile

Here's a patch to open the profile from `/sdcard/mozilla_profile`.

Unfortunately, I don't think I'll have time to go regression hunting today and I'll be at a conference for the rest of the week so I don't think I'll have time to get back to this until Monday.
Assignee: nobody → michael.l.comella
Ioana, can you help reproduce this and maybe find a regression window? Thanks!
Flags: needinfo?(chiorean.ioana)
Please sign in to firefox account with more bookmark / history entries(>1000 entries).
I saw this today on Michael's phone. But it worked fine on my Pixel 32G
Flags: needinfo?(cnevinchen)
I can't reproduce this on my account/pixel/Android O :(
(In reply to Nevin Chen [:nechen] from comment #26)
> Please sign in to firefox account with more bookmark / history entries(>1000
> entries).

As per comment 21, I can't reproduce by syncing my profile is a fresh install:

> I can reproduce on my daily driver, Nightly but I cannot reproduce after syncing a local dev build
> with my personal profile (the same one in Nightly). This latter build will not have the 1) same quantity
> of data as my personal profile and 2) the fragmentation of my personal profile. (I wonder if we should
> invest in being able to manually build these profiles for testing purposes...)

---

I did some debugging this morning:
- Now that I can use my profile in debug builds, I can use the profiling tools. Yay!
- Not memory: there were no exceptional allocations or GC events
- The CPU runs to ~40% when typing. Notably, I have a quad core phone so that could mean one or two cores are really chugging.
- I've attached a CPU trace: it was maybe: type "apple", wait for completion, backspace to "app", wait for completion. I wouldn't rely on these steps though – I can't really remember what I did. I can upload another trace when I have more time.
- With light analysis on the trace, it stands out that we have *two* threads for SearchLoader$SearchCursorLoader (ModernAsyncTask #2 & #5) and BrowserSearch$SearchEngineSuggestionAsyncLoader (ModernAsyncTask #3 & #4) that have partial overlap in execution time. The SearchCursorLoader spends most of its time in SQLiteConnection.nativeExecuteForCursorWindow – this is actually blocking the thread even though I'd expect SQL calls to be primarily sleeping and waiting for the disk. The UI thread is also doing a lot of work, updating for each key press.
- SearchLoader has not been updated in 57 [1]
- The history panel also seems to take an unexpectedly long time to load

My next TODO would be to look at the nativeExecuteForCursorWindow code to see why it uses so much CPU.

Thanks Nevin for the team debugging/ideas yesterday!

[1]: https://hg.mozilla.org/mozilla-central/log/tip/mobile/android/base/java/org/mozilla/gecko/home/SearchLoader.java
Let's take a look at what's actually going from a prism of SQL and data, and what might have changed recently (it's also nice to refresh my memory of this stuff).

We're selecting data from the 'combined' view. That's what seems to have regressed in performance here - a query built against it by LocalBrowserProvider's filter().

Typing "old bug" into awesomebar will essentially perform something like `SELECT url,title,history_id,bookmark_id FROM combined WHERE url LIKE '%old%' OR title LIKE '%old%' AND url LIKE '%bug%' OR title LIKE '%bug%' LIMIT 100 ORDER BY combined_frecency_sort_order`.

The sort order is something like: (CASE WHEN bookmark_id > -1 THEN 100 ELSE 0 END) + remoteVisitCount * MAX(1, 100 * 110 / ((1509717433627000 - remoteDateLastVisited) / 86400000000 * (1509717433627000 - remoteDateLastVisited) / 86400000000 + 110)) + (localVisitCount + 2) * (localVisitCount + 2) * MAX(2, 100 * 225 / ((1509717433627000 - localDateLastVisited) / 86400000000 * (1509717433627000 - localDateLastVisited) / 86400000000 + 225)) DESC.

... which is essentially a sum of two ways ("remote" and "local") to calculate frecency for a history record, and some heavy bonus points if it's also a bookmark.

'combined' view is:

SELECT stuff FROM bookmarks WHERE type = 'bookmark' AND parent <> fixed_pinned_ID AND is_deleted = 0 AND url NOT IN (select url from history)
UNION ALL
SELECT largely same stuff FROM history
LEFT OUTER JOIN bookmarks ON bookmarks.url = history.url WHERE history.is_deleted = 0 AND (bookmarks.type is null OR bookmarks.type = 'bookmark').

Some of the more obvious things that will affect performance of a sql query are:
- query itself
- schemas of tables/views it operates on ("shape" of data)
- indexes
- amount of data

AFAIK, two things involved here changed in 57:
- 'Combined' view changed (unnecessarily, AFAIK) in Bug 1386052 (landed in 57) - `history_guid` was added to the projection. We have an index on history GUID, and so it's unlikely that the query plan would change significantly with the new projection. Actually checking, yes - it's the same query plan with and without.
- 'bookmarks' table _did_ change in 57 - 'local_version' and 'sync_version' columns were added (Bug 1364644). Similarly, I don't think this would have a negative impact by itself.

That's mostly it for 'query', 'schemas' and 'indexes' (they didn't change), which leaves us with the data itself.

In Bug 1291821 (landed in 54) we've removed limits on how much of history and bookmarks could be downloaded by Sync. Prior, we wouldn't attempt to sync bookmarks if there are more than 5000 bookmark records, and we wouldn't fetch more than 250 latest history items.

If we expand the analytics query to go from 50-57, we'll see a decrease in perf starting with 54, before the large jump in 57 which we're currently investigating. See https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=95th-percentile&cumulative=0&end_date=null&keys=&max_channel_version=beta%252F57&measure=FENNEC_SEARCH_LOADER_TIME_MS&min_channel_version=beta%252F50&processType=*&product=Fennec&sanitize=1&sort_keys=submissions&start_date=null&trim=1&use_submission_date=0

Changes to Sync that would result in potentially much larger tables would certainly affect query pref (more data to sift through). My theory is that the 54 decrease could certainly be explained by larger volume of data in history and bookmark tables (but I haven't checked for schema/etc changes in 54+). In particular, in the combined view we're joining the two tables, so more data would certainly mean "more work to do". This is somewhat supported by the fact that we mostly see perf decrease for the 95th percentile - median and 75th percentile performance is about the same.

One way to interpret that: we mostly regressed performance for heavy users, which are likely to also be Sync users (but not necessary). This interpretation allows us to narrow down what subsets of users we're considering.

Now, the fact that in 54 we've lifted bookmark/history limits wouldn't by itself mean that all of a sudden sync users have more bookmarks and history records on their devices. For an existing Sync user, a "sync" is an incremental operation - download what's new/changed since last sync.
Bookmark limits _were_ removed in a way that heavy bookmark users would actually start syncing their bookmarks, and so we did increase number of bookmarks a heavy user will have starting from 54.

For history records, there are circumstances under which an existing Sync client might download _all_ of their data without a direct user intervention (and so, may actually account for a general trend). One is "node reassignment". A storage server to which Sync user is assigned is changed, and so all of user's clients perform a "first sync" to a brand new server - that is, download _all_ of their history (and all other record types).

I think that's the most obvious explanation here - a node re-assignment was performed, and our heavy sync users (who also have a desktop client, likely with tons of history) performed a "first sync", downloading tens of thousands of history records.

I've asked :bobm on IRC about the dates on which this might have happened recently for a representative subset of users. Will report back here. If there was a node-reassignment right around the 57 beta timeframe, and there haven't been other node re-assignments since 54 (or we'd see them in analytics!), I'd be inclined to say that's the cause of a big bump here.

Otherwise, keep digging.

But in either case, some possible remedies:
- see if we can substantially improve the combined view query
- FTS support!
- use a cache layer. That is, pre-compute the "combined" view. Would also be good to pre-compute the TOP SITES queries. I believe that's largely what iOS does. Perhaps we just can't escape the fact that queries get complicated, and there's a bunch of data.
Talking to bobm, we did do a limited migration of users right around the time the 95th percentile started to spike - nearly to the day. Roughly speaking, on the order of 1% of android sync users were migrated to a new storage server. Meaning that history tables for ~1% of Android sync users have likely grown quite a bit. Not for all affected users, but probably most of them.

While we probably can't say for certain given the low % of migrated users, given the timeframes, this seems like a likely explanation.

In a way that's unsatisfactory, in a sense that it just underscores that we need to focus on thinking how to build a caching/materialized view-like layer. E.g. it would be helpful if we didn't have to perform the expensive frecency sort every time user types anything into an awesome bar. It would also be helpful if we didn't have to recompute user's largely-stable top sites every time they open a new tab.
Since we can't really segment our analytics data in a useful way, let's have a very non-scientific poll: if you're experiencing this regression, are you also a Sync user? How many history records do you have in your local (android's browser.db) databases?

One thing we _could_ do is to take a closer look at the sync ping data reported by android, specifically at volumes of history/bookmark records being synced. We can't quite look at this in re:dash, so this would involve writing up some python jupyter notebooks.

Mike, if you're keen, I reckon it's still worth being thorough to see if we're not doing something obviously silly/unnecessary that may be avoided without more fundamental changes.
> if you're experiencing this regression, are you also a Sync user? How many history records do you have
> in your local (android's browser.db) databases?

Yes, I am a sync user.

- Personal profile (reproducible): 41256 from history and 475 from bookmarks
- Clean install, syncing with my personal account (unreproducible): 5829 history, 466 bookmarks

---

Regression hunting:
- (australis flavor) 33d0cd5c672a can still reproduce: it's unlikely photon is related
- (arbitrary 56 changeset) aa10d1423912 can still reproduce: it's unlikely anything specifically in 57 affected this

---

I'll read more thoroughly through the comments here on Monday but it looks like the sheer number of items I have could be related. Thanks for the research, Grisha.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #25)
> Ioana, can you help reproduce this and maybe find a regression window?
> Thanks!

Hi Liz, as per comment #8, comment #12 and comment #14 we tried with several devices we have in the office and unfortunately we were not able to reproduce it. My profile synced is very big but still not much of a delay for me - see video in comment #8
Flags: needinfo?(chiorean.ioana)
We merge 57 to release in 6 days: this won't make it and probably doesn't warrant a dot release.
To summarize mine and Grisha's comments:
- There have been no obvious changes to the query code (query, indices, schema) to warrant this (comment 29)
- Photon doesn't seem to affect it (comment 32)
- 56 exhibits the same behavior and so nothing specifically in 57 is singly to blame (comment 32)
- The data has changed for some users: Sync removed overall history and bookmarking limits in 54. There was a node migration for ~1% of sync users in 57 which may explain the additional spike (comment 29, comment 30)
- Personally, I cannot reproduce on the same build with little data (6k history) but I can with a lot (41k history) (comment 32)
- My profiler says the CPU gets pegged at 40% when typing - likely because one or two cores have heavy work. A lot of time is spent in nativeExecuteForCursorWindow, which is presumably where the query happens (comment 28). There are no noticeable problems with memory.

Ways we can continue to investigate:
- What does nativeExecuteForCursorWindow actually do? Is it the CPU or disk that's blocking?
- "One thing we _could_ do is to take a closer look at the sync ping data reported by android, specifically at volumes of history/bookmark records being synced. We can't quite look at this in re:dash, so this would involve writing up some python jupyter notebooks."

And some possible, high effort remedies from Grisha:
- see if we can substantially improve the combined view query
- FTS support!
- use a cache layer. That is, pre-compute the "combined" view. Would also be good to pre-compute the TOP SITES queries. I believe that's largely what iOS does. Perhaps we just can't escape the fact that queries get complicated, and there's a bunch of data.

And me:
- Prune the amount of data we store so we can't hit these problems
- Reduce the amount of data we sync

Unfortunately, it sounds like this is definitely not an AS problem and I have my hands full with other work so I'll have to pass it off to my fennec and sync brethren. Please let me know if I can help you out with anything!
Assignee: michael.l.comella → nobody
Whiteboard: [summary in comment 35]
Flags: needinfo?(michael.l.comella)
Agree with comment 35
Please help prioritze this.
Maybe we can test this on a lower end device.
Flags: needinfo?(wehuang)
Flags: needinfo?(nyee)
Flags: needinfo?(cnevinchen)
(In reply to Nevin Chen [:nechen] from comment #36)
> Maybe we can test this on a lower end device.
I am experiencing this issue with a brand new Samsung S8.

I am bumping the severity as it is pretty serious; the UX is pretty bad.
Severity: normal → critical
We need more Samsung device to test. 
Mike's device is also Samsung at
Flags: needinfo?(max)
(In reply to Nevin Chen [:nechen] from comment #38)
> We need more Samsung device to test. 
> Mike's device is also Samsung at

This doesn't seem restricted to Samsung devices. I have a Nexus 5.
Other possible solutions: it appears desktop Firefox does history expiration based on how performant your places DB is [1] and, as far as I know (I haven't looked this up), we don't do any history expiration on mobile. fwiw, desktop currently has about ~4.6x the number of sites than I do on mobile.

[1]: https://superuser.com/a/1114650
(In reply to Michael Comella (:mcomella) from comment #40)
> Other possible solutions: it appears desktop Firefox does history expiration
> based on how performant your places DB is [1] and, as far as I know (I
> haven't looked this up), we don't do any history expiration on mobile.

I remember having come across some code that claimed to do this (keeping 2000 entries or at least a month of history, whichever leaves more, if I remember correctly), but in practice the oldest entry (sorted by DATE) in my main profile (synced with Desktop) claims to be from the end of June - and I certainly didn't visit that particular page again within the last few months.
So yeah, something seems amiss there...
Something about our history expiration code must definitively be broken, because according to the logcat we do attempt to expire history, however in practice nothing happens. My (synced) main profile is now at 66 MB/106,010 history entries and counting and doesn't show any signs of expiring any history and neither does my local build's development profile which isn't synced but has a copy of my main profile's browser.db which I copied over some time ago.

Our query for expiring history looks something like
> DELETE FROM history WHERE MAX(date, modified) < 1512612273999  AND _id IN ( SELECT _id FROM history ORDER BY date ASC LIMIT 103985)

and if I attempt to run that query on a copy of browser.db using SQLite Debugger, this indeed results in 0 rows being deleted.
As far as I can tell, this happens because for most history entries, the modified colummn is <null> (so max(date, modified) will return <null> as well) and for those entries where modified isn't null, it happens to lie within the last four weeks, i.e. the grace period where we won't expire history.

The last substantial change in our history expiration query (which only affected the sort order, though) was in 2016, otherwise it's basically unchanged since 2012 (https://hg.mozilla.org/mozilla-central/log/351c75ab74c9a83db5c0662ba271b49479adb1f1/mobile/android/base/java/org/mozilla/gecko/db/BrowserProvider.java?patch=&linerange=480:486).
Since I seem to remember that history expiration still worked up until approximatively last summer, something about the way we set the modified date must have changed at some point.

Looking a bit closer at my profile, I think the issue is that for all history entries that have been added via Sync, but never visited locally (date_local = 0), their created and modified columns are set to <null>, which then breaks our history expiration code.
Flags: needinfo?(gkruglov)
Thanks for digging in, Jan.

It does appear that the "fast path" that sync uses to transactionally insert history/visits foolishly ignores the 'modified' column.

We can trivially patch the expiration query (use 'coalesce' around modified) to get expiration logic to work again.

This leaves our end-to-end story around history in a somewhat unresolved state, however. There's a tension between browser's needs (gotta be fast) and sync's needs (gotta be consistent and resilient).

As of some time now, we specifically download all of user's history that is present on the server. That's a good thing from a few angles - we can ensure data consistency, we create another full copy of the data that will be used to re-populate the server in case of a node re-assignment. This also improves quality of the awesomebar search results, at least in theory (our frecency logic was written before this change, when we only had access to limited data set on the client, and so it might need revisiting).

As we're observing here, the larger local history table, coupled with the broken expiration logic, creates a serious performance bottleneck. We can trivially patch up the expiration logic, and get top-sites & awesomebar queries running fast(er). This, however, progressively negates benefits of having a full local copy of the data.

We should also approach this from a "avoid doing the unavoidable expensive work" angle - cache results, etc, but that's a longer term path.

Given how bad the current situation is - and the trend is quite negative as well as users are accruing more history, see the updated query [0] - I'm in favor of a short-term solution (fix the expiration query & bulk insertion logic) that restores our performance. We have to do that regardless! This is also a good time to review how aggressively we expire, and when.

I'll get those patches ready.

[0] https://telemetry.mozilla.org/new-pipeline/evo.html#!aggregates=75th-percentile!95th-percentile&cumulative=0&end_date=null&keys=&max_channel_version=beta%252F58&measure=FENNEC_SEARCH_LOADER_TIME_MS&min_channel_version=beta%252F50&processType=*&product=Fennec&sanitize=1&sort_keys=submissions&start_date=null&trim=1&use_submission_date=0
Flags: needinfo?(gkruglov)
Clearing outstanding NIs, I think we have enough information at this point.
Flags: needinfo?(wehuang)
Flags: needinfo?(max)
Flags: needinfo?(jcheng)
Depends on: 1428165
Can the large amount of history slow down page loading too? (asking because of bug 1429336)
Flags: needinfo?(gkruglov)
afaik, the awesomescreen queries continue to run in the background, even after you've entered a page to load so yes, I think they can. But I'll let grisha do more thorough research.
Looking at Bug 1420641, and some of the videos posted there (open fennec, tap on a top site, wait forever), I don't think think is related.
Flags: needinfo?(gkruglov)
JanH, Bug 1428165 landed and perf graphs are looking much better now; how's your anecdotal experience on Nightly?
Flags: needinfo?(jh+bugzilla)
Definitively snappier, like it used to be.
Flags: needinfo?(jh+bugzilla)
Anecdotally, my GS5 perf is also much better – thanks for the update!
Can we safely call this fixed now?
Flags: needinfo?(mcastelluccio)
Yes, it works for me!
Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(mcastelluccio)
Resolution: --- → FIXED
Assignee: nobody → gkruglov
Target Milestone: --- → Firefox 59
You need to log in before you can comment on or make changes to this bug.