Closed Bug 722188 Opened 12 years ago Closed 12 years ago

mozilla::storage::Connection::BeginTransactionAs hangs for > 20 seconds waiting for frecency update

Categories

(Toolkit :: Places, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: BenWa, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [Snappy:p1])

Attachments

(2 files)

I was watching a stream (Flash) with a single other tab and got a hang for over 20 seconds. Using the built in profiler I was able trace it back to 'mozilla::storage::Connection::BeginTransactionAs'

See the full profile here (7MB, pauses the browser while parsing):
http://varium.fantasytalesonline.com/cleopatra/?report=AMIfv94ios_yEKbgE1nHw5Jf0laaiQv2Wrz7x3TlJ35E9wPE5kitc25XIfEYg_-EYA285BK32HQI9iAGvdA5Mv2o2sydR5DVGhYsFqQZBAbjD3kg75lPoLZEHH3K2dE1zWabEeHFmDaRzhgdBBiJC524v6LrUdN5Vg
Component: Profile: BackEnd → Places
Product: Core → Toolkit
QA Contact: profile-manager-backend → places
Right now I have some problem fetching the profile, did you identify a specific stack starting from Places?
Wew you "idle" (so not using keyboard/mouse from 5 minutes?) watching this stream?
next time please also check eventual SQL reported in about:telemetry after the fact, may help identifying the conflicting queries
We're thinking of adding dynamic samples to the profiler. If this area of the code isn't a hot loop is could be a great place to add this feature. We could tag the profiling sample with the SQL that was executing. What do you think?

Profile range from 1MB to 10MB and are a very heavy SVG page so youll need a fast machine with a good connection.

I was interacting with the page going from fullscreen to window and check the integrated chat in the page on http://www.twitch.tv/ live cast. My uptime wasn't high and I hadn't visited a lot of pages. I have no reason to believe it was a fluke and that my machine was running poorly either.
(In reply to Benoit Girard (:BenWa) from comment #3)
> We're thinking of adding dynamic samples to the profiler. If this area of
> the code isn't a hot loop is could be a great place to add this feature. We
> could tag the profiling sample with the SQL that was executing. What do you
> think?

Well, if there is a statement executing at that time would be useful. Telemetry just tracks slow queries but I think they are not yet reported to us (users have to manually file bugs).

> Profile range from 1MB to 10MB and are a very heavy SVG page so youll need a
> fast machine with a good connection.

Yep, unfortunately I'm not at my usual location this week, no fast machine :(

> My
> uptime wasn't high and I hadn't visited a lot of pages. I have no reason to
> believe it was a fluke and that my machine was running poorly either.

Yes, my question is more regarding idle, since on idle we run maintenance work that may be slow. Plus, atm some of that work may cause contention, I'll file a bug shortly for that. So my assumption is that you were not touching keyboard and mouse while looking the streaming, and after minutes the maintenance stuff ran. Provided this was Places and not work in another database, that I can't tell without seeing the stack.
adding a dependency for now, to track that.
(In reply to Marco Bonardo [:mak] from comment #4)
> Yes, my question is more regarding idle, since on idle we run maintenance
> work that may be slow. Plus, atm some of that work may cause contention,
> I'll file a bug shortly for that. So my assumption is that you were not
> touching keyboard and mouse while looking the streaming, and after minutes
> the maintenance stuff ran.

I WAS interacting with the page. Please see Comment 3, last paragraph. I'll post the stack shortly.
ah sorry, didn't make enough attention to that. That's strange, unless it was some sort of work coming from Sync, Places doesn't execute any synchronous stuff when the user browses and this sounds like thread contention.
Attached image Stack screenshot
It's being called from Js->nsNavBookmarks->nasNavHistory
Attachment #592781 - Attachment is patch: false
Attachment #592781 - Attachment mime type: text/plain → image/png
so this is a bookmarks runInBatchMode request, considered what you were doing at that time, I can only think of 2 reasons:
1. livemarks updates (bug 613588 will fix that)
2. Sync, or another add-on using bookmarks. We will shortly start working on an async bookmarking API for Sync.
I've hit this, too, while watching a stream (i.e. idle). Here's a profile that doesn't hang the browser and that also contains stacks of other threads (like the SQLite thread).

The only SQL I have in about:telemetry is "INSERT OR REPLACE INTO webappsstore2 SELECT * FROM webappsstore2_temp WHERE scope = :scope", but I haven't had the web app store extension installed for a long time.
By the way, I've hit this bug regularly during the last one or two weeks, but never before that.
webappsstore is DOMStorage, fwiw, not related to app store.

I wonder if this is a thread scheduling change due to the SQLite upgrade, cause there haven't been other large changes in databases. Though DOMStorage can't cause thread contention. So I wonder what's up, if it would be a SQLite hang it should be reported in telemetry.
(In reply to Marco Bonardo [:mak] from comment #12)
> webappsstore is DOMStorage, fwiw, not related to app store.

I see.

> I wonder if this is a thread scheduling change due to the SQLite upgrade,
> cause there haven't been other large changes in databases. Though DOMStorage
> can't cause thread contention. So I wonder what's up, if it would be a
> SQLite hang it should be reported in telemetry.

When does that reporting happen? After the query completes? The browser was hung and I had to kill it, maybe it didn't have a chance to report the query before it was terminated.
from the profile in comment 10 looks like runInBatchmode is locked on the mutex, would be really useful to get the query that is running at that time on the async thread and holding the mutex.

(In reply to Markus Stange from comment #13)
> When does that reporting happen? After the query completes? The browser was
> hung and I had to kill it, maybe it didn't have a chance to report the query
> before it was terminated.

unfortunately yes :(
It's even possible some async query became slower than usual due to some query planner change, and we should just fix it, though hard to tell off-hand.
(In reply to Marco Bonardo [:mak] from comment #14)
> would be really useful to get the query that is running at that time

How can I do that? Can I attach a debugger and execute something that gives me the query?
Blocks: SQLite3.7.10
Keywords: regression
(In reply to Markus Stange from comment #15)
> (In reply to Marco Bonardo [:mak] from comment #14)
> > would be really useful to get the query that is running at that time
> 
> How can I do that? Can I attach a debugger and execute something that gives
> me the query?

sure, you should be able to find the Storage async thread waiting on the mutex and get the SQL query string from the statement object
If this is in fact the same issue as my dupe, I've been hitting it on Mac/Win/Linux for at least a week. It's causing really terrible hangs in my daily browsing.
(In reply to Ted Mielczarek [:ted, :luser] from comment #18)
> If this is in fact the same issue as my dupe, I've been hitting it on
> Mac/Win/Linux for at least a week. It's causing really terrible hangs in my
> daily browsing.

You can confirm by switching to the nightly-profiling for daily browsing, installing the profiler extension and browsing normally and dump your profile on any hang. The performance impact is not noticeable.
I've just hit this hang again in a debug build with mozStorage logging enabled, and the whole terminal was full of thousands of lines that all said the same:
-1333661696[11c118e0]: sqlite3_trace on 11bb2cc0 for '-- TRIGGER moz_places_afterupdate_frecency_trigger'

Everything before that was lost, unfortunately. I'm now logging to a file.
Hit it again; the lines before it are:

-1333661696[116173e0]: Resetting statement: 'SELECT typed, hidden, visit_count, (SELECT count(*) FROM moz_historyvisits WHERE place_id = :page_id), EXISTS ( SELECT 1 FROM moz_bookmarks WHERE fk = :page_id AND NOT EXISTS( SELECT 1 FROM moz_items_annos a JOIN moz_anno_attributes n ON a.anno_attribute_id = n.id WHERE n.name = :anno_name AND a.item_id = parent ) ), (url > 'place:' AND url < 'place;') FROM moz_places WHERE id = :page_id '
-1333661696[116173e0]: sqlite3_trace on 11620a80 for '-- TRIGGER moz_places_afterupdate_frecency_trigger'
-1407036736[152ee0]: sqlite3_trace on 5094e0 for 'SELECT id FROM moz_downloads WHERE source = 'http://forums.macrumors.com/showthread.php?p=12485882''
-1333661696[116173e0]: Initialized statement 'UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0' (0x178ce730)
-1333661696[116173e0]: Initialized statement 'UPDATE moz_inputhistory SET use_count = use_count * .975' (0x178cded0)
-1333661696[116173e0]: sqlite3_trace on 11620a80 for 'BEGIN IMMEDIATE'
-1333661696[116173e0]: sqlite3_trace on 11620a80 for 'UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0'
-1407036736[152ee0]: Resetting statement: 'SELECT id FROM moz_downloads WHERE source = :source'
-1333661696[116173e0]: sqlite3_trace on 11620a80 for '-- TRIGGER moz_places_afterupdate_frecency_trigger'
-1333661696[116173e0]: sqlite3_trace on 11620a80 for '-- TRIGGER moz_places_afterupdate_frecency_trigger'
... (repeat 6787 times)
I can see a relation with UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0
this happens only on idle-daily though, so that's interesting. were you idle or just came back from idle?
I was idle. I was watching a flash video in full screen mode.
Depends on: 723044
thanks, this was really useful, I think I figured out a possible cause for the increase of contention in bug 723044, and will work on it.
Cool!

Just for further confirmation, today I waited until the freeze was over, and now I have this row on about:telemetry under "Slow SQL Statements on Other Thread":
1	1481694	UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0

1481694ms are 24.7 minutes. Maybe the SQLite upgrade just made this query slower?
we made it slower, not sure if the sqlite upgade is involved, btw it's something we can fix, and we should backport the fix to 12.
Summary: mozilla::storage::Connection::BeginTransactionAs hangs for > 20 seconds → mozilla::storage::Connection::BeginTransactionAs hangs for > 20 seconds waiting for frecency update
Whiteboard: [Snappy] → [Snappy:p1]
not related to the SQLite upgrade, the remaining work is in bug 723611, though the major issues are fixed on trunk.
No longer blocks: SQLite3.7.10
All the low hanging fruits have been fixed, remaining work is tracked by bug 699820.
That said, the major cause of this explosion of hangs is fixed in Aurora and Nightly, that are the only affected versions.
So I'm calling this fixed by dependencies. If you will ever be able to reproduce again, please, after the freeze finishes, copy the SQL statements from about:telemetry and file a specific bug for those.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
I haven't noticed this jankiness in weeks, so it definitely looks fixed from my end. Thanks!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: