722188 - mozilla::storage::Connection::BeginTransactionAs hangs for > 20 seconds waiting for frecency update

Reporter

Description

•

12 years ago

I was watching a stream (Flash) with a single other tab and got a hang for over 20 seconds. Using the built in profiler I was able trace it back to 'mozilla::storage::Connection::BeginTransactionAs'

See the full profile here (7MB, pauses the browser while parsing):
http://varium.fantasytalesonline.com/cleopatra/?report=AMIfv94ios_yEKbgE1nHw5Jf0laaiQv2Wrz7x3TlJ35E9wPE5kitc25XIfEYg_-EYA285BK32HQI9iAGvdA5Mv2o2sydR5DVGhYsFqQZBAbjD3kg75lPoLZEHH3K2dE1zWabEeHFmDaRzhgdBBiJC524v6LrUdN5Vg

Benoit Girard (:BenWa)

Reporter

Updated

•

12 years ago

Component: Profile: BackEnd → Places

Product: Core → Toolkit

QA Contact: profile-manager-backend → places

Marco Bonardo [:mak]

Comment 1

•

12 years ago

Right now I have some problem fetching the profile, did you identify a specific stack starting from Places?
Wew you "idle" (so not using keyboard/mouse from 5 minutes?) watching this stream?

Marco Bonardo [:mak]

Comment 2

•

12 years ago

next time please also check eventual SQL reported in about:telemetry after the fact, may help identifying the conflicting queries

Benoit Girard (:BenWa)

Reporter

Comment 3

•

12 years ago

We're thinking of adding dynamic samples to the profiler. If this area of the code isn't a hot loop is could be a great place to add this feature. We could tag the profiling sample with the SQL that was executing. What do you think?

Profile range from 1MB to 10MB and are a very heavy SVG page so youll need a fast machine with a good connection.

I was interacting with the page going from fullscreen to window and check the integrated chat in the page on http://www.twitch.tv/ live cast. My uptime wasn't high and I hadn't visited a lot of pages. I have no reason to believe it was a fluke and that my machine was running poorly either.

Marco Bonardo [:mak]

Comment 4

•

12 years ago

(In reply to Benoit Girard (:BenWa) from comment #3)
> We're thinking of adding dynamic samples to the profiler. If this area of
> the code isn't a hot loop is could be a great place to add this feature. We
> could tag the profiling sample with the SQL that was executing. What do you
> think?

Well, if there is a statement executing at that time would be useful. Telemetry just tracks slow queries but I think they are not yet reported to us (users have to manually file bugs).

> Profile range from 1MB to 10MB and are a very heavy SVG page so youll need a
> fast machine with a good connection.

Yep, unfortunately I'm not at my usual location this week, no fast machine :(

> My
> uptime wasn't high and I hadn't visited a lot of pages. I have no reason to
> believe it was a fluke and that my machine was running poorly either.

Yes, my question is more regarding idle, since on idle we run maintenance work that may be slow. Plus, atm some of that work may cause contention, I'll file a bug shortly for that. So my assumption is that you were not touching keyboard and mouse while looking the streaming, and after minutes the maintenance stuff ran. Provided this was Places and not work in another database, that I can't tell without seeing the stack.

Marco Bonardo [:mak]

Comment 5

•

12 years ago

adding a dependency for now, to track that.

Depends on: placesIdleContention

Benoit Girard (:BenWa)

Reporter

Comment 6

•

12 years ago

(In reply to Marco Bonardo [:mak] from comment #4)
> Yes, my question is more regarding idle, since on idle we run maintenance
> work that may be slow. Plus, atm some of that work may cause contention,
> I'll file a bug shortly for that. So my assumption is that you were not
> touching keyboard and mouse while looking the streaming, and after minutes
> the maintenance stuff ran.

I WAS interacting with the page. Please see Comment 3, last paragraph. I'll post the stack shortly.

Marco Bonardo [:mak]

Comment 7

•

12 years ago

ah sorry, didn't make enough attention to that. That's strange, unless it was some sort of work coming from Sync, Places doesn't execute any synchronous stuff when the user browses and this sounds like thread contention.

Benoit Girard (:BenWa)

Reporter

Comment 8

•

12 years ago

Attached image Stack screenshot — Details

It's being called from Js->nsNavBookmarks->nasNavHistory

Markus Stange [:mstange]

Updated

•

12 years ago

Attachment #592781 - Attachment is patch: false

Attachment #592781 - Attachment mime type: text/plain → image/png

Marco Bonardo [:mak]

Comment 9

•

12 years ago

so this is a bookmarks runInBatchMode request, considered what you were doing at that time, I can only think of 2 reasons:
1. livemarks updates (bug 613588 will fix that)
2. Sync, or another add-on using bookmarks. We will shortly start working on an async bookmarking API for Sync.

Markus Stange [:mstange]

Comment 10

•

12 years ago

Attached file activity monitor profile — Details

I've hit this, too, while watching a stream (i.e. idle). Here's a profile that doesn't hang the browser and that also contains stacks of other threads (like the SQLite thread).

The only SQL I have in about:telemetry is "INSERT OR REPLACE INTO webappsstore2 SELECT * FROM webappsstore2_temp WHERE scope = :scope", but I haven't had the web app store extension installed for a long time.

Markus Stange [:mstange]

Comment 11

•

12 years ago

By the way, I've hit this bug regularly during the last one or two weeks, but never before that.

Marco Bonardo [:mak]

Comment 12

•

12 years ago

webappsstore is DOMStorage, fwiw, not related to app store.

I wonder if this is a thread scheduling change due to the SQLite upgrade, cause there haven't been other large changes in databases. Though DOMStorage can't cause thread contention. So I wonder what's up, if it would be a SQLite hang it should be reported in telemetry.

Markus Stange [:mstange]

Comment 13

•

12 years ago

(In reply to Marco Bonardo [:mak] from comment #12)
> webappsstore is DOMStorage, fwiw, not related to app store.

I see.

> I wonder if this is a thread scheduling change due to the SQLite upgrade,
> cause there haven't been other large changes in databases. Though DOMStorage
> can't cause thread contention. So I wonder what's up, if it would be a
> SQLite hang it should be reported in telemetry.

When does that reporting happen? After the query completes? The browser was hung and I had to kill it, maybe it didn't have a chance to report the query before it was terminated.

Marco Bonardo [:mak]

Comment 14

•

12 years ago

from the profile in comment 10 looks like runInBatchmode is locked on the mutex, would be really useful to get the query that is running at that time on the async thread and holding the mutex.

(In reply to Markus Stange from comment #13)
> When does that reporting happen? After the query completes? The browser was
> hung and I had to kill it, maybe it didn't have a chance to report the query
> before it was terminated.

unfortunately yes :(
It's even possible some async query became slower than usual due to some query planner change, and we should just fix it, though hard to tell off-hand.

Markus Stange [:mstange]

Comment 15

•

12 years ago

(In reply to Marco Bonardo [:mak] from comment #14)
> would be really useful to get the query that is running at that time

How can I do that? Can I attach a debugger and execute something that gives me the query?

Marco Bonardo [:mak]

Updated

•

12 years ago

Blocks: SQLite3.7.10

Keywords: regression

Marco Bonardo [:mak]

Comment 17

•

12 years ago

(In reply to Markus Stange from comment #15)
> (In reply to Marco Bonardo [:mak] from comment #14)
> > would be really useful to get the query that is running at that time
> 
> How can I do that? Can I attach a debugger and execute something that gives
> me the query?

sure, you should be able to find the Storage async thread waiting on the mutex and get the SQL query string from the statement object

(not currently active) Ted Mielczarek

Comment 18

•

12 years ago

If this is in fact the same issue as my dupe, I've been hitting it on Mac/Win/Linux for at least a week. It's causing really terrible hangs in my daily browsing.

Benoit Girard (:BenWa)

Reporter

Comment 19

•

12 years ago

(In reply to Ted Mielczarek [:ted, :luser] from comment #18)
> If this is in fact the same issue as my dupe, I've been hitting it on
> Mac/Win/Linux for at least a week. It's causing really terrible hangs in my
> daily browsing.

You can confirm by switching to the nightly-profiling for daily browsing, installing the profiler extension and browsing normally and dump your profile on any hang. The performance impact is not noticeable.

Markus Stange [:mstange]

Comment 20

•

12 years ago

I've just hit this hang again in a debug build with mozStorage logging enabled, and the whole terminal was full of thousands of lines that all said the same:
-1333661696[11c118e0]: sqlite3_trace on 11bb2cc0 for '-- TRIGGER moz_places_afterupdate_frecency_trigger'

Everything before that was lost, unfortunately. I'm now logging to a file.

Markus Stange [:mstange]

Comment 21

•

12 years ago

Hit it again; the lines before it are:

-1333661696[116173e0]: Resetting statement: 'SELECT typed, hidden, visit_count, (SELECT count(*) FROM moz_historyvisits WHERE place_id = :page_id), EXISTS ( SELECT 1 FROM moz_bookmarks WHERE fk = :page_id AND NOT EXISTS( SELECT 1 FROM moz_items_annos a JOIN moz_anno_attributes n ON a.anno_attribute_id = n.id WHERE n.name = :anno_name AND a.item_id = parent ) ), (url > 'place:' AND url < 'place;') FROM moz_places WHERE id = :page_id '
-1333661696[116173e0]: sqlite3_trace on 11620a80 for '-- TRIGGER moz_places_afterupdate_frecency_trigger'
-1407036736[152ee0]: sqlite3_trace on 5094e0 for 'SELECT id FROM moz_downloads WHERE source = 'http://forums.macrumors.com/showthread.php?p=12485882''
-1333661696[116173e0]: Initialized statement 'UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0' (0x178ce730)
-1333661696[116173e0]: Initialized statement 'UPDATE moz_inputhistory SET use_count = use_count * .975' (0x178cded0)
-1333661696[116173e0]: sqlite3_trace on 11620a80 for 'BEGIN IMMEDIATE'
-1333661696[116173e0]: sqlite3_trace on 11620a80 for 'UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0'
-1407036736[152ee0]: Resetting statement: 'SELECT id FROM moz_downloads WHERE source = :source'
-1333661696[116173e0]: sqlite3_trace on 11620a80 for '-- TRIGGER moz_places_afterupdate_frecency_trigger'
-1333661696[116173e0]: sqlite3_trace on 11620a80 for '-- TRIGGER moz_places_afterupdate_frecency_trigger'
... (repeat 6787 times)

Marco Bonardo [:mak]

Comment 22

•

12 years ago

I can see a relation with UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0
this happens only on idle-daily though, so that's interesting. were you idle or just came back from idle?

Markus Stange [:mstange]

Comment 23

•

12 years ago

I was idle. I was watching a flash video in full screen mode.

Marco Bonardo [:mak]

Updated

•

12 years ago

Depends on: 723044

Marco Bonardo [:mak]

Comment 24

•

12 years ago

thanks, this was really useful, I think I figured out a possible cause for the increase of contention in bug 723044, and will work on it.

Markus Stange [:mstange]

Comment 25

•

12 years ago

Cool!

Just for further confirmation, today I waited until the freeze was over, and now I have this row on about:telemetry under "Slow SQL Statements on Other Thread":
1	1481694	UPDATE moz_places SET frecency = ROUND(frecency * .975) WHERE frecency > 0

1481694ms are 24.7 minutes. Maybe the SQLite upgrade just made this query slower?

Marco Bonardo [:mak]

Comment 26

•

12 years ago

we made it slower, not sure if the sqlite upgade is involved, btw it's something we can fix, and we should backport the fix to 12.

Justin Lebar (not reading bugmail)

Updated

•

12 years ago

Summary: mozilla::storage::Connection::BeginTransactionAs hangs for > 20 seconds → mozilla::storage::Connection::BeginTransactionAs hangs for > 20 seconds waiting for frecency update

(dormant account)

Updated

•

12 years ago

Whiteboard: [Snappy] → [Snappy:p1]

Marco Bonardo [:mak]

Comment 29

•

12 years ago

not related to the SQLite upgrade, the remaining work is in bug 723611, though the major issues are fixed on trunk.

No longer blocks: SQLite3.7.10

Depends on: VacuumManagerPragmas

Marco Bonardo [:mak]

Comment 30

•

12 years ago

All the low hanging fruits have been fixed, remaining work is tracked by bug 699820.
That said, the major cause of this explosion of hangs is fixed in Aurora and Nightly, that are the only affected versions.
So I'm calling this fixed by dependencies. If you will ever be able to reproduce again, please, after the freeze finishes, copy the SQL statements from about:telemetry and file a specific bug for those.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

(not currently active) Ted Mielczarek

Comment 31

•

12 years ago

I haven't noticed this jankiness in weeks, so it definitely looks fixed from my end. Thanks!

Stack screenshot 12 years ago Benoit Girard (:BenWa) 250.65 KB, image/png		Details
activity monitor profile 12 years ago Markus Stange [:mstange] 110.27 KB, text/plain		Details