Closed Bug 699051 Opened 13 years ago Closed 13 years ago

Track slow sql queries on main thread + send them in via telemetry

Categories

(Toolkit :: Storage, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla12
Tracking Status
firefox11 - ---
firefox12 + verified

People

(Reporter: taras.mozilla, Assigned: vladan)

References

Details

(Keywords: main-thread-io, perf, Whiteboard: [Snappy:P1])

Attachments

(2 files, 8 obsolete files)

27.21 KB, text/plain
Details
28.71 KB, patch
vladan
: review+
Details | Diff | Splinter Review
We should record sql queries over 100ms, then add them to the telemetry ping as a custom histogram.
ie {sql2:200, sql3:32003} where sql[12] are "SELECT *"...
How should we build that array of sql queries? Can this be done dynamically by telemetry?
(In reply to David Rajchenbach Teller [:Yoric] from comment #1)
> How should we build that array of sql queries? Can this be done dynamically
> by telemetry?

telemetry ping is just a json object. we'd add another branch to it .sql or something.
I can take a look at this if you want and if it is a priority.
Blocks: 701872
Attachment #573953 - Flags: review?(tglek)
Comment on attachment 573953 [details] [diff] [review]
Report slow prepared statements to Telemetry

Move globals into Telemetry singleton. I think the rest is ok.
Attachment #573953 - Flags: review?(tglek) → review-
Comment on attachment 573953 [details] [diff] [review]
Report slow prepared statements to Telemetry

> NS_IMETHODIMP
> TelemetryImpl::GetHistogramSnapshots(JSContext *cx, jsval *ret)
> {
>   }
>+
>+  // Add informaton about slow SQL statements to the snapshot
>+
>+  JSObject *hitCountObj = JS_NewObject(cx, NULL, NULL, NULL);
>+  JSObject *statementsObj = JS_NewObject(cx, NULL, NULL, NULL);

This used to be a leak, not sure if we have stack scanning yet in js.

>+  if (!hitCountObj || !statementsObj)
>+    return NS_ERROR_FAILURE;
>+
>+  JSBool ok = JS_DefineProperty(cx, root_obj, "slowStatements",
>+                                OBJECT_TO_JSVAL(statementsObj), NULL, NULL,
>+                                JSPROP_ENUMERATE);
>+  if (!ok)
>+    return NS_ERROR_FAILURE;
>+
>+  ok = JS_DefineProperty(cx, root_obj, "slowStatementCounts",
>+                         OBJECT_TO_JSVAL(hitCountObj), NULL, NULL,
>+                         JSPROP_ENUMERATE);
>+  if (!ok)
>+    return NS_ERROR_FAILURE;
>+
>+  if (!gHashMutex) {
>+    gHashMutex = new Mutex("Telemetry::gHashMutex");
>+    if (!gHashMutex)
>+      return NS_ERROR_FAILURE;
>+  }
>+
>+  MutexAutoLock hashMutex(*gHashMutex);
>+  EnumeratorArgs args = { cx, hitCountObj, statementsObj, 0 };
>+  if (gSlowStatementMap.IsInitialized()) {
>+    PRUint32 num = gSlowStatementMap.EnumerateEntries(StatementEnumerator,
>+                                                      static_cast<void*>(&args));
>+    if (num != gSlowStatementMap.Count())
>+      return NS_ERROR_FAILURE;
>+  }
>+

Don't pirate histogramSnapshots(). Add another method like slowSqlSnapshot() or something.
please once ready get a review from Sdwilsh on this, it may have unexpected impact for the mutex or threading.
Component: Places → Storage
QA Contact: places → storage
Attachment #573953 - Attachment is obsolete: true
Attachment #573966 - Flags: review?(tglek)
Comment on attachment 573966 [details] [diff] [review]
Report slow prepared statements to Telemetry, v2


> TelemetryImpl::TelemetryImpl():
>-mCanRecord(XRE_GetProcessType() == GeckoProcessType_Default)
>+mCanRecord(XRE_GetProcessType() == GeckoProcessType_Default),
>+mHashMutex("Telemetry::mHashMutex")
> {
>   mHistogramMap.Init(Telemetry::HistogramCount);
>+  mSlowStatementMap.Init();
> }
> 
> TelemetryImpl::~TelemetryImpl() {
>   mHistogramMap.Clear();

mSlowStatementMap doesn't get cleared?
>+    entry = sTelemetry->mSlowStatementMap.PutEntry(statement);
>+    if (NS_UNLIKELY(!entry)) {
>+      // clear the hash if out of memory
>+      sTelemetry->mSlowStatementMap.Clear();
>+      return;

that's weird but ok.


Please break out JS-related bits into a separate patch.
Attachment #573966 - Flags: review?(tglek) → review-
Try run for 78219bf72376 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=78219bf72376
Results (out of 275 total builds):
    exception: 2
    success: 234
    warnings: 39
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/vdjeric@mozilla.com-78219bf72376
(In reply to Taras Glek (:taras) from comment #6)
> Comment on attachment 573953 [details] [diff] [review] [diff] [details] [review]
> Report slow prepared statements to Telemetry
> 
> > NS_IMETHODIMP
> > TelemetryImpl::GetHistogramSnapshots(JSContext *cx, jsval *ret)
> > {
> >   }
> >+
> >+  // Add informaton about slow SQL statements to the snapshot
> >+
> >+  JSObject *hitCountObj = JS_NewObject(cx, NULL, NULL, NULL);
> >+  JSObject *statementsObj = JS_NewObject(cx, NULL, NULL, NULL);
> 
> This used to be a leak, not sure if we have stack scanning yet in js.

We do. This is fine.
You may want to consider adding a normalization step to scrub "IN (...)" clauses or to let specific connections opt out of this.  Namely, Thunderbird's global database dynamically creates statements because (unless this has changed), you can't do something like "IN ?", so it has to create one-off statements to use in clauses.  Since Thunderbird generates telemetry data, Thunderbird may spam the heck out of your database, break your UI, etc.
Separates slow query reporting by main thread & other. Limits reporting to known DBs to prevent privacy leaks.
Attachment #573966 - Attachment is obsolete: true
Attachment #574666 - Flags: review?(tglek)
Now that you said privacy, if this sends the bound queries (with parameters replaced), it may likely send private data. Does it send bound or unbound statements?
> You may want to consider adding a normalization step to scrub "IN (...)"
> clauses or to let specific connections opt out of this.  Namely,
> Thunderbird's global database dynamically creates statements because (unless
> this has changed), you can't do something like "IN ?", so it has to create
> one-off statements to use in clauses.  Since Thunderbird generates telemetry
> data, Thunderbird may spam the heck out of your database, break your UI, etc.

I've chosen to implement a whitelist so that only the slow statements that are executed on known Firefox DBs are reported. Is this sufficient?

(In reply to Marco Bonardo [:mak] from comment #14)
> Now that you said privacy, if this sends the bound queries (with parameters
> replaced), it may likely send private data. Does it send bound or unbound
> statements?

It sends unbound statements, i.e. the strings used to generate the prepared statement objects (before any parameters get bound)
> (In reply to Marco Bonardo [:mak] from comment #14)
> It sends unbound statements, i.e. the strings used to generate the prepared
> statement objects (before any parameters get bound)

OK, nice!
(In reply to Vladan Djeric (:vladan) from comment #15)
> I've chosen to implement a whitelist so that only the slow statements that
> are executed on known Firefox DBs are reported. Is this sufficient?

That should be fine.

You may want to consider just altering the private stepStmt helper method to be able to get at the connection information which already  knows the database filename (and can pre-compute an enabled bit) rather than having to run a pragma to find out the filename and do a hash lookup.  Specifically, it seems like the helper could live on the Connection instance quite reasonably, as our async execution code already holds a reference to the connection.

Obviously, a long-running statement is already in the slow path, so it's not so much a performance concern as a desire to avoid establishing an idiom where we end up cramming more and more reverse-engineering of information into stepStmt because it has a simple signature.
Comment on attachment 574666 [details] [diff] [review]
Report slow prepared statements to Telemetry, v3

r- due to getDBFileName nastyness(as discussed)
Attachment #574666 - Flags: review?(tglek) → review-
Whiteboard: [Snappy:P1]
The getDBFileName nastiness is currently the only way to get the DB name from an sqlite3* handle. The SQLite authors will add a new API for this: http://www.sqlite.org/draft/c3ref/db_filename.html
Attachment #574666 - Attachment is obsolete: true
Attachment #575516 - Flags: review?(tglek)
(In reply to Vladan Djeric (:vladan) from comment #19)
> The getDBFileName nastiness is currently the only way to get the DB name
> from an sqlite3* handle. The SQLite authors will add a new API for this:
> http://www.sqlite.org/draft/c3ref/db_filename.html

But mozilla:storage:Connection already knows the filename; you don't need to ask SQLite:
http://mxr.mozilla.org/mozilla-central/source/storage/src/mozStorageConnection.h#158

and stepStmt is a method that can be modified.
Changed the stepStmt helper to use Connection* instead of reverse-engineering the sqlite3 prepared statement
Attachment #575516 - Attachment is obsolete: true
Attachment #575516 - Flags: review?(tglek)
Attachment #575599 - Flags: superreview?
Attachment #575599 - Flags: review?(tglek)
Attachment #575599 - Flags: review?(bugmail)
Comment on attachment 575599 [details] [diff] [review]
Report slow prepared statements to Telemetry, v5

Asuth's review should be sufficient here. I've reviewed the rest of the patch before
Attachment #575599 - Flags: review?(tglek)
Try run for daa30888a1f8 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=daa30888a1f8
Results (out of 223 total builds):
    success: 202
    warnings: 17
    failure: 4
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/vdjeric@mozilla.com-daa30888a1f8
Comment on attachment 575599 [details] [diff] [review]
Report slow prepared statements to Telemetry, v5

Review of attachment 575599 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.  The string life-cycle for the retrieved SQL works out fine.
Attachment #575599 - Flags: review?(bugmail) → review+
Comment on attachment 575599 [details] [diff] [review]
Report slow prepared statements to Telemetry, v5

Review of attachment 575599 [details] [diff] [review]:
-----------------------------------------------------------------

You don't need superreview for this.

r=sdwilsh with comments addressed.

::: storage/src/mozStoragePrivateHelpers.cpp
@@ +278,5 @@
>  
>  } // anonymous namespace
>  
>  int
> +stepStmt(Connection* aConnection, sqlite3_stmt* aStatement)

nit: http://hg.mozilla.org/mozilla-central/annotate/ac667309bea6/storage/style.txt#l38

@@ +286,2 @@
>  
> +  sqlite3* db = aConnection->GetNativeConnection();

We have a type operator for this; `GetNativeConnection` should be going away, not getting more usage: http://hg.mozilla.org/mozilla-central/annotate/ac667309bea6/storage/src/mozStorageConnection.h#l117

::: storage/src/mozStoragePrivateHelpers.h
@@ +47,5 @@
>  
>  #include "sqlite3.h"
>  #include "nsIVariant.h"
>  #include "mozStorage.h"
> +#include "mozStorageConnection.h"

Don't include the header here; forward declare the class.

@@ +124,5 @@
>   * @param aStatement
>   *        A pointer to a sqlite3_stmt object.
>   * @return the result from sqlite3_step.
>   */
> +int stepStmt(Connection *aConnection, sqlite3_stmt *aStatement);

Hrm.  This used to not depend on `Connection` so we could use it in places where we did not have a connection.  I'm not particularly happy about that since it makes it harder to use this with the planned asynchronous connection, but it's fine as a temporary solution until the SQLite changes are landed upstream.  Can you please file a bug to revert this and make it dependent on the upgrade bug for the version of SQLite that we'll need to get that API?
Attachment #575599 - Flags: superreview? → review+
(In reply to Shawn Wilsher :sdwilsh from comment #25)
> Hrm.  This used to not depend on `Connection` so we could use it in places
> where we did not have a connection.  I'm not particularly happy about that
> since it makes it harder to use this with the planned asynchronous
> connection, but it's fine as a temporary solution until the SQLite changes
> are landed upstream.  Can you please file a bug to revert this and make it
> dependent on the upgrade bug for the version of SQLite that we'll need to
> get that API?

Where would we have code that lacks a connection?  I did a brief audit of the call-sites, and they all have a reference hanging around, and even hypothetical call-sites are going to need a reference-count chain to keep their connection alive.

Assuming the new connection type lacks any type overlap with the current connection type, I believe a function template could easily "duck-type" the Connection class for its purposes.


My rationale for suggesting that we pass in the connection (or move the helper to the Connection) is that it seems like our telemetry needs are only going to get more sophisticated, and trying to reverse engineer the required meta-data in stepStmt is just going to limit our ability to do that or require ugliness.  For example, Places also exposes an API that extensions can use.  If Places was able to mark database connections as servicing extension calls, that might be a useful data point to eliminate noise from extensions for our own analysis, and help extension authors if we can break it out for them.
Comment on attachment 575599 [details] [diff] [review]
Report slow prepared statements to Telemetry, v5

Review of attachment 575599 [details] [diff] [review]:
-----------------------------------------------------------------

Would it we worth it to also report the actual time it took since we have it?

While this is a nice improvement, it will generate a bunch of SQL queries with no way to prioritize work, a query taking 2 seconds should be worked on before a query taking 101ms, but bot hwill be reported the same.
A side question regarding the impact of Telemetry on performances, did anyone measure how expensive is a Now() call in our various platforms?
I'm just asking this since in Places time ago was added code to limit the number of times it was invoked in hot code paths, since on Windows (iirc) it was expensive due to microseconds correction.
With telemetry we are adding a lot of calls like this one, when really we just need to measure an interval and not the real time, there may be cheaper alternatives to doing Now() differences for microseconds-precise intervals.
Regarding this privacy-review-needed flag, is there a list of the prepared statements or queries that will be collected?  We should document a sample result from this probe, and if possible the set of queries that will be measured.
Looks like the only thing interesting in this data (aside from gaining perf insight) is identifying which features get used: clear recent history, bookmarks, etc.  Looks like this falls well in scope via "feature usage" so not an unreasonable privacy risk.
(In reply to Sid Stamm [:geekboy] from comment #29)
> Regarding this privacy-review-needed flag, is there a list of the prepared
> statements or queries that will be collected?  We should document a sample
> result from this probe, and if possible the set of queries that will be
> measured.

There isn't a list of prepared SQL statements that will be collected. If needed, I could try grepping for SQL keywords to find all of the existing SQL strings in the codebase. Obviously, any prepared statements added after the launch of this feature would also be reported. Also, as Andrew pointed out in a comment above, extensions executing SQL on Firefox's own databases would also get reported.

I attached a list of SQL statements prepared during a 30-minute browsing session to give an idea of the kind of data that might be collected. Note that these are not statements that required more than 100ms to execute, they are merely potential candidates for reporting.
(In reply to Vladan Djeric (:vladan) from comment #32)
> There isn't a list of prepared SQL statements that will be collected. If
> needed, I could try grepping for SQL keywords to find all of the existing
> SQL strings in the codebase. 

Nah, was just looking for an easy win (perhaps something like TelemetryHistograms.h), but it's not a problem if it doesn't exist.

> I attached a list of SQL statements prepared during a 30-minute browsing
> session to give an idea of the kind of data that might be collected. Note
> that these are not statements that required more than 100ms to execute, they
> are merely potential candidates for reporting.

Thanks for the context.  It's good to have example data handy.
(In reply to Andrew Sutherland (:asuth) from comment #26)
> Where would we have code that lacks a connection?  I did a brief audit of
> the call-sites, and they all have a reference hanging around, and even
> hypothetical call-sites are going to need a reference-count chain to keep
> their connection alive.
We wouldn't I guess.  Just feels weird to have an API to step a statement that also needs a connection I guess.

> My rationale for suggesting that we pass in the connection (or move the
> helper to the Connection) is that it seems like our telemetry needs are only
> going to get more sophisticated, and trying to reverse engineer the required
> meta-data in stepStmt is just going to limit our ability to do that or
> require ugliness.  For example, Places also exposes an API that extensions
> can use.  If Places was able to mark database connections as servicing
> extension calls, that might be a useful data point to eliminate noise from
> extensions for our own analysis, and help extension authors if we can break
> it out for them.
Moving the helper may be a better idea at this point since all call sites have access to a `Connection` anyway.
1) Move stepStmt into Connection
2) Track amount of time spent on slow SQL (per statement) to help prioritization
Attachment #575599 - Attachment is obsolete: true
Attachment #576342 - Flags: review?(sdwilsh)
Attachment #576342 - Flags: review?(bugmail)
Try run for dae375cd07e4 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=dae375cd07e4
Results (out of 247 total builds):
    success: 223
    warnings: 22
    failure: 2
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/vdjeric@mozilla.com-dae375cd07e4
Comment on attachment 576342 [details] [diff] [review]
Report slow prepared statements to Telemetry, v5

Review of attachment 576342 [details] [diff] [review]:
-----------------------------------------------------------------

I was leaving this for sdwilsh, but since I am in non-Thanksgiving-observing Canada (for now...) and he is not, r=asuth and he can chime in if he does not feel his requested changes were made.  (They look addressed or mooted to me.)
Attachment #576342 - Flags: review?(sdwilsh)
Attachment #576342 - Flags: review?(bugmail)
Attachment #576342 - Flags: review+
Comment on attachment 576342 [details] [diff] [review]
Report slow prepared statements to Telemetry, v5

Review of attachment 576342 [details] [diff] [review]:
-----------------------------------------------------------------

::: storage/src/mozStoragePrivateHelpers.cpp
@@ +273,5 @@
>    if (srv == SQLITE_OK)
>      notification.Wait();
>  
>    return srv;
>  }

AFAICT, we only use this method for `stepStmt`, so we should move it with the method (and it should be in an anonymous namespace).
> AFAICT, we only use this method for `stepStmt`, so we should move it with
> the method (and it should be in an anonymous namespace).

Are you referring to WaitForUnlockNotify? It's called by both stepStmt and prepareStmt (same for UnlockNotification::Wait)
(In reply to Vladan Djeric (:vladan) from comment #39)
> Are you referring to WaitForUnlockNotify? It's called by both stepStmt and
> prepareStmt (same for UnlockNotification::Wait)
My mxring failed.  Can you move both `prepareStmt` and `stepStmt` then along with the helpers.  Makes sense for both of those to be in the same place, so might as well pull it onto the `Connection` as well.
Applied sdwilsh's suggestion: Moved prepareStmt and its WaitForUnlockNotify helpers to mozStorageConnection
Attachment #576342 - Attachment is obsolete: true
Attachment #578047 - Flags: review?(sdwilsh)
Keywords: checkin-needed
Patch is now based off of latest version of mozilla-central tree (helps with checkin)
Attachment #578047 - Attachment is obsolete: true
Attachment #578047 - Flags: review?(sdwilsh)
Attachment #578328 - Flags: review?(sdwilsh)
Removing checkin-needed as this is apparently waiting for another review...
Keywords: checkin-needed
Comment on attachment 578328 [details] [diff] [review]
Report slow prepared statements to Telemetry, v6 (merged)

Review of attachment 578328 [details] [diff] [review]:
-----------------------------------------------------------------

r=sdwilsh

Sorry this took so long.

::: storage/src/mozStorageConnection.cpp
@@ +332,5 @@
> +{
> +  UnlockNotification notification;
> +  int srv = ::sqlite3_unlock_notify(aDatabase, UnlockNotifyCallback,
> +                                    &notification);
> +  NS_ASSERTION(srv == SQLITE_LOCKED || srv == SQLITE_OK, "Bad result!");

Switch this to a MOZ_ASSERT please

@@ +875,5 @@
> +  int srv;
> +  while ((srv = ::sqlite3_step(aStatement)) == SQLITE_LOCKED_SHAREDCACHE) {
> +    if (!checkedMainThread) {
> +      checkedMainThread = true;
> +      if (NS_IsMainThread()) {

nit: `::NS_IsMainThread()`

@@ +883,5 @@
> +    }
> +
> +    srv = WaitForUnlockNotify(mDBConn);
> +    if (srv != SQLITE_OK)
> +      break;

nit: please brace all ifs.

@@ +914,5 @@
> +  while((srv = ::sqlite3_prepare_v2(mDBConn, aSQL.get(), -1, _stmt, NULL)) ==
> +        SQLITE_LOCKED_SHAREDCACHE) {
> +    if (!checkedMainThread) {
> +      checkedMainThread = true;
> +      if (NS_IsMainThread()) {

nit: `::NS_IsMainThread()`

@@ +922,5 @@
> +    }
> +
> +    srv = WaitForUnlockNotify(mDBConn);
> +    if (srv != SQLITE_OK)
> +      break;

nit: please brace all ifs.

::: storage/src/mozStorageConnection.h
@@ +175,5 @@
> +   * @param aStatement
> +   *        A pointer to a sqlite3_stmt object.
> +   * @return the result from sqlite3_step.
> +   */
> +  int stepStatement(sqlite3_stmt* aStatement);

nit: pointer style is wrong here
Attachment #578328 - Flags: review?(sdwilsh) → review+
Applied sdwilsh's latest comments. Ready for check-in.
Attachment #578328 - Attachment is obsolete: true
Keywords: checkin-needed
Attachment #579093 - Flags: review+
https://hg.mozilla.org/mozilla-central/rev/fd47a1e23039
Status: NEW → RESOLVED
Closed: 13 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla11
this checkin, fd47a1e23039, regresses the about:telemetry addon for me (linux-64). I just get a blank page now. Backing out this patch fixes it.

Is there an updated addon somewhere other than addons.mozilla.org that deals with this?
(In reply to Patrick McManus from comment #48)
> this checkin, fd47a1e23039, regresses the about:telemetry addon for me
> (linux-64). I just get a blank page now. Backing out this patch fixes it.
> 
> Is there an updated addon somewhere other than addons.mozilla.org that deals
> with this?

You can find an updated XPI in bug 708925. The new version should be up on AMO soon.
Given that this patch causes intermittent assertions (and potentially crashes) and intermittent deadlocks, I think this needs to come out of Aurora.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #50)
> Given that this patch causes intermittent assertions (and potentially
> crashes) and intermittent deadlocks, I think this needs to come out of
> Aurora.

Agreed - let's back this out of FF11 and wait to get this right in FF12, at which point we can consider uplifting.
This is working fine on ff12, verified by looking at slow sql data at http://people.mozilla.org/~xstevens/telemetry/712354/
You need to log in before you can comment on or make changes to this bug.