Closed Bug 959652 Opened 6 years ago Closed 6 years ago

BrowserHealthRecorder.ensureEnvironment can be expensive during startup

Categories

(Firefox for Android :: General, defect)

ARM
Android
defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 30

People

(Reporter: mfinkle, Assigned: mcomella)

References

(Blocks 2 open bugs)

Details

(Keywords: perf)

Attachments

(1 file)

BrowserHealthRecorder.ensureEnvironment shows up in my GalaxyNexus startup profile, accounting for ~13% (~350ms).

I did not see it in the Nexus S startup profile though. I don't know offhand if the code takes different paths based on Android OS version.

profile: http://people.mozilla.org/~mfinkle/fennec/profiles/galaxynexus-home-startup.trace
To follow the trace further:

[350.922s] BrowserHealthRecorder.ensureEnvironment
 [350.892s] EnvironmentBuilder.registerCurrentEnvironment
  [339.447s] HealthReportDatabaseStorage$DatabaseEnvironment.register
   [318.359s] EnvironmentV1.getHash
    [309.051s] EnviromentV1.getAppender
     [309.051s] Class.newInstance

I'll take a look.
Assignee: nobody → michael.l.comella
Status: NEW → ASSIGNED
Oh. comment 1 continued:

[309.051s] Class.newInstance
 [308.991s] Class.newInstanceImpl
  [308.991s] EnvironmentV1$HashAppender.init
   [308.838s] MessageDigest.getInstance
This is unfortunate:

[308.838s] MessageDigest.getInstance
 [308.838s] fortress.Engine.getInstance
  [308.197s] fortress.Services.getCacheVersion
   [308.014s] forgtress.Services.initServiceInfo (called 6 times)
    [198.484s] String.toUpperCase (called 544 times) :(
Blocks: 959776
(In reply to Mark Finkle (:mfinkle) from comment #0)
> I did not see it in the Nexus S startup profile though. I don't know offhand
> if the code takes different paths based on Android OS version.

Skimming the code, it does not appear our code does but I'm curious whether the Android OS code/libraries has changed (see trace). I found the trace in [1]:

[29.971s] BHR.ensureEnvironment
 ...
  [1.074s] Fortress.Engine.getInstance
   (Notably, fortress.Services.getCacheVersion is missing...)

Finkle, what Android versions are you GNex and NexS? Also, are either of these traces taken on profiles that already contain data and if so, approximately how much (e.g. this is my primary profile, I just took these traces, etc.)?

[1]: https://people.mozilla.org/~mfinkle/fennec/profiles/nexuss-cnn-startup.trace
Flags: needinfo?(mark.finkle)
Nexus S has Android 2.3.7
Galaxy Nexus has Android 4.3

Both builds have small profiles. They are not my main phones. Just a week or two of pageloads and the standard bookmarks.
Flags: needinfo?(mark.finkle)
(In reply to Michael Comella (:mcomella) from comment #4)

> Skimming the code, it does not appear our code does but I'm curious whether
> the Android OS code/libraries has changed (see trace). I found the trace in
> [1]:
> 
> [29.971s] BHR.ensureEnvironment
>  ...
>   [1.074s] Fortress.Engine.getInstance
>    (Notably, fortress.Services.getCacheVersion is missing...)

It looks like the Engine.java and Services.java code will load the crypto providers statically (way before Firefox runs) in Gingerbread, but will force a reload of the providers in Jellybean.

Gingerbread:
http://androidxref.com/2.3.7/xref/libcore/luni/src/main/java/org/apache/harmony/security/fortress/Engine.java#82
http://androidxref.com/2.3.7/xref/libcore/luni/src/main/java/org/apache/harmony/security/fortress/Services.java#70

Jellybean:
http://androidxref.com/4.3_r2.1/xref/libcore/luni/src/main/java/org/apache/harmony/security/fortress/Engine.java#132
http://androidxref.com/4.3_r2.1/xref/libcore/luni/src/main/java/org/apache/harmony/security/fortress/Services.java#203
The dumb question: Do we need to hash this data?
(In reply to Mark Finkle (:mfinkle) from comment #7)
> The dumb question: Do we need to hash this data?

Strictly speaking, no, though for performance, we probaby want it.

Currently, the hash is used to uniquely identify a device environment. This helps with performance when recording the current environment where SQL will throw if we try to insert a non-unique hash value, telling us that environment already exists. SQLite does this performantly through an implicit index for unique values (i.e. the hash) [1]. If we remove the hash, we'd need to compare every data point (i.e. db column) of our environment (28 values?) in a table walk (we have no indices) in order to identify if the current environment already exists and we'd have to run this "exists?" query before every environment insertion. One instance of environment insertion occurs in ensureEnvironment, which will probably make our issue here worse (and it's likely called elsewhere).

Additionally, the hash is also used in the FHR document schema v3 [2], which may be implemented by others (though it wasn't yet when I checked several months ago) and thus it may be more resistant to change.

If we can figure out an alternative way of creating a short unique identifier without having to look it up in the DB, we could use that instead (though nothing quickly comes to mind).

...But whether or not we need to hash this data on startup? Still looking into that one. :)

Needinfo rnewman to double-check my conclusions.

[1]: http://stackoverflow.com/a/19358087
[2]: https://docs.services.mozilla.com/healthreport/index.html#version-3
Flags: needinfo?(rnewman)
History for this design: Bug 863440.

Hashing here serves two important purposes, and one less important but still valuable purpose:

* Collapsing down a large environment into a small identifier
* Ensuring that two identical environments on the same machine will result in the same identifier, even across crashes and data loss
* Ensuring that two identical environments on different machines will result in the same identifier.

Furthermore, yes, it's resistant to change. Tooling is built around this format.

(I could have sworn we already had a bug open for this…)
Flags: needinfo?(rnewman)
ckitching spotted this during his internship, btw. Chris, did you file a bug for this?

One thing we might want to do: use native crypto to do the hashing. We're shipping native crypto code anyway, and it'll totally avoid initializing the Java crypto providers.
(In reply to Richard Newman [:rnewman] from comment #10)
> ckitching spotted this during his internship, btw. Chris, did you file a bug
> for this?

Yup.
Bug 905296.

It's that one you needinfo'd me on approximately a thousand years ago and which I never got back to you about. I can't recall much of the context, and have no time to go into it at this point, so can't really give you a sensible answer to your question.
Duplicate of this bug: 905296
This is a great motivation for using native crypto for FxA, too.
Sudden thought.

Do you really need a *cryptographic* hash for this? Cryptographic hashes are many, many (read: thousands-of) times slower than non-cryptographic hashes (this is a desirable property of cryptographic hashes, actually)

Hashing sounds sensible for the reasons Richard gave in Comment 10 - but all of those things are satisfied by any hashing algorithm that provides collision resistance (such as xxHash - a very high performance non-cryptographic hash library for Java). Do you actually care about preimage resistance?

Seems like that might be an alternative solution to both this particular problem, with a shiny, general performance boost to boot. (Although you'd need to keep around extra logic to migrate stuff from one hashing scheme to the next if you did implement this. Still - you could initialise the expensive crypto library on demand in that case).
See Bug 905296 Comment 1.

Migration isn't much of a problem: we'd just need to re-hash each environment. Because FHR documents aren't persisted forever on the server, we can upload another with entirely different contents without fear.

That said, a change in scheme would have analytics consequences (not to mention dev costs), so it's not free.
In conclusion, I will attempt to write a patch using native crypto and benchmark the results. :)
See Bug 915312, or use NSS.
(In reply to Mark Finkle (:mfinkle) from comment #6)
> It looks like the Engine.java and Services.java code will load the crypto
> providers statically (way before Firefox runs) in Gingerbread, but will
> force a reload of the providers in Jellybean.

IIUC, this means that there isn't a regression here, but rather that Android has moved this initialization has moved from a place where we can't do anything about it to a place where we can probably do something to improve it. That's a good thing.

I am curious if avoiding the use of the crypto providers completely (so they would never be loaded on Jellybean) would significantly reduce memory usage. If so, perhaps that is worth pursuing.

If you want to use NSS, but you don't want to pay the full NSS initialization cost, you can try using the NSSLOWHASH API in nsslowhash.h.

Alternatively, there's mfbt/SHA1.[ch].
Blocks: 949054
Note: bug 967254 might mitigate this, but perhaps it's still worth pursuing for the issues in comment 18.
After discussions on IRC, it seems the consensus is to add SHA-1 to bug 915312 (native crypto) so I'll be taking a stab at moving that forward.
Depends on: 915312
Attached file Patch (github)
Part 1. Tests to come in part 2.

Additionally, I noticed we use MessageDigest in other places in the sync code that will require the same initialization (just not at startup), including sha1 routines - should I take care of that in this bug, or file a followup?

Traces were performed on Galaxy S 4 on a dirty profile (second run):
  * current impl: https://people.mozilla.org/~mcomella/trace/959652-01_old.trace
  * this patch: https://people.mozilla.org/~mcomella/trace/959652-02_new.trace

Current impl (times are CPU time):
 [1504.820] BHR.ensureEnvironment
  [1504.769] EnvironmentBuilder.registerCurrentEnvironment
   [1463.870] HRDatabaseStorage$DatabaseEnvironment.register
    [1400.879] EnvironmentV1.getHash()

New:
 [111.627] BHR.ensureEnvironment
  [111.591] EnvironmentBuilder.registerCurrentEnvironment
   [77.368] HRDatabaseStorage$DatabaseEnvironment.register
    [16.649] EnvironmentV1.getHash()

In HRDatabaseStorage$DatabaseEnvironment.register, SQLiteDatabase.insertOrThrow now takes up the most time (25.833), getHash is second most.

One last thought is that HashAppender's underlying implementation might be improved by keeping a LinkedList of String, and only creating the byte[] once we know the full size of the String, rather than having StringBuilder do this copying multiple times. However, this might neglect the CPU cache some more, so maybe not. I'll look into this a bit further when writing the tests.
Attachment #8387964 - Flags: review?(rnewman)
(In reply to Michael Comella (:mcomella) from comment #21)

> Additionally, I noticed we use MessageDigest in other places in the sync
> code that will require the same initialization (just not at startup),
> including sha1 routines - should I take care of that in this bug, or file a
> followup?

Follow-up.


> Current impl (times are CPU time):
>  [1504.820] BHR.ensureEnvironment
> 
> New:
>  [111.627] BHR.ensureEnvironment

Excellent!


> One last thought is that HashAppender's underlying implementation might be
> improved by keeping a LinkedList of String, and only creating the byte[]
> once we know the full size of the String, rather than having StringBuilder
> do this copying multiple times. However, this might neglect the CPU cache
> some more, so maybe not. I'll look into this a bit further when writing the
> tests.

Also potentially troublesome when considering UTF-8 encoding. You can't assume that string length = buffer length.
Attachment #8387964 - Flags: review?(rnewman) → review+
(In reply to Richard Newman [:rnewman] from comment #22)
> > Additionally, I noticed we use MessageDigest in other places in the sync
> > code that will require the same initialization (just not at startup),
> > including sha1 routines - should I take care of that in this bug, or file a
> > followup?
> 
> Follow-up.

bug 981872.
Blocks: 981872
Comment on attachment 8387964 [details] [review]
Patch (github)

Part 2 WIP for tests. Specifically, testSHA1Hashing throws an assertion that I can't figure out - see the comment on the PR.
Attachment #8387964 - Flags: feedback?(rnewman)
Comment on attachment 8387964 [details] [review]
Patch (github)

Part 1: r+
Part 2: r?

Improvements from comment 24:
  * Fixed the tests using what was discussed on IRC
  * Added a few more tests.

I did not research more performant implementations as mentioned in comment 21, nor do I think it's worth the time at this juncture.
Attachment #8387964 - Flags: review?(rnewman)
Attachment #8387964 - Flags: review+
Attachment #8387964 - Flags: feedback?(rnewman)
Your push did fine on the tests on its builds, all of which were, coincidentally, clobbers.

On the next push, https://tbpl.mozilla.org/?tree=Fx-Team&rev=bd0463063293, which did not get coincidental clobbers, everything failed.

nalexander told me that it was pure coincidence that clobbered builds did perfectly fine and unclobbered builds failed.

Backed out in https://hg.mozilla.org/integration/fx-team/rev/31f9a148b33c.
(In reply to Phil Ringnalda (:philor) from comment #28)
> Your push did fine on the tests on its builds, all of which were,
> coincidentally, clobbers.
> 
> On the next push, https://tbpl.mozilla.org/?tree=Fx-Team&rev=bd0463063293,
> which did not get coincidental clobbers, everything failed.
> 
> nalexander told me that it was pure coincidence that clobbered builds did
> perfectly fine and unclobbered builds failed.

Yeah, I have a hard time understanding how this could be clobber-related.  But everything else in my world right now appears to be, so why not weird native code + DB interactions?
So, this gets more interesting.  I downloaded the offending build, and running it locally, I find:

W dalvikvm(22386)             threadid=10: thread exiting with uncaught exception (group=0x4194e898)
E GeckoAppShell(22386)        >>> REPORTING UNCAUGHT EXCEPTION FROM THREAD 14630 ("GeckoBackgroundThread")
E GeckoAppShell(22386)        android.database.sqlite.SQLiteConstraintException: column url is not unique (code 19)
E GeckoAppShell(22386)        	at android.database.sqlite.SQLiteConnection.nativeExecuteForLastInsertedRowId(Native Method)
E GeckoAppShell(22386)        	at android.database.sqlite.SQLiteConnection.executeForLastInsertedRowId(SQLiteConnection.java:973)
E GeckoAppShell(22386)        	at android.database.sqlite.SQLiteSession.executeForLastInsertedRowId(SQLiteSession.java:788)
E GeckoAppShell(22386)        	at android.database.sqlite.SQLiteStatement.executeInsert(SQLiteStatement.java:86)
E GeckoAppShell(22386)        	at android.database.sqlite.SQLiteDatabase.insertWithOnConflict(SQLiteDatabase.java:1591)
E GeckoAppShell(22386)        	at android.database.sqlite.SQLiteDatabase.insertOrThrow(SQLiteDatabase.java:1487)
E GeckoAppShell(22386)        	at org.mozilla.gecko.db.BrowserDatabaseHelper.access$000$7aece89(BrowserDatabaseHelper.java:50)
E GeckoAppShell(22386)        	at org.mozilla.gecko.db.BrowserDatabaseHelper$2.run(BrowserDatabaseHelper.java:921)
E GeckoAppShell(22386)        	at android.os.Handler.handleCallback(Handler.java:730)
E GeckoAppShell(22386)        	at android.os.Handler.dispatchMessage(Handler.java:92)
E GeckoAppShell(22386)        	at android.os.Looper.loop(Looper.java:137)
E GeckoAppShell(22386)        	at org.mozilla.gecko.util.GeckoBackgroundThread.run(GeckoBackgroundThread.java:32)

I fingered that BrowserDatabaseHelper as the health recorder layer in Fennec, but now I see that's incorrect.  So this is even more mysterious!
Definitely nothing to do with mcomella's patch. That stack implies that we're inserting the default favicons in a DB that already has them, or the default favicons don't have unique URIs.
Lots of scary things in that log, including:

03-11 21:36:01.805 D/GeckoJarReader( 1868): No Entry for Firefox: About your browser

which implies that something other than a URL is being passed into GeckoJarReader.getStream.
(In reply to Richard Newman [:rnewman] from comment #31)
> Definitely nothing to do with mcomella's patch. That stack implies that
> we're inserting the default favicons in a DB that already has them, or the
> default favicons don't have unique URIs.

Something that comes to mind - the default favicons don't actually provide a value for the favicons.url field in the database (Indeed, in my FF-25 DB they have null for that field). When the ContentValues are passed to BrowserDatabaseHelper.insertFavicon, the value faviconUrl will always be null. I see that since I last looked at this file someone has added logic to detect this case and use Favicons.guessDefaultFaviconUrl to prevent the row ending up as null.

Might this potentially be quite problematic? I certainly wouldn't rely on guessDefaultFaviconUrl to be returning unique urls (Indeed, it *very much* is not going to under some conditions...)

On a less on-topic node, why is this method duplicated in BrowserProvider? I guess there's a long and complicated reason why this was done, but it seems potentially problematic... (Not least because, despite the comment, the bodies of these methods aren't identical...)
See also Bug 962025, which is perhaps related.

Duplication is something I had to deal with in Bug 947939. I blame Sola :P
Whatever it was, it was indeed not this bug but something in https://hg.mozilla.org/mozilla-central/pushloghtml?startID=26390&endID=26391 which did require a clobber, once that was merged to mozilla-inbound where it caused a multi-hour closure.
(In reply to Richard Newman [:rnewman] from comment #31)
> Definitely nothing to do with mcomella's patch. That stack implies that
> we're inserting the default favicons in a DB that already has them, or the
> default favicons don't have unique URIs.

Sorry for fingering the wrong thing, mcomella.
(In reply to Phil Ringnalda (:philor) from comment #35)
> Whatever it was, it was indeed not this bug but something in
> https://hg.mozilla.org/mozilla-central/pushloghtml?startID=26390&endID=26391
> which did require a clobber, once that was merged to mozilla-inbound where
> it caused a multi-hour closure.

I'm starting to suspect I did something bad to Android resource dependencies as part of Bug 978587.  My idle speculation is that liuche's strings only patch may have tickled something here.  I have open tickets with issues to address in this area, will try to get to it today.
(In reply to Richard Newman [:rnewman] from comment #32)
> Lots of scary things in that log, including:
> 
> 03-11 21:36:01.805 D/GeckoJarReader( 1868): No Entry for Firefox: About your
> browser
> 
> which implies that something other than a URL is being passed into
> GeckoJarReader.getStream.

I agree this looks frightening, but digging into the *unbelievable* code at http://mxr.mozilla.org/mozilla-central/source/mobile/android/base/db/BrowserDatabaseHelper.java#888 shows that this is a speculative load.  I'm digging in further.
Reland:

https://hg.mozilla.org/integration/fx-team/rev/21ef91fbc1ac

Let's hope the fixes in Bug 979388 (or fate) are sufficient to avoid the earlier issues!
https://hg.mozilla.org/mozilla-central/rev/21ef91fbc1ac
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 30
You need to log in before you can comment on or make changes to this bug.