Intermittent failures in B2G ICS Emulator Debug XPCshell because of IndexedDB NS_ERROR_STORAGE_BUSY

RESOLVED WORKSFORME

Status

()

RESOLVED WORKSFORME
3 years ago
2 years ago

People

(Reporter: gerard-majax, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

3 years ago
Created attachment 8646848 [details]
try_ubuntu64_vm-b2g-emulator-debug_test-xpcshell-1-bm122-tests1-linux64-build404.txt.gz

https://treeherder.mozilla.org/#/jobs?repo=try&revision=048244ad1a83

23:47:51     INFO -  TEST-PASS | b2g/components/test/unit/test_killswitch_gonk.js | test_doEnable - [test_doEnable : 86] should have a Promise - {} == true
23:47:51     INFO -  (xpcshell/head.js) | test run_next_test 5 finished (2)
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | [1469] WARNING: Received NS_ERROR_STORAGE_BUSY when attempting to start write transaction, retrying for up to 10 seconds: file ../../../gecko/dom/indexedDB/ActorsParent.cpp, line 8915
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | [1469] WARNING: 'NS_FAILED(rv)', file ../../../gecko/dom/indexedDB/ActorsParent.cpp, line 8932
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | [1469] WARNING: 'NS_FAILED(rv)', file ../../../gecko/dom/indexedDB/ActorsParent.cpp, line 12771
23:47:51     INFO -  "CONSOLE_MESSAGE: (error) [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:12791"]"
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | [1469] WARNING: Converting non-IndexedDB error code (0x80630001) to NS_ERROR_DOM_INDEXEDDB_UNKNOWN_ERR: file ../../../gecko/dom/indexedDB/ActorsParent.cpp, line 570
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | [1469] WARNING: Received NS_ERROR_STORAGE_BUSY when attempting to start write transaction, retrying for up to 10 seconds: file ../../../gecko/dom/indexedDB/ActorsParent.cpp, line 8915
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | -*- SettingsRequestManager: getRequest FAILED lockscreen.locked
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | -*- SettingsService: error:AbortError
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | --*-- KillSwitchMain: getCallback: handleError: m=undefined
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | --*-- KillSwitchMain: doEnable: undefined
23:47:51     INFO -  PROCESS | b2g/components/test/unit/test_killswitch_gonk.js | --*-- KillSwitchMain: Did reject(false);
23:47:51  WARNING -  TEST-UNEXPECTED-FAIL | b2g/components/test/unit/test_killswitch_gonk.js | test_doEnable - [test_doEnable : 95] should have succeeded - false == true
Flags: needinfo?(Jan.Varga)
(Reporter)

Updated

3 years ago
Blocks: 1181561
(Reporter)

Comment 1

3 years ago
Failure rate does decrease a lot if the mozSettings API does readonly/readwrite transactions instead of constantly doing readwrite:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=304c5b7f114b
I generally thought BUSY errors were supposed to be impossible for IndexedDB after bug 1131766.

Maybe the database is getting opened multiple times somehow?  Looking briefly at SettingsRequestManager.ensureConnection() and IndexedDBHelper.ensureDB, although it seems the code only wants to use one database, neither of those methods are safe against racing calls to those methods.  Since the emulator's I/O tends to be horrible, I would expect that to open the race windows.

In any event, if you make the tests run with NSPR_LOG_MODULES=IndexedDB:3 (either via the environment or programatically), I think you should get logging that explicitly spells out what's going on here.
(Reporter)

Comment 3

3 years ago
Right thanks. I had a look at the code you mention and it's just relying on the IndexedDBHelper to establish the connection, so I'm not sure how much we are racy risky here since I would expect the helper to be safe (since it checks this._db). But I might be missing something. I'll send a try with the NSPR defined and we will see! :)
(Reporter)

Comment 4

3 years ago
That should be setting the env proprely:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=06d3ab0c8434
(In reply to Alexandre LISSY :gerard-majax from comment #3)
> Right thanks. I had a look at the code you mention and it's just relying on
> the IndexedDBHelper to establish the connection, so I'm not sure how much we
> are racy risky here since I would expect the helper to be safe (since it
> checks this._db). But I might be missing something. I'll send a try with the
> NSPR defined and we will see! :)

It checks this._db before calling open(), yes, but open() only sets this._db during the IDBRequest.onsuccess callback.  Every call to ensureDB/open prior to the onsuccess callback/event firing will result in an additional call to open.  In the emulator, this time period can be lengthy because of the sucky I/O.  The idiom I like to use in these cases is to create, intialize, and save off a Promise like _dbPromise at the start of open().  Subsequent calls to open() would reuse the existing promise and just do a .then(aSuccessCb, aFailureCb) on the existing promise.

Note that it's conceivable that the Settings implementation's state machine protects against this with its queue stuff.  I looked at it briefly, but it was complex enough that it seemed most practical to see what the IndexedDB log entries say rather than potentially chase down an otherwise safe implementation.

Thanks for pushing the log-enabling.  I've retriggered the X1 jobs several times since the first run turned out to be green.
The retriggers got some nice failures.  The NSPR log output shows up under the emulator-NNNN.log artifact under the "Job details" tab.  The IndexedDB logging doesn't have a smoking gun.  What I get from http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/7fd54d3adf13b87ad554b495a4e6e5417f2f91306a51fb12993d2951b5f2d67486c85683d05fd24da30c451c0ddbf184565659ec6cb57359b21c7e3368da12c6 is:
- There is only one "open" of the settings database.
- All the accesses to the database seem to be correctly occurring on the same thread.
- Something is wrong with the logging's use of mBackgroundChildLoggingId for the strings  "Parent Transaction[0]: Beginning database work", because that either really should not be zero or that explains the problem.
- There are a ton of "Settings queue head blocked at BLAH for BLAH seconds" log entries prior to the database being opened.
- The open request is issued on line 10,054 which is well into the log after a lot of stuff happening, including all those settings queue warnings.
- There are a number of "NextVersionChangeTransactionSerialNumber doesn't match" errors at startup.  I mention this because :bent had previously requested assistance in reproducing this situation because of people complaining about these errors.
- I'm not sure I understand what's going on with the process hierarchy in the log.

Since something is causing those busy results, I pushed a revised version of the logging patch to try to turn on mozStorage in order to shed some light.  My wild guess right now would be the vacuuming logic is somehow doing something, mainly because that's the best explanation for how another database connection could be open.  Otherwise I'd guess two processes.

The try push treeherder is:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=27d82b0df0f4
There's a good log at http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/ff67285830023e2c14756d5a86c0420bb4b749dba73ebf4596cb8cb83c63bd418da209f6fe0d4c78e4fc41c8935279d9948432f6bbd34877ca96765cbc6e27a0 that I think shows that IndexedDB is racing itself.

== Log Analysis
The meaty/interesting bits seem to be:

- Thread 69928 opens a connection to the database at 23:55:18.391 and closes it again at 23:55:18.680.  Very little happens during its short life.  PRAGMAs and are few SELECTs are run.  No transaction is created, no 

- Thread 69992 then opens a connection to the database at 23:55:18.850 (170 ms later, which is very soon on a slow emulator)

- Thread 69992 then tries to lock the database, but encounters our famous error.  Note that somewhat confusingly "database is locked" is the error string for SQLITE_BUSY.  The error for SQLITE_LOCKED is "database table is locked".
08-15 23:55:19.070 I/PRLog   ( 1463): 69992[4674f800]: Initialized statement 'BEGIN IMMEDIATE;' (0x4679b528)
08-15 23:55:19.100 I/PRLog   ( 1463): 69992[4674f800]: sqlite3_trace on 460e9040 for 'BEGIN IMMEDIATE;'
08-15 23:55:19.100 I/PRLog   ( 1463): 69992[4674f800]: Statement::ExecuteStep error: database is locked
08-15 23:55:19.110 I/PRLog   ( 1463): 69992[4674f800]: Resetting statement: 'BEGIN IMMEDIATE;'
08-15 23:55:19.110 I/Gecko   ( 1463): [1463] WARNING: Received NS_ERROR_STORAGE_BUSY when attempting to start write transaction, retrying for up to 10 seconds: file ../../../gecko/dom/indexedDB/ActorsParent.cpp, line 8910

- Thread 69992 then spins for the full seconds (in this case at least), until it gives up:
08-15 23:55:29.160 I/PRLog   ( 1463): 69992[4674f800]: sqlite3_trace on 460e9040 for 'BEGIN IMMEDIATE;'
08-15 23:55:29.160 I/PRLog   ( 1463): 69992[4674f800]: Statement::ExecuteStep error: database is locked
08-15 23:55:29.160 I/PRLog   ( 1463): 69992[4674f800]: Resetting statement: 'BEGIN IMMEDIATE;'
08-15 23:55:29.160 I/Gecko   ( 1463): [1463] WARNING: 'NS_FAILED(rv)', file ../../../gecko/dom/indexedDB/ActorsParent.cpp, line 8927
08-15 23:55:29.160 I/PRLog   ( 1463): 69992[4674f800]: Resetting statement: 'BEGIN IMMEDIATE;'
08-15 23:55:29.170 I/PRLog   ( 1463): 69992[4674f800]: Resetting statement: 'ROLLBACK;'
08-15 23:55:29.170 I/Gecko   ( 1463): [1463] WARNING: 'NS_FAILED(rv)', file ../../../gecko/dom/indexedDB/ActorsParent.cpp, line 12765
08-15 23:55:29.180 E/GeckoConsole( 1463): [JavaScript Error: "IndexedDB UnknownErr: ActorsParent.cpp:12785"]

- The SQLITE_BUSY spin happens again with a failure after 10 seconds again.

- Much later, we get to test test_settingsrequestmanager_messages.js which has a much happier time...
- Thread 69832 does another quick open/close of the settings db at 00:14:56.491.  Which is rapidly followed by...
- Thread 69912 opening the settings database at 00:14:56.760.
- Thread 69912 successfully issues a "BEGIN IMMEDIATE" call without hitting a SQLITE_BUSY at 00:14:56.871.

== Thoughts

- Unless there's pathologically bad thread scheduling issues or I/O contention going on, it seems unlikely that the rapid open/close of the database should interfere with the second open's attempt to start a write transaction.  Specifically, the "Closing connection to" mozStorage logging happens in Connection::internalClose, and sqlite3_close is invoked almost immediately afterwards.

- It seems likely there really is another process on the phone that has the file open somehow.  The xpcshell_raw.log that matches the other log is at c1072a570df4dcb3f8b4ca7967fb1f93fa181e9aac30a9109491489ed46f9efbb5bcaafb6ba917318c7e5470e83031b86c9d8176818b1a1e12bd06ea3b7af82f and it invokes "adb ... shell ps" a bit and there does seem to be a b2g running.  I don't actually understand how these xpcshell tests in the emulator work, but it seems concerning to me that b2g could possibly be running.  Given how various code loves to spam the settings API with reads and writes, it does seem possible that an active b2g instance could maintain exclusive control of the database for the whole time.

In any event, it seems like there isn't actually a problem with IndexedDB itself, but there is a serious problem with contention on that file.  I'd suggesting needinfo-ing someone who knows how these tests work to find out what else could possibly be alive and trying to use the same profile.
(In reply to Andrew Sutherland [:asuth] from comment #7)
> that I think shows that IndexedDB is racing itself.

Whoops.  I mooted this conclusion as I wrote the other stuff; ignore it.  Apologies for any confusion by those tl;dr-ing ;).  Also, this ends my period of busybody analysis for now.
(Reporter)

Comment 9

3 years ago
(In reply to Andrew Sutherland [:asuth] from comment #8)
> (In reply to Andrew Sutherland [:asuth] from comment #7)
> > that I think shows that IndexedDB is racing itself.
> 
> Whoops.  I mooted this conclusion as I wrote the other stuff; ignore it. 
> Apologies for any confusion by those tl;dr-ing ;).  Also, this ends my
> period of busybody analysis for now.

Does the rest of you analysis is still valid? I think that we have Gaia really running at the same time we have XPCShell tests running. That may account for the explanation maybe.
Yeah, the rest of the analysis stands.  The "IndexedDB is racing itself" theory got discarded very early as I continued my analysis, but I had forgotten I had put that up in the intro.
(Reporter)

Comment 11

3 years ago
So it could just be because we have Gaia starting (hence using mozSettings database files) while we are running the XPCShell tests ?

Now I may sound stupid, but shouldn't we have the database file being opened just once, since everything gets multiplexed in SettingsDB.jsm ?
(In reply to Alexandre LISSY :gerard-majax from comment #11)
> So it could just be because we have Gaia starting (hence using mozSettings
> database files) while we are running the XPCShell tests ?

Yes, although see below.

> Now I may sound stupid, but shouldn't we have the database file being opened
> just once, since everything gets multiplexed in SettingsDB.jsm ?

xpcshell is its own executable that gets run as its own process.  Normally, just like different Firefox instances using different profiles don't interact, xpcshell using its own profile directory shouldn't interact with anything else.  My concern is that one of two things is happening:
1) Due to a glitch, xpcshell and a running normal b2g instance are using the same profile directory, so there are multiple processes with multiple SettingsDB.jsm instances trying to open and manipulate the database at the same time.
2) Due to a glitch, a prior xpcshell run is still hanging around, so, again two copies of SettingsDB.jsm.

The first case where it's the b2g process interfering with us seems more likely to explain a lengthy period of lock contention since so much gaia code just spams the settings API.


The issue of the profile directory is handled by nsIDirectoryServiceProvider implementations.  Here is the xpcshell one, that likes to get its value from XPCSHELL_TEST_PROFILE_DIR:
https://dxr.mozilla.org/mozilla-central/source/testing/xpcshell/head.js#1115

It seems like it is probably reliably getting its profile directory of "/data/local/tests/xpcshell/p" via that environment variable.  *However*, we have this wackiness in b2g where bug 889602 apparently implemented a "ProfD" provider "#ifdef MOZ_GONK" as well, and this could potentially be fighting with xpcshell:
https://dxr.mozilla.org/mozilla-central/source/b2g/components/DirectoryProvider.js#52

It seems plausible that this directory provider is fighting with/clobbering the xpcshell provider.  Especially since the xpcshell automation is invoking its "cleardir" script on "/data/local/tests/xpcshell/p" between every xpcshell run.  If that's not where the files live, that's a great explanation.  (Although SQLite WAL does do shared memory stuff, it's mmap'ed, not using a shared sys-v namespace or anything, so if the file is really being cleared, especially by just unlinking the root dir, that should make it impossible for a locking conflict.  Unfortunately I'm having real trouble finding the "cleardir" script's source to validate what it does.) 

To this end I pushed a try build that removes that added ProfD block at https://treeherder.mozilla.org/#/jobs?repo=try&revision=a1478a0e05b5 but arguably I really should have instead just added logging of the full SQLite path or the ProfD somewhere.  An exercise for the reader, perhaps? :)

I am needinfoing :ahal because I'm really not sure who nominally owns the b2g xpcshell stuff.  And I'm needinfoing :dhylands because he r+'d bug 889602 and the patch author :onecyrenus no longer seems to be meaningfully active in bugzilla.  (And I'm hoping they might have a better idea if this could be the problem, and if not, ideas for what else could be, or ideas of who might know better.)
Flags: needinfo?(dhylands)
Flags: needinfo?(ahalberstadt)
It's been a long time since anyone's looked at (much less touched) this code, but mozrunner does some wonky things with the profile and settings db so that tests can run with customized settings.

It finds settings.db here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/application.py?offset=600#116

Then it deletes to force a reload of the settings here:
https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/application.py?offset=2300#146

I can't remember the exact reason it needed this, but I remember it took me awhile to figure out (with some help from someone working on indexeddb). Note this code path isn't unique to xpcshell, mochitest, reftest and marionette all use it too.
Flags: needinfo?(ahalberstadt)
Ah, I found historical context. Tl;dr it was needed to get Gij tests working. For more info read bug 994847 comment 2 onwards.
Hm, so yeah, that could be a potentially non-trivial correctness problem on its own since that logic is doing nothing about the "-shm" and "-wal" or the "*ssegtnti.files.files" subdirectory.  All of those want to be nuked at the same time.  However, I think I would have expected to see something from the mozStorage logging about that.
My push reverting the b2g directory service didn't do anything, clearing :dhylands needinfo.  I did another push with my bug 531122 mozStorage logging change that should give us the full path of the database, helping make it clear what's going on with the profile.
Flags: needinfo?(dhylands)
The mozStorage logging changes from bug 531122 push is/was: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ca8952d6edc

And it appears my fear was correct.  Although we didn't get a test failure, we did get logs showing that the path to the settings DB is at "/data/local/storage/permanent/chrome/idb/2588645841ssegtnti.sqlite", which is where the real b2g keeps its stuff.

At which point I feel dumb as I look at https://dxr.mozilla.org/mozilla-central/source/b2g/components/DirectoryProvider.js?offset=500#62 and notice that there's a "indexedDBPDir" directory entry, and that xpcshell does not mess with this at all.  Note that there's also a constant NS_APP_INDEXEDDB_PARENT_DIR, which is what QuoteManager.cpp uses.  If it can't NS_GetSpecialDirectory it, it falls back to the ProfD.

So the problems here are thus:
1) b2g is running at the same time as xpcshell in the emulator tests.
2) b2g has a custom DirectoryProvider that explicitly provided "indexedDBPDir" with a hardcoded path of "/data/local" on MOZ_GONK platforms.  Which means potentially serious correctness problems for all b2g emulator tests involving IndexedDB, but also bigger problems (and this specific problem) because of the first point.

It is at this point that we run to fabrice to save us.

=== The log excerpt from http://mozilla-releng-blobs.s3.amazonaws.com/blobs/try/sha512/2abffc6371ae43b93462e2bbb948e6859b963557fcfc62c2e6080c3d4f26bceb86d7ffe22e9cc9061922cd05ec0b060563df5e4270576acb1ea1cd17ab2e7564
08-19 03:14:56.878 I/Gecko   ( 1460): {"action":"test_status","time":1439954096857,"thread":null,"pid":null,"source":"xpcshell/head.js","test":"b2g/components/test/unit/test_killswitch_gonk.js","subtest":"test_doEnable","status":"PASS","message":"[test_doEnable : 86] should have a Promise - {} == true"}
08-19 03:14:56.908 I/Gecko   ( 1460):
08-19 03:14:56.908 I/Gecko   ( 1460): {"action":"log","time":1439954096913,"thread":null,"pid":null,"source":"xpcshell/head.js","level":"INFO","message":"(xpcshell/head.js) | test run_next_test 5 finished (2)"}
08-19 03:14:57.028 I/PRLog   ( 1460): 1076352108[4043b080]: IndexedDB {1c59727b-0ef3-433b-8a98-fc8411fa87e2}: Child  Request[1]: indexedDB.open("settings", 5)
08-19 03:14:57.428 I/PRLog   ( 1460): 69928[4674f580]: Opening connection to '/data/local/storage/permanent/chrome/idb/2588645841ssegtnti.sqlite' (0x460e8fb0)
===
Flags: needinfo?(fabrice)
(In reply to Andrew Sutherland [:asuth] from comment #17)
> 
> It is at this point that we run to fabrice to save us.

lol.

I filed bug 1009730 a while ago to not use /data/local/storage in b2g anymore, but didn't fixed all the test issues.

We used to need that back in the (very) old days when we were preloading databases created by gaia's build system. Since we had no idea where the profile would be created, we used /data/local.
We stopped preloading this way because these db could not survive a factory reset, and instead we are shipping a settings.json in /system/b2g/defaults that we use at first run to create the settings db.

One way to see if bug 1009730 that would help there is to push a change of b2g's DirectoryProvider.js that doesn't set indexedDBPDir by removing it from https://dxr.mozilla.org/mozilla-central/source/b2g/components/DirectoryProvider.js#62
Flags: needinfo?(fabrice)

Updated

3 years ago
Flags: needinfo?(Jan.Varga)
we do not support b2g tests in our CI system anymore, also this is 2+ years old, I assume it hasn't been a problem.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.