Closed Bug 1101419 Opened 10 years ago Closed 9 years ago

Intermittent browser_alarms.js | application crashed [@ ntdll.dll + 0x19d00]

Categories

(Toolkit :: Storage, defect)

x86
Windows 8
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
firefox35 --- unaffected
firefox36 --- wontfix
firefox37 --- fixed
firefox38 --- fixed
firefox-esr31 --- unaffected

People

(Reporter: cbook, Assigned: mak)

References

()

Details

(Keywords: crash, intermittent-failure)

Windows 8 64-bit mozilla-central pgo test mochitest-browser-chrome-2

https://treeherder.mozilla.org/ui/logviewer.html#?job_id=651981&repo=mozilla-central

23:15:19 WARNING - PROCESS-CRASH | chrome://mochitests/content/browser/hal/tests/browser_alarms.js | application crashed [@ ntdll.dll + 0x19d00]
23:15:19 INFO - Crash dump filename: c:\users\cltbld~1.t-w\appdata\local\temp\tmppaecqf.mozrunner\minidumps\77f451c5-cd53-4240-858f-55d863988949.dmp
23:15:19 INFO - Operating system: Windows NT
23:15:19 INFO - 6.2.9200
23:15:19 INFO - CPU: amd64
23:15:19 INFO - family 6 model 30 stepping 5
23:15:19 INFO - 8 CPUs
23:15:19 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_READ
23:15:19 INFO - Crash address: 0xffffffffffffffff
23:15:19 INFO - Thread 0 (crashed)
23:15:19 INFO - 0 ntdll.dll + 0x19d00
23:15:19 INFO - rbx = 0x00000089c70f3760 r12 = 0x0000000000000000
23:15:19 INFO - r13 = 0x0000000000000000 r14 = 0x000007f73a42e000
23:15:19 INFO - r15 = 0x0000000000000000 rip = 0x000007fb474c9d00
23:15:19 INFO - rsp = 0x00000089bf696280 rbp = 0x0000000000000000
23:15:19 INFO - Found by: given as instruction pointer in context
23:15:19 INFO - 1 nss3.dll!sqlite3Prepare [sqlite3.c:64e7a6391916 : 103824 + 0x9d]
23:15:19 INFO - rip = 0x000007fb2e7c266a rsp = 0x00000089bf6962a0
23:15:19 INFO - Found by: stack scanning
23:15:19 INFO - 2 0x2
23:15:19 INFO - rip = 0x0000000000000003 rsp = 0x00000089bf696320
23:15:19 INFO - Found by: call frame info
23:15:19 INFO - 3 ntdll.dll + 0x103b3
23:15:19 INFO - rip = 0x000007fb474c03b4 rsp = 0x00000089bf696350
23:15:19 INFO - Found by: stack scanning
23:15:19 INFO - 4 nss3.dll!sqlite3_exec [sqlite3.c:64e7a6391916 : 99342 + 0xe]
23:15:19 INFO - rip = 0x000007fb2e7c1367 rsp = 0x00000089bf696390
23:15:19 INFO - Found by: stack scanning
23:15:19 INFO - 5 xul.dll!mozilla::storage::Connection::executeSql(sqlite3 *,char const *) [mozStorageConnection.cpp:64e7a6391916 : 1069 + 0x17]
23:15:19 INFO - rip = 0x000007fb2a468854 rsp = 0x00000089bf696440
23:15:19 INFO - Found by: call frame info
23:15:19 INFO - 6 xul.dll!mozilla::storage::Connection::ExecuteSimpleSQL(nsACString_internal const &) [mozStorageConnection.cpp:64e7a6391916 : 1423 + 0x1a]
23:15:19 INFO - rip = 0x000007fb2a467cd7 rsp = 0x00000089bf6965c0
23:15:19 INFO - Found by: call frame info
23:15:19 INFO - 7 xul.dll!mozilla::storage::Service::minimizeMemory() [mozStorageService.cpp:64e7a6391916 : 361 + 0x11]
23:15:19 INFO - rip = 0x000007fb2aec93b5 rsp = 0x00000089bf
Didn't we recently update SQLite?
Yep, timing matches perfectly.
Component: General → Storage
Flags: needinfo?(bent.mozilla)
Product: Core → Toolkit
this comes from minimizeMemory, could be related to bug 998330 and thus be a thread-safety problem. Not sure why the update made it worse though.
All of the crashes seems to be explainable as heap corruption.  In a few cases, the crash actually happens inside of jemalloc.c.  In others, it happens following a call to sqlite3_db_release_memory() (invoked by "PRAGMA shrink_memory") which does a lot of free()-ing.  It is unclear (to the SQLite developers) if SQLite is responsible for this corruption, or if SQLite is just stumbling across corruption that originates elsewhere.

The changes in SQLite 3.8.7.2 are designed to give a lot more thread concurrency to the IndexedDB implementation.  Perhaps the problem is not in 3.8.7.2 itself but rather the extra concurrency is triggering a preexisting problem, either in SQLite or elsewhere.
Is it possible to run some of the automated tests that are failing with SQLITE_DEBUG (which I believe is enabled when MOZ_DEBUG is turned on)?  That might give additional clues.  Even better would be if both SQLITE_DEBUG and SQLITE_MEMDEBUG could be turned on.
Service::minimizeMemory is just broken here I think.

It's accessing mDBConn without holding any locks, and another thread could be using it or closing it or anything.

Service::CollectReports mentions that getting memory stats requires holding the connection lock. Maybe minimizing memory should try that too?
Flags: needinfo?(bent.mozilla)
minimizeMemory is doing its work on the wrong thread (see comment 36).
bug 998330 is still open cause DOM Storage uses raw threads that don't get spinned at shutdown and thus we end up leaking the minimizeMemory runnable (bug 1010717 is filed to fix that, but so far I didn't find the time to work on it).
fwiw, minimizeMemory just executes a pragma, it doesn't use any memory APIs, ideally thread serialization would take care of safety, if we would do stuff on the right thread.
When SQLite is compiled with SQLITE_THREADSAFE=1 and assuming you have not otherwise taken SQLite out of "serialized" mode (see https://www.sqlite.org/threadsafe.html) then it should be perfectly safe from SQLite's perspective to invoke "PRAGMA shrink_memory" from any thread using any SQLite database conection regardless of what other threads are doing.  *Should* be.  But that is only speaking about the SQLite side of things - I don't know what threading constraints might be on application-level objects.
(In reply to D. Richard Hipp from comment #41)
> I don't know what threading constraints might be on
> application-level objects.

In this case I can't see anything preventing one thread from calling sqlite3_close() at the same time as another thread calls sqlite3_exec. I'm betting that this is not a case SQLite can protect against internally.

But as mak said we really shouldn't be doing this.
(In reply to ben turner [:bent] (use the needinfo? flag!) from comment #44)
> In this case I can't see anything preventing one thread from calling
> sqlite3_close() at the same time as another thread calls sqlite3_exec. I'm
> betting that this is not a case SQLite can protect against internally.
> 

Actually SQLite does guard against this, least with high probability.  

If the sqlite3_exec() call runs first, there is a mutex that prevents the sqlite3_close() from running until after the sqlite3_exec() completes.

If the sqlite3_close() runs first, it clears the 32-bit "magic" field in the database connection object which indicates to all other APIs that the database connection is no longer valid.  The heap memory used to implement the database connection might get reused for something else, and if you are unlucky, the "magic" field might get set back to the correct magic value that indicates a live database connection, but that is unlikely.  There is also a race condition that can cause this misuse detection to fail.  But if the misuse where happening, then it would usually be detected, and SQLite generates an error log message (https://www.sqlite.org/errlog.html) every time something untoward like this occurs, so you should be seeing lots of problems in the error logs.
A use-after-free bug has been found and fixed in SQLite.  The bug was present in all versions of SQLite between 3.8.2 and 3.8.7.2.  The bug comes up in cases where the application has set shared-cache mode and is holding one database connection open while at the same time closing and reopening another database connection on the same database.  The bug only appears if the underlying memory allocator assigns the same pointer to a newly created database connection that was used by another database connection previously closed, so hitting it requires a measure of bad luck.

The signature of the bug is very different from what we are seeing in FF in this report.  Nevertheless, I think FF does use shared-cache mode and does close and reopen database connections with some regularity.  So it seems worth-while to upgrade to SQLite 3.8.7.3, now on the SQLite website (https://www.sqlite.org/download.html).  Changes from 3.8.7.2 are minimal.
There is a development snapshot of SQLite 3.8.8 alpha at https://www.sqlite.org/snapshot/sqlite-amalgamation-201412120127.zip that contains changes that *might* address this crash.  Or it might not.  It's hard to say since we have been unable to reproduce the problem.
(In reply to D. Richard Hipp from comment #195)
> There is a development snapshot of SQLite 3.8.8 alpha at
> https://www.sqlite.org/snapshot/sqlite-amalgamation-201412120127.zip that
> contains changes that *might* address this crash.  Or it might not.  It's
> hard to say since we have been unable to reproduce the problem.

Doesn't appear to have helped.
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=c16734096c14
bug 998330 just landed, let's see how this evolves.
(In reply to Marco Bonardo [::mak] (needinfo? me) from comment #407)
> bug 998330 just landed, let's see how this evolves.

Looks like we're in the clear. Can we please get some uplift requests? :)
Assignee: nobody → mak77
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(mak77)
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
(In reply to Ryan VanderMeulen [:RyanVM UTC-5] from comment #409)
> Looks like we're in the clear. Can we please get some uplift requests? :)

I'm not sure the DOM Storage change respects the uplift rules, I suspect it should ride the train naturally to be sure it's not going to cause issues (it's still a feature that face the Web after all)
Flags: needinfo?(mak77)
I think Aurora could be doable as a stability fix. Maybe too risky for beta at this point. I won't lie, pretty frustrating that it missed two cycles :-\

And hopefully this doesn't burn users when 36 ships.
do we have evidence of the same crash signature on crash-stats? that would be a good selling point.
I see crashes on crash-stats involving sqlite3Prepare, but I'm way out of my element trying to correlate them to the specific signatures we see in this thread.
You need to log in before you can comment on or make changes to this bug.