Closed Bug 1414737 Opened 7 years ago Closed 4 years ago

crash in mozilla::dom::quota::QuotaObject::DisableQuotaCheck

Categories

(Toolkit :: Storage, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla74
Tracking Status
firefox-esr60 --- wontfix
firefox-esr68 --- wontfix
firefox66 --- wontfix
firefox67 --- wontfix
firefox68 --- wontfix
firefox72 --- wontfix
firefox73 --- wontfix
firefox74 --- fixed

People

(Reporter: shawnjohnjr, Assigned: sg)

References

(Blocks 2 open bugs)

Details

Crash Data

Attachments

(2 files)

Copied from:
https://crash-stats.mozilla.com/report/index/83d18b0a-c1cf-49ea-a136-5b2920171106


0 	xul.dll 	mozilla::dom::quota::QuotaObject::DisableQuotaCheck() 	dom/quota/ActorsParent.cpp:2714
1 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::DatabaseConnection::DisableQuotaChecks 	dom/indexedDB/ActorsParent.cpp:11355
2 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::Database::StartTransactionOp::DoDatabaseWork 	dom/indexedDB/ActorsParent.cpp:14927
3 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::TransactionDatabaseOperationBase::RunOnConnectionThread 	dom/indexedDB/ActorsParent.cpp:23735
4 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::Database::StartTransactionOp::RunOnConnectionThread 	dom/indexedDB/ActorsParent.cpp:14914
5 	xul.dll 	mozilla::dom::indexedDB::`anonymous namespace'::TransactionDatabaseOperationBase::Run 	dom/indexedDB/ActorsParent.cpp:23906
6 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1216
7 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:361
Maybe the QuotaObject has been removed without being null'd? So that, firefox crash while accessing its member variables.
Ok. It's not a regression. Reports are showing the earliest reports since v48.
(In reply to Tom Tung [:tt] from comment #2)
> Maybe the QuotaObject has been removed without being null'd? So that,
> firefox crash while accessing its member variables.

https://hg.mozilla.org/mozilla-central/annotate/e64846245b00/dom/indexedDB/ActorsParent.cpp#l11193

But we do have check mQuotaObject there. So it's possible that before mQuotaObject->DisableQuotaCheck() called, mQuotaObject is cleared on different threads.
(In reply to Shawn Huang [:shawnjohnjr] from comment #4)
> (In reply to Tom Tung [:tt] from comment #2)
> > Maybe the QuotaObject has been removed without being null'd? So that,
> > firefox crash while accessing its member variables.
> 
> https://hg.mozilla.org/mozilla-central/annotate/e64846245b00/dom/indexedDB/
> ActorsParent.cpp#l11193
> 
> But we do have check mQuotaObject there. So it's possible that before
> mQuotaObject->DisableQuotaCheck() called, mQuotaObject is cleared on
> different threads.

Indeed. Besides, idb::DatabaseConnection holds it by a RefPtr [1]. It shouldn't become a dangling while DatabaseConnection is accessing it.

[1] http://searchfox.org/mozilla-central/source/dom/indexedDB/ActorsParent.cpp#5054
Priority: -- → P2
Crash Signature: [@ mozilla::dom::quota::QuotaObject::DisableQuotaCheck]
Assignee: nobody → shuang
Note: 
1. "cleanup" transaction is only accessible if "dom.indexeddb.experimental" is set to true (which is "false" by default).
2. "dom.indexeddb.experimental" was not changed according to the "userPrefs" from crash report.
At least one thing we can be sure that both mQuotaObject, mJournalQuotaObject are null.

https://searchfox.org/mozilla-central/rev/7a8c667bdd2a4a32746c9862356e199627c0896d/dom/indexedDB/ActorsParent.cpp#11196

The only thing I can say is that mStorageConnection->GetQuotaObjects() fail to get both mQuotaObject, mJournalQuotaObject. But return value is not NS_ERROR_FAILURE.
Maybe like Bug 1389561, when doing cleanup and origin is not initialized, then QuotaObject is null?
After reading through Bug 1182987, I have better understanding the problem. I'm working on test case now.
(In reply to Bevis Tseng[:bevis][:btseng] from comment #6)
> Note: 
> 1. "cleanup" transaction is only accessible if "dom.indexeddb.experimental"
> is set to true (which is "false" by default).
> 2. "dom.indexeddb.experimental" was not changed according to the "userPrefs"
> from crash report.

Note: For (2). userPrefs only are listed in crash report whilte list (telemetry environment).

https://searchfox.org/mozilla-central/rev/be78e6ea9b10b1f5b2b3b013f01d86e1062abb2b/toolkit/components/telemetry/TelemetryEnvironment.jsm#185

So this doesn't prove anything.
Bevis,
One thing I found that code might not call db.transaction with 'cleanup' option.
In https://searchfox.org/mozilla-central/rev/be78e6ea9b10b1f5b2b3b013f01d86e1062abb2b/dom/indexedDB/IDBDatabase.cpp#676.

It's possible that originally in IDBDatabase::Transaction(), aMode is IDBTransactionMode::Readwrite, which is very common usage and when mQuotaExceeded is true (user might have low disk easily) and then we create idb transaction with 'cleanup' mode.

Then we can explain why cleanup mode transaction is executed to disable quota check without setting idb experimental flag.
And that also explains why we can see crash on FennecAndroid, if Mobile browser users have no intention to set preferences.
We marked the database as "QuotaExceeded" and very next "readwrite" transaction would actually run as "cleanup" transaction. 
So add/put would be prohibited for that transaction, quota checks would be disabled and vacuuming would be done after commit.

But I still don't understand why we QuotaObject is not accessible. If somehow 'cleanup' transaction would disable quota check, and somehow we still hold DirectoryLock and it wouldn't be able to collect inactive origins, OriginInfo should not be cleared.
Current test_cleanup_transaction.js doesn't cover Comment 13 implementation, so I have to write tests for readwrite mode trasnfer to cleanup mode, I guess this is why I can't hit the problem using test_cleanup_transaction.js for a template.
Okay, test_quotaExceeded_recovery.js could test that execution path, but I still need to figure out how to make QuotaObject is not accessible to simulate the original crash.
Assignee: shawnjohnjr → nobody
Assignee: nobody → shawnjohnjr

the crash signature curiously started spiking mid-january on release during the 64 cycle, is it worth further looking into that? so far it appears that it's not correlated to particular urls...

Flags: needinfo?(overholt)

(In reply to [:philipp] from comment #16)

the crash signature curiously started spiking mid-january on release during the 64 cycle, is it worth further looking into that? so far it appears that it's not correlated to particular urls...

I'll redirect that question to Tom who's been working in this area recently.

Flags: needinfo?(overholt) → needinfo?(shes050117)

Frankly, I don't find any suspicious patch/bug in 64 make the crash increase dramatically.

Keep the ni flag and will look into it deeper.

i wouldn't expect it to be related to something in 64 in particular, it got released on december 11, whereas this spike happened over a month later.

Here is the further information how we can observe the problem on our project (well, it's the b2g project based on gecko 48).

  1. We have very low storage space device, and the device consumes storage space constantly (such as writing logs into disk).
  2. We have a mechanism to aggressively clean up temporary origin space when storage is under 50MB (via b2g disk space watcher not via QuotaManager, since QuotaManager won't actively update device space information, but only when Firefox is initialized).
  3. The crash happens when we place device in the idle situation. And we suspected that we constantly hit low storage and at the same time we cleanup storage.
  4. Simply adding nullptr-check guards for both mQuotaObject, mJournalQuotaObject is fine. And we did not see any other crash here after guards been added. At least from our point of view, it's safe to do and we did not see side effects.

We did not spend too much time here to investigate why mQuotaObject, mJournalQuotaObject are nullptr for StorageConnection due to limited manpower.

If you want to investigate further, you must try to make space very low and constantly write data into disk and at the same time try to clean up origin.

(In reply to [:philipp] from comment #19)

i wouldn't expect it to be related to something in 64 in particular, it got released on december 11, whereas this spike happened over a month later.

I think it's not 64 related. It depends on the low-disk-space pre-condition, at least we have seen this bug on 48.

I'm not sure Jan or Andrew would agree to add a trivial nullptr check for mQuotaObject, mJournalQuotaObject. So I guess Tom might want to investigate the root cause here?

(In reply to Shawn Huang [:shawnjohnjr][:shuang] (as a happy gecko contributor) from comment #20)

Here is the further information how we can observe the problem on our project (well, it's the b2g project based on gecko 48).

  1. We have very low storage space device, and the device consumes storage space constantly (such as writing logs into disk).
  2. We have a mechanism to aggressively clean up temporary origin space when storage is under 50MB (via b2g disk space watcher not via QuotaManager, since QuotaManager won't actively update device space information, but only when Firefox is initialized).
  3. The crash happens when we place device in the idle situation. And we suspected that we constantly hit low storage and at the same time we cleanup storage.
  4. Simply adding nullptr-check guards for both mQuotaObject, mJournalQuotaObject is fine. And we did not see any other crash here after guards been added. At least from our point of view, it's safe to do and we did not see side effects.

We did not spend too much time here to investigate why mQuotaObject, mJournalQuotaObject are nullptr for StorageConnection due to limited manpower.

If you want to investigate further, you must try to make space very low and constantly write data into disk and at the same time try to clean up origin.

In our project, here is what we found:

Here is the backtrace:

(gdb) bt
#0 0xb5850090 in mozilla::dom::quota::QuotaObject::DisableQuotaCheck (this=0x0) at /home/claire/KaiOS/gecko/dom/quota/ActorsParent.cpp:2627
#1 0xb58c479a in DisableQuotaChecks (this=0xa98a1cc0) at ../../../gecko/dom/indexedDB/ActorsParent.cpp:10520
#2 mozilla::dom::indexedDB::(anonymous namespace)::Database::StartTransactionOp::DoDatabaseWork (this=0xab974380, aConnection=0xa98a1cc0)
at ../../../gecko/dom/indexedDB/ActorsParent.cpp:14162
#3 0xb58c4322 in mozilla::dom::indexedDB::(anonymous namespace)::TransactionDatabaseOperationBase::RunOnConnectionThread (this=this@entry=0xab974380)
at ../../../gecko/dom/indexedDB/ActorsParent.cpp:22545
#4 0xb58c455a in mozilla::dom::indexedDB::(anonymous namespace)::Database::StartTransactionOp::RunOnConnectionThread (this=0xab974380)
at ../../../gecko/dom/indexedDB/ActorsParent.cpp:14149
#5 0xb58c2d22 in mozilla::dom::indexedDB::(anonymous namespace)::TransactionDatabaseOperationBase::Run (this=0xab974380)
at ../../../gecko/dom/indexedDB/ActorsParent.cpp:22632
#6 0xb5d057d6 in mozilla::tasktracer::TracedRunnable::Run (this=0xaba9d730) at /home/claire/KaiOS/gecko/tools/profiler/tasktracer/TracedTaskCommon.cpp:114
#7 0xb4c522ea in nsThread::ProcessNextEvent (this=0xac9fe840, aMayWait=<optimized out>, aResult=0xa680dc57) at /home/claire/KaiOS/gecko/xpcom/threads/nsThread.cpp:994
#8 0xb4c63c90 in NS_ProcessNextEvent (aThread=<optimized out>, aThread@entry=0xac9fe840, aMayWait=aMayWait@entry=true)
at /home/claire/KaiOS/gecko/xpcom/glue/nsThreadUtils.cpp:290
#9 0xb58bccae in mozilla::dom::indexedDB::(anonymous namespace)::ConnectionPool::ThreadRunnable::Run (this=0xac587e20) at ../../../gecko/dom/indexedDB/ActorsParent.cpp:12720
#10 0xb5d057d6 in mozilla::tasktracer::TracedRunnable::Run (this=0xaba9d700) at /home/claire/KaiOS/gecko/tools/profiler/tasktracer/TracedTaskCommon.cpp:114
#11 0xb4c522ea in nsThread::ProcessNextEvent (this=0xac9fe840, aMayWait=<optimized out>, aResult=0xa680dd0f) at /home/claire/KaiOS/gecko/xpcom/threads/nsThread.cpp:994
#12 0xb4c63c90 in NS_ProcessNextEvent (aThread=<optimized out>, aThread@entry=0xac9fe840, aMayWait=aMayWait@entry=false)
at /home/claire/KaiOS/gecko/xpcom/glue/nsThreadUtils.cpp:290
#13 0xb4e44648 in mozilla::ipc::MessagePumpForNonMainThreads::Run (this=0xaba9d370, aDelegate=0xb07ef900) at /home/claire/KaiOS/gecko/ipc/glue/MessagePump.cpp:340
#14 0xb4e34002 in MessageLoop::RunInternal (this=this@entry=0xb07ef900) at /home/claire/KaiOS/gecko/ipc/chromium/src/base/message_loop.cc:230
#15 0xb4e340b4 in RunHandler (this=0xb07ef900) at /home/claire/KaiOS/gecko/ipc/chromium/src/base/message_loop.cc:223
#16 MessageLoop::Run (this=this@entry=0xb07ef900) at /home/claire/KaiOS/gecko/ipc/chromium/src/base/message_loop.cc:203
#17 0xb4c53398 in nsThread::ThreadFunc (aArg=0xac9fe840) at /home/claire/KaiOS/gecko/xpcom/threads/nsThread.cpp:396
#18 0xb478abbc in _pt_root (arg=0xaca52c80) at ../../../../../gecko/nsprpub/pr/src/pthreads/ptthread.c:216
#19 0xb6e982b0 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) frame 1
#1 0xb58c479a in DisableQuotaChecks (this=0xa98a1cc0) at ../../../gecko/dom/indexedDB/ActorsParent.cpp:10520
warning: Source file is more recent than executable.
10520 }
(gdb) p mQuotaObject
$1 = {mRawPtr = 0x0}
(gdb) p mJournalQuotaObject
$2 = {mRawPtr = 0x0}

mQuotaObject and mJournalQuotaObject are both null pointers.

Frame1:
#1 0xb58c479a in DisableQuotaChecks (this=0xa98a1cc0) at ../../../gecko/dom/indexedDB/ActorsParent.cpp:10520

(gdb) p ((mozilla::dom::indexedDB::FileManager*)mFileManager)->mOrigin
$21 = {<nsACString_internal> = {mData = 0xac14cc18 "1024+f+app://system.gaiamobile.org", mLength = 34, mFlags = 5}, <No data fields>}
(gdb) p ((mozilla::dom::indexedDB::FileManager*)mFileManager)->mDatabaseName
$22 = {<nsAString_internal> = {mData = 0xaccd4168 u"asyncStorage", mLength = 12, mFlags = 5}, <No data fields>}

Frame2:
#2 mozilla::dom::indexedDB::(anonymous namespace)::Database::StartTransactionOp::DoDatabaseWork (this=0xab974380, aConnection=0xa98a1cc0)
at ../../../gecko/dom/indexedDB/ActorsParent.cpp:14162

(gdb) p ((TransactionBase*)mTransaction).mMode
$16 = mozilla::dom::IDBTransaction::CLEANUP

Origin: "1024+f+app://system.gaiamobile.org"
Name: "asyncStorage"
Transaction Mode: CLEANUP

Tom, feel free to take this bug if you want to fix it properly. Since I'm not sure it's acceptable to introduce a trivial fix.

Thanks for the information! I believe low-disk condition is an important clue to reproduce the issue. I will look into this after implementing the bug 1521541.

Looking into this

Assignee: shawnjohnjr → shes050117
Status: NEW → ASSIGNED
Flags: needinfo?(shes050117)
Crash Signature: [@ mozilla::dom::quota::QuotaObject::DisableQuotaCheck] → [@ mozilla::dom::quota::QuotaObject::DisableQuotaCheck][@ mozilla::dom::indexedDB::(anonymous namespace)::Database::StartTransactionOp::DoDatabaseWork(mozilla::dom::indexedDB::(anonymous namespace)::DatabaseConnection*)]

I wonder if this is somehow an issue on re-escaping the char in the group string [1]. For instance, if I change the spec in [2] to spec = "http://foo'.com";. The test wouldn't stop once it's executed. The reason is that the quotaObject mismatching (create a groupInfo with "foo'.com" but trying to get the groupInfo object with "foo%27.com"). I suspect if we somehow run into the CLEANUP logic in the indexedDB, we would hit the crash.

[1] https://searchfox.org/mozilla-central/rev/f8b11433159cbc9cc80500b3e579d767473fa539/dom/quota/ActorsParent.cpp#3732-3735
[2] https://searchfox.org/mozilla-central/source/dom/indexedDB/test/unit/test_quotaExceeded_recovery.js

See Also: → 1516333

(In reply to Tom Tung [:tt, :ttung] from comment #27)

I wonder if this is somehow an issue on re-escaping the char in the group string [1].

I meant the group mismatching issue (could be introduced because of the group re-escaping)

Note that Bug 1576593 is fixed and I suppose this should help to reduce the issues here.

I suspect this could be caused by bug 1592404 as well and should be fixed by QuotaManager v3

Unassign myself because I'm not actively work on this and I believe QMv3 should fix most of issues for this bug.

Assignee: ttung → nobody
Status: ASSIGNED → NEW
See Also: → 1593365
Component: Storage: Quota Manager → Storage: IndexedDB
Flags: needinfo?(sgiesecke)

Based on crash report db9af6ba-607b-43b9-b7a8-c7f4f0200114, DatabaseConnection::mQuotaObject is nullptr here, meaning that GetQuotaObjectForFile returned a nullptr when called from Connection::GetQuotaObjects at https://searchfox.org/mozilla-central/rev/9e45d74b956be046e5021a746b0c8912f1c27318/storage/mozStorageConnection.cpp#2267 but Connection::GetQuotaObjects was still returning NS_OK.

(I am actually just repeating what https://bugzilla.mozilla.org/show_bug.cgi?id=1414737#c8 already said.)

Should this be checked in Connection::GetQuotaObjects and an error code returned instead?

Flags: needinfo?(sgiesecke) → needinfo?(jvarga)
Component: Storage: IndexedDB → Storage: Quota Manager

I wonder, how this is supposed to work:

already_AddRefed<QuotaObject> GetQuotaObjectForFile(sqlite3_file* pFile) {
  MOZ_ASSERT(pFile);

  telemetry_file* p = (telemetry_file*)pFile;
  RefPtr<QuotaObject> result = p->quotaObject;
  return result.forget();
}

assumes that pFile is of type telemetry_file. But in the calling function, we have:

  sqlite3_file* file;
  int srv = ::sqlite3_file_control(mDBConn, nullptr, SQLITE_FCNTL_FILE_POINTER,
                                   &file);
  if (srv != SQLITE_OK) {
    return convertResultCode(srv);
  }

  RefPtr<QuotaObject> databaseQuotaObject = GetQuotaObjectForFile(file);

Is there any magic callback inside sqlite3_file_control that allocates a telemetry_file object?

(In reply to Jens Stutte [:jstutte] from comment #33)

I wonder, how this is supposed to work:

telemetry_file is a pseudo-subclass of sqlite3_file: https://searchfox.org/mozilla-central/rev/9e45d74b956be046e5021a746b0c8912f1c27318/storage/TelemetryVFS.cpp#149

I wonder if this should also be changed to proper inheritance.

Yes, good old C-"Inheritance". And the initialization of TelemetryVFS accounts for the telemetry_file size, of course. So this explains why this works in the normal case (still it reads bad) :-).

Probably we see an operation on an already closed database (file) here (assuming that sqlite nulls its internal file pointer in that case) and should just return an error, rather than crash. Such that the caller can figure out, if he can still do something meaningful in this situation.

I don't think it's an already closed database. It's more like there's no quota object for the given database, but there should always be a quota object for an IDB database (except for persistent databases, but DisableQuotaChecks shouldn't be called for them).

Flags: needinfo?(jvarga)

Hmmm, I see:

int xClose(sqlite3_file* pFile) {
  telemetry_file* p = (telemetry_file*)pFile;
  int rc;
  {  // Scope for IOThreadAutoTimer
    IOThreadAutoTimer ioTimer(IOInterposeObserver::OpClose);
    rc = p->pReal->pMethods->xClose(p->pReal);
  }
  if (rc == SQLITE_OK) {
    delete p->base.pMethods;
    p->base.pMethods = nullptr;
    p->quotaObject = nullptr;
    delete[] p->location;
#ifdef DEBUG
    p->fileChunkSize = 0;
#endif
  }
  return rc;
}

where the quotaObject is explicitly set to null. Might not be the only case this happens, though.

But then Connection::GetQuotaObjects will return NS_ERROR_NOT_INITIALIZED because the connection is closed.

So the assumption/invariant is, that if we have a valid mDBConn we also must have a valid QuotaObject inside? This seems to be broken in this specific case (for whatever reason).

Well, we would definitely benefit from a recording of that rather than guessing from the code. Still we might want to consider to follow Shawn's suggestion https://bugzilla.mozilla.org/show_bug.cgi?id=1414737#c20 to check for a nullptr and fail more gently in case.

I'm ok with adding a check for nullptr if we also add something that won't hide this problem. I need more time to figure out how to do that or someone else can try.

Adding an NS_WARN_IF and an appropriate comment in the code and opening a new Bug here that covers analyzing the root cause seems to be sufficiently visible to me.

Another possibility would be to add telemetry so we can track the number of such failures.

New telemetry sounds good to me, while we are here, the telemetry should check that there are quota objects after opening the database (not just during the cleanup transaction).

(still in guess mode and obviously ok to not hide the cause)

I see in two reports I randomly clicked on parallel operations on IndexedDB on another thread: 8b0e7f1f-8d58-4e0f-828d-1f7f10200115 and 8b0e7f1f-8d58-4e0f-828d-1f7f10200115. We might see the effect of a race here?

Blocks: 1609640
Blocks: 1609641

To resolve the crashes, as part of this bug, the crashes should be changed into regular failures of Connection::GetQuotaObjects. Bugs 1609640 and 1609641 track adding telemetry and identifying/resolving the root cause of these failures.

Assignee: nobody → sgiesecke
Status: NEW → ASSIGNED
Component: Storage: Quota Manager → Storage
Product: Core → Toolkit
Pushed by sgiesecke@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a90fb9414ef6
Ensure Connection::GetQuotaObjects only returns NS_OK if the quota objects are non-null. r=janv,dom-workers-and-storage-reviewers
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla74
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: