Open Bug 1682955 Opened 4 years ago Updated 3 days ago

Intermittent SUMMARY: ThreadSanitizer: data race /builds/worker/checkouts/gecko/storage/mozStorageConnection.h:239:42 in connectionReady

Categories

(Core :: SQLite and Embedded Database Bindings, defect, P5)

defect

Tracking

()

REOPENED

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Filed by: ncsoregi [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer?job_id=324738801&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/Gq9V29__RniSdipJkvYsDg/runs/0/artifacts/public/logs/live_backing.log


[task 2020-12-16T19:37:15.285Z] 19:37:15     INFO - GECKO(2668) |   Thread T37 'IndexedDB #2' (tid=2728, running) created by thread T20 at:
[task 2020-12-16T19:37:15.285Z] 19:37:15     INFO - GECKO(2668) |     #0 pthread_create /builds/worker/fetches/llvm-project/llvm/projects/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:966:3 (firefox+0x5668b)
[task 2020-12-16T19:37:15.286Z] 19:37:15     INFO - GECKO(2668) |     #1 _PR_CreateThread /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:458:14 (libnspr4.so+0x3c203)
[task 2020-12-16T19:37:15.286Z] 19:37:15     INFO - GECKO(2668) |     #2 PR_CreateThread /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:533:12 (libnspr4.so+0x30b52)
[task 2020-12-16T19:37:15.286Z] 19:37:15     INFO - GECKO(2668) |     #3 nsThread::Init(nsTSubstring<char> const&) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:658:8 (libxul.so+0xb85bbf)
[task 2020-12-16T19:37:15.287Z] 19:37:15     INFO - GECKO(2668) |     #4 nsThreadManager::NewNamedThread(nsTSubstring<char> const&, unsigned int, nsIThread**) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadManager.cpp:641:12 (libxul.so+0xb8d0da)
[task 2020-12-16T19:37:15.287Z] 19:37:15     INFO - GECKO(2668) |     #5 NS_NewNamedThread(nsTSubstring<char> const&, nsIThread**, already_AddRefed<nsIRunnable>, unsigned int) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:169:57 (libxul.so+0xb94507)
[task 2020-12-16T19:37:15.287Z] 19:37:15     INFO - GECKO(2668) |     #6 NS_NewNamedThread(nsTSubstring<char> const&, nsIThread**, nsIRunnable*, unsigned int) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:160:10 (libxul.so+0xb8fa63)
[task 2020-12-16T19:37:15.288Z] 19:37:15     INFO - GECKO(2668) |     #7 mozilla::dom::indexedDB::(anonymous namespace)::ConnectionPool::ScheduleTransaction(mozilla::dom::indexedDB::(anonymous namespace)::ConnectionPool::TransactionInfo&, bool) /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:8572:23 (libxul.so+0x44a0f14)
[task 2020-12-16T19:37:15.288Z] 19:37:15     INFO - GECKO(2668) |     #8 Start /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:8270:15 (libxul.so+0x44a6367)
[task 2020-12-16T19:37:15.288Z] 19:37:15     INFO - GECKO(2668) |     #9 mozilla::dom::indexedDB::(anonymous namespace)::TransactionDatabaseOperationBase::StartOnConnectionPool(nsID const&, nsTSubstring<char> const&, long, nsTArray<nsTString<char16_t> > const&, bool) /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:17797:27 (libxul.so+0x44a6367)
[task 2020-12-16T19:37:15.288Z] 19:37:15     INFO - GECKO(2668) |     #10 mozilla::dom::indexedDB::(anonymous namespace)::OpenDatabaseOp::DispatchToWorkThread() /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:16915:45 (libxul.so+0x448fd28)
[task 2020-12-16T19:37:15.289Z] 19:37:15     INFO - GECKO(2668) |     #11 mozilla::dom::indexedDB::(anonymous namespace)::FactoryOp::Run() /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:16159:7 (libxul.so+0x448a90c)
[task 2020-12-16T19:37:15.289Z] 19:37:15     INFO - GECKO(2668) |     #12 CallCallback /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:9705:13 (libxul.so+0x449de91)
[task 2020-12-16T19:37:15.290Z] 19:37:15     INFO - GECKO(2668) |     #13 mozilla::dom::indexedDB::(anonymous namespace)::WaitForTransactionsHelper::MaybeWaitForFileHandles() /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:9694:3 (libxul.so+0x449de91)
[task 2020-12-16T19:37:15.290Z] 19:37:15     INFO - GECKO(2668) |     #14 mozilla::dom::indexedDB::(anonymous namespace)::WaitForTransactionsHelper::Run() /builds/worker/fetches/clang/bin/../lib/gcc/x86_64-unknown-linux-gnu/7.4.0/../../../../include/c++/7.4.0/bits/atomic_base.h (libxul.so+0x449d7d7)
[task 2020-12-16T19:37:15.290Z] 19:37:15     INFO - GECKO(2668) |     #15 WaitForDatabasesToComplete /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:8350:26 (libxul.so+0x449d940)
[task 2020-12-16T19:37:15.292Z] 19:37:15     INFO - GECKO(2668) |     #16 MaybeWaitForTransactions /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:9671:21 (libxul.so+0x449d940)
[task 2020-12-16T19:37:15.292Z] 19:37:15     INFO - GECKO(2668) |     #17 mozilla::dom::indexedDB::(anonymous namespace)::WaitForTransactionsHelper::Run() /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:9717:7 (libxul.so+0x449d940)
[task 2020-12-16T19:37:15.292Z] 19:37:15     INFO - GECKO(2668) |     #18 WaitForTransactions /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:9660:19 (libxul.so+0x448f7b1)
[task 2020-12-16T19:37:15.292Z] 19:37:15     INFO - GECKO(2668) |     #19 WaitForTransactions /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:15797:11 (libxul.so+0x448f7b1)
[task 2020-12-16T19:37:15.293Z] 19:37:15     INFO - GECKO(2668) |     #20 mozilla::dom::indexedDB::(anonymous namespace)::OpenDatabaseOp::BeginVersionChange() /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:16853:5 (libxul.so+0x448f7b1)
[task 2020-12-16T19:37:15.293Z] 19:37:15     INFO - GECKO(2668) |     #21 mozilla::dom::indexedDB::(anonymous namespace)::FactoryOp::Run() /builds/worker/checkouts/gecko/dom/indexedDB/ActorsParent.cpp:16155:7 (libxul.so+0x448a819)
[task 2020-12-16T19:37:15.294Z] 19:37:15     INFO - GECKO(2668) |     #22 nsThread::ProcessNextEvent(bool, bool*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:1200:14 (libxul.so+0xb8821c)
[task 2020-12-16T19:37:15.295Z] 19:37:15     INFO - GECKO(2668) |     #23 NS_ProcessNextEvent(nsIThread*, bool) /builds/worker/checkouts/gecko/xpcom/threads/nsThreadUtils.cpp:548:10 (libxul.so+0xb8dcd2)
[task 2020-12-16T19:37:15.295Z] 19:37:15     INFO - GECKO(2668) |     #24 mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) /builds/worker/checkouts/gecko/ipc/glue/MessagePump.cpp:332:5 (libxul.so+0x14841e8)
[task 2020-12-16T19:37:15.295Z] 19:37:15     INFO - GECKO(2668) |     #25 RunInternal /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:334:10 (libxul.so+0x140ec1c)
[task 2020-12-16T19:37:15.295Z] 19:37:15     INFO - GECKO(2668) |     #26 RunHandler /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:327:3 (libxul.so+0x140ec1c)
[task 2020-12-16T19:37:15.295Z] 19:37:15     INFO - GECKO(2668) |     #27 MessageLoop::Run() /builds/worker/checkouts/gecko/ipc/chromium/src/base/message_loop.cc:309:3 (libxul.so+0x140ec1c)
[task 2020-12-16T19:37:15.296Z] 19:37:15     INFO - GECKO(2668) |     #28 nsThread::ThreadFunc(void*) /builds/worker/checkouts/gecko/xpcom/threads/nsThread.cpp:441:10 (libxul.so+0xb84670)
[task 2020-12-16T19:37:15.297Z] 19:37:15     INFO - GECKO(2668) |     #29 _pt_root /builds/worker/checkouts/gecko/nsprpub/pr/src/pthreads/ptthread.c:201:5 (libnspr4.so+0x4500b)
[task 2020-12-16T19:37:15.297Z] 19:37:15     INFO - GECKO(2668) | SUMMARY: ThreadSanitizer: data race /builds/worker/checkouts/gecko/storage/mozStorageConnection.h:239:42 in connectionReady
[task 2020-12-16T19:37:15.298Z] 19:37:15     INFO - GECKO(2668) | ==================
[task 2020-12-16T19:37:15.298Z] 19:37:15     INFO - GECKO(2668) | Crash Annotation GraphicsCriticalError: |[C0][GFX1-]: Receive IPC close with reason=AbnormalShutdown (t=2.9137) [GFX1-]: Receive IPC close with reason=AbnormalShutdown```

This is part of the family of problems caused by the connection list and minimizeMemory, see https://bugzilla.mozilla.org/show_bug.cgi?id=1615725#c15 and discussion

See Also: → 1615725
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Blocks: tsan

There have been 62 total failures in the last 7 days.
All the failures are on linux1804-64-tsan-qr platform, opt build type.

Recent failure log.

Marco, as the owner of this component, can you help us assign the bug to someone? Thank you.

Flags: needinfo?(mak)
Whiteboard: [stockwell needswork:owner]

(In reply to Cristina Horotan [:chorotan] from comment #69)

Marco, as the owner of this component, can you help us assign the bug to someone? Thank you.

Not easily, as explained in comment 1 (and the discussion in bug 1615725) this is not a trivial fix, it's more of a refactoring with multiple steps.

  1. We could easily remove the minimize memory usage, I guess. It's saving a few MiBs in the best case, that is unlikely to make a difference. If we accept the compromise, it should be a trivial code removal, but it won't solve anything by itself. If we add a communication channel this would be easy to keep too.
  2. We may have to partially undo Bug 703143, moving to single connection reporters again. We'd want to report both the total memory and a per-connection memory to be able to report untracked memory (sqlite-other). Though for that we'd need a way in the memory reporter to tell "this is already measured, and part of this other measurement", otherwise we'd double count connections memory. Afaik there's not such a thing. We may have connections communicate measurements to the service though.
  3. The shutdown verification should also be refactored to use some kind of uuid per connection, and track opening/closing of those (but we need a way for the connection to notify the service about close).

I wonder if we could use ipdl to keep a communication channel between the Service and the connections, at that point we'd not need a list, we'd just send/receive messages. It would be low-traffic messaging. Async messaging would work for minimize memory, it may work for memory reporting (I assume there's already something in place to wait for async reporting), we may need sync IPC or spin events loop in debug builds to assert for unclosed connections.

Andrew, wdyt of the possibility to use ipdl instead of the current list? The alternative of completely removing the list, without a communication channel, is either dropping features or precision.
We'd need someone with some expertise with that though, any suggestions?

Flags: needinfo?(mak) → needinfo?(bugmail)

(In reply to Marco Bonardo [:mak] from comment #72)

Andrew, wdyt of the possibility to use ipdl instead of the current list? The alternative of completely removing the list, without a communication channel, is either dropping features or precision.
We'd need someone with some expertise with that though, any suggestions?

The IPC approach definitely enables a cleaner decoupling for the Off Main Thread (OMT) storage connections, but I don't think it's worth the implementation complexity to mozStorage or to the OMT consumers (places, Quota Manager clients, rust code) given the limited benefits.

I think it could make sense to retain the connection list and its operation for storage connections created on the main thread (which would be async connections and maybe a weird places connection if it still exists). This eliminates multi-threading concerns and could allow elimination of the custom Release implementation in favor of just calling unregisterConnection from the destructor for main-thread connections.

  1. We could easily remove the minimize memory usage, I guess. It's saving a few MiBs in the best case, that is unlikely to make a difference. If we accept the compromise, it should be a trivial code removal, but it won't solve anything by itself. If we add a communication channel this would be easy to keep too.

I too don't think this is meaningfully doing much, especially since about:memory currently suggests that places is the only connection on my profile using a meaningful amount of cache. (Over [4] connections.)

  1. We may have to partially undo Bug 703143, moving to single connection reporters again. We'd want to report both the total memory and a per-connection memory to be able to report untracked memory (sqlite-other). Though for that we'd need a way in the memory reporter to tell "this is already measured, and part of this other measurement", otherwise we'd double count connections memory. Afaik there's not such a thing. We may have connections communicate measurements to the service though.

I think the only real problem is the "other" metric which we derive by subtracting off all of the connection values from sqlite3_memory_used() (and which DMD cross-checks against its intercepted malloc tallies).

While I think the goal of about:memory continues to be noble and it fills an important niche, it's significantly less important for investigating and solving problems thanks to the amazing Firefox Profiler which has cool memory capabilities. (As an aside, in my parent process, 50.82% of the reported memory is heap-unclassified compared to the 2.36% attributed to explicit/storage.) I would definitely prefer to prioritize any implementation effort at supporting about:processes, about:logging, and the Firefox Profiler which they both are able to trigger.

In order to minimize the reduction in fidelity

  • Use sqlite3_memory_used() post sqlite3_init in our AutoSQLiteLifetime helper to capture the fixed memory costs of SQLite prior to us opening any connections. Save this off to report as its own explicit value like "explicit/storage/sqlite/config" in the existing reporter.
  • Also use sqlite3_memory_used() to report the aggregate SQLite memory usage but not under "explicit/" so it shows up under "Other Measurements" and doesn't need to sum perfectly.
  • For main-thread connections we still use the existing reporter mechanism otherwise as I propose above. Honestly my greatest concern would be newly added databases via Sqlite.sys.mjs, so since this should still cover them, that seems good.
  • We leave it to OMT connections to be reported by their owners. This would let QuotaManager report memory usage on a per-principal basis which could ideally be lumped in with the in-flight IPC messages which would be the most interesting and time-varying data. (Although something that the Firefox profiler can already help with.)
  1. The shutdown verification should also be refactored to use some kind of uuid per connection, and track opening/closing of those (but we need a way for the connection to notify the service about close).

It's not clear to me that the shutdown logic is really necessary at this point. There's 2 parts to this:

  1. Shutdown spinning.
  2. Verifying connections are closed.

Shutdown Spinning

I believe we added the shutdown spinning in bug 744294. I think the idea came from your https://bugzilla.mozilla.org/show_bug.cgi?id=744294#c2:

Since we now have a list of all the connections in Storage (used to track memory so far), wouldn't be simpler to do this in Storage leveraging that list (i.e. on shutdown spin in one single point until all the connections in the list are properly closed) rather than in each single component?

I think we've now actually moved to each component handling its own shutdown, with most adopting use of the async shutdown service at this time (ex: SQLite.sys.mjs), or others still doing their own event loop spinning (QuotaManager uses a dedicated "profile-before-change-qm"). So this somewhat moots the original intent.

And if there is code depending on this xpcom-shutdown-threads logic... it probably shouldn't and should explicitly move to using a shutdown blocker that waits for proper shutdown? Specifically, we have the fast shutdown mechanism controlled by the toolkit.shutdown.fastShutdownStage pref which is currently mapped to CCPostLastCycleCollection but the next step (which admittedly has been stalled for quite some time) is xpcom-shutdown-threads and that will terminate the process before firing the observer notification which means anything depending on this spinning for correct/consistent behavior is going to have/cause problems.

Verifying connections are closed

It looks like we added the "make sure we closed all the connections" in bug 711076 which was motivated by bug 673017 where AFAICT it turned out there was a failure to finalize some statements and that caused a failure in closing the connection which then had some knock-on shutdown related effects. So I think the net goal was to ensure that in our DEBUG builds that we would catch logic errors that might cause problems for users in the wild because we would be unable to close our SQLite connections. But in bug 874814 we introduced logic to auto-finalize any non-finalized statements if we encounter a SQLITE_BUSY, so statement finalization behavior was no longer as much of a problem.

That would leave the logic as a hygiene check that we're actually shutting the connections down. But I think we would see 1 of 2 things happen if someone forgets to shutdown an OMT connection:

  • Our leak checkers will report a leaked mozStorageConnection and friends.
  • Shutdown will hang because nsThreadManager shutdown failed to shut things down and there should be a backtrace identifying the thread that was being bad.

I say OMT connection because the reality is that an (idle) mozStorageConnection created via XPConnect would potentially be torn down as a result of GC. However, since callers would likely be using Sqlite.sys.mjs with its shutdown blocker, I'm not sure that's really a concern because that would happen prior to GC.

And also it would probably still be fine to retain the logic for main-thread connections if desired, although I think I've convinced myself above that we don't need this at all.

I'd definitely be interested if you think there are hygiene benefits to this logic that I'm missing. And if they exist, if you think requiring code to provide a proof-of-shutdown-blocker would be even better. I like approaches like that a lot because they make the people writing the code more immediately aware of the necessity of something rather than having bugs filed against infrastructure components like mozStorage if something appears and having to work backwards from there.

Flags: needinfo?(bugmail)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #74)

(In reply to Marco Bonardo [:mak] from comment #72)
The IPC approach definitely enables a cleaner decoupling for the Off Main Thread (OMT) storage connections, but I don't think it's worth the implementation complexity to mozStorage or to the OMT consumers (places, Quota Manager clients, rust code) given the limited benefits.

Yes, benefits are a bit limited, I agree. The implementation complexity would all be contained in the storage service and connection anyway, as far as the connections are using its methods.

While I think the goal of about:memory continues to be noble and it fills an important niche, it's significantly less important for investigating and solving problems thanks to the amazing Firefox Profiler which has cool memory capabilities.

That's a fair point.

In order to minimize the reduction in fidelity

  • Use sqlite3_memory_used() post sqlite3_init in our AutoSQLiteLifetime helper to capture the fixed memory costs of SQLite prior to us opening any connections. Save this off to report as its own explicit value like "explicit/storage/sqlite/config" in the existing reporter.

That'd would be a static measurement though, we'd not know if that memory has been released then. A telemetry probe would be more fit.

  • Also use sqlite3_memory_used() to report the aggregate SQLite memory usage but not under "explicit/" so it shows up under "Other Measurements" and doesn't need to sum perfectly.

I like this idea, report whole memory as Other, single connections in explicit, and then one can subtract manually.

That's an intersting idea and a nice finding. Maybe we could keep it only enabled in certain channels, like Nightly and early beta.

  • For main-thread connections we still use the existing reporter mechanism otherwise as I propose above. Honestly my greatest concern would be newly added databases via Sqlite.sys.mjs, so since this should still cover them, that seems good.

I'd prefer to not differenciate too much the handling between different kind of connections.

  • We leave it to OMT connections to be reported by their owners. This would let QuotaManager report memory usage on a per-principal basis which could ideally be lumped in with the in-flight IPC messages which would be the most interesting and time-varying data. (Although something that the Firefox profiler can already help with.)

I'd probably move the measurement as it is today to each connection as a first step, allowing to customize the measurement could be a second step. Also because we don't have a clear ETA on when that may happen.

Shutdown Spinning

I think we've now actually moved to each component handling its own shutdown, with most adopting use of the async shutdown service at this time (ex: SQLite.sys.mjs), or others still doing their own event loop spinning (QuotaManager uses a dedicated "profile-before-change-qm"). So this somewhat moots the original intent.

Yes, most consumers of async connections should be using blockers today.
There may be tests not doing that, that should be fixed, included cpp tests.

And if there is code depending on this xpcom-shutdown-threads logic... it probably shouldn't and should explicitly move to using a shutdown blocker that waits for proper shutdown?

Yes, that's part of the investigation about what will break.

Verifying connections are closed

I'd definitely be interested if you think there are hygiene benefits to this logic that I'm missing. And if they exist, if you think requiring code to provide a proof-of-shutdown-blocker would be even better. I like approaches like that a lot because they make the people writing the code more immediately aware of the necessity of something rather than having bugs filed against infrastructure components like mozStorage if something appears and having to work backwards from there.

My main concern is that closing the connection ensures a final merge to the database happens, leaving an empty journal, that avoids corruption if the user moves around a .sqlite file without its journal.
For connections created on the main-thread, we could probably debug assert in the connections itself that we don't receive "xpcom-shutdown-threads" (so we start observing it on open, and stop observing it on close).
For OMT connections maybe the proof of blocker is the way to go, if you have suggestions for an API, those would be welcome.

It sounds like we have some kind of plan anyway, so I'll start filing bug and hopefully find some spare time in the next weeks to start doing some of it.

Depends on: 1834262

I split the proposals in dependencies of Bug 1834262, feel free to add additional details there where necessary.

Whiteboard: [stockwell disable-recommended] → [stockwell needswork:owner]

Update

There have been 35 failures within the last 7 days, all of them on Linux 18.04 x64 WebRender tsan opt.

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=418484042&repo=mozilla-central&lineNumber=2918

There have been 30 total failures in the last 7 days.
All the failures are on linux1804-64-tsan-qr platform, opt build type.

Recent failure log.

Update

There have been 30 failures within the last 7 days, all of them on Linux 18.04 x64 WebRender tsan opt.

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=422904703&repo=autoland&lineNumber=3310

Update

There have been 36 total failures within the last 7 days, all of them on Linux 18.04 x64 WebRender tsan opt.

Recent log: https://treeherder.mozilla.org/logviewer?job_id=430896226&repo=mozilla-central&lineNumber=4804

Marco, are there any updates on this? Checked Bug 1834262 as well.

Flags: needinfo?(mak)
Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][stockwell needswork:owner]

(In reply to Natalia Csoregi [:nataliaCs] from comment #101)

Marco, are there any updates on this? Checked Bug 1834262 as well.

Sorry about that.
I'll try to do some work on that bug, there's a few dependencies to fix there.

Flags: needinfo?(mak)
Whiteboard: [stockwell disable-recommended][stockwell unknown] → [stockwell unknown]
Product: Toolkit → Core
You need to log in before you can comment on or make changes to this bug.