Perma Windows M-condprof dom/* failures
Categories
(Core :: DOM: Service Workers, defect, P2)
Tracking
()
People
(Reporter: noriszfay, Unassigned)
References
Details
(Keywords: intermittent-failure, leave-open, Whiteboard: [stockwell disable-recommended])
Attachments
(1 file)
Multiple failures are happening on different paths that all start with dom/* that are also failing on previous runs where the run was green. Here are a few examples: Example 1 , Example 2 , Example 3 , Example 4. This is happening on Windows only.
Comment hidden (Intermittent Failures Robot) |
Updated•2 months ago
|
Comment 2•2 months ago
|
||
The conditioned profile uses for this must be corrupted. The issues persists for backfills (which use the new conditioned profile) while initial runs on the same pushes (with the old profile) were successful before: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=condprof&fromchange=d268b3156cf56121c79d226275ae9d42d3b9170a&tochange=410b6740b34e5f7b19b473cc73ca89b0abe25595
Previous issues with conditioned profiles disappeared when the next profile snapshot got generated but this time the issue has affected at least two profiles in a row.
Comment 3•2 months ago
|
||
I am not sure what to do here. The m-c push this morning generated a new profile and I retriggered which resulted in the same failures. This means the generated profile is bad, it isn't data in the profile. We need to figure out a few things:
- why windows takes so long (2x+) vs linux
- how to validate the profile before completing the job, or maybe how to validate the profile before using it
we run the same mochitest and xpcshell tests on linux and windows, but windows seems to be the problem- not frequently, but twice in the last 2 weeks.
Comment hidden (Intermittent Failures Robot) |
Updated•2 months ago
|
Comment 5•2 months ago
|
||
Reruns of mochitest-plain tasks on Windows for older commits also fail. The
issue is located in the generated conditioned profile.
Comment 6•2 months ago
|
||
I took a look at the failure reports.
Not only ServiceWorker tests but also some non-ServiceWorker tests have timeout issues, for example, https://treeherder.mozilla.org/logviewer?job_id=464914601&repo=autoland&lineNumber=1801.
Almost all timeout test cases complete the test and call SimpleTest.finish(), and it seems to hang on closing content processes. Nika, do you have any idea about this? Could this be related to our recent improvement on process startup/shutdown?
Comment hidden (Intermittent Failures Robot) |
Comment 8•2 months ago
|
||
With the latest profile used for the test execution, about:serviceworkers
shows two registered workers on launch
- https://www.youtube.com/sw.js
- https://s.0cf.io/sw.js?id=87c568b8-3ecd-3402-1602-e87ed95dcdfc
Unknown if these are related.
Comment 9•2 months ago
|
||
we have an exception in the mochitests for those:
https://searchfox.org/mozilla-central/source/testing/mochitest/tests/SimpleTest/SimpleTest.js#1498
Comment 10•2 months ago
|
||
Pushed by archaeopteryx@coole-files.de: https://hg.mozilla.org/integration/autoland/rev/da1d0a58aa0b disable Windows mochitest-plain condprof tests temporarily for permanent failures. r=jmaher
Comment 11•2 months ago
|
||
bugherder |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 14•2 months ago
|
||
(In reply to Eden Chuang[:edenchuang] from comment #6)
I took a look at the failure reports.
Not only ServiceWorker tests but also some non-ServiceWorker tests have timeout issues, for example, https://treeherder.mozilla.org/logviewer?job_id=464914601&repo=autoland&lineNumber=1801.
Almost all timeout test cases complete the test and call SimpleTest.finish(), and it seems to hang on closing content processes. Nika, do you have any idea about this? Could this be related to our recent improvement on process startup/shutdown?
This comment is mostly a dump of thoughts as I looked through the failures and such. Not sure I can come to any conclusion about what exactly has happened here unfortunately.
That is an interesting situation. It seems like the test is calling SimpleTest.finish()
, which is in turn starting the process of switching to the next test, but then the test times out during that final clean-up phase.
The code for this is old async code and a bit tricky to follow, but it seems there are a number of async stages between SimpleTest.finish()
and the actual test finishing. First any cleanup functions are executed (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/testing/mochitest/tests/SimpleTest/SimpleTest.js#1558-1582), then we query the set of registered service workers to see if we have extra registered service workers (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/testing/mochitest/tests/SimpleTest/SimpleTest.js#1487), then we purge all service worker data (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/testing/mochitest/tests/SimpleTest/SimpleTest.js#1527,1552), and then finally do a bunch of clean-up of things like preferences, permissions, crash reports, and flush the message queue (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/testing/mochitest/tests/SimpleTest/TestRunner.js#891-895).
The content process shutdown/hang situation at shutdown seems like it might be a separate issue, as I don't think we actually wait for the content process to shut down at any point during this. The crashes appear to be happening as the test runner tries to shut down the application, which is after the timeouts have happened. This could definitely be in some way related to the service worker process lifecycle changes, but I'm not 100% sure how that would be happening.
I do see that the parent process main thread is in a multiply-nested event loop in the crash linked from comment 6 (https://treeherder.mozilla.org/logviewer?job_id=464914601&repo=autoland&lineNumber=14986). The main thread is in XPCOM shutdown and is advancing through shutdown phases, spinning a nested event loop from JS for that (presumably the async shutdown service). Within that code a runnable to invoke mozilla::storage::Connection::shutdownAsyncThread()
is running, which is itself spinning a nested event loop waiting for (presumably) the storage async thread to finish shutting down. The thread which I presume is the storage thread (https://treeherder.mozilla.org/logviewer?job_id=464914601&repo=autoland&lineNumber=19924) has returned from XPCOM activity (as indicated by the name suffix ,SHDRCV,SDHACK
and the stack), and is waiting for a PR_JoinThread()
call.
There should have been an event dispatched to the main thread which will perform that PR_JoinThread()
and exit the nested event loop, but it looks like the main thread has entered a buggy state where it's waiting on the TaskController
condvar for an event to become available (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/xpcom/threads/TaskController.cpp#631). Theoretically this should be notified when the event is dispatched to the main thread via (simplified call path) nsThread::Dispatch
-> EventQueueInternal::PutEvent
-> TaskController::DispatchRunnable
-> TaskController::AddTask
-> MaybeInterruptTask
-> EnsureHaveMainThreadTasksScheduled
.
That being said, I think in the parent process on windows we should never be waiting on that condvar, as we want to be processing native OS events when waiting for XPCOM events to become ready through OnProcessNextEvent
, so it seems unfortunate that we're ending up waiting on that condvar in any way.
Theoretically the timeouts etc. could all be caused by the main thread in the parent process becoming unresponsive due to waiting on the wrong condvar or similar in this case, as it would delay SimpleTest cleanup to be a timeout if the parent process main thread didn't respond to requests to change prefs/permissions. That being said, I still don't know what would have changed to cause this to be different and start hanging, so this might be barking up the wrong tree.
With the crashes linked in comment 0, the behaviour appears a bit different. The crash appears to be an explicit .abort()
call from JS code due to https://searchfox.org/mozilla-central/rev/6f8b44c8c7668f0ff032102e7493b46db505933b/toolkit/components/places/PlacesUtils.sys.mjs#2104-2110, presumably through the call to abort
here: https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/browser/components/BrowserGlue.sys.mjs#2241-2243.
I don't think we can blame a main thread hang in that case quite as much, as if we crashed immediately within this call, I don't think we'd be hanging and encountering a timeout. In this case, it looks like we're failing to shut down a sqlite connection when shutdown was requested. Perhaps there was some change to sqlite recently within the last week or so which could've caused this issue, as that seems to be a somewhat common thread here?
Comment 15•2 months ago
|
||
Nika, thanks for the detailed reply.
I once wondered whether my previous changes on the Worker caused this bug. The previous changes may cause the Worker thread to hang and cause the test to time out. However, I tried to reproduce the error and looked at the log. Unfortunately, it didn't have much to do with my previous changes. According to the log results, the Worker completed the shutdown as usual. I'll take a closer look at why the test times out. Thanks.
Updated•2 months ago
|
Comment 16•2 months ago
|
||
(In reply to Nika Layzell [:nika] (ni? for response) from comment #14)
I don't think we can blame a main thread hang in that case quite as much, as if we crashed immediately within this call, I don't think we'd be hanging and encountering a timeout. In this case, it looks like we're failing to shut down a sqlite connection when shutdown was requested. Perhaps there was some change to sqlite recently within the last week or so which could've caused this issue, as that seems to be a somewhat common thread here?
- Bug 1900093 is the only thing that landed in mozStorage directly in Fx129 on June 11th. There are no in-tree users of the new flag I can see and it should not directly impact anything.
- Bug 1898529 was the most recent upgrade of SQLite to 3.46.0 but that was in Fx128 landing on May 29th.
Looking at the multiply-nested event loop log the only async execution thread there seems to be places. Oddly the stack seems to show that the thread is idle, so I would not expect nsThread::Shutdown() to effectively hang, although maybe it was another thread that was being shut down.
:mak, are you aware of any recent changes to places or other SQLite consumers that could potentially cause interesting problems here?
Comment 17•2 months ago
•
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #16)
:mak, are you aware of any recent changes to places or other SQLite consumers that could potentially cause interesting problems here?
No.
Recently Bug 1904967 was reported, that is likely the same as this one?
The messages from FrecencyRecalculator tell the main Places database connection is missing, we're failing to wrap it. Or alternatively we're trying to wrap it but shutdown already started (These frecencyrecalc may be a red-herring effectively).
- Are we copying over a corrupt database or corrupting it in the process?
- is the filesystem locking the file somehow?
- Is something just trying to open a connection late on shutdown?
Being opt builds, we can't see the debug warnings to tell if we're failing opening.
I assume something is adding a shutdown blocker on places.sqlite. Andrew found a thread having a handle to places.sqlite, so something was able to open a connection... A reader connection maybe?
a. I wonder if there's a problem in how we generate the condprofile, are we copying the database at a safe time (Do we allow enough time for Firefox to create the database and fully close)?
b. Can we get a places.sqlite file from one of these failing profiles, to examine it and check if it's corrupt?
c. Can we get crash annotations? I cannot see them in the logs, but I'd like to see the async shutdown metadata.
d. I wonder if tests, in general, should run with toolkit.asyncshutdown.log
set to true. It may generate some additional noise but may also help with these cases
Comment 18•2 months ago
|
||
(In reply to Marco Bonardo [:mak] from comment #17)
b. Can we get a places.sqlite file from one of these failing profiles, to examine it and check if it's corrupt?
The profile used for the latest Windows mochitest runs should have been this one.
Comment 19•2 months ago
•
|
||
(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #18)
The profile used for the latest Windows mochitest runs should have been this one.
Hum, there's something strange here, places.sqlite-wal is 4MiB, while in normal conditions it should be a 0-byte file (it gets truncated on Firefox shutdown). That said, it looks sane.
How do we generate these profiles?
Comment 20•2 months ago
|
||
we could add toolkit.asyncshutdown.log=true
for condprof tests, that seems easy to do.
the process for creating these are via the condprof tool and it works reliably on linux.
- https://searchfox.org/mozilla-central/source/testing/condprofile/condprof/main.py (code to run profile generation)
- https://searchfox.org/mozilla-central/source/taskcluster/kinds/condprof/kind.yml#57 (command given from task)
- https://treeherder.mozilla.org/logviewer?job_id=465894263&repo=mozilla-central (log of
firefox-full
task)
linux takes 45 minutes to generate the profile, windows takes 120+ minutes; so something is probably pushing the limits of fileIO/time. I suspect when we get a bad profile (one where all tests fail) there is something in the profile creation that leaves a file corrupt or truncated in some way.
Comment 21•2 months ago
|
||
I can't find what would cause an unclean shutdown, unless those 120+ minutes are causing some kind of forced kill (maybe there's an async shutdown timeout involved, and that is what causes Firefox to forcelly terminate). I'm sorry but I don't know the condvar tool enough to easily find an issue in its code.
I posted a patch in bug 1907278 to anticipate the frecency recalculator component shutdown, though I suspect it won't make a big difference (but it would be nice to know if it does), but maybe it will remove some red-herring logging.
Description
•