Open Bug 1905548 Opened 2 months ago Updated 2 months ago

Perma Windows M-condprof dom/* failures

Categories

(Core :: DOM: Service Workers, defect, P2)

defect

Tracking

()

People

(Reporter: noriszfay, Unassigned)

References

Details

(Keywords: intermittent-failure, leave-open, Whiteboard: [stockwell disable-recommended])

Attachments

(1 file)

Multiple failures are happening on different paths that all start with dom/* that are also failing on previous runs where the run was green. Here are a few examples: Example 1 , Example 2 , Example 3 , Example 4. This is happening on Windows only.

Flags: needinfo?(echuang)

The conditioned profile uses for this must be corrupted. The issues persists for backfills (which use the new conditioned profile) while initial runs on the same pushes (with the old profile) were successful before: https://treeherder.mozilla.org/jobs?repo=autoland&searchStr=condprof&fromchange=d268b3156cf56121c79d226275ae9d42d3b9170a&tochange=410b6740b34e5f7b19b473cc73ca89b0abe25595

Previous issues with conditioned profiles disappeared when the next profile snapshot got generated but this time the issue has affected at least two profiles in a row.

Flags: needinfo?(jmaher)
Summary: Perma M-condprof dom/* failures → Perma Windows M-condprof dom/* failures

I am not sure what to do here. The m-c push this morning generated a new profile and I retriggered which resulted in the same failures. This means the generated profile is bad, it isn't data in the profile. We need to figure out a few things:

  1. why windows takes so long (2x+) vs linux
  2. how to validate the profile before completing the job, or maybe how to validate the profile before using it

we run the same mochitest and xpcshell tests on linux and windows, but windows seems to be the problem- not frequently, but twice in the last 2 weeks.

Flags: needinfo?(jmaher)

Reruns of mochitest-plain tasks on Windows for older commits also fail. The
issue is located in the generated conditioned profile.

I took a look at the failure reports.

Not only ServiceWorker tests but also some non-ServiceWorker tests have timeout issues, for example, https://treeherder.mozilla.org/logviewer?job_id=464914601&repo=autoland&lineNumber=1801.

Almost all timeout test cases complete the test and call SimpleTest.finish(), and it seems to hang on closing content processes. Nika, do you have any idea about this? Could this be related to our recent improvement on process startup/shutdown?

Flags: needinfo?(echuang) → needinfo?(nika)

With the latest profile used for the test execution, about:serviceworkers shows two registered workers on launch

Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/da1d0a58aa0b
disable Windows mochitest-plain condprof tests temporarily for permanent failures. r=jmaher

(In reply to Eden Chuang[:edenchuang] from comment #6)

I took a look at the failure reports.

Not only ServiceWorker tests but also some non-ServiceWorker tests have timeout issues, for example, https://treeherder.mozilla.org/logviewer?job_id=464914601&repo=autoland&lineNumber=1801.

Almost all timeout test cases complete the test and call SimpleTest.finish(), and it seems to hang on closing content processes. Nika, do you have any idea about this? Could this be related to our recent improvement on process startup/shutdown?

This comment is mostly a dump of thoughts as I looked through the failures and such. Not sure I can come to any conclusion about what exactly has happened here unfortunately.


That is an interesting situation. It seems like the test is calling SimpleTest.finish(), which is in turn starting the process of switching to the next test, but then the test times out during that final clean-up phase.

The code for this is old async code and a bit tricky to follow, but it seems there are a number of async stages between SimpleTest.finish() and the actual test finishing. First any cleanup functions are executed (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/testing/mochitest/tests/SimpleTest/SimpleTest.js#1558-1582), then we query the set of registered service workers to see if we have extra registered service workers (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/testing/mochitest/tests/SimpleTest/SimpleTest.js#1487), then we purge all service worker data (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/testing/mochitest/tests/SimpleTest/SimpleTest.js#1527,1552), and then finally do a bunch of clean-up of things like preferences, permissions, crash reports, and flush the message queue (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/testing/mochitest/tests/SimpleTest/TestRunner.js#891-895).


The content process shutdown/hang situation at shutdown seems like it might be a separate issue, as I don't think we actually wait for the content process to shut down at any point during this. The crashes appear to be happening as the test runner tries to shut down the application, which is after the timeouts have happened. This could definitely be in some way related to the service worker process lifecycle changes, but I'm not 100% sure how that would be happening.

I do see that the parent process main thread is in a multiply-nested event loop in the crash linked from comment 6 (https://treeherder.mozilla.org/logviewer?job_id=464914601&repo=autoland&lineNumber=14986). The main thread is in XPCOM shutdown and is advancing through shutdown phases, spinning a nested event loop from JS for that (presumably the async shutdown service). Within that code a runnable to invoke mozilla::storage::Connection::shutdownAsyncThread() is running, which is itself spinning a nested event loop waiting for (presumably) the storage async thread to finish shutting down. The thread which I presume is the storage thread (https://treeherder.mozilla.org/logviewer?job_id=464914601&repo=autoland&lineNumber=19924) has returned from XPCOM activity (as indicated by the name suffix ,SHDRCV,SDHACK and the stack), and is waiting for a PR_JoinThread() call.

There should have been an event dispatched to the main thread which will perform that PR_JoinThread() and exit the nested event loop, but it looks like the main thread has entered a buggy state where it's waiting on the TaskController condvar for an event to become available (https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/xpcom/threads/TaskController.cpp#631). Theoretically this should be notified when the event is dispatched to the main thread via (simplified call path) nsThread::Dispatch -> EventQueueInternal::PutEvent -> TaskController::DispatchRunnable -> TaskController::AddTask -> MaybeInterruptTask -> EnsureHaveMainThreadTasksScheduled.

That being said, I think in the parent process on windows we should never be waiting on that condvar, as we want to be processing native OS events when waiting for XPCOM events to become ready through OnProcessNextEvent, so it seems unfortunate that we're ending up waiting on that condvar in any way.

Theoretically the timeouts etc. could all be caused by the main thread in the parent process becoming unresponsive due to waiting on the wrong condvar or similar in this case, as it would delay SimpleTest cleanup to be a timeout if the parent process main thread didn't respond to requests to change prefs/permissions. That being said, I still don't know what would have changed to cause this to be different and start hanging, so this might be barking up the wrong tree.


With the crashes linked in comment 0, the behaviour appears a bit different. The crash appears to be an explicit .abort() call from JS code due to https://searchfox.org/mozilla-central/rev/6f8b44c8c7668f0ff032102e7493b46db505933b/toolkit/components/places/PlacesUtils.sys.mjs#2104-2110, presumably through the call to abort here: https://searchfox.org/mozilla-central/rev/6936c4c3fc9bee166912fce10104fbe0417d77d3/browser/components/BrowserGlue.sys.mjs#2241-2243.

I don't think we can blame a main thread hang in that case quite as much, as if we crashed immediately within this call, I don't think we'd be hanging and encountering a timeout. In this case, it looks like we're failing to shut down a sqlite connection when shutdown was requested. Perhaps there was some change to sqlite recently within the last week or so which could've caused this issue, as that seems to be a somewhat common thread here?

Flags: needinfo?(nika) → needinfo?(echuang)

Nika, thanks for the detailed reply.

I once wondered whether my previous changes on the Worker caused this bug. The previous changes may cause the Worker thread to hang and cause the test to time out. However, I tried to reproduce the error and looked at the log. Unfortunately, it didn't have much to do with my previous changes. According to the log results, the Worker completed the shutdown as usual. I'll take a closer look at why the test times out. Thanks.

Flags: needinfo?(echuang)
Severity: -- → S3
Priority: -- → P2

(In reply to Nika Layzell [:nika] (ni? for response) from comment #14)

I don't think we can blame a main thread hang in that case quite as much, as if we crashed immediately within this call, I don't think we'd be hanging and encountering a timeout. In this case, it looks like we're failing to shut down a sqlite connection when shutdown was requested. Perhaps there was some change to sqlite recently within the last week or so which could've caused this issue, as that seems to be a somewhat common thread here?

  • Bug 1900093 is the only thing that landed in mozStorage directly in Fx129 on June 11th. There are no in-tree users of the new flag I can see and it should not directly impact anything.
  • Bug 1898529 was the most recent upgrade of SQLite to 3.46.0 but that was in Fx128 landing on May 29th.

Looking at the multiply-nested event loop log the only async execution thread there seems to be places. Oddly the stack seems to show that the thread is idle, so I would not expect nsThread::Shutdown() to effectively hang, although maybe it was another thread that was being shut down.

:mak, are you aware of any recent changes to places or other SQLite consumers that could potentially cause interesting problems here?

Flags: needinfo?(mak)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #16)

:mak, are you aware of any recent changes to places or other SQLite consumers that could potentially cause interesting problems here?

No.
Recently Bug 1904967 was reported, that is likely the same as this one?

The messages from FrecencyRecalculator tell the main Places database connection is missing, we're failing to wrap it. Or alternatively we're trying to wrap it but shutdown already started (These frecencyrecalc may be a red-herring effectively).

  1. Are we copying over a corrupt database or corrupting it in the process?
  2. is the filesystem locking the file somehow?
  3. Is something just trying to open a connection late on shutdown?

Being opt builds, we can't see the debug warnings to tell if we're failing opening.
I assume something is adding a shutdown blocker on places.sqlite. Andrew found a thread having a handle to places.sqlite, so something was able to open a connection... A reader connection maybe?

a. I wonder if there's a problem in how we generate the condprofile, are we copying the database at a safe time (Do we allow enough time for Firefox to create the database and fully close)?
b. Can we get a places.sqlite file from one of these failing profiles, to examine it and check if it's corrupt?
c. Can we get crash annotations? I cannot see them in the logs, but I'd like to see the async shutdown metadata.
d. I wonder if tests, in general, should run with toolkit.asyncshutdown.log set to true. It may generate some additional noise but may also help with these cases

Flags: needinfo?(mak) → needinfo?(jmaher)

(In reply to Marco Bonardo [:mak] from comment #17)

b. Can we get a places.sqlite file from one of these failing profiles, to examine it and check if it's corrupt?

The profile used for the latest Windows mochitest runs should have been this one.

Depends on: 1907278

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #18)

The profile used for the latest Windows mochitest runs should have been this one.

Hum, there's something strange here, places.sqlite-wal is 4MiB, while in normal conditions it should be a 0-byte file (it gets truncated on Firefox shutdown). That said, it looks sane.
How do we generate these profiles?

we could add toolkit.asyncshutdown.log=true for condprof tests, that seems easy to do.

the process for creating these are via the condprof tool and it works reliably on linux.

linux takes 45 minutes to generate the profile, windows takes 120+ minutes; so something is probably pushing the limits of fileIO/time. I suspect when we get a bad profile (one where all tests fail) there is something in the profile creation that leaves a file corrupt or truncated in some way.

Flags: needinfo?(jmaher)

I can't find what would cause an unclean shutdown, unless those 120+ minutes are causing some kind of forced kill (maybe there's an async shutdown timeout involved, and that is what causes Firefox to forcelly terminate). I'm sorry but I don't know the condvar tool enough to easily find an issue in its code.
I posted a patch in bug 1907278 to anticipate the frecency recalculator component shutdown, though I suspect it won't make a big difference (but it would be nice to know if it does), but maybe it will remove some red-herring logging.

See Also: → 1904967
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: