Open Bug 1798904 Opened 2 years ago Updated 11 months ago

Shutdownhang in backgroundupdate task loading JS modules

Categories

(Toolkit :: Background Tasks, defect)

x86
Windows 10
defect

Tracking

()

People

(Reporter: jesup, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

We're getting a lot of shutdownhangs with nsZipArchive sitting in memcmp, comparing names.
The implication is that we're (i-?) looping in ::GetItem(), comparing names, and never returning (or never within the time shutdown will wait - 30s?)

This implies a loop when iterating item->next.

While we've had crashes there for a long time, it appears that the frequency increased considerably around late September

Crash report: https://crash-stats.mozilla.org/report/index/1e831b3f-a087-4b6a-986d-ecfbc0221103

MOZ_CRASH Reason: Shutdown hanging at step AppShutdownConfirmed. Something is blocking the main-thread.Top 10 frames of crashing thread:

0  VCRUNTIME140.dll  memcmp  
1  xul.dll  nsZipArchive::GetItem  modules/libjar/nsZipArchive.cpp:428
2  xul.dll  nsJAR::GetInputStreamWithSpec  modules/libjar/nsJAR.cpp:312
3  xul.dll  nsJAR::GetInputStream  modules/libjar/nsJAR.cpp:291
4  xul.dll  nsJARInputThunk::Init  modules/libjar/nsJARChannel.cpp:125
5  xul.dll  nsJARChannel::CreateJarInput  modules/libjar/nsJARChannel.cpp:287
6  xul.dll  nsJARChannel::Open  modules/libjar/nsJARChannel.cpp:1065
7  xul.dll  mozJSSubScriptLoader::ReadStencil  js/xpconnect/loader/mozJSSubScriptLoader.cpp:227
7  xul.dll  mozJSSubScriptLoader::DoLoadSubScriptWithOptions  js/xpconnect/loader/mozJSSubScriptLoader.cpp:455
8  xul.dll  mozJSSubScriptLoader::LoadSubScript  js/xpconnect/loader/mozJSSubScriptLoader.cpp:297

Also: all the crashes in the last 6 months are windows

The rise seems to be correlated with 105+

This hang also spiked at the same time: [@ shutdownhang | xtolong ] in nsZipArchive, also only on windows.

Crash Signature: [@ shutdownhang | memcmp | nsZipArchive::GetItem] → [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ]

The xtolong crashes all appear to be in initing ziparchives (GetDataOffset())

Severity: -- → S2
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-review]
Assignee: nobody → rjesup
Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged][necko-priority-queue]

We took a look at a crash report: https://crash-stats.mozilla.org/report/index/1e831b3f-a087-4b6a-986d-ecfbc0221103#tab-rawdump

And found some weird values for aFileName:

-		aFilename	<Error reading characters of string.>	const nsTSubstring<char> &
-		Length	{mLength=0x5b5f5e6c }	mozilla::detail::nsTStringLengthStorage<char>
		mLength	0x5b5f5e6c	unsigned int
		Flags	0xc25d	mozilla::detail::StringDataFlags
-		[Raw View]	{...}	const nsTSubstring<char>

There are spikes in the incidence rate on the day of or day after each release (Apr 12, Mar 15, Feb 15, etc), of 3-5x. This implies it's related to updates

Crash Signature: [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ] → [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ] [@ shutdownhang | memcpy_repmovs | nsJARInputStream::Read ]
Crash Signature: [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ] [@ shutdownhang | memcpy_repmovs | nsJARInputStream::Read ] → [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ] [@ shutdownhang | memcpy_repmovs | nsJARInputStream::Read ] [@ shutdownhang | mozilla::net::CacheIndexEntry::HashKey ]
Crash Signature: [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ] [@ shutdownhang | memcpy_repmovs | nsJARInputStream::Read ] [@ shutdownhang | mozilla::net::CacheIndexEntry::HashKey ] → [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ] [@ shutdownhang | memcpy_repmovs | nsJARInputStream::Read ] [@ shutdownhang | mozilla::net::CacheIndexEntry::HashKey ] [@ shutdownhang | memmove | std::_Copy_memmove ]\ [@ …
Crash Signature: [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ] [@ shutdownhang | memcpy_repmovs | nsJARInputStream::Read ] [@ shutdownhang | mozilla::net::CacheIndexEntry::HashKey ] [@ shutdownhang | memmove | std::_Copy_memmove ]\ [@ … → [@ shutdownhang | memcmp | nsZipArchive::GetItem] [@ shutdownhang | xtolong ] [@ shutdownhang | memcpy_repmovs | nsJARInputStream::Read ] [@ shutdownhang | mozilla::net::CacheIndexEntry::HashKey ] [@ shutdownhang | memmove | std::_Copy_memmove ] [@ s…
Crash Signature: shutdownhang | memcpy | nsJARInputStream::Read ] [@ shutdownhang | memcpy | std::_Copy_memmove ] → shutdownhang | memcpy | nsJARInputStream::Read ] [@ shutdownhang | memcpy | std::_Copy_memmove ] [@ shutdownhang | strlen | JS_AtomizeString ]

All, or virtually all, are Windows crashes. All appear to have BackgroundTaskName = backgroundupdate. about 1/2 are in AppShutdownConfirmed, the others are in various other phases.

The fact that they're all backgroundupdate (and all windows) is very interesting....

Flags: needinfo?(jstutte)
Summary: Crash in [@ shutdownhang | memcmp | nsZipArchive::GetItem] → Shutdownhang in backgroundupdate task loading JS modules

Nick - There is a large set of shutdown hangs for the backgroundupdate task, all (or virtually all) on Windows. The ones on this bug all seem to be loading JS modules, but that's what I was searching for since that's what this bug was originally looking at.

A wider search of backgroundupdate crashes shows a huge number of shutdown hangs:
https://crash-stats.mozilla.org/search/?background_task_name=%3Dbackgroundupdate&product=Firefox&date=%3E%3D2023-05-02T00%3A46%3A00.000Z&date=%3C2023-05-09T00%3A46%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Something like 36-40K crashes in the last week alone (this signatures on this bug (with ModuleLoader/nsJAR on the stack) are only around 500 of those).

I wonder if the problems reproing them (such as bug 1768344) were because it seems it most often crashes when there's a release (see the graph above).

Flags: needinfo?(nalexander)

Interestingly we made a quite good job in avoiding AsyncShutdownTimeout for background tasks, see for example bug 1820517, bug 1814104, bug 1799421. The remaining hangs are probably "things that start and were never even designed to interact with shutdown" such that the terminator catches them.

I think we have several paths here (and might want all):

  1. Mitigate the impact by having an (arbitrary) timeout at the end of the background task before starting the shutdown sequence of the process. This is an easy hack but obviously not really nice. And we should check the Shutdown Reason annotation if we see many cases of OS shutdown (my gut feeling is: no, but it is not yet aggregatable, so we should take a closer look) - delaying process end could delay OS shutdown. But I assume the backgroundupdate is something we run anytime during normal operations, IIUC.
  2. Redesign how startup actually works, making sure that we have async shutdown blockers for every piece that starts up.
  3. Have a closer look at what things are actually needed to startup for a background task. My current assumption is that many things are not needed at all, but there might be hidden dependencies hard to solve. Especially on Windows I'd check if we start GPU processes or similar, or accessibility features or other heavy COM objcects. (Disclaimer: I have no idea if any of these examples really happens.)

Given that 2. and 3. are desirable but hard, I'd start from 1.

Edit regarding 1.: I assume here in runBackgroundTaskNamed could be a good place to add a waiting promise.

Flags: needinfo?(jstutte)
Depends on: 1832089
Assignee: rjesup → nobody
Severity: S2 → --
Component: Networking: JAR → Background Tasks
Priority: P2 → --
Product: Core → Toolkit

I am not sure it is fair to move this to background tasks. The JS module loading could be more robust in blocking shutdown correctly, as well. The bug I made this depend on is a mitigation only.

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #8)

Nick - There is a large set of shutdown hangs for the backgroundupdate task, all (or virtually all) on Windows.

backgroundupdate is only scheduled on Windows, so this is expected.

The ones on this bug all seem to be loading JS modules, but that's what I was searching for since that's what this bug was originally looking at.

A wider search of backgroundupdate crashes shows a huge number of shutdown hangs:
https://crash-stats.mozilla.org/search/?background_task_name=%3Dbackgroundupdate&product=Firefox&date=%3E%3D2023-05-02T00%3A46%3A00.000Z&date=%3C2023-05-09T00%3A46%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Something like 36-40K crashes in the last week alone (this signatures on this bug (with ModuleLoader/nsJAR on the stack) are only around 500 of those).

I wonder if the problems reproing them (such as bug 1768344) were because it seems it most often crashes when there's a release (see the graph above).

I think the spikes with releases is almost certainly that when there are actually updates to download and apply, more stuff happens in the background update task.

Everything here looks like the class of issue that Jens details in https://bugzilla.mozilla.org/show_bug.cgi?id=1798904#c9: large swathes of Gecko (and Firefox) don't handle shutdown cleanly. The background update task loads a subset of Firefox code and generally exits very quickly: we're just not robust in that situation.

I will counter the weekly stats a little bit: 40k crashes sounds high but the background update task talks to ~100M distinct background update clients a week (and many clients run the task many times, that number is just one I have in my head). So I'd certainly like to see progress on this because these issues can bite browsing Firefox users too, but I don't think it's very high priority.

Flags: needinfo?(nalexander)

(In reply to Jens Stutte [:jstutte] from comment #9)

Interestingly we made a quite good job in avoiding AsyncShutdownTimeout for background tasks, see for example bug 1820517, bug 1814104, bug 1799421. The remaining hangs are probably "things that start and were never even designed to interact with shutdown" such that the terminator catches them.

I think we have several paths here (and might want all):

  1. Mitigate the impact by having an (arbitrary) timeout at the end of the background task before starting the shutdown sequence of the process. This is an easy hack but obviously not really nice. And we should check the Shutdown Reason annotation if we see many cases of OS shutdown (my gut feeling is: no, but it is not yet aggregatable, so we should take a closer look) - delaying process end could delay OS shutdown. But I assume the backgroundupdate is something we run anytime during normal operations, IIUC.
    Edit regarding 1.: I assume here in runBackgroundTaskNamed could be a good place to add a waiting promise.

backgroundupdate has an arbitrary short wait as it tries to let Glean uploads proceed: https://searchfox.org/mozilla-central/rev/373d05f4eabdb90a6480d5d36d983b8bff36c9d8/toolkit/mozapps/update/BackgroundTask_backgroundupdate.sys.mjs#414-416. Glean has improved greatly so this is no longer neccessary.

I'd be happy to lift this delay to a per-task-configured delay (like the per-task-configured timeout) in the tasks manager, and to extend it, if it might help. But I doubt that it would help in a non-linear fashion: we have racing startup and shutdown and racers gonna race. I'd like to see more labeling of activities at startup and shutdown to help us understand ones that are failing in the wild.

Whiteboard: [necko-triaged][necko-priority-queue]

(moved the discussion from comment 12 to bug 1832089)

Severity: -- → S3
You need to log in before you can comment on or make changes to this bug.