Open Bug 1667481 Opened 4 years ago Updated 8 months ago

Crash in [@ mozilla::ipc::MessagePumpForNonMainThreads::Run]

Categories

(Core :: IPC, defect, P2)

x86
Windows
defect

Tracking

()

Tracking Status
firefox81 --- unaffected
firefox82 + wontfix
firefox83 --- wontfix
firefox84 --- wontfix

People

(Reporter: aryx, Unassigned)

Details

(Keywords: crash, regression, sec-moderate)

Crash Data

There is a frequency increase on Windows 7 x86 for this signature starting with the 82 betas. Now each crashing installation reports ~5-6 crashes while for 81, there were ~1-2. 82.0b2 also had 12 installations reporting crashes, more than any of the betas.

Crash report: https://crash-stats.mozilla.org/report/index/e394de81-a364-4a13-9aba-88b3b0200925

Top 10 frames of crashing thread:

0  @0xcedd74 
1 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1234
2 xul.dll mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:302
3 xul.dll MessageLoop::RunHandler ipc/chromium/src/base/message_loop.cc:327
4 xul.dll MessageLoop::Run ipc/chromium/src/base/message_loop.cc:309
5 xul.dll static nsThread::ThreadFunc xpcom/threads/nsThread.cpp:442
6 nss3.dll _PR_NativeRunThread nsprpub/pr/src/threads/combined/pruthr.c:399
7 nss3.dll pr_root nsprpub/pr/src/md/windows/w95thred.c:139
8 ucrtbase.dll thread_start<unsigned int > 
9 kernel32.dll BaseThreadInitThunk 

These crashes are all happening while spinning the event loop on the socket thread, which I think indicates a networking issue.

Component: IPC → Networking
Summary: Crash in [@ mozilla::ipc::MessagePumpForNonMainThreads::Run] frequency increase with gecko 82 → Crash in [@ mozilla::ipc::MessagePumpForNonMainThreads::Run] frequency increase with gecko 82 on socket thread

Note that this is a startup crash, and it looks like it is happening quite early. Some of the crash reports were happening when we were setting up the JS context. Maybe there's some race with setting up the socket thread?

Summary: Crash in [@ mozilla::ipc::MessagePumpForNonMainThreads::Run] frequency increase with gecko 82 on socket thread → Startup crash in [@ mozilla::ipc::MessagePumpForNonMainThreads::Run] frequency increase with gecko 82 on socket thread
Flags: needinfo?(jstutte)
Keywords: regression

Aggregating for crash reasons, we have several:

1 	EXCEPTION_ACCESS_VIOLATION_EXEC 	168 	65.62 %
2 	EXCEPTION_ACCESS_VIOLATION_READ 	56 	21.88 %
3 	EXCEPTION_ACCESS_VIOLATION_WRITE 	16 	6.25 %
4 	EXCEPTION_GUARD_PAGE 	5 	1.95 %
5 	EXCEPTION_ILLEGAL_INSTRUCTION 	3 	1.17 %
6 	EXCEPTION_BREAKPOINT 	2 	0.78 %
7 	EXCEPTION_STACK_BUFFER_OVERRUN 	2 	0.78 %
8 	SIGSEGV /SEGV_MAPERR 	2 	0.78 %
9 	EXCEPTION_PRIV_INSTRUCTION 	1 	0.39 %
10 	SIGSEGV /0x00000000 	1 	0.39 %

EXCEPTION_ACCESS_VIOLATION_EXEC

EXCEPTION_ACCESS_VIOLATION_READ

EXCEPTION_ACCESS_VIOLATION_WRITE

FWIW, on older versions than 82.0b4 I see the same pattern also on different threads, not only the socket thread, with the main thread being in various states, but AFAICS always during initialization. Nevertheless, on 82.0b4 it is happening on the socket thread only, it seems. This might be just a case, though.

Flags: needinfo?(jstutte)

Keeping the ni to take a further look.

Flags: needinfo?(jstutte)

Just another detail: With 82.0b4 this seems to happen only under Windows 7 in 32 Bit mode - with older builds there are also some occurrences on more modern OS.

Flags: needinfo?(jstutte)
Flags: needinfo?(jstutte)
Depends on: 1668795
Flags: needinfo?(jstutte)

I had now the possibility to look at a minidump, reporting the stack here:

 	0c44dd74()	Unknown
 	[Die unten aufgeführten Frames sind möglicherweise nicht korrekt und/oder fehlen.]	Unbekannt
 	xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1239	C++
 	[Inlineframe] xul.dll!NS_ProcessNextEvent(nsIThread * aThread, bool aMayWait) Zeile 513	C++
 	xul.dll!mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate * aDelegate) Zeile 302	C++
 	[Inlineframe] xul.dll!MessageLoop::RunInternal() Zeile 334	C++
 	xul.dll!MessageLoop::RunHandler() Zeile 328	C++
 	xul.dll!MessageLoop::Run() Zeile 310	C++
 	xul.dll!nsThread::ThreadFunc(void * aArg) Zeile 444	C++
 	nss3.dll!_PR_NativeRunThread(void * arg) Zeile 399	C
 	nss3.dll!pr_root(void * arg) Zeile 139	C
 	ucrtbase.dll!thread_start<unsigned int (__stdcall*)(void *)>()	Unbekannt
 	kernel32.dll!@BaseThreadInitThunk@12()	Unbekannt
 	ntdll.dll!___RtlUserThreadStart@8()	Unbekannt
 	ntdll.dll!__RtlUserThreadStart@8()	Unbekannt
Group: core-security
No longer depends on: 1668795

Since at least the top of the stack is corrupted, we don't know for sure, but let's assume that the bottom of the stack is valid.

Then we are somewhere in event->Run() (called at https://searchfox.org/mozilla-central/rev/9c72508fcf2bba709a5b5b9eae9da35e0c707baa/xpcom/threads/nsThread.cpp#1197), and what went actually wrong depends on the type of the nsIRunnable, which we don't know.

Is it possible to (temporarily?) add a crash annotation that is filled before calling event->Run() with the type of the event (having no RTTI, maybe the address of the vtable?), so that if there is some crash within that, we can tell at least which type of event it was to somehow narrow this down?

Not sure who is knowledgable here... Gabriele have you got an idea or could redirect this?

Flags: needinfo?(gsvelto)
OS: Windows 7 → Windows

Nika introduced something similar for the main thread in bug 1608158. It gathers the current executable name (if set) and adds it to the crash report. She also added a RAII class to make adding this kind of annotations a simpler task.

Flags: needinfo?(gsvelto)

One thing to keep in mind is that retrieving a runnable's name is a rather expensive operation which is why it's enabled only on nightly and only on the main thread.

I'll crack open a minidump for you in VS to see if I can find something interesting. One thing to note is that all the crashes aren't on the socket thread, but most of them are so we might be dealing with an actual stability issue in the networking code and some noise from unrelated crash reports.

The beta crash spike seems to be reverting back to more normal levels, but we have had a longstanding crash in this signature.

Group: core-security → dom-core-security
Keywords: sec-moderate

FYI I opened up a minidump with VS but couldn't find anything interesting in there.

Severity: -- → S3
Priority: -- → P2

EKR told me he's hit this twice this week using Nightly on an Intel Macbook Pro. Not reproducible, unfortunately.

FWIW, I took this as a reminder to look at another crash dump. This one shows that we trigger a MOZ_RELEASE_ASSERT(CorePS::Exists()) during AutoProfilerLabel::Push(...), which most likely means that we try to dispatch a Runnable before we ever created or after we already destroyed CorePS.

I have not seen anything related to networking in ~10 crashes I looked at. They all crash on 3 different threads.

Component: Networking → IPC

There are a bunch of different crashes with this signature that appear to have different causes, running on different dedicated threads, crashing at different places in the named function. Not sure it's useful to lump these all together. It's rarely a "startup" crash any more. and no sign of the socket thread.

One cluster worth looking into is a bunch of linux crashes in 116/117 on the Compositor thread that crash on our UAF poison value +0x18, like bp-7d5f7415-0f63-4a7a-b0b2-372140230710

Group: dom-core-security
Summary: Startup crash in [@ mozilla::ipc::MessagePumpForNonMainThreads::Run] frequency increase with gecko 82 on socket thread → Crash in [@ mozilla::ipc::MessagePumpForNonMainThreads::Run]
You need to log in before you can comment on or make changes to this bug.