Closed Bug 1780094 Opened 2 years ago Closed 1 year ago

Crash in [@ js::NewStringCopyNDontDeflateNonStaticValidLength<T>] with ProxyAutoConfig

Categories

(Core :: Networking, defect, P1)

x86
Windows 10
defect

Tracking

()

RESOLVED FIXED
106 Branch
Tracking Status
firefox-esr102 108+ fixed
firefox102 --- wontfix
firefox103 --- wontfix
firefox104 --- wontfix
firefox106 + fixed

People

(Reporter: aryx, Assigned: valentin)

References

Details

(Keywords: crash, csectype-uaf, sec-moderate, Whiteboard: [necko-triaged][necko-priority-review])

Crash Data

Attachments

(1 file)

67 crashes from 30 installations of Firefox 102.0esr. There are isolated reports for the non-esr 102.0 and earlier versions.

Most reports are for bcb.gov.br sites (several subdomains).

Crash report: https://crash-stats.mozilla.org/report/index/f2c6e8be-558a-44fc-a5b3-49dd10220718

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll js::NewStringCopyNDontDeflateNonStaticValidLength<js::CanGC, unsigned char> js/src/vm/StringType.cpp:1651
1 xul.dll JS_NewStringCopyZ js/src/jsapi.cpp:2968
2 xul.dll mozilla::net::PACDnsResolve netwerk/base/ProxyAutoConfig.cpp:326
3 xul.dll Interpret js/src/vm/Interpreter.cpp:3314
4 xul.dll js::RunScript js/src/vm/Interpreter.cpp:389
5 xul.dll js::ExecuteKernel js/src/vm/Interpreter.cpp:781
6 xul.dll js::Execute js/src/vm/Interpreter.cpp:813
7 xul.dll ExecuteScript js/src/vm/CompilationAndEvaluation.cpp:509
8 xul.dll JS_ExecuteScript js/src/vm/CompilationAndEvaluation.cpp:533
9 xul.dll mozilla::net::ProxyAutoConfig::SetupJS netwerk/base/ProxyAutoConfig.cpp:584
Crash Signature: [@ js::NewStringCopyNDontDeflateNonStaticValidLength<T>] → [@ js::gc::GCRuntime::tryNewTenuredThing<T>] [@ js::NewStringCopyNDontDeflateNonStaticValidLength<T>]

These signatures might be both new variants of [@ mozilla::net::PACDnsResolve] which already affected Firefox 91esr.

Crash Signature: [@ js::gc::GCRuntime::tryNewTenuredThing<T>] [@ js::NewStringCopyNDontDeflateNonStaticValidLength<T>] → [@ js::gc::GCRuntime::tryNewTenuredThing<T>] [@ js::NewStringCopyNDontDeflateNonStaticValidLength<T>] [@ mozilla::net::PACDnsResolve]

Kershaw, is this something you can take a look at?

Flags: needinfo?(kershaw)

I found two crash reports that contains the poison value (0xe5e5)

Group: network-core-security
Flags: needinfo?(kershaw)
Priority: -- → P1
Whiteboard: [necko-priority-queue][necko-triaged]
Summary: Crash in [@ js::NewStringCopyNDontDeflateNonStaticValidLength<T>] → Crash in [@ js::NewStringCopyNDontDeflateNonStaticValidLength<T>] with ProxyAutoConfig

Calling this sec-moderate on the theory that most people don't use a PAC script, and those that do generally get them from trustworthy sources like their enterprise admins. But a UAF here is not great.

Looking at a dump in this report, in js::ErrorToException it seems errorNumber=JSMSG_BAD_BYTECODE
Not sure exactly what that means. Maybe Nicholas has an idea if this is really a UAF or not.

My first thought was that maybe we are destroying the context in ProxyAutoConfig::Shutdown while another thread is using it in ProxyAutoConfig::SetupJS but right now I don't see how that is actually possible.

Flags: needinfo?(nicolas.b.pierron)

From experience, I have never seen JSMSG_BAD_BYTECODE before.

This code is produced by the C++ interpreter, which is the first execution stage to read the bytecode. It would be reported if some unknown opcode appears in the bytecode. This can happen if the bytecode gets corrupted, or if the JS engine fails to produce jumps to the proper offset and that an immediate value gets interpreted as an opcode. The last case is unlikely as debug builds have plenty of checks to verify the coherency of our jump-targets.

Another option would be if the bytecode comes from a previous version of Firefox. Caching the bytecode without checking for the version number could cause some opcode to have shifted. All our deserialization path should include a version check, but this might still happen with twilight builds of Firefox (= building it your-self, testing, caching bytecode, rebasing, rebuilding, loading old bytecode from the cache).

Flags: needinfo?(nicolas.b.pierron)
Assignee: nobody → valentin.gosu

The most recent crash report https://crash-stats.mozilla.org/report/index/67ac0582-1540-4a06-b8c8-534520220809#allthreads doesn't have any networking signatures on the stack.
It's actually crashing when initializing the main thread JS context here

Crashing Thread (0), Name: MainThread
Frame 	Module 	Signature 	Source 	Trust
0 	xul.dll 	js::NewStringCopyNDontDeflateNonStaticValidLength<js::CanGC, unsigned char>(JSContext*, unsigned char const*, unsigned int, js::gc::InitialHeap) 	js/src/vm/StringType.cpp:1648 	context
1 	xul.dll 	js::NewStringCopyUTF8N(JSContext*, const JS::UTF8Chars, js::gc::InitialHeap) 	js/src/vm/StringType.cpp:1790 	cfi
2 	xul.dll 	js::NewStringCopyUTF8Z(JSContext*, const JS::ConstUTF8CharsZ, js::gc::InitialHeap) 	js/src/vm/StringType.h:1455 	cfi
3 	xul.dll 	js::ErrorToException(JSContext*, JSErrorReport*, JSErrorFormatString const* (*)(void*, const unsigned int), void*) 	js/src/jsexn.cpp:320 	cfi
4 	xul.dll 	js::ReportErrorNumberVA(JSContext*, js::IsWarning, JSErrorFormatString const* (*)(void*, const unsigned int), void*, const unsigned int, js::ErrorArgumentsType, char*) 	js/src/vm/ErrorReporting.cpp:484 	cfi
5 	xul.dll 	JS_ReportErrorNumberASCII(JSContext*, JSErrorFormatString const* (*)(void*, const unsigned int), void*, const unsigned int, <NoType>) 	js/src/jsapi.cpp:3550 	cfi
6 	xul.dll 	js::ReportAllocationOverflow(JSContext*) 	js/src/vm/JSContext.cpp:373 	cfi
7 	xul.dll 	js::TempAllocPolicy::reportAllocOverflow() const 	js/src/util/AllocPolicy.cpp:20 	cfi
8 	xul.dll 	mozilla::Vector<char16_t, 32, js::StringBufferAllocPolicy>::growStorageBy(unsigned int) 	mfbt/Vector.h:996 	cfi
9 	xul.dll 	js::StringBuffer::append(char16_t const*, unsigned int) 	js/src/util/StringBuffer.h:210 	cfi
10 	xul.dll 	js::frontend::ParserAtomsTable::appendTo(js::StringBuffer&, js::frontend::TaggedParserAtomIndex) const 	js/src/frontend/ParserAtom.cpp:1086 	cfi
11 	xul.dll 	`anonymous namespace'::NameResolver::nameExpression(js::frontend::ParseNode*, bool*) 	js/src/frontend/NameFunctions.cpp:137 	cfi
12 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
13 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
14 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
15 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
16 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
17 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
18 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
19 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
20 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
21 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
22 	xul.dll 	`anonymous namespace'::NameResolver::visit(js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:482 	cfi
23 	xul.dll 	js::frontend::NameFunctions(JSContext*, js::frontend::ParserAtomsTable&, js::frontend::ParseNode*) 	js/src/frontend/NameFunctions.cpp:490 	cfi
24 	xul.dll 	js::frontend::BytecodeEmitter::emitScript(js::frontend::ParseNode*) 	js/src/frontend/BytecodeEmitter.cpp:2484 	cfi
25 	xul.dll 	CompileGlobalScriptToStencilAndMaybeInstantiate<mozilla::Utf8Unit>(JSContext*, js::LifoAlloc&, js::frontend::CompilationInput&, JS::SourceText<mozilla::Utf8Unit>&, js::ScopeKind, mozilla::Variant<mozilla::UniquePtr<js::frontend::ExtensibleCompilationStencil, JS::DeletePolicy<js::frontend::ExtensibleCompilationStencil> >, RefPtr<js::frontend::CompilationStencil>, js::frontend::CompilationGCOutput*>&) 	js/src/frontend/BytecodeCompiler.cpp:296 	cfi
26 	xul.dll 	js::frontend::CompileGlobalScriptToStencil(JSContext*, js::LifoAlloc&, js::frontend::CompilationInput&, JS::SourceText<mozilla::Utf8Unit>&, js::ScopeKind) 	js/src/frontend/BytecodeCompiler.cpp:376 	cfi
27 	xul.dll 	JSRuntime::initSelfHostingStencil(JSContext*, mozilla::Span<const unsigned char, 4294967295>, bool (*)(JSContext*, mozilla::Span<const unsigned char, 4294967295>)) 	js/src/vm/SelfHosting.cpp:2592 	cfi
28 	xul.dll 	JS::InitSelfHostedCode(JSContext*, mozilla::Span<const unsigned char, 4294967295>, bool (*)(JSContext*, mozilla::Span<const unsigned char, 4294967295>)) 	js/src/vm/Initialization.cpp:224 	cfi
29 	xul.dll 	static XPCJSContext::NewXPCJSContext() 	js/xpconnect/src/XPCJSContext.cpp:1420 	cfi
30 	xul.dll 	xpc::InitializeJSContext() 	js/xpconnect/src/nsXPConnect.cpp:103 	cfi
31 	xul.dll 	XREMain::XRE_mainRun() 	toolkit/xre/nsAppRunner.cpp:5415 	cfi

The error number here is JSMSG_ALLOC_OVERFLOW.

The locals at internalEnsureCapacity are:

Name Value Type
this 0x0537e680 {mBegin=0x11213400 <Error reading characters of string.> mLength=0x711e5cc8 mTail={mBytes=...} } mozilla::Vector<char16_t,32,js::StringBufferAllocPolicy> *
▶ js::StringBufferAllocPolicy {impl_={cx_=0x11213400 {runtime_={...} kind_={value=??? } nurserySuppressions_={...} ...} } arenaId_=...} js::StringBufferAllocPolicy
▶ mBegin 0x11213400 <Error reading characters of string.> char16_t *
mLength 0x711e5cc8 unsigned int
▶ mTail {mBytes=0x0537e694 "\x10" } mozilla::Vector<char16_t,32,js::StringBufferAllocPolicy>::CRAndStorage<32,0>
aNeeded 0x6d6f7250 unsigned int

I'd like the JS team to also take a closer look at this, as this specific crash seems independent from necko code.

Assignee: valentin.gosu → nobody
Component: Networking → JavaScript Engine
Whiteboard: [necko-priority-queue][necko-triaged]

It's possible there are two different issues here though. The crash in comment 7 is probably some sort of memory corruption when compiling self-hosted code (because an allocation overflow shouldn't happen here, it's always exactly the same code...). This is likely unrelated to the PAC crashes.

@arai we're trying to allocate JS strings for an allocation-overflow exception thrown when compiling self-hosted code. Is this something that will be fixed by your frontend exception handling changes?

Flags: needinfo?(arai.unmht)

yes, the frontend code will stop allocating GC strings during compilation. we'll have a separate step to convert frontend-specific error (similar to off-thread errors) to runtime error, immediately after the compilation, and we can have dedicate step for self-hosted JS case to somehow report the error without using GC strings.

Flags: needinfo?(arai.unmht)

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #0)

67 crashes from 30 installations of Firefox 102.0esr. There are isolated reports for the non-esr 102.0 and earlier versions.

Most reports are for bcb.gov.br sites (several subdomains).

Knowing that this bug is about a government sub-domain and a proxy-auto-config, it sounds like this might be a recipe for targeted attacks, or a corrupted binary distributed to multiple computers.

One of the concerned I had was about Web Proxy Auto Discovery Protocol, but https://wpad.gov.br/wpad.dat does not seems to exists. This does not mean that an internal equivalent does not exists.

The fact that mozilla::net::PACDnsResolve crashes would suggest that there is a PAC (Proxy Auto Config) script. (right?)

For the context, a PAC script has the capability of by-passing the DNS resolution.
The over-representation of internal domain names with *.gov.br suggest something might be wrong, in the installations or configuration of PAC scripts.

On a side-note, the start-up crash mentioned in comment 7 would be in favor of a corrupted Firefox binary. However, if the binary were to be corrupted, we would expect all crashes to be start-up crashes. As these are manipulating self-hosted data, which is backed in the binary, but only 2.4% are reported as start-up crashes.

Everything seems to point towards memory corruptions, but the source of the corruption is unclear. The source of the memory corruption seems to happen at runtime, while parsing script content or while generating the bytecode for the script. These allocations are mostly managed by the LifoAlloc, which relies on the system allocator.

This crash-stat search, isolated to the sub-domain bcb.gov.br, reports many crashes which are related to SpiderMonkey's Interpreter under the context of the PAC script execution.

Thank you for that evaluation.

The WPAD URL is "http://wpad/wpad.dat"
When resolving wpad the OS resolver may append the DNS suffix to its name, resulting in wpad.gov.br.

It's entirely possible that either this specific organization in brazil has a PAC script that is triggering a bug/memory corruption, or maybe they have some specific software installed that is causing it?

(In reply to Valentin Gosu [:valentin] (he/him) from comment #5)

My first thought was that maybe we are destroying the context in ProxyAutoConfig::Shutdown while another thread is using it in ProxyAutoConfig::SetupJS but right now I don't see how that is actually possible.

This was my only approach, and we likely wouldn't be able to confirm a fix before it lands on ESR.

(In reply to Tooru Fujisawa [:arai] from comment #9)

yes, the frontend code will stop allocating GC strings during compilation. we'll have a separate step to convert frontend-specific error (similar to off-thread errors) to runtime error, immediately after the compilation, and we can have dedicate step for self-hosted JS case to somehow report the error without using GC strings.

I expect if we do this these crashes would either go away or turn into some other kind of crashes? Do we have a bug for this work?

Valentin, do you know if there's something that prevents us from calling Shutdown here while we're spinning the event loop here? There's a comment about nsPACMan being responsible for allowing one PAC execution at a time, I think that's what mInProgress is for? Maybe we can add some release asserts to the shutdown code?

I still think there are two separate issues here: a PAC issue that affects at least ESR 102 and the one in comment 7 is a one-off memory corruption.

Component: JavaScript Engine → Networking

In ProxyAutoConfig::SetupJS we should have called SetRunning(this), is there any chance that state is being clobbered by the SetRunning calls in ProxyAutoConfig::GetProxyForURI?

Those are great leads. I'll take a look.

Assignee: nobody → valentin.gosu
Whiteboard: [necko-triaged][necko-priority-queue]

(In reply to Valentin Gosu [:valentin] (he/him) from comment #12)

(In reply to Tooru Fujisawa [:arai] from comment #9)

yes, the frontend code will stop allocating GC strings during compilation. we'll have a separate step to convert frontend-specific error (similar to off-thread errors) to runtime error, immediately after the compilation, and we can have dedicate step for self-hosted JS case to somehow report the error without using GC strings.

I expect if we do this these crashes would either go away or turn into some other kind of crashes? Do we have a bug for this work?

it will happen in bug 1785762 or followup bug.

(In reply to Valentin Gosu [:valentin] (he/him) from comment #12)

(In reply to Tooru Fujisawa [:arai] from comment #9)

yes, the frontend code will stop allocating GC strings during compilation. we'll have a separate step to convert frontend-specific error (similar to off-thread errors) to runtime error, immediately after the compilation, and we can have dedicate step for self-hosted JS case to somehow report the error without using GC strings.

I expect if we do this these crashes would either go away or turn into some other kind of crashes? Do we have a bug for this work?

If this is indeed a memory corruption, then the other kind of crashes would be as unactionable as the current one.

The fact that we have multiple JS signatures showing the bytecode corruption, or corruption in the computation of the bytecode would suggest that the change of the frontend error reporting will just move the error to else-where.

Also, the error reporting is likely to be a consequence of a memory corruption, as it is more likely that a distributed PAC script does not contain any errors preventing its execution.

I think we should focus on the PAC crashes like the one in comment 0. There are hundreds of crashes with this signature that are similar to that one (crash address 0x420, under PACDnsResolve) so there's likely a real bug here.

The one in comment 7 is unrelated. The UAF crash in comment 5 might be a different symptom of the PAC crashes or could also be unrelated.

(In reply to Jan de Mooij [:jandem] from comment #14)

In ProxyAutoConfig::SetupJS we should have called SetRunning(this), is there any chance that state is being clobbered by the SetRunning calls in ProxyAutoConfig::GetProxyForURI?

As far as I can tell this should not be possible. SetupJS should run before GetProxyForURI, or be called by it. So by the time we get to here there shouldn't be any JS running. But you're right, I'll add an assertion anyway.

(In reply to Jan de Mooij [:jandem] from comment #13)

Valentin, do you know if there's something that prevents us from calling Shutdown here while we're spinning the event loop here? There's a comment about nsPACMan being responsible for allowing one PAC execution at a time, I think that's what mInProgress is for? Maybe we can add some release asserts to the shutdown code?

I've been looking at this question - if anything were to call Shutdown while we're running a PAC script we should hit this condition here, and it shouldn't crash.

I'll try to add some assertions to see if we can rule out these scenarios.

We need to wait for this patch to go to beta and see if any new crashes happen.

Keywords: leave-open
Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged][necko-priority-review]

106 was just released. No assertions or crashes on Nightly or Beta.
Let's wait for 2-3 more weeks. If no crashes we could uplift to ESR.

I think this check might have fixed the issues. Based on the previous crash rate, I would have expected at least one report on 106 until now - since there are none, maybe it's time to uplift and confirm.

Comment on attachment 9294417 [details]
Bug 1780094 - Add diagnostic asserts that ProxyAutoConfig::SetRunning is called at the correct times r=#necko

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Potential UAF.
  • User impact if declined: Crash or security vulnerability
  • Fix Landed on Version: 106
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): We now return an error if there's still a running script happening, instead of deleting the JS context.
Attachment #9294417 - Flags: approval-mozilla-esr102?

:valentin diagnostic asserts are only enabled nightly and early beta.
Wondering about this ESR uplift request?

Flags: needinfo?(valentin.gosu)

(In reply to Donal Meehan [:dmeehan] from comment #26)

:valentin diagnostic asserts are only enabled nightly and early beta.
Wondering about this ESR uplift request?

Apart from the diagnostic assert, it also adds an extra check:

  if (GetRunning()) {
    return NS_ERROR_ALREADY_INITIALIZED;
  }

This is the only line that is actually changing behaviour, and if we actually find ourselves in that condition we really want to return early, otherwise we crash.
Uplifting to ESR would confirm if this is the proper fix, or if the crash rate on release is simply too low. In either case it's worth doing I think.

Flags: needinfo?(valentin.gosu)

Comment on attachment 9294417 [details]
Bug 1780094 - Add diagnostic asserts that ProxyAutoConfig::SetRunning is called at the correct times r=#necko

Approved for 102.6esr.

Attachment #9294417 - Flags: approval-mozilla-esr102? → approval-mozilla-esr102+

Comment on attachment 9294417 [details]
Bug 1780094 - Add diagnostic asserts that ProxyAutoConfig::SetRunning is called at the correct times r=#necko

Landed on ESR102. Removing the approval flag to get this off the needs-uplift radar.
https://hg.mozilla.org/releases/mozilla-esr102/rev/44a89d647667

Attachment #9294417 - Flags: approval-mozilla-esr102+

Firefox 102.6 was released on 2022-12-13 and there seem to be no more PAC related crashes since then.
I think we can consider this FIXED.
A big thank you to everyone who helped resolve this one.


The one crash with this signature in 102.6 is https://crash-stats.mozilla.org/report/index/4bfc8c74-2671-4e02-9909-8f57d0221220 which does not seem PAC related at all. We might want to spin up a separate bug for that.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Group: network-core-security → core-security-release
Target Milestone: --- → 106 Branch
Group: core-security-release
See Also: → 1874444
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: