Closed Bug 1557732 Opened 5 years ago Closed 5 years ago

Assertion failure: js::GetNonCCWObjectRealm(wrapper) == js::GetContextRealm(cx), at /builds/worker/workspace/build/src/dom/workers/WorkerRunnable.cpp:345

Categories

(Core :: DOM: Workers, defect, P1)

defect

Tracking

()

VERIFIED FIXED
mozilla75
Tracking Status
firefox-esr68 --- unaffected
firefox69 --- wontfix
firefox70 --- wontfix
firefox71 --- wontfix
firefox73 - wontfix
firefox74 + verified
firefox75 + verified

People

(Reporter: jkratzer, Assigned: edenchuang)

References

(Blocks 2 open bugs)

Details

(Keywords: assertion, sec-high, testcase, Whiteboard: [post-critsmash-triage][adv-main74+r])

Attachments

(3 files, 1 obsolete file)

Attached file testcase.html

Testcase found while fuzzing mozilla-central rev debc7af30b7e (built with --enable-debug). Testcase must be served over HTTP in order to reproduce.

Assertion failure: js::GetNonCCWObjectRealm(wrapper) == js::GetContextRealm(cx), at /builds/worker/workspace/build/src/dom/workers/WorkerRunnable.cpp:345

rax = 0x000055ef733e1040   rdx = 0x0000000000000000
rcx = 0x00007fcc4f674484   rbx = 0x00007fcc25fe3940
rsi = 0x00007fcc5aebf8b0   rdi = 0x00007fcc5aebe680
rbp = 0x00007fffd5378330   rsp = 0x00007fffd5378190
r8 = 0x00007fcc5aebf8b0    r9 = 0x00007fcc5c029780
r10 = 0x0000000000000000   r11 = 0x0000000000000000
r12 = 0x00007fffd53781c0   r13 = 0x00007fcc37d2d000
r14 = 0x0000265588fc7190   r15 = 0x00007fffd5378240
rip = 0x00007fcc4b995047
OS|Linux|0.0.0 Linux 4.18.0-20-generic #21~18.04.1-Ubuntu SMP Wed May 8 08:43:37 UTC 2019 x86_64
CPU|amd64|family 6 model 94 stepping 3|1
GPU|||
Crash|SIGSEGV|0x0|0
0|0|libxul.so|mozilla::dom::WorkerRunnable::Run()|hg:hg.mozilla.org/mozilla-central:dom/workers/WorkerRunnable.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|224|0x2d
0|1|libxul.so|mozilla::ThrottledEventQueue::Inner::ExecuteRunnable()|hg:hg.mozilla.org/mozilla-central:xpcom/threads/ThrottledEventQueue.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|252|0x11
0|2|libxul.so|mozilla::ThrottledEventQueue::Inner::Executor::Run()|hg:hg.mozilla.org/mozilla-central:xpcom/threads/ThrottledEventQueue.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|80|0xd
0|3|libxul.so|nsThread::ProcessNextEvent(bool, bool*)|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThread.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|1176|0x15
0|4|libxul.so|NS_ProcessNextEvent(nsIThread*, bool)|hg:hg.mozilla.org/mozilla-central:xpcom/threads/nsThreadUtils.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|486|0x11
0|5|libxul.so|mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*)|hg:hg.mozilla.org/mozilla-central:ipc/glue/MessagePump.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|88|0xa
0|6|libxul.so|MessageLoop::RunInternal()|hg:hg.mozilla.org/mozilla-central:ipc/chromium/src/base/message_loop.cc:debc7af30b7e0ee50653706c19a10511ceefff83|315|0x17
0|7|libxul.so|MessageLoop::Run()|hg:hg.mozilla.org/mozilla-central:ipc/chromium/src/base/message_loop.cc:debc7af30b7e0ee50653706c19a10511ceefff83|290|0x8
0|8|libxul.so|nsBaseAppShell::Run()|hg:hg.mozilla.org/mozilla-central:widget/nsBaseAppShell.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|137|0xd
0|9|libxul.so|nsAppStartup::Run()|hg:hg.mozilla.org/mozilla-central:toolkit/components/startup/nsAppStartup.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|276|0xe
0|10|libxul.so|XREMain::XRE_mainRun()|hg:hg.mozilla.org/mozilla-central:toolkit/xre/nsAppRunner.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|4650|0x11
0|11|libxul.so|XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&)|hg:hg.mozilla.org/mozilla-central:toolkit/xre/nsAppRunner.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|4789|0x8
0|12|libxul.so|XRE_main(int, char**, mozilla::BootstrapConfig const&)|hg:hg.mozilla.org/mozilla-central:toolkit/xre/nsAppRunner.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|4870|0x5
0|13|firefox-bin|do_main|hg:hg.mozilla.org/mozilla-central:browser/app/nsBrowserApp.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|213|0x22
0|14|firefox-bin|main|hg:hg.mozilla.org/mozilla-central:browser/app/nsBrowserApp.cpp:debc7af30b7e0ee50653706c19a10511ceefff83|295|0xf
0|15|libc-2.27.so||||0x21b97
0|16|firefox-bin|MOZ_ReportCrash|hg:hg.mozilla.org/mozilla-central:mfbt/Assertions.h:debc7af30b7e0ee50653706c19a10511ceefff83|184|0x5
Flags: in-testsuite?
Group: core-security → dom-core-security
Keywords: sec-high

Guessing sec-high because it sounds like a compartment mismatch.

  • :jkratzer, Can the fuzzer make patches so that this could be run as a test via mach following an hg import? Or is there some way to otherwise easily use mach to run this test under mach with "--debugger rr" in a quasi-automated fashion? If there's a link I should just read, that's fine. Thanks!
  • :bholley, related to that, I'm interested for things like this if we can get a repro under pernos.co, but I'm worried that to do that I need to push this test to try but that is somewhat explicitly frowned upon I think. Do we have a mechanism to make it less risky / a concern?

My general motivating concern is we're quite overwhelmed as it is with worker sec bugs and our worker sec bug person just went on medical leave for at least 6 weeks, so anything we can do to make it easier to process and analyze the sec-bugs would be handy. (And especially the ability to link someone else to a pernos.co trace versus needing them to reproduce locally under rr, etc., would be nice.) Thanks!

Flags: needinfo?(jkratzer)
Flags: needinfo?(bobbyholley)

secure try / secure pernosco are both things that could exist in the future but don't exist now.

Flags: needinfo?(bobbyholley)

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #2)

  • :jkratzer, Can the fuzzer make patches so that this could be run as a test via mach following an hg import? Or is there some way to otherwise easily use mach to run this test under mach with "--debugger rr" in a quasi-automated fashion? If there's a link I should just read, that's fine. Thanks!

Andrew, I'm not sure if there's an easy way to import this as a test to be run under mach. However, if you're just looking for a simplified way to run this under rr, you could use ffpuppet (https://github.com/MozillaSecurity/ffpuppet/) with the -rr option.

python -m SimpleHTTPServer &
python -m ffpuppet -p prefs.js --xvfb -d -l log -rr ~/builds/firefox/firefox http://localhost:8000/testcase.html

I'll also attach the prefs I used to reproduce this.

Flags: needinfo?(jkratzer)
Attached file prefs-default-e10s.js

That is very helpful, thank you!

Given this is a sec-high issue and has an STR, mark this as P1.

Priority: -- → P1

Once we identify where the compartment is wrong we should see if there are any API design issues that made it easy to get wrong. See Bug 1565930 for discussions about compartment mismatch bugs related to Promise APIs.

Assignee: nobody → htsai
Assignee: htsai → ytausky
Assignee: ytausky → echuang

Is this something you might be able to work on for the 71 release?

Yes, I am working on it now. Hope can fix before moving 71 to beta.

Flags: needinfo?(echuang)

The worker sends jobs(CompileScript) to the main thread before the document/inner window(GlobalObject) is recreated and executing the job after recreated(when calling SpinEventLoop in the second sync XHR), but the worker/workerPrivate still keeps the original GlobalObject. I wrote some patch to try to cleanup the event queue, but it has side-effects to make other tests fail.

However, I think cleanup the jobs in the event queue might not be a correct solution, because there are not only worker jobs, and it might be the root cause making other regressions. I think these bugs 1546331, 1497633, 1557732 and 1545345 are related to maintaining worker's lifecycle, especially the delete worker timing. We should have an overall solution for these bugs, and bug 1539508 is for it.

In the next step, I will study from these cases and source code to figure out how to refactor the current architecture.
But it would take some time to do it.

I need to compose a deeper reply, but I think there isn't too much overlap on these.

See Also: CVE-2019-17008

Could not reproduce the bug after bug 1588248 landed. But it still could be reproduced in esr68.

Last week Eden told me that he's trying to figure out what fixed this so we can think about backporting. Is that still correct, Eden?

Flags: needinfo?(echuang)

Yes, it is correct. Writing a patch backporting.

Flags: needinfo?(echuang)

Taking while Eden's on PTO. I understand the current plan has been to backport the partial fix on bug 1588248 to ESR68.

Assignee: echuang → bugmail
Status: NEW → ASSIGNED
No longer blocks: fuzzing-workers

Hi Andrew, if it is okay, I'd like to take this bug back.

Following is a rough expected flow when loading the page.

  1. Loading and create DOM objects.
  2. Firing DOMContentLoaded event
  3. The DOMContentLoaded event handler executes a sync XHR then creates an empty worker after sync XHR return.
  4. DocShell creates the ContentViewer(nsDocumentViewer) and copy the document and set them into nsGlobalWindowOuter.
  5. Firing load event

In this expected flow, we should only fire the DOMContentLoaded event once.
In step 4, when DocShell creates a ContentViewer and copies the document, it also creates a new realm/GlobalObject, so you can expect the realm/compartment is different before and after DocShell creates the ContentViewer.

However, Sync XHR could make step 4 happens before we create the worker. And the following is the problem flow we meet.

  1. Loading and create DOMObjects
  2. Firing DOMContentLoaded event
  3. The DOMContentLoaded event handler executes a sync XHR, but not yet return.
  4. DocShell creates the ContentViewer(nsDocumentViewer) and copy the document and set them into nsGlobalWindowOuter.
  5. Creates an empty worker after Sync XHR returns with the GlobalObject from the event handler.
  6. DOMContentLoaded event could be fired again for the new document since it was copied before the handler finish.
  7. Some WorkerRunnable executes then hits the assertion since the realm from GlobalObject (from worker) is different from nsGlobalWindowOuter.

The possible solution could be making sure DocShell creates the ContentViewer and copies the document after DOMContentLoaded finish.
But I need some time to figure out how to achieve it without side-effects.

Flags: needinfo?(bugmail)

Hey, yes! Thanks for the detailed description!

Here's a pernosco trace from my local repro (still rebuilding): https://pernos.co/debug/f--h5b9H8IFFTzqKsZoVpA/index.html

Assignee: bugmail → echuang
Flags: needinfo?(bugmail)

Spend some time to analyze why seeing two DOMContentLoaded event dispatching when loading the attached testcase.html.
The interesting thing is a <meta> tag with a specified charset="UTF8" in the last line.

https://searchfox.org/mozilla-central/source/parser/html/nsHtml5TreeOperation.cpp#964

When parser parses the line of the file, it detects the document needs to reload because of charset changed, then nsDocShell::CharsetChangeStopDocumentLoad() is called to discard the current document.

https://searchfox.org/mozilla-central/source/parser/html/nsHtml5TreeOpExecutor.cpp#774

CharsetChangeStopDocumentLoad() discards the document through a normal stop loading flow, and DOMContentLoaded() is dispatched while executing the Document::EndLoad(). This is the first DOMContentLoaded event.

https://searchfox.org/mozilla-central/source/dom/base/Document.cpp#7354

After nsDocShell::CharsetChangeStopDocumentLoad() success, nsDocShell::CharsetChangeReloadDocument is called for reloading the document with specified charset. And this will create new ContentViewer, document, and WindowInner for reloading, then set them into the nsGlobalWindowOuter. After the document loading finish, the second DOMContentLoaded event is dispatched.

https://searchfox.org/mozilla-central/source/docshell/base/nsDocShell.cpp#12489

We meet the assertion fail since we are switching documents while handling the DOMContentLoaded event on the first document, and the worse is there is a sync XHR in the event handler.

I think we should not dispatch the DOMContentLoad event on the discarded document. And a patch is attached.

Flags: needinfo?(bugs)

I'm having great trouble to understand why the patch helps with the worker issue in general. One can always
spin event loop in DOMContentLoaded listener (or any time JS runs) and load some new page while that is happening.

Why does some workerrunnable try to run on wrong global? (assuming I understand the assertion correctly)
Why we haven't canceled it? Or if cancelling is for worker side only, why we we don't return early when we realize the global isn't correct anymore.
Or what am I missing? :)

Flags: needinfo?(bugs)

(In reply to Olli Pettay [:smaug] from comment #21)

I'm having great trouble to understand why the patch helps with the worker issue in general. One can always
spin event loop in DOMContentLoaded listener (or any time JS runs) and load some new page while that is happening.

Yes, the patch is specific for the charset changing case since it is from the attached case. Like you said, for other cases, one spin event loop in DOMContentLoaded listener (or any time JS runs) and load some new page while that is happening, the assertion might be hit again.
However, maybe the condition in the assertion doesn't fit all cases. I will try to update the assertion condition.

I am not sure if this patch is correct, but I think the patch still has its value for the charset changing case. At least, I didn't find that we should dispatch DOMContentLoaded on the document that discarded by charset changing in the spec. I also compared with other browsers' implementation, no matter the spinning event loop in DOMContentLoaded or not, no one dispatches two DOMContentLoaded events for loading the attached case, only Firefox.

Why does some workerrunnable try to run on wrong global? (assuming I understand the assertion correctly)

In the attached case, the global from DOMContentLoad event is used to create the worker, and runnables for the worker inherits the worker's gloabl and assumes the execution environment should fit with the global. So when executing the runnable the assertion checks the compartments between worker's global and current nsGlobalWindowOuter.

Why we haven't canceled it? Or if cancelling is for worker side only, why we we don't return early when we realize the global isn't correct anymore.
Or what am I missing? :)

The workerRunnable can be a control runnable or a normal runnable. If it is a control runnable, we use it to control workers, i.e. closing Worker. For the case, we cannot cancel it, since we expect to run the runnable to close workers normally.

Flags: needinfo?(bugs)

If we don't want to dispatch DOMContentLoaded for charset change, let's file a separate bug for that and fix that case there (with a testcase).
But I can't see how the patch would fix the worker related issue - it just hides the issue in one specific case.
This is a security bug (though, I'm not actually sure how critical), so the issue should be fixed properly.
This very much smells like an issue related to WorkerPrivate or WorkerRunnable.
Something there can't deal with the case when global is changing.

Flags: needinfo?(bugs)
Attachment #9125772 - Attachment is obsolete: true

According to comments 23, remove the patch for the specific case, reload with charset changing.

When constructing a dedicated worker, the new patch checks if the innerWindow for construction is the current innerWindow or not.
If it is not, we are constructing a worker on a discarded window(RemovedFromDocshell), and create it will cause compartment mismatching for the worker. In the case, the patch returns nullptr and throw InvalidStateError.

Comment on attachment 9126146 [details]
Bug 1557732 - Don't create worker for discarded document.

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Not sure. But it is easy to understand what situation the patch wants to fix.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: No
  • Which older supported branches are affected by this flaw?: 69
  • If not all supported branches, which bug introduced the flaw?: None
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?:
  • How likely is this patch to cause regressions; how much testing does it need?: The patch should not cause regressions easy.
Attachment #9126146 - Flags: sec-approval?

Hello :tjr, could you help on this sec-approval? Thanks.

Flags: needinfo?(tom)

Comment on attachment 9126146 [details]
Bug 1557732 - Don't create worker for discarded document.

Approved to land.

Attachment #9126146 - Flags: sec-approval?
Attachment #9126146 - Flags: sec-approval+
Attachment #9126146 - Flags: approval-mozilla-beta+

(Actually, I don't know if this patch applies cleanly to Beta, Eden can you confirm? If not can you attach one?)

Flags: needinfo?(echuang)

https://hg.mozilla.org/integration/autoland/rev/448b5cd3eaa0

This grafts cleanly to Beta as-landed.

Flags: needinfo?(echuang)
Group: dom-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla75
Flags: qe-verify+
Whiteboard: [post-critsmash-triage]
QA Whiteboard: [qa-triaged]

Hi, I'm trying to reproduce this issue on our end but I'm getting this error :

svuser@P5287:~/Desktop/Test/ffpuppet-master$ python -m ffpuppet -p prefs.js --xvfb -d -l log --rr ~/home/svuser/Desktop/Foxes/Fuzz/firefox/firefox http://localhost:8000/testcase.html
usage: main.py [-h] [-a ABORT_TOKEN] [-d] [-e EXTENSION] [-l LOG]
[--log-limit LOG_LIMIT] [-m MEMORY]
[--poll-interval POLL_INTERVAL] [-p PREFS] [-P PROFILE]
[-t TIMEOUT] [-u URL] [-v] [--xvfb] [--gdb] [--rr]
[--valgrind]
binary
main.py: error: unrecognized arguments: http://localhost:8000/testcase.html

I'm not that familiar with python but I did start the simple HTTP server using : python -m SimpleHTTPServer in the same folder where I ran the other command.
Am I missing something ?

Jason can you please take a look at this and confirm the Fix in Beta and Nightly ?

Flags: needinfo?(jkratzer)

Ehsan, I will have jsbugmon check that the issue has been fixed but just FYI, you need to pass -u before the URL or file path.

Flags: needinfo?(jkratzer)

Testcase is verified as fixed on mozilla-central rev fb4f281c1c54a5199e6e713c1b8115f80d7faa37.

(In reply to Rares Doghi from comment #33)

Am I missing something ?

Yes, -u http://localhost:8000/testcase.html is what was missing.

After setting the -u before the URL i was getting this error :
usage: rr [-h] [--interval INTERVAL] [--exclude EXCLUDE]
[--loglevel {NOTSET,DEBUG,INFO,WARNING,ERROR,CRITICAL}]
...
rr: error: unrecognized arguments: --version
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/svuser/Desktop/Test/ffpuppet-master/ffpuppet/main.py", line 9, in <module>
main()
File "/home/svuser/Desktop/Test/ffpuppet-master/ffpuppet/main.py", line 189, in main
use_rr=getattr(args, "rr", False))
File "/home/svuser/Desktop/Test/ffpuppet-master/ffpuppet/core.py", line 89, in init
subprocess.check_output(["rr", "--version"])
File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
**kwargs).stdout
File "/usr/lib/python3.5/subprocess.py", line 708, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['rr', '--version']' returned non-zero exit status 2

Jason can you please try Beta as well ? In the mean time I will update the flag on Nightly 75 based on comment 35.

Flags: needinfo?(jkratzer)

Testcase is verified as fixed on mozilla-beta rev 609993945f74.

Flags: needinfo?(jkratzer)

(In reply to Rares Doghi from comment #37)

Also you don't need --rr to verify the bugs. As you saw here it can add unnecessary overhead sometimes.

Thank you For checking Jason , also thanks for the help Tyson after removing the --rr this worked, I will update the flags accordingly. This issue is Verified as fixed in Our latest Nightly build and Beta.

Status: RESOLVED → VERIFIED
QA Whiteboard: [qa-triaged]
Flags: qe-verify+
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main74+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: