Closed Bug 1331209 Opened 7 years ago Closed 7 years ago

Crash in mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame | mozilla::ipc::MessageChannel::Send | mozilla::dom::asmjscache::PAsmJSCacheEntryParent::SendOnOpenMetadataForRead

Categories

(Core :: JavaScript Engine, defect, P2)

48 Branch
x86
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla58
Tracking Status
firefox-esr52 58+ verified
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 - wontfix
firefox58 --- fixed

People

(Reporter: Usul, Assigned: luke)

References

Details

(4 keywords, Whiteboard: [adv-main58+][adv-esr52.6+][post-critsmash-triage])

Crash Data

Attachments

(7 files, 2 obsolete files)

7.42 KB, patch
Details | Diff | Splinter Review
1.33 KB, patch
asuth
: review+
dveditz
: sec-approval+
Details | Diff | Splinter Review
845 bytes, patch
asuth
: review+
dveditz
: sec-approval+
Details | Diff | Splinter Review
11.46 KB, patch
asuth
: review+
dveditz
: sec-approval+
Details | Diff | Splinter Review
8.92 KB, patch
janv
: review+
dveditz
: sec-approval+
Details | Diff | Splinter Review
5.87 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
5.90 KB, patch
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is 
report bp-df744d0d-68c3-420d-afeb-7a42a2170114.
=============================================================
0 	XUL 	mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame(mozilla::ipc::MessageChannel&, mozilla::ipc::Direction, IPC::Message const*) 	obj-firefox/i386/dist/include/mozilla/ipc/MessageChannel.h:341
1 	XUL 	mozilla::ipc::MessageChannel::Send(IPC::Message*) 	ipc/glue/MessageChannel.cpp:228
2 	XUL 	mozilla::dom::asmjscache::PAsmJSCacheEntryParent::SendOnOpenMetadataForRead(mozilla::dom::asmjscache::Metadata const&) 	obj-firefox/i386/ipc/ipdl/PAsmJSCacheEntryParent.cpp:66
3 	XUL 	mozilla::dom::asmjscache::(anonymous namespace)::ParentRunnable::Run() 	dom/asmjscache/AsmJSCache.cpp:1003
4 	XUL 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1076
5 	XUL 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:290
6 	XUL 	mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:384
7 	XUL 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:232
8 	XUL 	nsThread::ThreadFunc(void*) 	xpcom/threads/nsThread.cpp:469
9 	libnss3.dylib 	_pt_root 	nsprpub/pr/src/pthreads/ptthread.c:216
Ø 10 	libsystem_pthread.dylib 	libsystem_pthread.dylib@0x15fa 	
Ø 11 	libsystem_pthread.dylib 	libsystem_pthread.dylib@0x1484 	
Ø 12 	libsystem_pthread.dylib 	libsystem_pthread.dylib@0x6cf1


On mac OS  10.9.5 the STR are as follow :
1) log on facebook
2) Writte on your wall
3) press Enter to validate your update

Crash.
Component: General → IPC
FWIW I've switched my wife to nightly and doesn't crash anymore with those strs.
Bill: by any chance would you know, as a matter of general IPC design, if this assert is necessary?
  https://hg.mozilla.org/releases/mozilla-release/annotate/90f18f9c15f7/ipc/glue/MessageChannel.cpp#l250
The runnable in question (ParentRunable in dom/asmjscache/AsmJSCache.cpp) is dispatch to the main thread, QuotaManager IO thread, and the background thread.  I don't know which of those violates the assumption that we're on a "worker thread".
Flags: needinfo?(wmccloskey)
An IPC channel is single-threaded (sort of like a JSRuntime used to be). Whatever thread you Open it on is the thread that it must be used on forever after. The assertion enforces that invariant. In this case it looks like a message is being sent on the wrong thread.

The channel used here is PBackground. The parent side of any PBackground channel must run on the PBackground thread in the chrome process.
Flags: needinfo?(wmccloskey)
Thanks!  Since the message channel must be called from the PBackground thread, does that mean that in the crash we must be calling from some non-PBackground thread?  Can you tell from the callstack in comment 0 what type of thread this is?
There might be a way to figure out the thread name by opening the crash in Visual Studio, but I don't know for sure. The stack is pretty generic, unfortunately.
Ok, thanks.  And to make sure I understand your above comment, you can tell from the crash stack that this MessageChannel *needs* to be called from the PBackground thread?

Re-reading the file, it seems everything is asserted to be on the right thread throughout.  The only way I can see this failing is if all the asserts are actually failing, it's just that they're not release asserts.  So one idea is to simply convert them to release asserts and see if that shows some interesting new crash signatures.
(In reply to Luke Wagner [:luke] from comment #6)
> Ok, thanks.  And to make sure I understand your above comment, you can tell
> from the crash stack that this MessageChannel *needs* to be called from the
> PBackground thread?

Not from the crash, no. In general, all PBackground communication is from (random thread in random process) to (PBackground thread in main process). Since the actor is a Parent actor, it should be on the (PBackground thread in main process) side.

In theory, it would be possible to create a PBackground channel that goes to some other random thread. But doing so would be pretty nutty.

> Re-reading the file, it seems everything is asserted to be on the right
> thread throughout.  The only way I can see this failing is if all the
> asserts are actually failing, it's just that they're not release asserts. 
> So one idea is to simply convert them to release asserts and see if that
> shows some interesting new crash signatures.

Yeah, it's kinda weird that this doesn't show up in automation. It's usually the sort of thing that always succeeds or always fails. I guess another possibility here is that the IPC channel has already been destroyed, and so the assertion is comparing the current thread ID to some bad memory. That sort of thing happens occasionally. Either way, moving the assertion up and out of the channel code would confirm.
(In reply to Bill McCloskey (:billm) from comment #7)
> Yeah, it's kinda weird that this doesn't show up in automation. It's usually
> the sort of thing that always succeeds or always fails. 

Yeah, that's my impression too.

> I guess another possibility here is that the IPC channel has already been destroyed, and so
> the assertion is comparing the current thread ID to some bad memory.

Ooh yeah, that seems quite possible given that this is a low-volume crash sort of thing.  Maybe I can repro locally putting in some sleep()s.

Q: in general what's the pattern to detect channel shutdown and avoid these problems?  (Once upon a time I reasoned through all this with the original dom/asmjscache impl, but when it was rewritten to use PBackground everything got way simpler but also I'm not familiar with the invariants.)
Hah, yeah, I put a sleep before the dispatch to the failing state and we reliably assert if I close the tab while it's sleeping.

So I think this is just a plain race: ParentRunnable::ActorDestroy is called on the PBackground thread which calls Fail() which transitions to the eFinished state at the same time that the sleeping thread is on the QM's IO thread (in the eReadyToReadMetadata state).  So when the sleep() wakes up, the ParentRunnable is dispatched to the PBackground thread despite being in the eFinished state.  (This is a different kind of assert/failure than we're seeing here but I assume it's all the same underlying race condition causing both.)

I think the fix is just to give ParentRunnable a lock (taken by both ParentRunnable::Run() and ActorDestroy (and probably Recv__delete__).  I'll try that.
Attached patch fix-raceSplinter Review
A fix, manually tested by sticking sleep(5)s at various points and closing the tab.  I also upgraded the thread asserts so should the crash persist we have more clues.
Assignee: nobody → luke
Attachment #8890604 - Flags: review?(jvarga)
Priority: -- → P2
Just came across this in triage - 56 and 57 are still affected. Any luck here (maybe for 57?)
Flags: needinfo?(luke)
Keywords: regression
I expect this is low-enough volume that it's not worth risk of taking for 57, but let me know if you think otherwise.  Jan: polite review ping.
Flags: needinfo?(luke) → needinfo?(jvarga)
Oh, I see. We have this problem also in IDB, but it's solved w/o a mutex.

Basically there is |bool mActorDestroyed| but also |Atomic<bool> mOperationMayProceed|
Then we have two methods:

bool
IsActorDestroyed() const
{
  AssertIsOnOwningThread();

  return mActorDestroyed;
}

// May be called on any thread, but you should call IsActorDestroyed() if
// you know you're on the background thread because it is slightly faster.
bool
OperationMayProceed() const
{
  return mOperationMayProceed;
}

Each method that is invoked from Run() for given state, calls IsActorDestroyed() or OperationMayProceed() to bail out early if the actor was destroyed in the mean time (by calling Fail() or FailOnNonOwningThread().

I think ActorDestroy() could just set the bool variables and doesn't call Close() or Fail() directly.
Flags: needinfo?(jvarga)
Attached patch wip patch (obsolete) — Splinter Review
Comment on attachment 8913102 [details] [diff] [review]
wip patch

Review of attachment 8913102 [details] [diff] [review]:
-----------------------------------------------------------------

Something like this, note it's not complete, it's just an example.
Comment on attachment 8890604 [details] [diff] [review]
fix-race

Review of attachment 8890604 [details] [diff] [review]:
-----------------------------------------------------------------

I would try to avoid a mutex if possible. If Run() acquired the lock and is doing I/O, then if we get ActorDestroy() on the PBackground thread trying to acquire the same lock, it means that PBackground is basically blocked by I/O. There are some exceptions when we can tolerate I/O on the PBackground thread or synchronous blocking of the thread, but I think we can avoid that in this case.
Attachment #8890604 - Flags: review?(jvarga)
Comment on attachment 8913102 [details] [diff] [review]
wip patch

Review of attachment 8913102 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/asmjscache/AsmJSCache.cpp
@@ +841,5 @@
>  
>      case eWaitingToFinishInit: {
>        AssertIsOnOwningThread();
>  
> +      if (!OperationMayProceed() || QuotaManager::IsShuttingDown()) {

err, I meant IsActorDestroyed() here
Sounds like you have a pretty specific idea in mind; care to write it up?
Alternatively, I don't think this path is super hot (at all) so I think the overhead of a mutex would be insigificant, so we could just review the patch as is.  This code will probably live for <2 years, so might be good to economize effort spent here.  WDYT?
Flags: needinfo?(jvarga)
Oh hah, I didn't even notice you had a WIP.  Thanks!
Flags: needinfo?(jvarga)
Ok, I will see if I have cycles for this, if not, we can take the mutex.
this crash signature is spiking up on release/beta in the past days.
Crash Signature: [@ mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame | mozilla::ipc::MessageChannel::Send | mozilla::dom::asmjscache::PAsmJSCacheEntryParent::SendOnOpenMetadataForRead] → [@ mozilla::ipc::MessageChannel::CxxStackFrame::CxxStackFrame | mozilla::ipc::MessageChannel::Send | mozilla::dom::asmjscache::PAsmJSCacheEntryParent::SendOnOpenMetadataForRead] [@ mozilla::ipc::MessageChannel::Send | mozilla::dom::asmjscache::PAsmJSCach…
Which route do we go here?
Flags: needinfo?(jvarga)
Luke - where do we stand on landing a fix?
Group: core-security
Flags: needinfo?(luke)
I'll resolve this today.
Flags: needinfo?(luke)
Crash Signature: mozilla::dom::asmjscache::PAsmJSCacheEntryParent::SendOnOpenMetadataForRead ] → mozilla::dom::asmjscache::PAsmJSCacheEntryParent::SendOnOpenMetadataForRead ] [@ mozilla::ipc::IProtocol::OtherPid ]
Ok, I now have cycles for this, I'm going to finish the WIP patch.
Flags: needinfo?(jvarga)
[Tracking Requested - why for this release]: Carrying over the nomination from the duped bug - Bug 1411721#c1.
Luke, I have patches ready.
You mentioned in comment 8 and comment 9 that you added some sleep(s) to reliably reproduce the race.
Would you have time to test it with my patches ?
Flags: needinfo?(luke)
I'm unfortunately in a standards meeting all day and out-of-office tomorrow.  Also, I don't have the particular patch I used to test, but I think the details in comment 9 tell you where to put it.  As for a testing workload, you could use mega.nz (to repro crash before w/o patches and validate fix).
Flags: needinfo?(luke)
Ok, thanks.
Ok, I added some sleeps and manually verified the fix.

It's now on try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=cc39c2f234fd0a1729463565b512ed330e46e3b2
Andrew, do you want to review this ?
Patches are not yet attached, but you can check them on try, see previous comment.
If not, I guess Luke will take a look on Monday.
Flags: needinfo?(bugmail)
Fairly high crash volume on beta 57 (and release 56 ) Is this something you are hoping to land for 57 release?  The next build will be the RC build next Monday.
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #36)
> Fairly high crash volume on beta 57 (and release 56 ) Is this something you
> are hoping to land for 57 release?  The next build will be the RC build next
> Monday.

Yes, I'll try to land it before RC.
Attachment #8913102 - Attachment is obsolete: true
Attachment #8925009 - Flags: review?(bugmail)
Attachment #8925011 - Attachment is patch: true
Comment on attachment 8925009 [details] [diff] [review]
Part 1: Remove unused state

Review of attachment 8925009 [details] [diff] [review]:
-----------------------------------------------------------------

Restating: This is missed cleanup from Bug 1408333 which caused the state machine to skip over eBackgroundChildPending but didn't remove the enum state.
Attachment #8925009 - Flags: review?(bugmail) → review+
Comment on attachment 8925010 [details] [diff] [review]
Part 2: Fix incorrect FailOnNonOwningThread() calls

Review of attachment 8925010 [details] [diff] [review]:
-----------------------------------------------------------------

Restating: The FailOnNonOwningThread() calls were the wrong variant and should have been Fail() calls, consistent with the AssertIsOnOwningThread() above.
Attachment #8925010 - Flags: review?(bugmail) → review+
Comment on attachment 8925011 [details] [diff] [review]
Part 3: Allow sending of the __delete__ message in one direction only

Review of attachment 8925011 [details] [diff] [review]:
-----------------------------------------------------------------

Restating: This patch cleans up the destruction logic.

Before the patch:
- The PAsmJsCacheEntry protocol allows for deletion by either the parent or the child.
- The child asks the PBackground parent to create a PAsmJSCacheEntry actor in ChildRunnable::Run when in state eInitial.
- ParentRunnable::Fail triggers deletion in the event of failure to open and !mDeleteReceived (as set by Recv__delete__).
- ParentRunnable maintains an mDeleteReceived boolean independent of mActorDestroyed.
- ChildRunnable::RecvOnOpenMetadataForRead triggers deletion when !FindHashMatch
- ChildRunnable::Run triggers deletion when in state eClosing (as set by Close()).

After the patch:
- The PAsmJsCacheEntry protocol requires the parent to delete the child.
- PAsmsJsCacheEntry gains a new child-to-parent Close() method.
- PAsmsJsCacheEntry::SelectCacheFileToRead gains an overload via union type OpenMetadataForReadResponse which is used to signal error when an AsmsJSCacheResult is passed.
  - The AsmsJSCacheResult type is only used to signal errors, as MOZ_ASSERTed in RecvSelectCacheFileToRead.
- Still: The child asks the PBackground parent to create a PAsmJSCacheEntry actor in ChildRunnable::Run when in state eInitial.
- ParentRunnable::Close triggers deletion if !mActorDestroyed (set by ActorDestroy) and sets state to eFinished.
- ParentRunnable::Fail triggers deletion if !mActorDestroyed (set by ActorDestroy) and sets state to eFinished.
- Control flow parity in the child:
  - ChildRunnable::RecvOnOpenMetadataForRead now sends an error overload to trigger Fail() in the parent to trigger protocol deletion.
  - ChildRunnable::Run's eClosing state now invokes SendClose() if !mActorDestroyed to trigger protocol deletion.
Attachment #8925011 - Flags: review?(bugmail) → review+
Bug 1411721 also seems to be a duplicate.
Is ESR affected?
Flags: needinfo?(luke)
I believe ESR is affected too.
Agreed.
Flags: needinfo?(luke)
Comment on attachment 8925012 [details] [diff] [review]
Part 4: Fix the actual race between ActorDestroy() on the PBackground thread and Run() on the I/O thread

Review of attachment 8925012 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent block comment in ActorDestroy, thank you!

Restating:

ParentRunnable now understands two types of cancellation:
- Actor destruction, as Atomic<>ally tracked by mOperationMayProceed and exposed via OperationMayProceed().
  - By definition, set on the background/owning thread in ActorDestroy.
  - Really just for use by the main and IO threads, as it's just an atomic version of mActorDestroyed exposed by IsActorDestroyed().
  - Never accessed as part of any greater synchronization block, so effectively limited to being an optimization.
- QuotaManager shutdown, as indicated by its Client's IsShuttingDown* methods introduced in bug 1389561 part 5.
  - Set on the background/owning thread.
  - On the background threads, (non-mic) sInstance is safe and so is its (non-atomic) mShutdownRequested.  Falling back to QM::IsShuttingDown.
  - On non-Background threads defers to QuotaManager::IsShuttingDown which is thread-safe thanks to QM's Atomic gShutdown.

For the ActorDestroy path, the general idea is that either the state machine has already completed operation (it's in eOpened/mOpened==true) and Close() is appropriate, or the state-machine is still processing and it's up to the state machine to notice !OperationMayProceed/IsActorDestroyed() and divert to Fail() either directly (on owning thread) or indirectly via FailOnNonOwningThread() (on non-owning threads).  To this end, every state handler in ParentRunnable::Run other than eWaitingToOpenDirectory (I added a line comment) and eFailing (used by FailOnNonOwningThread()) has one of these stanzas.

This idiom explicitly corrects the "eSendingMetadataForRead" case which we're seeing so many of on crash-stats.

There's also a dispatch guard added and an additional shutdown check.

::: dom/asmjscache/AsmJSCache.cpp
@@ +996,5 @@
>  
>      case eWaitingToFinishInit: {
>        AssertIsOnOwningThread();
>  
> +      if (NS_WARN_IF(Client::IsShuttingDownOnBackgroundThread()) ||

Should this be added to eWaitingToOpenDirectory too?  There was no pre-existing QuotaManager::IsShuttingDown() check to transform in this manner, which makes sense because the QM was explicitly just created if we're in that state.  However, it's conceivable the ActorDestroy transition did/could occur in such a case.
Attachment #8925012 - Flags: review?(bugmail) → review+
Awesome, thanks for such a quick review and description of the changes. I'm going to request a sec-approval.
Flags: needinfo?(bugmail)
(In reply to Andrew Sutherland [:asuth] from comment #49)
> Should this be added to eWaitingToOpenDirectory too?  There was no
> pre-existing QuotaManager::IsShuttingDown() check to transform in this
> manner, which makes sense because the QM was explicitly just created if
> we're in that state.  However, it's conceivable the ActorDestroy transition
> did/could occur in such a case.

Yeah, I added it.
Attachment #8925228 - Attachment description: Part 4: Fix the actual race between ActorDestroy() on the PBackground thread and Run() on the I/O thread → Part 4: Prevent the state machine from continuing if QM is shutting down or the actor has been destroyed
Attachment #8925228 - Attachment filename: asmjs-race → s9
Comment on attachment 8925228 [details] [diff] [review]
Part 4: Prevent the state machine from continuing if QM is shutting down or the actor has been destroyed

[Security approval request comment]
How easily could an exploit be constructed based on the patch?
It would require quite thorough understanding of the code.

Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?
Check-in comments, no.
Comments in the code, slightly.

Which older supported branches are affected by this flaw?
All supported branches, including ESR52.

If not all supported branches, which bug introduced the flaw?

Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be?
There are no backport patches yet, but this code hasn't been under heavy development, so there shouldn't be big conflicts, so not very high risk.

How likely is this patch to cause regressions; how much testing does it need?
The asmjscache module contains mochitests and these were run locally and on the try server, everything looks good. The risk for causing regressions is not very high.
Attachment #8925228 - Flags: sec-approval?
Comment on attachment 8925011 [details] [diff] [review]
Part 3: Allow sending of the __delete__ message in one direction only

[Security approval request comment]
See the sec-approval for Part 4
Attachment #8925011 - Flags: sec-approval?
Comment on attachment 8925010 [details] [diff] [review]
Part 2: Fix incorrect FailOnNonOwningThread() calls

[Security approval request comment]
See the sec-approval for Part 4
Attachment #8925010 - Flags: sec-approval?
Comment on attachment 8925009 [details] [diff] [review]
Part 1: Remove unused state

[Security approval request comment]
See the sec-approval for Part 4
Attachment #8925009 - Flags: sec-approval?
Attachment #8925009 - Flags: sec-approval? → sec-approval+
Attachment #8925010 - Flags: sec-approval? → sec-approval+
Attachment #8925011 - Flags: sec-approval? → sec-approval+
Comment on attachment 8925228 [details] [diff] [review]
Part 4: Prevent the state machine from continuing if QM is shutting down or the actor has been destroyed

sec-approval=dveditz
Attachment #8925228 - Flags: sec-approval? → sec-approval+
Group: core-security → dom-core-security
Comment on attachment 8925228 [details] [diff] [review]
Part 4: Prevent the state machine from continuing if QM is shutting down or the actor has been destroyed

Approval Request Comment
[Feature/Bug causing the regression]: The issue existed for long time, it just got more frequent with e10s I think.
[User impact if declined]: High volume of crashes.
[Is this code covered by automated tests?]: Yes, there are mochitests.
[Has the fix been verified in Nightly?]: It landed on m-c just 12 hours ago, we'll see in a day or two if number of crashes decreases.
[Needs manual test from QE? If yes, steps to reproduce]: No
[List of other uplifts needed for the feature/fix]: None
[Is the change risky?]: Slightly risky.
[Why is the change risky/not risky?]: The fix is not trivial, but everything was double checked and heavily tested locally and on the try server.
[String changes made/needed]: None
Attachment #8925228 - Flags: approval-mozilla-beta?
Comment on attachment 8925011 [details] [diff] [review]
Part 3: Allow sending of the __delete__ message in one direction only

Approval Request Comment
See approval-mozilla-beta for Part 4
Attachment #8925011 - Flags: approval-mozilla-beta?
Comment on attachment 8925010 [details] [diff] [review]
Part 2: Fix incorrect FailOnNonOwningThread() calls

Approval Request Comment
See approval-mozilla-beta for Part 4
Attachment #8925010 - Flags: approval-mozilla-beta?
Comment on attachment 8925009 [details] [diff] [review]
Part 1: Remove unused state

Approval Request Comment
See approval-mozilla-beta for Part 4
Attachment #8925009 - Flags: approval-mozilla-beta?
Comment on attachment 8925228 [details] [diff] [review]
Part 4: Prevent the state machine from continuing if QM is shutting down or the actor has been destroyed

This is too late for 57. This will have to ride the train with 58.
Attachment #8925228 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Attachment #8925011 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Attachment #8925010 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
Attachment #8925009 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
This is the #3 UAF in crash volume in 57/58 currently (close behind the top 2)... can we consider it for a ride-along?  (And the fix will then get bake time on 58)
Flags: needinfo?(sledru)
Flags: needinfo?(lhenry)
Redirecting to Ritu as she is the release owner.
Flags: needinfo?(sledru)
Flags: needinfo?(rkothari)
Flags: needinfo?(lhenry)
We decided not to uplift this into 57 RC. Depending on how this impacts us post 57 launch, we will consider taking this in a 57.x dot release after weighing in on the risks.
Group: dom-core-security → core-security-release
(In reply to Ritu Kothari (:ritu) from comment #67)
> We decided not to uplift this into 57 RC. Depending on how this impacts us
> post 57 launch, we will consider taking this in a 57.x dot release after
> weighing in on the risks.

this is currently in the range of 500 daily reports on release & accounting for 1.85% of browser crashes on 57.0
Hi Luke, Jim, given that crash volume on this one, we'd like to consider uplifting this to 57.0.1. Could you please comment on the risk vs. reward here? If you also think it's a good idea, please nominate for uplift to moz-release. Thanks!
Flags: needinfo?(luke)
Flags: needinfo?(jmathies)
Jan wrote this patch and would be in a better position to comment.
Flags: needinfo?(luke) → needinfo?(jvarga)
Patches here depend on bug 1389561, so that one would have to be uplifted too.
Anyway, I think it's a good idea, to take them all.
Flags: needinfo?(jvarga)
Thanks Jan. I think the size of this uplift is really worrisome. Looking at the changes in this bug and bug 1389561, I am leaning towards wontfix'ing it for 57. Sylvestre, FYI.
ack
Flags: needinfo?(sledru)
Depends on: 1389561
Flags: needinfo?(jmathies)
If the crash is bad enough, another option we have for 57.0.1 is to simply disable the asm.js cache since it's not a critical optimization.  That's a one-line code change so very low risk.
The crash volume on 57 is high enough that a low-risk mitigation like disabling the asm.js cache sounds worth considering. Luke, can you please write that patch and put it up for release approval?
Approval Request Comment
[Feature/Bug causing the regression]: bug landed a long time ago, seems exacerbated by recent shift to e10s
[User impact if declined]: crashes
[Is this code covered by automated tests?]: yes
[Has the fix been verified in Nightly?]: no
[Needs manual test from QE? If yes, steps to reproduce]: no
[List of other uplifts needed for the feature/fix]: none
[Is the change risky?]: no
[Why is the change risky/not risky?]: trivial patch, just turns off a feature
Flags: needinfo?(luke)
Attachment #8934239 - Flags: review?(bbouvier)
Attachment #8934239 - Flags: approval-mozilla-release?
Comment on attachment 8934239 [details] [diff] [review]
disable-asmjscache

Review of attachment 8934239 [details] [diff] [review]:
-----------------------------------------------------------------

Aouch.
Attachment #8934239 - Flags: review?(bbouvier) → review+
The mitigation patch seems to have fallen through the cracks after being put up for m-r approval.  Are we making another 57 release?  Should this ride-along?
Flags: needinfo?(ryanvm)
Flags: needinfo?(lhenry)
It did not really fall through any cracks - I was aware. The problem was other chemspill situations for both 57.0.3 and 57.0.4 during the holidays, so we could not risk taking other fixes. 

Ritu, over to you to either uplift or wontfix for 57. We could uplift it in case we end up with another dot release in the next week.  Since the first merge is this Thursday and we release 58 in 2 weeks it seems unlikely (but not impossible).
Flags: needinfo?(lhenry) → needinfo?(rkothari)
The likelihood of another dot release is slim. 

Hi Luke, Dan, Al, since this fix landed in 58, should we uplift the fix to ESR52.6? It was set to wontfix for some reason but I'd like to see it fixed in ESR52 if the risk is low. Thanks!
Flags: needinfo?(rkothari)
Flags: needinfo?(luke)
Flags: needinfo?(dveditz)
Flags: needinfo?(abillings)
(Forwarding ni? to patch author.)
Flags: needinfo?(luke) → needinfo?(jvarga)
If landing the 58 patch on ESR is hard/risky, we could also land my disable-asmjscache patch (comment 78) on ESR.
Flags: needinfo?(ryanvm)
I'd like it on ESR if possible but not in a dangerous sort of way.
Flags: needinfo?(abillings)
Whiteboard: [adv-main58+]
Component: IPC → JavaScript Engine
(In reply to Luke Wagner [:luke] from comment #83)
> If landing the 58 patch on ESR is hard/risky, we could also land my
> disable-asmjscache patch (comment 78) on ESR.

Given that AsmJSCache on ESR52 still uses QM's isApp flag and CacheMiss message, I think we should rather take Luke's patch for disabling the feature.
I'm not saying it's not possible to update these patches for ESR52, but it's not trivial, since the state machine differs a bit, so another round of testing would be required.
I'm busy with other stuff right now, but if there's someone who is able/willing to update and test original patches then I will be ok with that.
Flags: needinfo?(jvarga)
Comment on attachment 8934239 [details] [diff] [review]
disable-asmjscache

As m-r is 58 now, remove approval‑mozilla‑release flag.
Attachment #8934239 - Flags: approval-mozilla-release?
This is topcrash #80 for ESR so it seems worthwhile to take the "disable asmjscache" patch there. The patch itself contains a log message like "disable cache for 57". Would we have to fix that? Does anything else in that patch need merging to the ESR branch?
Flags: needinfo?(dveditz) → needinfo?(luke)
Nothing else needed.  Here's an updated patch that displays "disabled in ESR 52".
Flags: needinfo?(luke)
Comment on attachment 8943414 [details] [diff] [review]
disable-asmjscache (ESR52)

[Approval Request Comment]
If this is not a sec:{high,crit} bug, please state case for ESR consideration: see comment 87
User impact if declined: see comment 87
Fix Landed on Version: would have landed in 57, but didn't
Risk to taking this patch (and alternatives if risky): low, turns off a feature
Attachment #8943414 - Flags: approval-mozilla-esr52?
Hi Dan, I've already initiated ESER52.6 release build. Do you believe this is critical enough to start a build2? Please let me know.
Flags: needinfo?(dveditz)
 The risk of an exploit is relatively low, but we will get asked why this didn't land by the enterprise folks.
Flags: needinfo?(dveditz)
Flags: qe-verify-
Whiteboard: [adv-main58+] → [adv-main58+][post-critsmash-triage]
Comment on attachment 8943414 [details] [diff] [review]
disable-asmjscache (ESR52)

Low risk, sec-high, turns off a feature, ESR52+
Attachment #8943414 - Flags: approval-mozilla-esr52? → approval-mozilla-esr52+
And a follow-up Werror fix.
https://hg.mozilla.org/releases/mozilla-esr52/rev/a7c8e85285e2

As a QA note, Luke said we can verify this fix on ESR52 by doing the following:
* Visit http://kripken.github.io/ammo.js/examples/webgl_demo/ammo.html and confirm the "Disabled in ESR" message shows up in the console (it's a JS warning)
* Reload that page twice just to be sure
Flags: needinfo?(andrei.vaida)
Changing the qe-verify flag, based on Ryan's comment.
Flags: qe-verify- → qe-verify+
Flags: needinfo?(andrei.vaida)
This issue is verified fixed using 52.6.0 esr (BuildId:20180118122319) on Windows 10 64bit, macOS 10.12.6 and Ubuntu 16.04 64bit. The "Disabled in ESR" message shows up in the console.
Status: RESOLVED → VERIFIED
Flags: qe-verify+
Whiteboard: [adv-main58+][post-critsmash-triage] → [adv-main58+][adv-esr52.6+][post-critsmash-triage]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.