Closed Bug 1500805 Opened 11 months ago Closed 11 months ago

Improve middleman call robustness

Categories

(Core :: Web Replay, enhancement)

enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla65
Tracking Status
firefox65 --- fixed

People

(Reporter: bhackett, Assigned: bhackett)

References

Details

Attachments

(8 files)

Attached patch patchSplinter Review
The attached patch contains fixes for some issues with middleman calls and doing layout/painting after diverging from the recording.  This takes care of the two main sources of crashes related to these calls: dereferencing bogus pointers that originated from the recording (which can happen if there are missing middleman call hooks) and deadlocking when trying to take a lock held by an idle thread.  In both of these cases we should now report a repaint failure instead of hanging/crashing.  This patch also has some redirection fixes so that we can repaint successfully in more cases.
Keep track of whether threads should idle at a per-thread level, so that they can be selectively resumed.

Also, sorry about the review load in this bug, Nathan knows this code better but I can't r? him.  I suppose there should be a broader pool of people who are familiar with this stuff but I'm not sure who to ask.
Attachment #9018917 - Flags: review?(continuation)
After the main thread diverges from the recording, other threads continue executing until they get stuck somewhere: either hitting the end of their recording or trying to take a lock that the main thread is scheduled to take next.  If the main thread interacts with those other threads (posting runnables to them, for example) it could change their behavior in a way that causes them to have a mismatch with the recording.

This patch changes this situation so that all other threads enter their idle state when the main thread diverges from the recording, so that they won't execute anymore.  The compositor thread is allowed to diverge from the recording and resume execution if we get a repaint request.
Attachment #9018920 - Flags: review?(continuation)
If a thread which has diverged from the recording tries to take a lock held by an idle thread, treat this as an unhandled recording divergence and rewind to the last checkpoint instead of deadlocking.
Attachment #9018921 - Flags: review?(continuation)
When determining if an argument to a middleman call is a compile time constant string, improve checking so that we don't crash if the argument is a bogus pointer that originated from a call that is missing a middleman call hook.  We have to match against the internal layout of a compile time constant CFStringRef here, but this layout should remain stable in the future (these strings are baked into the executable itself).
Attachment #9018922 - Flags: review?(continuation)
Bug 1488808 changed the redirection for pthread_create so that it is supposed to tolerate being called after diverging from the recording, by not creating the thread and producing an error.  This change was ineffective, though, and we would trigger an unhandled recording divergence instead.  The attached patch makes a couple fixes so that we are able to correctly deal with this case.
Attachment #9018927 - Flags: review?(continuation)
Add middleman call hooks to several CoreGraphics APIs, so that they can be called after diverging from the recording.  These are mainly used for rendering custom fonts.
Attachment #9018928 - Flags: review?(continuation)
Fix a bug where when a call made in the middleman returned null, we wouldn't record the result in the right place and could crash later.  This also makes a couple minor code improvements.
Attachment #9018929 - Flags: review?(continuation)
Comment on attachment 9018917 [details] [diff] [review]
Part 1 - Allow idle threads to be selectively resumed.

Review of attachment 9018917 [details] [diff] [review]:
-----------------------------------------------------------------

erahm could probably also review thread stuff.

::: toolkit/recordreplay/Thread.h
@@ +295,5 @@
> +
> +  // Allow a single thread to resume execution.
> +  static void ResumeSingleIdleThread(size_t aId);
> +
> +  // Return whether this thread is the idle state entered after

Should this be "is in" instead of "is"?
Attachment #9018917 - Flags: review?(continuation) → review+
Comment on attachment 9018920 [details] [diff] [review]
Part 2 - Ensure non-main threads are idle when divergingn from the recording.

Review of attachment 9018920 [details] [diff] [review]:
-----------------------------------------------------------------

nit: "divergingn" in the summary.
Attachment #9018920 - Flags: review?(continuation) → review+
Attachment #9018921 - Flags: review?(continuation) → review+
Comment on attachment 9018922 [details] [diff] [review]
Part 4 - Watch for bogus pointers when checking for constant compile time strings.

Review of attachment 9018922 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/recordreplay/ProcessRedirectDarwin.cpp
@@ +702,5 @@
>    ConstantString,
>  };
>  
> +// Internal layout of a constant compile time CFStringRef.
> +struct CFConstantString {

I guess you need to do this because you need to access the raw char* to do the memory check, but the APIs for CFStringRef don't let you do that?

Is this code path only being hit when a call hook is missing, and thus we're not going to test it in automation? It would be good to have some kind of basic sanity testing that you can run this code and it will do something reasonable. I'm not sure how you'd right such a test.
Attachment #9018922 - Flags: review?(continuation) → review+
Attachment #9018927 - Flags: review?(continuation) → review+
Attachment #9018929 - Flags: review?(continuation) → review+
(In reply to Andrew McCreight [:mccr8] from comment #10)
> Comment on attachment 9018922 [details] [diff] [review]
> Part 4 - Watch for bogus pointers when checking for constant compile time
> strings.
> 
> Review of attachment 9018922 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: toolkit/recordreplay/ProcessRedirectDarwin.cpp
> @@ +702,5 @@
> >    ConstantString,
> >  };
> >  
> > +// Internal layout of a constant compile time CFStringRef.
> > +struct CFConstantString {
> 
> I guess you need to do this because you need to access the raw char* to do
> the memory check, but the APIs for CFStringRef don't let you do that?
>
> Is this code path only being hit when a call hook is missing, and thus we're
> not going to test it in automation? It would be good to have some kind of
> basic sanity testing that you can run this code and it will do something
> reasonable. I'm not sure how you'd right such a test.

This code path is covered by tests: when a call which we need to perform in the middleman is being passed a compile time constant CFString (e.g. @"foo" in a .mm file) we will end up here.  The problem with missing call hooks is that we can't distinguish here between values that are compile time constant strings and values that were replayed when recording a call or objective C message that is missing a middleman call hook.  The latter values are unsafe to dereference, and the idea with this patch is to check if the value looks like a valid constant CFString, without crashing.  Before this patch we were using the normal APIs to test if the value is a CFString and extract its contents, but we don't have control over what memory those APIs will access and can't avoid crashing if the pointer is invalid.
I'm trying to understand what is going on in part 6, not being very familiar with these hooks. From the comment at the top of ProcessRedirect.h, it sounds like the default behavior is to save the data when recording, and return the saved data when replaying. What is the default behavior when these calls are made after we are diverged? Does it just error out (or whatever you call it when we give up and go back to the last checkpoint)? And that's why you need to add hooks to actually call these methods via some mechanism where it runs the call in the middleman?
(In reply to Andrew McCreight [:mccr8] from comment #12)
> I'm trying to understand what is going on in part 6, not being very familiar
> with these hooks. From the comment at the top of ProcessRedirect.h, it
> sounds like the default behavior is to save the data when recording, and
> return the saved data when replaying. What is the default behavior when
> these calls are made after we are diverged? Does it just error out (or
> whatever you call it when we give up and go back to the last checkpoint)?
> And that's why you need to add hooks to actually call these methods via some
> mechanism where it runs the call in the middleman?

Yes, this is correct.  A redirected function can only run in the middleman if it has a middleman call hook (redirection.mMiddlemanCall) specified under FOR_EACH_REDIRECTION.  By default, calling a function in the middleman will copy the values of all register arguments and return values, and the hook deals with any special handling needed for inputs and outputs that won't behave correctly with this naive copying --- buffer arguments, arguments or return values that are opaque pointers to system values (i.e. CFTypes), data stored on the stack, and so forth.
Comment on attachment 9018928 [details] [diff] [review]
Part 6 - Add middleman call hooks for some graphics APIs.

Review of attachment 9018928 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the explanation.

::: toolkit/recordreplay/ProcessRedirectDarwin.cpp
@@ +388,5 @@
>    MACRO(CGContextScaleCTM, nullptr, nullptr, Middleman_UpdateCFTypeArg<0>) \
>    MACRO(CGContextTranslateCTM, nullptr, nullptr, Middleman_UpdateCFTypeArg<0>) \
> +  MACRO(CGDataProviderCreateWithData, RR_Compose<RR_ScalarRval, RR_CGDataProviderCreateWithData>, \
> +        nullptr, Middleman_CGDataProviderCreateWithData)         \
> +  MACRO(CGDataProviderRelease, nullptr, nullptr, nullptr, Preamble_Veto<0>) \

Does the veto here mean these objects are just leaked? Is this so they can be replayed again later without calling into the system?
Attachment #9018928 - Flags: review?(continuation) → review+
(In reply to Andrew McCreight [:mccr8] from comment #14)
> Comment on attachment 9018928 [details] [diff] [review]
> Part 6 - Add middleman call hooks for some graphics APIs.
> 
> Review of attachment 9018928 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Thanks for the explanation.
> 
> ::: toolkit/recordreplay/ProcessRedirectDarwin.cpp
> @@ +388,5 @@
> >    MACRO(CGContextScaleCTM, nullptr, nullptr, Middleman_UpdateCFTypeArg<0>) \
> >    MACRO(CGContextTranslateCTM, nullptr, nullptr, Middleman_UpdateCFTypeArg<0>) \
> > +  MACRO(CGDataProviderCreateWithData, RR_Compose<RR_ScalarRval, RR_CGDataProviderCreateWithData>, \
> > +        nullptr, Middleman_CGDataProviderCreateWithData)         \
> > +  MACRO(CGDataProviderRelease, nullptr, nullptr, nullptr, Preamble_Veto<0>) \
> 
> Does the veto here mean these objects are just leaked? Is this so they can
> be replayed again later without calling into the system?

The veto means that no call will be sent to the middleman on this (or other release and retain calls).  The object will not be leaked, though: the next time the middleman calls are reset, references will be released on all system objects which were created by previous calls, via Middleman_CFTypeOutput.  The main reason we do things this way is to avoid leaking objects if the replaying process hits an unrecoverable failure and has to rewind before it finishes repainting.
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/25e1e5dcacb2
Part 1 - Allow idle threads to be selectively resumed, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/20691f6e5da8
Part 2 - Ensure non-main threads are idle when diverging from the recording, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/ac631ceaa2b9
Part 3 - Rewind instead of deadlocking when taking a lock held by an idle thread, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/9f6bb5ca3b37
Part 4 - Watch for bogus pointers when checking for constant compile time strings, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/22e55baef1fc
Part 5 - Fix handling when creating threads after diverging from the recording, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b8359215f5b
Part 6 - Add middleman call hooks for some graphics APIs, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/31b1aec1f478
Part 7 - Avoid crashing after a middleman call returns a null value, r=mccr8.
Depends on: 1503055
Depends on: 1504285
You need to log in before you can comment on or make changes to this bug.