Allow all threads to diverge from the recording when repainting

RESOLVED FIXED in Firefox 65

Status

()

enhancement
RESOLVED FIXED
8 months ago
7 months ago

People

(Reporter: bhackett, Assigned: bhackett)

Tracking

unspecified
mozilla65
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox65 fixed)

Details

Attachments

(9 attachments)

Assignee

Description

8 months ago
Posted patch patchSplinter Review
When doing a repaint we currently only allow the main thread and compositor thread to diverge from the recording.  Other threads remain idle at wherever they are in the recording when the repaint occurred.  This is a problem when the main thread or compositor thread tries to take a lock which is held by one of the idle threads; bug 1488808 part 16 adds a couple of ugly bits of instrumentation to avoid some of these situations, and several parts of bug 1500805 are in place to avoid deadlocking the process when this situation occurs.  We will still fail to repaint in such situations, with the result for the user being more or less random failures as they are trying to use the tool.  With the understanding that failed repaints should not be tolerated and eventually be driven to zero, this situation can't continue.

I think the only real option to fix this is to allow all threads --- and not just the main thread and compositor thread --- to diverge from the recording.  Instead of letting an idle thread hold a lock necessary for the main/compositor thread to proceed, that thread will go about its business and eventually release the lock, and the repaint can finish without deadlocking.  There are a couple complications with this approach:

- We need to support additional functions being called after diverging from the recording, according to the behavior of the threads that will now be live.  From the testing I've done this set of additional functions is fairly small.

- All other threads can now perform unrecorded waits.  Currently we have to deal with such threads using the NotifyUnrecordedWait() public API.  Instead of requiring this API to be used for every thread, this patch removes the public API entirely and handles this issue internally within the redirections themselves.  I've tried to do this in the past but ran into issues around lock ordering when checking whether to idle before blocking on a condition variable.  There is a way around this, though, taking advantage of the fact that condition variables are allowed by their spec to have spurious wakeups.

16 files changed, 148 insertions(+), 273 deletions(-)
Assignee

Comment 1

8 months ago
Remove NotifyUnrecordedWait and MaybeWaitForCheckpointSave from the public API, and do some reshuffling so they can be used internally when a thread does a non-recorded wait on a condvar.
Attachment #9021644 - Flags: review?(nfroyd)
Assignee

Comment 2

8 months ago
Remove the now-unnecessary call to NotifyUnrecordedWait in MessagePumpDefault.
Attachment #9021647 - Flags: review?(nfroyd)
Assignee

Comment 3

8 months ago
JS helper threads don't need instrumentation anymore when blocking on condition variables.
Attachment #9021649 - Flags: review?(jdemooij)
Assignee

Comment 4

8 months ago
When repainting, instruct all other threads to diverge from the recording, and not just the compositor thread.  This also has a couple related fixes:

- Support contending threads simultaneously trying to send calls to the middleman.

- Remove the rather hackneyed technique used to keep the compositor thread from sending calls to the middleman after the repaint has finished.  Instead, the compositor thread and all other non-main threads are returned to the idle state before the main thread finishes the repaint.
Attachment #9021652 - Flags: review?(continuation)
Assignee

Comment 5

8 months ago
Support some redirections that might be called by non-main/compositor threads after diverging from the recording.  Several of these are blocking APIs where we instruct the thread to just wait forever (or rather, until the next rewind erases all changes that happened after diverging from the recording).
Attachment #9021655 - Flags: review?(nfroyd)
Comment on attachment 9021652 [details] [diff] [review]
Part 4 - Allow all threads to diverge from the recording when repainting.

Review of attachment 9021652 [details] [diff] [review]:
-----------------------------------------------------------------

::: toolkit/recordreplay/ipc/ChildIPC.cpp
@@ +586,5 @@
> +    gDidRepaint = true;
> +
> +    // Allow other threads to diverge from the recording so the compositor can
> +    // perform any paint we are about to trigger, or finish any in flight paint
> +    // that that existed at the point we are paused at.

nit: the rewrapping of the comment has revealed that there's a double "that".
Attachment #9021652 - Flags: review?(continuation) → review+
Assignee

Comment 7

8 months ago
Fix a spot where part 5 breaks a record/replay specific thread.

Part 5 handles read() calls that occur after diverging from the recording by erroring out with EAGAIN.  This is fine for non-blocking file descriptors, but file descriptors which block won't actually generate this error in practice.  The thread spawned to listen for checkpoint creation messages doesn't handle this case, and needs this special handling here.

An alternative that would eliminate the need for this patch would be to WaitForever() in read() calls on blocking file descriptors.  We can't do that for all read() calls because a thread might do a non-blocking read while holding a lock which other threads need, which would lead to a deadlock.  Keeping track of which file descriptors in the recording are non-blocking would require extra bookkeeping.  I'm hoping to avoid that, but if examples like this start turning up in Gecko instead of record/replay specific code, that bookkeeping might be necessary.
Attachment #9021660 - Flags: review?(nfroyd)
Assignee

Comment 8

8 months ago
Remove the instrumentation from Bug 1488808 Part 16, which is not necessary anymore now that all threads diverge from the recording when repainting.
Attachment #9021663 - Flags: review?(nfroyd)
Attachment #9021649 - Flags: review?(jdemooij) → review+
Assignee

Comment 9

8 months ago
Followup fix for a problem resulting from part 4 that showed up while testing.  When a child process is recovering we expect it to send the same messages to the middleman, in the same order.  Middleman call requests now break this property, since they could be sent in different orders by different threads, or different sets of call requests could be sent if threads begin diverging from the recording at different points.  Naively attempting to send the same call responses that were sent earlier can lead to the child getting responses that don't match up with the request it made.

The attached patch addresses this by handling middleman call requests in the normal fashion, even when recovering.
Attachment #9022019 - Flags: review?(continuation)
Attachment #9022019 - Flags: review?(continuation) → review+
Assignee

Comment 10

8 months ago
Comment on attachment 9021644 [details] [diff] [review]
Part 1 - Handle unrecorded waits internally.

I haven't been able to reach Nathan and he's probably busy with other stuff, so changing review to get this bug unblocked.
Attachment #9021644 - Flags: review?(nfroyd) → review?(lsmyth)
Assignee

Updated

8 months ago
Attachment #9021647 - Flags: review?(nfroyd) → review?(continuation)
Assignee

Updated

8 months ago
Attachment #9021655 - Flags: review?(nfroyd) → review?(lsmyth)
Assignee

Updated

8 months ago
Attachment #9021660 - Flags: review?(nfroyd) → review?(lsmyth)
Assignee

Updated

8 months ago
Attachment #9021663 - Flags: review?(nfroyd) → review?(continuation)
Attachment #9021647 - Flags: review?(continuation) → review+
Attachment #9021663 - Flags: review?(continuation) → review+
Attachment #9021644 - Flags: review?(lsmyth) → review+
Attachment #9021655 - Flags: review?(lsmyth) → review+
Attachment #9021660 - Flags: review?(lsmyth) → review+

Comment 11

7 months ago
Pushed by bhackett@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/082a5c062dd0
Part 1 - Handle unrecorded waits internally, r=lsmyth.
https://hg.mozilla.org/integration/mozilla-inbound/rev/686f9f2d649c
Part 2 - Remove NotifyUnrecordedWait use in MessagePumpDefault, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/9d56cb4ea09e
Part 3 - Remove NotifyUnrecordedWait use in JS helper threads, r=jandem.
https://hg.mozilla.org/integration/mozilla-inbound/rev/b7b13712a752
Part 4 - Allow all threads to diverge from the recording when repainting, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/f3a5ebd83214
Part 5 - Support some new redirections after diverging from the recording, r=lsmyth.
https://hg.mozilla.org/integration/mozilla-inbound/rev/8efe543a747c
Part 6 - Handle read() failures in the checkpoint listening thread, r=lsmyth.
https://hg.mozilla.org/integration/mozilla-inbound/rev/4319d0b6443b
Part 7 - Remove instrumentation used to avoid taking locks after diverging from the recording, r=mccr8.
https://hg.mozilla.org/integration/mozilla-inbound/rev/c4b05c2c7eed
Part 8 - Handle middleman call messages directly when recovering, r=mccr8.
You need to log in before you can comment on or make changes to this bug.