Random crashes with vsync refresh driver enabled

RESOLVED WORKSFORME

Status

()

defect
--
major
RESOLVED WORKSFORME
4 years ago
4 years ago

People

(Reporter: mchang, Assigned: mchang)

Tracking

({crash})

Firefox Tracking Flags

(Not tracked)

Details

(crash signature)

Attachments

(6 attachments, 18 obsolete attachments)

1.61 KB, patch
kats
: review+
Details | Diff | Splinter Review
145.13 KB, text/plain
Details
143.80 KB, text/x-log
Details
1.54 KB, text/plain
Details
4.05 KB, text/plain
Details
2.47 KB, text/plain
Details
There are random, timing specific crashes when we enable the vsync refresh driver. These occur across platforms. These are bugs 1131080, 1133865, and https://crash-stats.mozilla.com/report/index/7c23e927-c2b9-4f80-b225-ec7302150302. They all probably have the same root cause, which is the vsync refresh driver crashes on startup.
Assignee

Updated

4 years ago
See Also: → 1131080, 1133865
Assignee

Updated

4 years ago
Duplicate of this bug: 1131080

Updated

4 years ago
Crash Signature: [@ mozilla::gfx::VsyncSource::GetRefreshTimerVsyncDispatcher() ]
From the crash report in https://bugzilla.mozilla.org/show_bug.cgi?id=1139090#c2. It's crashing from VsyncParent::Create in calling gfxPlatform. I can only imagine this would happen if somehow the content process loaded before the main thread in the parent process ticked the refresh driver. Also, I don't think it's valid to get the gfxPlatform off main thread in general. This patch posts a task to the main thread to get the vsync refresh driver timer during the VsyncParent creation.
Attachment #8572246 - Flags: review?(bent.mozilla)
I think we also need to add the thread checking for the first time we create the gfxPlatform. We might call GetPlatform() off-main thread, but we should call GetPlatform() after we init the platform at main thread. This checking can't catch all the race condition since we don't use mutex for "gPlatform".
We create the gPlatform at [1] and we access it at [2].

[1]
https://hg.mozilla.org/mozilla-central/annotate/c5b90c003be8/gfx/thebes/gfxPlatform.cpp#l480
[2]
https://hg.mozilla.org/mozilla-central/annotate/c5b90c003be8/gfx/thebes/gfxPlatform.cpp#l416
Comment on attachment 8572246 [details] [diff] [review]
Init VsyncParent on main thread

Review of attachment 8572246 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/ipc/VsyncParent.cpp
@@ +52,5 @@
> +  mVsyncDispatcher = vsyncSource->GetRefreshTimerVsyncDispatcher();
> +
> +  if (mObservingVsync) {
> +    mVsyncDispatcher->AddChildRefreshTimer(this);
> +  }

We might get race when we receive message during InitOnMainThread.

How about this?
VsyncParent::RecvObserve()
{
  mutex lock(mCreateLock);
  ...
}

VsyncParent::InitOnMainThread()
{
  //init mVsyncDispatcher
  .....

  mutex lock(mCreateLock);
  if (mObservingVsync) {
    mVsyncDispatcher->AddChildRefreshTimer(this);
  }
}
Stepping back for a moment, the problem is that we're crashing when trying to read the VsyncSource::RefreshTimerVsyncDispatcher during startup. This is actually only created during gfxPlatform::Init, which is supposed to be run on the main thread. It is OK to call gfxPlatform::GfxPlatform() Off main thread, but it is NOT ok to call gfxPlatform::Init off main thread. Thus, the first call to gfxPlatform::gfxPlatform must be on main thread. The ONLY other places where we call GetHardwareVsync() off main thread is in VsyncParent::Create and HwcComposer::Vsync on b2g. However, HwcComposer::Vsync is guaranteed to be called AFTER gfxPlatform::Init() is called on the main thread because we don't actually enable Vsync until gfxPlatform::init on b2g. The b2g vsync callback is disabled on startup and only turned on during gfxPlatform::init.

What this means then, is that these crashes can occur in the unlikely case that gfxPlatform::Init is called at the same time on two threads: main thread and PBackground thread. We could init incorrectly and not have the hardware vsync source. For example, at [1], we could set the flag to true on the main thread, not finish initializing when the background thread tries to read the vsync source and crash. I think the overall message is that we should move reading the gfxPlatform::GetPlatform() VsyncParent::Create() onto the main thread.

We could add a few locks, but then it is kind of ugly because the child could send a SendObserve vsync message before the main thread gets to init gfxPlatform, so we'd keep having these checks to see if the parent is initialized. I think it might be better to let the VsyncChild/VsyncParent KNOW it is working with a RefreshTimerVsyncObserver, rather than just a generic VsyncObserver. Then, once the ActorCreate occurs on the child process, we can send a notification to the parent process that the child is setup. Then the parent can init on the main thread and send a message back to the child. So the order would be:

1) ChildActor sets up, sends a message to Parent
2) Parent inits gfxPlatform on main thread, sends a message to child it finished init
3) Child finishes init and swaps refresh drivers.

If we only move the VsyncParent::Create just onto the main thread without some sync code, we have a race condition in the parent process between the background thread receiving an ObserveVsync message and the main thread initializing gfxPlatform. Then we have to keep a flag that says we should start sending messages once the main thread finishes initializing. The child process would also have to wait for the parent process' main thread to finish, which would hurt start up time. Instead, the child process with this proposal could keep ticking with software timers until the main thread parent process finishes. This would give a strict ordering on when the parent / child could start sending messages to each other.

Another nice benefit from letting the VsyncParent/Child know it is working with the refresh driver is that we can delete the inner VsyncObserver within VsyncRefreshDriverTimer[2] and instead have the VsyncRefreshDriverTimer inherit from a RefreshTimerVsyncObserver that isn't refcounted. There is still a little bit of ugliness because the vsync observer would not be refcounted since everything is not refcounted in the refresh driver. I think this would be ok since everything is alive throughout the whole process and is already cleaned up through ActorDestroy and nsRefreshDriver::Shutdown. I'm still not sure this is actually better than adding a lock, but conceptually it's cleaner.

The really weird thing though about this bug is that I would've expected one of the nsRuntimeAborts in gfxPlatform to trigger, but maybe both threads entered gfxPlatform::init at the same time. 

Attached are WIP to do three things:

1) Delete the RefreshDriverVsyncObserver
2) Create the IPC calls to initialize the parent and child and sync across both of them. Once the child and parent are initialized, the child sends a message to the parent. The parent inits gfxPlatform on the main thread, then sends a message back to the child. The child then swaps refresh drivers. Until then, the child continues to tick on software timers.
3) Create a new RefreshTimerVsyncObserver, which is not refcounted. Then the VsyncRefreshDriverTimer inherits from this, no longer needs an inner class to listen to vsync, and ticks on vsync messages. The lifetime is the same as it is destroyed in nsRefreshDriver::Shutdown.

I'm still not sure we want to do this, but still churning in my head.

[1] https://dxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxPlatform.cpp?from=gfxPlatform.cpp#457
[2] https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsRefreshDriver.cpp#312
Attachment #8572246 - Attachment is obsolete: true
Attachment #8572583 - Attachment is obsolete: true
Attachment #8572246 - Flags: review?(bent.mozilla)
Assignee

Updated

4 years ago
Depends on: 1139726
Hmm, even artificially holding the main thread and yielding() at gfxPlatform::init doesn't crash. The VsyncParent is always created much later on OS X. Will try on other platforms later.
I can't reproduce this on b2g either, even with artificially holding up the main thread. Maybe there is a race condition between creating the content process and initializing the refresh driver. Still digging, but I do think we should still never read GetVsyncSource() off main thread in general, since all the enable/disable vsync stuff must occur on the main thread.
I was discussing this with Mason on IRC. Personally I think it's highly unlikely that this crash is caused by off-main-thread call of gfxPlatform::Init, because (1) there's a very narrow window in which it could happen, and it would have to be concurrent with a main-thread call of gfxPlatform::Init, (2) if that were the case I would expect to see a higher volume of crashes from [1], and (3) of the 8 crash reports on crash-stats within the last 28 days, 6 of them have uptimes of over a minute (in some cases significantly higher) so this isn't really a startup crash.

I think rather than jumping in and making changes we should wait until we have more data and/or can reproduce this to figure out a proper fix.

[1] http://mxr.mozilla.org/mozilla-central/source/gfx/thebes/gfxPlatform.cpp?rev=3da36cdb4198#456
Add some assertions to ensure that the vsync source is initialized. Just some checks in case to ensure we initialized the vsync source correctly.
Attachment #8572986 - Attachment is obsolete: true
Attachment #8574133 - Flags: review?(bugmail.mozilla)
Attachment #8574133 - Flags: review?(bugmail.mozilla) → review+
Assignee

Updated

4 years ago
See Also: → 1140978
I'd really like to revisit NUWA here and see how we're initializing and cloning PBackground correctly with NUWA.

Updated

4 years ago
Keywords: crash
Posted file Full Crash Stack
From https://crash-stats.mozilla.com/report/index/f8c4b6ae-b9ff-43ad-872c-768e52150309

Chatting with nhirata, having a sim card might make this occur, which would explain why I can't reproduce at all. Waiting on confirmation.
From irc, address 0x5a5a5a5a means jemalloc poison, so either memory mismanagement or memory corruption.
Just a summary, there seems to be some intermittent that occurs with or without the sim card + sd card. It doesn't seem to be carrier related at all. Crashes occur without any reference workload. It mostly looks like we create a new process in the background and then it crashes for some reason, so probably related to NUWA. I'd like to really dig into the NUWA process and see what's supposed to happen (https://bugzilla.mozilla.org/show_bug.cgi?id=1092978#c186). 

It also looks like this crash occurs either all the time or never. The intermittent seems to be about 2-3 minutes between crashes. From the crash signature, we're deallocating some memory somewhere and accessing it incorrectly. 

Naoki and Jayme Mercado tested a custom build with the patches attached to this bug where we make explicit the startup of the Vsync Child/Parent. Both were able to crash with these patches.
While running a debug build, I see this a lot:

I/Gecko   (  208): [208] WARNING: A runnable was posted to a worker that is already shutting down!: file ../../../dom/workers/WorkerPrivate.cpp, line 2520
I/Gecko   (  208): [208] WARNING: Failed to update worker languages!: file ../../../dom/workers/WorkerPrivate.cpp, line 3295
Assignee

Updated

4 years ago
Depends on: 1141018
Posted file First crash
Was finally able to reproduce, reflashed everything. STR:

1) Redo v18-D full flash
2) Reset to 319 MB flame
3) Build and flash Gecko: 6e5e93073a9e11ff531570b3571f1136fde04255
4) Use lots of memory - take a video, picture, browse the web
5) Keep using more memory, connect to wifi, etc. Wait for crash
The kicker was to change to 319mb flame. Seems like we should handle a process getting killed.
While connecting to wifi, lots of crashes in adb logcat due to NUWA crashing. We're probably sending vsync messages when we shouldn't be.
Per the discussion with Jerry offline, the report in comment #15 looks like a use-after-free crash. 0x5a5a5a6e is dereferencing file_descriptor_set_ of a dead IPC::Message. It's still unclear how come the IPC channel can get a deleted message from its output queue and use it. My suggestion is we log the address of 'this' and the stack in Message's dtor, and before we use it in ChannelImpl::ProcessOutgingMessages(). Then we can compare to see how this message got freed.
(In reply to Mason Chang [:mchang] from comment #22)
> Waiting on bug 1141316 to be able to download the crash dumps :/

I'm not sure how these dumps will help you, but I downloaded them and emailed them to you.
(In reply to Cervantes Yu from comment #23)
> Per the discussion with Jerry offline, the report in comment #15 looks like
> a use-after-free crash. 0x5a5a5a6e is dereferencing file_descriptor_set_ of
> a dead IPC::Message. It's still unclear how come the IPC channel can get a
> deleted message from its output queue and use it. My suggestion is we log
> the address of 'this' and the stack in Message's dtor, and before we use it
> in ChannelImpl::ProcessOutgingMessages(). Then we can compare to see how
> this message got freed.

The test patch is already at:
https://bugzilla.mozilla.org/show_bug.cgi?id=1140978#c29
https://bugzilla.mozilla.org/show_bug.cgi?id=1140978#c30
Thanks to Cervantes for going through the ipc message flow.
this patch is from https://bugzilla.mozilla.org/show_bug.cgi?id=1140978#c30
use this patch to check IPC::Message object use-after-free problem
this patch is from https://bugzilla.mozilla.org/show_bug.cgi?id=1140978#c31

It creates refresh timer in ctor instead of the first call of ChooseTimer().
Posted file Log so far (obsolete) —
From some initial investigation, it looks like we have the same message on the ChannelImpl::output_queue_. Then after we process the message, we delete it, but it gets popped again off the queue.
About as far as I can get is that during process creation, a message is somehow in the channel and is being deleted. The message gets deleted quite a few times, still not sure how. But this deleted message is left in one of the channel's output_queue_ [1]. When we finally get to ProcessingOutputMessages, i t's already been deleted, so we crash.

[1] https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.h#105
Attachment #8575691 - Attachment is obsolete: true
Best steps to STR:

1) Set flame to 319 MB
2) Go through FTU, don't connect to wireless
3) Load every app from the top down. e.g. load contacts, then messages, then phone, then browser, etc. 
4) Keep loading apps until you get to settings. Then connect to wifi.
5) Keep loading new apps such as email, clock, etc. 
6) You will have probably crashed by now. If not, restart the phone and do it again. I can reproduce 4/5 times this way.
My device is 319m v18d-1 flame. Thanks to Mason, I can get the crash with the STR in comment 30. I have a very interesting log.
When we push a new message into the queue "output_queue_"(the type is std::queue<Message*>), the size of queue is not correct. The original queue is empty, but the size becomes 2 when we just push one data into it. In addition, the last data in queue is 0x5a5a5a5a. That's why we got 0x5a5a5a5a from queue.

log:
I/Gecko   ( 3841): bignose message queue size error, orig:0, cur:2, front data:0xb2e2f5e0, back data:0x5a5a5a5a
print the last message name before crash.
I got the same message Background::Msg_PVsyncConstructor for 3 time. It's used for vsync protocol creation.
Attachment #8575931 - Attachment is obsolete: true
Hi Ben,

There is a weird error in ipc outgoing message queue. The size is not correct when we push a new object(please check comment 31 and comment 32).
I create the BackgroundChild at [1]. If we create background at other thread before this call(e.g. if we call GetForCurrentThread() at other thread), do we still create an individual backgroundChild for refresh driver? All vsyncChild message and nsIIPCBackgroundChildCreateCallback have main thread checking. Does it have thread racing for the outgoing message queue?
Or do we have memory allocation problem for the std::queue?

[1]
https://hg.mozilla.org/mozilla-central/annotate/6686aacf006f/layout/base/nsRefreshDriver.cpp#l829
Flags: needinfo?(bent.mozilla)
Add more log in OutputQueuePush() function.

I have another weird log in OutputQueuePush(). We change the output_queue_length_ value only at io thread, but I got a mismatch value between output_queue_length_ and the output_queue_.size().

Log:
I/Gecko   (26099): bignose message queue size error, orig:0, cur:2, queue_length:2, front data:0xb2e2d460, back data:0x5a5a5a5a

The original queue size is 0. I think output_queue_length_ should be 1, but i got 2 here.
Attachment #8575973 - Attachment is obsolete: true
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #33)
> Hi Ben,
> 
> There is a weird error in ipc outgoing message queue. The size is not
> correct when we push a new object(please check comment 31 and comment 32).
> I create the BackgroundChild at [1]. If we create background at other thread
> before this call(e.g. if we call GetForCurrentThread() at other thread), do
> we still create an individual backgroundChild for refresh driver? All
> vsyncChild message and nsIIPCBackgroundChildCreateCallback have main thread
> checking. Does it have thread racing for the outgoing message queue?
> Or do we have memory allocation problem for the std::queue?
> 
> [1]
> https://hg.mozilla.org/mozilla-central/annotate/6686aacf006f/layout/base/
> nsRefreshDriver.cpp#l829

Us breaking the queue sounds very weird. Us also breaking IPC sounds weird. I traced it to this [1], which sends the VsyncConstructor before the IPC channel gets a chance to reset all the file descriptors. If we also use the callback mechanism [2], I can't crash right now.

[1] https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsRefreshDriver.cpp&case=true#833
[2] https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsRefreshDriver.cpp&case=true#837
If we ALWAYS use the callback mechanism at [2] in comment 35, I can't crash.
There is a race condition during the process startup. The hello message[1] is ALWAYS sent on the main thread from what I can tell. All messages to the IPC thread in the content process do not start until AFTER the hello message is sent in the working cases. The hello message is also a UniquePtr. After the hello message is sent, all IPC messages are sent on the IPC thread. The output queue [3] is not thread safe.

In our case, the PVsyncConstructor sends an IPC message to construct the Vsync Child/Parent pair [2], which occurs on the IPC thread. This message is sometimes queued on the IPC channel's output queue at the exact same time the main thread is queuing up the hello message. After the IPC thread queues the vsync constructor message, the hello message from the main thread is deleted memory. I need to read up on unique_ptr and auto_ptr to understand why this gets deleted, but we have a race condition here.

In the working cases, the PVSync constructor IPC message doesn't get sent/queued until AFTER the ContentChild finishes initialization[4]. There is probably a switch somewhere during the initialization process that switches IPC messages on those channels to always send messages on the IPC thread rather than the main thread OR, only the hello message is ever expected to occur on the main thread. 

I also want to check to ensure that the background child is setup [5] and when is the IPC thread safe to start sending messages. Or is the IPC thread always cloned with NUWA?

Cervantes, does any of this make sense? Thanks!

[1] https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc?from=EnqueueHelloMessage&case=true#373
[2] https://dxr.mozilla.org/mozilla-central/source/layout/base/nsRefreshDriver.cpp?from=nsrefreshDriver.cpp&case=true#833
[3] https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#886
[4] https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp?from=ContentChild.cpp&case=true#533
[5] https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp?from=ContentChild.cpp&case=true#737
Flags: needinfo?(cyu)
Instead of checking if the background actor is ready, use the async callback, which is called AFTER the background thread is initialized. Not sure this is the right fix versus it just bought us enough time for the hello messages in the ipc channel to finish executing.
I think the proper answer is to probably wait for the OnChannelConnected (https://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel.h#30), which is called at the hello message is sent.
So I think we have a couple of options here that I can think.

1) Have the BackgroundChild not actually report as started up until the OnChannelConnected message has been received for all cloned channels. We would have to extend the BackgroundChild to listen to OnChannelConnected in the IPC channel. 
2) Have NUWA delay firing NuwaAddFinalConstructor until all IPC channels have been opened and acknowledged. I'm not sure this is in the domain of NUWA yet though. 
3) Change ipc_channel_posix to be thread safe.
4) Send the Hello messages on the iothread instead of the main thread and don't allow non-hello thread messages until after the ack.
From the log in comment #37 it appears that it's a race pushing messages to the output queue. BackgroundChild reports that it's up, but the underlying IPC channel is not fully cloned. In this case, NuwaAddFinalConstructor() is not suitable for you because you will send IPC messages to post a task to the IPC thread, which will race with CallNuwaSpawn():

https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentChild.cpp#2418

I think using the callback will work because it dispatches a runnable. By the time the callback runs, the hello messages is guaranteed in the queue.

Changing the output queue to be thread-safe just avoids the crash, but isn't the right fix because PVSync's constructor should be sent *after* the hello message. Sending the hello message on the IO thread has a similar problem. So I would suggest delaying calling CreateContentVsyncRefreshTimer() to fix the issue.
Flags: needinfo?(cyu)
Bill, since IPC channel's output queue is not thread safe and we could enqueue the hello message on the main thread or the IPC thread, do we plan to make it thread-safe? Thanks.
Flags: needinfo?(wmccloskey)
We call all NuwaAddFinalConstructor() registered callbacks in [1], and then we reset channel fd at [2][A] on the main thread.
In NuwaAddFinalConstructor callback, we try to init the PVsyncChld protocol. That will post a task to ipc thread to enqueue a PVsyncConstructor message[B]. [A] enqueues a hello message on the mian thread, and [B] enqueues a PVsyncConstructor on ipc thead. Thus, we have a memory corruption in the outgoing buffer queue as comment 31.

[1]
https://hg.mozilla.org/mozilla-central/annotate/5334d2bead3e/dom/ipc/ContentChild.cpp#l2402
[2]
https://hg.mozilla.org/mozilla-central/annotate/5334d2bead3e/dom/ipc/ContentChild.cpp#l2418
Flags: needinfo?(bent.mozilla)
Just clean the rd header. The PVsyncActorCreated is only used by vsync-base timer. I think we don't need to show that in header file.
Attachment #8575377 - Attachment is obsolete: true
Attachment #8575379 - Attachment is obsolete: true
Attachment #8576114 - Attachment is obsolete: true
Post a runnable to init PVsync on the main thread.
With this patch, I don't get crash with the STR in comment 30 with 10 times test.
Attachment #8576656 - Flags: review?(cyu)
Comment on attachment 8576655 [details] [diff] [review]
Fix ipc thread crash part1: change nsRefreshDriver::PVsyncActorCreated() to internal function. v1

Just remove PVsyncActorCreated() from header.
Attachment #8576655 - Flags: review?(bugmail.mozilla)
(In reply to Cervantes Yu from comment #42)
> Bill, since IPC channel's output queue is not thread safe and we could
> enqueue the hello message on the main thread or the IPC thread, do we plan
> to make it thread-safe? Thanks.

The hello message is normally enqueued only during construction of the channel. There's no risk that the channel can be used by any other threads at that time. However, ResetFileDescriptor also enqueues a hello message, and it seems to happen during Nuwa startup, when the channel can already be shared by other threads. So this is a Nuwa-specific problem.

I'm worried that the patches here might fix vsync, but we'll have the same problem in other code. I know very little about Nuwa, but it seems like the file descriptors are getting reset too late. We should do it before we unfreeze threads. Could ContentChild register a callback via NuwaAddConstructor and reset the file descriptors there?
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #47)
> (In reply to Cervantes Yu from comment #42)
> > Bill, since IPC channel's output queue is not thread safe and we could
> > enqueue the hello message on the main thread or the IPC thread, do we plan
> > to make it thread-safe? Thanks.
> 
> The hello message is normally enqueued only during construction of the
> channel. There's no risk that the channel can be used by any other threads
> at that time. However, ResetFileDescriptor also enqueues a hello message,
> and it seems to happen during Nuwa startup, when the channel can already be
> shared by other threads. So this is a Nuwa-specific problem.
> 
> I'm worried that the patches here might fix vsync, but we'll have the same
> problem in other code. I know very little about Nuwa, but it seems like the
> file descriptors are getting reset too late. We should do it before we
> unfreeze threads. Could ContentChild register a callback via
> NuwaAddConstructor and reset the file descriptors there?

I agree here, the proper fix should be not to delay the vsync constructor message, but either delay the Nuwa callback constructors until after the file descriptors are set or to block non hello messages in the IPC channel until the hello messages are acked.

Depending on how fast we can resolve the issue here, a temporary fix for vsync might be necessary just because we need to enable the vsync refresh driver for 2.2, with the trains switching in 2 weeks. A temp fix for vsync to post a task onto the main thread might be ok with a follow up to delete the hack once a proper fix is landed.
Like this one?
I register the reset fd callback in ContentChild::Init(). I think that's the first callback in list.
Or we need another new api for this type of callback registry(e.g. reset fd callback here)?
Attachment #8576747 - Flags: feedback?(wmccloskey)
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #49)
> Created attachment 8576747 [details] [diff] [review]
> reset channel fd at the first nuwa final constructor callback.
> 
> Like this one?
> I register the reset fd callback in ContentChild::Init(). I think that's the
> first callback in list.
> Or we need another new api for this type of callback registry(e.g. reset fd
> callback here)?

I don't think we have a guaranteed order of when the Nuwa callbacks execute do we?
Comment on attachment 8576747 [details] [diff] [review]
reset channel fd at the first nuwa final constructor callback.

Review of attachment 8576747 [details] [diff] [review]:
-----------------------------------------------------------------

::: dom/ipc/ContentChild.cpp
@@ +673,5 @@
>  #endif
>  
> +#ifdef MOZ_NUWA_PROCESS
> +    if (IsNuwaProcess()) {
> +        NuwaAddFinalConstructor(ResetChannelFD, nullptr);

I was thinking we should use NuwaAddConstructor since it runs before threads unfreeze. That would avoid the ordering problem Mason mentioned.

However, somebody who understands Nuwa should review this. Maybe Cervantes?
Attachment #8576747 - Flags: feedback?(wmccloskey)
From comment 51 - Can we unfreeze the threads later in NUWA? Or ensure that we reset the FDs earlier? Thanks!
Flags: needinfo?(cyu)
Just a straightforward one. Add a new callback list and call them before NuwaAddFinalConstructor callback list.
Attachment #8576747 - Attachment is obsolete: true
Jerry, why not reset before the threads are unfrozen? That seems a lot safer to me. Isn't it possible that a thread could use a channel before it gets reset?
Flags: needinfo?(hshih)
(In reply to Mason Chang [:mchang] from comment #50)
> (In reply to Jerry Shih[:jerry] (UTC+8) from comment #49)
> > Created attachment 8576747 [details] [diff] [review]
> > reset channel fd at the first nuwa final constructor callback.
> > 
> > Like this one?
> > I register the reset fd callback in ContentChild::Init(). I think that's the
> > first callback in list.
> > Or we need another new api for this type of callback registry(e.g. reset fd
> > callback here)?
> 
> I don't think we have a guaranteed order of when the Nuwa callbacks execute
> do we?

No. There is no guaranteed order when the callbacks are executed.
Flags: needinfo?(cyu)
(In reply to Jerry Shih[:jerry] (UTC+8) from comment #53)
> Created attachment 8577024 [details] [diff] [review]
> add a new type of callback before NuwaAddFinalConstructor called.
> 
> Just a straightforward one. Add a new callback list and call them before
> NuwaAddFinalConstructor callback list.

Given the callback types we have:
* NuwaAddConstructor()
* NuwaAddFinalConstructor()
* NuwaAddThreadConstructor()

Adding the 4th type of callback NuwaAddThreadFinalConstructor() is making the interface too much complicated and confusing. I would prefer using the existing one instead of adding one.

As Bill commented, NuwaAddConstructor() sounds like feasible but with one caveat: the actions added to NuwaAddConstructor() mustn't perform any action that depends on another thread that is not yet recreated.
Thinker, do you have any concern with this?
Attachment #8577064 - Flags: feedback?(tlee)
(In reply to Cervantes Yu from comment #56)
> (In reply to Jerry Shih[:jerry] (UTC+8) from comment #53)
> > Created attachment 8577024 [details] [diff] [review]
> > add a new type of callback before NuwaAddFinalConstructor called.
> > 
> > Just a straightforward one. Add a new callback list and call them before
> > NuwaAddFinalConstructor callback list.
> 
> Given the callback types we have:
> * NuwaAddConstructor()
> * NuwaAddFinalConstructor()
> * NuwaAddThreadConstructor()
> 
> Adding the 4th type of callback NuwaAddThreadFinalConstructor() is making
> the interface too much complicated and confusing. I would prefer using the
> existing one instead of adding one.
> 
> As Bill commented, NuwaAddConstructor() sounds like feasible but with one
> caveat: the actions added to NuwaAddConstructor() mustn't perform any action
> that depends on another thread that is not yet recreated.

Is it guaranteed that the IPC thread will be created by the time NuwaAddConstructor callbacks fire?
(In reply to Mason Chang [:mchang] from comment #58)
> Is it guaranteed that the IPC thread will be created by the time
> NuwaAddConstructor callbacks fire?

ResetFileDescriptor just creates the hello message and enqueues it directly. I don't think it needs the IPC thread to exist.
(In reply to Bill McCloskey (:billm) from comment #59)
> (In reply to Mason Chang [:mchang] from comment #58)
> > Is it guaranteed that the IPC thread will be created by the time
> > NuwaAddConstructor callbacks fire?
> 
> ResetFileDescriptor just creates the hello message and enqueues it directly.
> I don't think it needs the IPC thread to exist.

Sure, I just want to make sure that a message sent on the IPC thread must occur AFTER the hello messages are enqueued. From my understanding, NuwaAddConstructor callbacks fire before the IPC thread is unfrozen? Also, do we only care that the hello messages have been queued, or must they also be acked?
It only has to be enqueued.
Comment on attachment 8576656 [details] [diff] [review]
Fix ipc thread crash part2: post vsync timer creation task to main thread. v1

Review of attachment 8576656 [details] [diff] [review]:
-----------------------------------------------------------------

f+ because in terms of fixing the race, delaying the creation of VSyncChild works. We need someone who knows better about the refresh driver to review.
Attachment #8576656 - Flags: review?(cyu) → feedback+
Comment on attachment 8576656 [details] [diff] [review]
Fix ipc thread crash part2: post vsync timer creation task to main thread. v1

We have a race problem when we send ipc message in NuwaAddFinalConstructor callback. Please check comment 43. The temporary fix here is that we don't create the PVsyncChild protocol in that callback. Instead, we post a new task to main thread to create it. With this, all ipc cloning and reset stuff are already done(comment 48 and comment 62). After we fix the nuwa cloning issue, we can remove this patch.
Flags: needinfo?(hshih)
Attachment #8576656 - Flags: review?(roc)
Attachment #8576313 - Attachment is obsolete: true
Attachment #8577024 - Attachment is obsolete: true
Comment on attachment 8577064 [details]
The counter proposal: reset transports with NuwaAddConstructor()

move to Bug 1142935
Attachment #8577064 - Attachment is obsolete: true
Attachment #8577064 - Flags: feedback?(tlee)
Comment on attachment 8576656 [details] [diff] [review]
Fix ipc thread crash part2: post vsync timer creation task to main thread. v1

Review of attachment 8576656 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/base/nsRefreshDriver.cpp
@@ +873,5 @@
> +};
> +#endif //MOZ_NUWA_PROCESS
> +
> +static void
> +CreateNuwaContentVsyncRefreshTimer(void*)

This function needs to go inside the ifdef MOZ_NUWA_PROCESS as well or this won't compile.
Comment on attachment 8576655 [details] [diff] [review]
Fix ipc thread crash part1: change nsRefreshDriver::PVsyncActorCreated() to internal function. v1

Review of attachment 8576655 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like it's just moving some code around, and I don't see why it's needed to fix this bug. r+ anyway since it looks like a no-op but if there's no need to land this then please don't, as we will need to uplift the fix to 2.2 and the smaller the patch the better.
Attachment #8576655 - Flags: review?(bugmail.mozilla) → review+
(In reply to Bill McCloskey (:billm) from comment #47)
> (In reply to Cervantes Yu from comment #42)
> > Bill, since IPC channel's output queue is not thread safe and we could
> > enqueue the hello message on the main thread or the IPC thread, do we plan
> > to make it thread-safe? Thanks.
> 
> The hello message is normally enqueued only during construction of the
> channel. There's no risk that the channel can be used by any other threads
> at that time. However, ResetFileDescriptor also enqueues a hello message,
> and it seems to happen during Nuwa startup, when the channel can already be
> shared by other threads. So this is a Nuwa-specific problem.
> 
> I'm worried that the patches here might fix vsync, but we'll have the same
> problem in other code. I know very little about Nuwa, but it seems like the
> file descriptors are getting reset too late. We should do it before we
> unfreeze threads. Could ContentChild register a callback via
> NuwaAddConstructor and reset the file descriptors there?

Hmm it just occurred to me. Why does the ResetFileDescriptor need to enqueue another hello message if it's already enqueued during the construction of the channel? Do we need to have a second hello message?
Flags: needinfo?(wmccloskey)
(In reply to Mason Chang [:mchang] from comment #67)
> Hmm it just occurred to me. Why does the ResetFileDescriptor need to enqueue
> another hello message if it's already enqueued during the construction of
> the channel? Do we need to have a second hello message?

I think so. This is the code that receives the hello message:
http://mxr.mozilla.org/mozilla-central/source/ipc/chromium/src/chrome/common/ipc_channel_posix.cc#616

Basically, the hello message passes on the pid of the new process to the parent. Since we have a new pid after the fork, we need a new hello message.
Flags: needinfo?(wmccloskey)
Assignee

Updated

4 years ago
Attachment #8572978 - Attachment is obsolete: true
Assignee

Updated

4 years ago
Attachment #8572979 - Attachment is obsolete: true
Assignee

Updated

4 years ago
Attachment #8572980 - Attachment is obsolete: true
I hit the infinite crash loop on my mac today and I noticed it first occurred after I switched from nightly to the dev edition. The crash stacks that look like https://crash-stats.mozilla.com/report/index/7c23e927-c2b9-4f80-b225-ec7302150302 occur due to bug 1133526 not uplifted to the developer edition. I enabled the refresh driver without hardware vsync in about:config, which caused a crash on startup since we didn't check that hardware vsync was also enabled. Thus, we tried to get the vsync signal, but we never initialized the vsync signal on startup since the pref was off. So essentially, user error / I voided my warranty.
Assignee

Updated

4 years ago
Depends on: 1142935
Since bug 1142935 landed, we don't need to post the vsync timer creation task to main thread.
Attachment #8576656 - Attachment is obsolete: true
Attachment #8576655 - Attachment is obsolete: true
From the initial bug reports, we had two major crashes with this bug. The first was related to bug 1133526 not being uplifted to aurora. Bug 1142935 fixed the race condition in NUWA reported in bug 1140978. QA and smoketest blockers today confirmed these are not occurring. 

The OSX vsync source crashes were resolved in bug 1144638 and not dependent on the refresh driver. Since the new nightlies on b2g + OS X enable the refresh driver by default, I'll let this sit for another couple of days to verify that the refresh driver is stable. For now, the bug is unactionable!
I haven't seen any refresh driver crashes on bugzilla, crash-stats, or pinged for by QA other than bug 1147753, which so far looks like an L Bug. Resolving as WFM.
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.