Closed Bug 1171209 Opened 4 years ago Closed 3 years ago

"ASSERTION: Wrong size for this Shmem!" appears over 3000 times in linux64 debug m-2 e10s run

Categories

(Core :: IPC, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
Tracking Status
e10s + ---
firefox41 --- affected

People

(Reporter: erahm, Unassigned)

References

(Blocks 2 open bugs)

Details

The following assertion and it's associated stack trace shows up thousands of times in a recent linux64 debug e10s mochitest 2 runs [1] (it's possible it shows up elsewhere). This assertion is the result of a NS_ERROR call in |Shmem::OpenExisting| [2].

The assertion and it's stack trace account for 33% of the log file (5247893/15635917 bytes).

If this is a bug, it should probably be fixed. If this is expected behavior we should probably remove the NS_ERROR call or switch it to MOZ_LOG.

> 09:58:01     INFO -  ^G[Parent 5329] ###!!! ASSERTION: Wrong size for this Shmem!: 'Error', file /builds/slave/try-l64-d-00000000000000000000/build/src/ipc/glue/Shmem.cpp, line 459 
> 09:58:49     INFO -  #01: mozilla::layers::PImageBridgeParent::OnMessageReceived(IPC::Message const&) [mfbt/AlreadyAddRefed.h:116] 
> 09:58:49     INFO -  #02: mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) [ipc/glue/MessageChannel.h:507] 
> 09:58:49     INFO -  #03: mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message const&) [mfbt/Maybe.h:372] 
> 09:58:49     INFO -  #04: mozilla::ipc::MessageChannel::OnMaybeDequeueOne() [ipc/glue/MessageChannel.cpp:1182] 
> 09:58:49     INFO -  #05: MessageLoop::RunTask(Task*) [ipc/chromium/src/base/message_loop.cc:362] 
> 09:58:49     INFO -  #06: MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) [ipc/chromium/src/base/message_loop.cc:372] 
> 09:58:49     INFO -  #07: MessageLoop::DoWork() [ipc/chromium/src/base/message_loop.cc:456] 
> 09:58:49     INFO -  #08: base::MessagePumpDefault::Run(base::MessagePump::Delegate*) [ipc/chromium/src/base/message_pump_default.cc:35] 
> 09:58:49     INFO -  #09: MessageLoop::RunInternal() [ipc/chromium/src/base/message_loop.cc:234] 
> 09:58:49     INFO -  #10: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:517] 
> 09:58:49     INFO -  #11: base::Thread::ThreadMain() [ipc/chromium/src/base/thread.cc:173] 
> 09:58:49     INFO -  #12: ThreadFunc [ipc/chromium/src/base/platform_thread_posix.cc:41] 
> 09:58:49     INFO -  #13: libpthread.so.0 + 0x7e9a 
> 09:58:49     INFO -  #14: libc.so.6 + 0xf42ed 
> 09:58:49     INFO -  ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0xFFFC,name=???) Payload error: message could not be deserialized 
> 09:58:49     INFO -  ###!!! [Parent][DispatchAsyncMessage] Error: (msgtype=0xFFFB,name=???) Value error: message was deserialized, but contained an illegal value

[1] http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/erahm@mozilla.com-eebe5215c542/try-linux64-debug/try_ubuntu64_vm-debug_test-mochitest-e10s-2-bm120-tests1-linux64-build467.txt.gz
[2] https://hg.mozilla.org/mozilla-central/annotate/b0a507af2b4a/ipc/glue/Shmem.cpp#l459
Flags: needinfo?(gwright)
A did a little more digging:
  - This assertion is only hit during the e10s m-2 run
  - Frame #01 is always |mozilla::layers::PImageBridgeParent::OnMessageReceived(IPC::Message const&)|
More specifically as of today it shows up 2,942 times, I have confirmed it's limited to debug e10s mochitest 2 runs.

It shows up in the following tests:

> 451 - dom/media/test/test_loop.html
> 318 - dom/media/test/test_bug879717.html
> 291 - dom/media/test/test_bug495145.html
> 242 - dom/media/test/test_play_twice.html
> 224 - dom/media/test/test_replay_metadata.html
> 166 - dom/media/test/test_playback.html
> 147 - dom/media/test/test_reset_src.html
> 134 - dom/media/test/test_play_events.html
> 125 - dom/media/test/test_seek_out_of_range.html
> 110 - dom/media/test/test_reactivate.html
> 105 - dom/media/test/test_play_events_2.html
> 104 - dom/media/test/test_timeupdate_small_files.html
> 104 - dom/media/test/test_progress.html
> 99 - dom/media/test/test_streams_element_capture.html
> 81 - dom/media/test/test_playback_rate.html
> 67 - dom/media/test/test_seek-5.html
> 54 - dom/media/test/test_seek-2.html
> 41 - dom/media/test/test_seek-1.html
> 23 - dom/media/test/test_bug465498.html
> 21 - dom/media/test/test_standalone.html
> 16 - dom/media/test/test_bug1113600.html
> 10 - dom/media/test/test_played.html
> 4 - dom/media/test/test_eme_playback.html
> 2 - dom/media/test/test_buffered.html
> 1 - dom/media/test/test_seek-13.html
> 1 - dom/media/test/test_fastSeek.html
> 1 - dom/media/test/test_bug495300.html
Anthony -- See comment 2.  I believe the tests that are failing are all in streaming media.
Flags: needinfo?(ajones)
Peter - is this issue up your alley?
Flags: needinfo?(peterv)
This seems to be a gfx issue, unfortunately I don't know much about that.
Flags: needinfo?(peterv)
Matt - is this a graphics issue?
Flags: needinfo?(ajones) → needinfo?(matt.woodrow)
I believe it is, yeah. I did some logging on try to track this down.

The invalid size happens because Shmem::Dealloc [1] is setting the size field in the shmem header to 0.

So apparently we've tried to deallocate the memory before the parent side deserializes it.

ShmemTextureClient shouldn't ever deallocate the shmem if we've created a parent side actor [2], so I don't see how this could happen.

The only other thing I can think of is if we're reusing the same memory/file descriptors and things are getting confused that way.

Any ideas nical?


[1] http://mxr.mozilla.org/mozilla-central/source/ipc/glue/Shmem.cpp#488
[2] http://mxr.mozilla.org/mozilla-central/source/gfx/layers/client/TextureClient.cpp#689
Flags: needinfo?(matt.woodrow) → needinfo?(nical.bugzilla)
This seems to be specific to ImageBridge, but that might be because it's the main place we use Shmem.
Looks like TextureClient::ForceRemove(sync = false) destroys the texture asynchronously since we don't specify sync to be true in ImageBridgeChild. this means the TextureClient's destructor can race with the message that will create the TextureHost on the compositor side.
Flags: needinfo?(nical.bugzilla)
I forgot to say that this will only happen if the flag TEXTURE_DEALLOCATE_CLIENT is set (I am assuming this is the case).
I think that the best way to fix this is to deallocate the shmem from the TextureChild's destructor rather than the TextureClient. This way the deallocation is still async, but the TextureChild is guaranteed to be destroyed after the TextureParent so it can't race with creating the TextureParent/TextureHost.
(In reply to Nicolas Silva [:nical] from comment #10)
> I forgot to say that this will only happen if the flag
> TEXTURE_DEALLOCATE_CLIENT is set (I am assuming this is the case).

Actually, nevermind. If the flag is set the deallocation is always synchronous (sadly), so this race should not happen (not in the way I depicted).
removing needinfo to get back into triage.
Flags: needinfo?(gwright)
Blocks: e10s-tests
This has come up in other bugs and the conclusion is: This assertion is hit when a shmem is quickly created and destroyed on the content side, before the compositor process has had time to receive the internal IPDL message generated by the shmem creation. There isn't much we can do if not have an separate shmem type that isn't internally managed by ipdl (we have to re-do that management in gfx because of the other non-shmem types that sit under the same abstraction anyway). It's not exactly a low hanging fruit and the assertion hasn't been symptomatic of an actual crash or correctness issue sofar, so I am not in a rush to implement that. Also I think Lee turned the assertion into something that does not fail the tests in another bug, so this should not block e10s passing tests anymore.
I'm not seeing this assertion anymore, lets go ahead and close this.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.