Hey :aosmond, I'm need-info'ing you on this because I'm almost certainly going to ask you on monday about the TODO at the end of this comment. So feel free to ignore this or read this, but here's all the context for when that happens.
OK SO! Here's what I've figured out so far. First, context:
With the current design of document splitting a (non-root) WRBP may currently be associated with one of two documents: the "content" document or the "popover" document (the thing that shows up when you go fullscreen on e.g. a video, aiui). Which document you're associated with is tracked by mRenderRoot. (gecko calls Documents RenderRoots)
We started seeing this issue because of this change: https://phabricator.services.mozilla.com/D37078.
A significant part of that change was to make mRenderRoot into an Maybe. Essentially, in some situations a WRBP may exist and even have display lists associated with it, but not actually know what document it's associated with yet. (I need to remind myself exactly why this happens, but for now we will take it for granted). If this happens, it doesn't know what WebrenderAPI it's supposed to be talking to. In such a case it defers any work it needs to do to the Root WRBP, who possibly buffers it up until It Is Known. Note that some things are ultimately "API agnostic", and for those things the Root WRBP can just do it right away. Also note: a WRBP's Api starts as Nothing, and once it becomes Some it never changes back.
These comments describe this situation a bit: https://searchfox.org/mozilla-central/source/gfx/layers/wr/WebRenderBridgeParent.h#517
So, our crash: there are several places in the code where we call
Api(renderRoot) to resolve our WebrenderAPI. For non-root WRBP's, the argument is ignored and this just resolves mRenderRoot. So if we call Api and mRenderRoot is still Nothing, then we crash. Specifically, we crash on the second line of WRBP::ProcessWebRenderParentCommands where we grab our Api to stuff it in an AutoTransactionSender.
Why is ProcessWebRenderParentCommands being called? I wasn't able to verify this, but our hunch is that our child is executing EndClearCachedResources. This theory is supported by the fact that all of the commands that I saw when I caught the crash were ReleaseTextureOfImage. So a content process is getting convinced that the parent has some images, tries to tell it to clear them, and then we blow up trying to grab our non-existent API.
Note that WRBP::RecvClearCachedResources, which is called from WRBC::BeginClearCachedResources and is supposed to purge any display lists (scenes?) that might reference the resources we want to purge, is aware of the problem of missing renderRoots, and properly handles the situation: https://searchfox.org/mozilla-central/source/gfx/layers/wr/WebRenderBridgeParent.cpp#1867,1878-1880
So it's possible everything that's happening here is totally fine and correct, we just need to change ProcessWebRenderParentCommands to handle the missing renderRoot case like RecvClearCachedResources does.
I'll look into this on monday.
Big TODO: workout how exactly the ImageKeys we're being told to delete got created in the first place. This will probably tell me exactly how we're supposed to handle ReleaseTextureOfImage when mRenderRoot doesn't exist yet. (I expect that the root WRBP just did it for us, but I'm fuzzy on the whole image pipeline.)