Closed Bug 1492241 Opened 6 years ago Closed 5 years ago

Crash in core::option::expect_failed | core::iter::{{impl}}::next<T>

Categories

(Core :: Graphics: WebRender, defect, P2)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla66
Tracking Status
firefox-esr60 --- unaffected
firefox63 --- unaffected
firefox64 --- disabled
firefox65 --- fixed
firefox66 --- fixed

People

(Reporter: marcia, Assigned: nical)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, topcrash)

Crash Data

This bug was filed from the Socorro interface and is
report bp-3b62a536-27d4-4f60-a236-ccf0e0180915.
=============================================================

Small volume Win 10 crash spotted during nightly crash triage: https://bit.ly/2NmprgM. Crashes started using 20180914220208. 

Possible regression range based on Build ID: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=efccb758c78cc4c42170c886294bf335d90e4383&tochange=750e71a8f79b3b745a6653e555e60acb122f1241

There was a webrender update in that changeset.

Top 10 frames of crashing thread:

0 xul.dll static void std::panicking::rust_panic_with_hook src/libstd/panicking.rs:521
1 xul.dll static void std::panicking::continue_panic_fmt src/libstd/panicking.rs:426
2 xul.dll static void std::panicking::rust_begin_panic src/libstd/panicking.rs:337
3 xul.dll static void core::panicking::panic_fmt src/libcore/panicking.rs:92
4 xul.dll static void core::option::expect_failed src/libcore/option.rs:960
5 xul.dll static union core::option::Option<webrender_bindings::moz2d_renderer::{{impl}}::rasterize::Job> core::iter::{{impl}}::next<webrender_bindings::moz2d_renderer::{{impl}}::rasterize::Job, core::slice::Iter<webrender_api::image::BlobImageParams>, closure> src/libcore/iter/mod.rs:1326
6 xul.dll static struct alloc::vec::Vec< gfx/webrender_bindings/src/moz2d_renderer.rs:358
7 xul.dll static void webrender::resource_cache::ResourceCache::block_until_all_resources_added gfx/webrender/src/resource_cache.rs:1365
8 xul.dll static struct webrender::tiling::Frame webrender::frame_builder::FrameBuilder::build gfx/webrender/src/frame_builder.rs:368
9  @0x20f3fa83b87 

=============================================================
> no entry found for key
Priority: -- → P3
See Also: → 1480404
Priority: P3 → P4
Last crashes were in the 10-10 build. No crashes seen since then.
See Also: → 1508714
Crash Signature: [@ core::option::expect_failed | core::iter::{{impl}}::next<T>] → [@ core::option::expect_failed | core::iter::{{impl}}::next<T>] [@ core::option::expect_failed | <core::iter::Map<I, F> as core::iter::iterator::Iterator>::next]
Priority: P4 → P2
From bug 1508714:

(In reply to Jan Andre Ikenmeyer [:darkspirit] from comment #1)
> According to Socorro (Bugzilla Crash Stop) and Mozregression:
> last build without this crash signature: 20181119100448
> first build with this crash signature:   20181119220031
> https://hg.mozilla.org/mozilla-central/
> pushloghtml?fromchange=e44bb5b4bc79be613d29b3f95d7b508e68e3d128&tochange=bd4c
> ebdbed4bcd34d449f67d51f139f9bdf75edc
Increased the priority here since it looks like this has spiked to the #1 GPU process crasher now.
There are only 2 WR changes in that regression range, and one is just a version bump for a dependency (euclid).

Seems like that this is due to https://github.com/servo/webrender/pull/3323
Tried to untangle this, but the logic is pretty complicated. What I think is happening is:

We're crashing trying to access Moz2DBlobRasterizer::blob_commands[key] [1], because there's a key in the requests parameter that we don't have in our hashmap.

The callstack is coming from rasterize_missing_blob_images (inlined into block_until_all_resources_added), and that calls self.blob_image_handler.prepare_resources. prepare_resources checks the blob_commands HashMap on BlobImageHandler, so the relevant key must be in that one. 

Moz2DBlobRasterizer is created with a clone of the blob_commands from BlobImageHandler, so it seems like we must have added a new blob since we created the rasterizer, or the objects are actually unrelated.

It appears that the normal process is that we get ApiMsg::UpdateDocument on the render_backend, that updates the BlobImageHandler with new blobs, creates a Moz2DBlobRasterizer, and then sends it all to the scene builder. The scene builder rasterizes blobs, and then returns the Moz2DBlobRasterizer [2] to the render_backend's resource cache.

It also seems like ApiMsg::UpdateResources also adds new blobs to the BlobImageHandler, but doesn't do scene building, so it I think the Moz2DBlobRasterizer's internal map of blobs will now be out of date. 

Should we be recreating the blob rasterizing with a new clone of the blob_commands map? What if there's an existing scene build in-progress, how do we handle ordering of those and making sure we don't clobber our new rasterizer with an older one?

It's not super obvious why Bobby's patch regressed this, though I guess just eviction timing changes resulting in us needing to rasterize missing blobs in different situations.




[1] https://searchfox.org/mozilla-central/rev/f997bd6bbfc4773e774fdb6cd010142370d186f9/gfx/webrender_bindings/src/moz2d_renderer.rs#475
[2] https://searchfox.org/mozilla-central/rev/f997bd6bbfc4773e774fdb6cd010142370d186f9/gfx/wr/webrender/src/render_backend.rs#770
Flags: needinfo?(nical.bugzilla)
It might be simpler to just persist a blob rasterizer on each of the render_backend and scene_builders, instead of trying to pass them around. 

Then each scene builder request can pass a clone of the current blob commands to be integrated into the existing rasterizer, and when done we can just discard the commands instead of trying to return them (since the render_backend already has them, and they're guaranteed to be the same, or newer).
Keywords: topcrash
Hi Nical -- This is an important one.  Can you take this?
Assignee: nobody → nical.bugzilla
> Hi Nical -- This is an important one.  Can you take this?

Yep

> It's not super obvious why Bobby's patch regressed this, though I guess just eviction timing changes resulting in us needing to rasterize missing blobs in different situations.


Blobs should not be affected by the eviction timings (they are always manually evicted from the cache), or if they are it's another bug to look at.
Flags: needinfo?(nical.bugzilla)
(In reply to Matt Woodrow (:mattwoodrow) from comment #6)
> There are only 2 WR changes in that regression range, and one is just a
> version bump for a dependency (euclid).
> 
> Seems like that this is due to https://github.com/servo/webrender/pull/3323

Those are the only two changes to upstream WebRender, and as has been noted, neither is a particularly likely candidate.

However, it's quite a big range, and there's certainly some other interesting stuff in there that might have triggered this. Bug 1446309, bug 1477551, and bug 1486958 all seem like they could be related here.
So https://crash-stats.mozilla.com/report/index/fc9523ac-16a6-4b8a-84c3-3ec010181130 and https://crash-stats.mozilla.com/report/index/3762f238-2e31-4e9d-86f6-e64170181130 are crashes that happened after bug 1510882 / pull/3371 landed. However we're still crashing in the same place. This is very surprising as the crash should have moved earlier.
So it looks like the inconsistency probably comes from: https://searchfox.org/mozilla-central/rev/8f0db72fb6e35414fb9a6fc88af19c69f332425f/gfx/webrender_bindings/src/moz2d_renderer.rs#593

We clone the blob_commands table there and so it's believable that inconsistencies between the two arise.
I poked around the push logs for when this signature first appeared, but we enabled gfx.webrender.qualified.all on nightly around this time (20180913100107) so I suspect it was already present.

Earliest signatures for reference:
https://crash-stats.mozilla.com/report/index/3b62a536-27d4-4f60-a236-ccf0e0180915 [20180914220208, Windows]
https://crash-stats.mozilla.com/report/index/4e8df691-9d4f-4c9d-8bdb-cdbe20181123 [20181119220031, OSX]
https://crash-stats.mozilla.com/report/index/af5bad9d-70f0-4ed1-8d73-cf83b0181120 [20181119220031, Linux]
Here is a theory that kvark and I hashed out:

On the first pass of prepare_transaction:
1) Resource update contains only an AddBlobImage, but it is not visible according to viewport_tiles (set in SetBlobImageVisibleArea)
2) We call ResourceCache::create_blob_scene_builder_requests but since the image is not visible, it doesn't create any BlobImageParams entries (containing the blob image key) to be stored in txn.blob_requests
3) Since the transaction has no blob image updates, we can take the shortcut to call RenderBackend::update_document directly
4) We didn't call ResourceCache::set_blob_rasterizer with the updated txn.blob_rasterizer.

On the second pass of prepare_transaction:
1) Resource update contains only an SetBlobImageVisibleArea
2) We don't call ResourceCache::create_blob_scene_builder_requests since there is no Add/UpdateBlobImage entries.
3) Since the transaction has no blob image updates, we can take the shortcut to call RenderBackend::update_document directly
4) We are missing the blob we added in the previous step in ResourceCache::request_image, and it adds it to missing_blob_images.
5) We try to process missing_blob_images in ResourceCache::rasterize_missing_blob_images with the old blob_rasterizer.
6) Crash.
bp-25efa768-2eff-4c1a-b7fe-bb3390181130 I marked text in a bugzilla comment box and pressed Del.
Putting together a fix now, fingers crossed this is it.
(In reply to Andrew Osmond [:aosmond] from comment #18)
> Putting together a fix now, fingers crossed this is it.

Doesn't look like it. Sadness.
I reproduced locally readily outside of rr, and with some effort, in as well.

STR:
1) Comment out https://searchfox.org/mozilla-central/rev/3fdc51e66c9487b39220ad58dcee275fca070ccd/gfx/wr/webrender/src/scene_builder.rs#471-476 to force us to always create blobs on the RenderBackend thread.
2) MOZ_CHAOSMODE=1 ./mach run https://en.wikipedia.org/wiki/Fedora_(operating_system)#Releases
3) Ensure you are scrolled to the top of the page.
4) Resize vigorously until it crashes.

This confirmed nical's theory that the a low priority and high priority request raced. They both created their own rasterizer, the order got inverted, and when we tried to rasterize the missing blob, our snapshot was missing the key added in the other transaction.
Blocks: 1510328
I don't think what landed can solve the priority inversion I found. Trying to reproduce again... I have a patch to force the correct ordering by forcing high priority scene builder transactions follow the low priority route if it has blob images to rasterize. Probably not ideal however.
Linked to another solution (which is simple and appears to fix the issue).
Blocks: 1494164
\o/ Looks like bug 1512730 fixed the crash.
Indeed! Let's mark it fixed until proven otherwise.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Should bug 1512730 be uplifted to 65?
Yeah, that would be good. :aosmond or :kvark, do you want to request uplift on that WR update?
Target Milestone: --- → mozilla66
You need to log in before you can comment on or make changes to this bug.