Closed Bug 1442293 Opened 6 years ago Closed 6 years ago

crashtest failures with "fatal runtime error: allocator memory exhausted" (probably due to WebRender)

Categories

(Core :: Graphics: WebRender, defect, P1)

57 Branch
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: dholbert, Unassigned)

References

Details

Attachments

(2 files)

I noticed we've had three crashtest intermittents filed in the last ~week, which are all really crashes that start with this error message:
"fatal runtime error: allocator memory exhausted"

From some websearching, it sounds like this is a Rust error, so presumably this is Stylo running out of memory (given that we're in layout, that's the heaviest rust thing that's happening).
Oh, these are all in "Linux x64 QuantumRender opt" platform -- so the rust code that's OOM'ing is probably WebRender code, not stylo code.
Component: CSS Parsing and Computation → Graphics: WebRender
Summary: crashtest failures with "fatal runtime error: allocator memory exhausted" (probably due to stylo?) → crashtest failures with "fatal runtime error: allocator memory exhausted" (probably due to WebRender)
(In reply to Daniel Holbert [:dholbert] from comment #1)
> Oh, these are all in "Linux x64 QuantumRender opt" platform

(er, they're all QuantumRender -- not all opt, though -- at least one is debug)

Sample opt log:
https://treeherder.mozilla.org/logviewer.html#?job_id=163835604&repo=mozilla-inbound

Sample debug log:
https://treeherder.mozilla.org/logviewer.html#?job_id=164969819&repo=autoland
Possibly related to bug 1441667 ?
Priority: -- → P1
Depends on: 1442878
Depends on: 1443116
I grabbed the resource_usage.json files from a passing and failing crashtest run on the same push and graphed the % memory usage. Looks like at some point the failing run decides it wants to grab a big whack of memory and OOMs. So it's not a slow leak, but an intermittently hit bad allocation.

The failure usually gets blamed on one of the layout/generic/crashtests/3* reftests, with the earliest (lexicographically) one I've seen so far being 334148-1.xhtml. If the bad allocation is consistently triggered by a particular test, it's likely to be one of the tests just before that one.
The failure reproduces on a smaller set of layout/generic/crashtests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b78808c87cb476e0c389ba19334c8db545d9413d

This makes it easier to experiment with since we don't have to run the whole suite. I also did a try push which (hopefully) will log any allocations over 1M in size:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d24407c36a5afc51cdccf65cfb6aa53a72d28c8d

This should tell us if the problem is just one giant allocation or a series of large allocations, and give us the allocation amounts. With that in hand we might be able to trigger a crash or backtrace on the bad allocation(s).
It took a few try pushes to get it working right, here's a good one: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0ebeeeda8e5798240d600a45c6b8ffc8b89751e1

The failing crashtest logs show a series of escalating reallocs, for example:

[task 2018-03-07T04:05:52.926Z] 04:05:52     INFO - 954 140198091421440 realloc(0x7f8268100000,2490368)=0x7f8268800000
[task 2018-03-07T04:05:52.942Z] 04:05:52     INFO - 954 140198091421440 realloc(0x7f8268300000,2129920)=0x7f8267f00000
[task 2018-03-07T04:05:52.944Z] 04:05:52     INFO - 954 140198091421440 realloc(0x7f8268500000,4980736)=0x7f8262700000
[task 2018-03-07T04:05:52.946Z] 04:05:52     INFO - 954 140198091421440 realloc(0x7f8268800000,4980736)=0x7f8268200000
[task 2018-03-07T04:05:52.967Z] 04:05:52     INFO - 954 140198091421440 realloc(0x7f8267f00000,4259840)=0x7f8268700000
[task 2018-03-07T04:05:52.976Z] 04:05:52     INFO - 954 140198091421440 realloc(0x7f8262700000,9961472)=0x7f8262c00000
[task 2018-03-07T04:05:52.984Z] 04:05:52     INFO - 954 140198091421440 realloc(0x7f8268200000,9961472)=0x7f8251c00000
[task 2018-03-07T04:05:53.002Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f8268700000,8519680)=0x7f8251300000
[task 2018-03-07T04:05:53.023Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f8262c00000,19922944)=0x7f8250000000
[task 2018-03-07T04:05:53.040Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f8251c00000,19922944)=0x7f8262700000
[task 2018-03-07T04:05:53.070Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f8251300000,17039360)=0x7f824ef00000
[task 2018-03-07T04:05:53.111Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f8250000000,39845888)=0x7f824c900000
[task 2018-03-07T04:05:53.140Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f8262700000,39845888)=0x7f8250000000
[task 2018-03-07T04:05:53.203Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f824ef00000,34078720)=0x7f824a800000
[task 2018-03-07T04:05:53.296Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f824c900000,79691776)=0x7f8245c00000
[task 2018-03-07T04:05:53.394Z] 04:05:53     INFO - 954 140198091421440 realloc(0x7f8250000000,79691776)=0x7f8241000000

which finally culminates in this:

[task 2018-03-07T04:06:07.352Z] 04:06:07     INFO - 954 140198091421440 realloc(0x7f8137200000,2550136832)=0x7f8053200000
[task 2018-03-07T04:06:07.352Z] 04:06:07     INFO - 954 140198091421440 realloc(0x7f80eb200000,2550136832)=0x0
[task 2018-03-07T04:06:07.352Z] 04:06:07     INFO - fatal runtime error: allocator memory exhausted

Presumably this is some sort of cache in WR that is growing as it tries to fit something and eventually OOMs. I should be able to do another try push to get a backtrace, but in the meantime - Glenn, does this pattern look familiar at all? Any ideas on what might be causing this? I believe this started when we turned on blob-images.
Flags: needinfo?(gwatson)
I can't think of anything immediately - it will hopefully be clear once we get a backtrace. It's perhaps a Vec being infinitely added to and doubling in size each time.
Flags: needinfo?(gwatson)
Quick update: I tried getting a backtrace, but not having much luck so far.

At first [1] I just triggered a segfault inside the realloc replacement but that got dealt with pretty much the same as the OOM, in that there were no crash dumps/traces. Then I tried running some stackwalking code from gecko [2] but that I think went into some sort of infinite recursion or otherwise exploded. This one at least did cause the harness to time out and force-kill everything, which also logs stacks. I believe thread 58 in that log is the relevant one, but most of it is in libraries for which we don't have symbols, and there's a few libxul.so frames that aren't symbolicated. I'll try hand-symbolicating them, maybe they'll be useful. But inspired by this timeout behaviour I then tried just going into an infinite loop during the bad realloc [3] hoping that the harness would again time out and print (hopefully more useful) stacks but it seems like this infinite loop might just not actually trigger the timeout, and it prevents the OOM, so instead everything is green.

[1] https://treeherder.mozilla.org/logviewer.html#?job_id=166499639&repo=try&lineNumber=2220
[2] https://treeherder.mozilla.org/logviewer.html#?job_id=166539677&repo=try&lineNumber=2140
[3] https://treeherder.mozilla.org/#/jobs?repo=try&revision=065e1e543f6651c597bf419f6f44a47e69ad0764
Ah, emilio pointed out that my theory as to why the infinite loop was always green, was wrong. So I retriggered more jobs and got a reproduction of the problem. Here's the stack of the thread doing the realloc, taken from [1].

Thread 61
 0  libc-2.23.so + 0xcc30d
    rax = 0xfffffffffffffdfc   rdx = 0x0000000000000000
    rcx = 0x00007f66ba52030d   rbx = 0xffffffffffffff70
    rsi = 0x00007f6683cf60c0   rdi = 0x00007f6683cf60c0
    rbp = 0x0000000000000000   rsp = 0x00007f6683cf60b0
     r8 = 0x00007f662b800000    r9 = 0x0000000026000000
    r10 = 0x0000000000000735   r11 = 0x0000000000000293
    r12 = 0x00007f65f2800000   r13 = 0x00007f6683cf6110
    r14 = 0x00007f662b800000   r15 = 0x00007f6683cff700
    rip = 0x00007f66ba52030d
    Found by: given as instruction pointer in context
 1  libc-2.23.so + 0xcc25a
    rsp = 0x00007f6683cf60c0   rip = 0x00007f66ba52025a
    Found by: stack scanning
 2  firefox!hang [LogAlloc.cpp:065e1e543f6651c597bf419f6f44a47e69ad0764 : 149 + 0xa]
    rsp = 0x00007f6683cf60f0   rip = 0x0000000000412174
    Found by: stack scanning
 3  firefox!replace_realloc [LogAlloc.cpp:065e1e543f6651c597bf419f6f44a47e69ad0764 : 166 + 0x5]
    rsp = 0x00007f6683cf6100   rip = 0x0000000000412705
    Found by: stack scanning
 4  libxul.so!std::heap::__default_lib_allocator::__rdl_realloc [lib.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 183 + 0xb]
    rsp = 0x00007f6683cf6160   rip = 0x00007f66ad4d53a8
    Found by: stack scanning
 5  libxul.so!alloc::raw_vec::RawVec<webrender::prim_store::PrimitiveMetadata, alloc::heap::Heap>::double<webrender::prim_store::PrimitiveMetadata,alloc::heap::Heap> [heap.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 127 + 0x10]
    rsp = 0x00007f6683cf61a0   rip = 0x00007f66ad6dee8c
    Found by: stack scanning
 6  libxul.so!webrender::prim_store::PrimitiveStore::add_primitive [vec.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 971 + 0x15]
    rsp = 0x00007f6683cf61f0   rip = 0x00007f66ad35c3c1
    Found by: stack scanning
 7  firefox!_fini + 0x13c0
    rsp = 0x00007f6683cf6290   rip = 0x0000000000433780
    Found by: stack scanning
 8  firefox!arena_t::RallocHuge [mozjemalloc.cpp:065e1e543f6651c597bf419f6f44a47e69ad0764 : 3885 + 0xb]
    rsp = 0x00007f6683cf6340   rip = 0x00000000004111d4
    Found by: stack scanning
 9  libpthread-2.23.so + 0x9efe
    rsp = 0x00007f6683cf63a0   rip = 0x00007f66bb4d4efe
    Found by: stack scanning
10  libxul.so!std::heap::__default_lib_allocator::__rdl_realloc [lib.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 183 + 0xb]
    rsp = 0x00007f6683cf6490   rip = 0x00007f66ad4d53a8
    Found by: stack scanning
11  libxul.so!alloc::raw_vec::RawVec<webrender::freelist::Slot<webrender::clip::ClipSources>, alloc::heap::Heap>::double<webrender::freelist::Slot<webrender::clip::ClipSources>,alloc::heap::Heap> [heap.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 127 + 0x10]
    rsp = 0x00007f6683cf64d0   rip = 0x00007f66ad6debe4
    Found by: stack scanning
12  libxul.so!webrender::freelist::FreeList<webrender::clip::ClipSources>::insert<webrender::clip::ClipSources> [vec.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 971 + 0x8]
    rsp = 0x00007f6683cf6520   rip = 0x00007f66ad6fe875
    Found by: stack scanning
13  libxul.so!webrender::display_list_flattener::DisplayListFlattener::create_primitive [display_list_flattener.rs:065e1e543f6651c597bf419f6f44a47e69ad0764 : 1168 + 0x24]
    rsp = 0x00007f6683cf6590   rip = 0x00007f66ad6fb00b
    Found by: stack scanning

[1] https://treeherder.mozilla.org/logviewer.html#?job_id=166575955&repo=try&lineNumber=2145
Flags: needinfo?(gwatson)
If I'm reading that stack trace correctly, is it saying that the entry point to the alloc is [12] (ClipSources::insert)? In which case, is it fair to assume that [5] and [6] are incorrect symbols in this trace, since they don't seem to match what I'd expect?

Perhaps something is resulting in a huge number of clips being generated? I wonder if we could log out the sizes of the prim_store::cpu_metadata and clip_store freelist arrays?
Flags: needinfo?(gwatson)
Here's a try push with logging: https://treeherder.mozilla.org/#/jobs?repo=try&revision=cbeb17b7022dfd5a73baa37e1b130c1b516ba4e6 with some reproductions.

I grabbed one of the logs (attached, because it takes a long time to get from taskcluster), which is huge and mostly just a fun exercise in counting. It starts at

951 139828237694720 realloc(0x7f2c39502000,1245184)=0x7f2c4ae00000
create_primitive has clip_store.len() = 5001
create_primitive has clip_store.len() = 5002
create_primitive has clip_store.len() = 5003
create_primitive has clip_store.len() = 5004
create_primitive has clip_store.len() = 5005
create_primitive has clip_store.len() = 5006
create_primitive has clip_store.len() = 5007
create_primitive has clip_store.len() = 5008
create_primitive has clip_store.len() = 5009
create_primitive has clip_store.len() = 5010
prim_store: cpu_metadata.len() is 5001

and goes to:

create_primitive has clip_store.len() = 500645
prim_store: cpu_metadata.len() is 500636
create_primitive has clip_store.len() = 500646
prim_store: cpu_metadata.len() is 500637
create_primitive has clip_store.len() = 500647
prim_store: cpu_metadata.len() is 500638
[GFX1-]: Receive IPC close with reason=AbnormalShutdown
[Child 1035, Chrome_ChildThread] WARNING: pipe error (3): Connection reset by peer: file /builds/worker/workspace/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 353

before the harness shuts it down.

The last display list that is received just before this counting madness starts looks reasonably sane, except for this item:

GenericDisplayItem { item: Image(ImageDisplayItem { image_key: ImageKey(IdNamespace(10), 34), stretch_size: 10000000.0×10000000.0, tile_spacing: 10000000.0×10000000.0, image_rendering: Auto, alpha_type: PremultipliedAlpha }), clip_and_scroll: ClipAndScrollInfo { scroll_node_id: Clip(10, PipelineId(1, 5)), clip_node_id: None }, info: PrimitiveInfo { rect: TypedRect(10000000.0×10000000.0 at (8.0,8.0)), local_clip: Rect(TypedRect(10000000.0×10000000.0 at (8.0,8.0))), is_backface_visible: true, tag: None } }

which is quite large and presumably the source of the problem.
I should also note that the same display item appears in *successful* runs on the try push, so it's probably not the display list alone that's the problem, but it must be somehow triggering an edge case in WR.
Flags: needinfo?(gwatson)
Yep, that definitely looks like it could be the cause of the problem. I'll see if I can reproduce this locally in wrench today.

Thanks for chasing down the cause!
Flags: needinfo?(gwatson)
I tried to take some of the patches from the try run above, and add a panic in the rust code if the array length exceeds 10k. But I can't seem to trigger the behaviour - try run is https://treeherder.mozilla.org/#/jobs?repo=try&revision=870f7ffd104743d419afab45cc2577f6509f1488&selectedJob=166908535.

Is there something obvious I've done wrong which is stopping the bug occurring here, or is it just very rare and I happen to not be seeing it so far?
Ah, I did eventually get a crash in that try run.
Relevant backtrace below. The key entry here is DisplayListFlattener::decompose_image_row. It's not clear to me why that test would think that the image requires tiling. I'll try to add some more logging to determine when / why a tiled image is selected here, since I don't think it should be.


task 2018-03-09T02:52:38.576Z] 02:52:38     INFO - 
[task 2018-03-09T02:52:38.577Z] 02:52:38     INFO - Thread 57 (crashed)
[task 2018-03-09T02:52:38.578Z] 02:52:38     INFO -  0  firefox!mozalloc_abort [mozalloc_abort.cpp:870f7ffd104743d419afab45cc2577f6509f1488 : 33 + 0x5]
[task 2018-03-09T02:52:38.578Z] 02:52:38     INFO -     rax = 0x0000000000000000   rdx = 0x0000000000000000
[task 2018-03-09T02:52:38.579Z] 02:52:38     INFO -     rcx = 0x00007f9c7b37d2dd   rbx = 0x00007f9c7b64b700
[task 2018-03-09T02:52:38.579Z] 02:52:38     INFO -     rsi = 0x00007f9c7b64c770   rdi = 0x00007f9c7b64b540
[task 2018-03-09T02:52:38.580Z] 02:52:38     INFO -     rbp = 0x00007f9c453f61d0   rsp = 0x00007f9c453f61c0
[task 2018-03-09T02:52:38.581Z] 02:52:38     INFO -      r8 = 0x00007f9c7b64c770    r9 = 0x00007f9c453ff700
[task 2018-03-09T02:52:38.581Z] 02:52:38     INFO -     r10 = 0x0000000000000058   r11 = 0x00007f9c7b2f37a0
[task 2018-03-09T02:52:38.582Z] 02:52:38     INFO -     r12 = 0x00007f9c6eed5270   r13 = 0x000000000000001f
[task 2018-03-09T02:52:38.582Z] 02:52:38     INFO -     r14 = 0x0000000000000428   r15 = 0x000000000000000d
[task 2018-03-09T02:52:38.583Z] 02:52:38     INFO -     rip = 0x000000000041c790
[task 2018-03-09T02:52:38.583Z] 02:52:38     INFO -     Found by: given as instruction pointer in context
[task 2018-03-09T02:52:38.584Z] 02:52:38     INFO -  1  firefox!abort [mozalloc_abort.cpp:870f7ffd104743d419afab45cc2577f6509f1488 : 80 + 0x5]
[task 2018-03-09T02:52:38.584Z] 02:52:38     INFO -     rbx = 0x0000000000000001   rbp = 0x00007f9c453f61e0
[task 2018-03-09T02:52:38.585Z] 02:52:38     INFO -     rsp = 0x00007f9c453f61e0   r12 = 0x00007f9c6eed5270
[task 2018-03-09T02:52:38.585Z] 02:52:38     INFO -     r13 = 0x000000000000001f   r14 = 0x0000000000000428
[task 2018-03-09T02:52:38.586Z] 02:52:38     INFO -     r15 = 0x000000000000000d   rip = 0x000000000041c766
[task 2018-03-09T02:52:38.586Z] 02:52:38     INFO -     Found by: call frame info
[task 2018-03-09T02:52:38.587Z] 02:52:38     INFO -  2  libxul.so!panic_abort::__rust_start_panic [lib.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 59 + 0x5]
[task 2018-03-09T02:52:38.587Z] 02:52:38     INFO -     rbx = 0x0000000000000001   rbp = 0x00007f9c453f61f0
[task 2018-03-09T02:52:38.588Z] 02:52:38     INFO -     rsp = 0x00007f9c453f61f0   r12 = 0x00007f9c6eed5270
[task 2018-03-09T02:52:38.588Z] 02:52:38     INFO -     r13 = 0x000000000000001f   r14 = 0x0000000000000428
[task 2018-03-09T02:52:38.589Z] 02:52:38     INFO -     r15 = 0x000000000000000d   rip = 0x00007f9c6e2d5ce9
[task 2018-03-09T02:52:38.589Z] 02:52:38     INFO -     Found by: call frame info
[task 2018-03-09T02:52:38.590Z] 02:52:38     INFO -  3  libxul.so!std::panicking::rust_panic [panicking.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 608 + 0x9]
[task 2018-03-09T02:52:38.591Z] 02:52:38     INFO -     rbp = 0x00007f9c453f6250   rsp = 0x00007f9c453f6200
[task 2018-03-09T02:52:38.591Z] 02:52:38     INFO -     rip = 0x00007f9c6e29c3fd
[task 2018-03-09T02:52:38.591Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.592Z] 02:52:38     INFO -  4  libxul.so!std::panicking::rust_panic_with_hook [panicking.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 593 + 0xd]
[task 2018-03-09T02:52:38.593Z] 02:52:38     INFO -     rbp = 0x00007f9c453f62e0   rsp = 0x00007f9c453f6260
[task 2018-03-09T02:52:38.593Z] 02:52:38     INFO -     rip = 0x00007f9c6e29c368
[task 2018-03-09T02:52:38.593Z] 02:52:38     INFO -     Found by: call frame info
[task 2018-03-09T02:52:38.594Z] 02:52:38     INFO -  5  libxul.so!std::panicking::begin_panic<&str> [panicking.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 538 + 0x12]
[task 2018-03-09T02:52:38.594Z] 02:52:38     INFO -     rbx = 0x00007f9c6eed5310   rbp = 0x00007f9c453f6330
[task 2018-03-09T02:52:38.602Z] 02:52:38     INFO -     rsp = 0x00007f9c453f62f0   r12 = 0x0000000000000101
[task 2018-03-09T02:52:38.603Z] 02:52:38     INFO -     r13 = 0x00007f9c453f6c01   r14 = 0x00007f9c70279370
[task 2018-03-09T02:52:38.603Z] 02:52:38     INFO -     r15 = 0x0000000000000012   rip = 0x00007f9c6e188c02
[task 2018-03-09T02:52:38.604Z] 02:52:38     INFO -     Found by: call frame info
[task 2018-03-09T02:52:38.604Z] 02:52:38     INFO -  6  libxul.so!webrender::prim_store::PrimitiveStore::add_primitive [prim_store.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 1064 + 0x18]
[task 2018-03-09T02:52:38.605Z] 02:52:38     INFO -     rbp = 0x00007f9c453f6700   rsp = 0x00007f9c453f6340
[task 2018-03-09T02:52:38.606Z] 02:52:38     INFO -     rip = 0x00007f9c6e154946
[task 2018-03-09T02:52:38.606Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.607Z] 02:52:38     INFO -  7  libxul.so!webrender::display_list_flattener::DisplayListFlattener::create_primitive [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 1176 + 0x25]
[task 2018-03-09T02:52:38.608Z] 02:52:38     INFO -     rbp = 0x00007f9c453f6960   rsp = 0x00007f9c453f6710
[task 2018-03-09T02:52:38.609Z] 02:52:38     INFO -     rip = 0x00007f9c6e500e98
[task 2018-03-09T02:52:38.609Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.610Z] 02:52:38     INFO -  8  libxul.so!webrender::display_list_flattener::DisplayListFlattener::add_image [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 1239 + 0x122]
[task 2018-03-09T02:52:38.611Z] 02:52:38     INFO -     rbp = 0x00007f9c453f6bf0   rsp = 0x00007f9c453f6970
[task 2018-03-09T02:52:38.611Z] 02:52:38     INFO -     rip = 0x00007f9c6e5039ba
[task 2018-03-09T02:52:38.612Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.613Z] 02:52:38     INFO -  9  libxul.so!webrender::display_list_flattener::DisplayListFlattener::add_tile_primitive [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 1135 + 0x2a]
[task 2018-03-09T02:52:38.614Z] 02:52:38     INFO -     rbp = 0x00007f9c453f6d10   rsp = 0x00007f9c453f6c00
[task 2018-03-09T02:52:38.614Z] 02:52:38     INFO -     rip = 0x00007f9c6e5009aa
[task 2018-03-09T02:52:38.615Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.616Z] 02:52:38     INFO - 10  libxul.so!webrender::display_list_flattener::DisplayListFlattener::decompose_tiled_image [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 1025 + 0x28]
[task 2018-03-09T02:52:38.616Z] 02:52:38     INFO -     rbp = 0x00007f9c453f6dc0   rsp = 0x00007f9c453f6d20
[task 2018-03-09T02:52:38.617Z] 02:52:38     INFO -     rip = 0x00007f9c6e5005ff
[task 2018-03-09T02:52:38.618Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.619Z] 02:52:38     INFO - 11  libxul.so!webrender::display_list_flattener::DisplayListFlattener::decompose_image_row [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 939 + 0x3e]
[task 2018-03-09T02:52:38.619Z] 02:52:38     INFO -     rbp = 0x00007f9c453f6ef0   rsp = 0x00007f9c453f6dd0
[task 2018-03-09T02:52:38.620Z] 02:52:38     INFO -     rip = 0x00007f9c6e500420
[task 2018-03-09T02:52:38.620Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.621Z] 02:52:38     INFO - 12  libxul.so!webrender::display_list_flattener::DisplayListFlattener::flatten_item [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 893 + 0x4d]
[task 2018-03-09T02:52:38.622Z] 02:52:38     INFO -     rbp = 0x00007f9c453f7b90   rsp = 0x00007f9c453f6f00
[task 2018-03-09T02:52:38.623Z] 02:52:38     INFO -     rip = 0x00007f9c6e4fe517
[task 2018-03-09T02:52:38.623Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.624Z] 02:52:38     INFO - 13  libxul.so!webrender::display_list_flattener::DisplayListFlattener::flatten_item [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 392 + 0x16]
[task 2018-03-09T02:52:38.625Z] 02:52:38     INFO -     rbp = 0x00007f9c453f8820   rsp = 0x00007f9c453f7ba0
[task 2018-03-09T02:52:38.625Z] 02:52:38     INFO -     rip = 0x00007f9c6e4fc6e1
[task 2018-03-09T02:52:38.626Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.627Z] 02:52:38     INFO - 14  libxul.so!webrender::display_list_flattener::DisplayListFlattener::flatten_root [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 392 + 0x15]
[task 2018-03-09T02:52:38.627Z] 02:52:38     INFO -     rbp = 0x00007f9c453f8d00   rsp = 0x00007f9c453f8830
[task 2018-03-09T02:52:38.628Z] 02:52:38     INFO -     rip = 0x00007f9c6e4f707a
[task 2018-03-09T02:52:38.628Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.629Z] 02:52:38     INFO - 15  libxul.so!webrender::display_list_flattener::DisplayListFlattener::flatten_item [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 617 + 0xb]
[task 2018-03-09T02:52:38.630Z] 02:52:38     INFO -     rbp = 0x00007f9c453f9990   rsp = 0x00007f9c453f8d10
[task 2018-03-09T02:52:38.630Z] 02:52:38     INFO -     rip = 0x00007f9c6e4fe9dd
[task 2018-03-09T02:52:38.631Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.632Z] 02:52:38     INFO - 16  libxul.so!webrender::display_list_flattener::DisplayListFlattener::flatten_item [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 392 + 0x16]
[task 2018-03-09T02:52:38.632Z] 02:52:38     INFO -     rbp = 0x00007f9c453fa620   rsp = 0x00007f9c453f99a0
[task 2018-03-09T02:52:38.633Z] 02:52:38     INFO -     rip = 0x00007f9c6e4fc6e1
[task 2018-03-09T02:52:38.633Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.634Z] 02:52:38     INFO - 17  libxul.so!webrender::display_list_flattener::DisplayListFlattener::flatten_root [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 392 + 0x15]
[task 2018-03-09T02:52:38.635Z] 02:52:38     INFO -     rbp = 0x00007f9c453fab00   rsp = 0x00007f9c453fa630
[task 2018-03-09T02:52:38.636Z] 02:52:38     INFO -     rip = 0x00007f9c6e4f707a
[task 2018-03-09T02:52:38.636Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.637Z] 02:52:38     INFO - 18  libxul.so!webrender::display_list_flattener::DisplayListFlattener::create_frame_builder [display_list_flattener.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 260 + 0x10]
[task 2018-03-09T02:52:38.638Z] 02:52:38     INFO -     rbp = 0x00007f9c453fb550   rsp = 0x00007f9c453fab10
[task 2018-03-09T02:52:38.638Z] 02:52:38     INFO -     rip = 0x00007f9c6e4f6b5f
[task 2018-03-09T02:52:38.639Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.640Z] 02:52:38     INFO - 19  libxul.so!webrender::render_backend::RenderBackend::update_document [render_backend.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 195 + 0x20]
[task 2018-03-09T02:52:38.640Z] 02:52:38     INFO -     rbp = 0x00007f9c453fc350   rsp = 0x00007f9c453fb560
[task 2018-03-09T02:52:38.641Z] 02:52:38     INFO -     rip = 0x00007f9c6e1e9eb3
[task 2018-03-09T02:52:38.642Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.643Z] 02:52:38     INFO - 20  libxul.so!webrender::render_backend::RenderBackend::run [render_backend.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 885 + 0x1f]
[task 2018-03-09T02:52:38.644Z] 02:52:38     INFO -     rbp = 0x00007f9c453fd390   rsp = 0x00007f9c453fc360
[task 2018-03-09T02:52:38.645Z] 02:52:38     INFO -     rip = 0x00007f9c6e1e6347
[task 2018-03-09T02:52:38.645Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.646Z] 02:52:38     INFO - 21  libxul.so!std::sys_common::backtrace::__rust_begin_short_backtrace<closure,()> [renderer.rs:870f7ffd104743d419afab45cc2577f6509f1488 : 2253 + 0x2a]
[task 2018-03-09T02:52:38.647Z] 02:52:38     INFO -     rbp = 0x00007f9c453fdfb0   rsp = 0x00007f9c453fd3a0
[task 2018-03-09T02:52:38.648Z] 02:52:38     INFO -     rip = 0x00007f9c6e12a113
[task 2018-03-09T02:52:38.649Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
[task 2018-03-09T02:52:38.650Z] 02:52:38     INFO - 22  libxul.so!std::panicking::try::do_call<std::panic::AssertUnwindSafe<closure>,()> [mod.rs:4d90ac38c0b61bb69470b61ea2cccea0df48d9e5 : 406 + 0x18]
[task 2018-03-09T02:52:38.650Z] 02:52:38     INFO -     rbp = 0x00007f9c453fe4a0   rsp = 0x00007f9c453fdfc0
[task 2018-03-09T02:52:38.651Z] 02:52:38     INFO -     rip = 0x00007f9c6e18963e
[task 2018-03-09T02:52:38.652Z] 02:52:38     INFO -     Found by: previous frame's frame pointer
I confirmed locally that during that test run, no image templates are added which have tiling enabled.

Current theories:
 * Display list is occasionally being sent with bogus / corrupted value for image tiling parameters.
 * tiled_image_map is getting into a bad state, perhaps with the recent async scene changes.

Pending try here with more logging information related to image tiling params:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=708b7da3439b2ae7b38981bc63b1efba85b3b7c8
Relevant log from latest test:

[task 2018-03-09T06:34:34.536Z] 06:34:34     INFO - !@# add_image_template ImageKey(IdNamespace(10), 33) None
[task 2018-03-09T06:34:34.536Z] 06:34:34     INFO - !@# add_image_template ImageKey(IdNamespace(10), 34) None
[task 2018-03-09T06:34:34.536Z] 06:34:34     INFO - !@# delete_image_template ImageKey(IdNamespace(10), 20)
[task 2018-03-09T06:34:34.536Z] 06:34:34     INFO - !@# delete_image_template ImageKey(IdNamespace(10), 25)
[task 2018-03-09T06:34:34.536Z] 06:34:34     INFO - !@# delete_image_template ImageKey(IdNamespace(10), 21)
[task 2018-03-09T06:34:34.536Z] 06:34:34     INFO - !@# delete_image_template ImageKey(IdNamespace(10), 26)
[task 2018-03-09T06:34:34.537Z] 06:34:34     INFO - !@# delete_image_template ImageKey(IdNamespace(10), 24)
[task 2018-03-09T06:34:34.537Z] 06:34:34     INFO - !@# delete_image_template ImageKey(IdNamespace(10), 27)
[task 2018-03-09T06:34:34.537Z] 06:34:34     INFO - !@# decompose_image ImageKey(IdNamespace(10), 34)

It looks like an image gets added (10, 34) with no tiling configuration, and then for some reason decompose_image is called on that anyway. I can keep looking into this next week. Nical, any ideas if the tile_image_map may be causing this? Could it possibly be related to the async scene building somehow?
Flags: needinfo?(nical.bugzilla)
> Could it possibly be related to the async scene building somehow?

Anything is possible but this PR didn't touch any of the code that manipulate image parameters. Tiling is enabled automatically if on any image which has a dimension above a certain threshold (the maximum texture size reported by the device).
Is it possible that a (blob?) image has a very large size?
Flags: needinfo?(nical.bugzilla)
This looks very much like a case of a very large image being requested (likely a blob image since they are procedural and the recording could request potentially large sizes w8ith the expectation that tiles are allocated and painted lazily).

Since we already had a similar failure, I think it's time to move the tile decomposition from the scene building phase to frame building as we talked about before. Probably in PrimStore::prepare_prim_for_render (if it's possible to insert new primitives at this stage) and only decompose within the visible rect.
It's possible that it's a large blob image, but I think it's a bit more subtle than just that. The panic I added for testing occurs when the prim count goes > 5000. And on my local machine, and 99% of CI test runs, this panic is never hit.

So whatever is causing the decompose to infinite loop / create a massive number of tile primitives is not deterministic - it's only hit extremely rarely, which suggests that in this case at least, the problem is a bug / corruption, rather than a generic problem with tiling large images.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ff5f15d605256efeae1ffb3519f0b049d9de1ad6

Latest try run with more logging, to see when / why the crashing case decides to invoke tiling on an image.
A better try run is https://treeherder.mozilla.org/#/jobs?repo=try&revision=f26fddb025f1824245fd6731c995596e6d35193c&selectedJob=167333375.

There was one failure of the 139 crash test runs, with log file https://taskcluster-artifacts.net/dNkb4GnXSNe14wyrDOfQdA/0/public/logs/live_backing.log.

The relevant portion is here:

[task 2018-03-12T01:06:27.616Z] 01:06:27     INFO - !@# add_image_template ImageKey(IdNamespace(10), 34) None ImageDescriptor { format: BGRA8, width: 10000000, height: 10000000, stride: None, offset: 0, is_opaque: false }
[task 2018-03-12T01:06:27.616Z] 01:06:27    ERROR - thread 'WRRenderBackend#3' panicked at 'decided to tile mts=8192', gfx/webrender/src/resource_cache.rs:466:13

So, in 138/139 runs, that panic and log does not appear - no image gets added with those dimensions.

So, from this we can say:
 * The immediate cause of the panic is that we eagerly tile images that big in WR. We can work around this (a) in the short term by rejecting images above a crazy size, and (b) in the long-term by applying tiling lazily (we have WR issues and a plan for this).
 * However, the above is a band-aid - there appears to be non-deterministic behaviour coming from Gecko that only requests an image of that size very occasionally. We should investigate this and fix it, since this appears to be the root cause of the problem here.

Milan, who can look into the second part of this (the Gecko non-determinism in this test run)?
Flags: needinfo?(milan)
For reference, here's the add_image_template output from a passing run:

REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/layout/generic/crashtests/289864-1.html | 13 / 183 (7%)
!@# add_image_template ImageKey(IdNamespace(10), 28) None ImageDescriptor { format: BGRA8, width: 790, height: 10, stride: Some(3160), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 29) None ImageDescriptor { format: BGRA8, width: 790, height: 10, stride: Some(3160), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 30) None ImageDescriptor { format: BGRA8, width: 33, height: 10, stride: Some(132), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 31) None ImageDescriptor { format: BGRA8, width: 10, height: 990, stride: Some(40), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 32) None ImageDescriptor { format: BGRA8, width: 10, height: 990, stride: Some(40), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 33) None ImageDescriptor { format: BGRA8, width: 10, height: 33, stride: Some(40), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 34) None ImageDescriptor { format: BGRA8, width: 717, height: 538, stride: Some(2868), offset: 0, is_opaque: true }
!@# delete_image_template ImageKey(IdNamespace(10), 24)
!@# delete_image_template ImageKey(IdNamespace(10), 20)
!@# delete_image_template ImageKey(IdNamespace(10), 27)
!@# delete_image_template ImageKey(IdNamespace(10), 25)
!@# delete_image_template ImageKey(IdNamespace(10), 21)
!@# delete_image_template ImageKey(IdNamespace(10), 26)
!@# add_image_template ImageKey(IdNamespace(10), 35) None ImageDescriptor { format: BGRA8, width: 717, height: 538, stride: Some(2868), offset: 0, is_opaque: true }
REFTEST TEST-PASS | file:///builds/worker/workspace/build/tests/reftest/tests/layout/generic/crashtests/289864-1.html | (LOAD ONLY)


and here's the corresponding output from the failing run:

REFTEST TEST-LOAD | file:///builds/worker/workspace/build/tests/reftest/tests/layout/generic/crashtests/289864-1.html | 13 / 183 (7%)
!@# add_image_template ImageKey(IdNamespace(10), 28) None ImageDescriptor { format: BGRA8, width: 790, height: 10, stride: Some(3160), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 29) None ImageDescriptor { format: BGRA8, width: 790, height: 10, stride: Some(3160), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 30) None ImageDescriptor { format: BGRA8, width: 33, height: 10, stride: Some(132), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 31) None ImageDescriptor { format: BGRA8, width: 10, height: 990, stride: Some(40), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 32) None ImageDescriptor { format: BGRA8, width: 10, height: 990, stride: Some(40), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 33) None ImageDescriptor { format: BGRA8, width: 10, height: 33, stride: Some(40), offset: 0, is_opaque: false }
!@# add_image_template ImageKey(IdNamespace(10), 34) None ImageDescriptor { format: BGRA8, width: 10000000, height: 10000000, stride: None, offset: 0, is_opaque: false }
thread 'WRRenderBackend#3' panicked at 'decided to tile mts=8192', gfx/webrender/src/resource_cache.rs:466:13
stack backtrace:
REFTEST TEST-PASS | file:///builds/worker/workspace/build/tests/reftest/tests/layout/generic/crashtests/289864-1.html | (LOAD ONLY)


So in the good case the image is 717x538 and in the bad case the image is 10000000x10000000. Since Milan is on PTO, Andrew, do you have any idea why this behaviour is intermittently different for this crashtest?

Also I think it's now safe to say this behaviour is triggered by 289864-1.html.
Flags: needinfo?(milan) → needinfo?(aosmond)
We are falling back to blob images when we shouldn't. I suspect bug 1444387 resolved this which corrected a very common case (for me anyways) where we would temporarily use fallback to render images. The failure sequence goes something like:

- Metadata decode completes. Decode triggered at some point and display list generation happens.
- In nsDisplayImage::CreateWebRenderCommands, it gets an image container which is empty because the decoder hasn't produced a frame yet (partial nor complete). This is a race between the image decoder threads and the main thread.
- Empty container causes CreateWebRenderCommands to eventually return false, and try fallback.
- The blob image recording produces an image at the upscaled size, hence the larger image in the resource cache.

In the normal case, we get a non-empty image container, and nsDisplayImage::CreateWebRenderCommands succeeds. It will add an image at its native size (since upscaling is required) and the dest rect for the display item will be much bigger.

I assume at this point WebRender is able to handle the necessary upscaling, as it wasn't blocked by allocating a massive buffer up front.

If bug 1444387 does not resolve this, then I would suspect it is somehow generating a display item before metadata decoding completes for the image, and does fallback for that. We would want to avoid that as well. It would fail because it only gives you an image container if it has completed metadata decoding, since it cannot reason about the requested decode size without knowing the native image size.

Alternatively, and less likely, if decoding were to fail for some intermittent reason (e.g. could not allocate a buffer), it would try to use fallback to display the broken image icon / alt text if present. That may have a similar problem, and we haven't completed the work to avoid fallback for that case yet.
Flags: needinfo?(aosmond)
Thanks for the detailed explanation! We can wait and see if the intermittent failures (mostly starred as bug 1436384 now) go away with bug 1444387. I don't see any since it landed, but it's only been a day so we'll have to give it a bit more time.
Depends on: 1455206
Nothing left to do here, the intermittents have gone away. It seems that bug 1444387 fixed the majority of them, but there was a slow trickle that continued until a couple of months ago and then that went away too.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: