Crash in [@ mozilla::TimeStampValue::operator-]
Categories
(Core :: Graphics, defect)
Tracking
()
People
(Reporter: gsvelto, Assigned: tnikkel)
References
Details
(Keywords: crash, csectype-uaf, sec-high, Whiteboard: [adv-esr102.8+r])
Crash Data
Crash report: https://crash-stats.mozilla.org/report/index/8eae32f6-36f2-45c6-9f14-6a1ed0220219
Reason: EXCEPTION_ACCESS_VIOLATION_READ
Top 10 frames of crashing thread:
0 mozglue.dll mozilla::TimeStampValue::operator- const mozglue/misc/TimeStamp_windows.cpp:357
1 xul.dll nsRefreshDriver::Tick layout/base/nsRefreshDriver.cpp:2184
2 xul.dll mozilla::RefreshDriverTimer::TickRefreshDrivers layout/base/nsRefreshDriver.cpp:326
3 xul.dll mozilla::RefreshDriverTimer::Tick layout/base/nsRefreshDriver.cpp:342
4 xul.dll mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver layout/base/nsRefreshDriver.cpp:703
5 xul.dll mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyParentProcessVsync layout/base/nsRefreshDriver.cpp:620
6 xul.dll mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyVsync layout/base/nsRefreshDriver.cpp:541
7 xul.dll mozilla::dom::VsyncMainChild::RecvNotify dom/ipc/VsyncMainChild.cpp:68
8 xul.dll mozilla::dom::PVsyncChild::OnMessageReceived ipc/ipdl/PVsyncChild.cpp:208
9 xul.dll mozilla::ipc::PBackgroundChild::OnMessageReceived ipc/ipdl/PBackgroundChild.cpp:6167
This is a use-after-free crash caught by PHC. The stack and signatures aren't really important because they just happen to be the first object that accessed the dead memory in this particular case. What's important are the allocation and free stacks. The memory was allocated at this stack:
#0 mozilla::layers::ClipManager::PushOverrideForASR(mozilla::ActiveScrolledRoot const*, mozilla::wr::WrSpatialId const&) (xul.dll)
#1 mozilla::layers::ClipManager::BeginList(mozilla::layers::StackingContextHelper const&) (xul.dll)
#2 mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#3 mozilla::nsDisplayTransform::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#4 mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#5 mozilla::nsDisplayOpacity::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#6 mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#7 mozilla::nsDisplayOwnLayer::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#8 mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#9 mozilla::layers::WebRenderLayerManager::EndTransactionWithoutLayer(mozilla::nsDisplayList*, mozilla::nsDisplayListBuilder*, WrFiltersHolder&&, mozilla::layers::WebRenderBackgroundData*, const double) (xul.dll)
#10 mozilla::nsDisplayList::PaintRoot(mozilla::nsDisplayListBuilder*, gfxContext*, unsigned int, mozilla::Maybe<double>) (xul.dll)
#11 static nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, mozilla::nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) (xul.dll)
#12 mozilla::PresShell::PaintInternal(nsView*, mozilla::PaintInternalFlags) (xul.dll)
#13 nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) (xul.dll)
#14 nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) (xul.dll)
#15 nsViewManager::ProcessPendingUpdates() (xul.dll)
And it was freed here:
#0 std::list<std::pair<const mozilla::wr::WrSpatialId,std::stack<mozilla::wr::WrSpatialId,std::deque<mozilla::wr::WrSpatialId,std::allocator<mozilla::wr::WrSpatialId> > > >,std::allocator<std::pair<const mozilla::wr::WrSpatialId,std::stack<mozilla::wr::WrSpatialId,std::deque<mozilla::wr::WrSpatialId,std::allocator<mozilla::wr::WrSpatialId> > > > > >::erase(std::_List_const_iterator<std::_List_val<std::_List_simple_types<std::pair<const mozilla::wr::WrSpatialId,std::stack<mozilla::wr::WrSpatialId,std::deque<mozilla::wr::WrSpatialId,std::allocator<mozilla::wr::WrSpatialId> > > > > > >) (xul.dll)
#1 mozilla::layers::ClipManager::PopOverrideForASR(mozilla::ActiveScrolledRoot const*) (xul.dll)
#2 mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#3 mozilla::nsDisplayTransform::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#4 mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#5 mozilla::nsDisplayOpacity::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#6 mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#7 mozilla::nsDisplayOwnLayer::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#8 mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#9 mozilla::layers::WebRenderLayerManager::EndTransactionWithoutLayer(mozilla::nsDisplayList*, mozilla::nsDisplayListBuilder*, WrFiltersHolder&&, mozilla::layers::WebRenderBackgroundData*, const double) (xul.dll)
#10 mozilla::nsDisplayList::PaintRoot(mozilla::nsDisplayListBuilder*, gfxContext*, unsigned int, mozilla::Maybe<double>) (xul.dll)
#11 static nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, mozilla::nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) (xul.dll)
#12 mozilla::PresShell::PaintInternal(nsView*, mozilla::PaintInternalFlags) (xul.dll)
#13 nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) (xul.dll)
#14 nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) (xul.dll)
#15 nsViewManager::ProcessPendingUpdates() (xul.dll)
Updated•2 years ago
|
Updated•2 years ago
|
Comment 1•2 years ago
|
||
Where do the allocation and free stacks come from?
Comment 2•2 years ago
|
||
I want to know more about why we should ignore the crash stack here, and where these other alloc/free stacks come from.
To me, this then looks like an unrelated use-before-alloc, that just so happens to touch memory that was properly freed in your freed stack.
But you're saying the opposite, that we should look to the alloc/free stacks and disregard the crash stack? Can you give me a better understanding here? I'm a little lost! :)
Reporter | ||
Comment 3•2 years ago
|
||
The stacks are detected by PHC. PHC randomly selects guard pages to detect accesses to dead objects. The allocation and free stacks for the objects allocated in the page are recorded, and once the elements are freed the pages are deliberately leaked so that UAF accesses to them can be detected. In this case the crash stack is the first access that happened to hit the dead object, the bug however lies where the dead pointer was leaked.
One thing that I find odd about those stacks is that the crash stack seem to involve a BaseTransactionId<TimeStamp>
object while the allocated & freed object appears to be a WrSpatialId
one and the two don't seem related.
Comment 4•2 years ago
|
||
Assigning to get sec bugs owned. Feel free to hand off to someone else as needed.
Comment 5•2 years ago
|
||
I think this is a wildpointer deref that's tripping in PHC. PHC would think that a wildpointer deref would be a UAF if the wildpointer happens to hit memory that was previously freed (and forgotten about) correctly.
I think I know where the problem is, but I'll grab the minidump and check the inline stack to make sure before I post a speculative fix.
Comment 6•2 years ago
|
||
- [Inline Frame] mozglue.dll!mozilla::TimeStampValue::CheckQPC(const mozilla::TimeStampValue & aOther) Line 278
at /builds/worker/checkouts/gecko/mozglue/misc/TimeStamp_windows.cpp(278) - mozglue.dll!mozilla::TimeStampValue::operator-(const mozilla::TimeStampValue & aOther) Line 357
at /builds/worker/checkouts/gecko/mozglue/misc/TimeStamp_windows.cpp(357) - [Inline Frame] xul.dll!mozilla::TimeStampValue::operator<=(const mozilla::TimeStampValue & aOther) Line 77
at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp_windows.h(77) - [Inline Frame] xul.dll!mozilla::TimeStamp::operator<=(const mozilla::TimeStamp & aOther) Line 502
at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h(502) - xul.dll!nsRefreshDriver::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp aNowTime, nsRefreshDriver::IsExtraTick aIsExtraTick) Line 2184
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(2184) - [Inline Frame] xul.dll!mozilla::RefreshDriverTimer::TickDriver(nsRefreshDriver * driver, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp) Line 348
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(348) - xul.dll!mozilla::RefreshDriverTimer::TickRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp aNow, nsTArray<RefPtr<nsRefreshDriver>> & aDrivers) Line 326
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(326) - xul.dll!mozilla::RefreshDriverTimer::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp now) Line 344
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(344) - [Inline Frame] xul.dll!mozilla::VsyncRefreshDriverTimer::RunRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp) Line 780
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(780) - xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp aVsyncTimestamp) Line 703
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(703) - xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyParentProcessVsync() Line 621
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(621) - xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyVsync(const mozilla::VsyncEvent & aVsync) Line 548
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(548) - xul.dll!mozilla::dom::VsyncMainChild::RecvNotify(const mozilla::VsyncEvent & aVsync, const float & aVsyncRate) Line 67
at /builds/worker/checkouts/gecko/dom/ipc/VsyncMainChild.cpp(67) - xul.dll!mozilla::dom::PVsyncChild::OnMessageReceived(const IPC::Message & msg__) Line 208
at /builds/worker/workspace/obj-build/ipc/ipdl/PVsyncChild.cpp(208) - xul.dll!mozilla::ipc::PBackgroundChild::OnMessageReceived(const IPC::Message & msg__) Line 6167
at /builds/worker/workspace/obj-build/ipc/ipdl/PBackgroundChild.cpp(6167) - [Inline Frame] xul.dll!mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy * aProxy, const IPC::Message & aMsg) Line 1658
at /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp(1658)
Comment 7•2 years ago
|
||
Here are the suspicious parts:
- [Inline Frame] mozglue.dll!mozilla::TimeStampValue::CheckQPC(const mozilla::TimeStampValue & aOther) Line 278
at /builds/worker/checkouts/gecko/mozglue/misc/TimeStamp_windows.cpp(278)
MFBT_API uint64_t TimeStampValue::CheckQPC(const TimeStampValue& aOther) const {
> uint64_t deltaGTC = mGTC - aOther.mGTC; [1]
- mozglue.dll!mozilla::TimeStampValue::operator-(const mozilla::TimeStampValue & aOther) Line 357
at /builds/worker/checkouts/gecko/mozglue/misc/TimeStamp_windows.cpp(357)
MFBT_API uint64_t
TimeStampValue::operator-(const TimeStampValue& aOther) const {
if (IsNull() && aOther.IsNull()) { [2]
return uint64_t(0);
}
> return CheckQPC(aOther);
}
- xul.dll!mozilla::RefreshDriverTimer::TickRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp aNow, nsTArray<RefPtr<nsRefreshDriver>> & aDrivers) Line 326
at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(326)
for (nsRefreshDriver* driver : aDrivers.Clone()) { // [3]
// don't poke this driver if it's in test mode
if (driver->IsTestControllingRefreshesEnabled()) {
continue;
}
> TickDriver(driver, aId, aNow);
}
[2]: It feels weird to check IsNull && other.IsNull
rather than ||
. There are a bunch of (non-release! :( ) asserts around for ensuring that generally !IsNull for math, but this seems like it's designed to say "yeah but let's allow comparing two nulls. It looks like "null" is really the term for zero-timestamp, which isn't really null per se, and you can theoretically do math on it, but yeah, it probably won't give what you want.
Because of short-circuiting, if this->IsNull, then we won't deref aOther
at [2], leaving it until we access it for our crash at [1].
[3]: I don't think the lifetime of the result from Clone() is lifetime extended. I think c++ only says that begin(X) and end(X) are called, so this would be a dangling pointer.
Thoughts, not necessarily related:
Our read violation address is 0x0000021215b77c98. That's the address the minidump crash session gives me for &aOther also.
The value of the nsRefreshDriver* driver
appears to be 0x0000021215af7c00. This is far from &aOther though: -0x80098.
It looks like the operator- we end up in comes to us via aNowTime <= mMostRecentRefresh
.
I do think [3] is the most suspicious.
Talking with gsvelto, there are two ways where we can end up with a type-mismatch from PHC between use-type and alloc/free-type:
- A wildpointer that by luck happens to hit PHC guard pages. This is kinda a use-before-alloc rather than use-after-free.
- A true UAF, but use-after-free-with-reuse:
- alloc A
- free A
- alloc B at &A
- free B, &A now marked as "B was here"
- uaf for A, but our records say that the most recent free at &A was B.
RefreshDriver tick calls tend to make changes to things, hence why we Clone the list before iterating over it.
However, it's likely that this too-tight alloc+free mean we quickly reuse and overwrite that memory, but we keep iterating over it with our loop reading.
Comment 8•2 years ago
|
||
Comment 9•2 years ago
|
||
So there's at least some lifetime extension:
{
auto && __range = (aDrivers.Clone()); // I think this is ok.
for ( auto __begin = begin-expr,
__end = end-expr;
__begin != __end;
++__begin ) {
for-range-declaration = *__begin;
statement
}
}
Comment 10•2 years ago
|
||
- [Inline Frame] xul.dll!mozilla::TimeStamp::operator<=(const mozilla::TimeStamp & aOther) Line 502
at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h(502)
bool operator<=(const TimeStamp& aOther) const {
MOZ_ASSERT(!IsNull(), "Cannot compute with a null value");
MOZ_ASSERT(!aOther.IsNull(), "Cannot compute with aOther null value");
> return mValue <= aOther.mValue;
}
Here are those non-release asserts.
Comment 11•2 years ago
|
||
template <class E>
class nsTArray : public nsTArray_Impl<E, nsTArrayInfallibleAllocator> {
[...]
using self_type = nsTArray<E>;
[...]
self_type Clone() const {
self_type result;
result.Assign(*this);
return result;
}
This should lifetime-extend fine, phew.
Updated•2 years ago
|
Updated•2 years ago
|
Comment 13•2 years ago
|
||
I looked at the minidump for https://crash-stats.mozilla.org/report/index/7c77161a-d7df-430c-b942-0ef070220513 and it looks like an ImageStartData from https://searchfox.org/mozilla-central/rev/97c902e8f92b15dc63eb584bfc594ecb041242a4/layout/base/nsRefreshDriver.cpp#2600 is the data that is not live during the TimeStamp subtraction.
Perhaps the hash table is being mutated during iteration?
Comment 14•2 years ago
|
||
I will pick this back up.
Updated•2 years ago
|
Comment 15•2 years ago
|
||
tn, this looks sort of image lib related. Do you mind taking a look sometime soonish?
Assignee | ||
Comment 16•2 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)
Perhaps the hash table is being mutated during iteration?
I went over that for loop with a fine tooth comb. I couldn't find any way the hash table could be mutated.
I looked at the minidump for https://crash-stats.mozilla.org/report/index/7c77161a-d7df-430c-b942-0ef070220513 and it looks like an ImageStartData from https://searchfox.org/mozilla-central/rev/97c902e8f92b15dc63eb584bfc594ecb041242a4/layout/base/nsRefreshDriver.cpp#2600 is the data that is not live during the TimeStamp subtraction.
Hmm, we access |start| on the line before but we don't crash there?
The crash report in comment 0 crashes on a completely different timestamp subtraction not related to the image start table near the start of Tick:
https://searchfox.org/mozilla-central/rev/4a10f85ebca1c22c803d0253a82fe39514ba0b53/layout/base/nsRefreshDriver.cpp#2344
This suggests it's more of a general refresh driver problem I think.
Assignee | ||
Comment 17•2 years ago
|
||
If you search crashstats for crashes in nsRefreshDriver::Tick and look at the read exceptions at addresses that look like that could have been valid addresses then you find crashes on a variety of lines in that function. Maybe someone with minidump access could look at a few of those to see if we could learn something about this bug?
I also noticed bug 1786818 while looking into this, but I don't think it could cause this.
Updated•2 years ago
|
Assignee | ||
Comment 18•2 years ago
|
||
I went through the last 100 crashreports in mozilla::TimeStampValue::operator-, none of them look the two stacks being analyzed here where it looks like something might be going wrong in nsRefreshDriver::Tick. Almost all of them don't go through Tick at all. Two of them went through Tick, but crashed a few levels deeper in the intersection observer code, or in the animated image advancing code. One crashed when executing a auto profiler label in Tick, looked like it was just trying to get the current timestamp (ie Now).
Updated•2 years ago
|
Comment 19•2 years ago
|
||
Theory I'm operating on is that a large chunk of memory was freed (_munmap in jemalloc) and then the hot code of another thread was hammering on mozilla::TimeStampValue::operator- and SEGV'd on the no longer mapped memory, which I believe matches the advice in comment #3.
Comment 20•2 years ago
|
||
I think I may have over-focused on TimeStampValue::operator- being the crash point, but reading again it sounds like that is the thing to ignore, and instead maybe we should be looking into ClipManager::PushOverrideForASR and ClipManager::PopOverrideForASR, with the UAF at:
std::list<std::pair<const WrSpatialId, std::stack<WrSpatialId> >::erase(std::_List_const_iterator<>)
.
I believe this must be the internal implementation of (one of the buckets in) std::unordered_map<wr::WrSpatialId, std::stack<wr::WrSpatialId>> mASROverride
.
Note: WrSpatialId
is a POD struct.
These functions have changed a little recently, but at the time they looked like:
https://searchfox.org/mozilla-central/rev/b5e1870820c09fa2f6cebcfe093cf00d2143ec1a/gfx/layers/wr/ClipManager.cpp#102-132
void ClipManager::PushOverrideForASR(const ActiveScrolledRoot* aASR,
const wr::WrSpatialId& aSpatialId) {
Maybe<wr::WrSpatialId> space = GetScrollLayer(aASR);
MOZ_ASSERT(space.isSome());
CLIP_LOG("Pushing %p override %zu -> %s\n", aASR, space->id,
ToString(aSpatialId.id).c_str());
auto it = mASROverride.insert({*space, std::stack<wr::WrSpatialId>()});
it.first->second.push(aSpatialId);
// Start a new cache
mCacheStack.emplace();
}
void ClipManager::PopOverrideForASR(const ActiveScrolledRoot* aASR) {
MOZ_ASSERT(!mCacheStack.empty());
mCacheStack.pop();
Maybe<wr::WrSpatialId> space = GetScrollLayer(aASR);
MOZ_ASSERT(space.isSome());
auto it = mASROverride.find(*space);
CLIP_LOG("Popping %p override %zu -> %s\n", aASR, space->id,
ToString(it->second.top().id).c_str());
it->second.pop();
if (it->second.empty()) {
mASROverride.erase(it);
}
}
UAF in unordered_map::erase()
like this would usually be from an iterator invalidation issue.
Because the contents of mASROverride are pods and containers of pods, there's no modification here of mASROverride between it = mASROverride.find()
and mASROverride.erase(it)
though.
However you can see that we don't even assert that find() doesn't return mASROverride.end()!
We rely on this line giving us the same WrSpatialId in both PushOverrideForASR and PopOverrideForASR, otherwise we would c.erase(c.end())
in Pop, which would is forbidden.
But, these are the alloc and free stacks, so I guess they're not necessarily problematic.
So what could possibly UAF here?
The only other reader of mASROverride is this safe and contained use:
wr::WrSpatialId ClipManager::SpatialIdAfterOverride(
const wr::WrSpatialId& aSpatialId) {
auto it = mASROverride.find(aSpatialId);
if (it == mASROverride.end()) {
return aSpatialId;
}
It looks like PushOverrideForASR and PopOverrideForASR are correct in the nested case too:
PushOverrideForASR(foo, bar);
PushOverrideForASR(foo, bar);
PopOverrideForASR(foo);
PopOverrideForASR(foo);
So I think it's down to not all calls to PopOverrideForASR(foo) matching a corresponding call to PushOverrideForASR(foo, _). Unfortunately the callers don't look trivially symmetric.
There are two calling files for these functions, ClipManager.cpp and WebRenderCommandBuilder.cpp.
The changelog for ClipManager.cpp is not too busy:
https://hg.mozilla.org/mozilla-central/log/tip/gfx/layers/wr/ClipManager.cpp
This bug was created 2022-03-31.
- (pushed 2022-08-23 21:47 +0000 Emilio Cobos Álvarez Emilio Cobos Álvarez - Bug 1661147 - When starting a new list, copy inputs from the stack as well. r=gfx-reviewers,bradwerth)
- 2022-03-31: This bug was filed
- pushed 2022-03-28 09:39 +0000 Hiroyuki Ikezoe Hiroyuki Ikezoe - Bug 1760222 - Detect scroll-linked effects only in the same refresh driver's timestamp. r=botond
- Suspect because of timing for being on Nightly when this bug was filed
- pushed 2022-01-31 16:51 +0000 Hiroyuki Ikezoe Hiroyuki Ikezoe - Bug 1571758 - Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r=botond
- Suspect because of timing for likely being on Release when this bug was filed
- (pushed 2021-12-09 09:32 +0000 Hiroyuki Ikezoe Hiroyuki Ikezoe - Bug 1744842 - Use LayoutVector2D instead of LayoutPoint to set scroll offset (the main-thread part). r=gw,botond)
WebRenderCommandBuilder.cpp is busier:
- 2022-03-31: This bug was filed
- pushed 2022-03-29 20:42 +0000 Nicolas Silva Nicolas Silva - Bug 1761770 - Dont make non-uniformly scaled items active. r=jrmuizel
- pushed 2022-03-29 20:42 +0000 Nicolas Silva Nicolas Silva - Bug 1761770 - Adjust item activity decisions. r=jrmuizel
- pushed 2022-03-16 09:52 +0000 Botond Ballo Botond Ballo - Bug 1753779 - Avoid emitting the same deferred transform item onto an ancestor and descendant WebRenderLayerScrollData node. r=tnikkel
- pushed 2022-03-15 05:30 +0000 Botond Ballo Botond Ballo - Bug 1749190 - Try harder to avoid cyclic scroll metadata annotations when building WRScrollData. r=tnikkel
- pushed 2022-03-15 05:30 +0000 Botond Ballo Botond Ballo - Bug 1749190 - Rename a variable to be more precise. r=tnikkel
- pushed 2022-03-06 09:45 +0000 Lee Salzman Lee Salzman - Bug 1511493 - Ensure PushGlyphs uses the current transaction's IpcResourceUpdateQueue. r=emilio
- (pushed 2022-02-24 03:51 +0000 Marian-Vasile Laza Marian-Vasile Laza - Merge autoland to mozilla-central. a=merge)
- pushed 2022-02-24 03:51 +0000 Nicolas Silva Nicolas Silva - Bug 1686654 - Ensure SwitchItem is called at the start of each blob group. r=jrmuizel
- pushed 2022-02-24 03:51 +0000 Nicolas Silva Nicolas Silva - Bug 1755747 - Add support for antialiased non-snapped rectangles. r=gfx-reviewers,aosmond
- pushed 2022-02-23 09:38 +0000 Miko Mynttinen Miko Mynttinen - Bug 1714584 - Part 1: Decouple nsDisplayList internal list from
- (pushed 2022-02-22 14:42 +0000 Norisz Fay Norisz Fay - Backed out 2 changesets (bug 1714584) per devs request for causing crashes )
- (pushed 2022-02-22 09:37 +0000 Miko Mynttinen Miko Mynttinen - Bug 1714584 - Part 1: Decouple nsDisplayList internal list from nsDisplayItems r=mstange)
- pushed 2022-02-22 09:37 +0000 Nicolas Silva Nicolas Silva - Bug 1753404 - Track hit tested bounds separately from painted bounds. r=jrmuizel
- pushed 2022-02-22 09:37 +0000 Nicolas Silva Nicolas Silva - Bug 1753404 - Avoid extra blob tiles from invisible items. r=jrmuizel
- pushed 2022-02-18 21:52 +0000 Andrew Osmond Andrew Osmond - Bug 1754978 - Part 1. Refactor CompositableHandle infrastructure to allow in-process driven handles. r=sotaro
- (pushed 2022-02-18 09:08 +0000 Iulian Moraru Iulian Moraru - Backed out 2 changesets (bug 1754978) for causing valgrind bustages.)
- (pushed 2022-02-17 05:15 +0000 Andrew Osmond Andrew Osmond - Bug 1754978 - Part 1. Refactor CompositableHandle infrastructure to allow in-process driven handles. r=sotaro)
- pushed 2022-02-11 09:42 +0000 Andrew Osmond Andrew Osmond - Bug 1754556 - Update WebGPU external image resource only after present is complete. r=kvark
- pushed 2022-01-27 09:46 +0000 Botond Ballo Botond Ballo - Bug 1751789 - Better handle the case where the deferred transform item's ASR is a descendant of the current item's ASR. r=tnikkel
It's not necessarily those, but I think we should pass this to hiro and/or botond to see if they have any ideas. (or maybe tnikkel sees something in those changelog commit lists?)
Also, in https://searchfox.org/mozilla-central/source/gfx/layers/wr/ClipManager.cpp#85-90 , the symmetry between push/pop use is not obvious, which is always suspect, but it does look ultimately correct:
void ClipManager::BeginList(const StackingContextHelper& aStackingContext) {
[...]
ItemClips clips(nullptr, nullptr, false);
if (!mItemClipStack.empty()) {
clips = mItemClipStack.top();
}
[...]
PushOverrideForASR(clips.mASR, *referenceFrameId);
[...]
mItemClipStack.push(clips);
}
void ClipManager::EndList(const StackingContextHelper& aStackingContext) {
[...]
mItemClipStack.pop();
[...]
PopOverrideForASR(mItemClipStack.empty() ? nullptr
: mItemClipStack.top().mASR);
[...]
}
However, when this bug was filed, the symmetry was more obvious I think:
https://searchfox.org/mozilla-central/rev/352d47e58598c90b7e7754f8fc1465be19ed77be/gfx/layers/wr/ClipManager.cpp#52-90
Comment 21•2 years ago
|
||
Bug 1760222 and bug 1571758 are totally unrelated to the ASR structure.
Comment 22•2 years ago
|
||
Probably this; https://hg.mozilla.org/integration/autoland/rev/3d623418ee05 . I am afraid I have almost zero knowledge, maybe ActiveScrolledRoot::PickDescendant picked a wrong one?
Comment 23•2 years ago
•
|
||
I'm fairly confident we can rule out {bug 1751789, bug 1749190, bug 1753779} as causing a regression in PushOverrideForASR/PopOverrideForASR calls being properly matched, too. While these bugs make changes to code that runs during WebRender display list building, this code should only impact the WebRenderScrollData that's produced as a secondary artifact, and not the WebRender display list itself (nor any intermediate state that affects the WebRender display list such as mASROverride
).
I'm less sure about the other changes in the changelogs from comment 20, but nothing obvious jumps out.
I also tried to see if I can spot a reason why PushOverrideForASR/PopOverrideForASR calls may be mismatched by code inspection. I'm not really familiar with this code, but I couldn't spot any issue:
- This pair of calls is matched: in the same function, both inside an
if (stickyScrollContainer)
block, condition does not change in between, no early returns in between. Also theaASR
argument is the same and does not change in between the calls. - This pair of calls is less obvious, but I believe it pans out:
- They're in different functions, but BeginList/EndList are matched in their respective call sites here and here.
- Inside BeginList and EndList, the calls are conditioned on
AffectsClipPositioning()
andReferenceFrameId()
. These are properties of the stacking context which do not change after the stacking context's construction. - The hardest part to reason about is that the
aASR
arguments of the pair of calls in BeginList/EndList will be the same.- In BeginList, the argument is the
mASR
of the top ofmItemClipStack
, or null ifmItemClipStack
is empty. - BeginList then pushes a new entry into
mItemClipStack
. - The only other place that modifies
mItemClipStack
isSwitchItem()
, which does a matched pair of pop() and then push(). So, if someSwitchItem()
calls occur in between BeginList and EndList, then the topmItemClipStack
may be different going into EndList then it was coming out of BeginList. - However, EndList starts by popping
mItemClipStack
, and then uses themASR
of the entry now at the top of the stack, or null if it's empty. It follows that this is the same value used by BeginList.
- In BeginList, the argument is the
That said, I do have a suggestion for how to make progress here. As Kelsey observed in comment 20, PopOverrideForASR()
does not check that this find()
call returns a valid iterator.
Proposal: Let's land a patch to add a check and early-return from PopOverrideForASR()
, as well as debug-asserting, if the call returns mASROverride.end()
.
This should avoid any UAF resulting from PushOverrideForASR/PopOverrideForASR being mismatched (for some reason we can't spot), and by turning it into a reliable debug-crash instead, possibly turn up affected testcases via fuzzing (or people running debug builds) which we can investigate further.
As far as I can tell, early-returning there shouldn't cause any other UB down the line that wouldn't have occurred otherwise. Proceeding with incorrect clips may cause rendering problems, but those would have happened anyways, with UB on top.
Looping in Markus (who may be more familiar with mItemClipStack
/mOverrideASR
, having reviewed bug 1377187) with a couple of questions:
- Any reason why my proposal above might be a bad idea?
- Anything you can spot that I missed in the analysis of PushOverrideForASR/PopOverrideForASR being matched, or any of the changes in the changelogs in comment 20 that look suspicious?
Assignee | ||
Comment 24•2 years ago
|
||
(In reply to Kelsey Gilbert [:jgilbert] from comment #20)
- pushed 2022-02-24 03:51 +0000 Nicolas Silva Nicolas Silva - Bug 1686654 - Ensure SwitchItem is called at the start of each blob group. r=jrmuizel
Maybe not the named patch, but bug 1686654 in general could maybe be a possibility for causing this, it enables active SVG images which I think would give the web render grouping code in these files are significantly harder workout.
Updated•2 years ago
|
Comment 25•2 years ago
|
||
The severity field for this bug is set to S3. However, the bug is flagged with the sec-high
keyword.
:ahale, could you consider increasing the severity of this security bug?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 26•2 years ago
|
||
(In reply to Botond Ballo [:botond] from comment #23)
Proposal: Let's land a patch to add a check and early-return from
PopOverrideForASR()
, as well as debug-asserting, if the call returnsmASROverride.end()
.This should avoid any UAF resulting from PushOverrideForASR/PopOverrideForASR being mismatched (for some reason we can't spot), and by turning it into a reliable debug-crash instead, possibly turn up affected testcases via fuzzing (or people running debug builds) which we can investigate further.
As far as I can tell, early-returning there shouldn't cause any other UB down the line that wouldn't have occurred otherwise. Proceeding with incorrect clips may cause rendering problems, but those would have happened anyways, with UB on top.
I filed bug 1797186 to land a patch for this.
Updated•2 years ago
|
Comment 27•2 years ago
|
||
I put up a socorro change to separate out the different signatures: https://github.com/mozilla-services/socorro/pull/6228
Comment 28•2 years ago
|
||
Thanks for all the good work on this! IIUC we're waiting to see if bug 1797186 resolves this UAF. I marked bug 1797186 as a dependency.
@ahale - If you can keep an eye on this and let us know after a few weeks if this appears to be resolved or if it's still around, I'd really appreciate it. Thanks!
Assignee | ||
Comment 29•2 years ago
|
||
Recording this info about how to monitor this bug here so we don't forget it:
Since this bug report is based on a PHC crash we'll likely need to check PHC crashes on crash stats in order to determine if we still hit it. The answer on #crashreporting for how to search these crashes was:
"They end up on crash-stats with specific annotations. If you have protected data access this query[1] should make them show up"
[1] https://crash-stats.mozilla.org/search/?phc_alloc_stack=!__null__
Comment 30•2 years ago
|
||
(In reply to Maire Reavy [:mreavy] from comment #28)
@ahale - If you can keep an eye on this and let us know after a few weeks if this appears to be resolved or if it's still around, I'd really appreciate it. Thanks!
Yep, I'll be keeping an eye on it.
Assignee | ||
Comment 31•2 years ago
|
||
Today there are two crashes on crash stats that look like they would be prevented by bug 1797186 (the builds of the crashes don't have bug 1797186)
https://crash-stats.mozilla.org/report/index/a0432e20-d915-4aa8-97a3-c0cb30221110
https://crash-stats.mozilla.org/report/index/29ee1ce0-190e-4de1-a85f-363da0221107
Which I think shows that bug 1797186 is indeed fixing a real problem. It's odd that none of these crashes showed up before this week though. Still an open question as to how we are hitting the unbalanced Pop/Push that bug 1797186 handles more gracefully.
Comment 32•2 years ago
|
||
There have been no crashes matching the intersection of those patterns after Firefox 105.1.0
Crash stats filter for a PHC with signature containing PopOverrideForASR:
https://crash-stats.mozilla.org/signature/?phc_alloc_stack=%21__null&signature=mozilla%3A%3Alayers%3A%3AClipManager%3A%3APopOverrideForASR&date=%3E%3D2022-05-22T00%3A30%3A00.000Z&date=%3C2022-11-22T00%3A30%3A00.000Z#reports
Updated•1 year ago
|
Comment 33•1 year ago
•
|
||
(In reply to Ashley Hale [:ahale] from comment #32)
There have been no crashes matching the intersection of those patterns after Firefox 105.1.0
Bug 1797186 landed in 108 IIUC. Do we know why this went away already in 106? I see that bug 1661147 touched the same code and landed in the 106 timeframe. Is that possibly related?
Assignee | ||
Comment 34•1 year ago
|
||
I think this was just a rare crash, we should keep monitor through at least the 108 release cycle.
Comment 35•1 year ago
|
||
I'm still trying to find the optimal crash stats criteria, like my search in comment #32 is for PopOverrideForASR but that's not the signature we identified and put in socorro configuration to disambiguate different ways to reach the TimeStampValue::operator- function. I'm not completely sure I understand how PHC is annotated here but I'll make some other queries and see if I find any patterns.
I agree this should not be closed until we have data showing it is addressed by the PopOverrideForASR fixes.
Comment 36•1 year ago
|
||
Here is a more useful crash stats query for specifically signatures containing mozilla::TimeStampValue::operator- with a PHC crash after build id 20221025 (corresponding to the date when Bug 1797186 landed, but not necessarily containing it), I see shutdownhang on many of the signatures which seems notable.
Of those crashes, I only see one on 108 or later https://crash-stats.mozilla.org/report/index/660d7eeb-8a6e-406e-b0f8-503720221127 which is on version 108.0b6, which has a buildid of 20221124185931 which is considerably newer than 20221025 when Bug 1797186 (mentioned in comment #31) changed the PopOverrideForASR behavior to have bounds checking, but this may still be a reduction in crash rate, there may be multiple causes of PHC crashes in this code.
Comment 37•1 year ago
|
||
Status: Waiting for 108 to deploy to a majority of users to confirm if this rare crash is fixed, due to the stochastic nature of PHC detection we need a sufficient amount of users running 108 to find out if this is actually fixed, and the fact it is a rare crash makes that even harder, so once 108 hits release channel I can draw conclusions from the crash stats.
My prediction is that we will still see some crash volume in this method but via different signatures as there appear to be multiple causes of underflow/overflow/PHC/UAF involved that just happen to hit this method on their path of doom, the method isn't the problem, it's just a symptom.
It would be nice to see shutdownhang signatures drop off the chart as well, which I think has been fixed by Bug 1797186 but we'll see.
Assignee | ||
Comment 38•1 year ago
|
||
I don't have phc access on crashstats, but comment 0 includes the alloc and free stacks in addition to the actual crashing stack. Do we have access to the alloc/free stacks on these crashes?
Reporter | ||
Comment 39•1 year ago
|
||
If you point me at the crashes you need inspected I can extract the stacks for you.
Assignee | ||
Comment 40•1 year ago
|
||
I can't even get a list of crashes to look at without phc data access. The question I want to answer is: when did the crashes happen and in what versions of phc crashes with alloc/free stacks that go through PushOverrideForASR/PopOverrideForASR?
Comment 41•1 year ago
•
|
||
I seem to no longer be able to see the phc annotations so I can't measure this anymore. I still have protected data access according to crash stats but this annotation no longer appears in the crash reports for me, so I can't confirm if my searches are working correctly. Though I may be misremembering and maybe I was never able to see phc annotations to determine if my searches work correctly. Currently I have limited confidence in the validity of the search condition 'phc_alloc_stack does not have terms __null' since I can't tell if it's selecting them correctly.
If I am understanding correctly, the phc alloc stack and phc free stack would be comma separated numbers, so I don't know what numbers would exist in those fields to indicate PopOverrideForASR and PushOverrideForASR are involved to verify if the fix in Bug 1797186 prevents some portion of these crashes, so I'm going just on crash volume of crashes meeting my criteria which is kind of a blind metric.
Reporter | ||
Comment 42•1 year ago
|
||
From a quick search there are have been no PHC crashes reported under this signature in the past 6 months. Truth be said there's been too few PHC crashes reported in general and we're trying to tweak the heuristics to improve the situation.
(In reply to Ashley Hale [:ahale] from comment #41)
I seem to no longer be able to see the phc annotations so I can't measure this anymore. I still have protected data access according to crash stats but this annotation no longer appears in the crash reports for me, so I can't confirm if my searches are working correctly. Though I may be misremembering and maybe I was never able to see phc annotations to determine if my searches work correctly. Currently I have limited confidence in the validity of the search condition 'phc_alloc_stack does not have terms __null' since I can't tell if it's selecting them correctly.
A query with phc_kind exists
should reveal all PHC crashes on file if you have protected data access.
Assignee | ||
Comment 43•1 year ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #42)
From a quick search there are have been no PHC crashes reported under this signature in the past 6 months. Truth be said there's been too few PHC crashes reported in general and we're trying to tweak the heuristics to improve the situation.
Wouldn't we need to search the phc alloc/free stacks, not the signature of the crash? The signature is pretty much random, just whoever happened to touch the memory after the problem?
Reporter | ||
Comment 44•1 year ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #43)
Wouldn't we need to search the phc alloc/free stacks, not the signature of the crash? The signature is pretty much random, just whoever happened to touch the memory after the problem?
Good point, and we neither process nor index the alloc/free stack, it's still a manual process so we can't do it directly.
Comment 45•1 year ago
|
||
(In reply to Timothy Nikkel (:tnikkel) from comment #43)
Wouldn't we need to search the phc alloc/free stacks, not the signature of the crash? The signature is pretty much random, just whoever happened to touch the memory after the problem?
It shouldn't be random, no. Once the memory is freed, the use must come from a dangling pointer and that should always be roughly the same location, so the "use" stack should be stable. It would still be nice if we could eventually process alloc/free stacks but that's probably going to be more work.
Comment 46•1 year ago
|
||
This is in wait and see mode, we believe the issue is addressed.
Comment 47•1 year ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #44)
(In reply to Timothy Nikkel (:tnikkel) from comment #43)
Wouldn't we need to search the phc alloc/free stacks, not the signature of the crash? The signature is pretty much random, just whoever happened to touch the memory after the problem?
Good point, and we neither process nor index the alloc/free stack, it's still a manual process so we can't do it directly.
Lack of index is why I have trouble searching for these then, some tooling around that would be helpful :)
Comment 48•1 year ago
|
||
Okay I found the problem with my queries for phc alloc stack, which is more or less a typo in reverse - I clicked the URL in comment #31 but bugzilla automatic markup of links truncates before the __ at the end of the !null condition, which broke the search and was subtle enough I wasn't sure if it was meant to be that way.
However the number of phc alloc exists crashes is extremely low, so low that it's hard to confirm anything in regards to the mozilla::TimeStampValue::operator- signature or PopOverrideForASR signature as searches like https://crash-stats.mozilla.org/api/SuperSearch/?signature=~mozilla%3A%3ATimeStampValue%3A%3Aoperator-&phc_alloc_stack=%21__null__&_facets=signature&_facets=version give no results even for a 6 month window. Even finding ones that involve WebRender is hard, but there are a couple that seem like driver crashes, not really of interest to this bug.
Updated•1 year ago
|
Comment 49•1 year ago
|
||
Will discuss this bug in gfx triage meeting again, as we have no clear indicator that the problem is fixed or not fixed, PHC crashes are too rare to even include a single sample one way or the other currently.
Comment 50•1 year ago
|
||
:gsvelto - is there something wrong with the PHC detection? I find no crashes in the past 2 weeks with my search for 'phc alloc stack' with exists condition, I can find only two at 1 month window. Currently I can't confirm this bug still exists, but I also have low confidence that it is fixed due to a lack of any samples to work with.
Updated•1 year ago
|
Reporter | ||
Comment 51•1 year ago
|
||
(In reply to Ashley Hale [:ahale] from comment #50)
:gsvelto - is there something wrong with the PHC detection? I find no crashes in the past 2 weeks with my search for 'phc alloc stack' with exists condition, I can find only two at 1 month window. Currently I can't confirm this bug still exists, but I also have low confidence that it is fixed due to a lack of any samples to work with.
It's because the population is small, Randell wrote a WIP to increase it in bug 1800010 but the work there is not complete yet.
Comment 52•1 year ago
|
||
Tentatively resolving this as fixed by bug 1797186. Feel free to reopen if needed.
Comment 53•1 year ago
|
||
Is this something we'd need to fix on ESR102 still? AFAICT, the code in question looks mostly the same (though not exactly).
Assignee | ||
Comment 54•1 year ago
|
||
Uplifting to esr102 seems reasonable to me. I can make a patch for esr102, it should be very easy.
Comment 55•1 year ago
|
||
A rebased patch would be great, thanks. We'll need it before next week's RC builds.
Assignee | ||
Comment 56•1 year ago
|
||
Done. Sorry was sick the past couple days.
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•10 months ago
|
Description
•