Closed Bug 1762368 Opened 2 years ago Closed 1 year ago

Crash in [@ mozilla::TimeStampValue::operator-]

Categories

(Core :: Graphics, defect)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
108 Branch
Tracking Status
firefox-esr102 110+ fixed
firefox108 --- fixed

People

(Reporter: gsvelto, Assigned: tnikkel)

References

Details

(Keywords: crash, csectype-uaf, sec-high, Whiteboard: [adv-esr102.8+r])

Crash Data

Crash report: https://crash-stats.mozilla.org/report/index/8eae32f6-36f2-45c6-9f14-6a1ed0220219

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 mozglue.dll mozilla::TimeStampValue::operator- const mozglue/misc/TimeStamp_windows.cpp:357
1 xul.dll nsRefreshDriver::Tick layout/base/nsRefreshDriver.cpp:2184
2 xul.dll mozilla::RefreshDriverTimer::TickRefreshDrivers layout/base/nsRefreshDriver.cpp:326
3 xul.dll mozilla::RefreshDriverTimer::Tick layout/base/nsRefreshDriver.cpp:342
4 xul.dll mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver layout/base/nsRefreshDriver.cpp:703
5 xul.dll mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyParentProcessVsync layout/base/nsRefreshDriver.cpp:620
6 xul.dll mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyVsync layout/base/nsRefreshDriver.cpp:541
7 xul.dll mozilla::dom::VsyncMainChild::RecvNotify dom/ipc/VsyncMainChild.cpp:68
8 xul.dll mozilla::dom::PVsyncChild::OnMessageReceived ipc/ipdl/PVsyncChild.cpp:208
9 xul.dll mozilla::ipc::PBackgroundChild::OnMessageReceived ipc/ipdl/PBackgroundChild.cpp:6167

This is a use-after-free crash caught by PHC. The stack and signatures aren't really important because they just happen to be the first object that accessed the dead memory in this particular case. What's important are the allocation and free stacks. The memory was allocated at this stack:

#0    mozilla::layers::ClipManager::PushOverrideForASR(mozilla::ActiveScrolledRoot const*, mozilla::wr::WrSpatialId const&) (xul.dll)
#1    mozilla::layers::ClipManager::BeginList(mozilla::layers::StackingContextHelper const&) (xul.dll)
#2    mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#3    mozilla::nsDisplayTransform::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#4    mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#5    mozilla::nsDisplayOpacity::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#6    mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#7    mozilla::nsDisplayOwnLayer::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#8    mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#9    mozilla::layers::WebRenderLayerManager::EndTransactionWithoutLayer(mozilla::nsDisplayList*, mozilla::nsDisplayListBuilder*, WrFiltersHolder&&, mozilla::layers::WebRenderBackgroundData*, const double) (xul.dll)
#10    mozilla::nsDisplayList::PaintRoot(mozilla::nsDisplayListBuilder*, gfxContext*, unsigned int, mozilla::Maybe<double>) (xul.dll)
#11    static nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, mozilla::nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) (xul.dll)
#12    mozilla::PresShell::PaintInternal(nsView*, mozilla::PaintInternalFlags) (xul.dll)
#13    nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) (xul.dll)
#14    nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) (xul.dll)
#15    nsViewManager::ProcessPendingUpdates() (xul.dll)

And it was freed here:

#0    std::list<std::pair<const mozilla::wr::WrSpatialId,std::stack<mozilla::wr::WrSpatialId,std::deque<mozilla::wr::WrSpatialId,std::allocator<mozilla::wr::WrSpatialId> > > >,std::allocator<std::pair<const mozilla::wr::WrSpatialId,std::stack<mozilla::wr::WrSpatialId,std::deque<mozilla::wr::WrSpatialId,std::allocator<mozilla::wr::WrSpatialId> > > > > >::erase(std::_List_const_iterator<std::_List_val<std::_List_simple_types<std::pair<const mozilla::wr::WrSpatialId,std::stack<mozilla::wr::WrSpatialId,std::deque<mozilla::wr::WrSpatialId,std::allocator<mozilla::wr::WrSpatialId> > > > > > >) (xul.dll)
#1    mozilla::layers::ClipManager::PopOverrideForASR(mozilla::ActiveScrolledRoot const*) (xul.dll)
#2    mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#3    mozilla::nsDisplayTransform::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#4    mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#5    mozilla::nsDisplayOpacity::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#6    mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#7    mozilla::nsDisplayOwnLayer::CreateWebRenderCommands(mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, mozilla::layers::StackingContextHelper const&, mozilla::layers::RenderRootStateManager*, mozilla::nsDisplayListBuilder*) (xul.dll)
#8    mozilla::layers::WebRenderCommandBuilder::CreateWebRenderCommandsFromDisplayList(mozilla::nsDisplayList*, mozilla::nsDisplayItem*, mozilla::nsDisplayListBuilder*, mozilla::layers::StackingContextHelper const&, mozilla::wr::DisplayListBuilder&, mozilla::wr::IpcResourceUpdateQueue&, bool) (xul.dll)
#9    mozilla::layers::WebRenderLayerManager::EndTransactionWithoutLayer(mozilla::nsDisplayList*, mozilla::nsDisplayListBuilder*, WrFiltersHolder&&, mozilla::layers::WebRenderBackgroundData*, const double) (xul.dll)
#10    mozilla::nsDisplayList::PaintRoot(mozilla::nsDisplayListBuilder*, gfxContext*, unsigned int, mozilla::Maybe<double>) (xul.dll)
#11    static nsLayoutUtils::PaintFrame(gfxContext*, nsIFrame*, nsRegion const&, unsigned int, mozilla::nsDisplayListBuilderMode, nsLayoutUtils::PaintFrameFlags) (xul.dll)
#12    mozilla::PresShell::PaintInternal(nsView*, mozilla::PaintInternalFlags) (xul.dll)
#13    nsViewManager::ProcessPendingUpdatesPaint(nsIWidget*) (xul.dll)
#14    nsViewManager::ProcessPendingUpdatesForView(nsView*, bool) (xul.dll)
#15    nsViewManager::ProcessPendingUpdates() (xul.dll)
Blocks: gfx-triage

Where do the allocation and free stacks come from?

Flags: needinfo?(gsvelto)

I want to know more about why we should ignore the crash stack here, and where these other alloc/free stacks come from.
To me, this then looks like an unrelated use-before-alloc, that just so happens to touch memory that was properly freed in your freed stack.
But you're saying the opposite, that we should look to the alloc/free stacks and disregard the crash stack? Can you give me a better understanding here? I'm a little lost! :)

The stacks are detected by PHC. PHC randomly selects guard pages to detect accesses to dead objects. The allocation and free stacks for the objects allocated in the page are recorded, and once the elements are freed the pages are deliberately leaked so that UAF accesses to them can be detected. In this case the crash stack is the first access that happened to hit the dead object, the bug however lies where the dead pointer was leaked.

One thing that I find odd about those stacks is that the crash stack seem to involve a BaseTransactionId<TimeStamp> object while the allocated & freed object appears to be a WrSpatialId one and the two don't seem related.

Flags: needinfo?(gsvelto)

Assigning to get sec bugs owned. Feel free to hand off to someone else as needed.

Assignee: nobody → jgilbert

I think this is a wildpointer deref that's tripping in PHC. PHC would think that a wildpointer deref would be a UAF if the wildpointer happens to hit memory that was previously freed (and forgotten about) correctly.

I think I know where the problem is, but I'll grab the minidump and check the inline stack to make sure before I post a speculative fix.

Flags: needinfo?(jgilbert)
  1. [Inline Frame] mozglue.dll!mozilla::TimeStampValue::CheckQPC(const mozilla::TimeStampValue & aOther) Line 278
    at /builds/worker/checkouts/gecko/mozglue/misc/TimeStamp_windows.cpp(278)
  2. mozglue.dll!mozilla::TimeStampValue::operator-(const mozilla::TimeStampValue & aOther) Line 357
    at /builds/worker/checkouts/gecko/mozglue/misc/TimeStamp_windows.cpp(357)
  3. [Inline Frame] xul.dll!mozilla::TimeStampValue::operator<=(const mozilla::TimeStampValue & aOther) Line 77
    at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp_windows.h(77)
  4. [Inline Frame] xul.dll!mozilla::TimeStamp::operator<=(const mozilla::TimeStamp & aOther) Line 502
    at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h(502)
  5. xul.dll!nsRefreshDriver::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp aNowTime, nsRefreshDriver::IsExtraTick aIsExtraTick) Line 2184
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(2184)
  6. [Inline Frame] xul.dll!mozilla::RefreshDriverTimer::TickDriver(nsRefreshDriver * driver, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp) Line 348
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(348)
  7. xul.dll!mozilla::RefreshDriverTimer::TickRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp aNow, nsTArray<RefPtr<nsRefreshDriver>> & aDrivers) Line 326
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(326)
  8. xul.dll!mozilla::RefreshDriverTimer::Tick(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp now) Line 344
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(344)
  9. [Inline Frame] xul.dll!mozilla::VsyncRefreshDriverTimer::RunRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp) Line 780
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(780)
  10. xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::TickRefreshDriver(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp aVsyncTimestamp) Line 703
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(703)
  11. xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyParentProcessVsync() Line 621
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(621)
  12. xul.dll!mozilla::VsyncRefreshDriverTimer::RefreshDriverVsyncObserver::NotifyVsync(const mozilla::VsyncEvent & aVsync) Line 548
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(548)
  13. xul.dll!mozilla::dom::VsyncMainChild::RecvNotify(const mozilla::VsyncEvent & aVsync, const float & aVsyncRate) Line 67
    at /builds/worker/checkouts/gecko/dom/ipc/VsyncMainChild.cpp(67)
  14. xul.dll!mozilla::dom::PVsyncChild::OnMessageReceived(const IPC::Message & msg__) Line 208
    at /builds/worker/workspace/obj-build/ipc/ipdl/PVsyncChild.cpp(208)
  15. xul.dll!mozilla::ipc::PBackgroundChild::OnMessageReceived(const IPC::Message & msg__) Line 6167
    at /builds/worker/workspace/obj-build/ipc/ipdl/PBackgroundChild.cpp(6167)
  16. [Inline Frame] xul.dll!mozilla::ipc::MessageChannel::DispatchAsyncMessage(mozilla::ipc::ActorLifecycleProxy * aProxy, const IPC::Message & aMsg) Line 1658
    at /builds/worker/checkouts/gecko/ipc/glue/MessageChannel.cpp(1658)
Flags: needinfo?(jgilbert)

Here are the suspicious parts:

  1. [Inline Frame] mozglue.dll!mozilla::TimeStampValue::CheckQPC(const mozilla::TimeStampValue & aOther) Line 278
    at /builds/worker/checkouts/gecko/mozglue/misc/TimeStamp_windows.cpp(278)
MFBT_API uint64_t TimeStampValue::CheckQPC(const TimeStampValue& aOther) const {
> uint64_t deltaGTC = mGTC - aOther.mGTC; [1]
  1. mozglue.dll!mozilla::TimeStampValue::operator-(const mozilla::TimeStampValue & aOther) Line 357
    at /builds/worker/checkouts/gecko/mozglue/misc/TimeStamp_windows.cpp(357)
MFBT_API uint64_t
TimeStampValue::operator-(const TimeStampValue& aOther) const {
  if (IsNull() && aOther.IsNull()) { [2]
    return uint64_t(0);
  }

> return CheckQPC(aOther);
}
  1. xul.dll!mozilla::RefreshDriverTimer::TickRefreshDrivers(mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> aId, mozilla::TimeStamp aNow, nsTArray<RefPtr<nsRefreshDriver>> & aDrivers) Line 326
    at /builds/worker/checkouts/gecko/layout/base/nsRefreshDriver.cpp(326)
    for (nsRefreshDriver* driver : aDrivers.Clone()) {  // [3]
      // don't poke this driver if it's in test mode
      if (driver->IsTestControllingRefreshesEnabled()) {
        continue;
      }

>     TickDriver(driver, aId, aNow);
    }

[2]: It feels weird to check IsNull && other.IsNull rather than ||. There are a bunch of (non-release! :( ) asserts around for ensuring that generally !IsNull for math, but this seems like it's designed to say "yeah but let's allow comparing two nulls. It looks like "null" is really the term for zero-timestamp, which isn't really null per se, and you can theoretically do math on it, but yeah, it probably won't give what you want.
Because of short-circuiting, if this->IsNull, then we won't deref aOther at [2], leaving it until we access it for our crash at [1].

[3]: I don't think the lifetime of the result from Clone() is lifetime extended. I think c++ only says that begin(X) and end(X) are called, so this would be a dangling pointer.

Thoughts, not necessarily related:
Our read violation address is 0x0000021215b77c98. That's the address the minidump crash session gives me for &aOther also.
The value of the nsRefreshDriver* driver appears to be 0x0000021215af7c00. This is far from &aOther though: -0x80098.
It looks like the operator- we end up in comes to us via aNowTime <= mMostRecentRefresh.

I do think [3] is the most suspicious.
Talking with gsvelto, there are two ways where we can end up with a type-mismatch from PHC between use-type and alloc/free-type:

  1. A wildpointer that by luck happens to hit PHC guard pages. This is kinda a use-before-alloc rather than use-after-free.
  2. A true UAF, but use-after-free-with-reuse:
  • alloc A
  • free A
  • alloc B at &A
  • free B, &A now marked as "B was here"
  • uaf for A, but our records say that the most recent free at &A was B.

RefreshDriver tick calls tend to make changes to things, hence why we Clone the list before iterating over it.
However, it's likely that this too-tight alloc+free mean we quickly reuse and overwrite that memory, but we keep iterating over it with our loop reading.

So there's at least some lifetime extension:

{
   auto && __range = (aDrivers.Clone()); // I think this is ok.
   for ( auto __begin = begin-expr,
              __end = end-expr;
         __begin != __end;
         ++__begin ) {
      for-range-declaration = *__begin;
      statement
   }
}
  1. [Inline Frame] xul.dll!mozilla::TimeStamp::operator<=(const mozilla::TimeStamp & aOther) Line 502
    at /builds/worker/workspace/obj-build/dist/include/mozilla/TimeStamp.h(502)
  bool operator<=(const TimeStamp& aOther) const {
    MOZ_ASSERT(!IsNull(), "Cannot compute with a null value");
    MOZ_ASSERT(!aOther.IsNull(), "Cannot compute with aOther null value");
>   return mValue <= aOther.mValue;
  }

Here are those non-release asserts.

template <class E>
class nsTArray : public nsTArray_Impl<E, nsTArrayInfallibleAllocator> {
  [...]
  using self_type = nsTArray<E>;
  [...]
  self_type Clone() const {
    self_type result;
    result.Assign(*this);
    return result;
  }

This should lifetime-extend fine, phew.

Sotaro, could you take a look at this?

Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(jmuizelaar)

I looked at the minidump for https://crash-stats.mozilla.org/report/index/7c77161a-d7df-430c-b942-0ef070220513 and it looks like an ImageStartData from https://searchfox.org/mozilla-central/rev/97c902e8f92b15dc63eb584bfc594ecb041242a4/layout/base/nsRefreshDriver.cpp#2600 is the data that is not live during the TimeStamp subtraction.

Perhaps the hash table is being mutated during iteration?

I will pick this back up.

No longer blocks: gfx-triage

tn, this looks sort of image lib related. Do you mind taking a look sometime soonish?

Assignee: jgilbert → tnikkel

(In reply to Jeff Muizelaar [:jrmuizel] from comment #13)

Perhaps the hash table is being mutated during iteration?

I went over that for loop with a fine tooth comb. I couldn't find any way the hash table could be mutated.

I looked at the minidump for https://crash-stats.mozilla.org/report/index/7c77161a-d7df-430c-b942-0ef070220513 and it looks like an ImageStartData from https://searchfox.org/mozilla-central/rev/97c902e8f92b15dc63eb584bfc594ecb041242a4/layout/base/nsRefreshDriver.cpp#2600 is the data that is not live during the TimeStamp subtraction.

Hmm, we access |start| on the line before but we don't crash there?

The crash report in comment 0 crashes on a completely different timestamp subtraction not related to the image start table near the start of Tick:
https://searchfox.org/mozilla-central/rev/4a10f85ebca1c22c803d0253a82fe39514ba0b53/layout/base/nsRefreshDriver.cpp#2344

This suggests it's more of a general refresh driver problem I think.

If you search crashstats for crashes in nsRefreshDriver::Tick and look at the read exceptions at addresses that look like that could have been valid addresses then you find crashes on a variety of lines in that function. Maybe someone with minidump access could look at a few of those to see if we could learn something about this bug?

I also noticed bug 1786818 while looking into this, but I don't think it could cause this.

Flags: needinfo?(jmuizelaar)

I went through the last 100 crashreports in mozilla::TimeStampValue::operator-, none of them look the two stacks being analyzed here where it looks like something might be going wrong in nsRefreshDriver::Tick. Almost all of them don't go through Tick at all. Two of them went through Tick, but crashed a few levels deeper in the intersection observer code, or in the animated image advancing code. One crashed when executing a auto profiler label in Tick, looked like it was just trying to get the current timestamp (ie Now).

Assignee: tnikkel → ahale

Theory I'm operating on is that a large chunk of memory was freed (_munmap in jemalloc) and then the hot code of another thread was hammering on mozilla::TimeStampValue::operator- and SEGV'd on the no longer mapped memory, which I believe matches the advice in comment #3.

I think I may have over-focused on TimeStampValue::operator- being the crash point, but reading again it sounds like that is the thing to ignore, and instead maybe we should be looking into ClipManager::PushOverrideForASR and ClipManager::PopOverrideForASR, with the UAF at:
std::list<std::pair<const WrSpatialId, std::stack<WrSpatialId> >::erase(std::_List_const_iterator<>).
I believe this must be the internal implementation of (one of the buckets in) std::unordered_map<wr::WrSpatialId, std::stack<wr::WrSpatialId>> mASROverride.

Note: WrSpatialId is a POD struct.

These functions have changed a little recently, but at the time they looked like:
https://searchfox.org/mozilla-central/rev/b5e1870820c09fa2f6cebcfe093cf00d2143ec1a/gfx/layers/wr/ClipManager.cpp#102-132

void ClipManager::PushOverrideForASR(const ActiveScrolledRoot* aASR,
                                     const wr::WrSpatialId& aSpatialId) {
  Maybe<wr::WrSpatialId> space = GetScrollLayer(aASR);
  MOZ_ASSERT(space.isSome());

  CLIP_LOG("Pushing %p override %zu -> %s\n", aASR, space->id,
           ToString(aSpatialId.id).c_str());

  auto it = mASROverride.insert({*space, std::stack<wr::WrSpatialId>()});
  it.first->second.push(aSpatialId);

  // Start a new cache
  mCacheStack.emplace();
}
void ClipManager::PopOverrideForASR(const ActiveScrolledRoot* aASR) {
  MOZ_ASSERT(!mCacheStack.empty());
  mCacheStack.pop();

  Maybe<wr::WrSpatialId> space = GetScrollLayer(aASR);
  MOZ_ASSERT(space.isSome());

  auto it = mASROverride.find(*space);
  CLIP_LOG("Popping %p override %zu -> %s\n", aASR, space->id,
           ToString(it->second.top().id).c_str());

  it->second.pop();
  if (it->second.empty()) {
    mASROverride.erase(it);
  }
}

UAF in unordered_map::erase() like this would usually be from an iterator invalidation issue.
Because the contents of mASROverride are pods and containers of pods, there's no modification here of mASROverride between it = mASROverride.find() and mASROverride.erase(it) though.
However you can see that we don't even assert that find() doesn't return mASROverride.end()!
We rely on this line giving us the same WrSpatialId in both PushOverrideForASR and PopOverrideForASR, otherwise we would c.erase(c.end()) in Pop, which would is forbidden.

But, these are the alloc and free stacks, so I guess they're not necessarily problematic.
So what could possibly UAF here?

The only other reader of mASROverride is this safe and contained use:

wr::WrSpatialId ClipManager::SpatialIdAfterOverride(
    const wr::WrSpatialId& aSpatialId) {
  auto it = mASROverride.find(aSpatialId);
  if (it == mASROverride.end()) {
    return aSpatialId;
  }

It looks like PushOverrideForASR and PopOverrideForASR are correct in the nested case too:

PushOverrideForASR(foo, bar);
PushOverrideForASR(foo, bar);
PopOverrideForASR(foo);
PopOverrideForASR(foo);

So I think it's down to not all calls to PopOverrideForASR(foo) matching a corresponding call to PushOverrideForASR(foo, _). Unfortunately the callers don't look trivially symmetric.


There are two calling files for these functions, ClipManager.cpp and WebRenderCommandBuilder.cpp.

The changelog for ClipManager.cpp is not too busy:
https://hg.mozilla.org/mozilla-central/log/tip/gfx/layers/wr/ClipManager.cpp

This bug was created 2022-03-31.

  • (pushed 2022-08-23 21:47 +0000 Emilio Cobos Álvarez Emilio Cobos Álvarez - Bug 1661147 - When starting a new list, copy inputs from the stack as well. r=gfx-reviewers,bradwerth)
  • 2022-03-31: This bug was filed
  • pushed 2022-03-28 09:39 +0000 Hiroyuki Ikezoe Hiroyuki Ikezoe - Bug 1760222 - Detect scroll-linked effects only in the same refresh driver's timestamp. r=botond
    • Suspect because of timing for being on Nightly when this bug was filed
  • pushed 2022-01-31 16:51 +0000 Hiroyuki Ikezoe Hiroyuki Ikezoe - Bug 1571758 - Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r=botond
    • Suspect because of timing for likely being on Release when this bug was filed
  • (pushed 2021-12-09 09:32 +0000 Hiroyuki Ikezoe Hiroyuki Ikezoe - Bug 1744842 - Use LayoutVector2D instead of LayoutPoint to set scroll offset (the main-thread part). r=gw,botond)

WebRenderCommandBuilder.cpp is busier:

  • 2022-03-31: This bug was filed
  • pushed 2022-03-29 20:42 +0000 Nicolas Silva Nicolas Silva - Bug 1761770 - Dont make non-uniformly scaled items active. r=jrmuizel
  • pushed 2022-03-29 20:42 +0000 Nicolas Silva Nicolas Silva - Bug 1761770 - Adjust item activity decisions. r=jrmuizel
  • pushed 2022-03-16 09:52 +0000 Botond Ballo Botond Ballo - Bug 1753779 - Avoid emitting the same deferred transform item onto an ancestor and descendant WebRenderLayerScrollData node. r=tnikkel
  • pushed 2022-03-15 05:30 +0000 Botond Ballo Botond Ballo - Bug 1749190 - Try harder to avoid cyclic scroll metadata annotations when building WRScrollData. r=tnikkel
  • pushed 2022-03-15 05:30 +0000 Botond Ballo Botond Ballo - Bug 1749190 - Rename a variable to be more precise. r=tnikkel
  • pushed 2022-03-06 09:45 +0000 Lee Salzman Lee Salzman - Bug 1511493 - Ensure PushGlyphs uses the current transaction's IpcResourceUpdateQueue. r=emilio
  • (pushed 2022-02-24 03:51 +0000 Marian-Vasile Laza Marian-Vasile Laza - Merge autoland to mozilla-central. a=merge)
  • pushed 2022-02-24 03:51 +0000 Nicolas Silva Nicolas Silva - Bug 1686654 - Ensure SwitchItem is called at the start of each blob group. r=jrmuizel
  • pushed 2022-02-24 03:51 +0000 Nicolas Silva Nicolas Silva - Bug 1755747 - Add support for antialiased non-snapped rectangles. r=gfx-reviewers,aosmond
  • pushed 2022-02-23 09:38 +0000 Miko Mynttinen Miko Mynttinen - Bug 1714584 - Part 1: Decouple nsDisplayList internal list from
  • (pushed 2022-02-22 14:42 +0000 Norisz Fay Norisz Fay - Backed out 2 changesets (bug 1714584) per devs request for causing crashes )
  • (pushed 2022-02-22 09:37 +0000 Miko Mynttinen Miko Mynttinen - Bug 1714584 - Part 1: Decouple nsDisplayList internal list from nsDisplayItems r=mstange)
  • pushed 2022-02-22 09:37 +0000 Nicolas Silva Nicolas Silva - Bug 1753404 - Track hit tested bounds separately from painted bounds. r=jrmuizel
  • pushed 2022-02-22 09:37 +0000 Nicolas Silva Nicolas Silva - Bug 1753404 - Avoid extra blob tiles from invisible items. r=jrmuizel
  • pushed 2022-02-18 21:52 +0000 Andrew Osmond Andrew Osmond - Bug 1754978 - Part 1. Refactor CompositableHandle infrastructure to allow in-process driven handles. r=sotaro
  • (pushed 2022-02-18 09:08 +0000 Iulian Moraru Iulian Moraru - Backed out 2 changesets (bug 1754978) for causing valgrind bustages.)
  • (pushed 2022-02-17 05:15 +0000 Andrew Osmond Andrew Osmond - Bug 1754978 - Part 1. Refactor CompositableHandle infrastructure to allow in-process driven handles. r=sotaro)
  • pushed 2022-02-11 09:42 +0000 Andrew Osmond Andrew Osmond - Bug 1754556 - Update WebGPU external image resource only after present is complete. r=kvark
  • pushed 2022-01-27 09:46 +0000 Botond Ballo Botond Ballo - Bug 1751789 - Better handle the case where the deferred transform item's ASR is a descendant of the current item's ASR. r=tnikkel

It's not necessarily those, but I think we should pass this to hiro and/or botond to see if they have any ideas. (or maybe tnikkel sees something in those changelog commit lists?)


Also, in https://searchfox.org/mozilla-central/source/gfx/layers/wr/ClipManager.cpp#85-90 , the symmetry between push/pop use is not obvious, which is always suspect, but it does look ultimately correct:

void ClipManager::BeginList(const StackingContextHelper& aStackingContext) {
[...]
  ItemClips clips(nullptr, nullptr, false);
  if (!mItemClipStack.empty()) {
    clips = mItemClipStack.top();
  }
[...]
      PushOverrideForASR(clips.mASR, *referenceFrameId);
[...]
  mItemClipStack.push(clips);
}

void ClipManager::EndList(const StackingContextHelper& aStackingContext) {
[...]
  mItemClipStack.pop();
[...]
      PopOverrideForASR(mItemClipStack.empty() ? nullptr
                                               : mItemClipStack.top().mASR);
[...]
}

However, when this bug was filed, the symmetry was more obvious I think:
https://searchfox.org/mozilla-central/rev/352d47e58598c90b7e7754f8fc1465be19ed77be/gfx/layers/wr/ClipManager.cpp#52-90

Flags: needinfo?(tnikkel)
Flags: needinfo?(hikezoe.birchill)
Flags: needinfo?(botond)

Bug 1760222 and bug 1571758 are totally unrelated to the ASR structure.

Flags: needinfo?(hikezoe.birchill)

Probably this; https://hg.mozilla.org/integration/autoland/rev/3d623418ee05 . I am afraid I have almost zero knowledge, maybe ActiveScrolledRoot::PickDescendant picked a wrong one?

I'm fairly confident we can rule out {bug 1751789, bug 1749190, bug 1753779} as causing a regression in PushOverrideForASR/PopOverrideForASR calls being properly matched, too. While these bugs make changes to code that runs during WebRender display list building, this code should only impact the WebRenderScrollData that's produced as a secondary artifact, and not the WebRender display list itself (nor any intermediate state that affects the WebRender display list such as mASROverride).

I'm less sure about the other changes in the changelogs from comment 20, but nothing obvious jumps out.

I also tried to see if I can spot a reason why PushOverrideForASR/PopOverrideForASR calls may be mismatched by code inspection. I'm not really familiar with this code, but I couldn't spot any issue:

  • This pair of calls is matched: in the same function, both inside an if (stickyScrollContainer) block, condition does not change in between, no early returns in between. Also the aASR argument is the same and does not change in between the calls.
  • This pair of calls is less obvious, but I believe it pans out:
    • They're in different functions, but BeginList/EndList are matched in their respective call sites here and here.
    • Inside BeginList and EndList, the calls are conditioned on AffectsClipPositioning() and ReferenceFrameId(). These are properties of the stacking context which do not change after the stacking context's construction.
    • The hardest part to reason about is that the aASR arguments of the pair of calls in BeginList/EndList will be the same.
      • In BeginList, the argument is the mASR of the top of mItemClipStack, or null if mItemClipStack is empty.
      • BeginList then pushes a new entry into mItemClipStack.
      • The only other place that modifies mItemClipStack is SwitchItem(), which does a matched pair of pop() and then push(). So, if some SwitchItem() calls occur in between BeginList and EndList, then the top mItemClipStack may be different going into EndList then it was coming out of BeginList.
      • However, EndList starts by popping mItemClipStack, and then uses the mASR of the entry now at the top of the stack, or null if it's empty. It follows that this is the same value used by BeginList.

That said, I do have a suggestion for how to make progress here. As Kelsey observed in comment 20, PopOverrideForASR() does not check that this find() call returns a valid iterator.

Proposal: Let's land a patch to add a check and early-return from PopOverrideForASR(), as well as debug-asserting, if the call returns mASROverride.end().

This should avoid any UAF resulting from PushOverrideForASR/PopOverrideForASR being mismatched (for some reason we can't spot), and by turning it into a reliable debug-crash instead, possibly turn up affected testcases via fuzzing (or people running debug builds) which we can investigate further.

As far as I can tell, early-returning there shouldn't cause any other UB down the line that wouldn't have occurred otherwise. Proceeding with incorrect clips may cause rendering problems, but those would have happened anyways, with UB on top.


Looping in Markus (who may be more familiar with mItemClipStack/mOverrideASR, having reviewed bug 1377187) with a couple of questions:

  • Any reason why my proposal above might be a bad idea?
  • Anything you can spot that I missed in the analysis of PushOverrideForASR/PopOverrideForASR being matched, or any of the changes in the changelogs in comment 20 that look suspicious?
Flags: needinfo?(botond) → needinfo?(mstange.moz)

(In reply to Kelsey Gilbert [:jgilbert] from comment #20)

  • pushed 2022-02-24 03:51 +0000 Nicolas Silva Nicolas Silva - Bug 1686654 - Ensure SwitchItem is called at the start of each blob group. r=jrmuizel

Maybe not the named patch, but bug 1686654 in general could maybe be a possibility for causing this, it enables active SVG images which I think would give the web render grouping code in these files are significantly harder workout.

Severity: S2 → S3
Flags: needinfo?(jmuizelaar)

The severity field for this bug is set to S3. However, the bug is flagged with the sec-high keyword.
:ahale, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(ahale)

(In reply to Botond Ballo [:botond] from comment #23)

Proposal: Let's land a patch to add a check and early-return from PopOverrideForASR(), as well as debug-asserting, if the call returns mASROverride.end().

This should avoid any UAF resulting from PushOverrideForASR/PopOverrideForASR being mismatched (for some reason we can't spot), and by turning it into a reliable debug-crash instead, possibly turn up affected testcases via fuzzing (or people running debug builds) which we can investigate further.

As far as I can tell, early-returning there shouldn't cause any other UB down the line that wouldn't have occurred otherwise. Proceeding with incorrect clips may cause rendering problems, but those would have happened anyways, with UB on top.

I filed bug 1797186 to land a patch for this.

See Also: → 1797186
Flags: needinfo?(ahale)

I put up a socorro change to separate out the different signatures: https://github.com/mozilla-services/socorro/pull/6228

Thanks for all the good work on this! IIUC we're waiting to see if bug 1797186 resolves this UAF. I marked bug 1797186 as a dependency.

@ahale - If you can keep an eye on this and let us know after a few weeks if this appears to be resolved or if it's still around, I'd really appreciate it. Thanks!

Depends on: 1797186
Flags: needinfo?(ahale)

Recording this info about how to monitor this bug here so we don't forget it:
Since this bug report is based on a PHC crash we'll likely need to check PHC crashes on crash stats in order to determine if we still hit it. The answer on #crashreporting for how to search these crashes was:

"They end up on crash-stats with specific annotations. If you have protected data access this query[1] should make them show up"
[1] https://crash-stats.mozilla.org/search/?phc_alloc_stack=!__null__

(In reply to Maire Reavy [:mreavy] from comment #28)

@ahale - If you can keep an eye on this and let us know after a few weeks if this appears to be resolved or if it's still around, I'd really appreciate it. Thanks!

Yep, I'll be keeping an eye on it.

Today there are two crashes on crash stats that look like they would be prevented by bug 1797186 (the builds of the crashes don't have bug 1797186)
https://crash-stats.mozilla.org/report/index/a0432e20-d915-4aa8-97a3-c0cb30221110
https://crash-stats.mozilla.org/report/index/29ee1ce0-190e-4de1-a85f-363da0221107
Which I think shows that bug 1797186 is indeed fixing a real problem. It's odd that none of these crashes showed up before this week though. Still an open question as to how we are hitting the unbalanced Pop/Push that bug 1797186 handles more gracefully.

There have been no crashes matching the intersection of those patterns after Firefox 105.1.0

Crash stats filter for a PHC with signature containing PopOverrideForASR:
https://crash-stats.mozilla.org/signature/?phc_alloc_stack=%21__null&signature=mozilla%3A%3Alayers%3A%3AClipManager%3A%3APopOverrideForASR&date=%3E%3D2022-05-22T00%3A30%3A00.000Z&date=%3C2022-11-22T00%3A30%3A00.000Z#reports

Flags: needinfo?(ahale)
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

(In reply to Ashley Hale [:ahale] from comment #32)

There have been no crashes matching the intersection of those patterns after Firefox 105.1.0

Bug 1797186 landed in 108 IIUC. Do we know why this went away already in 106? I see that bug 1661147 touched the same code and landed in the 106 timeframe. Is that possibly related?

Flags: needinfo?(tnikkel)
Flags: needinfo?(mstange.moz)
Flags: needinfo?(ahale)

I think this was just a rare crash, we should keep monitor through at least the 108 release cycle.

I'm still trying to find the optimal crash stats criteria, like my search in comment #32 is for PopOverrideForASR but that's not the signature we identified and put in socorro configuration to disambiguate different ways to reach the TimeStampValue::operator- function. I'm not completely sure I understand how PHC is annotated here but I'll make some other queries and see if I find any patterns.

I agree this should not be closed until we have data showing it is addressed by the PopOverrideForASR fixes.

Status: RESOLVED → REOPENED
Flags: needinfo?(ahale)
Resolution: FIXED → ---

Here is a more useful crash stats query for specifically signatures containing mozilla::TimeStampValue::operator- with a PHC crash after build id 20221025 (corresponding to the date when Bug 1797186 landed, but not necessarily containing it), I see shutdownhang on many of the signatures which seems notable.

Of those crashes, I only see one on 108 or later https://crash-stats.mozilla.org/report/index/660d7eeb-8a6e-406e-b0f8-503720221127 which is on version 108.0b6, which has a buildid of 20221124185931 which is considerably newer than 20221025 when Bug 1797186 (mentioned in comment #31) changed the PopOverrideForASR behavior to have bounds checking, but this may still be a reduction in crash rate, there may be multiple causes of PHC crashes in this code.

Status: Waiting for 108 to deploy to a majority of users to confirm if this rare crash is fixed, due to the stochastic nature of PHC detection we need a sufficient amount of users running 108 to find out if this is actually fixed, and the fact it is a rare crash makes that even harder, so once 108 hits release channel I can draw conclusions from the crash stats.

My prediction is that we will still see some crash volume in this method but via different signatures as there appear to be multiple causes of underflow/overflow/PHC/UAF involved that just happen to hit this method on their path of doom, the method isn't the problem, it's just a symptom.

It would be nice to see shutdownhang signatures drop off the chart as well, which I think has been fixed by Bug 1797186 but we'll see.

I don't have phc access on crashstats, but comment 0 includes the alloc and free stacks in addition to the actual crashing stack. Do we have access to the alloc/free stacks on these crashes?

If you point me at the crashes you need inspected I can extract the stacks for you.

Flags: needinfo?(tnikkel)

I can't even get a list of crashes to look at without phc data access. The question I want to answer is: when did the crashes happen and in what versions of phc crashes with alloc/free stacks that go through PushOverrideForASR/PopOverrideForASR?

Flags: needinfo?(tnikkel)

I seem to no longer be able to see the phc annotations so I can't measure this anymore. I still have protected data access according to crash stats but this annotation no longer appears in the crash reports for me, so I can't confirm if my searches are working correctly. Though I may be misremembering and maybe I was never able to see phc annotations to determine if my searches work correctly. Currently I have limited confidence in the validity of the search condition 'phc_alloc_stack does not have terms __null' since I can't tell if it's selecting them correctly.

If I am understanding correctly, the phc alloc stack and phc free stack would be comma separated numbers, so I don't know what numbers would exist in those fields to indicate PopOverrideForASR and PushOverrideForASR are involved to verify if the fix in Bug 1797186 prevents some portion of these crashes, so I'm going just on crash volume of crashes meeting my criteria which is kind of a blind metric.

From a quick search there are have been no PHC crashes reported under this signature in the past 6 months. Truth be said there's been too few PHC crashes reported in general and we're trying to tweak the heuristics to improve the situation.

(In reply to Ashley Hale [:ahale] from comment #41)

I seem to no longer be able to see the phc annotations so I can't measure this anymore. I still have protected data access according to crash stats but this annotation no longer appears in the crash reports for me, so I can't confirm if my searches are working correctly. Though I may be misremembering and maybe I was never able to see phc annotations to determine if my searches work correctly. Currently I have limited confidence in the validity of the search condition 'phc_alloc_stack does not have terms __null' since I can't tell if it's selecting them correctly.

A query with phc_kind exists should reveal all PHC crashes on file if you have protected data access.

(In reply to Gabriele Svelto [:gsvelto] from comment #42)

From a quick search there are have been no PHC crashes reported under this signature in the past 6 months. Truth be said there's been too few PHC crashes reported in general and we're trying to tweak the heuristics to improve the situation.

Wouldn't we need to search the phc alloc/free stacks, not the signature of the crash? The signature is pretty much random, just whoever happened to touch the memory after the problem?

(In reply to Timothy Nikkel (:tnikkel) from comment #43)

Wouldn't we need to search the phc alloc/free stacks, not the signature of the crash? The signature is pretty much random, just whoever happened to touch the memory after the problem?

Good point, and we neither process nor index the alloc/free stack, it's still a manual process so we can't do it directly.

(In reply to Timothy Nikkel (:tnikkel) from comment #43)

Wouldn't we need to search the phc alloc/free stacks, not the signature of the crash? The signature is pretty much random, just whoever happened to touch the memory after the problem?

It shouldn't be random, no. Once the memory is freed, the use must come from a dangling pointer and that should always be roughly the same location, so the "use" stack should be stable. It would still be nice if we could eventually process alloc/free stacks but that's probably going to be more work.

This is in wait and see mode, we believe the issue is addressed.

(In reply to Gabriele Svelto [:gsvelto] from comment #44)

(In reply to Timothy Nikkel (:tnikkel) from comment #43)

Wouldn't we need to search the phc alloc/free stacks, not the signature of the crash? The signature is pretty much random, just whoever happened to touch the memory after the problem?

Good point, and we neither process nor index the alloc/free stack, it's still a manual process so we can't do it directly.

Lack of index is why I have trouble searching for these then, some tooling around that would be helpful :)

Okay I found the problem with my queries for phc alloc stack, which is more or less a typo in reverse - I clicked the URL in comment #31 but bugzilla automatic markup of links truncates before the __ at the end of the !null condition, which broke the search and was subtle enough I wasn't sure if it was meant to be that way.

However the number of phc alloc exists crashes is extremely low, so low that it's hard to confirm anything in regards to the mozilla::TimeStampValue::operator- signature or PopOverrideForASR signature as searches like https://crash-stats.mozilla.org/api/SuperSearch/?signature=~mozilla%3A%3ATimeStampValue%3A%3Aoperator-&phc_alloc_stack=%21__null__&_facets=signature&_facets=version give no results even for a 6 month window. Even finding ones that involve WebRender is hard, but there are a couple that seem like driver crashes, not really of interest to this bug.

Will discuss this bug in gfx triage meeting again, as we have no clear indicator that the problem is fixed or not fixed, PHC crashes are too rare to even include a single sample one way or the other currently.

:gsvelto - is there something wrong with the PHC detection? I find no crashes in the past 2 weeks with my search for 'phc alloc stack' with exists condition, I can find only two at 1 month window. Currently I can't confirm this bug still exists, but I also have low confidence that it is fixed due to a lack of any samples to work with.

Flags: needinfo?(gsvelto)
No longer blocks: gfx-triage

(In reply to Ashley Hale [:ahale] from comment #50)

:gsvelto - is there something wrong with the PHC detection? I find no crashes in the past 2 weeks with my search for 'phc alloc stack' with exists condition, I can find only two at 1 month window. Currently I can't confirm this bug still exists, but I also have low confidence that it is fixed due to a lack of any samples to work with.

It's because the population is small, Randell wrote a WIP to increase it in bug 1800010 but the work there is not complete yet.

Flags: needinfo?(gsvelto)

Tentatively resolving this as fixed by bug 1797186. Feel free to reopen if needed.

Status: REOPENED → RESOLVED
Closed: 1 year ago1 year ago
Resolution: --- → FIXED

Is this something we'd need to fix on ESR102 still? AFAICT, the code in question looks mostly the same (though not exactly).

Assignee: ahale → tnikkel
Group: gfx-core-security → core-security-release
Flags: needinfo?(tnikkel)
Target Milestone: --- → 108 Branch

Uplifting to esr102 seems reasonable to me. I can make a patch for esr102, it should be very easy.

Flags: needinfo?(tnikkel)

A rebased patch would be great, thanks. We'll need it before next week's RC builds.

Flags: needinfo?(tnikkel)

Done. Sorry was sick the past couple days.

Flags: needinfo?(tnikkel)
Whiteboard: [adv-102.8+r]
Whiteboard: [adv-102.8+r] → [adv-esr102.8+r]
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.