Closed Bug 1167557 Opened 10 years ago Closed 9 years ago

crash in mozilla::image::SurfaceCacheImpl::LookupBestMatch(mozilla::image::Image* const, mozilla::image::SurfaceKey const&, mozilla::Maybe<unsigned int> const&)

Categories

(Core :: Graphics: ImageLib, defect)

38 Branch
x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox38.0.5 --- wontfix
firefox39 + wontfix
firefox40 + wontfix
firefox41 - wontfix
firefox42 - unaffected

People

(Reporter: MatsPalmgren_bugz, Assigned: vliu)

References

Details

(Keywords: crash, Whiteboard: [gfx-noted])

Crash Data

Attachments

(2 files, 3 obsolete files)

This bug was filed from the Socorro interface and is report bp-53fc21b1-5aa4-4c25-9220-482fd2150515. ============================================================= The volume for this signature spiked in v38 (~3400 crashes in 28 days). The user comments mentions "games, farmville, Flash" etc but the stack looks more like ImageLib. All crashes are on x86 Windows. 0 mozilla::image::SurfaceCacheImpl::LookupBestMatch(mozilla::image::Image* const, mozilla::image::SurfaceKey const&, mozilla::Maybe<unsigned int> const&) 1 mozilla::image::RasterImage::LookupFrame(unsigned int, nsIntSize const&, unsigned int) 2 mozilla::image::RasterImage::StartDecoding() 3 nsDocument::AddImage(imgIRequest*) 4 nsImageLoadingContent::TrackImage(imgIRequest*) 5 PresShell::UpdateImageVisibility() 6 nsRunnableMethodImpl<void ( mozilla::dom::XULDocument::*)(void), void, 1>::Run() 7 nsThread::ProcessNextEvent(bool, bool*) 8 mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 9 MessageLoop::RunHandler() 10 nsAString_internal::Replace(unsigned int, unsigned int, wchar_t const*, unsigned int, mozilla::fallible_t const&) 11 nsLocalFile::AppendInternal(nsString const&, bool) 12 nsLocalFile::Append(nsAString_internal const&) 13 GetProcessPriorityBoost 14 GetCurrentThreadStub 15 nsEditingSession::EndDocumentLoad(nsIWebProgress*, nsIChannel*, nsresult, bool) 16 std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Tidy(bool, unsigned int) 17 `anonymous namespace'::HistogramGet(char const*, char const*, unsigned int, unsigned int, unsigned int, unsigned int, bool, base::Histogram**) 18 wmain 19 firefox.exe@0x439f 20 LdrpAppendToForwarderList 21 _RtlUserThreadStart
Whiteboard: [gfx-noted]
If this is a simple fix, it's large enough to uplift if we can get to it fast enough. Adding Daniel as he was the reviewer.
Blocks: 1119774
Flags: needinfo?(seth)
Flags: needinfo?(dholbert)
Taking a look now.
Flags: needinfo?(seth)
This is probably *not* simple at all. We appear to be crashing on dereferencing a pointer, right after null-checking it. Even when considering inlining, I don't understand it.
Actually this entire stack looks insane! HistogramGet() from telemetry calls std::basic_string<char, std::char_traits<char>, std::allocator<char> >::_Tidy(bool, unsigned int) which calls nsEditingSession::EndDocumentLoad()? Is there any reason to think we can trust anything in this stack at all?
Flags: needinfo?(dholbert)
Yeah, the stack also seems to show string-manipulation code (stack level 10) calling into IPC code. I think the stack is just horked.
A really high proportion of the comments mention that flash crashed. I think we should take that seriously.
David, I know you had some dealings with stacks that don't look right - is there a good way to get decent stacks for those?
Flags: needinfo?(dmajor)
Here's a better stack... xul!mozilla::image::SurfaceCacheImpl::LookupBestMatch xul!mozilla::image::RasterImage::LookupFrame xul!mozilla::image::RasterImage::RequestDecodeForSize xul!mozilla::image::RasterImage::StartDecoding xul!imgRequestProxy::StartDecoding xul!nsDocument::AddImage xul!nsImageLoadingContent::TrackImage xul!PresShell::UpdateImageVisibility xul!nsRunnableMethodImpl<void (__thiscall mozilla::dom::XULDocument::*)(void),void,1>::Run xul!nsThread::ProcessNextEvent xul!mozilla::ipc::MessagePump::Run xul!MessageLoop::RunHandler xul!MessageLoop::Run xul!nsBaseAppShell::Run xul!nsAppShell::Run xul!nsAppStartup::Run xul!XREMain::XRE_mainRun
There's some inlining but the nearby disassembly looks a lot like the DrawableFrameRef ctor: explicit DrawableFrameRef(imgFrame* aFrame) : mFrame(aFrame) , mRef(aFrame->mVBuf) ; edx must be mSurface because of the offset 0x8 from CachedSurface 015a5ece 8b5608 mov edx,dword ptr [esi+8] 015a5ed1 89542428 mov dword ptr [esp+28h],edx 015a5ed5 89542430 mov dword ptr [esp+30h],edx ; mSurface is passed as aFrame to DrawableFrameRef ; Below is the addref for "mFrame(aFrame)" ; Spoiler alert, we skipped this block of code 015a5ed9 85d2 test edx,edx 015a5edb 7407 je xul!mozilla::image::SurfaceCacheImpl::LookupBestMatch+0xfd (015a5ee4) 015a5edd 33c0 xor eax,eax 015a5edf 40 inc eax 015a5ee0 f00fc102 lock xadd dword ptr [edx],eax ; The crash happens during the deref in "aFrame->mVBuf" ; I confirmed that mVBuf is offset 0x18 from imgFrame 015a5ee4 8b6a18 mov ebp,dword ptr [edx+18h] << CRASH HERE ; Below is the addref for "mRef(aFrame->mVBuf)" 015a5ee7 85ed test ebp,ebp 015a5ee9 7408 je xul!mozilla::image::SurfaceCacheImpl::LookupBestMatch+0x10c (015a5ef3) 015a5eeb 33c0 xor eax,eax 015a5eed 40 inc eax 015a5eee f00fc14500 lock xadd dword ptr [ebp],eax So the mSurface of "surface" is null.
Flags: needinfo?(dmajor)
Nice! Seth, just protect against it, or something going on further up the chain?
Flags: needinfo?(seth)
Thanks for the analysis, David! This is a pretty serious violation of the invariants of this code. (Which is why we have fatal assertions to check for it; apparently we've never managed to hit this in testing.) It doesn't make sense to check for it here. I need to look over the code and ensure that everywhere we can create a CachedSurface, we're actually passing it a valid imgFrame.
Flags: needinfo?(seth)
Thanks Seth. Given the spike, let's see what the solution is, we may even request an uplift to 38.0.5. Let me know if I can help have this go faster. [Tracking Requested - why for this release]: That's a big spike of crashes in 38.
Assignee: nobody → seth
Timothy, see if Seth could use help?
Flags: needinfo?(tnikkel)
Tracking for 39 and 40 because of Comment 12.
Adding flags to show affected versions. Currently there are only a handful of crashes in the last week for 39. It may be that we don't have a lot of farmville-playing beta users.
No crashes reported with this signature for any Firefox version for the last 7 days. That seems odd. Maybe this was fixed elsewhere.
Yeah, that's very odd. The most recent crashes are from 2015-06-12 and it seem to have stopped rather abruptly. The only explanation I can think of is that most crashes came from a single popular site, eg Facebook, and that they rolled out a new version of some part of their site that "fixed" it. Perhaps someone can post a summary of the URLs for the crashes we have?
Keywords: needURLs
I'm going to call this "wontfix" for 39+ in the tracking flags. We didn't fix it that we know of but there is no "worksforme" flag. Mats, not sure if you want to investigate further, leave open, but I leave that up to you.
I found a nearly identical signature that I'm confident is the same crash. The only difference is "Maybe<unsigned int>" became "Maybe<T>". Crashes with the new signature starts to appear 2015-06-09 in low numbers but the volume goes up around 2015-06-11 which is where the first signature starts to fade out. I suspect the signature change is due to a software upgrade or configuration change on the Socorro server around 2015-06-11. Something to look out for in other bugs...
Crash Signature: [@ mozilla::image::SurfaceCacheImpl::LookupBestMatch(mozilla::image::Image* const, mozilla::image::SurfaceKey const&, mozilla::Maybe<unsigned int> const&)] → [@ mozilla::image::SurfaceCacheImpl::LookupBestMatch(mozilla::image::Image* const, mozilla::image::SurfaceKey const&, mozilla::Maybe<unsigned int> const&)] [@ mozilla::image::SurfaceCacheImpl::LookupBestMatch(mozilla::image::Image* const, mozilla::image:…
Flags: needinfo?(mats)
Yeah, that was bug 1163831.
Crash Signature: , mozilla::image::SurfaceKey const&, mozilla::Maybe<T> const&) ] → , mozilla::image::SurfaceKey const&, mozilla::Maybe<T> const&) ] [@ mozilla::image::SurfaceCacheImpl::LookupBestMatch ]
Thanks for catching this! I emailed the stability list about bug 1163831 and will update at the channel meeting. Let's watch this as we release 39; I'll keep tracking it.
I don't think I can add anything to the excellent analysis already done in this bug.
Flags: needinfo?(tnikkel)
I'm sorry this took me a long time to get to. There have been a lot of fires to put out over the past month. At any rate, from my inspection of the code, the only possible origin of a null CachedSurface::mSurface is if a null surface is passed to SurfaceCacheImpl::Insert in the first place. We're asserting about this, but we really need to be doing a runtime check and rejecting the insert. This patch adds such a runtime check. It's marked MOZ_ASSERT_UNREACHABLE, so if we hit it in testing, it'll give us a stack that'll help track the issue down.
Attachment #8628001 - Flags: review?(dholbert)
(In reply to Seth Fowler [:seth] from comment #24) > At any rate, from my inspection of the code, the only possible origin of a > null > CachedSurface::mSurface is if a null surface is passed to > SurfaceCacheImpl::Insert in the first place. Note that this function only has one caller -- the static SurfaceCache::Insert method. And this method which calls some getters on the passed-in surface before calling out to the SurfaceCacheImpl. (I guess those getters might run just fine on a null pointer, though...) Anyway, seems very odd that we'd get a null pointer passed in, to say the least. > we really need to be doing a runtime check and rejecting the insert. This patch > adds such a runtime check. It's marked MOZ_ASSERT_UNREACHABLE, so if we hit > it in testing, it'll give us a stack that'll help track the issue down. I wonder if we should instead use NS_RUNTIMEABORT, or something else which we're sure will trigger the crash reporter? If we do that & watch for the updated crash reports, that seems like it'd help us get a better shot at actually finding & fixing the issue. (I'm skeptical that MOZ_ASSERT_UNREACHABLE will really get us any useful stacks, since it's debug-only. If this was the sort of thing we could hit easily in test runs with debug builds, we probably would have hit it as a null-deref. Though maybe we have & it's buried in a randomorange bug somewhere, who knows...) Seth, what do you think?
Flags: needinfo?(seth)
(My concern with the current patch is that we'll work around the crash but never figure out what causes this, and we'll be left with this mysterious warty early-return. And we may or may not actually be handling things correctly in whichever caller is doing something Very Bad and calling us in this way.)
(MOZ_CRASH might be better than NS_RUNTIMEABORT; not sure what's better for crashing in a crashreporter-detectable way.)
(In reply to Daniel Holbert [:dholbert] from comment #25) > Note that this function only has one caller -- the static > SurfaceCache::Insert method. And this method which calls some getters on the > passed-in surface before calling out to the SurfaceCacheImpl. (I guess those > getters might run just fine on a null pointer, though...) Yeah, we're clearly getting away with that right now, or else the stack would be different. But it's a good idea to do the check before we dereference aSurface at all, so I'll update the patch to move the check to SurfaceCache::Insert() itself. > Anyway, seems very odd that we'd get a null pointer passed in, to say the > least. Yeah. It's really not obvious to me how it's happening, because the call sites all either have null checks or, in the case of VectorImage, they dereference the imgFrame pointer themselves repeatedly before calling SurfaceCache::Insert(). > I wonder if we should instead use NS_RUNTIMEABORT, or something else which > we're sure will trigger the crash reporter? If we do that & watch for the > updated crash reports, that seems like it'd help us get a better shot at > actually finding & fixing the issue. Yeah, I agree. It's pretty likely that if we were going to hit this in testing, we would've already done so. I'll switch it over to MOZ_CRASH. Hopefully the volume isn't too high. =)
Flags: needinfo?(seth)
OK, here's the new version of the patch.
Attachment #8628049 - Flags: review?(dholbert)
Attachment #8628001 - Attachment is obsolete: true
Attachment #8628001 - Flags: review?(dholbert)
Comment on attachment 8628049 [details] [diff] [review] Reject null surfaces in SurfaceCacheImpl::Insert >Bug 1167557 - Reject null surfaces in SurfaceCacheImpl::Insert. r=dholbert This needs an update, because we aren't fixing the bug (yet), and we're not rejecting exactly, and this isn't in SurfaceCacheImpl anymore. Maybe: Bug 1167557 diagnostic: crash if null surface is passed to null surfaces in SurfaceCache::Insert. r=dholbert >diff --git a/image/SurfaceCache.cpp b/image/SurfaceCache.cpp >--- a/image/SurfaceCache.cpp >+++ b/image/SurfaceCache.cpp >+ // Refuse null surfaces. >+ if (!aSurface) { >+ MOZ_CRASH("Don't pass null surfaces to SurfaceCache::Insert"); >+ return InsertOutcome::FAILURE; >+ } Drop the return statement after MOZ_CRASH, since it's unreachable. r=me with that. (and remember to mark this as "leave-open" too) Thanks!
Attachment #8628049 - Flags: review?(dholbert) → review+
(In reply to Daniel Holbert [:dholbert] from comment #30) > Maybe: > Bug 1167557 diagnostic: crash if null surface is passed to null surfaces in > SurfaceCache::Insert. r=dholbert er "passed to SurfaceCache::Insert", of course. :)
(In reply to Daniel Holbert [:dholbert] from comment #30) > ... All good feedback, thanks. > r=me with that. (and remember to mark this as "leave-open" too) Since this change will change the crash signature I'm guessing we're going to end up with a new bug. =) We can wait till that happens and then connect the two, though.
Here's the updated version of the patch.
Attachment #8628049 - Attachment is obsolete: true
Hmm. So it seems like crash reports with this signature actually stopped before I even pushed the diagnostic patch above. I'm not seeing any new crashes with SurfaceCache::Insert in the signature, either. Unfortunately, I think this got accidentally fixed before we tracked down the source. =\ I'm resolving this for now. We can reopen if anyone actually hits the new SurfaceCache::Insert MOZ_CRASH.
Status: NEW → RESOLVED
Closed: 10 years ago
Keywords: leave-open, needURLs
Resolution: --- → FIXED
(In reply to Seth Fowler [:seth] from comment #36) > Hmm. So it seems like crash reports with this signature actually stopped > before I even pushed the diagnostic patch above. I actually see a few crashes on Nightly with build IDs from the 2nd or later: https://crash-stats.mozilla.com/search/?product=Firefox&signature=%3Dmozilla%3A%3Aimage%3A%3ASurfaceCacheImpl%3A%3ALookupBestMatch(mozilla%3A%3Aimage%3A%3AImage*+const%2C+mozilla%3A%3Aimage%3A%3ASurfaceKey+const%26%2C+mozilla%3A%3AMaybe<T>+const%26)&version=42.0a1&build_id=>20150702000000&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This one gets linked correctly by Bugzilla as well: https://crash-stats.mozilla.com/search/?product=Firefox&signature=%24mozilla%3A%3Aimage%3A%3ASurfaceCacheImpl%3A%3ALookupBestMatch&version=42.0a1&build_id=>20150702000000&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#crash-reports
grr, or not :(
Hmm, that is really puzzling. I don't know how anyone is hitting this without hitting the MOZ_CRASH we added in this bug...
Is the crash volume on this still high?
(In reply to Milan Sreckovic [:milan] from comment #41) > Is the crash volume on this still high? Here is the breakdown... > [@ mozilla::image::SurfaceCacheImpl::LookupBestMatch(mozilla::image::Image* const, mozilla::image::SurfaceKey const&, mozilla::Maybe<unsigned int> const&) ] There are 0 reports in the last week for this signature. > [@ mozilla::image::SurfaceCacheImpl::LookupBestMatch(mozilla::image::Image* const, mozilla::image::SurfaceKey const&, mozilla::Maybe<T> const&) ] Firefox 42.*: 5 crashes last week Firefox 41.*: 27 crashes last week Firefox 40.*: 9 crashes last week Firefox 39.*: 490 crashes last week Firefox 38.*: 385 crashes last week > [@ mozilla::image::SurfaceCacheImpl::LookupBestMatch ] There are 0 reports in the last week for this signature. Overall this sits around #125 in the topcrash report for Release which is where it has been for at least a month. I would not call this "high" but it hasn't shifted much either.
Hmm. Thanks, I may have been looking at the wrong signature. It's bizarre that the inconsistency between the first and second one (in how the template parameter is represented) exists.
Looks like we won't do anything about this in 40. Seth, any chance you could find a fix for 41? Thanks
[Tracking Requested - why for this release]: I looked at the 3 crash signature and was unable to find any hits during last 7 days for first and third signature. The middle signature does have a few hits but too few for FF41 [@ mozilla::image::SurfaceCacheImpl::LookupBestMatch(mozilla::image::Image* const, mozilla::image::SurfaceKey const&, mozilla::Maybe<T> const&) ] Untracking for FF41 and setting the status to wontfix. I do not see any 42 and 43 hits either. Still nominating 42 for tracking as there are recent hits (on 40.0.2 and 40.0.3) so this may still be a real issue in our code base.
No crash for 42a2 (only one with 42a1). Not tracking.
I don't know what's left to do here. We did not have any Aurora or Nightly crashes, and only 2 crashes in Beta over the last month. There were 1602 crashes reported in Firefox 41.0.2 over the same period. This might be "fixed" for all intents and purposes.
I'm closing this bug as wontfix since these crashes have nearly disappeared in supported Firefox versions. Over the last week there's been 3 reports against Firefox 45 and zero reports on any of the test branches.
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → WONTFIX
(In reply to Anthony Hughes (:ashughes) [GFX][QA][Mentor] from comment #48) > I'm closing this bug as wontfix since these crashes have nearly disappeared "WONTFIX" doesn't seem like the right resolution for a crash that's mostly-gone-away. As a bug resolution, WONTFIX means "even if we had a patch that fixed the reported bug, we would *not take that patch*". It's a pretty severe resolution & is only rarely applicable. WORKSFORME seems more appropriate here, in the sense that this has gone from pretty-bad to mostly-good for an unknown reason (as you noted in comment 47). --> updating resolution. Also, we should probably back out the explicit MOZ_CRASH that we added here (or convert it to an assertion), since that was just a diagnostic tool (and IIUC it didn't turn out to be triggered, anyway, per comment 40). Seth?
Resolution: WONTFIX → WORKSFORME
Peter, let's have somebody convert that MOZ_CRASH to gfxDevCrash as a sanity check and close this as works for me.
Assignee: seth → nobody
Status: RESOLVED → REOPENED
Flags: needinfo?(seth) → needinfo?(howareyou322)
Resolution: WORKSFORME → ---
Right - not quite. I don't really want to back off that check quite yet, even though haven't triggered it. I'd want to see this: if (!aSurface) { MOZ_CRASH("Don't pass null surfaces to SurfaceCache::Insert"); } become this: if (!aSurface) { gfxDevCrash(LogReason::InvalidCacheSurface) << "Null surface in SurfaceCache::Insert"; } with gfx/2d/Logging.h needing to define InvalidCacheSurface.
Gotcha; sorry for the mixed signals. (I tagged my comment as 'obsolete' to auto-hide it & not cause confusion.)
See Also: → 1263678
vincent, please help on this.
Assignee: nobody → vliu
Flags: needinfo?(howareyou322)
Hi Milan, As previous comments, the proposed patch to convert that MOZ_CRASH to gfxDevCrash as a sanity check was attached. Could you please have a review for me? Thanks.
Attachment #8750137 - Flags: review?(milan)
Comment on attachment 8750137 [details] [diff] [review] Log for Null surface in SurfaceCache::Insert. Review of attachment 8750137 [details] [diff] [review]: ----------------------------------------------------------------- ::: image/SurfaceCache.cpp @@ +1035,5 @@ > } > > // Refuse null surfaces. > if (!aSurface) { > + gfxDevCrash(LogReason::InvalidCacheSurface) << "Null surface in SurfaceCache::Insert"; Since gfxDevCrash() doesn't crash in beta and release, we need to take care of this situation. We should return InsertOutcome::FAILURE at this point.
Attachment #8750137 - Flags: review?(milan)
Hi Milan, Thanks for your suggestion. The proposed patch was attached.
Attachment #8750782 - Flags: review?(milan)
Attachment #8750137 - Attachment is obsolete: true
Attachment #8750782 - Flags: review?(milan) → review+
(In reply to Milan Sreckovic [:milan] from comment #50) > Peter, let's have somebody convert that MOZ_CRASH to gfxDevCrash as a sanity > check and close this as works for me. https://hg.mozilla.org/integration/mozilla-inbound/rev/19f163886e2f
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: