Closed Bug 1049138 Opened 6 years ago Closed 5 years ago

crash in VisitNextEdgeBetweenRect

Categories

(Core :: Graphics: Layers, defect, critical)

34 Branch
All
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla37
blocking-b2g 2.0+
Tracking Status
firefox33 --- unaffected
firefox34 + wontfix
firefox35 + fixed
firefox36 + fixed
firefox37 --- fixed
b2g-v2.0 --- fixed
b2g-v2.0M --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed
fennec 34+ ---

People

(Reporter: aaronmt, Assigned: snorp)

References

Details

(Keywords: crash, regression, topcrash-android-armv7, Whiteboard: layer-tiles)

Crash Data

Attachments

(6 files, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-128afe4b-0b2a-484f-9527-a920e2140730.
=============================================================
First crash report is on 07/26/2014
Might be related to bug 1045122
[Tracking Requested - why for this release]: new crash in 34 beta and topcrash.

First appearance of this crash 2014-07-24.
Trying out the most reported URLS against a Nexus 5. Top device is the GT-I9300	(S3), which I don't have.
Whiteboard: tiles
tracking-fennec: ? → +
Top crash in 36, 35 and 34
tracking-fennec: + → 34+
Assignee: nobody → milan
Mark/Brad - Milan tells me that his team has no time to look at this bug in the 34 time frame. Do you have anyone who can help out with the investigation?
Assignee: milan → nobody
Flags: needinfo?(mark.finkle)
Flags: needinfo?(blassey.bugs)
Snorp, do you have someone that can look at this?
Flags: needinfo?(snorp)
Flags: needinfo?(mark.finkle)
Flags: needinfo?(blassey.bugs)
(In reply to Brad Lassey [:blassey] (use needinfo?) from comment #7)
> Snorp, do you have someone that can look at this?

Not really...it would help if we had STR, though.
Flags: needinfo?(snorp)
Whiteboard: tiles → layer-tiles
I've been loading the URLs on the most affected device (GT-I9300) without success.
This seems to be pretty well-distributed between devices and graphics adapters, not specific to any of them. That said, it remains the stability #1 issue left on beta at this time.
Eugen can you please see if you can figure out what's going on here?
Assignee: nobody → esawin
This crashes on calling LockedBits::visitor (same as bug 1045122) from TiledContentClient.cpp, which was introduced in bug 1023473.
At this point this is wontfix for 34.
Jeff & Botond - This is the #1 crasher on Fx34. Can you take a look at this?
Flags: needinfo?(milan)
Flags: needinfo?(jmuizelaar)
Flags: needinfo?(botond)
This crash is in code introduced in bug 1023473. I'm not familiar with that code, and Jeff (who wrote it) is on PTO.

BenWa, you reviewed the patch in bug 1023473. Do you think this problem is easy to fix, or alternately, do you know if it would be risky to back bug 1023473 out until Jeff can fix this problem?
Blocks: 1023473
Flags: needinfo?(botond) → needinfo?(bgirard)
Jeff, do we have devices that reproduce this?
Flags: needinfo?(milan)
This is not reproducible. According to crash stats the following devices account for 30% of crashes with a long tail of devices that have crashed only a few times. I have tried loading URLs from crash stats on a GT-I9300 without any success. 

samsung GT-I9300 18 (REL)
samsung GT-I9100 16 (REL)
asus 	Nexus 7  19 (REL)
samsung GT-I9305 18 (REL)
samsung GT-N8000 16 (REL)
samsung GT-N8010 16 (REL)
samsung GT-P3100 16 (REL)
samsung SM-T111  17 (REL)
samsung SM-T110  17 (REL)
LGE 	Nexus 4  19 (REL)
Is this the same as bug 1045122?
Flags: needinfo?(jmuizelaar)
Attachment #8529313 - Flags: review?(bgirard) → review+
FYI: I think blassey has started a backout of bug 1023473, which seems to have introduced the crash, and also fixes bug 1073554.
If bug 1023473 is being backed out, it was a 2.0/2.1/2.2 blocker, so let's be careful.
https://hg.mozilla.org/mozilla-central/rev/4e0b66fa178b
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla37
Flags: needinfo?(bgirard)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I've been hitting the VisitAbove crashes several times a day since the landing in comment #22. Today I bit the bullet installed a debug build of mozilla-central f14dcd1c8c0b (from tbpl), loaded http://www.geekzone.co.nz/ (not the mobile version) and scrolled up and down until:

12-15 23:30:17.517 I/Gecko   (22408): [22408] WARNING: Subdocument container has no frame: file /builds/slave/m-cen-and-api-11-d-00000000000/build/layout/base/nsDocumentViewer.cpp, line 2505
12-15 23:30:17.517 I/Gecko   (22408): ++DOMWINDOW == 18 (0x6edfdc00) [pid = 22408] [serial = 18] [outer = 0x6e8bfa00]
12-15 23:30:17.540 W/GeckoConsole(22408): [JavaScript Warning: "This site makes use of a SHA-1 Certificate; it's recommended you use certificates with signature algorithms that use hash functions stronger than SHA-1." {file: "https://googleads.g.doubleclick.net/xbbe/pixel?d=CPcCENFCGKCAhwM&v=APEucNXi3KEXTPas95lo9lQg654KuOF3Z97TUysheW-A0SJkRa1sjqj_1-9_cEYyH6H2inG9Lox3" line: 0}]
12-15 23:30:17.782 I/Gecko   (22408): void mozilla::AndroidBridge::HandleGeckoMessage(JSContext*, JS::HandleObject)
12-15 23:30:17.837 I/Gecko   (22408): void mozilla::AndroidBridge::HandleGeckoMessage(JSContext*, JS::HandleObject)
12-15 23:30:18.173 D/dalvikvm(22408): GC_CONCURRENT freed 1448K, 13% free 10419K/11904K, paused 3ms+6ms, total 42ms
12-15 23:30:18.446 I/Gecko   (22408): void mozilla::AndroidBridge::HandleGeckoMessage(JSContext*, JS::HandleObject)
12-15 23:30:18.798 F/MOZ_CRASH(22408): Hit MOZ_CRASH(long src memcpy) at /builds/slave/m-cen-and-api-11-d-00000000000/build/gfx/layers/client/TiledContentClient.cpp:994
12-15 23:30:18.798 F/libc    (22408): Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1), thread 22425 (Gecko)
12-15 23:30:18.899 I/DEBUG   (22389): unexpected waitpid response: n=22425, status=00000b00
12-15 23:30:18.899 I/DEBUG   (22389): ptrace detach from 22425 failed: No such process
12-15 23:30:18.899 I/DEBUG   (22389): debuggerd committing suicide to free the zombie!
12-15 23:30:18.899 I/ActivityManager(  372): Process org.mozilla.fennec (pid 22408) has died.
12-15 23:30:18.899 I/WindowState(  372): WIN DEATH: Window{4293de60 u0 org.mozilla.fennec/org.mozilla.fennec.App}

Repeated it a few times and it was a long src memcpy each time.
(In reply to Nick Thomas [:nthomas] from comment #23)
> Repeated it a few times and it was a long src memcpy each time.

If you can reproduce this with a local build, it would help if you printed out all the arguments in the visitor function when we crash. Use printf_stderr, as that always goes to logcat on Android.

Thanks a lot for looking into this! I still can't reproduce it locally at all. What device are you using?
Flags: needinfo?(nthomas)
Really we should just write better tests for this region visitor code. It's really easy to write gtests for this using differently shaped regions.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #25)
> Really we should just write better tests for this region visitor code. It's
> really easy to write gtests for this using differently shaped regions.

Yeah, also that.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #25)
> Really we should just write better tests for this region visitor code. It's
> really easy to write gtests for this using differently shaped regions.

There are already some tests. If you have suggestions for more please add.
Here you go. None of the existing tests exercise the boundary conditions and one of the new ones I added fails. (assuming I did the "update gtests" patch correctly)
We we're being too conservative in our clamping.

I don't think this will fix the crash.
Attached patch Clamp y1 more (obsolete) — Splinter Review
But perhaps insufficiently conservative in clamping height
Attachment #8536655 - Attachment is obsolete: true
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #24)
> Thanks a lot for looking into this! I still can't reproduce it locally at
> all. What device are you using?

Galaxy Nexus, so old.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #35)
> https://tbpl.mozilla.org/?tree=Try&rev=b80e42c86578

Unfortunately this only has an x86 Android build. The arm builders got split recently into two builds for different API levels, and trychooser isn't quite updated yet (bug 1108058). I've used https://wiki.mozilla.org/ReleaseEngineering/How_To/Trigger_arbitrary_jobs to try to fill the gaps. We'll have to wait and see if the tests start.
Flags: needinfo?(nthomas)
(In reply to Nick Thomas [:nthomas] from comment #36)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #24)
> > Thanks a lot for looking into this! I still can't reproduce it locally at
> > all. What device are you using?
> 
> Galaxy Nexus, so old.
> 
> (In reply to Jeff Muizelaar [:jrmuizel] from comment #35)
> > https://tbpl.mozilla.org/?tree=Try&rev=b80e42c86578
> 
> Unfortunately this only has an x86 Android build. The arm builders got split
> recently into two builds for different API levels, and trychooser isn't
> quite updated yet (bug 1108058). I've used
> https://wiki.mozilla.org/ReleaseEngineering/How_To/Trigger_arbitrary_jobs to
> try to fill the gaps. We'll have to wait and see if the tests start.

Any luck with these builds?
Still hitting 

12-17 16:44:19.912 F/MOZ_CRASH(10818): Hit MOZ_CRASH(long src memcpy) at /builds/slave/try-and-api-11-d-0000000000000/build/gfx/layers/client/TiledContentClient.cpp:994

with the debug build from b80e42c86578. Happy to try other builds, eg with the suggestion in comment #24.
(In reply to Nick Thomas [:nthomas] from comment #38)
> Still hitting 
> 
> 12-17 16:44:19.912 F/MOZ_CRASH(10818): Hit MOZ_CRASH(long src memcpy) at
> /builds/slave/try-and-api-11-d-0000000000000/build/gfx/layers/client/
> TiledContentClient.cpp:994
> 
> with the debug build from b80e42c86578. Happy to try other builds, eg with
> the suggestion in comment #24.

Any chance you can get this in this in debugger and print all of the arguments to ensure_memcpy?
Otherwise, I've pushed a build that should print them to stderr.
https://tbpl.mozilla.org/?tree=Try&rev=64edf7b05d0d
(In reply to Jeff Muizelaar [:jrmuizel] from comment #39)
> Any chance you can get this in this in debugger and print all of the
> arguments to ensure_memcpy?

Sorry, I don't have that sort of dev environment handy or any experience with it.
Here's a build that should actually build:
https://tbpl.mozilla.org/?tree=Try&rev=cacc4c65ab2b
I have a Galaxy Nexus now that reproduces this bug. Jeff and Bas helped me determine what's going on on IRC. The cairo format (RGB16) does not agree with the SurfaceFormat of the DrawTarget (BGRX). This results in us having a bogus stride and overrunning things.
Assignee: esawin → snorp
I don't know if this fixes all of the possible issues with using the content as the surface format, but it does fix this bug.
Attachment #8537940 - Flags: review?(jmuizelaar)
Attachment #8537940 - Flags: review?(bas)
Some cleanup
Attachment #8537940 - Attachment is obsolete: true
Attachment #8537940 - Flags: review?(jmuizelaar)
Attachment #8537940 - Flags: review?(bas)
Attachment #8537942 - Flags: review?(jmuizelaar)
Attachment #8537942 - Flags: review?(bas)
Attachment #8537942 - Flags: review?(jmuizelaar) → review+
Comment on attachment 8537942 [details] [diff] [review]
Use the cairo surface format to determine gfx format when able

Approval Request Comment
[Feature/regressing bug #]: bug 1023473
[User impact if declined]: crashes
[Describe test coverage new/current, TBPL]: local testing and CI
[Risks and why]: low
[String/UUID change made/needed]: none

It looks like we already WONTFIXed this for 34, but thought I'd give it a shot.
Attachment #8537942 - Flags: approval-mozilla-release?
Attachment #8537942 - Flags: approval-mozilla-beta?
Attachment #8537942 - Flags: approval-mozilla-aurora?
The inbound build on cc7b14ff7bdb is working great for me.
https://hg.mozilla.org/mozilla-central/rev/cc7b14ff7bdb
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
Attachment #8537942 - Flags: review?(bas) → review+
Attachment #8537942 - Flags: approval-mozilla-beta?
Attachment #8537942 - Flags: approval-mozilla-beta+
Attachment #8537942 - Flags: approval-mozilla-aurora?
Attachment #8537942 - Flags: approval-mozilla-aurora+
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #48)
> It looks like we already WONTFIXed this for 34, but thought I'd give it a
> shot.

If this is confirmed to fix the Visit* crashes on ICS-based B2G devices, we definitely should try to get it into FxOS 2.1 as the ICS-based Geeksphones are still the most reliable crash reporting devices for B2G next to the Flames.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #51)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #48)
> > It looks like we already WONTFIXed this for 34, but thought I'd give it a
> > shot.
> 
> If this is confirmed to fix the Visit* crashes on ICS-based B2G devices, we
> definitely should try to get it into FxOS 2.1 as the ICS-based Geeksphones
> are still the most reliable crash reporting devices for B2G next to the
> Flames.

I think we probably use gralloc-backed tiles everywhere on B2G, so this cairo-related thing should not affect it. Sotaro can you confirm that?
Flags: needinfo?(sotaro.ikeda.g)
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #52)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #51)
> > (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #48)
> > > It looks like we already WONTFIXed this for 34, but thought I'd give it a
> > > shot.
> > 
> > If this is confirmed to fix the Visit* crashes on ICS-based B2G devices, we
> > definitely should try to get it into FxOS 2.1 as the ICS-based Geeksphones
> > are still the most reliable crash reporting devices for B2G next to the
> > Flames.
> 
> I think we probably use gralloc-backed tiles everywhere on B2G, so this
> cairo-related thing should not affect it. Sotaro can you confirm that?

Whether we use gralloc or not should be unrelated to whether this crashes. The crashes can happen anywhere we use cairo and 565.
(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #52)
> > 
> > If this is confirmed to fix the Visit* crashes on ICS-based B2G devices, we
> > definitely should try to get it into FxOS 2.1 as the ICS-based Geeksphones
> > are still the most reliable crash reporting devices for B2G next to the
> > Flames.
> 
> I think we probably use gralloc-backed tiles everywhere on B2G, so this
> cairo-related thing should not affect it. Sotaro can you confirm that?

For tiling buffer allocation, TextureClient::CreateForDrawing() is used. On b2g, it first try to allocate TextureClient, if the allocation failed, then it try to fallback to BufferTextureClient allocation(ShmemTextureClient/MemoryTextureClient).

Therefore, if the device is out of gralloc buffer situation, BufferTextureClient is used for it.

ICS device is very easy to become out of gralloc. Since JB, android supports ion memory and out of gralloc becomes less often. geekphone is most easily become out of gralloc situation even within b2g ics devices. geekphone uses very old codeaurora code(ics_chocolate) that is not provided for b2g support. All other b2g ics devices uses newer code (ics_strawberry) that has support for b2g by codeaurora. Therefore, geekphone has a lot of problems related to graphics and media.


Bug 1036905 is one example bug that is related to out of gralloc on geekphone.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #53)
> (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #52)
> > (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #51)
> > > (In reply to James Willcox (:snorp) (jwillcox@mozilla.com) from comment #48)
> > > > It looks like we already WONTFIXed this for 34, but thought I'd give it a
> > > > shot.
> > > 
> > > If this is confirmed to fix the Visit* crashes on ICS-based B2G devices, we
> > > definitely should try to get it into FxOS 2.1 as the ICS-based Geeksphones
> > > are still the most reliable crash reporting devices for B2G next to the
> > > Flames.
> > 
> > I think we probably use gralloc-backed tiles everywhere on B2G, so this
> > cairo-related thing should not affect it. Sotaro can you confirm that?
> 
> Whether we use gralloc or not should be unrelated to whether this crashes.
> The crashes can happen anywhere we use cairo and 565.

Oops, that's exactly right. We probably do want this on B2G then. What determines whether or not we use 565? Screen depth, right?
Comment on attachment 8537942 [details] [diff] [review]
Use the cairo surface format to determine gfx format when able

NOTE: Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): bug 1023473
User impact if declined: crashes
Testing completed: m-c, try
Risk to taking this patch (and alternatives if risky): low
String or UUID changes made by this patch: none
Flags: needinfo?(sotaro.ikeda.g)
Attachment #8537942 - Flags: approval-mozilla-b2g34?
https://hg.mozilla.org/releases/mozilla-aurora/rev/1538e452e408
https://hg.mozilla.org/releases/mozilla-beta/rev/bfc940387e3f

[Blocking Requested - why for this release]:
Based on the recent comments, this sounds like a very-real possibility for causing crashes on v2.0 and v2.1.
Comment on attachment 8537942 [details] [diff] [review]
Use the cairo surface format to determine gfx format when able

We do indeed need this on b2g32 as well.
Attachment #8537942 - Flags: approval-mozilla-b2g32?
This may cut down on a number of large crashes involving Visit*; I'm not sure if there's a risk for any of our partners to take this patch.
Flags: needinfo?(bbajaj)
Removing the steps-wanted tag as the cause of this issue is known now.
Keywords: steps-wanted
Duplicate of this bug: 1107898
Duplicate of this bug: 1107931
Duplicate of this bug: 1108451
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #59)
> This may cut down on a number of large crashes involving Visit*; I'm not
> sure if there's a risk for any of our partners to take this patch.

We can certainly backport given the low-risk and land on our trees. That way any future builds that are created by partners from this branch could avoid this. Can you help monitor the crash-reports that we are seeing or see if there is any way to reproduce and verify the fix here?
Flags: needinfo?(bbajaj)
We have regular very visible crash reports from Geeksphones on those signatures and GP creates new builds from those branches daily (I think even for 2.0 still, but surely for 2.1 and 2.2) so we will probably see just from stats within a week if the crashes still happen.

It's so nice when the affected devices actually have regular updates and working crash reporting, I'd so love to see that for more FxOS devices...
Flags: needinfo?(nhirata.bugzilla)
Comment on attachment 8537942 [details] [diff] [review]
Use the cairo surface format to determine gfx format when able

Approving as this is a low risk patch helping with stabiliy. QA will help monitor the crash data post landing.
Attachment #8537942 - Flags: approval-mozilla-b2g34?
Attachment #8537942 - Flags: approval-mozilla-b2g34+
Attachment #8537942 - Flags: approval-mozilla-b2g32?
Attachment #8537942 - Flags: approval-mozilla-b2g32+
blocking-b2g: 2.0? → 2.0+
Waiting for things to land on 2.1; visit crash doesn't seem to be happening on 2.2 any more.
Duplicate of this bug: 1097571
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #65)
> We have regular very visible crash reports from Geeksphones on those
> signatures and GP creates new builds from those branches daily (I think even
> for 2.0 still, but surely for 2.1 and 2.2) so we will probably see just from
> stats within a week if the crashes still happen.
> 
> It's so nice when the affected devices actually have regular updates and
> working crash reporting, I'd so love to see that for more FxOS devices...

Is there a bug on file for that? Having bought a commercially available device down here in Mexico that it's still stuck on 1.1 it's a real PITA...
Flags: needinfo?(kairo)
(In reply to alex_mayorga from comment #71)
> Is there a bug on file for that? Having bought a commercially available
> device down here in Mexico that it's still stuck on 1.1 it's a real PITA...

Not sure what you mean, you'd need to file bugs against the partners that deliver those phones. There is nothing Mozilla can do there unfortunately. What I said is that it's nice that we at least the Geeksphones that actually do have current versions available so that we do get data from that.
Flags: needinfo?(kairo)
Hmm, on Android, this signature went down from ~3.5% in 35.0b4 and 35.0b6 to ~2% in 35.0b8, but it's not gone completely. I guess we'll need to clone this bug to a new one for the remaining crashes.

On B2G, I haven't seen the Visit* signatures in any build after this landed (12/22 on 2.0 and 2.1) but given holiday time, we should monitor for a bit longer until we can be sure.
I just saw this crash twice in rapid succession on http://178.175.135.122/ (Pirate Bay countdown) with Beta 35 on Android 5 (Nexus 4).
(In reply to Dirkjan Ochtman (:djc) from comment #74)
> I just saw this crash twice in rapid succession on http://178.175.135.122/
> (Pirate Bay countdown) with Beta 35 on Android 5 (Nexus 4).

If that's the latest beta and this is reproducable, can you please file a new bug (ideally via the "Report this bug in ... Core" link of one of your crash reports) and refernece that new bug here? We seem to have fixed something here but it looks like there's more cases we need to look into.
Flags: needinfo?(dirkjan)
Done, see bug 1117338.
Flags: needinfo?(dirkjan)
Flags: needinfo?(nhirata.bugzilla)
Comment on attachment 8537942 [details] [diff] [review]
Use the cairo surface format to determine gfx format when able

Now that we've shipped 35.0 there's no need for this to be approved for mozilla-release.
Attachment #8537942 - Flags: approval-mozilla-release?
Duplicate of this bug: 1045122
Duplicate of this bug: 1043847
Hi James,
The problem still exist on 2.0M branch per bug 1126644. Could you please help to check again? Thanks!
Blocks: Woodduck
Flags: needinfo?(snorp)
See Also: → 1126644
Looks like some good work in bug 1126644 finding out what's going on here
Flags: needinfo?(snorp)
This is back as had 2 crashes today
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Please file a new bug with details. Tracking gecko versions for patches gets confusing if bugs are reopened like this.
Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.