Closed Bug 763166 Opened 8 years ago Closed 8 years ago

crash in mozilla::AndroidGeckoLayerClient::SetFirstPaintViewport

Categories

(Core :: Widget: Android, defect, critical)

15 Branch
ARM
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla17
Tracking Status
firefox15 + fixed
firefox16 + verified
firefox17 --- fixed

People

(Reporter: scoobidiver, Assigned: kats)

References

Details

(4 keywords, Whiteboard: [native-crash][startupcrash])

Crash Data

Attachments

(2 files, 1 obsolete file)

There are only two crashes in 14.0b5 but there's a spike in startup crashes beginning from 16.0a1/20120606. The regression range for the spike is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=a7a905fd70d5&tochange=6338a8988917
It's likely a regression from bug 758361 which was uplifted to Aurora.

Signature 	_JNIEnv::CallVoidMethod | mozilla::AndroidGeckoLayerClient::SetFirstPaintViewport More Reports Search
UUID	4a54c3c6-09ec-4376-b96b-87aa02120608
Date Processed	2012-06-08 21:40:28
Uptime	28
Last Crash	3.7 weeks before submission
Install Age	9.2 minutes since version was first installed.
Install Time	2012-06-08 21:31:13
Product	FennecAndroid
Version	16.0a1
Build ID	20120608030525
Release Channel	nightly
OS	Linux
OS Version	0.0.0 Linux 3.0.8+ #9 SMP PREEMPT Wed May 2 18:19:29 CEST 2012 armv7l
Build Architecture	arm
Build Architecture Info	
Crash Reason	SIGSEGV
Crash Address	0x0
App Notes 	
AdapterVendorID: archos, AdapterDeviceID: ARCHOS 80G9.
AdapterDescription: 'Model: 'ARCHOS 80G9', Product: 'G9A80', Manufacturer: 'archos', Hardware: 'archos''.
archos ARCHOS 80G9
archos/G9A80/A80:4.0.3/MR1/20120420.190002:user/release-keys
EMCheckCompatibility	True

Frame 	Module 	Signature 	Source
0 	libdvm.so 	libdvm.so@0x57a70 	
1 	libdvm.so 	libdvm.so@0x57a4f 	
2 	libxul.so 	_JNIEnv::CallVoidMethod 	jni.h:631
3 	libxul.so 	mozilla::AndroidGeckoLayerClient::SetFirstPaintViewport 	widget/android/AndroidJavaWrappers.cpp:664
4 	libxul.so 	mozilla::AndroidBridge::SetFirstPaintViewport 	widget/android/AndroidBridge.cpp:2047
5 	libxul.so 	mozilla::layers::CompositorParent::SetFirstPaintViewport 	gfx/layers/ipc/CompositorParent.cpp:429
6 	libxul.so 	mozilla::layers::CompositorParent::TransformShadowTree 	gfx/layers/ipc/CompositorParent.cpp:374
7 	libxul.so 	mozilla::layers::CompositorParent::Composite 	gfx/layers/ipc/CompositorParent.cpp:260
8 	libxul.so 	RunnableMethod<mozilla::layers::CompositorParent, void , Tuple0>::Run 	ipc/chromium/src/base/tuple.h:383
9 	libxul.so 	MessageLoop::RunTask 	ipc/chromium/src/base/message_loop.cc:318
10 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	ipc/chromium/src/base/message_loop.cc:326
11 	libxul.so 	MessageLoop::DoWork 	ipc/chromium/src/base/message_loop.cc:426
12 	libxul.so 	base::MessagePumpDefault::Run 	ipc/chromium/src/base/message_pump_default.cc:23
13 	libxul.so 	MessageLoop::RunInternal 	ipc/chromium/src/base/message_loop.cc:208
14 	libxul.so 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:201
15 	libxul.so 	base::Thread::ThreadMain 	ipc/chromium/src/base/thread.cc:156
16 	libxul.so 	ThreadFunc 	ipc/chromium/src/base/platform_thread_posix.cc:31
17 	libc.so 	libc.so@0x12c4e 	
18 	libc.so 	libc.so@0x127a2 	
19 	libEGL.so 	libEGL.so@0x23e82 

More reports at:
https://crash-stats.mozilla.com/report/list?signature=_JNIEnv%3A%3ACallVoidMethod+|+mozilla%3A%3AAndroidGeckoLayerClient%3A%3ASetFirstPaintViewport
https://crash-stats.mozilla.com/report/list?signature=_JNIEnv%3A%3ACallVoidMethod
https://crash-stats.mozilla.com/report/list?signature=JNI_CreateJavaVM+|+_JNIEnv%3A%3ACallVoidMethod+|+mozilla%3A%3AAndroidGeckoLayerClient%3A%3ASetFirstPaintViewport
As 15.0a2 is unaffected, it's not caused by bug 758361.
No longer blocks: 758361
Version: 15 Branch → 16 Branch
Given that this crash seems like a race condition on startup, I would guess that this change was the one that caused the spike in 16.0:

d945ae198516	Mike Hommey — Bug 760152 - Start library decompression earlier. r=blassey

Just a guess, though.
Setting as top crash : 6th on nightly, does not seem to appear in aurora, or beta
CC'ing :glandium. Does bug 760152 only do library decompression earlier? Does it do any other gecko startup earlier?
(In reply to Kartikaya Gupta (:kats) from comment #4)
> CC'ing :glandium. Does bug 760152 only do library decompression earlier?
> Does it do any other gecko startup earlier?

Gecko's actual initialization has not moved. Only decompression did.
https://tbpl.mozilla.org/php/getParsedLog.php?id=12818266&tree=Mozilla-Inbound
Whiteboard: [native-crash][startupcrash] → [native-crash][startupcrash][orange]
Blocks: 438871
Crash Signature: [@ _JNIEnv::CallVoidMethod | mozilla::AndroidGeckoLayerClient::SetFirstPaintViewport] [@ _JNIEnv::CallVoidMethod] [@ JNI_CreateJavaVM | _JNIEnv::CallVoidMethod | mozilla::AndroidGeckoLayerClient::SetFirstPaintViewport] → [@ _JNIEnv::CallVoidMethod | mozilla::AndroidGeckoLayerClient::SetFirstPaintViewport ] [@ _JNIEnv::CallVoidMethod ] [@ JNI_CreateJavaVM | _JNIEnv::CallVoidMethod | mozilla::AndroidGeckoLayerClient::SetFirstPaintViewport ]
tracking-fennec: --- → ?
Assignee: nobody → bugmail.mozilla
This should be sufficient to fix this crash, assuming the problem is in fact the GetJNIForThread call. I can kill the GetJNIForThread() more in a separate bug.
Attachment #641598 - Flags: review?(blassey.bugs)
Comment on attachment 641598 [details] [diff] [review]
Add a GetJNIForCompositorThread and make the compositor JNI invocations use it

Review of attachment 641598 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/android/AndroidBridge.h
@@ +132,5 @@
> +    static JNIEnv* GetJNIForCompositorThread() {
> +        if (NS_LIKELY(sBridge)) {
> +            if (sBridge->mCompositorThread) {
> +                if ((void*)pthread_self() != sBridge->mCompositorThread) {
> +                    __android_log_print(ANDROID_LOG_ERROR, "AndroidBridge", "Non-compositor thread calling GetJNIForCompositorThread!");

this should probably be a NS_ABORT_IF_FALSE()

@@ +145,5 @@
> +            MutexAutoLock lock(sBridge->mCompositorJNICreationMutex);
> +
> +            if (sBridge->mCompositorThread) {
> +                // this means that another thread executed this function between the time we started executing
> +                // it and the time we acquired the mutex. fail.

this also means we have two threads that claim to be the compositor thread, which sounds like a disaster. Let's abort here too
Attachment #641598 - Flags: review?(blassey.bugs) → review+
Backed out on suspicion of making all the android talos tests go red. Nothing useful in the log though.

https://hg.mozilla.org/integration/mozilla-inbound/rev/f3215d2dc286
(In reply to comment #27)
> Backed out on suspicion of making all the android talos tests go red. Nothing
> useful in the log though.
> 
> https://hg.mozilla.org/integration/mozilla-inbound/rev/f3215d2dc286

The backout seems to have fixed it.
Assignee: bugmail.mozilla → blassey.bugs
Attachment #641598 - Attachment is obsolete: true
Attachment #642940 - Flags: review?(bugmail.mozilla)
Comment on attachment 642940 [details] [diff] [review]
patch to actually use tls

Review of attachment 642940 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with changing the if (status < 0) to if (status) as well.
Attachment #642940 - Flags: review?(bugmail.mozilla) → review+
https://hg.mozilla.org/mozilla-central/rev/2ee313f65ccf
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
14.0.1 and 15.0 Beta are affected.
Let's restrict this bug to top crashers.
It's #3 top crasher in 15.0b2. The Beta regression range is:
http://hg.mozilla.org/releases/mozilla-beta/pushloghtml?fromchange=695042299ec9&tochange=166ba24e4239
Bug 760152 is the only bug that belongs to the two regression windows.
(In reply to Mike Hommey [:glandium] from comment #5)
> (In reply to Kartikaya Gupta (:kats) from comment #4)
> > CC'ing :glandium. Does bug 760152 only do library decompression earlier?
> > Does it do any other gecko startup earlier?
> 
> Gecko's actual initialization has not moved. Only decompression did.

actually, one thing moved: static initialization. Is there any static initialization that calls back into java?
Blocks: 778954
(In reply to Mike Hommey [:glandium] from comment #37)
> actually, one thing moved: static initialization. Is there any static
> initialization that calls back into java?

I'm not aware of any offhand, but it's quite possible that there is. Looking at the code again, I found something that might be the problem here. Patch coming in a sec.
I think I'd rather we just backed out bug 760152 on branches at this point, since it didn't have any obvious perf gains. Is that riskier?
Doing the backout is probably the better option. I don't know if my patch will actually fix this or not yet. If it does we can re-land it with the fix.
(In reply to Kartikaya Gupta (:kats) from comment #43)
> Doing the backout is probably the better option. I don't know if my patch
> will actually fix this or not yet. If it does we can re-land it with the fix.

Thanks, sending over to glandium in that case.
Assignee: blassey.bugs → mh+mozilla
Comment on attachment 647566 [details] [diff] [review]
Remove race condition from AndroidBridge initialization

New try run is at https://tbpl.mozilla.org/?tree=Try&rev=e3e0e2f7ec4f (old one failed to build because i used a busted inbound changeset as a base). It looks about as green as android usually does, so requesting review.
Attachment #647566 - Flags: review?(snorp)
Attachment #647566 - Flags: review?(snorp) → review+
Attachment #642940 - Flags: checkin+
Attachment #647566 - Flags: checkin+
https://hg.mozilla.org/mozilla-central/rev/87736e458d15
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
There are still crashes in the trunk.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to Scoobidiver from comment #49)
> There are still crashes in the trunk.

Do you have evidence of that? There are no crashes I can see with a build id from a build after the landing:
https://crash-stats.mozilla.com/report/list?range_value=7&range_unit=days&date=2012-08-02&signature=_JNIEnv%3A%3ACallVoidMethod%20|%20mozilla%3A%3AAndroidGeckoLayerClient%3A%3ASetFirstPaintViewport&version=FennecAndroid%3A17.0a1
There are no crashes after 16.0a2/20120801, i.e. the backout of bug 760152.
https://crash-stats.mozilla.com/report/index/1fd05d7b-0f24-4063-9a84-ee7782120806
https://crash-stats.mozilla.com/report/index/e4673839-1a8f-4f07-ab22-956902120806

Note to clarify.. that's only 3 report that shows on/after 8/1 out of 2,838+634+1 Crash Reports.  Only 1 report after 8/1.

Should we close this out?  Or keep this and mark as a lower priority?  People just need to update it seems.
(In reply to Naoki Hirata :nhirata from comment #54)
> Should we close this out?  Or keep this and mark as a lower priority? 
No. When 17 Branch will go to Aurora, it will be #3 top crasher in 17.0a2 like it was in 16.0a2.
Bug 760152 was backed out of Beta 15 yesterday.
(In reply to Naoki Hirata :nhirata from comment #54)
> Should we close this out?  Or keep this and mark as a lower priority? 
> People just need to update it seems.

No. This is still occurring on trunk.
Assigning to kats after speaking to glandium.
Assignee: mh+mozilla → bugmail.mozilla
It looks like all of the recent logs are from mochitest runs -- why would that be?
Kats, do you have any ideas for this?
Unfortunately, no. I've looked through the code multiple times and don't see any other problems that could cause this.
(In reply to Ed Morley [:edmorley] from comment #101)
> Kats, do you have any ideas for this?

Actually, we should back out bug 760152 on 17 as well. Backing that out seems to have fixed the error completely on 15 and 16, so it should help significantly on 17. Note that this error was happening even before bug 760152 landed (albeit at a very low crash volume) so I don't expect it to disappear completely, unless one of the other patches on this bug did actually fix something. (i.e. bug 760152 might have masked the original problem, which was fixed by one of the patches on this bug).

I posted a comment on bug 760152 requesting it be backed out on 17 as well.
Thank you :-)
I pushed the backout of bug 760152 to inbound so hopefully this should stop happening as that propagates to the various trees. *fingers crossed*
This has stopped after the build from the 16th, so the backout worked. Should we mark this bug fixed?
Sounds good to me.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Whiteboard: [native-crash][startupcrash][orange] → [native-crash][startupcrash]
tracking-fennec: ? → ---
You need to log in before you can comment on or make changes to this bug.