Closed Bug 1098962 Opened 10 years ago Closed 9 years ago

Android 4.0 robocop-1 has an unacceptable retry rate

Categories

(Firefox for Android Graveyard :: Testing, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: gbrown)

References

Details

Assignee: nobody → gbrown
Summary: robocop-1 has an unacceptable retry rate → Android 4.0 robocop-1 has an unacceptable retry rate
It looks like this has been happening all week. The first obvious rev on inbound is 6c2fb5df16ba (Fri Nov 7 11:17:34 2014 PST) but there may be earlier failures too: 9f896cb289a9, 33d0247e5290.

Is this another instance of "Pandas reboot when test job runs for too long"?
See Also: → 869030
(In reply to Geoff Brown [:gbrown] from comment #1)
> Is this another instance of "Pandas reboot when test job runs for too long"?

Yes, I think so.

Logcats for the failures often show something like:

11-14 09:56:10.968 I/Process ( 1401): Sending signal. PID: 1401 SIG: 3
11-14 09:56:10.968 I/dalvikvm( 1401): threadid=3: reacting to signal 3
11-14 09:56:11.007 I/dalvikvm( 1401): Wrote stack traces to '/data/anr/traces.txt'
11-14 09:56:20.742 W/WindowManager( 1401): Input event injection timed out.
11-14 09:56:41.015 I/Process ( 1401): Sending signal. PID: 1401 SIG: 3
11-14 09:56:41.015 I/dalvikvm( 1401): threadid=3: reacting to signal 3
11-14 09:56:41.046 I/dalvikvm( 1401): Wrote stack traces to '/data/anr/traces.txt'
11-14 09:56:41.046 I/Process ( 1401): Sending signal. PID: 1579 SIG: 3
11-14 09:56:41.046 I/dalvikvm( 1579): threadid=3: reacting to signal 3
11-14 09:56:41.054 I/dalvikvm( 1579): Wrote stack traces to '/data/anr/traces.txt'
11-14 09:56:43.132 I/Watchdog_N( 1401): dumpKernelStacks
11-14 09:56:43.210 D/dalvikvm( 1401): GC_FOR_ALLOC freed 299K, 28% free 9239K/12679K, paused 46ms
11-14 09:56:43.281 D/dalvikvm( 1401): GC_CONCURRENT freed 31K, 28% free 9256K/12679K, paused 4ms+20ms
11-14 09:56:43.343 D/dalvikvm( 1401): GC_FOR_ALLOC freed 97K, 28% free 9158K/12679K, paused 56ms
11-14 09:56:43.343 I/dalvikvm-heap( 1401): Grow heap (frag case) to 9.211MB for 197928-byte allocation
11-14 09:56:43.375 D/dalvikvm( 1401): GC_FOR_ALLOC freed 0K, 28% free 9351K/12935K, paused 54ms
11-14 09:56:43.468 D/dalvikvm( 1401): GC_FOR_ALLOC freed 266K, 30% free 9182K/12935K, paused 55ms
11-14 09:56:45.171 I/Process ( 1401): Sending signal. PID: 1401 SIG: 9
11-14 09:56:45.171 W/Watchdog( 1401): *** WATCHDOG KILLING SYSTEM PROCESS: com.android.server.PowerManagerService

approximately 30 minutes after the start of the test.

Here's a try run which simulates running Android 4.0 Robocop as 7 chunks instead of 5:

https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=df79bab7f533

There are only a few retries, and they only happen in cleanupDevice -- not during the test job.

Compare that to the same revision run as 5 chunks: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=dd048eae7f2a
Depends on: 1099306
btw, in the above log pid 1401 is "system" and 1579 is "com.android.phone" -- that pattern appears to hold for all the logs I have examined.
The ANR report in the log for one of the retries[0] shows a deadlock in the PowerManagerService thread.

10:39:04     INFO -  "PowerManagerService" prio=5 tid=20 MONITOR
10:39:04     INFO -    | group="main" sCount=1 dsCount=0 obj=0x41398f60 self=0x199f668
10:39:04     INFO -    | sysTid=1428 nice=0 sched=0/0 cgrp=[fopen-error:2] handle=26764208
10:39:04     INFO -    | schedstat=( 0 0 0 ) utm=3 stm=3 core=1
10:39:04     INFO -    at com.android.server.PowerManagerService$TimeoutTask.run(PowerManagerService.java:~1298)
10:39:04     INFO -    - waiting to lock <0x41209160> (a com.android.server.PowerManagerService$LockList) held by tid=26 (InputDispatcher)
10:39:04     INFO -    at android.os.Handler.handleCallback(Handler.java:605)
10:39:04     INFO -    at android.os.Handler.dispatchMessage(Handler.java:92)
10:39:04     INFO -    at android.os.Looper.loop(Looper.java:137)
10:39:04     INFO -    at android.os.HandlerThread.run(HandlerThread.java:60)

It says the lock is held by InputDispatcher, whose stack looks like this:

10:39:04     INFO -  "InputDispatcher" prio=10 tid=26 MONITOR
10:39:04     INFO -    | group="main" sCount=1 dsCount=0 obj=0x4132fa70 self=0x19c3890
10:39:04     INFO -    | sysTid=1434 nice=-8 sched=0/0 cgrp=[fopen-error:2] handle=26378376
10:39:04     INFO -    | schedstat=( 0 0 0 ) utm=125 stm=171 core=1
10:39:04     INFO -    at com.android.server.am.BatteryStatsService.noteUserActivity(BatteryStatsService.java:~189)
10:39:04     INFO -    - waiting to lock <0x4120ee28> (a com.android.internal.os.BatteryStatsImpl) held by tid=13 (ProcessStats)
10:39:04     INFO -    at com.android.server.PowerManagerService.userActivity(PowerManagerService.java:2419)
10:39:04     INFO -    at com.android.server.PowerManagerService.userActivity(PowerManagerService.java:2361)
10:39:04     INFO -    at dalvik.system.NativeStart.run(Native Method)

So that one is waiting on some lock held by ProcessStats:

0:39:04     INFO -  "ProcessStats" prio=5 tid=13 MONITOR
10:39:04     INFO -    | group="main" sCount=1 dsCount=0 obj=0x4127d4b8 self=0x199d230
10:39:04     INFO -    | sysTid=1419 nice=0 sched=0/0 cgrp=[fopen-error:2] handle=26431952
10:39:04     INFO -    | schedstat=( 0 0 0 ) utm=1 stm=0 core=1
10:39:04     INFO -    at com.android.server.am.ActivityManagerService.broadcastIntent(ActivityManagerService.java:~12504)
10:39:04     INFO -    - waiting to lock <0x4120b2a0> (a com.android.server.am.ActivityManagerService) held by tid=21 (AlarmManager)
10:39:04     INFO -    at android.app.ContextImpl.sendBroadcast(ContextImpl.java:918)
10:39:04     INFO -    at com.android.server.DropBoxManagerService.add(DropBoxManagerService.java:236)
10:39:04     INFO -    at android.os.DropBoxManager.addText(DropBoxManager.java:272)
10:39:04     INFO -    at com.android.server.am.ActivityManagerService$11.run(ActivityManagerService.java:7704)
10:39:04     INFO -    at com.android.server.am.ActivityManagerService.addErrorToDropBox(ActivityManagerService.java:7709)
10:39:04     INFO -    at com.android.server.am.ActivityManagerService.handleApplicationWtf(ActivityManagerService.java:7522)
10:39:04     INFO -    at com.android.internal.os.RuntimeInit.wtf(RuntimeInit.java:345)
10:39:04     INFO -    at android.util.Log$1.onTerribleFailure(Log.java:103)
10:39:04     INFO -    at android.util.Log.wtf(Log.java:278)
10:39:04     INFO -    at com.android.internal.os.BatteryStatsImpl.getNetworkStatsDetailGroupedByUid(BatteryStatsImpl.java:5738)
10:39:04     INFO -    at com.android.internal.os.BatteryStatsImpl.access$100(BatteryStatsImpl.java:76)
10:39:04     INFO -    at com.android.internal.os.BatteryStatsImpl$Uid.computeCurrentTcpBytesReceived(BatteryStatsImpl.java:2457)
10:39:04     INFO -    at com.android.internal.os.BatteryStatsImpl$Uid.getTcpBytesReceived(BatteryStatsImpl.java:2446)
10:39:04     INFO -    at com.android.internal.os.BatteryStatsImpl.writeSummaryToParcel(BatteryStatsImpl.java:5437)
10:39:04     INFO -    at com.android.internal.os.BatteryStatsImpl.writeLocked(BatteryStatsImpl.java:4836)
10:39:04     INFO -    at com.android.internal.os.BatteryStatsImpl.writeAsyncLocked(BatteryStatsImpl.java:4818)
10:39:04     INFO -    at com.android.server.am.ActivityManagerService.updateCpuStatsNow(ActivityManagerService.java:1652)
10:39:04     INFO -    at com.android.server.am.ActivityManagerService$3.run(ActivityManagerService.java:1534)

That's waiting on a lock held by AlarmManager:

10:39:04     INFO -  "AlarmManager" prio=5 tid=21 MONITOR
10:39:04     INFO -    | group="main" sCount=1 dsCount=0 obj=0x413271f8 self=0x196ac88
10:39:04     INFO -    | sysTid=1429 nice=0 sched=0/0 cgrp=[fopen-error:2] handle=26742432
10:39:04     INFO -    | schedstat=( 0 0 0 ) utm=3 stm=1 core=1
10:39:04     INFO -    at com.android.server.am.ActivityManagerService.retrieveServiceLocked(ActivityManagerService.java:~10563)
10:39:04     INFO -    - waiting to lock <0x4120ee28> (a com.android.internal.os.BatteryStatsImpl) held by tid=13 (ProcessStats)
10:39:04     INFO -    at com.android.server.am.ActivityManagerService.startServiceLocked(ActivityManagerService.java:11096)
10:39:04     INFO -    at com.android.server.am.ActivityManagerService.startServiceInPackage(ActivityManagerService.java:11149)
10:39:04     INFO -    at com.android.server.am.PendingIntentRecord.sendInner(PendingIntentRecord.java:260)
10:39:04     INFO -    at com.android.server.am.PendingIntentRecord.send(PendingIntentRecord.java:182)
10:39:04     INFO -    at android.app.PendingIntent.send(PendingIntent.java:509)
10:39:04     INFO -    at android.app.PendingIntent.send(PendingIntent.java:463)
10:39:04     INFO -    at com.android.server.AlarmManagerService$AlarmThread.run(AlarmManagerService.java:665)

AlarmManager is waiting on ProcessStats. And now we're deadlocked.

[0] http://ftp.mozilla.org/pub/mozilla.org/mobile/try-builds/gbrown@mozilla.com-dd048eae7f2a/try-android/try_panda_android_test-robocop-1-bm100-tests1-panda-build2884.txt.gz
It looks like we could work around this specific instance by avoiding the 'wtf' error in BatteryStatsImpl (ProcessStats thread above)
If we're able to make changes to the Android build, commenting out this line should fix the deadlock described above: http://androidxref.com/4.0.4/xref/frameworks/base/core/java/com/android/internal/os/BatteryStatsImpl.java#5738
See Also: → 883966
Android 4.0 Robocop is running as 7 chunks now and the frequent retries on rc1 are resolved.

Let's keep this bug open for now while we sort out a permanent solution (follow-up on Comments 5 - 8).
We have a new instance of this problem, and a solution on the way, in bug 1142553.

Even so, given the declining importance and anticipated obsolescence of the pandaboards, I am reluctant to follow-up on Comments 5 - 8.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in before you can comment on or make changes to this bug.