Persona is no longer an option for authentication on BMO. For more details see Persona Deprecated.
Last Comment Bug 811444 - android panda boards magically reboot in the middle of the test
: android panda boards magically reboot in the middle of the test
Product: Testing
Classification: Components
Component: General (show other bugs)
: unspecified
: ARM Android
: -- normal (vote)
: mozilla20
Assigned To: Joel Maher ( :jmaher)
Depends on:
Blocks: android_4.0_testing 815726 850756
  Show dependency treegraph
Reported: 2012-11-13 12:52 PST by Joel Maher ( :jmaher)
Modified: 2013-04-05 06:00 PDT (History)
15 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---

add a --run-slower to mochitest options (1.0) (3.05 KB, patch)
2012-11-27 11:14 PST, Joel Maher ( :jmaher)
ted: review+
Details | Diff | Splinter Review
[buildbotcustom] v1 (5.17 KB, patch)
2012-11-28 02:37 PST, Justin Wood (:Callek)
kmoir: review+
jmaher: feedback+
Details | Diff | Splinter Review
patch (4.61 KB, patch)
2013-04-02 11:22 PDT, Kim Moir [:kmoir]
bhearsum: review-
Details | Diff | Splinter Review
patch that uses SetBuildProperty instead (4.17 KB, patch)
2013-04-03 06:39 PDT, Kim Moir [:kmoir]
bhearsum: review+
Details | Diff | Splinter Review

Description Joel Maher ( :jmaher) 2012-11-13 12:52:04 PST
oh, receurdo de Tegra, but I have seen it in the log files, unable to connect, then we fail, then we get the logcat and a startup sequence is shown.

We need to make sure we are setting proper watcher.ini stuff and that the latest version of the watcher (my custom version) will not reboot unless told to specifically

If all that is good, then we have other problems, maybe a common test case, or a history on the board.
Comment 1 Joel Maher ( :jmaher) 2012-11-16 10:57:05 PST
I have verified we are running a legit version of watcher which by default doesn't reboot.  I have also verified that we are not setting a watcher.ini file which will tell watcher to reboot.

Now the chance of watcher being the cause of reboots is much less a concern for me.

I am seeing this a lot today while testing a fix to the sutagent hanging on shelling out to system commands.  There is no common spot in a test suite, test suite or panda board.

This leads me to question the script and the power strips.  I would really like to have a central place to look at all requests that get sent to the power strip so I can determine if we are maybe sending the data.  I would also like to get a log from the power strip to see what it thinks it rebooted.
Comment 2 Amy Rich [:arr] [:arich] 2012-11-16 12:19:09 PST
joel, are you talking about pandas or tegras (comment 1 and the subject don't agree).  No pandas are attached to power strips.  They're all attached to relay boards inside chassis.  All tegras are attached to power strips.
Comment 3 Dustin J. Mitchell [:dustin] 2012-11-16 12:21:46 PST
I'm certain that the relays don't log.  There are no pandas on power strips (by which I assume you mean PDUs), and anyway I'm fairly certain the PDUs don't log either.

Mozpool will be that central place - you can look at the device_logs table, or aggregate /var/log/mozpool.log across all imaging servers if you want the relayhost/bank/relay breakdown.  But since sut_lib is doing reboots itself, rather than using mozpool, that won't show you what you need to know today.

If you can give me some examples of pandas that have failed this way, I can check for mozpool having power-cycled them.  But I haven't had any problems with misfires, so that's unlikely to be it.

I've been using chassis 3 (panda-{0034..0045}) to test mozpool for the last two weeks, so this may simply be a matter of two groups using the same chassis.
Comment 4 Joel Maher ( :jmaher) 2012-11-16 12:23:50 PST

I am only talking about pandas.  I really have no idea what technology lies behind the scenes to cycle the power, but whatever it is (I guess relay board in this case), I would like to get a log from it.
Comment 5 Joel Maher ( :jmaher) 2012-11-16 12:27:31 PST
Dustin, I believe we are using chassis 2 (panda-{0022-0033}).  Does that mean chassis 1 is panda-0000 -> panda-0021?

It sounds like we are not toggling the same boards.  If I wanted to check which host relay:bank mozpool was calling, how could I do that?
Comment 6 Joel Maher ( :jmaher) 2012-11-16 12:30:07 PST
I know that is the host we use for the relay, if that is being called from mozpool, we have an answer, if not, we are still at the drawing board.

Speaking of mozpool, is there an easy api to call or a python module we can import to initiate a device reboot?
Comment 7 Dustin J. Mitchell [:dustin] 2012-11-16 12:34:44 PST
They pandas are all listed here
chassis numbers match relay hostnames, so chassis 1 is just panda-0010 and panda-0021 (dunno why)

I don't see panda-relay-002 in the mozpool logs.

And yes, of course there's an API -- that's what we've been working on for two weeks!
Comment 8 Joel Maher ( :jmaher) 2012-11-27 11:14:11 PST
Created attachment 685737 [details] [diff] [review]
add a --run-slower to mochitest options (1.0)
Comment 9 Justin Wood (:Callek) 2012-11-28 02:37:27 PST
Created attachment 686041 [details] [diff] [review]
[buildbotcustom] v1

Ok, this does the needed work for buildbot and makes a few assumtions I want to call out:

* attachment 685737 [details] [diff] [review] will get an r+ (if it doesn't we need to peek at this again with whatever revised patch does)
* That we are ok with the hack being hidden at this low a level (I couldn't come up with a less hacky idea for passing the var down to here)
* That I am correct and attachment 685737 [details] [diff] [review] only affects mochi and is not wanted/needed for reftests
* That if we are to turn on any panda mochitests on other train-branches we would backport the --run-slower arg.
* b2g pandas won't use this buildbot factory.
* This won't land until *after* the mozilla-central code patch lands and gets carried forward to cedar.
Comment 10 Ted Mielczarek [:ted.mielczarek] 2012-11-28 06:50:33 PST
Comment on attachment 685737 [details] [diff] [review]
add a --run-slower to mochitest options (1.0)

Review of attachment 685737 [details] [diff] [review]:

This sucks. :-( I can only assume we're overloading the device and that's causing it to reboot. It would be better to find a root cause here, but as a band-aid we can deal with this.

::: testing/mochitest/tests/SimpleTest/TestRunner.js
@@ +441,5 @@
>          TestRunner.updateUI(tests);
>          TestRunner._currentTest++;
> +        if (TestRunner.runSlower) {
> +            setTimeout(TestRunner.runNextTest, 1000);

Have you done any testing to see if we can get away with a lower value? It'd be nice to make this as low as possible without causing issues.
Comment 11 Joel Maher ( :jmaher) 2012-11-28 07:27:17 PST

I would like to leave this bug open to experiment with lower values instead of 1000.
Comment 12 Joel Maher ( :jmaher) 2012-11-28 07:32:05 PST
Comment on attachment 686041 [details] [diff] [review]
[buildbotcustom] v1

Review of attachment 686041 [details] [diff] [review]:

looks good.
Comment 13 Dustin J. Mitchell [:dustin] 2012-11-28 08:14:57 PST
Can one of you briefly summarize the problem and fix?  I'm mostly curious.
Comment 14 Joel Maher ( :jmaher) 2012-11-28 08:25:25 PST
The problem is described in bug 815726, the solution is a hack to slow down the tests which cause fennec to have lower cpu and memory usage.  All signs point to overheating, but it could easily be something else.
Comment 15 Dustin J. Mitchell [:dustin] 2012-11-28 08:32:37 PST
OK, thanks!  An alternative approach might be to look at the kernel CPU governor configuration, but since this is critical-path, don't let me distract you :)
Comment 16 Joel Maher ( :jmaher) 2012-11-28 09:46:05 PST
do you have a pointer to how that can be configured?
Comment 17 Ed Morley [:emorley] 2012-11-29 06:55:52 PST
Comment 18 Justin Wood (:Callek) 2012-12-11 13:01:29 PST
We forget the "leave open" whiteboard for this bug, tehrefore got resolved with the m-c landing.
Comment 19 Justin Wood (:Callek) 2012-12-11 13:04:45 PST
Comment on attachment 686041 [details] [diff] [review]
[buildbotcustom] v1

(no checked-in flag here)
Comment 20 Aki Sasaki [:aki] 2012-12-12 13:22:03 PST
This is in production.
Comment 21 Ben Hearsum (:bhearsum) 2012-12-13 08:30:43 PST
in production
Comment 22 Geoff Brown [:gbrown] 2013-01-10 13:43:08 PST
Several recent logs in bug 722166 show pandaboards rebooting: comments 1439, 1438, and 1436.
Comment 23 Joel Maher ( :jmaher) 2013-03-08 11:17:17 PST
In looking at the test logs on mozilla-central and mozilla-inbound we are not running --run-slower anymore for the pandaboard mochitests.  I have confirmed that we continue to have problems with rebooting and adding --run-slower appears to help remediate this problem.  

For some data, a regular run yields just >50C from the kernel board reading, but with --run-slower, I always stay <46C.  This is data collected from 10 idle boards and 1 board running smoketests.
Comment 24 Kim Moir [:kmoir] 2013-03-08 12:58:26 PST
The problem is here

if 'panda' in self.platform:

is never true because platform has a value of android 

We don't want to run these on tegras. Not sure if we should add a value to  RemoteUnittestFactory so we have a another value that describes the platform better.

It did work before, not sure if recent changes broke something and the value passed used to be panda_android.
Comment 25 Joel Maher ( :jmaher) 2013-03-08 13:06:49 PST
can we go with slaveName instead of platform?

is it possible to distinguish between android 4.0 and android 2.2 ?
Comment 26 Kim Moir [:kmoir] 2013-03-11 07:13:40 PDT
I don't think slaveName is available to RemoteUnittestFactory now either.  I have to think about how to refactor it.
Comment 27 Ed Morley [:emorley] 2013-03-14 03:03:17 PDT
(In reply to Joel Maher (:jmaher) from comment #23)
> In looking at the test logs on mozilla-central and mozilla-inbound we are
> not running --run-slower anymore for the pandaboard mochitests.

Good spot - thank you for chasing this Joel/Kim :-)
Comment 28 Kim Moir [:kmoir] 2013-03-19 11:51:41 PDT
I looked this again today and I don't know how it ever worked. (Although I recall looking at the logs and seeing the parameter in the log when it was landed). In our buildbot-configs, 

PLATFORMS['android']['slave_platforms'] = ['tegra_android', 'panda_android']

platform is always android, it's the slave platform that is panda_android

If you look at the build properties on a page from a recent panda android build, the platform is panda_android.  I know how to capture that at build time,  but I don't see how to use this in an if statement and add the slowTests parameter for only panda_android

because this doesn't appear to be supported by buildbot (Using Properties in Steps section).  

I'm not sure how to implement this, any ideas would be appreciated.
Comment 29 Kim Moir [:kmoir] 2013-03-22 13:09:39 PDT
I wrote a patch yesterday to address this issue with doStepIf in and setting properties to determine if the device was a panda by doing a find on the foopy for the panda related file. I tested it this morning and it's not pretty at all, very hacky.  Given that Jake and Joel are testing a new power supply next week for the pandas chassis as they have determined that there are issues with the existing ones, I think perhaps we should not sink any more time into trying to make the tests run slower.  I'll focus on getting the test infrastructure set up in bug 853947 so they can assess the power issues.
Comment 30 Joel Maher ( :jmaher) 2013-03-22 13:53:42 PDT
Thanks kmoir!  If our power supply solution works, then life is good, otherwise this is a good fallback solution which we can use if needed.
Comment 31 Kim Moir [:kmoir] 2013-04-02 11:22:49 PDT
Created attachment 732434 [details] [diff] [review]

Patch to check if the foopy has attached pandas if running mochitests. If so, add --run-slower.  I've tested this on my dev-master and it works.
Comment 32 Chris AtLee [:catlee] 2013-04-02 11:59:14 PDT
Comment on attachment 732434 [details] [diff] [review]

Review of attachment 732434 [details] [diff] [review]:

::: process/
@@ +5396,5 @@
> +                             extract_fn=ifAPanda,
> +                             flunkOnFailure=False,
> +                             haltOnFailure=False,
> +                             warnOnFailure=False
> +                             ))

sorry, drive-by review.

could you use a SetBuildProperty step instead? then you could use a property function that looks at build.slavename to determine if it's a panda or not?

the find command as written will traverse all of /builds, only to return the first matching entry with panda-*[0-9] in the name. this could be pretty expensive!
Comment 33 Ben Hearsum (:bhearsum) 2013-04-02 12:33:35 PDT
Comment on attachment 732434 [details] [diff] [review]

Review of attachment 732434 [details] [diff] [review]:

::: process/
@@ +5396,5 @@
> +                             extract_fn=ifAPanda,
> +                             flunkOnFailure=False,
> +                             haltOnFailure=False,
> +                             warnOnFailure=False
> +                             ))

I second catlee's comments. Let me know if you want help with the SetBuildProperty part. might be a decent example.
Comment 34 Kim Moir [:kmoir] 2013-04-02 14:44:45 PDT
I changed it like this

     def ifAPanda(build):
                    slavename = build.slavename
                    if re.match(r'panda-[0-9]{4}\+?', slavename):
                        return "True"
                        return "False"

so on the build page slowTests is now a Step instead of a SetProperty and has the correct value

Like here

However, I'm not sure how to parse this in anymore since it's not a property associated with build but rather a step.  I tried to find some examples and hacked around a bit but was unable to find a solution - catlee or bhearsum - suggestions?
Comment 35 Kim Moir [:kmoir] 2013-04-03 06:39:57 PDT
Created attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead

tested on my dev-master and it works.
Comment 36 Ben Hearsum (:bhearsum) 2013-04-03 07:21:58 PDT
Comment on attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead

Review of attachment 732808 [details] [diff] [review]:

r=me, but can you get rid of the excess whitespace when you land?
Comment 37 Kim Moir [:kmoir] 2013-04-03 07:52:59 PDT
Comment on attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead

Checked in and fixed whitespace
Comment 38 Ben Hearsum (:bhearsum) 2013-04-04 12:32:25 PDT
latest patch is in production
Comment 39 Kim Moir [:kmoir] 2013-04-05 06:00:41 PDT
Verified that this is working by looking at recent test runs in tbpl.

Note You need to log in before you can comment on or make changes to this bug.