Closed Bug 811444 Opened 9 years ago Closed 9 years ago

android panda boards magically reboot in the middle of the test


(Testing :: General, defect)

Not set


(Not tracked)



(Reporter: jmaher, Assigned: jmaher)




(3 files, 1 obsolete file)

oh, receurdo de Tegra, but I have seen it in the log files, unable to connect, then we fail, then we get the logcat and a startup sequence is shown.

We need to make sure we are setting proper watcher.ini stuff and that the latest version of the watcher (my custom version) will not reboot unless told to specifically

If all that is good, then we have other problems, maybe a common test case, or a history on the board.
I have verified we are running a legit version of watcher which by default doesn't reboot.  I have also verified that we are not setting a watcher.ini file which will tell watcher to reboot.

Now the chance of watcher being the cause of reboots is much less a concern for me.

I am seeing this a lot today while testing a fix to the sutagent hanging on shelling out to system commands.  There is no common spot in a test suite, test suite or panda board.

This leads me to question the script and the power strips.  I would really like to have a central place to look at all requests that get sent to the power strip so I can determine if we are maybe sending the data.  I would also like to get a log from the power strip to see what it thinks it rebooted.
joel, are you talking about pandas or tegras (comment 1 and the subject don't agree).  No pandas are attached to power strips.  They're all attached to relay boards inside chassis.  All tegras are attached to power strips.
I'm certain that the relays don't log.  There are no pandas on power strips (by which I assume you mean PDUs), and anyway I'm fairly certain the PDUs don't log either.

Mozpool will be that central place - you can look at the device_logs table, or aggregate /var/log/mozpool.log across all imaging servers if you want the relayhost/bank/relay breakdown.  But since sut_lib is doing reboots itself, rather than using mozpool, that won't show you what you need to know today.

If you can give me some examples of pandas that have failed this way, I can check for mozpool having power-cycled them.  But I haven't had any problems with misfires, so that's unlikely to be it.

I've been using chassis 3 (panda-{0034..0045}) to test mozpool for the last two weeks, so this may simply be a matter of two groups using the same chassis.

I am only talking about pandas.  I really have no idea what technology lies behind the scenes to cycle the power, but whatever it is (I guess relay board in this case), I would like to get a log from it.
Dustin, I believe we are using chassis 2 (panda-{0022-0033}).  Does that mean chassis 1 is panda-0000 -> panda-0021?

It sounds like we are not toggling the same boards.  If I wanted to check which host relay:bank mozpool was calling, how could I do that?
I know that is the host we use for the relay, if that is being called from mozpool, we have an answer, if not, we are still at the drawing board.

Speaking of mozpool, is there an easy api to call or a python module we can import to initiate a device reboot?
They pandas are all listed here
chassis numbers match relay hostnames, so chassis 1 is just panda-0010 and panda-0021 (dunno why)

I don't see panda-relay-002 in the mozpool logs.

And yes, of course there's an API -- that's what we've been working on for two weeks!
Assignee: nobody → jmaher
Attachment #685737 - Flags: review?(ted)
Ok, this does the needed work for buildbot and makes a few assumtions I want to call out:

* attachment 685737 [details] [diff] [review] will get an r+ (if it doesn't we need to peek at this again with whatever revised patch does)
* That we are ok with the hack being hidden at this low a level (I couldn't come up with a less hacky idea for passing the var down to here)
* That I am correct and attachment 685737 [details] [diff] [review] only affects mochi and is not wanted/needed for reftests
* That if we are to turn on any panda mochitests on other train-branches we would backport the --run-slower arg.
* b2g pandas won't use this buildbot factory.
* This won't land until *after* the mozilla-central code patch lands and gets carried forward to cedar.
Attachment #686041 - Flags: review?(kmoir)
Attachment #686041 - Flags: feedback?(jmaher)
Attachment #686041 - Flags: review?(kmoir) → review+
Comment on attachment 685737 [details] [diff] [review]
add a --run-slower to mochitest options (1.0)

Review of attachment 685737 [details] [diff] [review]:

This sucks. :-( I can only assume we're overloading the device and that's causing it to reboot. It would be better to find a root cause here, but as a band-aid we can deal with this.

::: testing/mochitest/tests/SimpleTest/TestRunner.js
@@ +441,5 @@
>          TestRunner.updateUI(tests);
>          TestRunner._currentTest++;
> +        if (TestRunner.runSlower) {
> +            setTimeout(TestRunner.runNextTest, 1000);

Have you done any testing to see if we can get away with a lower value? It'd be nice to make this as low as possible without causing issues.
Attachment #685737 - Flags: review?(ted) → review+

I would like to leave this bug open to experiment with lower values instead of 1000.
Comment on attachment 686041 [details] [diff] [review]
[buildbotcustom] v1

Review of attachment 686041 [details] [diff] [review]:

looks good.
Attachment #686041 - Flags: feedback?(jmaher) → feedback+
Blocks: 815726
Can one of you briefly summarize the problem and fix?  I'm mostly curious.
The problem is described in bug 815726, the solution is a hack to slow down the tests which cause fennec to have lower cpu and memory usage.  All signs point to overheating, but it could easily be something else.
OK, thanks!  An alternative approach might be to look at the kernel CPU governor configuration, but since this is critical-path, don't let me distract you :)
do you have a pointer to how that can be configured?
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
We forget the "leave open" whiteboard for this bug, tehrefore got resolved with the m-c landing.
Resolution: FIXED → ---
This is in production.
in production
Summary: panda boards magically reboot in the middle of the test → android panda boards magically reboot in the middle of the test
Several recent logs in bug 722166 show pandaboards rebooting: comments 1439, 1438, and 1436.
In looking at the test logs on mozilla-central and mozilla-inbound we are not running --run-slower anymore for the pandaboard mochitests.  I have confirmed that we continue to have problems with rebooting and adding --run-slower appears to help remediate this problem.  

For some data, a regular run yields just >50C from the kernel board reading, but with --run-slower, I always stay <46C.  This is data collected from 10 idle boards and 1 board running smoketests.
The problem is here

if 'panda' in self.platform:

is never true because platform has a value of android 

We don't want to run these on tegras. Not sure if we should add a value to  RemoteUnittestFactory so we have a another value that describes the platform better.

It did work before, not sure if recent changes broke something and the value passed used to be panda_android.
can we go with slaveName instead of platform?

is it possible to distinguish between android 4.0 and android 2.2 ?
I don't think slaveName is available to RemoteUnittestFactory now either.  I have to think about how to refactor it.
Blocks: 850756
(In reply to Joel Maher (:jmaher) from comment #23)
> In looking at the test logs on mozilla-central and mozilla-inbound we are
> not running --run-slower anymore for the pandaboard mochitests.

Good spot - thank you for chasing this Joel/Kim :-)
I looked this again today and I don't know how it ever worked. (Although I recall looking at the logs and seeing the parameter in the log when it was landed). In our buildbot-configs, 

PLATFORMS['android']['slave_platforms'] = ['tegra_android', 'panda_android']

platform is always android, it's the slave platform that is panda_android

If you look at the build properties on a page from a recent panda android build, the platform is panda_android.  I know how to capture that at build time,  but I don't see how to use this in an if statement and add the slowTests parameter for only panda_android

because this doesn't appear to be supported by buildbot (Using Properties in Steps section).  

I'm not sure how to implement this, any ideas would be appreciated.
I wrote a patch yesterday to address this issue with doStepIf in and setting properties to determine if the device was a panda by doing a find on the foopy for the panda related file. I tested it this morning and it's not pretty at all, very hacky.  Given that Jake and Joel are testing a new power supply next week for the pandas chassis as they have determined that there are issues with the existing ones, I think perhaps we should not sink any more time into trying to make the tests run slower.  I'll focus on getting the test infrastructure set up in bug 853947 so they can assess the power issues.
Thanks kmoir!  If our power supply solution works, then life is good, otherwise this is a good fallback solution which we can use if needed.
Attached patch patch (obsolete) — Splinter Review
Patch to check if the foopy has attached pandas if running mochitests. If so, add --run-slower.  I've tested this on my dev-master and it works.
Attachment #732434 - Flags: review?(bhearsum)
Comment on attachment 732434 [details] [diff] [review]

Review of attachment 732434 [details] [diff] [review]:

::: process/
@@ +5396,5 @@
> +                             extract_fn=ifAPanda,
> +                             flunkOnFailure=False,
> +                             haltOnFailure=False,
> +                             warnOnFailure=False
> +                             ))

sorry, drive-by review.

could you use a SetBuildProperty step instead? then you could use a property function that looks at build.slavename to determine if it's a panda or not?

the find command as written will traverse all of /builds, only to return the first matching entry with panda-*[0-9] in the name. this could be pretty expensive!
Comment on attachment 732434 [details] [diff] [review]

Review of attachment 732434 [details] [diff] [review]:

::: process/
@@ +5396,5 @@
> +                             extract_fn=ifAPanda,
> +                             flunkOnFailure=False,
> +                             haltOnFailure=False,
> +                             warnOnFailure=False
> +                             ))

I second catlee's comments. Let me know if you want help with the SetBuildProperty part. might be a decent example.
Attachment #732434 - Flags: review?(bhearsum) → review-
I changed it like this

     def ifAPanda(build):
                    slavename = build.slavename
                    if re.match(r'panda-[0-9]{4}\+?', slavename):
                        return "True"
                        return "False"

so on the build page slowTests is now a Step instead of a SetProperty and has the correct value

Like here

However, I'm not sure how to parse this in anymore since it's not a property associated with build but rather a step.  I tried to find some examples and hacked around a bit but was unable to find a solution - catlee or bhearsum - suggestions?
tested on my dev-master and it works.
Attachment #732434 - Attachment is obsolete: true
Attachment #732808 - Flags: review?(bhearsum)
Comment on attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead

Review of attachment 732808 [details] [diff] [review]:

r=me, but can you get rid of the excess whitespace when you land?
Attachment #732808 - Flags: review?(bhearsum) → review+
Comment on attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead

Checked in and fixed whitespace
latest patch is in production
Verified that this is working by looking at recent test runs in tbpl.
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.