811444 - android panda boards magically reboot in the middle of the test

Assignee

Description

•

12 years ago

oh, receurdo de Tegra, but I have seen it in the log files, unable to connect, then we fail, then we get the logcat and a startup sequence is shown.

We need to make sure we are setting proper watcher.ini stuff and that the latest version of the watcher (my custom version) will not reboot unless told to specifically

If all that is good, then we have other problems, maybe a common test case, or a history on the board.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 1

•

12 years ago

I have verified we are running a legit version of watcher which by default doesn't reboot.  I have also verified that we are not setting a watcher.ini file which will tell watcher to reboot.

Now the chance of watcher being the cause of reboots is much less a concern for me.

I am seeing this a lot today while testing a fix to the sutagent hanging on shelling out to system commands.  There is no common spot in a test suite, test suite or panda board.

This leads me to question the relay.py script and the power strips.  I would really like to have a central place to look at all requests that get sent to the power strip so I can determine if we are maybe sending the data.  I would also like to get a log from the power strip to see what it thinks it rebooted.

Amy Rich [:arr] [:arich]

Comment 2

•

12 years ago

joel, are you talking about pandas or tegras (comment 1 and the subject don't agree).  No pandas are attached to power strips.  They're all attached to relay boards inside chassis.  All tegras are attached to power strips.

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

12 years ago

I'm certain that the relays don't log.  There are no pandas on power strips (by which I assume you mean PDUs), and anyway I'm fairly certain the PDUs don't log either.

Mozpool will be that central place - you can look at the device_logs table, or aggregate /var/log/mozpool.log across all imaging servers if you want the relayhost/bank/relay breakdown.  But since sut_lib is doing reboots itself, rather than using mozpool, that won't show you what you need to know today.

If you can give me some examples of pandas that have failed this way, I can check for mozpool having power-cycled them.  But I haven't had any problems with misfires, so that's unlikely to be it.

I've been using chassis 3 (panda-{0034..0045}) to test mozpool for the last two weeks, so this may simply be a matter of two groups using the same chassis.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 4

•

12 years ago

:arr,

I am only talking about pandas.  I really have no idea what technology lies behind the scenes to cycle the power, but whatever it is (I guess relay board in this case), I would like to get a log from it.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 5

•

12 years ago

Dustin, I believe we are using chassis 2 (panda-{0022-0033}).  Does that mean chassis 1 is panda-0000 -> panda-0021?

It sounds like we are not toggling the same boards.  If I wanted to check which host relay:bank mozpool was calling, how could I do that?

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 6

•

12 years ago

I know that panda-relay-002.build.scl1.mozilla.com is the host we use for the relay, if that is being called from mozpool, we have an answer, if not, we are still at the drawing board.

Speaking of mozpool, is there an easy api to call or a python module we can import to initiate a device reboot?

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

12 years ago

They pandas are all listed here
  http://mobile-services.build.scl1.mozilla.com/ui/lifeguard.html
chassis numbers match relay hostnames, so chassis 1 is just panda-0010 and panda-0021 (dunno why)

I don't see panda-relay-002 in the mozpool logs.

And yes, of course there's an API -- that's what we've been working on for two weeks!
  https://wiki.mozilla.org/ReleaseEngineering/Mozpool

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 8

•

12 years ago

Attached patch add a --run-slower to mochitest options (1.0) — Details — Splinter Review

Assignee: nobody → jmaher

Status: NEW → ASSIGNED

Attachment #685737 - Flags: review?(ted)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

12 years ago

Blocks: android_4.0_testing

Justin Wood (:Callek)

Comment 9

•

12 years ago

Attached patch [buildbotcustom] v1 — Details — Splinter Review

Ok, this does the needed work for buildbot and makes a few assumtions I want to call out:

* attachment 685737 [details] [diff] [review] will get an r+ (if it doesn't we need to peek at this again with whatever revised patch does)
* That we are ok with the hack being hidden at this low a level (I couldn't come up with a less hacky idea for passing the var down to here)
* That I am correct and attachment 685737 [details] [diff] [review] only affects mochi and is not wanted/needed for reftests
* That if we are to turn on any panda mochitests on other train-branches we would backport the --run-slower arg.
* b2g pandas won't use this buildbot factory.
* This won't land until *after* the mozilla-central code patch lands and gets carried forward to cedar.

Attachment #686041 - Flags: review?(kmoir)

Attachment #686041 - Flags: feedback?(jmaher)

Kim Moir [:kmoir] ET

Updated

•

12 years ago

Attachment #686041 - Flags: review?(kmoir) → review+

(not currently active) Ted Mielczarek

Comment 10

•

12 years ago

Comment on attachment 685737 [details] [diff] [review]
add a --run-slower to mochitest options (1.0)

Review of attachment 685737 [details] [diff] [review]:
-----------------------------------------------------------------

This sucks. :-( I can only assume we're overloading the device and that's causing it to reboot. It would be better to find a root cause here, but as a band-aid we can deal with this.

::: testing/mochitest/tests/SimpleTest/TestRunner.js
@@ +441,5 @@
>  
>          TestRunner.updateUI(tests);
>          TestRunner._currentTest++;
> +        if (TestRunner.runSlower) {
> +            setTimeout(TestRunner.runNextTest, 1000);

Have you done any testing to see if we can get away with a lower value? It'd be nice to make this as low as possible without causing issues.

Attachment #685737 - Flags: review?(ted) → review+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 11

•

12 years ago

inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8c28e8d114a

I would like to leave this bug open to experiment with lower values instead of 1000.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 12

•

12 years ago

Comment on attachment 686041 [details] [diff] [review]
[buildbotcustom] v1

Review of attachment 686041 [details] [diff] [review]:
-----------------------------------------------------------------

looks good.

Attachment #686041 - Flags: feedback?(jmaher) → feedback+

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

12 years ago

Blocks: 815726

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

12 years ago

Can one of you briefly summarize the problem and fix?  I'm mostly curious.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 14

•

12 years ago

The problem is described in bug 815726, the solution is a hack to slow down the tests which cause fennec to have lower cpu and memory usage.  All signs point to overheating, but it could easily be something else.

Dustin J. Mitchell [:dustin] (he/him)

Comment 15

•

12 years ago

OK, thanks!  An alternative approach might be to look at the kernel CPU governor configuration, but since this is critical-path, don't let me distract you :)

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 16

•

12 years ago

do you have a pointer to how that can be configured?

Ed Morley [:emorley]

Comment 17

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/a8c28e8d114a

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla20

Justin Wood (:Callek)

Comment 18

•

12 years ago

We forget the "leave open" whiteboard for this bug, tehrefore got resolved with the m-c landing.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Wood (:Callek)

Comment 19

•

12 years ago

Comment on attachment 686041 [details] [diff] [review]
[buildbotcustom] v1

http://hg.mozilla.org/build/buildbotcustom/rev/1075c14a159d

(no checked-in flag here)

Aki Sasaki (not active)

Comment 20

•

12 years ago

This is in production.

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

12 years ago

in production

Aki Sasaki (not active)

Updated

•

12 years ago

Summary: panda boards magically reboot in the middle of the test → android panda boards magically reboot in the middle of the test

Geoff Brown [:gbrown]

Comment 22

•

11 years ago

Several recent logs in bug 722166 show pandaboards rebooting: comments 1439, 1438, and 1436.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 23

•

11 years ago

In looking at the test logs on mozilla-central and mozilla-inbound we are not running --run-slower anymore for the pandaboard mochitests.  I have confirmed that we continue to have problems with rebooting and adding --run-slower appears to help remediate this problem.  

For some data, a regular run yields just >50C from the kernel board reading, but with --run-slower, I always stay <46C.  This is data collected from 10 idle boards and 1 board running smoketests.

Kim Moir [:kmoir] ET

Comment 24

•

11 years ago

The problem is here

http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#5369

if 'panda' in self.platform:

is never true because platform has a value of android 

We don't want to run these on tegras. Not sure if we should add a value to  RemoteUnittestFactory so we have a another value that describes the platform better.

It did work before, not sure if recent changes broke something and the value passed used to be panda_android.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 25

•

11 years ago

can we go with slaveName instead of platform?

is it possible to distinguish between android 4.0 and android 2.2 ?

Kim Moir [:kmoir] ET

Comment 26

•

11 years ago

I don't think slaveName is available to RemoteUnittestFactory now either.  I have to think about how to refactor it.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Updated

•

11 years ago

Blocks: 850756

Ed Morley [:emorley]

Comment 27

•

11 years ago

(In reply to Joel Maher (:jmaher) from comment #23)
> In looking at the test logs on mozilla-central and mozilla-inbound we are
> not running --run-slower anymore for the pandaboard mochitests.

Good spot - thank you for chasing this Joel/Kim :-)

Kim Moir [:kmoir] ET

Comment 28

•

11 years ago

I looked this again today and I don't know how it ever worked. (Although I recall looking at the logs and seeing the parameter in the log when it was landed). In our buildbot-configs, 

PLATFORMS['android']['slave_platforms'] = ['tegra_android', 'panda_android']

platform is always android, it's the slave platform that is panda_android

If you look at the build properties on a page from a recent panda android build, the platform is panda_android.  I know how to capture that at build time,  but I don't see how to use this in an if statement and add the slowTests parameter for only panda_android

http://buildbot.net/buildbot/docs/latest/manual/cfg-properties.html

because this doesn't appear to be supported by buildbot (Using Properties in Steps section).  

I'm not sure how to implement this, any ideas would be appreciated.

Kim Moir [:kmoir] ET

Comment 29

•

11 years ago

I wrote a patch yesterday to address this issue with doStepIf in factory.py and setting properties to determine if the device was a panda by doing a find on the foopy for the panda related file. I tested it this morning and it's not pretty at all, very hacky.  Given that Jake and Joel are testing a new power supply next week for the pandas chassis as they have determined that there are issues with the existing ones, I think perhaps we should not sink any more time into trying to make the tests run slower.  I'll focus on getting the test infrastructure set up in bug 853947 so they can assess the power issues.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 30

•

11 years ago

Thanks kmoir!  If our power supply solution works, then life is good, otherwise this is a good fallback solution which we can use if needed.

Kim Moir [:kmoir] ET

Comment 31

•

11 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Patch to check if the foopy has attached pandas if running mochitests. If so, add --run-slower.  I've tested this on my dev-master and it works.

Attachment #732434 - Flags: review?(bhearsum)

Chris AtLee [:catlee]

Comment 32

•

11 years ago

Comment on attachment 732434 [details] [diff] [review]
patch

Review of attachment 732434 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/factory.py
@@ +5396,5 @@
> +                             extract_fn=ifAPanda,
> +                             flunkOnFailure=False,
> +                             haltOnFailure=False,
> +                             warnOnFailure=False
> +                             ))

sorry, drive-by review.

could you use a SetBuildProperty step instead? then you could use a property function that looks at build.slavename to determine if it's a panda or not?

the find command as written will traverse all of /builds, only to return the first matching entry with panda-*[0-9] in the name. this could be pretty expensive!

bhearsum@mozilla.com (:bhearsum)

Comment 33

•

11 years ago

Comment on attachment 732434 [details] [diff] [review]
patch

Review of attachment 732434 [details] [diff] [review]:
-----------------------------------------------------------------

::: process/factory.py
@@ +5396,5 @@
> +                             extract_fn=ifAPanda,
> +                             flunkOnFailure=False,
> +                             haltOnFailure=False,
> +                             warnOnFailure=False
> +                             ))

I second catlee's comments. Let me know if you want help with the SetBuildProperty part. https://mxr.mozilla.org/build-central/source/buildbotcustom/process/factory.py#4729 might be a decent example.

Attachment #732434 - Flags: review?(bhearsum) → review-

Kim Moir [:kmoir] ET

Comment 34

•

11 years ago

I changed it like this

     def ifAPanda(build):
                    slavename = build.slavename
                    if re.match(r'panda-[0-9]{4}\+?', slavename):
                        return "True"
                    else:
                        return "False"
                self.addStep(SetBuildProperty(
                            property_name="slowTests",
                            value=ifAPanda,
                            ))


so on the build page slowTests is now a Step instead of a SetProperty and has the correct value

Like here 
http://dev-master01.build.scl1.mozilla.com:8036/builders/Android%20Tegra%20250%20mozilla-central%20opt%20test%20mochitest-1/builds/30

However, I'm not sure how to parse this in unittest.py anymore since it's not a property associated with build but rather a step.  I tried to find some examples and hacked around a bit but was unable to find a solution - catlee or bhearsum - suggestions?

Kim Moir [:kmoir] ET

Comment 35

•

11 years ago

Attached patch patch that uses SetBuildProperty instead — Details — Splinter Review

tested on my dev-master and it works.

Attachment #732434 - Attachment is obsolete: true

Attachment #732808 - Flags: review?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 36

•

11 years ago

Comment on attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead

Review of attachment 732808 [details] [diff] [review]:
-----------------------------------------------------------------

r=me, but can you get rid of the excess whitespace when you land?

Attachment #732808 - Flags: review?(bhearsum) → review+

Kim Moir [:kmoir] ET

Comment 37

•

11 years ago

Comment on attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead

Checked in and fixed whitespace
http://hg.mozilla.org/build/buildbotcustom/rev/e854634ca5bb

bhearsum@mozilla.com (:bhearsum)

Comment 38

•

11 years ago

latest patch is in production

Kim Moir [:kmoir] ET

Comment 39

•

11 years ago

Verified that this is working by looking at recent test runs in tbpl.

Status: REOPENED → RESOLVED

Closed: 12 years ago → 11 years ago

Resolution: --- → FIXED

add a --run-slower to mochitest options (1.0) 12 years ago Joel Maher ( :jmaher ) (UTC -8) 3.05 KB, patch	ted : review+	Details \| Diff \| Splinter Review
[buildbotcustom] v1 12 years ago Justin Wood (:Callek) 5.17 KB, patch	kmoir : review+ jmaher : feedback+	Details \| Diff \| Splinter Review
patch 11 years ago Kim Moir [:kmoir] ET 4.61 KB, patch	bhearsum : review-	Details \| Diff \| Splinter Review
patch that uses SetBuildProperty instead 11 years ago Kim Moir [:kmoir] ET 4.17 KB, patch	bhearsum : review+	Details \| Diff \| Splinter Review