Closed Bug 610600 Opened 9 years ago Closed 9 years ago

production tegra stability improvements

Categories

(Release Engineering :: General, defect, P3)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aki, Assigned: bear)

References

Details

Attachments

(7 files, 1 obsolete file)

Something is different between production and staging.
We need to figure it out and fix it.

Staging:
http://bm-foopy.build.mozilla.org:8012/one_line_per_build?numbuilds=1000 (purple, orange, red)

Production:
http://test-master01.build.mozilla.org:8012/one_line_per_build
(orange, green)
Depends on: 610573
Assignee: nobody → aki
Depends on: 611346, 611348
Production was falsely green because it's never installed a new version of Fennec.
Severity: critical → blocker
Summary: All staging tegras are perma-orange → Tegra remote installApp steps are broken
Depends on: 611823
Depends on: 612789
installApp is now fixed, but we're all orange still on builds that aren't oct24.
Assignee: aki → bear
Summary: Tegra remote installApp steps are broken → Tegras are perma-orange
Depends on: 618360
No longer perma orange, but very, very unreliable.
Summary: Tegras are perma-orange → bring tegras back into production
Depends on: 617815
Depends on: 618363
Depends on: 614173
Depends on: 612792
Depends on: 618376
Depends on: 618384
Depends on: 619764
These are now in production, so we can either morph this bug (get them stable + stay up, and/or track landing the blocking bugs), or close.
Lowering priority due to comment 4.
Severity: blocker → major
Depends on: 623299
Depends on: 629266
Attachment #507736 - Flags: review?(aki)
Attachment #507737 - Flags: review?(aki) → review+
Comment on attachment 507736 [details] [diff] [review]
add haltonfailure to install step for Android

Asking Bear to add haltOnFailure to more steps.
Attachment #507736 - Flags: review?(aki)
(In reply to comment #7)
> Created attachment 507737 [details] [diff] [review]
> turn off tsvg, tdhtml and tspider until we can figure out orange

committed changeset 3531:b51b429fc8fb
Comment on attachment 507738 [details] [diff] [review]
add haltonfailure to install Android talos steps that talk to the tegras

The haltOnFailure=True in addRunTestStep() needs to be changed back to False, otherwise you will prevent desktop Talos from running multiple talos suites in one run.

r=me with that change.
Attachment #507738 - Flags: review?(aki) → review+
(In reply to comment #11)
> Comment on attachment 507738 [details] [diff] [review]
> add haltonfailure to install Android talos steps that talk to the tegras
> 
> The haltOnFailure=True in addRunTestStep() needs to be changed back to False,
> otherwise you will prevent desktop Talos from running multiple talos suites in
> one run.
> 
> r=me with that change.

that tweak applied and pushed
committed changeset 1252:48f9d397d28c
adjusting priority to reflect that this is now a tracking bug for the various ateam/releng tegra projects
Severity: major → normal
Priority: -- → P3
Depends on: 631306
adjusting summary to make this a tracking bug for rest of talos and reftests
Summary: bring tegras back into production → bring remaining talos suites and reftests to production tegras
Attachment #509908 - Flags: review?(aki) → review+
Comment on attachment 509908 [details] [diff] [review]
turn on tdhtml, tsvg and tsspider tests for m-c

http://hg.mozilla.org/build/buildbot-configs/rev/98e7507b7171
Attachment #509908 - Flags: checked-in+
Depends on: 632499
Summary: bring remaining talos suites and reftests to production tegras → production tegra stability improvements
Depends on: 636091
Depends on: 636231
Depends on: 637412
Comment on attachment 518763 [details] [diff] [review]
bear's latest stability patch

>             if options.reset:
>                 s += ' - resetting'
>+                hbSocket.send('rebt\n')
>+                # not interested in result
>+                # taking advantage of the pause
>+                d = hbSocket.recv(4096)
>                 os.remove(errorFile)

Clever.  I'm guessing 4096 is 4 heartbeats?
If so, a comment might be nice. You also have 4096 elsewhere; this might be a good thing to pull out into a var.

>     def handlesigterm(self, signum, frame):
>         if self.pidfile is not None:
>             try:
>+                eventQueue.put(('terminate',))
>                 os.remove(self.pidfile)
>             except (KeyboardInterrupt, SystemExit):
>                 raise
>             except Exception:
>                 pass
>         sys.exit(0)

Hm, is this to make sure the queue doesn't keep going?
Is there a possibility that the queue will have other items in it that will be dealt with before the terminate?

I'm wondering if this should be eventQueue.queue = ['terminate'] or something, unless we want it to keep processing.
Maybe logging eventQueue.queue at the time of terminate might be good enough for now.

>-def killPID(pidFile, signal='2'):
>+def killPID(pid, signal='2'):

Not relying on the pidfile is probably a good move.

>+def sendReboot(ip, port):
>+    log.warning('sending rebt to tegra')
>+    try:
>+        hbSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>+        hbSocket.settimeout(float(120))
>+        hbSocket.connect((ip, port))
>+        hbSocket.send('rebt\n')
>+        hbSocket.recv(4096)
>+    except:
>+        log.debug('Error sending reboot')

Cool.  Guessing this also reboots and waits for 4 heartbeats?

>                 tegraActive = True
>                 if not bbActive:
>                     if os.path.isfile(errorFile):
>-                        log.warning('Tegra active but error flag set')
>+                        log.warning('Tegra active but error flag set [%d/%d]' % (softCount, softResets))
>+                        softCount += 1
>+                        if softCount > softCountMax:
>+                            softCount = 0
>+                            if softResets < softResetMax:
>+                                softResets += 1
>+                                log.info('removing error flag to see if tegra comes back')
>+                                os.remove(errorFile)
>+                            else:
>+                                hardResets += 1
>+                                log.info('hard reset reboot check [%d/%d]' % (hardResets, hardResetsMax))
>+                                if hardResets < hardResetsMax:
>+                                    sendReboot(options.tegraIP, sutDataPort)

I'd love a message here in an else:...
"Tegra over hardResetsMax; requires manual intervention" or something.
We may want it to otherwise notify us, rather than waiting for us to parse logs, but having it in the log message would be a good start.

Should we also terminate if hardResets > hardResetsMax ?

Also, you're incrementing softResets and hardResets in different places (softResets after the softResetMax compare; hardResets before) but you're using < for both, so we're going to have an off-by-one.

>+            elif state == 'dialback':
>+                softCount  = 0
>+                softResets = 0

This only goes to dialback when clientproxy first starts?

>+            elif state == 'terminate':
>+                break

The break doesn't seem final enough to me, but if it falls off the end of the __main__ function I suppose that works.

>-    if daemon is not None:
>-        daemon.stop()
>+    # killPID(db.pid)
>+    # killPID(mt.pid)
>+    # 
>+    # db.join()
>+    # mt.join()

Testing different exits?

I've got lots of comments above, some of which are questions and some of which I think need fixing.
But I don't think I saw anything that prevents me from wanting to see this landed/run as long as the fixes happen.
Attachment #518763 - Flags: review?(aki) → review+
Depends on: 641721
Depends on: 642369
Depends on: 643029
This patch contains the changes worked out with the a-team for latest patched devicemanager.py and also the v1.0 watcher/agent apk's.  It also has the resolution check and small cleanups.

It may not be the last thing changed for this final stability push, but I wanted to get the current code in the source tree.
Attachment #521260 - Flags: review?(aki)
Attachment #521260 - Attachment is patch: true
Attachment #521260 - Attachment mime type: application/octet-stream → text/plain
Attachment #521260 - Flags: review?(aki) → review+
Comment on attachment 521260 [details] [diff] [review]
changes to installApp to work with v1.0 watcher/agent apk

committed changeset 1304:6637f555ff47
Attachment #521260 - Flags: checked-in+
MobileTest is currently looking better than Mobile, test-wise, right now =P
Then again, we don't seem to be running any unit tests there?

If things look good in the morning, I suggest we:

* verify unit tests work in staging
* point foopy02's tegras (031-050; please create 045-050) to test-master01:9012
* potentially also point foopy03's 051-054 at test-master01:9012
* file a bug to image 055-094, and possibly 001-030.
* when the tegras are imaged, decide which we keep in staging and which we point at production.  I think at least 3 foopy's worth of tegras (60 tegras) should be in production.
config.py added to tools/sut_tools/
Attachment #521694 - Flags: review?(aki)
Attachment #521694 - Flags: review?(aki) → review+
add new config step and also change reboot to be a disconnect step for remote reftests
Attachment #521695 - Flags: review?(aki)
Comment on attachment 521695 [details] [diff] [review]
update to buildbotcustom process/factory.py to add new config step for reftests

I'm not a huge fan of the generic name "config" for a script that only changes screen resolutions for reftests, but I want this stuff green first, nitpick later.
Attachment #521695 - Flags: review?(aki) → review+
Comment on attachment 521694 [details] [diff] [review]
new sut_tools helper script to set screen resolution

committed changeset 1306:309b378527a2
Attachment #521694 - Flags: checked-in+
Comment on attachment 521695 [details] [diff] [review]
update to buildbotcustom process/factory.py to add new config step for reftests

committed changeset 1433:377379e0b359
Attachment #521695 - Flags: checked-in+
Blocks: 645168
Tracking bug that's tracking 1 low priority bug. -> RESO FIXED
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.