Closed
Bug 610600
Opened 14 years ago
Closed 13 years ago
production tegra stability improvements
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mozilla, Assigned: bear)
References
Details
Attachments
(7 files, 1 obsolete file)
2.09 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
2.59 KB,
patch
|
mozilla
:
review+
|
Details | Diff | Splinter Review |
2.09 KB,
patch
|
mozilla
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
13.53 KB,
patch
|
mozilla
:
review+
mozilla
:
checked-in+
|
Details | Diff | Splinter Review |
3.44 KB,
patch
|
mozilla
:
review+
bear
:
checked-in+
|
Details | Diff | Splinter Review |
3.20 KB,
patch
|
mozilla
:
review+
bear
:
checked-in+
|
Details | Diff | Splinter Review |
2.64 KB,
patch
|
mozilla
:
review+
bear
:
checked-in+
|
Details | Diff | Splinter Review |
Something is different between production and staging. We need to figure it out and fix it. Staging: http://bm-foopy.build.mozilla.org:8012/one_line_per_build?numbuilds=1000 (purple, orange, red) Production: http://test-master01.build.mozilla.org:8012/one_line_per_build (orange, green)
Updated•14 years ago
|
Blocks: releng-android
Reporter | ||
Updated•14 years ago
|
Assignee: nobody → aki
Reporter | ||
Comment 1•14 years ago
|
||
Production was falsely green because it's never installed a new version of Fennec.
Severity: critical → blocker
Summary: All staging tegras are perma-orange → Tegra remote installApp steps are broken
Reporter | ||
Comment 2•14 years ago
|
||
installApp is now fixed, but we're all orange still on builds that aren't oct24.
Assignee: aki → bear
Summary: Tegra remote installApp steps are broken → Tegras are perma-orange
Reporter | ||
Comment 3•14 years ago
|
||
No longer perma orange, but very, very unreliable.
Summary: Tegras are perma-orange → bring tegras back into production
Reporter | ||
Comment 4•14 years ago
|
||
These are now in production, so we can either morph this bug (get them stable + stay up, and/or track landing the blocking bugs), or close.
Assignee | ||
Comment 6•13 years ago
|
||
Attachment #507736 -
Flags: review?(aki)
Assignee | ||
Comment 7•13 years ago
|
||
Attachment #507737 -
Flags: review?(aki)
Reporter | ||
Updated•13 years ago
|
Attachment #507737 -
Flags: review?(aki) → review+
Reporter | ||
Comment 8•13 years ago
|
||
Comment on attachment 507736 [details] [diff] [review] add haltonfailure to install step for Android Asking Bear to add haltOnFailure to more steps.
Attachment #507736 -
Flags: review?(aki)
Assignee | ||
Comment 9•13 years ago
|
||
Attachment #507736 -
Attachment is obsolete: true
Attachment #507738 -
Flags: review?(aki)
Assignee | ||
Comment 10•13 years ago
|
||
(In reply to comment #7) > Created attachment 507737 [details] [diff] [review] > turn off tsvg, tdhtml and tspider until we can figure out orange committed changeset 3531:b51b429fc8fb
Reporter | ||
Comment 11•13 years ago
|
||
Comment on attachment 507738 [details] [diff] [review] add haltonfailure to install Android talos steps that talk to the tegras The haltOnFailure=True in addRunTestStep() needs to be changed back to False, otherwise you will prevent desktop Talos from running multiple talos suites in one run. r=me with that change.
Attachment #507738 -
Flags: review?(aki) → review+
Assignee | ||
Comment 12•13 years ago
|
||
(In reply to comment #11) > Comment on attachment 507738 [details] [diff] [review] > add haltonfailure to install Android talos steps that talk to the tegras > > The haltOnFailure=True in addRunTestStep() needs to be changed back to False, > otherwise you will prevent desktop Talos from running multiple talos suites in > one run. > > r=me with that change. that tweak applied and pushed committed changeset 1252:48f9d397d28c
Assignee | ||
Comment 13•13 years ago
|
||
adjusting priority to reflect that this is now a tracking bug for the various ateam/releng tegra projects
Severity: major → normal
Priority: -- → P3
Assignee | ||
Comment 14•13 years ago
|
||
adjusting summary to make this a tracking bug for rest of talos and reftests
Summary: bring tegras back into production → bring remaining talos suites and reftests to production tegras
Assignee | ||
Comment 15•13 years ago
|
||
Attachment #509908 -
Flags: review?(aki)
Reporter | ||
Updated•13 years ago
|
Attachment #509908 -
Flags: review?(aki) → review+
Reporter | ||
Comment 16•13 years ago
|
||
Comment on attachment 509908 [details] [diff] [review] turn on tdhtml, tsvg and tsspider tests for m-c http://hg.mozilla.org/build/buildbot-configs/rev/98e7507b7171
Attachment #509908 -
Flags: checked-in+
Reporter | ||
Updated•13 years ago
|
Summary: bring remaining talos suites and reftests to production tegras → production tegra stability improvements
Reporter | ||
Comment 17•13 years ago
|
||
Attachment #518763 -
Flags: review?(aki)
Reporter | ||
Comment 18•13 years ago
|
||
Comment on attachment 518763 [details] [diff] [review] bear's latest stability patch > if options.reset: > s += ' - resetting' >+ hbSocket.send('rebt\n') >+ # not interested in result >+ # taking advantage of the pause >+ d = hbSocket.recv(4096) > os.remove(errorFile) Clever. I'm guessing 4096 is 4 heartbeats? If so, a comment might be nice. You also have 4096 elsewhere; this might be a good thing to pull out into a var. > def handlesigterm(self, signum, frame): > if self.pidfile is not None: > try: >+ eventQueue.put(('terminate',)) > os.remove(self.pidfile) > except (KeyboardInterrupt, SystemExit): > raise > except Exception: > pass > sys.exit(0) Hm, is this to make sure the queue doesn't keep going? Is there a possibility that the queue will have other items in it that will be dealt with before the terminate? I'm wondering if this should be eventQueue.queue = ['terminate'] or something, unless we want it to keep processing. Maybe logging eventQueue.queue at the time of terminate might be good enough for now. >-def killPID(pidFile, signal='2'): >+def killPID(pid, signal='2'): Not relying on the pidfile is probably a good move. >+def sendReboot(ip, port): >+ log.warning('sending rebt to tegra') >+ try: >+ hbSocket = socket.socket(socket.AF_INET, socket.SOCK_STREAM) >+ hbSocket.settimeout(float(120)) >+ hbSocket.connect((ip, port)) >+ hbSocket.send('rebt\n') >+ hbSocket.recv(4096) >+ except: >+ log.debug('Error sending reboot') Cool. Guessing this also reboots and waits for 4 heartbeats? > tegraActive = True > if not bbActive: > if os.path.isfile(errorFile): >- log.warning('Tegra active but error flag set') >+ log.warning('Tegra active but error flag set [%d/%d]' % (softCount, softResets)) >+ softCount += 1 >+ if softCount > softCountMax: >+ softCount = 0 >+ if softResets < softResetMax: >+ softResets += 1 >+ log.info('removing error flag to see if tegra comes back') >+ os.remove(errorFile) >+ else: >+ hardResets += 1 >+ log.info('hard reset reboot check [%d/%d]' % (hardResets, hardResetsMax)) >+ if hardResets < hardResetsMax: >+ sendReboot(options.tegraIP, sutDataPort) I'd love a message here in an else:... "Tegra over hardResetsMax; requires manual intervention" or something. We may want it to otherwise notify us, rather than waiting for us to parse logs, but having it in the log message would be a good start. Should we also terminate if hardResets > hardResetsMax ? Also, you're incrementing softResets and hardResets in different places (softResets after the softResetMax compare; hardResets before) but you're using < for both, so we're going to have an off-by-one. >+ elif state == 'dialback': >+ softCount = 0 >+ softResets = 0 This only goes to dialback when clientproxy first starts? >+ elif state == 'terminate': >+ break The break doesn't seem final enough to me, but if it falls off the end of the __main__ function I suppose that works. >- if daemon is not None: >- daemon.stop() >+ # killPID(db.pid) >+ # killPID(mt.pid) >+ # >+ # db.join() >+ # mt.join() Testing different exits? I've got lots of comments above, some of which are questions and some of which I think need fixing. But I don't think I saw anything that prevents me from wanting to see this landed/run as long as the fixes happen.
Attachment #518763 -
Flags: review?(aki) → review+
Reporter | ||
Comment 19•13 years ago
|
||
Comment on attachment 518763 [details] [diff] [review] bear's latest stability patch http://hg.mozilla.org/build/tools/rev/e1f9fac67b62
Attachment #518763 -
Flags: checked-in+
Assignee | ||
Comment 20•13 years ago
|
||
This patch contains the changes worked out with the a-team for latest patched devicemanager.py and also the v1.0 watcher/agent apk's. It also has the resolution check and small cleanups. It may not be the last thing changed for this final stability push, but I wanted to get the current code in the source tree.
Attachment #521260 -
Flags: review?(aki)
Reporter | ||
Updated•13 years ago
|
Attachment #521260 -
Attachment is patch: true
Attachment #521260 -
Attachment mime type: application/octet-stream → text/plain
Reporter | ||
Updated•13 years ago
|
Attachment #521260 -
Flags: review?(aki) → review+
Assignee | ||
Comment 21•13 years ago
|
||
Comment on attachment 521260 [details] [diff] [review] changes to installApp to work with v1.0 watcher/agent apk committed changeset 1304:6637f555ff47
Attachment #521260 -
Flags: checked-in+
Reporter | ||
Comment 22•13 years ago
|
||
MobileTest is currently looking better than Mobile, test-wise, right now =P Then again, we don't seem to be running any unit tests there? If things look good in the morning, I suggest we: * verify unit tests work in staging * point foopy02's tegras (031-050; please create 045-050) to test-master01:9012 * potentially also point foopy03's 051-054 at test-master01:9012 * file a bug to image 055-094, and possibly 001-030. * when the tegras are imaged, decide which we keep in staging and which we point at production. I think at least 3 foopy's worth of tegras (60 tegras) should be in production.
Assignee | ||
Comment 23•13 years ago
|
||
config.py added to tools/sut_tools/
Attachment #521694 -
Flags: review?(aki)
Reporter | ||
Updated•13 years ago
|
Attachment #521694 -
Flags: review?(aki) → review+
Assignee | ||
Comment 24•13 years ago
|
||
add new config step and also change reboot to be a disconnect step for remote reftests
Attachment #521695 -
Flags: review?(aki)
Reporter | ||
Comment 25•13 years ago
|
||
Comment on attachment 521695 [details] [diff] [review] update to buildbotcustom process/factory.py to add new config step for reftests I'm not a huge fan of the generic name "config" for a script that only changes screen resolutions for reftests, but I want this stuff green first, nitpick later.
Attachment #521695 -
Flags: review?(aki) → review+
Assignee | ||
Comment 26•13 years ago
|
||
Comment on attachment 521694 [details] [diff] [review] new sut_tools helper script to set screen resolution committed changeset 1306:309b378527a2
Attachment #521694 -
Flags: checked-in+
Assignee | ||
Comment 27•13 years ago
|
||
Comment on attachment 521695 [details] [diff] [review] update to buildbotcustom process/factory.py to add new config step for reftests committed changeset 1433:377379e0b359
Attachment #521695 -
Flags: checked-in+
Reporter | ||
Comment 28•13 years ago
|
||
Tracking bug that's tracking 1 low priority bug. -> RESO FIXED
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•