Closed
Bug 739384
Opened 13 years ago
Closed 13 years ago
Deploy to production foopies verification scripts (changes from Bug 690311 Part 1)
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P1)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Callek, Assigned: armenzg)
References
Details
Armen asked me to file a bug against him for this.
We need to deploy Bug 690311 Part 1 to the various foopies in a bit of a staged rollout (so that if something breaks we don't break everything at once)
He said he will tackle the first prod foopy of this in his AM.
Assignee | ||
Comment 1•13 years ago
|
||
I am going to do foopy07
https://build.mozilla.org/buildapi/recent/tegra-031
https://build.mozilla.org/buildapi/recent/tegra-032
https://build.mozilla.org/buildapi/recent/tegra-035
https://build.mozilla.org/buildapi/recent/tegra-036
https://build.mozilla.org/buildapi/recent/tegra-037
https://build.mozilla.org/buildapi/recent/tegra-038
https://build.mozilla.org/buildapi/recent/tegra-039
https://build.mozilla.org/buildapi/recent/tegra-199
https://build.mozilla.org/buildapi/recent/tegra-200
Assignee | ||
Comment 2•13 years ago
|
||
Steps to follow:
* load up bm19 and each of the slaves for that foopy
* hit graceful shutdown for each
* once *all* of them are done, run /builds/stop_cp.sh (it stops all tegras on that foopy)
** you can modify stop_cp.sh to send each call to stop.py to be in the background
* once *all* have stopped, update the /builds/tools checkout
* run /builds/start_cp.sh (it starts all tegras on that foopy)
I added the --debug flag to start_cp.sh.
Assignee | ||
Comment 3•13 years ago
|
||
Callek, would you mind having a look at tegra-036, tegra-037 and tegra-200?
I can add your pub key to foopy07.
It seems that all of them started but these 3 failed on some jobs after starting.
I assume this is fine but I would prefer to have an extra pair of eyes.
I wish we had a way to gracefully shutdown a tegra when failing and make it run through the verify steps. Or something like that.
An interesting case is this one:
http://buildbot-master19.build.mtv1.mozilla.com:8201/builders/Android%20Tegra%20250%20mozilla-inbound%20talos%20remote-twinopen/builds/1545
Where the job is going well until hitting "Run Performance tests" which fails and then everything fails after that.
Assignee | ||
Comment 4•13 years ago
|
||
I guess we have a little bit of insight by looking at this:
http://mobile-dashboard1.build.mtv1.mozilla.com/tegras/tegra-037_status.log
http://mobile-dashboard1.build.mtv1.mozilla.com/tegras/tegra-036_status.log
I really don't like the message "SUTAgent not present;" but I bet that is completely unrelated to our changes.
I think tegra-037 is not in a good shape even prior to our changes.
Assignee | ||
Comment 5•13 years ago
|
||
I realized that I run ./start_cp.sh without "screen -x". Why have the jobs not failed?
Comment 6•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #5)
> I realized that I run ./start_cp.sh without "screen -x". Why have the jobs
> not failed?
They don't fail to start, what happens is that sometime in the future they will fail. It is random how long that will take. (or seems to be random)
Comment 7•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #4)
> I guess we have a little bit of insight by looking at this:
> http://mobile-dashboard1.build.mtv1.mozilla.com/tegras/tegra-037_status.log
> http://mobile-dashboard1.build.mtv1.mozilla.com/tegras/tegra-036_status.log
>
> I really don't like the message "SUTAgent not present;" but I bet that is
> completely unrelated to our changes.
>
> I think tegra-037 is not in a good shape even prior to our changes.
"SUTAgent not present" just means that clientproxy could not connect to the SUTAgent daemon running on the Tegra. Who knows why the SUTAgent isn't running - that is beyond clientproxies ability to determine.
Reporter | ||
Comment 8•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #5)
> I realized that I run ./start_cp.sh without "screen -x". Why have the jobs
> not failed?
So we have many reds on m-c and inbound from runs with the tegras attached to foopy07, what I can see so far indicates its not because of my code, but is from this deploy. We should stop_cp and re-run start_cp with the proper "screen -x" I think
Symptoms are |hg clone| having python throwing failing to |import site| and errors like the following:
hg clone http://hg.mozilla.org/build/tools tools
in dir /builds/tegra-035/test/. (timeout 1320 secs)
watching logfiles {}
argv: ['hg', 'clone', 'http://hg.mozilla.org/build/tools', 'tools']
environment:
PATH=/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
PWD=/builds/tegra-035/test
SUT_IP=10.250.49.22
SUT_NAME=tegra-035
__CF_USER_TEXT_ENCODING=0x1F6:0:0
closing stdin
using PTY: False
'import site' failed; use -v for traceback
Traceback (most recent call last):
File "/opt/local/bin/hg", line 38, in <module>
mercurial.dispatch.run()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/dispatch.py", line 16, in run
sys.exit(dispatch(sys.argv[1:]))
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/dispatch.py", line 21, in dispatch
u = uimod.ui()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/ui.py", line 35, in __init__
for f in util.rcpath():
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/util.py", line 1346, in rcpath
_rcpath = os_rcpath()
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/util.py", line 1321, in os_rcpath
path.extend(user_rcpath())
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/mercurial/posix.py", line 53, in user_rcpath
return [os.path.expanduser('~/.hgrc')]
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/posixpath.py", line 259, in expanduser
userhome = pwd.getpwuid(os.getuid()).pw_dir
KeyError: 'getpwuid(): uid not found: 502'
program finished with exit code 1
There is a small code error though, that affects reporting not outcome of verify.py I'll attach it to bug 690311 momentarily and land it as soon as it gets review.
Comment 9•13 years ago
|
||
I'm restarting clientproxy via screen for all the tegras connected to foopy07. WIll update when done.
Comment 10•13 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #9)
> I'm restarting clientproxy via screen for all the tegras connected to
> foopy07. WIll update when done.
This is done now.
Assignee | ||
Comment 11•13 years ago
|
||
Thanks coop.
Callek, the jobs since the restart seem good.
Assignee | ||
Comment 12•13 years ago
|
||
Callek, anything left before I take care of the remaining foopies?
I would like to do this tomorrow morning.
Summary: Deploy changes from Bug 690311 Part 1 → Deploy to production foopies verification scripts (changes from Bug 690311 Part 1)
Assignee | ||
Updated•13 years ago
|
Priority: -- → P1
Reporter | ||
Comment 13•13 years ago
|
||
We are good to go forward with the rest of these. Though in IRC you suggested possibly waiting until friday. If you do delay it will not hold me up, so whichever.
Assignee | ||
Comment 14•13 years ago
|
||
I will be doing this tomorrow as today was busy for me.
Assignee | ||
Comment 15•13 years ago
|
||
bm19 has been done (using screen).
I will now work on bm20.
Assignee | ||
Comment 16•13 years ago
|
||
For the record, the original revisions of tools was dace6c4e8902 in case we need to revert to it (I really hope we don't).
bm20 is down and I am now stopping all foopies/tegras for it.
Assignee | ||
Comment 17•13 years ago
|
||
This is done. Now, let's watch that nothing breaks.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 18•13 years ago
|
||
Sooooo looks like we were not done here, bm20 wasn't updated, bear accomplished most of that for us.
bm19 has at _least_ one foopy we never updated, so we need to check all that as well.
I wrote up an etherpad for the relevant info we have/need here: https://etherpad.mozilla.org/usN39EKFDB
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 19•13 years ago
|
||
(In reply to Justin Wood (:Callek) from comment #18)
> bm19 has at _least_ one foopy we never updated, so we need to check all that
> as well.
Besides that one foopy, there was only one other to update (foopy09 and foopy19 for history). Those are done now, thanks bear!
Status: REOPENED → RESOLVED
Closed: 13 years ago → 13 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•