Closed
Bug 772531
Opened 13 years ago
Closed 10 years ago
detect when a tegra reboots via log file parsing, mark test as red and allow tbpl to get accurate information
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: jmaher, Unassigned)
References
Details
Attachments
(4 files, 1 obsolete file)
929 bytes,
patch
|
wlach
:
review+
|
Details | Diff | Splinter Review |
2.79 KB,
patch
|
armenzg
:
review+
jmaher
:
feedback+
armenzg
:
checked-in+
|
Details | Diff | Splinter Review |
954 bytes,
patch
|
wlach
:
review+
|
Details | Diff | Splinter Review |
498 bytes,
patch
|
armenzg
:
review-
|
Details | Diff | Splinter Review |
I have detected a pattern in our log files that show up when the tegra reboots:
Error receiving data from socket. cmd=ps
; err=[Errno 54] Connection reset by peer
reconnecting socket
This makes sense and causes us to report orange although no tests have failed.
If we detect this pattern in the log output, I would like to mark the test red, and display a message to TBPL that would incdicate "Android device rebooted (; err=[Errno 54] Connection reset by peer)"
Comment 1•13 years ago
|
||
We would get this for free with mozharness.
There's also buildbotcustom parsing for unit tests.
Reporter | ||
Comment 2•13 years ago
|
||
well, the real problem is why are we getting a high rate of reboots...bad tegras, bad sdcards, or tests doing troublesome things. Mozharness would be awesome, and in the meantime, lets get this added to our infrastructure.
Comment 3•13 years ago
|
||
The colour is determined by the status code that buildbot provides.
What TBPL shows in the annotated summary box is determined by:
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/e2e344885c80/php/inc/GeneralErrorFilter.php#l24
Alternatively, the log can just prefix the line with "TinderboxPrint: " and we'll have the message shown in the list of bullets below the annotated summary box.
Reporter | ||
Comment 4•13 years ago
|
||
maybe the best method is to return a proper error from the harness and output "Automation error:" at the beginning of one of the lines.
Comment 5•13 years ago
|
||
Detecting a line in the log and taking an action can be accomplished by:
1) by adding a log_eval_func to the specific step. [1]
This can be done by either adding it in-step [2] or on the errors.py [3] file and importing it.
The steps to be modified are in RemoteUnittestFactory [4]
To add "TinderboxPrint:" messages we can modify createSummary IIUC. jlund would know as he has been looking at that recently.
"Automation error:" can be added to errors.py.
[1]
log_eval_func=lambda c,s: regex_log_evaluator(c, s, talos_hgweb_errors),
[2]
http://hg.mozilla.org/build/buildbotcustom/file/default/misc.py#l419
420 log_eval_func=lambda c,s: regex_log_evaluator(c, s, (
420 (re.compile('# TBPL WARNING #'), WARNINGS),
421 (re.compile('# TBPL FAILURE #'), FAILURE),
422 (re.compile('# TBPL EXCEPTION #'), EXCEPTION),
423 (re.compile('# TBPL RETRY #'), RETRY),
424 ))
[3]
http://hg.mozilla.org/build/buildbotcustom/file/281fc2c2fa9e/status/errors.py
[4] http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l6793
Comment 6•13 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #4)
> maybe the best method is to return a proper error from the harness and
> output "Automation error:" at the beginning of one of the lines.
This would be my preference (over TinderboxPrint) :-)
Reporter | ||
Comment 7•13 years ago
|
||
this patch looks for a known error and delivers a tbpl parsable message. This doesn't change the job red, but we can build from this.
Attachment #640728 -
Flags: review?(wlachance)
Comment 8•13 years ago
|
||
Comment on attachment 640728 [details] [diff] [review]
report a real error from devicemanager when we receive a known failure state (1.0)
Idea of patch is good but could be improved in a few ways to be more robust and clear. I'll give an r+ assuming the below is implemented (or given a good reason why not). ;)
># HG changeset patch
># Parent 2e702f2948b447e46e28cb3094308c6846f254f6
>Bug 772531 - detect when a tegra reboots via log file parsing, mark test as red and allow tbpl to get accurate information. r=wlach
>
>diff -r 2e702f2948b4 build/mobile/devicemanagerSUT.py
>--- a/build/mobile/devicemanagerSUT.py Tue Jul 10 14:56:47 2012 -0400
>+++ b/build/mobile/devicemanagerSUT.py Tue Jul 10 15:46:43 2012 -0400
>@@ -213,6 +213,9 @@
> except socket.error, msg:
Change this to:
except socket.error, err:
(you'll see why below)
> self._sock.close()
> self._sock = None
>+ # This error shows up with we have our tegra rebooted.
>+ if str(msg.strip()) == "[Errno 54] Connection reset by peer":
I think it's better to just look for the errno, rather than examining a string whose format could change.
I'd suggest replacing this with:
> if err[0] == errno.ECONNRESET:
(you'll also need to import errno on top)
>+ raise AgentError("Automation error: Error receiving data from socket (possible reboot). cmd="+str(cmd)+"; err="+str(msg), True)
I find this kind of concatenation less clear than the standard method, where we substitute in variables via '%'. Also, there's no need to cast to string explicitly. So try:
raise AgentError("Automation error: Error receiving data from socket (possible reboot). cmd=%s; err=%s" % (cmd, msg), True)
Attachment #640728 -
Flags: review?(wlachance) → review+
Comment 9•13 years ago
|
||
I'd be glad to *try* add the glue needed for this from the buildbot end.
I have noticed that one of buildbots error checking for 'upload_errors' would have caught this if it was considered an upload step. However it would have set this to RETRY AKA Blue not red: https://github.com/mozilla/buildbotcustom/blob/master/status/errors.py#L28
I am wondering if the Tegra is rebooting, should we be going for the same(blue) and signalling for a Retry?
Are there any logs of this?
I can take a look into parsing Joel's recent append: 'Automation error: Error receiving data from socket' and change the color accordingly in buildbot.
Comment 10•13 years ago
|
||
I wonder if adding something like this to global_errors would work without any further changes needed.
This would set buildbots return status to RETRY and in return blue in tbpl.
I think for the tegra's unittests we are already passing global errors here:
http://mxr.mozilla.org/build/source/buildbotcustom/steps/base.py#17
as I dont think we are setting log_eval_func but I will certainly need confirmation from someone who knows more
Comment 11•13 years ago
|
||
I suppose this could have dangerous effects since its 'global' and lots of other builds would pass through it. I am not sure if 'connection reset by peer' appears in other logs and should be ignored AKA stay green/success
Reporter | ||
Comment 12•13 years ago
|
||
I am not sure either. That is something we should test in staging with desktop builds. If we see that on an Android build I would say we are safe to throw an error. I have been testing this for a while today and about half the reboots fall into that category.
Updated•13 years ago
|
Component: Release Engineering: Automation (Release Automation) → Release Engineering: Automation (General)
QA Contact: bhearsum → catlee
Comment 13•13 years ago
|
||
it would probably be safer to add an argument to the 'step' its rebooting and check for jmahers: 'Automation Error' via something among the lines of:
log_eval_func=lambda c,s: regex_log_evaluator(c, s, (
global_errors + ((re.compile("Automation error: Error receiving data from socket (possible reboot)"), FAILURE),)
Updated•13 years ago
|
Assignee: nobody → jlund
Comment 14•13 years ago
|
||
this patch should be good for testing in staging *once* joel's patch lands as this new patch will not check for 'connection reset by peer' but instead looks for 'Automation error: Error'.
I think this approach is cleaner and safer as it only targets tegra machines specifically for unittests.
Attachment #640768 -
Attachment is obsolete: true
Reporter | ||
Comment 15•13 years ago
|
||
landed on inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a4b0d896e9ca
Comment 16•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 17•12 years ago
|
||
We forgot to add a whiteboard to prevent devs from closing this bug.
jlund this is still an issue.
Can I review your bug?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•12 years ago
|
Attachment #641174 -
Flags: review?(armenzg)
Comment 18•12 years ago
|
||
Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure
Review of attachment 641174 [details] [diff] [review]:
-----------------------------------------------------------------
This looks great.
jmaher, do you want the jobs to turn red/FAILURE? or RETRY?
Attachment #641174 -
Flags: review?(armenzg)
Attachment #641174 -
Flags: review+
Attachment #641174 -
Flags: feedback?(jmaher)
Reporter | ||
Comment 19•12 years ago
|
||
Ideally retry, but I would really like a way to track this. If they turn red and we can star them easily to track the frequency that would be fine as well.
Reporter | ||
Comment 20•12 years ago
|
||
Attachment #643496 -
Flags: review?(wlachance)
Reporter | ||
Comment 21•12 years ago
|
||
Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure
Review of attachment 641174 [details] [diff] [review]:
-----------------------------------------------------------------
looks good to me
Attachment #641174 -
Flags: feedback?(jmaher) → feedback+
Comment 22•12 years ago
|
||
I will land and reconfigure the masters in the morning.
I don't want to have any moving pieces while Callek is doing any investigations.
Comment 23•12 years ago
|
||
Comment on attachment 643496 [details] [diff] [review]
print out the agent error message (1.0)
Yep, this is the right thing.
Attachment #643496 -
Flags: review?(wlachance) → review+
Reporter | ||
Comment 24•12 years ago
|
||
landed fix for printing the error message:
https://hg.mozilla.org/integration/mozilla-inbound/rev/27884b05391b
Comment 25•12 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
Comment 26•12 years ago
|
||
Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure
I reconfigured the masters with jlund's flagging reboots code.
http://hg.mozilla.org/build/buildbotcustom/rev/def40730fbbe
Let's now hope that we catch *only* the right thing.
Attachment #641174 -
Flags: checked-in+
Comment 27•12 years ago
|
||
In case we have to land any other patches to m-i (adding [leave open]).
https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [leave open]
Comment 28•12 years ago
|
||
armen. I think this is the problem.
I am wondering is there a way I can debug what has been landed in buildbot or test this on dev-master. After a silly mess up, I would like to make sure I am reading and implementing this right. What do you think is best to do?
Attachment #643982 -
Flags: review?(armenzg)
Comment 29•12 years ago
|
||
Comment on attachment 643982 [details] [diff] [review]
changes talos_error to RETRY instead of FAILURE
I have thought about it a little more.
We don't have more capacity to RETRY and we don't know yet how often this happens.
We should RETRY but for now let's try to spot occurences.
jmaher, could you please get a patch when you can return an error code (2) rather than a WARNING code (1)?
I want to prevent to re-queue even more jobs.
This is in accordance to:
https://github.com/buildbot/buildbot/blob/master/master/buildbot/status/results.py#L16
Attachment #643982 -
Flags: review?(armenzg) → review-
Updated•12 years ago
|
Whiteboard: [leave open] → [leave open][sheriff-want]
Reporter | ||
Comment 30•12 years ago
|
||
I don't think this is critical anymore. My patch to get the harnesses to bubble the information from devicemanager to sys.exit() in the harness is horribly hacky and didn't quite work so well.
I can leave this out there as a bug and tackle it some other day.
Comment 31•12 years ago
|
||
Much of this bug has been covered by bug 793091, bug 790613, bug 790595, bug 794017, bug 793678 & bug 790618; to the extent that this bug doesn't affect sheriffs much now so removing bug 778688 from dependency chain.
No longer blocks: 778688
Whiteboard: [leave open][sheriff-want]
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 32•11 years ago
|
||
I am not actively working on this.
my focus/priority is primarily here:
- Bug 858797 - fx desktop builds in mozharness
Assignee: jlund → nobody
Comment 34•10 years ago
|
||
Ew, tegra. Get it away from me!
Status: REOPENED → RESOLVED
Closed: 12 years ago → 10 years ago
Flags: needinfo?(coop)
Resolution: --- → WONTFIX
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•