Closed
Bug 1018531
Opened 11 years ago
Closed 11 years ago
Panda tests retrying more than necessary
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: coop, Assigned: pmoore)
References
Details
(Whiteboard: [capacity][mobile][panda])
Attachments
(1 file, 1 obsolete file)
2.24 KB,
patch
|
rail
:
review+
pmoore
:
checked-in+
|
Details | Diff | Splinter Review |
We started updating the tools checkout on the foopies during reconfigs as part of bug 1018248 today. This is absolutely the right thing to do and should help us not get out of sync there again, but it's had some unintended consequences this first time around.
Looking at the results for panda jobs after the tools checkout was updated on the foopies today, we see an immediate increase in the number of retries:
Before update:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=ef6fc5965a4c
After update:
https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=9b01df568861
Diving into the "after" tests for Android 4.0 mochist-3 opt (the first example where we have multiple failures followed by a successful test), the "failed" examples all contain a log line like this:
08:49:33 INFO - Caught Exception: Remote Device Error: unable to connect to panda-0434 after 5 attempts
From: https://tbpl.mozilla.org/php/getParsedLog.php?id=40735272&tree=Mozilla-Inbound&full=1#error0
This error doesn't seem to be fatal at all. Indeed, the rest of the mochitest runs to completion afterwards but the error parser flags it as needing a retry:
https://hg.mozilla.org/build/buildbotcustom/file/b047f08ca8b4/status/errors.py#l9
In this particular case, we "failed" 4 times at 23 minutes a piece before finally succeeding. Every one of those tests actually succeeded. :(
I know we're surfacing more exceptions now in an effort to get them fixed. This seems like one of the first cases we need to fix.
We could change the error parser while we work on a proper fix. Testing in my python console, I think a regexp like this should work in status/errors.py#l9:
re.compile("(?<!INFO - Caught Exception: )Remote Device Error")
Pete: assigning this to you because one way of another you'll need to deal with this as buildduty on Monday. You also understand the problem space better than most.
Assignee | ||
Comment 1•11 years ago
|
||
Tomcat|sheriffduty
01:42:25 PDT: see https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=fd9518344a23
01:42:27 PDT: the blue ones
01:42:47 PDT: all different pandas with Caught Exception: Remote Device Error: unable to connect to panda-0627 after 5 attempts etc
01:44:12 PDT: and seems a general problem when i load https://tbpl.mozilla.org/?tree=Mozilla-Inbound
01:44:17 PDT: lots of blue :)
Assignee | ||
Comment 2•11 years ago
|
||
So the tools version deployed before was 774756f24bef. The current deployed version is 9fdad0dcf1bc. Just checking the delta for obvious changes that could affect.
Assignee | ||
Comment 3•11 years ago
|
||
Nothing obviously suspect at first glance. E.g for sut tools, the only change is:
$ git diff 2b990522b8fa9714ff0afb925fb725f190408c36..28e463a044aaea3f650ca8a07f10a969acb3ea1b -- sut_tools
diff --git a/sut_tools/mozdevice/devicemanagerSUT.py b/sut_tools/mozdevice/devicemanagerSUT.py
index 1733e49..00685b9 100644
--- a/sut_tools/mozdevice/devicemanagerSUT.py
+++ b/sut_tools/mozdevice/devicemanagerSUT.py
@@ -991,7 +991,8 @@ class DeviceManagerSUT(DeviceManager):
else:
try:
data = self._runCmds([{'cmd': 'testroot'}])
- except:
+ except BaseException as e:
+ print "Caught Exception:", e
return None
deviceRoot = data.strip() + '/tests'
$
And this should be harmless. In any case, I'll check the logs to see if we hit this "Caught Exception" line...
Assignee | ||
Comment 4•11 years ago
|
||
So I have a first suspicion!!
I *think* what might be the problem, is that previously, we were swallowing this exception, and not reporting on it. Now we still don't process it (we have the except clause) but we do mention that we caught this error in the logs. We didn't used to do that.
I *think* the system that decides this was a "blue build" and needs retrying is actually parsing our logs, and seeing the word "Exception" and therefore thinking that this build has failed, when in fact it has passed.
This is just a hypothesis at this point... Will investigate further.
Assignee | ||
Comment 5•11 years ago
|
||
Test output below. You see *with* the "Caught Exception:" no SRE_Match object is returned (i.e. the line is not matched) but without the "Caught Exception:" the SRE_Match object is returned, as required.
Python 2.7.6rc1 (v2.7.6rc1:4913d0e9be30+, Oct 27 2013, 20:52:11)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> print re.search("(?<!INFO - Caught Exception: )Remote Device Error", "09:26:32 INFO - Caught Exception: Remote Device Error: unable to connect to panda-0361 after 5 attempts")
None
>>> print re.search("(?<!INFO - Caught Exception: )Remote Device Error", "09:26:32 INFO - Remote Device Error: unable to connect to panda-0361 after 5 attempts")
<_sre.SRE_Match object at 0x10047db28>
>>> quit()
Attachment #8432414 -
Flags: review?(sbruno)
Assignee | ||
Comment 6•11 years ago
|
||
Actually, let me add a comment too - to explain this strange line :D
Attachment #8432414 -
Attachment is obsolete: true
Attachment #8432414 -
Flags: review?(sbruno)
Attachment #8432418 -
Flags: review?(sbruno)
Updated•11 years ago
|
Attachment #8432418 -
Flags: review?(sbruno) → review+
Assignee | ||
Comment 7•11 years ago
|
||
Comment on attachment 8432418 [details] [diff] [review]
bug1018531_buildbotcustom_v2.patch
https://hg.mozilla.org/build/buildbotcustom/rev/199ec58139d8
Attachment #8432418 -
Flags: checked-in+
Assignee | ||
Updated•11 years ago
|
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 8•11 years ago
|
||
In production
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•