Closed Bug 876159 Opened 11 years ago Closed 11 years ago

Test runs which fail and then exceed the maximum log size are marked as passing

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Unassigned)

References

Details

Attachments

(5 files, 1 obsolete file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=23400763&tree=Mozilla-Inbound is a debug Ubuntu crashtest run, which exceeded the maximum log size, but is marked as green.

Every tree is closed.
Those crashtests may well be passing, the summary comes from https://tbpl.mozilla.org/php/getParsedLog.php?id=23402698&tree=Try where Ms2ger spotted it, because his try push did break mochitest-1, but the debug ones which exceeded the maximum log size were marked as green.
And we can easily get the linux crashtests down below the log limit, should just be a matter of 'hg backout -r 6115fce2399d'. I'm only holding off on that because we might need the permafailgreen to test fixing detection of it.

Doesn't solve all our undetected problems, though, because https://tbpl.mozilla.org/php/getParsedLog.php?id=23401583&tree=Mozilla-Inbound is a Win7 debug mochitest-2 that OOMs and overflows the log (intermittently).
r=me to comment out that line with a reference to this bug
(xpcom/base/nsCycleCollector.cpp, line 1890)
Or you can just delete that warning, rather than commenting it out.  I was planning on removing it next week anyways.
My possibly-flawed understand of how we used to catch these is that http://mxr.mozilla.org/build/source/buildbotcustom/steps/unittest.py#230 insisted on seeing the "Failed: 0" line, and if it didn't see it, then the run failed.

http://mxr.mozilla.org/build/source/mozharness/mozharness/mozilla/testing/unittest.py#120 looks to be the mozharness equivalent, where it thinks that in that case fail_count will still be the -1 it was inited as, but apparently that doesn't work: both truncated logs which have an error in the non-truncated part and should have set fail_count to > 0, and truncated logs which don't have an error in the non-truncated part and should have left it at -1 wind up with SUCCESS.
Well, there are other issues here.  We have one case where even in the truncated log, there are test failures yet it is green so it seems that the log being truncated bypasses running through the parser at all and sets the build green.  I think bypassing the parser altogether is fine, but if the log is so much bigger than expected that it has to be truncated, that condition alone should be sufficient to turn the build red.
Not sure if this will work, but it's worth a shot.
Attachment #754171 - Flags: review?(philringnalda)
Comment on attachment 754171 [details] [diff] [review]
detect truncation -> burn

I'm pretty sure it won't work, because the message is a buildbot HEADER rather than an ANYTHINGYOUCANACTUALLYSEE, but Callek thought it was possible that it was visible to log parsing despite that, so if you're willing to do a reconfig I'm more than willing to trigger a crashtest that will overflow to see.
Attachment #754171 - Flags: review?(philringnalda) → review+
The other approach we can take is to use rc_eval_func instead of regex_eval_func, and change the scripts to set a return status as opposed to spitting out a string.

That approach would take a lot more changes, so I'm going to reconfig to see if this works first.
Also, we've been talking about uploading logs, configs, etc. from test jobs for a while now.  That would allow for uploading the un-truncated info log, the raw log and other log level logs, the localconfig.json, screenshots, or anything else we put in the upload dir for uploading.  There would still need to be something telling buildbot what the status is, though.
(In reply to Andrew McCreight [:mccr8] from comment #3)
> r=me to comment out that line with a reference to this bug

Not enough, https://tbpl.mozilla.org/php/getParsedLog.php?id=23409164&full=1&branch=try is the same OOM without the CC warning, still overflowing. Kill the nsDeque OOM warning? Kill imptests? I'll kill imptests if you tell me to, just tell me to, I don't mind, I'll kill them, say the word.
Untested, but should work in theory.
We'd need to change the regex_log_evaluator() to rc_eval_func() in buildbotcustom.
Attachment #754181 - Flags: review?
Comment on attachment 754184 [details] [diff] [review]
[custom] rc_eval_func instead of regex_log_evaluator

5.. manually canceled?
Attachment #754184 - Flags: review?
(In reply to Justin Wood (:Callek) from comment #17)
> Comment on attachment 754184 [details] [diff] [review]
> [custom] rc_eval_func instead of regex_log_evaluator
> 
> 5.. manually canceled?

How would we detect that in mozharness?
Comment on attachment 754181 [details] [diff] [review]
[mozharness] set self.return_code, lower touch version

You need to assign a reviewer.
Attachment #754181 - Flags: review? → review+
Comment on attachment 754184 [details] [diff] [review]
[custom] rc_eval_func instead of regex_log_evaluator

"Dunno": RETRY always scares me because there are so many things I dunno, but there's no better time than now to risk infinite retries :)
Attachment #754184 - Flags: review? → review+
Comment on attachment 754181 [details] [diff] [review]
[mozharness] set self.return_code, lower touch version

https://hg.mozilla.org/build/mozharness/rev/d90aad39ac1f

Landed on default.
I'll merge to production once the reconfig's done.
Attachment #754181 - Flags: checked-in+
Comment on attachment 754184 [details] [diff] [review]
[custom] rc_eval_func instead of regex_log_evaluator

http://hg.mozilla.org/build/buildbotcustom/rev/117b78891fc5

Merged to production-0.8, starting reconfig.
Attachment #754184 - Flags: checked-in+
The Win debug OOM looks like it's maybe 8% based on my try run, so it doesn't block reopening like the crashtest does, but I'll be coming after it with a machete before long.
Reconfig done, mozharness patch transplanted to production.
with unit tests that pass.
Attachment #754189 - Flags: review?(philringnalda)
Comment on attachment 754189 [details] [diff] [review]
[mozharness] set failure if we surpass max log size

>+def cleanup():
>+    gc.collect()
>+    # I'm using MercurialVCS here because that gives me access to
>+    #
>+    c = CleanupObj()

This is not going to be a surprise, but I've got no idea what that comment is saying.
Attachment #754189 - Flags: review?(philringnalda) → review+
Comment on attachment 754189 [details] [diff] [review]
[mozharness] set failure if we surpass max log size

Landed to default, transplanted to production.  And yeah, that comment was left over from a different file I copied from (cleaned up now).
http://hg.mozilla.org/build/mozharness/rev/7416a68e544b
Attachment #754189 - Flags: checked-in+
(In reply to Phil Ringnalda (:philor) from comment #13)
> Not enough,
> https://tbpl.mozilla.org/php/getParsedLog.php?id=23409164&full=1&branch=try
> is the same OOM without the CC warning, still overflowing. Kill the nsDeque
> OOM warning? Kill imptests? I'll kill imptests if you tell me to, just tell
> me to, I don't mind, I'll kill them, say the word.

Yeah, I think killing the nsDeque OOM warning is good as a short term measure to prevent log badness.  I should be able to add it back later this week, when I make it so the CC doesn't blast it out a million times when things fall apart.

It is just papering over extreme badness in that test (bug 875585), but you are going to need to get somebody else's approval for disabling it. ;)  Ms2ger should have a better idea who might have an opinion.
Depends on: 875585
Never been so happy to see a red job; trees reopened, *thank you Aki*.
Severity: blocker → normal
Filed bug 876545 on dealing with the way that debug mochitest-1 is so close to 50MB that a single timeout and screenshot puts it over.
Comment on attachment 756301 [details] [diff] [review]
move worst_level() to LogMixin, use it in buildbot_status()

Review of attachment 756301 [details] [diff] [review]:
-----------------------------------------------------------------

This looks ok to me

::: mozharness/mozilla/buildbot.py
@@ +83,5 @@
>              if not level:
>                  level = TBPL_STATUS_DICT[tbpl_status]
> +            self.worst_buildbot_status = self.worst_level(tbpl_status, self.worst_buildbot_status, TBPL_STATUS_DICT.keys())
> +            if self.worst_buildbot_status != tbpl_status:
> +                self.info("Current worst status %s is worse; keeping it." % self.worst_buildbot_status)

not a fan of how this english reads, but its a very minor bikeshed nit with no new shed color in mind, so no worry.
Attachment #756301 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 756301 [details] [diff] [review]
move worst_level() to LogMixin, use it in buildbot_status()

http://hg.mozilla.org/build/mozharness/rev/43e1d7a78e96 , transplanted to production.
Attachment #756301 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: