Closed Bug 1059992 Opened 10 years ago Closed 9 years ago

ALL b2g builds exceeding buildbot output limits

Categories

(Release Engineering :: General, defect)

defect
Not set
blocker

Tracking

(b2g-v1.4 fixed, b2g-v2.0 fixed, b2g-v2.1 fixed, b2g-v2.2 fixed)

RESOLVED FIXED
Tracking Status
b2g-v1.4 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed
b2g-v2.2 --- fixed

People

(Reporter: mshal, Assigned: jhford)

References

Details

Attachments

(3 files, 1 obsolete file)

Flame builds sometimes hit the following:

Output exceeded 52428800 bytes, remaining output has been truncated (output was 52446216 bytes)

(eg: https://tbpl.mozilla.org/php/getParsedLog.php?id=46947281&tree=Mozilla-Central )

This doesn't actually cause the build to fail - it just stops logging output after that. But if there is a failure after the output limit is reached, the actual failure message is lost.

I think it is probably simplest to just up the logMaxSize in buildbot-configs/mozilla/master_common.py. I don't know how feasible it is to trim the actual build output.
Attached file logmaxsize
The fact that the logMaxSize hasn't changed since it was first created makes me somewhat uncertain that this is the right thing to do. I'm open to suggestions.
Attachment #8480816 - Flags: review?(catlee)
We've talked about this amongst catlee's group and there are good reasons to not bump the log limit, since it adds more load to the buildbot masters to handle the information streaming back, and because 50MB really should be enough for anyone. Someone suggested that there is a lot of repeated error output in the log which could be fixed or shown once or suppressed to get back under the limit.
Blocks: 907890
Ahh, ok. Feel free to r- then :).

The problem I ran into is my flame-kk builds were failing in staging, but no error messages were in the log since the output was truncated. I upped the log size on my dev-master to see the error, but then when I wanted to compare to a real good build on tbpl, that log was truncated too.
Comment on attachment 8480816 [details]
logmaxsize

Yeah, I'd really really like to not have to increase this. We've held the line at 50MB for quite a while.

Are there any obvious candidates for reducing build output?
Attachment #8480816 - Flags: review?(catlee) → review-
Bug 907890, which caused every single b2g build on every tree >= 32 to have a truncated log? Seems like a fairly obvious candidate to me.

Of course there are other things, there's half a meg of spammy traces from 13 Uglify2 errors because they want to use fat arrows and it doesn't know how to parse them and thinks it's useful to say so with a huge block of text, but who are you actually going to persuade to fix them?

The only reason I didn't close b2g32 and every trunk tree this morning when I discovered this was true of not just some Flame builds, but every single b2g device and emulator build since the end of July, was because people would lose their minds on mergeday eve. I think the right thing to do is close Tuesday morning, trunk aurora and 32, and not reopen until we can actually see b2g build logs.
Severity: normal → blocker
Summary: Flame builds exceeding buildbot output limits → ALL b2g builds exceeding buildbot output limits
John, as you can see it's a big issue to sheriffs to have the logs truncated.
Flags: needinfo?(jhford)
I don't think we should be turning off showcommands as it is still very useful information for debugging non-gecko changes.  If there is no option to increase the size of the logs we can stop having Android build system output, I guess we need to turn off the extended information for non-gecko parts.

To get us into a useful state where we don't loose gecko logging, I think we should do the following steps in order:

1) change gonk-misc/Android.mk to read an environment variable to force not setting -s without breaking the showcommands target support
2) set environment variable in buildbot/mozharness
3) revert https://bug907890.bugzilla.mozilla.org/attachment.cgi?id=8464166

I can take 1 and 3, but I need someone from Releng for 2.  Nick, do you know who on Releng can take number 2 on?
Flags: needinfo?(jhford) → needinfo?(nthomas)
Using a fake rule of

testingjohn:
	echo 'value for -s flag:' $(SHOW_COMMAND_GECKO)

I've tested that this works as follows:

$ ./build.sh testingjohn
...
value for -s flag: -s

$ ./build.sh testingjohn showcommands
...
value for -s flag:

$ ./build.sh testingjohn FORCE_GECKO_BUILD_OUTPUT=1
...
value for -s flag:

$ FORCE_GECKO_BUILD_OUTPUT=1 ./build.sh testingjohn 
...
value for -s flag:
Attachment #8482699 - Flags: review?(dhylands)
This patch removes the showcommands target from Mozharness.  It can't be landed until after attachment 8482699 [details] [diff] [review] lands.  If you'd prefer to specify the environment variable here instead of in buildbot, I guess that works too.
Attachment #8482701 - Flags: review?(catlee)
Attachment #8482701 - Flags: review?(catlee) → review+
Comment on attachment 8482699 [details] [diff] [review]
Environment override to force only gecko to show full logs

Review of attachment 8482699 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with issue addressed (either in this bug or a followup filed)

::: Android.mk
@@ +268,5 @@
> +ifneq ($(strip $(FORCE_GECKO_BUILD_OUTPUT)),)
> +SKIP_DASH_S = 1
> +endif
> +
> +ifdef SKIP_DASH_S

My only concern here is that if you decide to set SKIP_DASH_S = 0 it would still take the if.

I'd rather see:
ifneq ($(strip($(SKIP_DASH_S)),1)

but this can be fixed in a followup bug as well.
Attachment #8482699 - Flags: review?(dhylands) → review+
Comment on attachment 8482699 [details] [diff] [review]
Environment override to force only gecko to show full logs

gonk-misc commit 6969df171e5295f855f12d12db0382048e6892e7
Attachment #8482699 - Flags: checked-in+
set the environment variable.  I'm not sure how spidermonkey gets these values as it does not have an equivalent query env function I think.
Attachment #8482774 - Flags: review?(catlee)
Comment on attachment 8482774 [details] [diff] [review]
Set the environment as well

Review of attachment 8482774 [details] [diff] [review]:
-----------------------------------------------------------------

::: scripts/b2g_build.py
@@ +292,5 @@
>              )
>          # Force B2G_UPDATER so that eng builds (like the emulator) will get
>          # the updater included. Otherwise the xpcshell updater tests won't run.
>          env['B2G_UPDATER'] = '1'
> +        env['FORCE_GECKO_BUILD_OUTPUT'] = '1'

please add a comment here about referencing this bug, and pointing to where this environment variable is referenced by the build system.
Attachment #8482774 - Flags: review?(catlee) → review+
Attachment #8482701 - Attachment is obsolete: true
Flags: needinfo?(nthomas)
Comment on attachment 8482774 [details] [diff] [review]
Set the environment as well

I transplanted this to production to get into use asap:
https://hg.mozilla.org/build/mozharness/rev/ea52d732ef2f
Does the gonk-misc patch need uplifting to the v2.1, v2.0, etc branches in that repo ?
(In reply to Nick Thomas [:nthomas] from comment #16)
> Does the gonk-misc patch need uplifting to the v2.1, v2.0, etc branches in
> that repo ?

Only if we're truncating there as well
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #17)
> (In reply to Nick Thomas [:nthomas] from comment #16)
> > Does the gonk-misc patch need uplifting to the v2.1, v2.0, etc branches in
> > that repo ?
> 
> Only if we're truncating there as well

Scratch that, yes, we do need to uplift it because we're no longer using showcommands and thus we lose out on the Gecko build system output.
Comment on attachment 8482699 [details] [diff] [review]
Environment override to force only gecko to show full logs

NOTE: This flag is now for security issues only. Please see https://wiki.mozilla.org/Release_Management/B2G_Landing to better understand the B2G approval process and landings.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): this bug
User impact if declined: developers loose Gecko build logging information
Testing completed: it works on master branch
Risk to taking this patch (and alternatives if risky): if we have a build failure or otherwise need to refer to a full gecko build, we won't have the required information for branches without this patch.
String or UUID changes made by this patch: none
Attachment #8482699 - Flags: approval-mozilla-b2g32?
Attachment #8482699 - Flags: approval-mozilla-b2g30?
Attachment #8482699 - Flags: approval-mozilla-b2g28?
This is really NPOTB, isn't it? i.e. doesn't affect shipping code at all?
Assignee: nobody → jhford
Comment on attachment 8482699 [details] [diff] [review]
Environment override to force only gecko to show full logs

Confirmed with John on IRC that this doesn't affect any compiled code, just changes some of the output in the logs. Uplifting a=NPOTB.

v2.1: https://github.com/mozilla-b2g/gonk-misc/commit/fd2ace405952a06d16eb8a1334129007c3f179fa
v2.0: https://github.com/mozilla-b2g/gonk-misc/commit/c4765e55870c241a5d25499d65323f3e87d26143
v1.4: https://github.com/mozilla-b2g/gonk-misc/commit/c5fc772a061a6fa46750e457954837c0d0df64dc
Attachment #8482699 - Flags: approval-mozilla-b2g32?
Attachment #8482699 - Flags: approval-mozilla-b2g30?
Attachment #8482699 - Flags: approval-mozilla-b2g28?
Merged to production, and deployed.
I checked some logs to confirm we've resolved all branches & builds:
 # stage
 cd /pub/mozilla.org/b2g/tinderbox/builds
 zgrep -c 'output has been truncated' */1410*/*-bm*-build1-build*.txt.gz > file

There are 3578 jobs in that covering 2.75 days, of which 239 (6%) end up truncated, mostly not by much.

There is consistent truncation for mozilla-b2g32_v2_0 emulator opt/nightly/debug.

And intermittent truncation for
* m-c & peers: emulator & emulator-kk, non-unified, opt & debug, most builds (unified is ok)
* same for m-a, except it's dep only (we don't unify there, so no non-unified)
* also on m-a: flame nightlies for opt, flame nightly & dep for eng (but flame-kk is fine); hamachi opt & eng nightlies

I'm wondering if there is some other low hanging fruit, like some the l10n output or buildsymbols warnings.
$ curl -L "https://tbpl.mozilla.org/php/getParsedLog.php?id=47640747&tree=Mozilla-Central&full=1" | gzip -d > log
$ egrep 'INFO|ERROR|WARNING' log | wc -l
   86674

That's 90,000 "17:11:59  WARNING - " 21 char strings.  That's 1,845KB of output.

I would argue that we could compress that format down to something much smaller and still get useful information:

171159E for error, 171159W for warning and 171159I for info.  That's only 615KB.  If we got rid of the seconds, that's only 439K of output.

Also, the build directory is "/builds/slave/b2g_m-cen_emu_dep-000000000000".  That gets repeated a bunch of times because of how we invoke the compiler.

jhford:~ $ sed s/b2g_m-cen_emu_dep-000000000000//g < log > smalllog
jhford:~ $ du log smalllog
58528	log
48288	smalllog

There's another 10MB.  Just getting rid of the 0s, we reduce by 4MB

jhford:~ $ sed s/b2g_m-cen_emu_dep-000000000000/b2g_m-cen_emu_dep/g < log > smalllog
jhford:~ $ du log smalllog
58528	log
54088	smalllog

If we hash the name of that directory and take only the first 7 chars, we get a huge improvement!

jhford:~ $ sed s/b2g_m-cen_emu_dep-000000000000/c1c1a17/g < log > smalllog
jhford:~ $ du log smalllog
58528	log
50672	smalllog

At the time when the hashes are generated, we could have a safety check that we aren't assigning the same shortened hash to two different build directories in the same master process.

jhford:~ $ cat safety.py
import sys
import hashlib

usedHashes = []

def getDir(name):
  h = hashlib.new('sha1')
  h.update(name)
  h = h.hexdigest()
  if h in usedHashes:
    sys.stderr.write('ERROR: There is a hash collision!')
    raise Exception('Short hash collision!')
  else:
    usedHashes.append(h)
    return h

for i in sys.argv[1:]:
  print getDir(i);
jhford:~ $ python safety.py john ford
a51dda7c7ff50b61eaea0444371f4a6a9301e501
d6edd79fd69252cd1adb811d1e99a7398dd6f53a
jhford:~ $ python safety.py john ford john
a51dda7c7ff50b61eaea0444371f4a6a9301e501
d6edd79fd69252cd1adb811d1e99a7398dd6f53a
ERROR: There is a hash collision!Traceback (most recent call last):
  File "safety.py", line 18, in <module>
    print getDir(i);
  File "safety.py", line 12, in getDir
    raise Exception('Short hash collision!')
Exception: Short hash collision!


Here's a bunch of low hanging fruit that I think could give pretty good wins.
Depends on: 1072026
Thanks for the suggestions John. I found something else which accounts for a lot of bits which is a less intrusive thing to change:

eg 15:55:15 INFO - /builds/slave/b2g_m-aurora_flm-kk_ntly-00000/build/prebuilts/gcc/linux-x86/arm/arm-eabi-4.7/bin/../lib/gcc/arm-eabi/4.7/../../../../arm-eabi/bin/as: /lib64/libz.so.1: no version information available (required by /builds/slave/b2g_m-aurora_flm-kk_ntly-00000/build/prebuilts/gcc/linux-x86/arm/arm-eabi-4.7/bin/../lib/gcc/arm-eabi/4.7/../../../../arm-eabi/bin/as)

Filed bug 1072026 to suppress this.
I reran the check in comment #23, and we're only truncating on emulator builds for mozilla-b2g34_v2_1, mozilla-b2g34_v2_1s, and mozilla-b2g32_v2_0. Uplifting bug 1072026 to those branches should fix that up.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: