Closed Bug 1226276 Opened 9 years ago Closed 8 years ago

l10n taggging fails when running 'hg branches' against clean checkout of certain locale repos

Categories

(Release Engineering :: Release Automation: Other, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Unassigned)

References

Details

Attachments

(1 file)

three retriggers, different slaves, same failure: http://buildbot-master73.bb.releng.usw2.mozilla.com:8001/builders/release-comm-esr38-thunderbird_tag_l10n/builds/1/steps/run_script/logs/stdio

running the last few cmds on one of the failing machines succeeds cleanly..



[cltbld@bld-linux64-spot-1096.build.releng.usw2.mozilla.com nb-NO]$ mock_mozilla -r mozilla-centos6-x86_64 --cwd /builds/slave/rel-c-esr38-tb_tag_l10n-000000/nb-NO/. --unpriv --s
hell '/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" BUILDBOT_CONFIGS="https://hg.mozilla.org/build/buildbot-configs" CLOBBERER_URL="https://api.pub.build.mozilla.org/clobber
er/forceclobber" BUILDBOTCUSTOM="https://hg.mozilla.org/build/buildbotcustom" PROPERTIES_FILE="/builds/slave/rel-c-esr38-tb_tag_l10n-000000/buildprops.json" PATH="/tools/buildbot
/bin:/usr/local/bin:/usr/lib/ccache:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/tools/git/bin:/tools/python27/bin:/tools/python27-mercurial/bin:/home/cltbld/bin" EXTRA_DATA="/
builds/slave/rel-c-esr38-tb_tag_l10n-000000/data.json" hg branches -c'
INFO: mock_mozilla.py version 1.0.3 starting...
State Changed: init plugins
INFO: selinux disabled
State Changed: start
State Changed: lock buildroot
State Changed: shell
MOBILE4201_2015111905_RELBRANCH 2818:2c65acd17612
THUNDERBIRD3840_2015111302_RELBRANCH 2817:f9901dbca796
THUNDERBIRD3840_2015110918_RELBRANCH 2816:7997911655a3
SEA_COMM420_20151103_RELBRANCH 2815:1526960facbe
GECKO420_2015102918_RELBRANCH 2813:4af100eb71bf
... etc




[cltbld@bld-linux64-spot-1096.build.releng.usw2.mozilla.com nb-NO]$ mock_mozilla -r mozilla-centos6-x86_64 --cwd /builds/slave/rel-c-esr38-tb_tag_l10n-000000/nb-NO/. --unpriv --s
hell '/usr/bin/env HG_SHARE_BASE_DIR="/builds/hg-shared" BUILDBOT_CONFIGS="https://hg.mozilla.org/build/buildbot-configs" CLOBBERER_URL="https://api.pub.build.mozilla.org/clobber
er/forceclobber" BUILDBOTCUSTOM="https://hg.mozilla.org/build/buildbotcustom" PROPERTIES_FILE="/builds/slave/rel-c-esr38-tb_tag_l10n-000000/buildprops.json" PATH="/tools/buildbot
/bin:/usr/local/bin:/usr/lib/ccache:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/tools/git/bin:/tools/python27/bin:/tools/python27-mercurial/bin:/home/cltbld/bin" EXTRA_DATA="/
builds/slave/rel-c-esr38-tb_tag_l10n-000000/data.json" hg update -C -r cd8316ad9001'
INFO: mock_mozilla.py version 1.0.3 starting...
State Changed: init plugins
INFO: selinux disabled
State Changed: start
State Changed: lock buildroot
State Changed: shell
149 files updated, 0 files merged, 20 files removed, 0 files unresolved
State Changed: unlock buildroot





at this point I'm not sure what's wrong..
Blocks: 1223879
I can't see anything obvious either. It may be worth trying to use strace while it's hung.
re-triggered again and followed with strace:

[root@bld-linux64-spot-381.build.releng.usw2.mozilla.com ~]# ps auxwf | grep 10382
root     11873  0.0  0.0 103240   864 pts/4    S+   07:57   0:00          \_ grep 10382
cltbld   10382  0.1  0.0 187512 10436 ?        S    07:29   0:01          \_ /usr/bin/python /builds/slave/rel-c-esr38-tb_tag_l10n-000000/scripts/scripts/release/tag-release.py -c mozilla/release-thunderbird-comm-esr38.py -b https://hg.mozilla.org/build/buildbot-configs -t THUNDERBIRD_38_4_0_RELEASE --tag-l10n

[root@bld-linux64-spot-381.build.releng.usw2.mozilla.com ~]# strace -fF -p 10382 -o tagging_sub.log


which eventually hung as expected on nb-NO running 'hg branches -c' so I inspected that:

Process 11838 attached
[root@bld-linux64-spot-381.build.releng.usw2.mozilla.com ~]# strace -p 11838
Process 11838 attached - interrupt to quit
write(1, "nactive)\n", 9)               = 9



then, just by poking the 1 fd in that proc, oddly (very oddly!) that helped kick it back into action:
[root@bld-linux64-spot-381.build.releng.usw2.mozilla.com ~]# cd /proc/11838/fd
[root@bld-linux64-spot-381.build.releng.usw2.mozilla.com fd]# cat 1
MOBILE4201_2015111905_RELBRANCH 2818:2c65acd17612
THUNDERBIRD3840_2015111302_RELBRANCH 2817:f9901dbca796
.... branch output

as soon as I cat'd that fd, proc 11838 came back to life!
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x7f6e04e504a0}, {0x7f6e0518fe50, [], SA_RESTORER, 0x7f6e04e504a0}, 8) = 0
rt_sigaction(SIGHUP, {SIG_DFL, [], SA_RESTORER, 0x7f6e04e504a0}, {0x7f6e0518fe50, [], SA_RESTORER, 0x7f6e04e504a0}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL, [], SA_RESTORER, 0x7f6e04e504a0}, {0x7f6e0518fe50, [], SA_RESTORER, 0x7f6e04e504a0}, 8) = 0
exit_group(0)                           = ?
Process 11838 detached

and the log continued to go. Talking to rail, he thinks there could be some weird flushing/buffering bug in hg at 3.1.2

at any rate, the job seems to be progressing now :\
this is happening again. same repo. different product (fennec) release. investigating..
Summary: tb 38.4.0 is failing to tag nb-NO locale repo with release tag → l10n taggging fails when running 'hg branches' against clean checkout of nb-NO locale repo
Depends on: 1229494
this is happening on 'ca' locale now: http://buildbot-master72.bb.releng.usw2.mozilla.com:8001/builders/release-mozilla-release-firefox_tag_l10n/builds/6
Summary: l10n taggging fails when running 'hg branches' against clean checkout of nb-NO locale repo → l10n taggging fails when running 'hg branches' against clean checkout of certain locale repos
we hit this on about 10 more locale repos. I suppose nb-NO was a bit more bloated with branches, tags, etc. This requires manual surveillance of the job until it completes.
this is actually not because of hg or OS. it is our implementation of running the cmd through subprocess: https://dxr.mozilla.org/build-central/source/tools/lib/python/util/commands.py#186

will upload a patch that changes from poll();stdout.read() logic to something that looks like communicate() but allows us to still keep a timeout.
played with this today.

this patch passes tests defined in: https://dxr.mozilla.org/build-central/source/tools/lib/python/mozilla_buildtools/test/test_util_commands.py#74

the patch will look weird and big but it's only because I added a try/catch which bumped the whole indent. here is the same patch without try/catch: http://people.mozilla.org/~jlund/151215_bug_1226276_tools_get_output-no-try-catch.diff

this patch

1) uses a tempfile for storing output.
2) stops using communicate() when timeout is reached. I don't think it's needed and besides, I believe there was a second bug in get_output where stderrdata[1] was always going to be empty because when we do include stderr output, we just mix it in with stdout[2]

[1] https://dxr.mozilla.org/build-central/source/tools/lib/python/util/commands.py#213
[2] https://dxr.mozilla.org/build-central/source/tools/lib/python/util/commands.py#177
Attachment #8698715 - Flags: review?(catlee)
Comment on attachment 8698715 [details] [diff] [review]
151215_bug_1226276_tools_get_output_buffer_fix.diff

Let's land this, lgtm. TemporaryFile has a nice feature to remove the file, whenever it's closed.
Attachment #8698715 - Flags: review?(catlee) → review+
Comment on attachment 8698715 [details] [diff] [review]
151215_bug_1226276_tools_get_output_buffer_fix.diff

thanks

https://hg.mozilla.org/build/tools/rev/fc9b6d055d7e
Attachment #8698715 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: