see bug https://bugzilla.mozilla.org/show_bug.cgi?id=1025800 about a problem with a slave with builds. 01:21 < pmoore|projectduty> Tomcat|sheriffduty: in those logs with the corrupted zips, i also see lines like: 01:22 < pmoore|projectduty> /usr/bin/objcopy:dist/bin/stSlDtG3: cannot create debug link section `dist/bin/test_unlock_notify.dbg': Invalid operation 01:23 < pmoore|projectduty> i wonder if we need to update our version of objcopy - i see a lot of e.g. debian bugs with this problem, and the solution was to upgrade binutils package 01:23 < Tomcat|sheriffduty> pmoore|projectduty: yeah i will file a bug for this slave 01:24 < pmoore|projectduty> i wonder if we are just using a buggy version of code, and the problem is occasionally exhibited on a slave, but maybe is not a slave problem but a tools problem that happens sporadically 01:24 < pmoore|projectduty> or maybe once it happens, it leaves the machine in a bad state, so it looks like a slave problem 01:24 < pmoore|projectduty> just a thought, could be way off base 01:24 < Tomcat|sheriffduty> yeah 01:25 < pmoore|projectduty> might be worth raising a bug for the slave and also a separate bug for the root cause? or maybe we already have one…
Pmoore's guess looks probable: looking at an ec2 instance (non spot but should be the same), I am getting: [email@example.com ~]$ /usr/bin/objcopy --version GNU objcopy version 188.8.131.52.2-5.28.el6 20091009 and 2.20 was reported to have a bug that may be related to this situation. Here is that bug: https://sourceware.org/bugzilla/show_bug.cgi?id=11072 based off comments, in the sourceware bug, an updated version resolved that issue. I'm guessing we need to play with puppet to do this. /me dives in deeper.
so i am not sure if these add up. the issue reported here - https://sourceware.org/bugzilla/show_bug.cgi?id=11072 and here - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=556951 refers to using --add-gnu-debuglink. That option doesn't seem to be in the output of 'make buildsymbols' we do use the 'gold' binutils and that seems to be a culprit for such output like: """ /usr/bin/objcopy:dist/bin/stBmHhWA: cannot create debug link section `dist/bin/libxul.so.dbg': Invalid operation """ It looks like we we install binutils against: http://mxr.mozilla.org/build/source/puppet-manifests/modules/packages/manifests/devtools.pp#48
If objcopy is not a red herring, we can just set OBJCOPY in build/unix/mozconfig.linux. That said, why are we using gold? we shouldn't be.
(In reply to Mike Hommey [:glandium] from comment #3) > If objcopy is not a red herring, we can just set OBJCOPY in > build/unix/mozconfig.linux To use the one that comes alongside gcc, which is from a recent binutils.
(In reply to Mike Hommey [:glandium] from comment #3) > If objcopy is not a red herring, we can just set OBJCOPY in > build/unix/mozconfig.linux. That said, why are we using gold? we shouldn't > be. /me finds `Bug 633269 - Use gold for linking on linux` looks like you were against this too back then :) not sure if this how things are today, I may be reading things incorrectly from puppet repo.
So far bld-linux64-spot-136 is the only slave I could find that was having this issue. There is the possibility that the zip was actually bad, or the instance created was an anomaly. maybe related, it seems that https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slave.html?name=bld-linux64-spot-494 was also having issues unpacking. not with zip but with tar. It had four failures in a row: """ ERROR: Command failed. See logs for output. # ['tar', '--use-compress-program', 'pigz', '-xf', '/builds/mock_mozilla/cache/mozilla-centos6-x86_64/root_cache/cache.tar.gz', '-C', '/builds/mock_mozilla/mozilla-centos6-x86_64/root/'] program finished with exit code 2 """ It appears like bld-linux64-spot-494 got 'un-stuck' and is running green again.
So there are a few ideas here: 1) The slave is to blame a) this was due to bad instances (bad AMI) -> Bug 1025842 certainly suggests that to be the case for bld-linux64-spot-494 b) maybe ran out of disk space, this would explain how bld-linux64-spot-136 got stuck in a rut on the same builder 2) the zip was corrupt: unlikely as other slaves built fine for the rev at incident and the surrounding revs. 3) there is a bug with binutils: I want to say that this is a red herring as other slaves seem to be able to build just fine. I am tempted to suggest this is number (1). Let's see what the result of Bug 1025842 is first as it might be related. I am going to re-enable bld-linux64-spot-136. It looks like it has been terminated since being disabled so its state is lost and it should have a fresh start. Won't help with debugging what actually happened though. If this is a 'disk space' issue, we will have to act quickly on the spot(s) in question.
this *looks* solved, or if its not is not a buildduty issue anymore. Please either file new bugs for followup issues, or re-open and move to a different component if my assessment is wrong.