Closed Bug 1028816 Opened 10 years ago Closed 10 years ago

Segmentation fault on gaia try server with pull request from bug 1017490

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: yurenju, Assigned: jgriffin)

References

Details

Attachments

(1 file)

I got Segmentation fault when pushing pull request to gaia try server.

https://tbpl.mozilla.org/?tree=Gaia-Try&rev=6c6b3548ffe0f9b0950c055ed657e91264607907

log for G (gaia unit test):
> 20:28:27     INFO -  /builds/slave/test/gaia/xulrunner-sdk/bin/run-mozilla.sh /builds/slave/test/gaia/xulrunner-sdk/bin/xpcshell build/make_gaia_shared.js
> 20:28:30     INFO -  make[1]: Leaving directory `/builds/slave/test/gaia/apps/email'
> 20:28:30     INFO -  copy verticalhome to build_stage/
> 20:28:30     INFO -  execute verticalhome/build/build.js
> 20:28:30     INFO -  run-js-command verticalhome/app/build
> 20:28:31     INFO -  copy system to build_stage/
> 20:28:31     INFO -  execute system/build/build.js
> 20:28:31     INFO -  run-js-command system/app/build
> 20:28:31     INFO -  copy gallery to build_stage/
> 20:28:31     INFO -  execute gallery/build/build.js
> 20:28:31     INFO -  run-js-command gallery/app/build
> 20:28:32     INFO -  copy clock to build_stage/
> 20:28:32     INFO -  execute clock/build/build.js
> 20:28:32     INFO -  run-js-command clock/app/build
> 20:28:32     INFO -  /bin/bash: line 1:  2686 Segmentation fault      /builds/slave/test/gaia/xulrunner-sdk/bin/run-mozilla.sh /builds/slave/test/gaia/xulrunner-sdk/bin/xpcshell -f "/builds/slave/test/gaia/build/xpcshell-commonjs.js" -e "run('app/build');"
> 20:28:32     INFO -  make: *** [clock] Error 139
> 20:28:32    ERROR - Return code: 2
> 20:28:32    ERROR - 2 not in success codes: [0]
> 20:28:32    FATAL - Halting on failure while running ['make']
> 20:28:32    FATAL - Running post_fatal callback...
> 20:28:32    FATAL - Exiting 2

after investigating I believe this is a mozharness issue, I extracted a snippet[1] from mozharness/base/script.py@run_command[2] with same arguments 

you can download that gist, change "cwd" to your gaia path and execute it then you will get Segmentation fault on linux64 box.

[1] https://gist.github.com/yurenju/4abf50117c48478288d4
[2] http://hg.mozilla.org/build/mozharness/file/3347b848256c/mozharness/base/script.py#l688
It's hard to say where the actual bug is; I don't think it's a mozharness bug, although it's possible we may be able to work around it in mozharness.  It's likely a 'make' bug, but it could also be an xpcshell bug or a Python bug.

The symptoms:  On linux64 at least, using subprocess.Popen to invoke a make call which in turn invokes xpcshell to run some JS causes a segfault.  This doesn't occur on OSX.  Switching subprocess.Popen to subprocess.call avoids the segfault, but if we switched that in mozharness, we'd lose timeout handling.

Cc'ing a few people in case they have any ideas about what's going on here.
I can reproduce the segfault on Fedora 20.  The segfault is definately happening outside of python, but it's being triggered by how subprocess.py invokes the command when there's an env param to Popen().

The relevant section of subproces.py:

  if env is None:
    os.execvp(executable, args)
  else:
    os.execvpe(executable, args, env)

If you modify that script to omit completely the env= param but set those environment variables on the command line, the script works (e.g. DESKTOP=0 DESKTOP_SHIMS=1 DEBUG=1 NOFTU=1 python gaia-make.py).  The difference in code path here is that without env=, we're using execvp instead of execvpe.  If we switch both cases to use execvpe, we still segfault.

  if env is None:
    os.execvpe(executable, args, {})
  else:
    os.execvpe(executable, args, env)

You'll still get the segfaults regardless of whether or not you specify an env kwarg.

Switching to .call is *not* fixing this.  I just verified that there is still a coredump happening.  Using .call just doesn't print the segfault notice because that segfault is in a subprocess, not python itself.  On Fedora, you need to run 'ulimit -c unlimited' to get core dumps created in the cwd.

I've also verified that the process that's getting a coredump is Xulrunner:

jhford-w520:~/b2g/gaia $ file core.18553 
core.18553: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/home/jhford/b2g/gaia/xulrunner-sdk-30/xulrunner-sdk/bin/xpcshell -f /home/jhfo'
Reduced test case:
import os

#os.execvp('/usr/bin/make', ['make', 'preferences']);
os.execvpe('/usr/bin/make', ['make', 'preferences'], {});
#os.execlp('/usr/bin/make', 'make', 'preferences');
#os.execlpe('/usr/bin/make', 'make', 'preferences', {});


execvp and execl both avoid segfaulting, both execvpe and execlpe segfault.
I'll change the way we invoke the tests to pass the env on the command-line and see if this goes away.
Assignee: nobody → jgriffin
I think this should have the desired effect.
Attachment #8445565 - Flags: review?(jhford)
Comment on attachment 8445565 [details] [diff] [review]
Pass env variables on the command-line,

Looks good to me!  Can we add:

self.info('Sending environment as make vars because of bug 1028816')

Because I'm sure that eventually, someone will wonder why on earth the environment they are setting is not being set as the actual environment.

Other than that, the only issue I see is that the copy and paste argv output from mozharness could be broken by an environment var with a space.
Attachment #8445565 - Flags: review?(jhford) → review+
Comment on attachment 8445565 [details] [diff] [review]
Pass env variables on the command-line,

Addressed review comments:  https://hg.mozilla.org/build/mozharness/rev/c4e0534d7b2e
Attachment #8445565 - Flags: checked-in+
:jgriffin, I saw the commit has been checked in in mozharness repository[1] but we still get segfault on try server, do we use the same version as the repository on try server?

[1] https://hg.mozilla.org/build/mozharness/summary
Flags: needinfo?(jgriffin)
So there are two issues at play here:

1) we use the production branch of mozharness on Gaia-Try most of the time
2) we are using hg.m.o/users/jford_mozilla.com/mozharness on the default branch because there is no staging environment and I need to test a mozharness+gaia-try change

Can you link to a log of the new failure?  I'm curious if something lower level than the patch jgriffin worked on is doing something like  "def func(env={})" or "if not env: env = {}" or some sort of similar magic.
(In reply to Yuren [:yurenju] from comment #8)
> :jgriffin, I saw the commit has been checked in in mozharness repository[1]
> but we still get segfault on try server, do we use the same version as the
> repository on try server?
> 
> [1] https://hg.mozilla.org/build/mozharness/summary

The patch has been merged to jford_mozilla.com/mozharness, so should be used for Gaia-Try runs.  If it's not working, there may be some other problem involved.  As jhford said, can you provide a link to a failing log?
Flags: needinfo?(jgriffin)
:jgriffin and :jhford, thanks for you help and the pull request looks good now! :D

https://tbpl.mozilla.org/?rev=632529bd820393ff1a16ec29cfd78d5478803ce6&tree=Gaia-Try
close this bug, thanks all!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: