Marionette crashes aren't being properly symbolicated

RESOLVED FIXED in Firefox 49

Status

Testing
Marionette
--
critical
RESOLVED FIXED
2 years ago
2 years ago

People

(Reporter: RyanVM, Assigned: maja_zf)

Tracking

({intermittent-failure, regression})

Version 3
mozilla50
intermittent-failure, regression
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox47 unaffected, firefox48 unaffected, firefox49 fixed, firefox-esr45 unaffected, firefox50 fixed)

Details

MozReview Requests

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

(Reporter)

Description

2 years ago
I'm pretty sure this used to work, so that's swell. I wanted to throw Taskcluster under the bus for it, but then OSX had to go and ruin that theory.

https://treeherder.mozilla.org/logviewer.html#?job_id=3950116&repo=mozilla-central#L1312
https://treeherder.mozilla.org/logviewer.html#?job_id=28754506&repo=mozilla-inbound#L1314
Flags: needinfo?(dburns)
Ted,

I don't even know where to start to fix this, some guidance please.
Flags: needinfo?(dburns) → needinfo?(ted)
(Reporter)

Updated

2 years ago
Summary: Mn-e10s crashes aren't being properly symbolicated → Marionette crashes aren't being properly symbolicated
(Reporter)

Updated

2 years ago
Duplicate of this bug: 1277730
So for the mozilla-inbound case as Ryan filed above I cannot see that we unzip the downloaded symbols in the marionette harness script. Instead we pass the zip file path to mozcrash. But mozcrash does only unzip archives in case it downloads those itself:

https://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozcrash/mozcrash/mozcrash.py#174
mozharness looks like it's downloading and unzipping the symbols properly:
```
01:14:31     INFO - Running command: ['unzip', '-q', '-o', '/home/worker/workspace/build/target.crashreporter-symbols.zip', '-d', '/home/worker/workspace/build/symbols']
01:14:31     INFO - Copy/paste: unzip -q -o /home/worker/workspace/build/target.crashreporter-symbols.zip -d /home/worker/workspace/build/symbols
```

And the marionette runtests.py is being invoked with `--symbols-path=/home/worker/workspace/build/symbols`, which looks right. However! Something is not passing that symbols path down to mozcrash properly, if you look at the minidump_stackwalk invocation it puts in the log:
```
01:19:39     INFO -  mozcrash Copy/paste: /usr/local/bin/linux64-minidump_stackwalk /tmp/tmpEAKbwF.mozrunner/minidumps/2ee0ace2-b7d1-8fab-65952a72-05b2565e.dmp /tmp/tmpIc4Mhu
```

It's passing a temp directory there, not the symbols path. mozcrash will do that if you pass `None` down as the symbols path:
https://dxr.mozilla.org/mozilla-central/rev/d1102663db10b3d4b9358f3cf4e16b7c56902352/testing/mozbase/mozcrash/mozcrash/mozcrash.py#167

I tried tracing through the Marionette code, and it *looks* like it's persisting the symbols_path all the way down through mozrunner, but it's pretty complicated and I'm not at all confident of that.
Flags: needinfo?(ted)
whimboo: this is the kind of thing I was worried about with making `symbols_path` optional. :-/
Maja, can you have a look that we are doing things properly based on what ted has documented in comment 4 when you have a moment please.
Flags: needinfo?(mjzffr)
So the builds as pasted earlier on this bug are from Taskcluster and Buildbot.

Here the taskcluster one:

01:13:59     INFO - Run as /home/worker/workspace/mozharness/scripts/marionette.py --config-file mozharness/configs/marionette/prod_config.py --config-file mozharness/configs/remove_executables.py --no-read-buildbot-config --installer-url=https://queue.taskcluster.net/v1/task/KnQrzy0rRVONLRnC82z6OA/artifacts/public/build/target.tar.bz2 --test-packages-url=https://queue.taskcluster.net/v1/task/KnQrzy0rRVONLRnC82z6OA/artifacts/public/build/target.test_packages.json --e10s --download-symbols=true

We use the following symbols-url setting from the prod_config.py:
https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/configs/marionette/prod_config.py#80

Because of download-symbols=True we download and extract. Both work fine and symbols should be at the location of symbols_path. Also all is fine for the run command call for Marionette:

01:15:55     INFO - Copy/paste: /home/worker/workspace/build/venv/bin/python -u /home/worker/workspace/build/tests/marionette/marionette/runtests.py --gecko-log=/home/worker/workspace/build/blobber_upload_dir/gecko.log --log-raw=/home/worker/workspace/build/blobber_upload_dir/marionette_raw.log --log-errorsummary=/home/worker/workspace/build/blobber_upload_dir/marionette_errorsummary.log --binary=/home/worker/workspace/build/application/firefox/firefox --address=localhost:2828 --symbols-path=/home/worker/workspace/build/symbols /home/worker/workspace/build/tests/marionette/tests/testing/marionette/harness/marionette/tests/unit-tests.ini

Maybe that is an issue in mozcrash here?


Here for the buildbot command:

/tools/buildbot/bin/python scripts/scripts/marionette.py --cfg marionette/prod_config.py --e10s --blob-upload-branch mozilla-inbound
 in dir /builds/slave/test/. (timeout 1800 secs) (maxTime 7200 secs)

The same config file is used, but with ondemand for download. As result we do not download and extract the archive via mozharness but pass it directly to mozcrash:

22:18:56     INFO - Calling ['/builds/slave/test/build/venv/bin/python', '-u', '/builds/slave/test/build/tests/marionette/marionette/runtests.py', '--gecko-log=/builds/slave/test/build/blobber_upload_dir/gecko.log', '--log-raw=/builds/slave/test/build/blobber_upload_dir/marionette_raw.log', '--log-errorsummary=/builds/slave/test/build/blobber_upload_dir/marionette_errorsummary.log', '--binary=/builds/slave/test/build/application/NightlyDebug.app/Contents/MacOS/firefox', '--address=localhost:2828', '--symbols-path=https://queue.taskcluster.net/v1/task/faovEIEtTvyN6xaRWYI_xw/artifacts/public/build/firefox-49.0a1.en-US.mac64.crashreporter-symbols.zip', '/builds/slave/test/build/tests/marionette/tests/testing/marionette/harness/marionette/tests/unit-tests.ini'] with output_timeout 1000

So in both cases we have a symbols-path specified, so none of my changes should have an impact here:
https://hg.mozilla.org/mozilla-central/rev/7ba4a5c7bc0e#l1.30

Maybe it might be good to get more logging into mozcrash? Also in case of symbols being not found or such.
Btw I can see some similar things for our firefox-ui-tests (bug 1279490). Given that crashes were properly symbolicated before, I wonder if that could be related a build system change?
Maybe the task definition changes for --download-symbols on bug 1267825 caused this problem? We have ondemand now for opt and pgo builds. I thought that minidump-stackwalk can only fetch nightly and release builds but not inbound. Or am I wrong, Ted?
Flags: needinfo?(ted)
I think the following comment in mozharness answers it:

https://dxr.mozilla.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/testbase.py#206

> # As of now this only works for Nightly and release builds.

Maja, you may want to check mozcrash how it handles the ondemand case with a known symbols url.
Flags: needinfo?(ted)
Blocks: 1281998
(In reply to Henrik Skupin (:whimboo) from comment #9)
> Maybe the task definition changes for --download-symbols on bug 1267825
> caused this problem? We have ondemand now for opt and pgo builds. I thought
> that minidump-stackwalk can only fetch nightly and release builds but not
> inbound. Or am I wrong, Ted?

Per my investigation in comment 4, I don't think mozharness is to blame here. It seems to be downloading and unzipping the symbols correctly, and passing that path to the Marionette harness correctly. Somewhere inside the harness that symbols_path is getting lost and mozcrash winds up using a temp directory.
Blocks: 1284159
Created attachment 8768219 [details]
Bug 1276011 - Fix symbols_path arg not being passed on to runner class.

Review commit: https://reviewboard.mozilla.org/r/62548/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/62548/
Attachment #8768219 - Flags: review?(dburns)
I traced the runner code and it seems that the `symbols_path` arg was not being passed on to the Marionette constructor, and therefore the runner class, etc.

With the fix I get the following locally when testing with the crashme addon:

0:09.80 LOG: MainThread mozcrash INFO Copy/paste: /Users/mozilla/dev/temp/macosx64-minidump_stackwalk /var/folders/5k/xmn_fndx0qs2jcpcwhzl86wm0000gn/T/tmpRXUgV7/minidumps/2853752E-97E4-4027-B3FA-0B4491596CFE.dmp /Users/mozilla/dev/temp/firefox-49.0a1.en-US.mac64.crashreporter-symbols.zip
 0:11.67 CRASH: MainThread pid:4447. Test:runner.py. Minidump anaylsed:False. Signature:[@ libcrashme.dylib + 0xe50]
Crash dump filename: /var/folders/5k/xmn_fndx0qs2jcpcwhzl86wm0000gn/T/tmpRXUgV7/minidumps/2853752E-97E4-4027-B3FA-0B4491596CFE.dmp
Operating system: Mac OS X
                  10.11.5 15F34
CPU: amd64
     family 6 model 69 stepping 1
     4 CPUs

Crash reason:  EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
Crash address: 0x0
Process uptime: 9 seconds

Thread 0 (crashed)
 0  libcrashme.dylib + 0xe50
    rax = 0x000000010e494e50   rdx = 0x000000010e494e34
    rcx = 0x000000012e1e88c0   rbx = 0x0000000000000001
    rsi = 0x00000001093d59b0   rdi = 0x0000000000000000
    rbp = 0x00007fff5b7aff60   rsp = 0x00007fff5b7aff50
     r8 = 0x00007fff5b7aff00    r9 = 0x00000001098893c8
    r10 = 0x00007fff5b7afec0   r11 = 0x000000010e494e10
    r12 = 0x0000000126009c18   r13 = 0x0000000109911644
    r14 = 0x000000010e494e10   r15 = 0xb700817702326750
    rip = 0x000000010e494e50
    Found by: given as instruction pointer in context
 1  XUL + 0x4bbbc44
    rbp = 0x00007fff5b7aff70   rsp = 0x00007fff5b7aff70
    rip = 0x0000000109910c44
    Found by: previous frame's frame pointer
 2  XUL + 0x4bbc620
    rsp = 0x00007fff5b7aff90   rip = 0x0000000109911620
    Found by: stack scanning
 3  libcrashme.dylib + 0xe10
    rsp = 0x00007fff5b7affb8   rip = 0x000000010e494e10
    Found by: stack scanning
 4  XUL + 0x468bf57
    rsp = 0x00007fff5b7b0010   rip = 0x00000001093e0f57
    Found by: stack scanning
 5  libcrashme.dylib + 0xe10
    rsp = 0x00007fff5b7b0020   rip = 0x000000010e494e10
    Found by: stack scanning
 6  XUL + 0x4132fe2
    rsp = 0x00007fff5b7b0050   rip = 0x0000000108e87fe2
    Found by: stack scanning
...
Flags: needinfo?(mjzffr)
(Assignee)

Updated

2 years ago
Assignee: nobody → mjzffr
I don't often look at crash dumps, but the above in Comment 13 doesn't look any better to me than Comment 0 (aside from the crash symbols being found correctly by minidump_stackwalk). 

Ryan or anyone -- what do you think?
Flags: needinfo?(ryanvm)
It looks like you passed a path to a symbols.zip--mozcrash is written to either take a path to an *unzipped* set of symbols, or a URL to a symbols.zip (which it will download and unzip). It doesn't have any code to handle a path to a local zip file.
Ah! Thanks for clearing that up.

Better:
0:10.95 LOG: MainThread mozcrash INFO Copy/paste: /Users/mozilla/dev/temp/macosx64-minidump_stackwalk /var/folders/5k/xmn_fndx0qs2jcpcwhzl86wm0000gn/T/tmpfK8KHv/minidumps/B4113599-5462-4790-B2E3-D281FDFC8F81.dmp /Users/mozilla/dev/temp/firefox-49.0a1.en-US.mac64.crashreporter-symbols
 0:20.39 CRASH: MainThread pid:6293. Test:runner.py. Minidump anaylsed:False. Signature:[@ Crash]
Crash dump filename: /var/folders/5k/xmn_fndx0qs2jcpcwhzl86wm0000gn/T/tmpfK8KHv/minidumps/B4113599-5462-4790-B2E3-D281FDFC8F81.dmp
Operating system: Mac OS X
                  10.11.5 15F34
CPU: amd64
     family 6 model 69 stepping 1
     4 CPUs

Crash reason:  EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
Crash address: 0x0
Process uptime: 10 seconds

Thread 0 (crashed)
 0  libcrashme.dylib!Crash [nsCrasher.cpp : 56 + 0x0]
    rax = 0x0000000110870e50   rdx = 0x0000000110870e34
    rcx = 0x0000000163911500   rbx = 0x0000000000000001
    rsi = 0x000000010b8d59b0   rdi = 0x0000000000000000
    rbp = 0x00007fff592b9f60   rsp = 0x00007fff592b9f50
     r8 = 0x00007fff592b9f00    r9 = 0x000000010bd893c8
    r10 = 0x00007fff592b9ec0   r11 = 0x0000000110870e10
    r12 = 0x0000000128381240   r13 = 0x000000010be11644
    r14 = 0x0000000110870e10   r15 = 0x42005606fb6299f4
    rip = 0x0000000110870e50
    Found by: given as instruction pointer in context
 1  XUL!ffi_call_unix64 + 0x4c
    rbx = 0x0000000000000001   rbp = 0x00007fff592b9f70
    rsp = 0x00007fff592b9f70   r12 = 0x0000000128381240
    r13 = 0x000000010be11644   r14 = 0x0000000110870e10
    r15 = 0x42005606fb6299f4   rip = 0x000000010be10c44
    Found by: call frame info
 2  XUL!ffi_call + 0x420
    rbx = 0x0000000000000001   rbp = 0x00007fff592ba040
    rsp = 0x00007fff592b9f90   r12 = 0x0000000128381240
    r13 = 0x000000010be11644   r14 = 0x0000000110870e10
    r15 = 0x42005606fb6299f4   rip = 0x000000010be11620
    Found by: call frame info
Flags: needinfo?(ryanvm)
(Assignee)

Updated

2 years ago
See Also: → bug 1284847
Comment on attachment 8768219 [details]
Bug 1276011 - Fix symbols_path arg not being passed on to runner class.

https://reviewboard.mozilla.org/r/62548/#review59392
Attachment #8768219 - Flags: review?(dburns) → review+

Comment 18

2 years ago
Pushed by mjzffr@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/f564cb1eb300
Fix symbols_path arg not being passed on to runner class. r=automatedtester

Comment 19

2 years ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/f564cb1eb300
Status: NEW → RESOLVED
Last Resolved: 2 years ago
status-firefox50: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Maja, I think it would be good to check also aurora and at least beta if those branches are affected to.
status-firefox48: --- → ?
status-firefox49: --- → ?
Flags: needinfo?(mjzffr)
David, looks like we have an aurora uplift for you.
status-firefox47: --- → unaffected
status-firefox48: ? → unaffected
status-firefox49: ? → affected
status-firefox-esr45: --- → unaffected
Flags: needinfo?(mjzffr) → needinfo?(dburns)
Thanks Maja. So the sheriffs can land this test framework only change when we add the correct whiteboard entry.
Flags: needinfo?(dburns)
Whiteboard: [checkin-needed-aurora]
(Reporter)

Updated

2 years ago
Blocks: 1274408
(Reporter)

Updated

2 years ago
No longer blocks: 1281998
Actually this is a regression from bug 1274408.
Keywords: regression
(Reporter)

Comment 24

2 years ago
bugherderuplift
https://hg.mozilla.org/releases/mozilla-aurora/rev/1986988c99dd
status-firefox49: affected → fixed
Whiteboard: [checkin-needed-aurora]
You need to log in before you can comment on or make changes to this bug.