[tier-3] Permafail - ERROR - The following files failed: 'macosx64-minidump_stackwalk'



a year ago
a year ago


(Reporter: intermittent-bug-filer, Unassigned)


Version 3

Firefox Tracking Flags

(Not tracked)


(Whiteboard: [stockwell unknown])



a year ago
Filed by: fmezei [at] mozilla.com



After re-enabling the ondemand update tests in bug 1386628, all OSX update jobs fail on Nightly with this error.
Error is not seen on Beta 56. David is this something that you could help with?
It seems that we cannot find the appropriate version of the minidump-stackwalk program on tooltool:


Ted, do you know what's wrong with this? Are we missing some specific version of the tools?
Flags: needinfo?(ted)

04:31:15     INFO - Calling ['/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python', '/Users/mozauto/jenkins/workspace/mozilla-central_update/build/tooltool.py', '--url', 'https://tooltool.mozilla-releng.net/', 'fetch', '-m', '/Users/mozauto/jenkins/workspace/mozilla-central_update/build/tests/config/tooltool-manifests/macosx64/releng.manifest', '-o'] with output_timeout 600
04:31:15     INFO -  INFO - Attempting to fetch from 'https://tooltool.mozilla-releng.net/'...
04:31:16     INFO -  INFO - ...failed to fetch 'macosx64-minidump_stackwalk' from https://tooltool.mozilla-releng.net/

vs (success on osx 10.10 firefox-ui):


16:02:22     INFO - Calling ['/tools/tooltool.py', '--url', 'https://tooltool.mozilla-releng.net/', '--authentication-file', '/builds/relengapi.tok', 'fetch', '-m', '/Users/cltbld/tasks/task_1505083747/build/tests/config/tooltool-manifests/macosx64/releng.manifest', '-o', '-c', '/builds/tooltool_cache'] with output_timeout 600
16:02:22     INFO -  INFO - File macosx64-minidump_stackwalk retrieved from local cache /builds/tooltool_cache

I notice these differences:
 - no --authentication-file used for failure case
 - no tooltool cache used for failure case
 - different location for tooltool.py (different version possible?)

I recall setting up osx tooltool cache in bug 1385629


but that's probably not the most important issue.
(In reply to Geoff Brown [:gbrown] from comment #3)
> I notice these differences:
>  - no --authentication-file used for failure case

I don't think this should matter, the file has `visibility: public`.

>  - no tooltool cache used for failure case

It's possible that not having a tooltool cache tickles a tooltool bug.

>  - different location for tooltool.py (different version possible?)

This is pretty plausible. The first log from comment 3 shows:

04:31:15     INFO - Downloading https://raw.githubusercontent.com/mozilla/build-tooltool/master/tooltool.py to /Users/mozauto/jenkins/workspace/mozilla-central_update/build/tooltool.py

the second log shows:

16:01:35     INFO -  'exes': {'tooltool.py': '/tools/tooltool.py',

...so it's using some baked-in copy of tooltool.py?

It might help to stick a `-v` in the tooltool commandline to get more info out of tooltool. Functionally it's not very complicated, it just builds a URL from the digest in the manifest and tries to download it.
Flags: needinfo?(ted)
The tests which are failing here are not run via Taskcluster but via Mozmill CI. They do not use a tooltool cache, that's correct. But it's also not using a local copy of it. For each job we run, a fresh copy of mozharness is getting downloaded:


Then we use the common.tests.zip archive, and run the firefox-ui-update script, which is based on the script which gets used by the fx-ui tests as executed in TC. The only difference is the used config file for mozharness which is qa_jenkins.py:


And it says `download_tooltool: True`, which always forces a fresh copy of the files. I assume Ted is right that we unmask a problem here which only happens on OS X those days. Linux and Windows are both fine.
If the cause is unclear, Florin could trigger some Nighly update tests in CI based on older Nightlies. The regression is definitely in the range of August 14th to Sep 8th, so about 5 test runs should be necessary to nail this down.
Comment hidden (Intermittent Failures Robot)
This continues to block update tests on OS X - all tests have failed last week. Once we merge to Beta (57 Beta 1 should happen tomorrow), this will probably block all update tests on OS X. Does anyone have any update on a potential fix here?
I'm not familiar with Mozmill CI so I'm not actively trying to debug this.

To follow up on some of the speculation in earlier comments, it would be nice if someone (Florin?, Henrik?) added -v to the tooltool command line, temporarily, to get better diagnostics.

Also, retriggering tests as suggested in comment 6 might be useful.
Whiteboard: [stockwell needswork]
who is working on this?  I see this will cross our 'need to disable' threshold really soon and we will disable the offending tests- without a specific set of offending tests, the entire suite.

:whimboo, I see you as the QA contact- are these tests ones that you are responsible for?
Flags: needinfo?(hskupin)
This is tier-3, and I no longer maintain those tests. Geoff and myself gave proposals but so far nothing happened on this bug.
Flags: needinfo?(hskupin)
Summary: Permafail - ERROR - The following files failed: 'macosx64-minidump_stackwalk' → [tier-3] Permafail - ERROR - The following files failed: 'macosx64-minidump_stackwalk'
Florin, who is responsible for developing these tests and ensuring they are green.  I see in the title this is 'tier-3', but we are getting stars, I would really like to avoid getting data in orangefactor for tests that are tier-3.
Flags: needinfo?(florin.mezei)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #12)
> Florin, who is responsible for developing these tests and ensuring they are
> green.  I see in the title this is 'tier-3', but we are getting stars, I
> would really like to avoid getting data in orangefactor for tests that are
> tier-3.

Joel, I don't know of anyone responsible for maintaining these anymore. The tests themselves have been disabled, but then re-enabled temporarily until we do a complete analysis of existing coverage and gaps in automated update testing. They were re-enabled because in there absence we need to do manual testing of updates which adds lots of extra effort to the manual QA teams.

Is there anything else that we can do to avoid getting this in orangefactor? If we don't star the failing tests, would that help?
Flags: needinfo?(florin.mezei) → needinfo?(jmaher)
Shouldn't we start by adding '-v' to the tooltool invocation to figure out what's going on? If this can be reproduced on try I'll be glad to help out with that.
if the failures are not starred then they will not get into orangefactor,  that would help!  I assume the team responsible for writing code for installers and application update would be able to help fix issues with the tests
Flags: needinfo?(jmaher)
Gabriele is correct in #c14 -- we need that to figure out what the issue is.
I won't star these anymore.

Also I've tried re-running some tests with older Nightlies here: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=42151fcd6cfc216d147730d0f2c6a2acd52d22fd&filter-searchStr=fxup&filter-tier=3&selectedJob=131969510.

The jobs still failed the same way, with this error. For example this job [1] was run with the same source build (--installer-url parameter) and same test package (--test-packages-url parameter) as when it originally passed on August 13.

I'm thinking that either I'm doing something wrong, or the problem is not in the Firefox code.

[1] https://firefox-ui-tests.s3.amazonaws.com/71281e3c-9476-42de-9225-55175388bde6/log_info.log
We are now also seeing this error in Firefox 57 Beta 3 - https://firefox-ui-tests.s3.amazonaws.com/27c140aa-37e5-488f-948e-787d96ba2937/log_info.log.

Also, while I haven't noticed this before, this also affects Windows with a similar error: ERROR - The following files failed: 'win32-minidump_stackwalk.exe' - https://firefox-ui-tests.s3.amazonaws.com/ddd76aa7-4a6b-4d6f-bf33-7bf842e357a7/log_info.log.
Whiteboard: [stockwell needswork] → [stockwell unknown]
Last Resolved: a year ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.