Closed Bug 1176358 Opened 4 years ago Closed 4 years ago

Determine why the Linux Firefox UI update verify tests are crashing

Categories

(Release Engineering :: General, defect)

defect
Not set

Tracking

(firefox40 fixed, firefox41 fixed, firefox42 fixed)

RESOLVED INVALID
Tracking Status
firefox40 --- fixed
firefox41 --- fixed
firefox42 --- fixed

People

(Reporter: armenzg, Assigned: armenzg)

References

Details

Attachments

(8 files, 1 obsolete file)

Let's add --symbols-url and --gecko-log - to see more.

> CRASH: MainThread pid:17272. Test:runner.py. Minidump anaylsed:False. Signature:[None]
> MINIDUMP_STACKWALK not set, can't process dump.

http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/39.0b7-candidates/build1/logs/release-mozilla-beta-linux_beta_update_tests_1-bm77-build1-build1.txt.gz
Blocks: 1148546
To reiterate discussion from irc: The test runs well locally with the same build, and we had these running well before, there's only a few things that might have changed. Logging on to a test machine would be a good next step.

There's a very small chance this is a very intermittent crash that poisoned the remainder of the locales. I've mentioned this before a few times, but we should pass a new port to each locale so problems like this don't mean loss of all useful information from later locales.
Well, I'm more puzzled by this output:

21:11:57     INFO -   0:12.24 LOG: MainThread INFO 	Fallback update test ran and PASSED
21:11:57     INFO -   0:12.24 LOG: MainThread INFO 	Direct update test ran and PASSED
21:11:57     INFO -   0:12.24 LOG: MainThread INFO Uninstalling application at "/tmp/tmpvdl2to/firefox"
21:11:57     INFO -   0:12.26 LOG: MainThread ERROR Failure during execution of the update test.

Both tests are passing, so why do we fail? We somehow have wrong states here.
self.failed doesn't get set when the tests don't run. That summary should be fixed to call "if self.failed or not self.passed" a failure, but there's really no ambiguity about the overall health of the run.
Bug 1176358 - Log gecko output to stdout and use a different port for each locale when running firefox update tests. r=armenzg
Attachment #8626819 - Flags: review?(armenzg)
Attachment #8626819 - Flags: review?(armenzg) → review+
Comment on attachment 8626819 [details]
MozReview Request: Bug 1176358 - Log gecko output to stdout and use a different port for each locale when running firefox update tests. r=armenzg

https://reviewboard.mozilla.org/r/12155/#review10623

Ship It!
(In reply to Chris Manchester [:chmanchester] from comment #4)
> MozReview Request: Bug 1176358 - Log gecko output to stdout and use a
> different port for each locale when running firefox update tests. r=armenzg

I just want to say that in general it is not a good idea to use different ports! The real underlying problem will be focus issues! If really multiple processes of Marionette are running side by side you can expect a lot of test failures.

So I'm not a fan of this workaround but would really appreciate to get the underlying problem fixed in Marionette.
We have not yet had any betas since our workaround landed.
My concern is log sizes might be too big.

(In reply to Henrik Skupin (:whimboo) [away 07/01 - 07/31] from comment #174)
> (In reply to Armen Zambrano G. (:armenzg - Toronto) from comment #173)
> > * Linux [1]
> > ** We get a crash
> > ** CRASH: MainThread pid:17272. Test:runner.py. Minidump anaylsed:False.
> > Signature:[None]
> > ** MINIDUMP_STACKWALK not set, can't process dump.
> 
> Do we have symbols for release and beta builds? If yes then you really
> should pass the symbols url into our update script via --symbols-url %URL%.
> 

I've found the symbols:
https://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/39.0b3-candidates/build1/mac/en-US/Firefox%2039.0b3.crashreporter-symbols.zip
https://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/38.0.1-candidates/build1/mac/en-US/Firefox%2038.0.1.crashreporter-symbols.zip
We now have more output.
chmanchester: what do you suggest we try now?

We should also fail the job rather than sound that we succeeded.

wget https://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/40.0b2-candidates/build1/logs/release-mozilla-beta-linux64_beta_update_tests_1-bm74-build1-build0.txt.gz
gzip -d release-mozilla-beta-linux64_beta_update_tests_1-bm74-build1-build0.txt.gz

MacAir ci_tools git:[master] $ grep "Downloading" release-mozilla-beta-linux64_beta_update_tests_1-bm74-build1-build0.txt | wc -l
     348
MacAir ci_tools git:[master] $ grep "firefox-ui-update has failed" release-mozilla-beta-linux64_beta_update_tests_1-bm74-build1-build0.txt | wc -l
     312
The gecko output isn't particularly illuminating to me (it might be to someone else though). Getting MINIDUMP_STACKWALK on these machines and having someone log on to attempt to observe the failure like I mentioned before would be my next steps. Is there some obvious difference between these machines and all those where we've seen these tests run and pass?
It's the same machine type.
I could ask someone from releng to change VNC for me while the job is running to access it.

Should we use the --symbols-url approach? or set the MINIDUMP_STACKWALK?
I will be trying this tomorrow.
> Should we use the --symbols-url approach? or set the MINIDUMP_STACKWALK?
> I will be trying this tomorrow.

I'm not sure how this usually works, but I'd expect we need both.
The last thing gecko logs before each crash comes from http://hg.mozilla.org/mozilla-central/annotate/9340658848d1/toolkit/mozapps/extensions/DeferredSave.jsm#l221 -- the next thing that should happen is a call to OS.File.writeAtomic in the profile directory.
Let's make sure that we're not running out of space.
Assignee: nobody → armenzg
Status: NEW → ASSIGNED
Attachment #8631030 - Flags: review?(bhearsum)
Comment on attachment 8631030 [details] [diff] [review]
[mozharness] purge

Review of attachment 8631030 [details] [diff] [review]:
-----------------------------------------------------------------

I hope it's as simple as this! Given Chris' comment, it seems like it might be.
Attachment #8631030 - Flags: review?(bhearsum) → review+
It was not the case. We purged and we had over 30GB yet we still failed.

I will try adding the minidump and force a crash.
This worked locally.

This patch relies on the patch on bug 1182271.

python scripts/firefox_ui_updates.py --cfg developer_config.py --cfg generic_releng_macosx64.py  --tools-tag default --firefox-ui-branch mozilla-beta --update-verify-config mozBeta-firefox-mac64.cfg --run-tests --installer-path `pwd`/build/Firefox%2039.0b3.dmg --symbols-path https://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/39.0b3-candidates/build1/mac/en-US/Firefox%2039.0b3.crashreporter-symbols.zip
Attachment #8631799 - Flags: review?(bhearsum)
Attachment #8631799 - Flags: review?(bhearsum) → review+
Blocks: 1182796
Could you please land this if satisfied?
Attachment #8632468 - Flags: review?(bhearsum)
I know this is a hack, however, I'm hoping we can get some light into the Linux issues.

It is a hack because we look for the symbols file in:
* in ftp/candidates instead of stage
* regardless of the locale, we grab from en-US
* regardless of build#, we look under build1 since I have no way of knowing which build# it is

This is a safe hack in that it does not fail if there is no symbols file (e.g. beta releases from 38.0 don't have a symbols file in the candidates dir anymore). You can see the warning [1] we throw but we continue without any problems.

If there is a symbols available we will pass --symbols-path to the fx-ui-updates binary.

bhearsum: if you're satisfied with this hack to be able to get the crashdump so chmanchester can look into it please land it on my behalf.

If you could log in into a machine to try few things while a job is running it would be even more awesome but I know that you probably have higher priorities.

[1]
13:27:01  WARNING - HTTP Error 404: Not Found - https://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/40.0b1-candidates/build1/mac/en-US/Firefox%2040.0b1.crashreporter-symbols.zip
Attachment #8632469 - Flags: review?(bhearsum)
Attachment #8632468 - Flags: review?(bhearsum)
Attachment #8632468 - Flags: review+
Attachment #8632468 - Flags: checked-in+
Comment on attachment 8632469 [details] [diff] [review]
[mozharness] If there is a symbols file under en-US build1 use it

Review of attachment 8632469 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Armen (back on Monday July 20th) from comment #20)
> Created attachment 8632469 [details] [diff] [review]
> [mozharness] If there is a symbols file under en-US build1 use it
> 
> I know this is a hack, however, I'm hoping we can get some light into the
> Linux issues.
> 
> It is a hack because we look for the symbols file in:
> * in ftp/candidates instead of stage
> * regardless of the locale, we grab from en-US
> * regardless of build#, we look under build1 since I have no way of knowing
> which build# it is

Guessing the build# seems like a very bad idea. If you use build1's crashsymbols against build2 or later, you're probably going to end up with a broken, or even misleading stack. The simple workaround for this is to pass --build-number to this script. You should also pass stageServer, since that's readily available in the release config as well.

> This is a safe hack in that it does not fail if there is no symbols file
> (e.g. beta releases from 38.0 don't have a symbols file in the candidates
> dir anymore). You can see the warning [1] we throw but we continue without
> any problems.
> 
> If there is a symbols available we will pass --symbols-path to the
> fx-ui-updates binary.
> 
> bhearsum: if you're satisfied with this hack to be able to get the crashdump
> so chmanchester can look into it please land it on my behalf.
> 
> If you could log in into a machine to try few things while a job is running
> it would be even more awesome but I know that you probably have higher
> priorities.
> 
> [1]
> 13:27:01  WARNING - HTTP Error 404: Not Found -
> https://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/40.0b1-candidates/
> build1/mac/en-US/Firefox%2040.0b1.crashreporter-symbols.zip

I don't think it's true that we don't have crashsymbols; this is a case where guessing and grabbing from build1 is wrong, we didn't get far enough into that build attempt to have symbols.

I can't r+ this patch for these reasons. I know you're out for a week, I'll try to throw something together in your stead.
Attachment #8632469 - Flags: review?(bhearsum) → review-
This patch builds on Armen's, and improves symbol location guessing by:
* Requiring build number to be passed in.
* Continuing to look on https://ftp rather than http://stage (there's no reason to switch hosts - they serve the same contents
Attachment #8632469 - Attachment is obsolete: true
Attachment #8632765 - Flags: review?(rail)
Attachment #8632766 - Flags: review?(rail) → review+
the first log for linux64 seems to indicate that localhost:xxxx is not available- possibly a marionette issue?

the second one for linux has:
OSError: [Errno 2] No such file or directory

could this be a wrong path?


overall I am thinking this might be harness related, etc.  Chris, can you weigh in on the first issue on linux64?
Flags: needinfo?(cmanchester)
The "No such file or directory" when launching the browser as well as my observation in comment 14 sort of make me think these machines are set up in a way that doesn't let us do what we're trying to do in /tmp (the binary and profile are put there). If this were inherent to the harness I would expect to be able to reproduce it locally (or Armen would have encountered it during other testing).

:bhearsum, can you help me get logged on to an instance that might reproduce this so I can test the above?
Flags: needinfo?(cmanchester) → needinfo?(bhearsum)
(In reply to Chris Manchester [:chmanchester] from comment #27)
> The "No such file or directory" when launching the browser as well as my
> observation in comment 14 sort of make me think these machines are set up in
> a way that doesn't let us do what we're trying to do in /tmp (the binary and
> profile are put there). If this were inherent to the harness I would expect
> to be able to reproduce it locally (or Armen would have encountered it
> during other testing).
> 
> :bhearsum, can you help me get logged on to an instance that might reproduce
> this so I can test the above?

Armen's already got a machine loaned out for this, I think I've granted you access to it as well. You should have forwarded mail with connection details - ping me in #releng if you have any issues!
Flags: needinfo?(bhearsum)
I wasn't able to access the loaner, and trying this locally deleted most of my home directory. I'll check back in tomorrow.
Attached file crash stack
With a bunch of hacks I was able to get a stack out of the crash running the tests through ssh (the patch to get symbols is not effective).
Running

yum install dejavu-lgc-sans-fonts

on my loaner fixes the start up crash, and the update tests run and pass.
The problem with the linux runs is attempting to run a 32 bit firefox on a 64 bit os, we need 32 bit versions of various libraries (libc, libstc++, etc) installed for this to work.
I backed out the purging of the builds for now in bug 1183858 since if run on a users machine it can purge their home directory
https://hg.mozilla.org/build/mozharness/rev/0dc76e63c4a6
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #38)
> I backed out the purging of the builds for now in bug 1183858 since if run
> on a users machine it can purge their home directory
> https://hg.mozilla.org/build/mozharness/rev/0dc76e63c4a6

Perhaps removing it from the default list of actions would be a good idea instead? We're probably going to have lots of failures if we're not purging...
Perhaps. That would require adding a config specifically for that.
I would like this to be fixed properly since accidents can happen and I would prefer a user not to loose anything locally because they copy/pasted something from a log.

I believe runner has recently landed for Windows machines so I feel it will be less of an issue.

I've also landed in inbound to make sure there are no discrepancies:
https://hg.mozilla.org/integration/mozilla-inbound/rev/2e380fbc1e3b
No longer depends on: 1183858
(In reply to Chris Manchester [:chmanchester] from comment #37)
> The problem with the linux runs is attempting to run a 32 bit firefox on a
> 64 bit os, we need 32 bit versions of various libraries (libc, libstc++,
> etc) installed for this to work.

I wonder if I did not see this in my loaner because I assumed that the issue was happening on both archs and I was only run the L64 jobs.

######################
...
21:43:28     INFO -      raise IOError("process has died with return code %d" % poll)
21:43:28     INFO -  IOError: process has died with return code 11

######################

I don't think we're yet using the minidump code:
21:43:28     INFO -  Crash dump filename: /tmp/tmp8d55Y6.mozrunner/minidumps/4f24827f-86f6-ded8-15e44ccf-7e496bd7.dmp
21:43:28     INFO -  No symbols path given, can't process dump.
21:43:28     INFO -  MINIDUMP_STACKWALK not set, can't process dump.

The logs say that we're using:
https://hg.mozilla.org/build/mozharness/file/eb48289f9701/scripts/firefox_ui_updates.py
instead of bhearsum's landed code:
https://hg.mozilla.org/build/mozharness/diff/a34e226a5e61/scripts/firefox_ui_updates.py

I think we still have to wait for the next beta to get to the update testing jobs and hence executing the code landed last week:
http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/40.0b5-candidates/build1/logs/
Depends on: 1185623
https://hg.mozilla.org/mozilla-central/rev/2e380fbc1e3b
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Keywords: leave-open
Resolution: FIXED → ---
The next beta tried to use in-tree mozharness which is missing Fx UI changes.
bug 1186083 to solve that.
I've landed the code on beta in bug 1186083.
Once we have a new beta later this week (or early next week) we will be able to use the new code.
Attachment #8638526 - Flags: review?(bhearsum) → review+
We have new set of crash dump (this is for Linux 64-bit jobs not the 32-bit jobs) [1].
I guess there's nothing really to be done in here until someone can be allocated for bug 1185623.

After the crash dump, the main error is:
> 03:58:22     INFO -  IOError: process has died with return code 11

Running it on my loaner with [2] works perfectly.

For reference's sake, this is the command which is run internally [3]

[1]
http://ftp.mozilla.org/pub/mozilla.org/firefox/candidates/40.0b8-candidates/build1/logs/release-mozilla-beta-linux64_beta_update_tests_3-bm70-build1-build2.txt.gz

[2]
python scripts/firefox_ui_updates.py --cfg generic_releng_config.py --cfg generic_releng_linux.py --firefox-ui-branch mozilla-beta --update-verify-config mozBeta-firefox-linux.cfg --tools-tag FIREFOX_40_0b8_RELEASE_RUNTIME --installer-url http://stage.mozilla.org/pub/mozilla.org//firefox/releases/40.0b2/linux-i686/ach/firefox-40.0b2.tar.bz2 --symbols-path http://stage.mozilla.org/pub/mozilla.org/firefox/candidates/40.0b2-candidates/build1/linux-i686/en-US/firefox-40.0b2.crashreporter-symbols.zip

[3]
/builds/slave/rel-m-beta-l64_beta_u_t_3-0000/build/venv/bin/firefox-ui-update --installer /builds/slave/rel-m-beta-l64_beta_u_t_3-0000/build/firefox-39.0b1.tar.bz2 --gecko-log=- --address=localhost:2828 --symbols-path http://stage.mozilla.org/pub/mozilla.org/firefox/candidates/39.0b1-candidates/build1/linux-x86_64/en-US/firefox-39.0b1.crashreporter-symbols.zip --update-channel beta-localtest
Did you consider running these jobs on test slaves ? We really don't run Firefox on the build slaves all that much, maybe just during PGO.
(In reply to Nick Thomas [:nthomas] from comment #57)
> Did you consider running these jobs on test slaves ? We really don't run
> Firefox on the build slaves all that much, maybe just during PGO.

I've considered it. It seems that the place we should have started to begin with.
We probably chose the builders since the updates metadata logic was already in place under process/release.py [1]

Sounds like hacking buildbotcustom/configs to make such switch would require a fair bit to re-test everything on the testers.

[1] http://hg.mozilla.org/build/buildbotcustom/file/default/process/release.py#l1333
FYI ultimate hack used here. Don't do this at home without parental supervision.

I'm going to get us the answer on how these run on the test machines.
I will check-in tomorrow.

Firefox UI tests for L64 on try (experiment 1): https://treeherder.mozilla.org/#/jobs?repo=try&revision=8d25a1dfde15
Patch: https://hg.mozilla.org/try/rev/8d25a1dfde15

Steps:
* ln firefox_ui_updates.py desktop_unittest.py
* Patched sys.argv [1] to run Linux64
* Use this commit message (try: -b o -p linux64 -u mochitest-1 -t none)
** Probably un-needed since I will be using mozci
* Cancel running jobs
* Use mozci to trigger builder (hacky --file usage):
** mozci-trigger -b "Ubuntu VM 12.04 x64 try opt test mochitest-1" -r 8d25a1dfde15 --file http://who-cares.com


[1]
   1.237 +            # Hacky flags to work on the try server
   1.238 +            [['--mochitest-suite'], {'dest': 'foo1'}],
   1.239 +            [['--blob-upload-branch'], {'dest': 'foo2'}],
   1.240 +            [['--download-symbols'], {'dest': 'foo2'}],
   1.241 +        ] + copy.deepcopy(self.harness_extra_args)
   1.242 +
   1.243 +        print "Old sys.argv:"
   1.244 +        print sys.argv
   1.245 +        command = 'scripts/scripts/firefox_ui_updates.py --cfg generic_releng_config.py --cfg generic_releng_linux64.py --firefox-ui-branch mozilla-beta --update-verify-config mozBeta-firefox-linux64.cfg --tools-tag FIREFOX_40_0b6_RELEASE_RUNTIME --total-chunks 6 --this-chunk 1 --build-number 1'
   1.246 +        sys.argv = command.split(' ')
   1.247 +        print "New sys.argv:"
   1.248 +        print sys.argv
I created steps on how to run jobs on the try server:
https://wiki.mozilla.org/Auto-tools/Projects/Marionette_update_tests#Test_your_changes_on_the_Try_server

I will work today on getting the jobs running on Ubuntu 64-bit to completion. There are some mozharness adjustments needed.
We're moving to testers.
Status: REOPENED → RESOLVED
Closed: 4 years ago4 years ago
Resolution: --- → INVALID
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.