Closed Bug 1303834 Opened 5 years ago Closed 4 years ago

[tier-3] Intermittent test_fallback_update.py TestFallbackUpdate.test_update (partial MAR) | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)

Categories

(Toolkit :: Application Update, defect)

defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla55
Tracking Status
firefox50 --- wontfix
firefox51 --- wontfix
firefox52 --- wontfix
firefox-esr52 --- fixed
firefox53 --- wontfix
firefox54 --- fixed
firefox55 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: mhowell)

References

Details

(Keywords: intermittent-failure, regression, Whiteboard: [stockwell unknown])

Attachments

(2 files)

Filed by: hskupin [at] gmail.com

https://treeherder.mozilla.org/logviewer.html#?job_id=5036232&repo=mozilla-central

https://firefox-ui-tests.s3.amazonaws.com/1cf613c4-045e-4935-a0c7-649120b5b75f/log_info.log

This is a follow-up bug for bug 1293404. As it looks like the issue we have initially seen has not been fully fixed yet.
From the logs:

10:37:16     INFO -  *** AUS:UI gFinishedPage:not elevationRequired
10:37:16     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_type_add_interface_static: assertion 'g_type_parent (interface_type) == G_TYPE_INTERFACE' failed
10:37:16     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_type_add_interface_static: assertion 'g_type_parent (interface_type) == G_TYPE_INTERFACE' failed
10:37:16     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_type_add_interface_static: assertion 'g_type_parent (interface_type) == G_TYPE_INTERFACE' failed
10:37:21     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_object_ref: assertion 'object->ref_count > 0' failed
10:37:21     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_object_unref: assertion 'object->ref_count > 0' failed
10:37:21     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_object_ref: assertion 'object->ref_count > 0' failed
10:37:21     INFO -  (firefox:5620): GLib-GObject-CRITICAL **: g_object_unref: assertion 'object->ref_count > 0' failed
10:37:22     INFO -  *** UTM:SVC TimerManager:registerTimer - id: xpi-signature-verification
10:37:22     INFO -  ATTENTION: default value of option force_s3tc_enable overridden by environment.
10:37:22     INFO -  *** AUS:SVC Creating UpdateService
10:37:22     INFO -  *** AUS:SVC readStatusFile - status: applying, path: /home/mozauto/jenkins/workspace/mozilla-central_update/build/application.copy/updates/0/update.status
10:37:23     INFO -  1474306643425	Marionette	ERROR	Error on starting server: [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]"  nsresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)"  location: "JS frame :: chrome://marionette/content/server.js :: MarionetteServer.prototype.start :: line 85"  data: no]
10:37:23     INFO -  [Exception... "Component returned failure code: 0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE) [nsIServerSocket.initSpecialConnection]"  nsresult: "0x804b0036 (NS_ERROR_SOCKET_ADDRESS_IN_USE)"  location: "JS frame :: chrome://marionette/content/server.js :: MarionetteServer.prototype.start :: line 85"  data: no]
10:37:23     INFO -  MarionetteServer.prototype.start@chrome://marionette/content/server.js:85:19
10:37:23     INFO -  MarionetteComponent.prototype.init@resource://gre/components/marionettecomponent.js:218:5
10:37:23     INFO -  MarionetteComponent.prototype.observe@resource://gre/components/marionettecomponent.js:142:7
10:37:23     INFO -  *** UTM:SVC TimerManager:registerTimer - id: browser-cleanup-thumbnails
10:37:29     INFO -  *** AUS:SVC gCanCheckForUpdates - able to check for updates
The readStatusFile call in the above snippet should be reporting that the status is "applied", not "applying". That's what's shown in the logs for passing runs. That means that the failure is caused by the browser process deciding that the updater process has exited before it actually has.

So the question is how that's happening. I don't think waitpid() is throwing any errors because the bug 1293404 patch added a log message for that case, and I don't see that message in the logs (it would have been in the above snippet). I also don't think waitpid() is reporting some state change in the updater other than exiting normally, because there also should have been a log message on that branch even in the original bug 1272614 patch. That doesn't really leave me any ideas.

whimboo, these jobs don't have TaskCluster ID's (I guess they don't run in TaskCluster?) so I can't get loaners from them. I'd really like to point strace at one of these runs so I can see what waitpid() is doing. Do I need to use the "manual" loaner request process?
Flags: needinfo?(hskupin)
(In reply to Matt Howell [:mhowell] from comment #2)
> whimboo, these jobs don't have TaskCluster ID's (I guess they don't run in
> TaskCluster?) so I can't get loaners from them. I'd really like to point
> strace at one of these runs so I can see what waitpid() is doing. Do I need
> to use the "manual" loaner request process?

Sorry, but I had to revert mozmill-ci yesterday from using taskcluster because some incompatible changes landed by end of last week. Given that lots of changes are happening right now in terms of task definitions I'm not able to keep up with all of that. I don't want to risk more breakage. So we run all of our tests via our own slaves again.

That shouldn't actually be bad! The benefit is that we always have them available and connections won't drop as it happens for AWS. I will see that I can prepare a machine for your usage. Then you can do whatever you want.
Flags: needinfo?(hskupin)
So the machine I wanted to setup doesn't show this particular problem. Therefore I prepared one of our slave nodes from the production environment. I would be happy if you can limit your activity to the local environment and folders as listed below to avoid global pollution of the VM. Thanks

Steps in how you can run the tests:
1. Connect to the Mozilla VPN
2. Connect via ssh to mm-ub-1404-64-4.qa.scl3.mozilla.com (ping me on IRC to get the user/pass)
3. Run `screen -x`
4. Run the last commands from bash history

It always reproduces this failure for me with a source build from Sep 19th.
I've not been able to reach that server through the VPN. I suspected a VPN configuration issue at first, but I've tried from a Mac, Windows, and a Linux system, so I don't think I'm doing the same thing wrong on all three.

On IRC, whimboo suggested I could use an old TaskCluster run but with a newer binary, and I was able to do that. But I'm not any less confused now, unfortunately. strace doesn't show any waitpid calls for the updater that return anything other than 0 (meaning "process still running") so I have no better idea how that loop is getting broken out of prematurely. I may have to push some patches to oak that add some more logging for... something.
https://hg.mozilla.org/projects/oak/rev/e86e0aeb2dc0ca4c5a3b1281b6be856751655839
Added some more logging for investigating bug 1303834. Don't land this anywhere else.
https://hg.mozilla.org/projects/oak/rev/664d9b090718
Merge m-c to oak, so that we have a second build to test bug 1303834 against.
I requested new nightly builds for both changesets. They should appear soon.
FYI currently we restart Firefox via Services.startup.quit(). But via bug 1304656 we now want to change it to use the restart button. Not sure if there would be a different behavior for Firefox between those two methods.
Matt, you can get a Nightly here for testing:
https://archive.mozilla.org/pub/firefox/nightly/2016/09/2016-09-21-23-36-38-oak/
Flags: needinfo?(mhowell)
Our tradition of logs making no sense whatsoever and leaving me utterly confused remains unbroken:

[...]
1928328960[7f9766bba300]: ProcessHasTerminated: Checking state of updater process
1928328960[7f9766bba300]: ProcessHasTerminated: Updater process is still running; waiting 1 second before trying again
1928328960[7f9766bba300]: WaitForProcess: process still running, dispatching myself
*** UTM:SVC TimerManager:registerTimer - id: xpi-signature-verification
ATTENTION: default value of option force_s3tc_enable overridden by environment.
*** AUS:SVC Creating UpdateService
*** AUS:SVC readStatusFile - status: applying, path: /tmp/tmpDYP2gp.application.copy/updates/0/update.status
[...]

So, my newly added logging tells us that WaitForProcess() correctly gets a false return from ProcessHasTerminated() and attempts to dispatch itself so it can check again, but then what actually happens is that UpdateDone() runs. I do not have the slightest idea how that could be possible. The dispatching works correctly a bunch of times before this happens, so it's not like it's just always broken.
Flags: needinfo?(mhowell)
Could it be that RefreshUpdateStatus() is getting called from somewhere else? Independently from the code which currently runs in nsUpdateDriver.cpp? Btw where can I find the implementation of RefreshUpdateStatus()?
Ok, so it's here:
https://dxr.mozilla.org/mozilla-central/rev/f0e6cc6360213ba21fd98c887b55fce5c680df68/toolkit/mozapps/update/nsUpdateService.js#3111

Something which might work is to use the --js-debugger option, which let you set breakpoints. Maybe it's really related in how we instruct Firefox to restart the browser (Services.startup.quit())?
After seeing the above run, I tried again with nsThread logging enabled, and I've got this:

[... lots of copies of the next six lines here ...]
-1988102400[7fbc8e9941e0]: WaitForProcess: process still running, dispatching myself
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) Dispatch [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) ProcessNextEvent [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) running [7fbc78e62130]
-1988102400[7fbc8e9941e0]: ProcessHasTerminated: Checking state of updater process
-1988102400[7fbc8e9941e0]: ProcessHasTerminated: Updater process is still running; waiting 1 second before trying again
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbc9ca65aa0) sync shutdown
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [1 0]
[GMPThread]: D/nsThread THRD(7fbc9ca65aa0) running [7fbc7903ac20]
[GMPThread]: D/nsThread THRD(7fbc9ca65aa0) ProcessNextEvent [0 0]
[GMPThread]: D/nsThread THRD(7fbcbd95cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbcbd95cf80) sync shutdown
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [1 0]
[Link Monitor]: D/nsThread THRD(7fbcbd95cf80) ProcessNextEvent [0 0]
[Link Monitor]: D/nsThread THRD(7fbcbd95cf80) running [7fbc7903ac20]
[Link Monitor]: D/nsThread THRD(7fbcbd95cf80) ProcessNextEvent [0 0]
[Link Monitor]: D/nsThread THRD(7fbcbd95cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Timer]: D/nsThread THRD(7fbcbd95ceb0) ProcessNextEvent [0 0]
[Timer]: D/nsThread THRD(7fbcbd95ceb0) ProcessNextEvent [1 0]
[Main Thread]: D/nsThread THRD(7fbcbd95ceb0) sync shutdown
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [1 0]
[Timer]: D/nsThread THRD(7fbcbd95ceb0) running [7fbc7903ac20]
[Timer]: D/nsThread THRD(7fbcbd95ceb0) ProcessNextEvent [0 0]
[Timer]: D/nsThread THRD(7fbcbd95cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Main Thread]: D/nsThread THRD(7fbc7bfbb120) sync shutdown
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [1 0]
-1988102400[7fbc8e9941e0]: WaitForProcess: process still running, dispatching myself
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) Dispatch [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) ProcessNextEvent [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) running [7fbc7903ac20]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbc7bfbb120) ProcessNextEvent [0 0]
[Unnamed thread 0x7fbc8e9941e0]: D/nsThread THRD(7fbcbd95cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) running [7fbc7903ac20]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Main Thread]: D/nsThread THRD(7fbcbd95cd10) ProcessNextEvent [0 0]
[Main Thread]: D/nsThread THRD(7f830ef5cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7f830ef5cd10) Dispatch [0 0]
[Main Thread]: D/nsThread THRD(7f830ef5ceb0) Dispatch [0 0]
[... test proceeds to finish failing ...]

This doesn't make any sense to me; it looks like everything is fine until suddenly the absolute wrong function gets dispatched to the nsUpdateProcessor thread, but then somehow also ends up running on the main thread after that? I am not any less confused.
(In reply to Henrik Skupin (:whimboo) from comment #13)
> Could it be that RefreshUpdateStatus() is getting called from somewhere
> else? Independently from the code which currently runs in
> nsUpdateDriver.cpp?

There are no other calls to RefreshUpdateStatus() anywhere except for the one in nsUpdateProcessor::UpdateDone(). And UpdateDone() isn't invoked from anywhere else either.

(In reply to Henrik Skupin (:whimboo) from comment #14)
> Something which might work is to use the --js-debugger option, which let you
> set breakpoints.

Hmm. Not sure how that would help, since I know there's only the one call site.

> Maybe it's really related in how we instruct Firefox to
> restart the browser (Services.startup.quit())?

Don't know how that would work, but I suppose stranger things have happened. At what points in the test does that get called?
I'd really like to get an rr recording, but as far as I know rr wouldn't work in any of our various test runners, and I'm still without a local repro.
(In reply to Matt Howell [:mhowell] from comment #17)
> I'd really like to get an rr recording, but as far as I know rr wouldn't
> work in any of our various test runners, and I'm still without a local repro.

Dustin, do you know something about rr on our testers? Is it something we could get working there, even if its on a one click loaner for now?

If that is not possible we might have to request temporary VPN-QA access for Matt, so that he could connect to one of our slave nodes.
Flags: needinfo?(dustin)
To my understanding, rr does not work on EC2 instances.
Flags: needinfo?(dustin)
(In reply to Matt Howell [:mhowell] from comment #16)
> > Maybe it's really related in how we instruct Firefox to
> > restart the browser (Services.startup.quit())?
> 
> Don't know how that would work, but I suppose stranger things have happened.
> At what points in the test does that get called?

Each time when Firefox has to be restarted. We weren't able yet to click on the restart button, but could implement this soon given that bug 1298800 is fixed now.
(In reply to Henrik Skupin (:whimboo) from comment #21)
> Each time when Firefox has to be restarted. We weren't able yet to click on
> the restart button, but could implement this soon given that bug 1298800 is
> fixed now.

In that case, I don't think that's related, because the code that's acting up here is on the path that determines when the restart button should appear; the trouble happens before it's time for a restart, not during or after.
Some thinking here.... so far we were only able to see that behavior in Linux VMs or on Taskcluster, which uses docker images in virtual machines (AWS). Maybe it's some kind of trouble with virtualization?
So something strange is happening here. Since I updated all the machines including the linux ones for latest OS and Java updates, the failure as layed out here on this bug is gone. At least for mozilla-aurora:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=fxup&filter-tier=1&filter-tier=2&filter-tier=3

So not sure if there was some process still around which caused such a behavior.

I do not have answers to the following two topics yet:

1. mozilla-central: Our tests are currently busted due to a change in how we use virtualenv with mozharness. Once that has been fixed (should be with the next set of nighly builds), we can re-verify.

2. I don't know if those problems would still be existent in Taskcluster. We currently don't use those workers but our own ones in mozmill-ci.

So lets wait a couple of days and revisit it then. FYI I will be on PTO starting Friday this week up to Thu next week.
Ok, so this is still an ongoing problem with our update tests. I still see failures on mozilla-aurora and that not too less.

Matt, would you have the time to look into this more again? It would be great if we can make some progress.
After discovering bug 1309556 yesterday, I get the feeling this this remaining issue here could somehow be related. Means we are trying to re-connect to the still shutting down instance of Firefox, which then gets killed. The failure summary doesn't really match, but the fix for the race might still help.
Depends on: 1309556
(In reply to Henrik Skupin (:whimboo) from comment #29)
> After discovering bug 1309556 yesterday, I get the feeling this this
> remaining issue here could somehow be related. Means we are trying to
> re-connect to the still shutting down instance of Firefox, which then gets
> killed. The failure summary doesn't really match, but the fix for the race
> might still help.

Are you suggesting that, with the patch for 1309556 landed (2 days ago), we should just wait and see if it recurs?
Flags: needinfo?(hskupin)
As of now we do not have results for mozilla-central. Reason was bug 1306421 which busted all of our tests on Linux and Windows. Starting with tomorrow we will have update test results again.
Flags: needinfo?(hskupin)
The failure is still happening on mozilla-central. So the mentioned patch didn't contribute to fixing this bug. One other option we should try now is to actually restart Firefox for applying the update via the restart button. But not sure if that would change something.
Depends on: 1304656
All attempts to see this fixed for in_app restarts have failed. So I assume this has really to do in how we respawn the process after an update. Matt, this failure can still be seen each day and that multiple times. I wonder if there should be made another attempt to get this investigated.
Flags: needinfo?(mhowell)
In all honestly, I ran completely out of ideas for how to investigate this, much less to solve it. It needs somebody with experience working on weird low-level Linux-specific issues. I think I left enough information in comments here for someone else to pick this up, and if not I'm happy to explain the situation, but I'm afraid it's gotten beyond me.
Flags: needinfo?(mhowell)
(In reply to Matt Howell [:mhowell] from comment #37)
> In all honestly, I ran completely out of ideas for how to investigate this,
> much less to solve it. It needs somebody with experience working on weird
> low-level Linux-specific issues. I think I left enough information in
> comments here for someone else to pick this up, and if not I'm happy to
> explain the situation, but I'm afraid it's gotten beyond me.

The only person I know out of my head for this might be Karl. Karl, not sure if you have the time to help us in fixing this process related issue when updating Firefox. If you can, we would really appreciate it. Thank you in advance.
Flags: needinfo?(karlt)
I don't see Fxup-auroratest on
https://treeherder.mozilla.org/#/jobs?repo=oak&filter-tier=1&filter-tier=2&filter-tier=3&exclusion_profile=false
What do I need to do to see the logs?

Can you verify that changes for bug 1272614 triggered this by reverting those
changes on oak?

(In reply to Matt Howell [:mhowell] from comment #12)
> [...]
> 1928328960[7f9766bba300]: ProcessHasTerminated: Checking state of updater
> process
> 1928328960[7f9766bba300]: ProcessHasTerminated: Updater process is still
> running; waiting 1 second before trying again
> 1928328960[7f9766bba300]: WaitForProcess: process still running, dispatching
> myself
> *** UTM:SVC TimerManager:registerTimer - id: xpi-signature-verification
> ATTENTION: default value of option force_s3tc_enable overridden by
> environment.
> *** AUS:SVC Creating UpdateService
> *** AUS:SVC readStatusFile - status: applying, path:
> /tmp/tmpDYP2gp.application.copy/updates/0/update.status
> [...]
> 
> So, my newly added logging tells us that WaitForProcess() correctly gets a
> false return from ProcessHasTerminated() and attempts to dispatch itself so
> it can check again, but then what actually happens is that UpdateDone()
> runs.

How do you know that UpdateDone() happens?

It sounds like some logging may be getting truncated?

Perhaps this may happen if the process exits abnormally for some reason and so
buffers are not flushed.

Can you add logging to UpdateDone() to confirm this theory?

If that confirms the theory, then I'd be inclined to use fprintf(stderr,) to
be sure MOZ_LOG() isn't writing to some other kind of stream, but I guess stderr
is not necessarily always flushed either.
Blocks: 1272614
Flags: needinfo?(karlt) → needinfo?(mhowell)
Keywords: regression
Matt, when you extend the logging, could you also add some timing information? It would be good to know how long it actually takes before we leave the loop waiting for the update being applied. Maybe in some cases it takes longer than 60s?
Just want to mention that this is still our most often occurring test failure for update tests on Linux! This is for mozilla-central and mozilla-aurora.
I promise I haven't forgotten about this! I do plan to try Karl's suggestions, but I've had other stuff going on recently.
Is this something we need to worry about for 50?
Flags: needinfo?(hskupin)
(In reply to Ryan VanderMeulen [:RyanVM] from comment #45)
> Is this something we need to worry about for 50?

I cannot tell about update test results for betas and releases. Florin can give you an answer to this question.
Flags: needinfo?(hskupin) → needinfo?(florin.mezei)
Beta and Release have not been affected by this so far (I've also just run the update tests for Fx 50 on release-cdntest and did not run into this).
Flags: needinfo?(florin.mezei)
Interesting. I wonder if there is special code around which only gets run when updating non-release builds.
no much status in the bug in the last 2 weeks, checking in as this bug is quite frequent- is there anything we are stuck on?
We haven't prioritized this, because of the intermittent nature and comments 45-47. In addition, I know that mhowell has not been able to reproduce this well locally. Happy to take patches on this (if people have time) until mhowell is cleared of his current priorities.
I have some interesting news. On bug 1322199 I'm currently working on an update for the proxy settings of our boxes. It looks like with those applied or some other unknown changes the problem is gone:

https://treeherder.allizom.org/#/jobs?repo=mozilla-central&exclusion_profile=false&filter-searchStr=firefox%20ui%20update%20linux&filter-tier=1&filter-tier=2&filter-tier=3&selectedJob=5569697

We should observe that for a couple of days. Maybe it will return.
During the last week this test failure showed only up for the 32bit Linux machines. The 64bit ones work perfectly fine. Not sure what could have changed that. Maybe it were the proxy updates I did on bug 1322199, and a related restart of the machines? But I restarted them all, so why do 32bit machines still show this problem?
Should we just leave this open in perpetuity in case this recurs more severely? You've got the most insight to this right now, and no one has a cause or solution.
Flags: needinfo?(mhowell) → needinfo?(hskupin)
As long as it isn't fixed we cannot close this bug.
Flags: needinfo?(hskupin)
FYI we haven't had any update test results for Linux in the last 14 days because we build Nighlies on Linux via TC now, and funsize jobs didn't send out notifications, which we were listening for.

As it looks like the failure rate is again somewhat high.
:whimboo, can you look into fixing this or help find someone who can?
Flags: needinfo?(hskupin)
Whiteboard: [stockwell needswork]
Joel, please see comment 44 for the last update. Someone would have to try to get the stuff tested what karlt mentioned. It's actually not something I can do. Matt helped out here, but seems to have other priorities at the moment.
Flags: needinfo?(hskupin)
this picked up in frequency in the last week, Assuming it stays at this rate, we would like to see this fixed or disabled within 2 weeks. :mhowell, I assume you own this test case and can help find the right people to work on this?
Flags: needinfo?(mhowell)
Joel, please take into account that those tests are reporting as Tier-3 level. Sadly there is no way to distinguish that from Tier-1/2.
oh, they shouldn't be starred and in my intermittent dashboard then.  I see what you mean, can we annotate this as a tier-3 test so it doesn't affect orange factor or ask the sheriffs not to annotate this?
Flags: needinfo?(mhowell)
I annotate those failures to get a feeling of our intermittent failure rate for update tests. I have no idea how to track those failures otherwise. Maybe it would be wise to bump those jobs to Tier-2, especially because they are that important for release work. We should discuss this outside of this bug to be honest.
the failures are primarily beta/aurora, :whimboo is this something you can figure out and fix?
Flags: needinfo?(hskupin)
(In reply to Joel Maher ( :jmaher) from comment #79)
> the failures are primarily beta/aurora, :whimboo is this something you can
> figure out and fix?

It happens more often on Aurora and Beta because we are running update tests for different locales on those branches. As such we roughly get 4 x locale_count failures on Linux 32/64 per day.

I cannot work on this at this time while I have to finish up WebDriver P1 bugs, sorry.
it looks like the recent failures stopped on March 14th and newer failures are on mozilla-central- but the rate has slowed down considerably.

it would be nice to resolve any easy wins here when the WebDriver work slows down.
There are also other issues with updates lately eg. bug 1285340. So this might be the reason why we had a temporarily decrease of failures.
Flags: needinfo?(hskupin)
this keeps showing up on my radar, and almost all the issues are mozilla-aurora; possibly we just need to live with this?
while this has a lot of failures, it is primarily mozilla-beta followed by esr and aurora.  :whimboo are you aware of this high failure rate?
Flags: needinfo?(hskupin)
Joel, as mentioned at least twice on this bug this is a product issue and needs a dev to sort this out and to get it fixed. It's not something I have expertise in. So Matt would be most likely person here to work on it, but might have other priorities. So release-drivers should figure that out. It's not something I can do.
Flags: needinfo?(hskupin) → needinfo?(lhenry)
I'm going to be spending a bit of time on this to implement karlt's suggestions from comment 39, since that's the last actionable idea I have available. Starting with pushing a backout of bug 1272614 to oak to see what happens.
https://hg.mozilla.org/projects/oak/rev/fd7e1ee7273113ff5254ad467cd98b9f407a5278
Bug 1303834 - Backed out changeset d5fb267d0946 to see if this failure is affected
Thanks Matt. Does it mean we should test updates on oak, or will this be merged eventually to mozilla-central?
I'd rather not merge this to central until we know if it makes a difference (and preferably understand why if so), so getting the tests run on oak would be ideal.
Matt, I don't see that we build nightlies at all on oak. The problem could be that those are done via TC now and the job might not have been setup.

https://treeherder.mozilla.org/#/jobs?repo=oak&filter-searchStr=nightly

Could you check that and get those builds created? Florin and I are happy to test that but we would need usable builds. Thanks.
Flags: needinfo?(lhenry) → needinfo?(mhowell)
Flags: needinfo?(mhowell)
(In reply to Robert Strong [:rstrong] (use needinfo to contact me) from comment #99)
> oak nightly builds
> https://archive.mozilla.org/pub/firefox/nightly/latest-oak/

Those are not Linux builds. It's only Mac and Windows as my TH link also shows.
Flags: needinfo?(robert.strong.bugs)
Filed bug 1353819
Flags: needinfo?(robert.strong.bugs)
can we disable these tests or mark them as tier-3?  We are at 3+ weeks of a very high failure rate.
Flags: needinfo?(hskupin)
Those tests are Tier-3 level tests.
Flags: needinfo?(hskupin)
how did we get 174 orangefactor stars if they are tier-3, I would not expect to get any orangefactor data for tier-3 on our integration and release branches.
(In reply to Joel Maher ( :jmaher) from comment #107)
> how did we get 174 orangefactor stars if they are tier-3, I would not expect
> to get any orangefactor data for tier-3 on our integration and release
> branches.

We are going in circles. :) Please see comment 74. If this is something I should not do, let me know and I will save my time no longer starring the failures. With that in mind I won't see a reason to further check for failures on integration branches, which means we might hit failures on the merge from aurora to beta on surprise.
adding a tier-3 to the subject which should help reduce confusion.  Tier-3 is great for greening something up, but once you star something it gets into orangefactor and other failure dashboards and we are constantly reminded of it.
Summary: Intermittent test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!) → [tier-3] Intermittent test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)
Whiteboard: [stockwell needswork] → [stockwell unknown]
(In reply to Matt Howell [:mhowell] from comment #97)
> I'd rather not merge this to central until we know if it makes a difference
> (and preferably understand why if so), so getting the tests run on oak would
> be ideal.

Looks like the nightly builds are available now. For example those could be used:

source: https://treeherder.mozilla.org/#/jobs?repo=oak&filter-searchStr=nightly&selectedJob=90480988
target: https://treeherder.mozilla.org/#/jobs?repo=oak&filter-searchStr=nightly&selectedJob=90235931

Florin, could you trigger such an update test on oak, and let it repeat a dozen of times if it's not failing? Thanks.
Flags: needinfo?(florin.mezei)
(In reply to Henrik Skupin (:whimboo) from comment #110)
> Looks like the nightly builds are available now. For example those could be
> used:
> 
> source:
> https://treeherder.mozilla.org/#/jobs?repo=oak&filter-
> searchStr=nightly&selectedJob=90480988
> target:
> https://treeherder.mozilla.org/#/jobs?repo=oak&filter-
> searchStr=nightly&selectedJob=90235931
> 
> Florin, could you trigger such an update test on oak, and let it repeat a
> dozen of times if it's not failing? Thanks.


I can't seem to figure out the parameters for running this - http://mm-ci-production.qa.scl3.mozilla.com:8080/job/ondemand_update/62742/

Henrik can you advise?
Flags: needinfo?(florin.mezei) → needinfo?(hskupin)
Checking the logs I can see that there is a problem in downloading the mozharness archive. Reason is that on oak no archives are getting created: https://hg.mozilla.org/integration/oak/archive/

So basically this is not solvable in mozmill-ci as long as we use the archiver script from releng to fetch the mozharness archive.

What we would have to do is to trigger the automated tests manually on the machines. It's similar to what I explained to you lately when we had issues with beta builds. After downloading and extracting the oak nightly, the tests should work fine. Would you mind doing that? If yes, I would appreciate. Also it would make sure that you are able to do it in cases when I'm not around.
Flags: needinfo?(hskupin)
So with help from Henrik I did manage to test this manually on a Linux machine. However, the tests failed because no update was offered for: https://aus5.mozilla.org/update/6/Firefox/53.0a1/20170410200459/Linux_x86-gcc3/en-US/nightly-oak/Linux%203.13.0-106-generic%20(GTK%203.10.8%2Clibpulse%204.0.0)/NA/default/default/update.xml?force=1.

Given that there are more recent builds here [1], was expecting to get an update.

[1] - https://archive.mozilla.org/pub/firefox/nightly/2017/04/2017-04-11-13-53-21-oak/
Ben, any idea why no updates are offered for the above URL?
Flags: needinfo?(bhearsum)
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #115)
> Ben, any idea why no updates are offered for the above URL?

I had locked nightlies to an earlier revision that didn't have Linux ones while testing something. I just reverted that - should be fixed now.
Flags: needinfo?(bhearsum)
Thanks Ben!

I've re-tested on the same machine, and this time an update was indeed served, but I've hit bug 1260383 - JavascriptException: TypeError: ums.activeUpdate is null, for both Direct and Fallback updates (tried twice), with this sort of error showing in the logs:

1492013159680   Marionette      TRACE   6 -> [0,962,"executeScript",{"scriptTimeout":null,"newSandbox":true,"args":["16"],"filename":"windows.py","script":"\n              Components.utils.import(\"resource://gre/modules/Services.jsm\");\n\n              let win = Services.wm.getOuterWindowWithId(Number(arguments[0]));\n              return win.document.readyState == 'complete';\n            ","sandbox":"default","line":157}]
1492013159685   Marionette      TRACE   6 <- [1,962,null,{"value":true}]
ERROR: Error verifying signature.
(In reply to Florin Mezei, QA (:FlorinMezei) from comment #117)
> ERROR: Error verifying signature.

This error comes from:
https://dxr.mozilla.org/mozilla-central/rev/f40e24f40b4c4556944c762d4764eace261297f5/modules/libmar/verify/mar_verify.c#453

Looks like the downloaded mar files could not be verified. Ben, not sure if you need the output from the updater log to get this investigated and fixed. Florin most likely could provide this tomorrow.
Flags: needinfo?(bhearsum)
(In reply to Henrik Skupin (:whimboo) from comment #118)
> (In reply to Florin Mezei, QA (:FlorinMezei) from comment #117)
> > ERROR: Error verifying signature.
> 
> This error comes from:
> https://dxr.mozilla.org/mozilla-central/rev/
> f40e24f40b4c4556944c762d4764eace261297f5/modules/libmar/verify/mar_verify.
> c#453
> 
> Looks like the downloaded mar files could not be verified. Ben, not sure if
> you need the output from the updater log to get this investigated and fixed.
> Florin most likely could provide this tomorrow.

This is probably happening because we changed the branding on oak at one point. If your starting build is https://hg.mozilla.org/projects/oak/rev/b5d2520dc1ddcce8a6a02f823226f91eaa461683 or later, everything should be fine now.
Flags: needinfo?(bhearsum)
Florin, can you please have another look for more recent oak nightlies? Thanks.
Flags: needinfo?(florin.mezei)
(In reply to Henrik Skupin (:whimboo) from comment #124)
> Florin, can you please have another look for more recent oak nightlies?
> Thanks.

I've tried this today on http://mm-ci-production.qa.scl3.mozilla.com:8080/computer/mm-ub-1404-32-3/ but encountered the following failure (build used was https://archive.mozilla.org/pub/firefox/nightly/2017/04/2017-04-16-11-02-36-oak/):

TEST-UNEXPECTED-ERROR | test_direct_update.py TestDirectUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Connection timed out after 360.0s)
Traceback (most recent call last):
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 166, in run
    testMethod()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/tests/firefox-ui/tests/update/direct/test_direct_update.py", line 20, in test_update
    self.download_and_apply_available_update(force_fallback=False)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_ui_harness/testcases.py", line 288, in download_and_apply_available_update
    self.restart()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_ui_harness/testcases.py", line 353, in restart
    super(UpdateTestCase, self).restart(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_puppeteer/mixins.py", line 71, in restart
    self.marionette.restart(in_app=True)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
    return func(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1222, in restart
    self._request_in_app_shutdown("eRestart")
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1156, in _request_in_app_shutdown
    self._send_message("quitApplication", {"flags": list(flags)})
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
    m._handle_socket_failure()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
    return func(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 726, in _send_message
    msg = self.client.request(name, params)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 284, in request
    return self.receive()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 211, in receive
    raise socket.timeout("Connection timed out after {}s".format(self.socket_timeout))

I tried some other machines but got another failure (so probably I missed something on those):
TEST-UNEXPECTED-ERROR | test_fallback_update.py TestFallbackUpdate.test_update | InvalidArgumentException: Unrecognised timeout: page load
stacktrace:
        WebDriverError@chrome://marionette/content/error.js:211:5
        InvalidArgumentError@chrome://marionette/content/error.js:301:5
        fromJSON@chrome://marionette/content/session.js:70:17
        GeckoDriver.prototype.setTimeouts@chrome://marionette/content/driver.js:1658:19
        execute/req<@chrome://marionette/content/server.js:510:22
        TaskImpl_run@resource://gre/modules/Task.jsm:319:42
        TaskImpl@resource://gre/modules/Task.jsm:277:3
        asyncFunction@resource://gre/modules/Task.jsm:252:14
        Task_spawn@resource://gre/modules/Task.jsm:166:12
        execute@chrome://marionette/content/server.js:500:15
        onPacket@chrome://marionette/content/server.js:471:7
        _onJSONObjectReady/<@chrome://marionette/content/server.js -> resource://devtools/shared/transport/transport.js:483:11
        exports.makeInfallible/<@resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/shared/ThreadSafeDevToolsUtils.js:101:14
        exports.makeInfallible/<@resource://gre/modules/commonjs/toolkit/loader.js -> resource://devtools/shared/ThreadSafeDevToolsUtils.js:101:14
Traceback (most recent call last):
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 147, in run
    self.setUp()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/tests/firefox-ui/tests/update/fallback/test_fallback_update.py", line 11, in setUp
    UpdateTestCase.setUp(self, is_fallback=True)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_ui_harness/testcases.py", line 50, in setUp
    super(UpdateTestCase, self).setUp()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/firefox_puppeteer/mixins.py", line 77, in setUp
    super(PuppeteerMixin, self).setUp(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 468, in setUp
    super(MarionetteTestCase, self).setUp()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_harness/marionette_test/testcases.py", line 261, in setUp
    self.marionette.timeout.reset()
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/timeout.py", line 97, in reset
    self.page_load = DEFAULT_PAGE_LOAD_TIMEOUT
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/timeout.py", line 74, in page_load
    self._set("page load", sec)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/timeout.py", line 33, in _set
    self._marionette._send_message("setTimeouts", {name: ms})
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
    return func(*args, **kwargs)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 729, in _send_message
    self._handle_error(err)
  File "/home/mozauto/jenkins/workspace/ondemand_update/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 762, in _handle_error
    raise errors.lookup(error)(message, stacktrace=stacktrace)
Flags: needinfo?(florin.mezei)
The above issues Florin pointed out are caused because the tests were run with the ondemand_update job, which certainly will not work for nightly builds. So after mentioning it to him, we run update tests for oak together on a Linux machine.

The results we got are not that promising but at least we are getting closer...

So as noticed with the fallback updates for an oak build from April 17th, we do not offer any partial update because those seem to not get build. There are only complete mar patches available:

https://aus5.mozilla.org/update/6/Firefox/55.0a1/20170417110320/Linux_x86-gcc3/en-US/nightly-oak/Linux%203.13.0-106-generic%20(GTK%203.10.8%2Clibpulse%204.0.0)/NA/default/default/update.xml?force=1

Running the update tests with those builds the issue on this bug didn't surface at all. So I also had a look at various update tests for recent beta and rc candidate builds and noticed an interesting fact:

https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=d345b657d381ade5195f1521313ac651618f54a2&filter-searchStr=firefox%20ui%20linux%20update&filter-tier=1&filter-tier=2&filter-tier=3

All the failures happen with a fallback update for initially served partial updates! If the initial patch is a complete one (which starts with 53.0b9 downwards) no connection issues are happening with Marionette after the final restart!

So I believe this is strongly related to partial patches and fallback updates.

Simon and Ben, I wonder if there is a way to enable funsize partial patch generation for the oak branch. At least for Linux where we have to investigate this problem.
Severity: normal → critical
Flags: needinfo?(sfraser)
Flags: needinfo?(bhearsum)
Summary: [tier-3] Intermittent test_fallback_update.py TestFallbackUpdate.test_update | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!) → [tier-3] Intermittent test_fallback_update.py TestFallbackUpdate.test_update (partial MAR) | IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for port 2828!)
Have submitted https://github.com/mozilla-releng/funsize/pull/58 to add the routes for oak, awaiting review.
Flags: needinfo?(sfraser)
I don't have time to look at this anytime soon, I'm swamped with Dawn work.
Flags: needinfo?(bhearsum)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #128)
> Have submitted https://github.com/mozilla-releng/funsize/pull/58 to add the
> routes for oak, awaiting review.

This is now in place. The workers for this are a bit overloaded at times, so if it's possible to not trigger nightlies at similar times to the existing ones that would help a great deal.

Simon.
I've re-tested this with https://archive.mozilla.org/pub/firefox/nightly/2017/04/2017-04-20-11-02-08-oak/. 

I tested 6 times - Passed 2 times, and failed 4 times with: IOError: Process killed because the connection to Marionette server is lost. Check gecko.log for errors (Reason: Timed out waiting for connection on localhost:2828!)

I'm also attaching below two logs: one for a passed test, and one for a failed test. It seems to me from the logs that we fetched the partial.mar.
Thank you Florin. Matt, as it looks like reverting the one change didn't have any effect. Maybe could this somewhat be also related to what we have seen today on bug 1358402 and monkey patched to make it work?
Flags: needinfo?(mhowell)
Hmm. I'm not sure. What we had in bug 1358402 was the Marionette server just never trying to start, right? Here it is trying, but can't because of the NS_ERROR_SOCKET_ADDRESS_IN_USE.

Also, that error doesn't happen here until the second restart, the one to finish applying the fallback complete update, not the one during which we overwrite the status file.
Flags: needinfo?(mhowell)
(In reply to Matt Howell [:mhowell] from comment #136)
> Also, that error doesn't happen here until the second restart, the one to
> finish applying the fallback complete update, not the one during which we
> overwrite the status file.

Oh, that's correct. Sorry.

Matt, given that comment 93 was not working, could you add the suggestions from Karl so that we can see if we can get some more information? Thanks.
Flags: needinfo?(mhowell)
(In reply to OrangeFactor Robot from comment #140)
> 105 failures in 183 pushes (0.574 failures/push) were associated with this
> bug yesterday.   
> 
> Repository breakdown:
> * mozilla-beta: 91

Due to the amount of tests we have to run on beta (including all the locales) this is becoming crazy now. Matt, we would really appreciate if you could find some time to implement the suggestions from Karl. Thanks.
(In reply to Karl Tomlinson (back Apr 26 :karlt) from comment #39)
> Can you verify that changes for bug 1272614 triggered this by reverting those
> changes on oak?

We have tried this, and it didn't help.

> How do you know that UpdateDone() happens?

Because that's what triggers the AUS:SVC lines that appear next.

> It sounds like some logging may be getting truncated?
> 
> Perhaps this may happen if the process exits abnormally for some reason and
> so buffers are not flushed.

The log described here agrees with the one in comment 15, so that would have to mean both of those logs (from different runs) got truncated in exactly the same place. And saying these are getting "truncated" doesn't seem to make sense, because the lines that appear to be missing would be in the middle of the logging we have, not at the end.

> Can you add logging to UpdateDone() to confirm this theory?
> 
> If that confirms the theory, then I'd be inclined to use fprintf(stderr,) to
> be sure MOZ_LOG() isn't writing to some other kind of stream, but I guess
> stderr is not necessarily always flushed either.

I could push a patch to oak that adds this logging, but in light of the above I don't think it would tell us anything we do not already know.
Flags: needinfo?(mhowell)
Similar to bug 1355818, this bug here seems to show much more often on beta than on beta-cdntest. In fact, for the past 3 builds, there has been no failed job on the beta-cdntest channel on Linux, but multiple failures on the beta channel (e.g. for 54b4 and 54b5)/

Results for 54b5:
- beta-cdntest [1] - 0 jobs failed jobs
- beta [2] - 8 jobs failed (some locales + en-US) - not re-run as we've decided not to do that anymore because it takes a lot of time and effort

[1] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=06bf49fb5795&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-beta-cdntest(
[2] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-beta&revision=06bf49fb5795&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-beta(
Following up on findings in comment 145 (and same as for bug 1355818), on the same day I've also run tests for the dot release 53.0.2 and ESR 52.1.1. The results of these tests seem to confirm the findings on beta: the more we move towards the official channels, the more failures we see. Oddly enough, I actually got zero failures of this kind on localtest and cdntest channels, while the official channels were quite a bit more tricky (basically the same thing I saw for Beta). See below the detailed results:

1. Results for 53.0.2:
   a) release-localtest [1] - 0 failures (tests were 100% green)
   b) release-cdntest [2] - 0 failures (tests were 100% green)
   c) release [3] - 10 jobs failed - passed after multiple re-runs (71 in total)

2. Results for ESR 52.1.1):
   a) esr-localtest [4] - 0 failures (tests were 100% green)
   b) esr-cdntest [5] - 0 failures (tests were 100% green)
   c) esr [6] - 7 jobs failed - passed after multiple re-runs (40 in total)

[1] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=f87a819106bd&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-release-localtest(
[2] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=f87a819106bd&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-release-cdntest(
[3] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-release&revision=f87a819106bd&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-release(

[4] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr52&revision=120111e65bc4&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-esr-localtest(
[5] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr52&revision=120111e65bc4&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-esr-cdntest(
[6] - https://treeherder.mozilla.org/#/jobs?repo=mozilla-esr52&revision=120111e65bc4&group_state=expanded&filter-tier=3&filter-searchStr=Fxup-esr(
I haven't tested different channels around the same time on CI machines so far. Maybe that would be a good idea to do, Florin. That way we could see if we can exclude time related failures. Florin, could you do that? You would only have to add `--update-channel %name` as option to the `firefox-ui-update` command.
Flags: needinfo?(florin.mezei)
(In reply to Henrik Skupin (:whimboo) from comment #149)
> I haven't tested different channels around the same time on CI machines so
> far. Maybe that would be a good idea to do, Florin. That way we could see if
> we can exclude time related failures. Florin, could you do that? You would
> only have to add `--update-channel %name` as option to the
> `firefox-ui-update` command.

I'll do this after we publish 54b6 to beta - so I should have some results tomorrow.
Flags: needinfo?(florin.mezei)
Flags: needinfo?(florin.mezei)
I've run 9 jobs today on http://mm-ci-production.qa.scl3.mozilla.com:8080/computer/mm-ub-1404-32-3/, for the update 53.0 -> 53.0.2 - 3 updates on release-localtest, 3 on release-cdntest, and 3 on release. All jobs passed. 

I'm leaving the needinfo on so I can try this again on another day.
Good news! There is no more such a failure on Sunday! Which means all 4 partial updates tested on Linux 32 and 64 are passing. It's something which we haven't had for a very long time. So assume that the patch on bug 1355818 also fixed this issue. But not sure how it correlates. Anyway, I will keep an eye on it over the next days.
Depends on: 1355818
The same result today. Not a single failure on the Linux machines! It's something which we haven't had all the last months! So I call this bug fixed by the patch on bug 1355818. Thanks Matt!
Assignee: nobody → mhowell
Status: NEW → RESOLVED
Closed: 4 years ago
Flags: needinfo?(florin.mezei)
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
This wasn't in our tests but in the application updater.
Component: Firefox UI Tests → Application Update
Product: Testing → Toolkit
QA Contact: hskupin
Version: Version 3 → unspecified
You need to log in before you can comment on or make changes to this bug.