Closed Bug 1249458 Opened 9 years ago Closed 9 years ago

TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Available update has been found (Bad caching on Balrog?)

Categories

(Testing :: Firefox UI Tests, defect)

47 Branch
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: impossibus, Assigned: whimboo)

References

()

Details

(Keywords: intermittent-failure)

09:30:25 INFO - Traceback (most recent call last): 09:30:25 INFO - File "/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/local/lib/python2.7/site-packages/marionette/marionette_test.py", line 344, in run 09:30:25 INFO - testMethod() 09:30:25 INFO - File "/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/local/lib/python2.7/site-packages/firefox_ui_tests/update/fallback/test_fallback_update.py", line 20, in test_update 09:30:25 INFO - self.download_and_apply_available_update(force_fallback=True) 09:30:25 INFO - File "/home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/local/lib/python2.7/site-packages/firefox_puppeteer/testcases/update.py", line 298, in download_and_apply_available_update 09:30:25 INFO - "Available update has been found") 09:30:25 INFO - TEST-INFO took 7514ms 09:30:25 INFO - *** AUS:SVC UpdateService:removeDownloadListener - no downloader! 09:30:25 INFO - Update test results: 09:30:25 INFO - [{'build_post': None, 09:30:25 INFO - 'build_pre': {'buildid': u'20160216030245', 09:30:25 INFO - 'channel': u'nightly', 09:30:25 INFO - 'disabled_addons': None, 09:30:25 INFO - 'locale': u'en-US', 09:30:25 INFO - 'mar_channels': set([u'firefox-mozilla-central']), 09:30:25 INFO - 'url_aus': u'https://aus5.mozilla.org/update/3/Firefox/47.0a1/20160216030245/Linux_x86_64-gcc3/en-US/nightly/Linux%203.13.0-77-generic%20(GTK%203.10.8%2Clibpulse%204.0.0)/default/default/update.xml?force=1', 09:30:25 INFO - 'user_agent': u'Mozilla/5.0 (X11; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0', 09:30:25 INFO - 'version': u'47.0a1'}, 09:30:25 INFO - 'fallback': True, 09:30:25 INFO - 'patch': {'channel': u'nightly', 'download_duration': None}, 09:30:25 INFO - 'success': False}] 09:30:26 INFO - MARIONETTE LOG: INFO: TEST-END: /home/mozauto/jenkins/workspace/mozilla-central_update/build/venv/local/lib/python2.7/site-packages/firefox_ui_tests/update/fallback/test_fallback_update.py:test_update 09:30:26 INFO - Restoring channel defaults for: /home/mozauto/jenkins/workspace/mozilla-central_update/build/application.copy/defaults/pref/channel-prefs.js 09:30:26 INFO -
I don't think that I filed a bug about such an update test failure yet. But I talked with Rail about it about 2 weeks ago. So this might be a caching issue on Balrog which results in a delayed advertisement of the update. At the time when we get the funsize Pulse message updates should be available, but as the above shows they aren't always there.
Summary: TEST-UNEXPECTED-FAIL | test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Available update has been found → TEST-UNEXPECTED-FAIL | test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Available update has been found (Bad caching on Balrog?)
Summary: TEST-UNEXPECTED-FAIL | test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Available update has been found (Bad caching on Balrog?) → TEST-UNEXPECTED-FAIL | test_direct_update.py TestDirectUpdate.test_update, test_fallback_update.py TestFallbackUpdate.test_update | AssertionError: Available update has been found (Bad caching on Balrog?)
I have seen the same problem with yesterdays Aurora ru builds. No ru-1 and ru-4 updates were available for Win64 builds. Now the update URL lists an update. https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=633cd9d7ba59&filter-searchStr=fxup%20x64%20ru&filter-tier=1&filter-tier=2&filter-tier=3
This is becoming more and more a major problem for us. Lots of tests were failing yesterday because the expected version was not yet available via Balrog. Rail, we should really get this fixed and send out the funsize Pulse messages only if Balrog already serves those updates.
Flags: needinfo?(rail)
(In reply to Henrik Skupin (:whimboo) from comment #3) > Rail, we should really get this fixed and send out the funsize Pulse > messages only if Balrog already serves those updates. I don't think this should be done by funsize. Not sure if it's even possible... The update channel may be throttled (partly or full), there may be multiple update channel consuming the same metadata, I don't think you can tell TC to stop sending Pulse messages that the job is finished. Ben, maybe you have better ideas how to deal with Balrog caching?
Flags: needinfo?(rail) → needinfo?(bhearsum)
(In reply to Rail Aliiev [:rail] from comment #4) > The update channel may be throttled (partly or full), there may be multiple > update channel consuming the same metadata, I don't think you can tell TC to > stop sending Pulse messages that the job is finished. Sure but here it is clearly about a non-throttled and open channel.
So there's a couple of levels of caching. The load balancers have a short cache of something between 30 and 120 seconds, which caches based on the incoming URI. Each webhead also has a short cache of 60s, which caches based on the release name. I think it's unlikely that you're regularly hitting the load balancer cache because your URIs are not something that should be terribly common in the wild. You could definitely be hitting the one on the webheads, though. It's hard to say for sure. Do you know how long after receiving the Pulse message the failures are hit? It shouldn't be possible for there to be caching longer than 2min.
Flags: needinfo?(bhearsum)
(In reply to Ben Hearsum (:bhearsum) from comment #6) > Do you know how long after receiving the Pulse message the failures are hit? > It shouldn't be possible for there to be caching longer than 2min. I don't know that detail given that we run the update test once and when it failed it's not getting retriggered. Also I did not observe funsize tasks in Taskcluster so that I can manually check that. My work plate is a bit too full for that at the moment. But something which might be possible is that I could instruct Jenkins to delay the update jobs a bit. If there is really a max 2min delay we could make that update tests for nightly builds will have to wait this amount of time. I could play around with that the next week if that would help. That way we could figure out a sane delay.
(In reply to Henrik Skupin (:whimboo) from comment #7) > (In reply to Ben Hearsum (:bhearsum) from comment #6) > > Do you know how long after receiving the Pulse message the failures are hit? > > It shouldn't be possible for there to be caching longer than 2min. > > I don't know that detail given that we run the update test once and when it > failed it's not getting retriggered. Also I did not observe funsize tasks in > Taskcluster so that I can manually check that. My work plate is a bit too > full for that at the moment. > > But something which might be possible is that I could instruct Jenkins to > delay the update jobs a bit. If there is really a max 2min delay we could > make that update tests for nightly builds will have to wait this amount of > time. > > I could play around with that the next week if that would help. That way we > could figure out a sane delay. This is certainly worth a try. I'd say give it 5 minutes to be extra safe, but if you're still hitting these failures after waiting 5 minutes, something else is definitely wrong.
Ok, so I temporarily added a quiet period of 300s to the aurora and central update jobs. Lets observe the next days how that works.
With a 300s quiet time it works pretty well. No single test failure has been seen here yesterday. I will try to reduce the time so that we can check what would be the minimum quiet time before failures occur.
Assignee: nobody → hskupin
Status: NEW → ASSIGNED
No test failures over the last days even with a quiet period of 150s. I will try with 60s now and if that fails maybe 120s might be a good choice.
So 60s was also working well until we updated production and the value got reset. So I will bump the quiet period permanently now via https://github.com/mozilla/mozmill-ci/issues/778.
The change has been landed, so this failure shouldn't appear anymore.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
It still happens. So I would say we bump up the delay from 120s to 300s just to be on the safe side.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The fix has been landed, so no more failures should finally occur with that. If it does, we might want to check what we could do with Balrog caching.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
(In reply to Ben Hearsum (:bhearsum) from comment #9) > This is certainly worth a try. I'd say give it 5 minutes to be extra safe, > but if you're still hitting these failures after waiting 5 minutes, > something else is definitely wrong. Now that we have a quiet period of 5 mins set the failure is still present: https://treeherder.mozilla.org/logviewer.html#?job_id=2620736&repo=mozilla-aurora#L799 So what else could we do to further investigate this problem? Ben, any idea?
Status: RESOLVED → REOPENED
Flags: needinfo?(bhearsum)
Resolution: FIXED → ---
[Tracking Requested - why for this release]:
I had a conversation with Ben on IRC and he pointed out some lines in the log file which made me think more about the situation. So what happens here is a side-effect of bug 1216738. We start Firefox to set the update logging pref and restart right after (maybe that's not necessary anymore to make the logger active). Right before the restart Firefox finishes downloaded an already started download of an update. That means with the restart the update gets applied and our first check for an available update fails, because it has already been applied. So I think nothing here is related to Balrog anymore. Maybe even the whole story here existed because of that. I will check that later today.
Flags: needinfo?(bhearsum) → needinfo?(hskupin)
Ok, those are definitely different issues. I'm going to file a new bug for the current intermittent issue.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Flags: needinfo?(hskupin)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.