These can be seen on https://errormill.mozilla.org/Releng/balrog-prod/group/170698/ For example: A request to https://aus3.mozilla.org/update/3/Firefox/27.0/20131216183647/Darwin_x86_64-gcc3/en-US/beta/Darwin%2011.4.2/default/default/update.xml Ends up with this traceback: KeyError: u'Darwin_x86_64-gcc3' Stacktrace (most recent call last): File "flask/app.py", line 1060, in full_dispatch_request rv = self.dispatch_request() File "flask/app.py", line 1047, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "flask/views.py", line 56, in view return self.dispatch_request(*args, **kwargs) File "flask/views.py", line 112, in dispatch_request return meth(*args, **kwargs) File "auslib/web/views/client.py", line 38, in get xml = release.createXML(query, update_type, app.config["WHITELISTED_DOMAINS"], app.config["SPECIAL_FORCE_HOSTS"]) File "auslib/blob.py", line 222, in createXML localeData = self.getPlatformData(buildTarget)["locales"][locale] File "auslib/blob.py", line 118, in getPlatformData platform = self.getResolvedPlatform(platform) File "auslib/blob.py", line 115, in getResolvedPlatform return self['platforms'][platform].get('alias', platform) This doesn't surprise me - Darwin_x86_64-gcc3 is not a build target we use for releases that we ship anymore - and the buildid isn't one that's known to us as a release build either. However, we've already had thousands of tracebacks for this, so there's clearly something in the wild that's using this. I'm guessing that there's some 3rd party build that still points at our update server. For Balrog, we should probably catch these exceptions rather than having them bubble up.
Whatever we do for this might fix https://errormill.mozilla.org/Releng/balrog-prod/group/172691/too...
On the theory that maybe debug dep builds somehow had "beta" as an update channel, I tried to find a build that matched https://aus3.mozilla.org/update/3/Firefox/32.0/20140822024446/Darwin_x86_64-gcc3/en-US/beta/Darwin 13.3.0/default/default/update.xml I couldn't find anything in http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-macosx64-debug/ with that buildid. I also inspected one of the builds and the channel was set to "default". So, I don't know where it's coming from, but we're now up to 8k exceptions of this type - so there's some custom build or builds out there that are built exclusively for 64-bit Mac, trying to update from us.
At more or less random, I noticed that one URL here was https://aus3.mozilla.org/update/3/Firefox/31.0/20140610163407/Darwin_x86_64-gcc3/en-US/beta/Darwin%2013.4.0/default/default/update.xml. I googled the buildid and it turned out to be 31.0b1. The one in the previous comment is a valid beta too, as it turns out. This makes me _strongly_ suspect there's some sort of automated process pinging these URLs. AFAICT, update verify and final verify are innocent, and we had some in the past 7min (when no releases or running). I'm wondering if there's some sort of CI test or QA test that does this... Whimboo, do you know of any automation that would use the buildid of a proper release build, but a debug-style build target (like Darwin_x86_64-gcc3)?
Then again, maybe not...I just noticed that we get the original IPs on Sentry, and they seem to come from all over the world.
Found http://superuser.com/questions/581334/automatic-updates-of-firefox-stable-and-beta-dont-work-on-os-x randomly, which appears to have a user in the wild running a beta with Darwin_x86_64-gcc3...
I think needinfo is not needed anymore on me. I'm also not familiar with such builds. All what we test in our CI are builds we get notifications via Pulse. So if those builds would appear from qa.scl3.mozilla.com something would be wrong with the build machines.
Reading more on that stackoverflow page, I see that the user has stripped away the x86 parts of the binary, which results in BUILD_TARGET changing to Darwin_x86_64-gcc3. I now highly suspect that this is why we're getting so many of these requests. Sadly, these also means that tons of users in the wild are _not_ getting updates. This bug is scoped to having Balrog deal with these requests more gracefully, but I've filed others to try and help our userbase: https://bugzilla.mozilla.org/show_bug.cgi?id=1071576 https://bugzilla.mozilla.org/show_bug.cgi?id=1071580
We talked about this a bit today. Nick suggested raising specific tracebacks when we find "bad data" like this. We can then add a blacklist of tracebacks to NOT send to Sentry. That way we won't clog it up with tracebacks for things that aren't Balrog's fault. This will probably fix bug 1069508, too.
(In reply to Ben Hearsum [:bhearsum] from comment #9) > We talked about this a bit today. Nick suggested raising specific tracebacks > when we find "bad data" like this. We can then add a blacklist of tracebacks > to NOT send to Sentry. That way we won't clog it up with tracebacks for > things that aren't Balrog's fault. This will probably fix bug 1069508, too. This idea seems to work well, based on my quick tests: diff --git a/auslib/web/base.py b/auslib/web/base.py index a86f709..8a3a363 100644 --- a/auslib/web/base.py +++ b/auslib/web/base.py @@ -8,28 +8,39 @@ from raven.contrib.flask import Sentry from auslib.AUS import AUS app = Flask(__name__) AUS = AUS() sentry = Sentry() from auslib.web.views.client import ClientRequestView + +class BadDataError(Exception): + """Raised when a client sends data that appears to be invalid.""" + pass + + @app.errorhandler(404) def fourohfour(error): """We don't return 404s in AUS. Instead, we return empty XML files""" response = make_response('<?xml version="1.0"?>\n<updates>\n</updates>') response.mimetype = 'text/xml' return response @app.errorhandler(Exception) def generic(error): - # Log the error with Sentry before eating it (see bug 885173 for background) - if sentry.client: - sentry.captureException() + """Deals with any unhandled exceptions. If the exception is not a + BadDataError, it will be sent to Sentry. Regardless of the exception, + a 200 response with no updates is returned, because that's what the client + # expects. See bugs 885173 and 1069454 for additional background.""" + + if not isinstance(error, BadDataError): + if sentry.client: + sentry.captureException() log.debug('Hit exception, sending an empty response') response = make_response('<?xml version="1.0"?>\n<updates>\n</updates>') response.mimetype = 'text/xml' return response @app.route('/robots.txt') def robots(): return send_from_directory(app.static_folder, "robots.txt") So it's just a matter of raising BadDataError in the right places now...
Created attachment 8503328 [details] [diff] [review] deal with bad data better I think this covers all the cases identified in this bug and bug 1069508. I'm wondering if we should use it in bug 1069512 (for truly bad version numbers - not just things like w.x.y.z that we don't support yet) as well, but maybe that can be addressed there? I think the patch is pretty straightforward, but I'm always a bit unsure of whether to do exception handling for things like getResolvedPlatform/getPlatformData/getLocaleData that depend on each other. It seems safest to do it at every level, so that's what I've done.
Comment on attachment 8503328 [details] [diff] [review] deal with bad data better Review of attachment 8503328 [details] [diff] [review]: ----------------------------------------------------------------- r+ with suggestions for improvements. ::: auslib/blobs/apprelease.py @@ +370,1 @@ > localeData = self.getPlatformData(buildTarget)["locales"][locale] Another chance to use getLocaleData() ? ::: auslib/blobs/gmp.py @@ +43,4 @@ > > def getPlatformData(self, vendor, platform): > platform = self.getResolvedPlatform(vendor, platform) > return self['vendors'][vendor]['platforms'][platform] You have a try/except in this situation for apprelease, which catches a bogus alias. Worth having here too. ::: auslib/test/blobs/test_apprelease.py @@ +25,5 @@ > self.assertEquals(blob.getPlatformData('a'), dict(foo=1)) > > + def testGetPlatformDataRaisesBadDataError(self): > + blob = SimpleBlob(platforms=dict(a=dict(foo=1))) > + self.assertRaises(BadDataError, blob.getPlatformData, "c") Could also test that BadDataError is raised when an alias points to a non-existent platform. ::: auslib/web/base.py @@ +26,5 @@ > def generic(error): > + """Deals with any unhandled exceptions. If the exception is not a > + BadDataError, it will be sent to Sentry. Regardless of the exception, > + a 200 response with no updates is returned, because that's what the client > + # expects. See bugs 885173 and 1069454 for additional background.""" Nit, stray #.
Created attachment 8506347 [details] [diff] [review] add more tests; catch more errors; use getLocaleData more I'm assuming it's okay to carry forward the r+, but correct me if I'm wrong!
Looks good to me.
Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/8d1f8341d878c09c6367e2284c1f2a50cacffaed bug 1069454: deal with bad data better. r=nthomas
Comment on attachment 8506347 [details] [diff] [review] add more tests; catch more errors; use getLocaleData more Pushed this patch, will have it deployed to production later today.
We stopped getting https://errormill.mozilla.org/Releng/balrog-prod/group/172899/ and some other Sentry events after this made it into production. I've marked them all as resolved, so we're all done here now!