Closed Bug 1069454 Opened 10 years ago Closed 10 years ago

deal with bad data better

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(1 file, 1 obsolete file)

      No description provided.
These can be seen on https://errormill.mozilla.org/Releng/balrog-prod/group/170698/

For example:
A request to https://aus3.mozilla.org/update/3/Firefox/27.0/20131216183647/Darwin_x86_64-gcc3/en-US/beta/Darwin%2011.4.2/default/default/update.xml

Ends up with this traceback:
KeyError: u'Darwin_x86_64-gcc3'

Stacktrace (most recent call last):

  File "flask/app.py", line 1060, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1047, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "flask/views.py", line 56, in view
    return self.dispatch_request(*args, **kwargs)
  File "flask/views.py", line 112, in dispatch_request
    return meth(*args, **kwargs)
  File "auslib/web/views/client.py", line 38, in get
    xml = release.createXML(query, update_type, app.config["WHITELISTED_DOMAINS"], app.config["SPECIAL_FORCE_HOSTS"])
  File "auslib/blob.py", line 222, in createXML
    localeData = self.getPlatformData(buildTarget)["locales"][locale]
  File "auslib/blob.py", line 118, in getPlatformData
    platform = self.getResolvedPlatform(platform)
  File "auslib/blob.py", line 115, in getResolvedPlatform
    return self['platforms'][platform].get('alias', platform)


This doesn't surprise me - Darwin_x86_64-gcc3 is not a build target we use for releases that we ship anymore - and the buildid isn't one that's known to us as a release build either. However, we've already had thousands of tracebacks for this, so there's clearly something in the wild that's using this. I'm guessing that there's some 3rd party build that still points at our update server.

For Balrog, we should probably catch these exceptions rather than having them bubble up.
Summary: lots of tracebacks from Darwin → lots of tracebacks from Darwin_x86_64-gcc3 after pushing balrog to beta
On the theory that maybe debug dep builds somehow had "beta" as an update channel, I tried to find a build that matched https://aus3.mozilla.org/update/3/Firefox/32.0/20140822024446/Darwin_x86_64-gcc3/en-US/beta/Darwin 13.3.0/default/default/update.xml

I couldn't find anything in http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-beta-macosx64-debug/ with that buildid.

I also inspected one of the builds and the channel was set to "default". So, I don't know where it's coming from, but we're now up to 8k exceptions of this type - so there's some custom build or builds out there that are built exclusively for 64-bit Mac, trying to update from us.
At more or less random, I noticed that one URL here was https://aus3.mozilla.org/update/3/Firefox/31.0/20140610163407/Darwin_x86_64-gcc3/en-US/beta/Darwin%2013.4.0/default/default/update.xml. I googled the buildid and it turned out to be 31.0b1. The one in the previous comment is a valid beta too, as it turns out.

This makes me _strongly_ suspect there's some sort of automated process pinging these URLs. AFAICT, update verify and final verify are innocent, and we had some in the past 7min (when no releases or running). I'm wondering if there's some sort of CI test or QA test that does this...

Whimboo, do you know of any automation that would use the buildid of a proper release build, but a debug-style build target (like Darwin_x86_64-gcc3)?
Flags: needinfo?(hskupin)
Then again, maybe not...I just noticed that we get the original IPs on Sentry, and they seem to come from all over the world.
Found http://superuser.com/questions/581334/automatic-updates-of-firefox-stable-and-beta-dont-work-on-os-x randomly, which appears to have a user in the wild running a beta with Darwin_x86_64-gcc3...
I think needinfo is not needed anymore on me. I'm also not familiar with such builds. All what we test in our CI are builds we get notifications via Pulse. So if those builds would appear from qa.scl3.mozilla.com something would be wrong with the build machines.
Flags: needinfo?(hskupin)
Reading more on that stackoverflow page, I see that the user has stripped away the x86 parts of the binary, which results in BUILD_TARGET changing to Darwin_x86_64-gcc3. I now highly suspect that this is why we're getting so many of these requests. Sadly, these also means that tons of users in the wild are _not_ getting updates.

This bug is scoped to having Balrog deal with these requests more gracefully, but I've filed others to try and help our userbase:
https://bugzilla.mozilla.org/show_bug.cgi?id=1071576
https://bugzilla.mozilla.org/show_bug.cgi?id=1071580
We talked about this a bit today. Nick suggested raising specific tracebacks when we find "bad data" like this. We can then add a blacklist of tracebacks to NOT send to Sentry. That way we won't clog it up with tracebacks for things that aren't Balrog's fault. This will probably fix bug 1069508, too.
Assignee: nobody → bhearsum
Summary: lots of tracebacks from Darwin_x86_64-gcc3 after pushing balrog to beta → deal with unknown build targets being sent to balrog
(In reply to Ben Hearsum [:bhearsum] from comment #9)
> We talked about this a bit today. Nick suggested raising specific tracebacks
> when we find "bad data" like this. We can then add a blacklist of tracebacks
> to NOT send to Sentry. That way we won't clog it up with tracebacks for
> things that aren't Balrog's fault. This will probably fix bug 1069508, too.

This idea seems to work well, based on my quick tests:
diff --git a/auslib/web/base.py b/auslib/web/base.py
index a86f709..8a3a363 100644
--- a/auslib/web/base.py
+++ b/auslib/web/base.py
@@ -8,28 +8,39 @@ from raven.contrib.flask import Sentry
 from auslib.AUS import AUS
 
 app = Flask(__name__)
 AUS = AUS()
 sentry = Sentry()
 
 from auslib.web.views.client import ClientRequestView
 
+
+class BadDataError(Exception):
+    """Raised when a client sends data that appears to be invalid."""
+    pass
+
+
 @app.errorhandler(404)
 def fourohfour(error):
     """We don't return 404s in AUS. Instead, we return empty XML files"""
     response = make_response('<?xml version="1.0"?>\n<updates>\n</updates>')
     response.mimetype = 'text/xml'
     return response
 
 @app.errorhandler(Exception)
 def generic(error):
-    # Log the error with Sentry before eating it (see bug 885173 for background)
-    if sentry.client:
-        sentry.captureException()
+    """Deals with any unhandled exceptions. If the exception is not a
+    BadDataError, it will be sent to Sentry. Regardless of the exception,
+    a 200 response with no updates is returned, because that's what the client
+    # expects. See bugs 885173 and 1069454 for additional background."""
+
+    if not isinstance(error, BadDataError):
+        if sentry.client:
+            sentry.captureException()
     log.debug('Hit exception, sending an empty response')
     response = make_response('<?xml version="1.0"?>\n<updates>\n</updates>')
     response.mimetype = 'text/xml'
     return response
 
 @app.route('/robots.txt')
 def robots():
     return send_from_directory(app.static_folder, "robots.txt")


So it's just a matter of raising BadDataError in the right places now...
Attached patch deal with bad data better (obsolete) — Splinter Review
I think this covers all the cases identified in this bug and bug 1069508. I'm wondering if we should use it in bug 1069512 (for truly bad version numbers - not just things like w.x.y.z that we don't support yet) as well, but maybe that can be addressed there?

I think the patch is pretty straightforward, but I'm always a bit unsure of whether to do exception handling for things like getResolvedPlatform/getPlatformData/getLocaleData that depend on each other. It seems safest to do it at every level, so that's what I've done.
Attachment #8503328 - Flags: review?(nthomas)
Summary: deal with unknown build targets being sent to balrog → deal with bad data better
Comment on attachment 8503328 [details] [diff] [review]
deal with bad data better

Review of attachment 8503328 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with suggestions for improvements.

::: auslib/blobs/apprelease.py
@@ +370,1 @@
>          localeData = self.getPlatformData(buildTarget)["locales"][locale]

Another chance to use getLocaleData() ?

::: auslib/blobs/gmp.py
@@ +43,4 @@
>  
>      def getPlatformData(self, vendor, platform):
>          platform = self.getResolvedPlatform(vendor, platform)
>          return self['vendors'][vendor]['platforms'][platform]

You have a try/except in this situation for apprelease, which catches a bogus alias. Worth having here too.

::: auslib/test/blobs/test_apprelease.py
@@ +25,5 @@
>          self.assertEquals(blob.getPlatformData('a'), dict(foo=1))
>  
> +    def testGetPlatformDataRaisesBadDataError(self):
> +        blob = SimpleBlob(platforms=dict(a=dict(foo=1)))
> +        self.assertRaises(BadDataError, blob.getPlatformData, "c")

Could also test that BadDataError is raised when an alias points to a non-existent platform.

::: auslib/web/base.py
@@ +26,5 @@
>  def generic(error):
> +    """Deals with any unhandled exceptions. If the exception is not a
> +    BadDataError, it will be sent to Sentry. Regardless of the exception,
> +    a 200 response with no updates is returned, because that's what the client
> +    # expects. See bugs 885173 and 1069454 for additional background."""

Nit, stray #.
Attachment #8503328 - Flags: review?(nthomas) → review+
I'm assuming it's okay to carry forward the r+, but correct me if I'm wrong!
Attachment #8506347 - Flags: review+
Looks good to me.
Attachment #8503328 - Attachment is obsolete: true
Comment on attachment 8506347 [details] [diff] [review]
add more tests; catch more errors; use getLocaleData more

Pushed this patch, will have it deployed to production later today.
Attachment #8506347 - Flags: checked-in+
Depends on: 1084393
We stopped getting https://errormill.mozilla.org/Releng/balrog-prod/group/172899/ and some other Sentry events after this made it into production. I've marked them all as resolved, so we're all done here now!
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: