Closed Bug 671488 Opened 13 years ago Closed 9 years ago

Add blob caching to balrog

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: bhearsum)

References

Details

(Whiteboard: [updates][balrog])

Attachments

(2 files, 3 obsolete files)

There is lots of caching we might implement, but the obvious one is the xml returned for a request. Possibly keyed of updateQuery, probably for at least minutes. We should also build in a mechanism to empty the cache when the rules or json have changed.
Add caching to what?
Summary: Add caching → Add caching to AUS3
Component: Release Engineering → Release Engineering: Automation
Priority: P3 → --
QA Contact: release → catlee
Hardware: x86 → All
Whiteboard: [updates]
Priority: -- → P3
Whiteboard: [updates] → [updates][balrog]
Blocks: balrog-nightly
No longer blocks: balrog
Catlee and I were talking the other day and the idea of caching JSON decoding came up. It's (probably) going to be one of the more expensive things we do.
It'll be easier to do this once we have perf tests for Balrog, because we'll have a better idea of where our bottlenecks are.
Depends on: 832464
Will also be informed by testing against production nodes once those are up.
Depends on: 832461
Product: mozilla.org → Release Engineering
No longer blocks: balrog-nightly
mass component change
Component: General Automation → Balrog: Backend
Summary: Add caching to AUS3 → Add caching to balrog
Blocks: balrog-beta
We decided to add zeus caching in bug 982618 that negates the need for this (for now at least). We may come back to it at some point in the future.
No longer blocks: balrog-beta
Catlee and I talked about some ideas today:
10:45 <@catlee> caching!
10:45 < bhearsum> yeah, a blob cache would help with that too
10:45 < bhearsum> zeus helps a lot already
10:45 <@catlee> yeah, I bet
10:45 < bhearsum> but it doesn't help in cases where you have slightly differently urls that end up with the same blob
10:45 <@catlee> how long can you keep something in the blob cache?
10:46 <@catlee> is there a quick way to check if it's up to date?
10:46 < bhearsum> i was thinking it could be defined with a maximum size and use it as an LRU
10:46 < bhearsum> and then some cap on cache length for items too
10:46 < bhearsum> 30 or 60s or something
10:46 <@catlee> you could cache data version
10:47 < bhearsum> ah, that's a good idea!
10:47 <@catlee> that should be fast to check
10:47 < bhearsum> look up data version each time, and then only get the blob if data version is changed?
10:47 <@catlee> well, you can probably even cache that for short periods
10:47 <@catlee> so you check the data version at most once a minute
10:47 <@catlee> or maybe 30s
10:47 < bhearsum> yeah, possibly...
10:47 < bhearsum> we need to figure out what our maximum livable TTL is though
10:48 < bhearsum> eg, if we cache 1min in app and 30s at zeus, that's 1m30s
10:48 <@catlee> yeah
10:48 <@catlee> I wouldn't do it per request
10:48 < bhearsum> check data version?
10:48 <@catlee> yeah
10:49 <@catlee> even 10s would be fine
We decided today this didn't block using Balrog for release, unless the gradual turnup of Balrog revealed load issues on the apache or db hosts.
After doing a live test as part of bug 1075542, this has become necessary for shipping to the release channel. https://bugzilla.mozilla.org/show_bug.cgi?id=1075542#c1 has details.
Attached patch first stab at lru for blobs (obsolete) — Splinter Review
This needs more extensive manual testing (and maybe more unit tests), but it's gone through a couple of iterations and passes all tests. It ended up being more complicated than anticipated for a couple of reasons:
1) We need to support caching and not caching with the same object. This is because both apps share a single global namespace. Unsurprisingly, a non-caching cache library isn't a thing, so I had to write the MaybeCacher wrapper.
2) I couldn't figure out a way to magically deal with data_version, so it's being dealt with by hand in getReleaseBlob for now. At some point in the future I think it might be good to look at moving caching all the way down to AUSTable.select(), or maybe to have a better defined interface that expects data_version to be returned from any function that we could decorate.
Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Attachment #8533957 - Flags: feedback?(nthomas)
Comment on attachment 8533957 [details] [diff] [review]
first stab at lru for blobs

Review of attachment 8533957 [details] [diff] [review]:
-----------------------------------------------------------------

A great start. Is repoze.lru is abandoned now, but the best fit for us ?

::: auslib/db.py
@@ +875,5 @@
> +        # We need to get the current data version of the blob no matter what,
> +        # because we need to invalidate the cache of the cached version is
> +        # of a different version.
> +        try:
> +            data_version = self.select(where=[self.name==name], columns=[self.data_version], limit=1, transaction=transaction)[0]

We'll could end up doing ~3500 select/second here, in a world with geo-ip rules that mean we can't cache in Zeus (assumed 150M ADI, always running Firefox, checking twice a day, likely on the high side).

What if we build the cache like this:
* data_version is cached with a TTL of 60 seconds or so
* if we're over that: check data_version in the db
* if data_version has changed: invalidate blob cache

Moving the select inside the 'if cached:' block would be start on that, but the cache timeout may be different.
Attachment #8533957 - Flags: feedback?(nthomas) → feedback+
(In reply to Nick Thomas [:nthomas] from comment #11)
> Comment on attachment 8533957 [details] [diff] [review]
> first stab at lru for blobs
> 
> Review of attachment 8533957 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> A great start. Is repoze.lru is abandoned now, but the best fit for us ?
> 
> ::: auslib/db.py
> @@ +875,5 @@
> > +        # We need to get the current data version of the blob no matter what,
> > +        # because we need to invalidate the cache of the cached version is
> > +        # of a different version.
> > +        try:
> > +            data_version = self.select(where=[self.name==name], columns=[self.data_version], limit=1, transaction=transaction)[0]
> 
> We'll could end up doing ~3500 select/second here, in a world with geo-ip
> rules that mean we can't cache in Zeus (assumed 150M ADI, always running
> Firefox, checking twice a day, likely on the high side).
> 
> What if we build the cache like this:
> * data_version is cached with a TTL of 60 seconds or so
> * if we're over that: check data_version in the db
> * if data_version has changed: invalidate blob cache
> 
> Moving the select inside the 'if cached:' block would be start on that, but
> the cache timeout may be different.

This sounds good to me. After fixing my original patch to use the blob cache in getReleases() the top hotspots were:
* evaluateRules
* getReleaseBlob
* getRulesMatchingQuery
(plus things underneath these, such as select() and things underneath _it_, such as convertRows).

I'll look at getting both of these cached a bit - I don't see any reason not to do that now.
(In reply to Nick Thomas [:nthomas] from comment #11)
> Comment on attachment 8533957 [details] [diff] [review]
> first stab at lru for blobs
> 
> Review of attachment 8533957 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> A great start. Is repoze.lru is abandoned now, but the best fit for us ?
> 
> ::: auslib/db.py
> @@ +875,5 @@
> > +        # We need to get the current data version of the blob no matter what,
> > +        # because we need to invalidate the cache of the cached version is
> > +        # of a different version.
> > +        try:
> > +            data_version = self.select(where=[self.name==name], columns=[self.data_version], limit=1, transaction=transaction)[0]
> 
> We'll could end up doing ~3500 select/second here, in a world with geo-ip
> rules that mean we can't cache in Zeus (assumed 150M ADI, always running
> Firefox, checking twice a day, likely on the high side).
> 
> What if we build the cache like this:
> * data_version is cached with a TTL of 60 seconds or so
> * if we're over that: check data_version in the db
> * if data_version has changed: invalidate blob cache

Actually, one follow-up question...does this mean that we would never reload the blob unless data_version has changed? I recall talking about that, and I have it my head that we weren't comfortable having no TTL on the blob itself (yet) - but I could be misremembering.
Blob caching is going to be the biggest win, and we decided to focus on that first. We'll want to do some caching for the rules table too, but that will likely happen in another bug.
Summary: Add caching to balrog → Add blob caching to balrog
Attached patch blob + blob version caching (obsolete) — Splinter Review
Functionality-wise, I think this is ready to go. My limited tests with cProfile show very little database usage. I couldn't figure out a way to profile across a large set of requests, but other metrics showed great improvement with caching across 40,000 requests. The wall clock time without caching was 20min, and mysql constantly used 20% CPU. With caching, wall clock went to 10min and mysql sat at 5% cpu (with rule caching this would probably sit pretty close to 0%). Wall clock time may even lower in production. I suspect my local VM has some weird overhead and I/O inefficiencies that we wouldn't see there. The fact that I'm blasting it non stop with requests (instead of spreading them over a longer time) probably reduces efficiency too.

I couldn't figure out a great way to handle the caching and non-caching cases, so I just added a big 'ol toggle to MaybeCacher. It works, but it's a bit ugly. getReleaseBlob could probably be improved in some way too, but I haven't been able to figure out how. That might become a bit more obvious when we start caching rules -- maybe there will be a common pattern that becomes obvious.

One fun thing I found when profiling is that we call getReleaseBlob for name="*" for almost every request. That's very silly, so I got rid of it. I think the rest of the patch is pretty well documented but let me know if anything needs more explanation.

I spoke with cturra and he's up for doing another release channel traffic test next week, so I'm planning to do probably the same day that this goes to production.
Attachment #8533957 - Attachment is obsolete: true
Attachment #8535679 - Flags: review?(nthomas)
Attachment #8535679 - Flags: review?(catlee)
Comment on attachment 8535679 [details] [diff] [review]
blob + blob version caching

Review of attachment 8535679 [details] [diff] [review]:
-----------------------------------------------------------------

::: admin.py
@@ +47,4 @@
>      from auslib.admin.base import app
>      from migrate.exceptions import DatabaseAlreadyControlledError
>  
> +    cache.enabled = False

this doesn't seem to do anything in MaybeCacher

::: auslib/blobs/apprelease.py
@@ +73,5 @@
> +        # even attempt to look it up.
> +        if patch["from"] != "*":
> +            return dbo.releases.getReleaseBlob(name=patch["from"])
> +        else:
> +            return None

good catch!

::: auslib/db.py
@@ +901,5 @@
> +                blob = createBlob(row['data'])
> +                cache.put("blob", name, {"data_version": data_version, "blob": blob})
> +            except IndexError:
> +                raise KeyError("Couldn't find release with name '%s'" % name)
> +

we spoke in IRC about possibly using a pattern of

foo = cache.get(cache_name, key, calc_foo)

instead of

foo = cache.get(cache_name, key)
if not foo:
    foo = calc_foo()
    cache.put(cache_name, key, foo)

::: auslib/test/test_db.py
@@ +937,5 @@
> +                self.releases.getReleaseBlob(name="a")
> +                t.return_value += 1
> +
> +            self._checkCacheStats(cache.caches["blob"], 5, 4, 1)
> +            self._checkCacheStats(cache.caches["blob_version"], 5, 3, 2)

some comments here as to why we expect these numbers of hits/misses would be great

(and below too)

::: auslib/util/cache.py
@@ +11,5 @@
> +
> +    If maxsize is less than 1, this class' methods will always return nothing.
> +    If maxsize is 1 or more, caching will be performed. It is up to callers to
> +    cope with both cases. In a world where bug 1109295 is fixed, we may not
> +    need this anymore."""

I think this docstring needs updating re: how to disable the cache.
I think this addresess all the review comments. Thanks for the help factoring the getters.
Attachment #8535679 - Attachment is obsolete: true
Attachment #8535679 - Flags: review?(nthomas)
Attachment #8535679 - Flags: review?(catlee)
Attachment #8535831 - Flags: review?(nthomas)
Attachment #8535831 - Flags: review?(catlee)
Comment on attachment 8535831 [details] [diff] [review]
fix up comments/docstrings, factor out getters in getReleaseBlob

Review of attachment 8535831 [details] [diff] [review]:
-----------------------------------------------------------------

::: auslib/web/base.py
@@ +48,5 @@
> +Hit Percentage: %s%%<br/>
> +Requests: %s<br/>
> +Hits: %s<br/>
> +Misses: %s<br/>
> +""" % (float(cache.caches["blob"].hits) / float(cache.caches["blob"].lookups) *100, cache.caches["blob"].lookups, cache.caches["blob"].hits, cache.caches["blob"].misses)

oooh, I missed this before!

extra bonus points if you add json formatted output too.
Attachment #8535831 - Flags: review?(catlee) → review+
I filed bug 1111032 for rule caching.
(In reply to Chris AtLee [:catlee] from comment #18)
> Comment on attachment 8535831 [details] [diff] [review]
> fix up comments/docstrings, factor out getters in getReleaseBlob
> 
> Review of attachment 8535831 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: auslib/web/base.py
> @@ +48,5 @@
> > +Hit Percentage: %s%%<br/>
> > +Requests: %s<br/>
> > +Hits: %s<br/>
> > +Misses: %s<br/>
> > +""" % (float(cache.caches["blob"].hits) / float(cache.caches["blob"].lookups) *100, cache.caches["blob"].lookups, cache.caches["blob"].hits, cache.caches["blob"].misses)
> 
> oooh, I missed this before!
> 
> extra bonus points if you add json formatted output too.

Oops, I actually meant to remove that and file a bug on getting cache metrics sent somewhere. This was my very hacky way to know if the cache was working. It's not great for production because you get different stats per wsgi process, and they reset after we hit max requests (5000 per process in production IIRC).
Comment on attachment 8535831 [details] [diff] [review]
fix up comments/docstrings, factor out getters in getReleaseBlob

Review of attachment 8535831 [details] [diff] [review]:
-----------------------------------------------------------------

Great stuff!

::: auslib/blobs/apprelease.py
@@ +73,5 @@
> +        # even attempt to look it up.
> +        if patch["from"] != "*":
> +            return dbo.releases.getReleaseBlob(name=patch["from"])
> +        else:
> +            return None

A very nice catch. Lets watch out for new exceptions in sentry, in case the previous try/except was catching cases we don't handle now.

::: auslib/db.py
@@ +903,5 @@
> +        # of the cached blob and the latest data version don't match, we need
> +        # to update the cache with the latest blob.
> +        if cached_blob["data_version"] != data_version:
> +            blob_info = getBlob()
> +            cache.put("blob", name, blob_info)

Don't you need to update the blob_version cache too ? Otherwise they're out of sync for a while, and the blob cache will keep being updated. You could test for this by checking how many times getBlob() is called.

::: auslib/test/test_db.py
@@ +937,5 @@
> +                self.releases.getReleaseBlob(name="a")
> +                t.return_value += 1
> +
> +            # We've retrieved the blob and blob version 5 times.
> +            # The blob cache has a ttl of 5, so we're expecting the first one

Nit, s/5/10/

::: auslib/util/cache.py
@@ +23,5 @@
> +    def reset(self):
> +        self.caches.clear()
> +
> +    def get(self, name, key, value_getter=None):
> +        """Returns the value of the specified key from the named cached.

Nit s/cached/cache/

@@ +55,5 @@
> +        if not name:
> +            for c in self.caches.values():
> +                c.clear()
> +
> +        self.caches[name].clear()

Maybe an else ahead of the last line.

::: balrog-server.py
@@ +64,5 @@
>          pass
>  
> +    if options.profile_dir:
> +        from werkzeug.contrib.profiler import ProfilerMiddleware
> +        if not path.exists(options.profile_dir):

Great!

::: balrog.ini-dist
@@ +17,5 @@
>  ; domains that are allowed in update data, comma separated
>  domain_whitelist=download.mozilla.org,stage.mozilla.org,ftp.mozilla.org
> +
> +[caches]
> +; Caches to set-up. Keys are cache names, values are size and ttl

Nit, values are number of cache entries and ttl (seconds)

::: puppet/templates/balrog.ini.erb
@@ +9,5 @@
>  ; domains that are allowed in update data, comma separated
>  domain_whitelist=download.mozilla.org,stage.mozilla.org,ftp.mozilla.org
> +
> +[caches]
> +; Caches to set-up. Keys are cache names, values are size and ttl

See other nit.
Attachment #8535831 - Flags: review?(nthomas) → review+
(In reply to Nick Thomas [:nthomas] from comment #21)
> ::: auslib/db.py
> @@ +903,5 @@
> > +        # of the cached blob and the latest data version don't match, we need
> > +        # to update the cache with the latest blob.
> > +        if cached_blob["data_version"] != data_version:
> > +            blob_info = getBlob()
> > +            cache.put("blob", name, blob_info)
> 
> Don't you need to update the blob_version cache too ? Otherwise they're out
> of sync for a while, and the blob cache will keep being updated. You could
> test for this by checking how many times getBlob() is called.

Oops, good catch! I'll fix this and get some test coverage for it.
(In reply to Ben Hearsum [:bhearsum] from comment #22)
> (In reply to Nick Thomas [:nthomas] from comment #21)
> > ::: auslib/db.py
> > @@ +903,5 @@
> > > +        # of the cached blob and the latest data version don't match, we need
> > > +        # to update the cache with the latest blob.
> > > +        if cached_blob["data_version"] != data_version:
> > > +            blob_info = getBlob()
> > > +            cache.put("blob", name, blob_info)
> > 
> > Don't you need to update the blob_version cache too ? Otherwise they're out
> > of sync for a while, and the blob cache will keep being updated. You could
> > test for this by checking how many times getBlob() is called.
> 
> Oops, good catch! I'll fix this and get some test coverage for it.

Hmm, diving into this a bit more I'm not sure I see how this could happen. "data_version" will always be newer than the one in cached_blob because the blob_version ttl is lower than blob's. This could be a problem of the ttls of blob and blob_version get out of sync (eg, blob_version refreshes right before blob expires), but I don't think this can happen -- anytime blob_version gets invalidated we'll end up updating the blob cache.

I suppose that there might be a very small window for this to happen if the blob ttl expires after we retrieve data_version (I don't think it could happen before that, otherwise data_version would've expired already) - I'm not sure if this is what you were getting at though. Nevertheless, it seems like it would be a good idea to check whether cached_blob["data_version"] or data_version is newer, and update both caches appropriately.
I pushed https://github.com/bhearsum/balrog/commit/96c3c9d7b9d1d1c6e935a103df2d3b9b87b8b884 which I _think_ gets rid of any possiblity of extra blob retrievals through more specific checking of the data versions. Nick, does this address what you were talking about? I haven't found a way to test this case, because inner methods like getBlob are unmockable.
I think I just got myself confused. If the blob changes we won't notice until the blob_version ttl expires, at which point both caches are updated very close together, just the time to do getBlob() between them. Lets not worry about the blob changing again in that small gap. If we ever see load spikes that last as long as the blob_version ttl we can revisit.
It looks like I'm OK to carry forward the r+ here -- the only code changes in this patch are the fix to MaybeCacher.clear() and the better hnadling of data_version differences. I've gone ahead and landed this so I can do some testing on dev.

Before landing I ran the same set of requests that I was testing with locally, which took 15min14s to complete. Once dev picks up this change I'll run it again and see how it performs.
Attachment #8535831 - Attachment is obsolete: true
Attachment #8537212 - Flags: checked-in+
Oops, I forgot to remove the /cache_stats endpoint in my original push. I just pushed that separately.
Depends on: 1112187
This popped up when I tried to test in dev:


KeyError: "Couldn't find release with name 'Firefox-34.0b9-build1'"

Stacktrace (most recent call last):

  File "flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "auslib/web/views/client.py", line 38, in get
    xml = release.createXML(query, update_type, app.config["WHITELISTED_DOMAINS"], app.config["SPECIAL_FORCE_HOSTS"])
  File "auslib/blobs/apprelease.py", line 131, in createXML
    patches = self._getPatchesXML(localeData, updateQuery, whitelistedDomains, specialForceHosts)
  File "auslib/blobs/apprelease.py", line 528, in _getPatchesXML
    xml = self._getSpecificPatchXML(patchKey, patchType, patch, updateQuery, whitelistedDomains, specialForceHosts)
  File "auslib/blobs/apprelease.py", line 80, in _getSpecificPatchXML
    fromRelease = self._getFromRelease(patch)
  File "auslib/blobs/apprelease.py", line 75, in _getFromRelease
    return dbo.releases.getReleaseBlob(name=patch["from"])
  File "auslib/db.py", line 889, in getReleaseBlob
    data_version = cache.get("blob_version", name, getDataVersion)
  File "auslib/util/cache.py", line 40, in get
    value = value_getter()
  File "auslib/db.py", line 887, in getDataVersion
    raise KeyError("Couldn't find release with name '%s'" % name)



Which is probably because dev doesn't have a good data set (I uploaded the Firefox-35.0b3-build1 blob and some other ones, but not all of the blobs for the partials they have). Nonetheless, the code should probably handle this better.

There's also some other tracebacks that may be rooted in the same issue. I'll look into them as well.
Looks like this is partly (and maybe fully) a silly mistake in apprelease.py. When I factored out the fromRelease logic I didn't include the KeyError that it catches: https://github.com/mozilla/balrog/commit/d51724def8e71ec9b9c9ec97efe1d89a011bcf08#diff-1a7ea9e25a2f1a8c15c06fd95243142aL290

I pushed https://github.com/mozilla/balrog/commit/e34f70bb3c16a3b1c94033bd55f74327fb157078 as a bustage fix for this.
Another one came up, but it's a bit less serious. It happens when we have a rule that points out a non-existent release. I think this is happening a lot on dev because the database clean up script is deleting old releases without checking to see if a rule points at them...

IndexError: list index out of range

Stacktrace (most recent call last):

  File "flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "auslib/web/views/client.py", line 35, in get
    release, update_type = AUS.evaluateRules(query)
  File "auslib/AUS.py", line 84, in evaluateRules
    release = dbo.releases.getReleases(name=rule['mapping'], limit=1)[0]
Also found:
ValueError: Couldn't find fileUrl

Stacktrace (most recent call last):

  File "flask/app.py", line 1475, in full_dispatch_request
    rv = self.dispatch_request()
  File "flask/app.py", line 1461, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "flask/views.py", line 84, in view
    return self.dispatch_request(*args, **kwargs)
  File "flask/views.py", line 149, in dispatch_request
    return meth(*args, **kwargs)
  File "auslib/web/views/client.py", line 38, in get
    xml = release.createXML(query, update_type, app.config["WHITELISTED_DOMAINS"], app.config["SPECIAL_FORCE_HOSTS"])
  File "auslib/blobs/apprelease.py", line 135, in createXML
    patches = self._getPatchesXML(localeData, updateQuery, whitelistedDomains, specialForceHosts)
  File "auslib/blobs/apprelease.py", line 532, in _getPatchesXML
    xml = self._getSpecificPatchXML(patchKey, patchType, patch, updateQuery, whitelistedDomains, specialForceHosts)
  File "auslib/blobs/apprelease.py", line 88, in _getSpecificPatchXML
    url = self._getUrl(updateQuery, patchKey, patch, specialForceHosts)
  File "auslib/blobs/apprelease.py", line 677, in _getUrl
    raise ValueError("Couldn't find fileUrl")
My cached run took 14min still. I'm not really sure what to make of this. It's a bit faster than the non-caching run, but not as fast as I expected. The fact that I'm replaying requests with 0 delay between them may be a big factor here - we could be tickling a different bottleneck because of this.

The other curious thing is that I started running a second non-caching test earlier (by accident - I had thought that the patch was landed already, but it wasn't) and it took significantly longer than 15min and did not even complete.

Given the above I'm inclined to go ahead with this in production and try a _real_ test - I'm really not confident in being able to replicate actual production load. I need to set-up the latest release data for that still, and I'm going to try to fix one or both of the IndexError and ValueErrors from previous comments - I don't think either of them are blockers though - they're both related to bad data.
Here's a follow-up to fix the other two exceptions. The IndexError is kindof a silly one - we're indexing an array that could potentially be empty. The ValueError is one that the blob raises itself - and I think it's better classified as a BadDataError. The times we hit it right now are because the Thunderbird-34.0b1-build1 blob is just not useful (https://aus4-admin.mozilla.org/releases/Thunderbird-34.0b1-build1/data).
Attachment #8537435 - Flags: review?(nthomas)
(In reply to Ben Hearsum [:bhearsum] from comment #27)
> Created attachment 8537212 [details] [diff] [review]
> fix review nits & better handling of data_version inequality

Looks like all of 96c3c9d7b9d1d1c6e935a103df2d3b9b87b8b884 is included, did you mean to revert some of it ? eg the commented out test, the cache.put on blob_version we decided we didn't need
Comment on attachment 8537435 [details] [diff] [review]
fix IndexError and ValueError

Review of attachment 8537435 [details] [diff] [review]:
-----------------------------------------------------------------

The patch looks fine, but just a couple of queries ...

(In reply to Ben Hearsum [:bhearsum] from comment #34)
> The ValueError is one that the blob raises itself - and I think it's better
> classified as a BadDataError. The times we hit it right now are because the
> Thunderbird-34.0b1-build1 blob is just not useful
> (https://aus4-admin.mozilla.org/releases/Thunderbird-34.0b1-build1/data).

Did you mean to point to dev here ? We're using Thunderbird-34.0b1-build3 in production, and that seems fine. 

More importantly, BadDataError has been for odd queries up to now, and we'll lose the sentry reports by using it here. So we rely on QE to catch 'no update offered' errors ?
Attachment #8537435 - Flags: review?(nthomas) → review+
(In reply to Nick Thomas [:nthomas] from comment #35)
> (In reply to Ben Hearsum [:bhearsum] from comment #27)
> > Created attachment 8537212 [details] [diff] [review]
> > fix review nits & better handling of data_version inequality
> 
> Looks like all of 96c3c9d7b9d1d1c6e935a103df2d3b9b87b8b884 is included, did
> you mean to revert some of it ? eg the commented out test, the cache.put on
> blob_version we decided we didn't need

I wasn't sure whether or not to remove the test - it's a framework for one that would be useful - if I could figure out a way to verify it probably. I'm happy to remove it if you think it's just noise.

I left the cache.put to handle the off chance that the blob data version is higher than the blob version one (I think there's still a remote possibility of this). I see that that's broken anyways though, as it's re-caching already cached blob version. I can fix this up or just plain remove it.
(In reply to Nick Thomas [:nthomas] from comment #36)
> Comment on attachment 8537435 [details] [diff] [review]
> fix IndexError and ValueError
> 
> Review of attachment 8537435 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The patch looks fine, but just a couple of queries ...
> 
> (In reply to Ben Hearsum [:bhearsum] from comment #34)
> > The ValueError is one that the blob raises itself - and I think it's better
> > classified as a BadDataError. The times we hit it right now are because the
> > Thunderbird-34.0b1-build1 blob is just not useful
> > (https://aus4-admin.mozilla.org/releases/Thunderbird-34.0b1-build1/data).
> 
> Did you mean to point to dev here ? We're using Thunderbird-34.0b1-build3 in
> production, and that seems fine. 

I did mean to point to this one (dev is just a copy of it). This blob is actually not OK as far as I can tell -- it's missing all of the top level stuff like fileUrls - so you can't construct a useful response from it.

> More importantly, BadDataError has been for odd queries up to now, and we'll
> lose the sentry reports by using it here. So we rely on QE to catch 'no
> update offered' errors ?

Good point, let's not overload this. Do you think it's useful for exceptions caused by bad blobs to bubble up to Sentry? If so, it could be argued that this patch isn't wanted at all (or that we should raise an exception if mapping doesn't resolve to a release, at least).
Depends on: 1113123
When this went live in production around 7:30am pacific this morning we saw a huge dip in network traffic between the database and webheads. They were all pulling ~30MB/sec before the patch went live, and are now only pulling ~1MB/sec. Average CPU utilization was around 15-20% (varies by web head) and dropped to 10-15% -- so about 5% less on each webhead. That's less than I expected but it's something.

We'll proceed with the release channel test in bug 1112662 in a bit and go from there.
So the main patch here is in production now. There's still the IndexError and ValueError to fix, so I'll leave this bug open until we deal with them.

Nick, any thoughts on:
> > More importantly, BadDataError has been for odd queries up to now, and we'll
> > lose the sentry reports by using it here. So we rely on QE to catch 'no
> > update offered' errors ?
> 
> Good point, let's not overload this. Do you think it's useful for exceptions
> caused by bad blobs to bubble up to Sentry? If so, it could be argued that
> this patch isn't wanted at all (or that we should raise an exception if
> mapping doesn't resolve to a release, at least).
Flags: needinfo?(nthomas)
No longer depends on: 832464
I don't have a good answer here. We need some way to know if we screwed up, and errormill may be the best we have right now.
Flags: needinfo?(nthomas)
(In reply to Nick Thomas [:nthomas] (Away until Jan 12th) from comment #41)
> I don't have a good answer here. We need some way to know if we screwed up,
> and errormill may be the best we have right now.

OK, I'm not going to do anything about these for now then. We can re-evaluate this later.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: