Closed Bug 1111032 Opened 9 years ago Closed 8 years ago

add caching to balrog rule evaluation

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum, Mentored)

References

Details

(Whiteboard: [lang=python])

Attachments

(1 file, 1 obsolete file)

Attached patch rule-caching.diff (obsolete) — Splinter Review
I started looking at this a bit in bug 671488, but it's less important than blob caching, and different enough to deserve it's own bug. Attached is my really really brain dead rule caching. It's probably not what we want, but it's a starting point.

Catlee, Nick, and I talked about this yesterday too. A few notes from that:
* The attached patch has too many axis' to be useful. OS Version in particular multiplies into thousands (if not tens of thousands) of values in the real world. We already only query by product, buildTarget, locale, headerArchitecture, and distribution/distVersion (https://github.com/mozilla/balrog/blob/4338af9338012201e4fc355c778305bc09e905b7/auslib/db.py#L709) - creating the cache with those values as keys might be good enough.
* Another approach could be to have a short cache on the entire set of rules, and go through the work of evaluating them for every query.

We definitely want to cache the rules we get back from the database in some way. What's still up in the air is whether or not we should cache the evaluated rules (ie, incoming request with product, version, branch = foo, bar, blah gets release X). I think it's worth doing some profiling to see whether or not that's a hotspot before we go down that road.
Mentor: bhearsum
Whiteboard: [lang=python]
I'm going to look at doing this ASAP, because the database instance size we'd like to use in CloudOps is smaller than the scl3 one, even with just 50% of the traffic at it, the database is seeing significant traffic from the rules table queries.
Assignee: nobody → bhearsum
Blocks: 1248741
Attachment #8768524 - Flags: feedback?(nthomas)
Attachment #8768524 - Flags: feedback?(nthomas) → feedback+
Commit pushed to master at https://github.com/mozilla/balrog

https://github.com/mozilla/balrog/commit/8d5ad72fba921a54b2a9c9a7fb36299821b0f2a6
bug 1111032: Implement rule caching with a few parts of the update query as the key (#97). r=nthomas,rail,mostlygeek
Comment on attachment 8768524 [details] [review]
simple full-table rule caching

This got some feedback and updates in the PR, and has since been merged.
Attachment #8768524 - Flags: checked-in+
We did a bunch of verification in this in CloudOps stage and a canary node in prod. In prod, the canary node received about 1/2 the number of bytes that the other nodes did and the load average was also about half of the others - so it's definitely working. There was a drop in CPU usage as well, which may be from decreased overhead (fewer requests to rds), or it may be due to :mostlygeek turning off uwsgi per-request logging at the same time.

I think we can do better here still, but this is good enough for now. Possible ideas for the future:
* Fully cache rules table instead of querying it for unique combination of certain parts of the updateQuery.
* Ensure that we aren't doing a bunch of simultaneous updates of the same key in the same proc. I'm not certain if one proc handles more than one request at a time or not, so this may not be the case at all. repoze.lru may also give us this locking for free if we are.
* Use a local memcache on each node. This would reduce memory usage by having a shared cache across all procs as well as reduce the frequency of blob cache rebuilding (we currently kill procs after 5000 requests, which is every few min, and the blob cache will normally last for up to an hour).
> * Ensure that we aren't doing a bunch of simultaneous updates of the same key in the same proc. I'm not certain if one proc handles more than one request at a time or not, so this may not be the case at all. repoze.lru may also give us this locking for free if we are.

One request/uwsgi process. No locking required. 

On the database side, the changes has resulted in a huge reduction in CPU/outgoing traffic. 

CPU     : ~30% down to ~10%
Outgoing: ~25MB/sec down to ~5MB/sec
Attachment #8535856 - Attachment is obsolete: true
I'm calling this fixed now. I filed https://bugzilla.mozilla.org/show_bug.cgi?id=1285953 as a follow-up to improve the caching.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
See Also: → 1285953
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: