As documented in Bug#1010382 Comment#4 We should rate-limit pre-generating APKs for apps as they are approved. We should make no more than 1 request every 10 seconds.
can the APK service just maintain a queue so that it rate limits itself? That way any consumer (not just Marketplace) will not cause trouble
Yes, that is part of the plan. But I've detailed why we need both, Marketplace to rate limit and APK factory to have a maximum work load.
where is it documented? All I see in your link above is a comment about rate limiting Marketplace's requests but that sounds like a poor solution to me (since other API consumers will likely not rate limit themselves)
Sorry, in Bug#1010382#c2 I walk through why adding a SQS to the APK Factory wouldn't be the best solution and why we should a) use a queue on the Marketplace side to space out pre-generation and b) add a max worker thing to the APK Factory. Happy to jump on Vidyo or IRC to discuss.
ok, I read through that and I understand the complications with adding a queue to the APK factory now. Instead of having Marketplace maintain the logic for rate-limiting, what if the APK factory returned a 503 when it reached capacity? That way the Marketplace and any other client would know to implement a backoff/retry pattern when it gets 503s. This would be easy to implement since we already have background tasks for processing apps. This would address my concern that putting rate-limiting directly in Marketplace is the wrong place to introduce the logic. The Marketplace would have to guess at the rate and hope that it's sufficient enough. A blind rate-limit would also slow down the Marketplace in cases where it made the wrong guess.
Working on this today, sorry for the delay. > Instead of having Marketplace maintain the logic for rate-limiting, what if the APK factory returned a 503 when it reached capacity? A 503 is a good idea, but I'm wondering if 60 requests come in within a one second period, how does one decide which get 503s? One solution discussed in capping the maximum number of concurrent builds. If we get a stampede of requests, the result is that they will have to wait with an open connection for a long time. So an alternative is adding a request timeout and retry pattern on the marketplace side. Another alternative is another maximum limit on the backlog of requests, otherwise you get a 503. > That way the Marketplace and any other client would know to implement a backoff/retry pattern when it gets 503s. myk - how will Fennec handle a 503? Do we need to add re-try logic to Fennec? > The Marketplace would have to guess at the rate and hope that it's sufficient enough. Marketplace does not have to guess. Many web services are rate-limited per client. What makes for a bit of a grey area is that this API is overloaded. A) It gives you an APK for an open webapp you are installing B) It pre-generates the APK for a manifest that will definitely go all the way to a build step In scaling this service, we planned for A type workloads but not B type workloads. Maybe we should break out a Marketplace specific API and this one is documented to be rate limited to 1 request every 10 seconds. Is that helpful? We have not documented exactly how the service fails, during this nightly reviewer workflow. I'll also look at that while doing this work.
What I've found so far: * Neither the controller nor the generator are falling over * Marketplace is seeing 504s, because these requests pile up and take longer than 60 seconds to complete * Nginx closes the connection and responds with a 504, but the APK generation probably does complete, which is the point of this setup * 504 errors make monitoring, troubleshooting, etc difficult
Taking a "yes and" approach to all of the back and forth we've had, what if we... In addition to work being discussed, we also expose a new API for pre-generation: * controller will queue pre-generation using SQS * response will not timeout as this fire-and-forget will be very quick * generators will listen on this SQS queue and as capacity allows, build the apks Benefits * Avoids changing the semantics of /application.apk (which is much of my earlier feedback) * Improves monitoring / troubleshooting of reviewer apk pre-generation Downsides * Requires marketplace change to enable
Hmm, I don't really see how adding a new URL for pre-generation really helps much. What's nice about having one URL for both generation and fetching is that the APK won't be generated if it already exists. That's a nice property. Anyway, this bug got hijacked for tangential APK issues, my bad. To address your original concern: if you just want Marketplace to rate-limit its requests we can do that easily. We can define a celery task limited to one run per every 10 seconds. Even with that limit in place, the APK generator can still get DOS'ed (by other clients). However, if we start with the rate limit then maybe it will buy you some time to prevent DOS'ing.
I've added config.maximumNumberOfConcurrentBuilds which by default is set to 10. The 11th concurrent build request will trigger a 503. https://github.com/mozilla/apk-factory-service/commit/a42d8b45bf14be4b2bacbbc06553249a4703a908 > Hmm, I don't really see how adding a new URL for pre-generation really helps much. I was thinking: /application.apk - we don't want to add any latency, we care about the .apk that comes back /pre-generation - we can add queues, etc as latency isn't a concern. We don't care about the actual .apk. > To address your original concern: if you just want Marketplace to rate-limit its requests we can do that > easily. We can define a celery task limited to one run per every 10 seconds. That would be great!
Target Milestone: --- → 2014-07-22
503s shipped yesterday. generators are configured to limit concurrent generation to 10. We can monitor and tweak configs. Graph of Concurrent Builds over the last 2 hours: https://graphite.shared.us-west-2.prod.mozaws.net/render/?3=&width=588&height=310&_salt=1406074303.799&from=-2hours&target=stats.gauges.apk-generator-release.apk-build-active.count&target=stats.gauges.apk-generator-review.apk-build-active.count&title=Concurrent%20Builds
not sure who can take this on the Marketplace side yet so I'm removing the milestone for now
Target Milestone: 2014-07-22 → ---
pretty sure I can take it though
Assignee: nobody → kumar.mcmillan
Ever since bug 1060473 we haven't seen the APK factory get overloaded so this is less of a priority.
Assignee: kumar.mcmillan → nobody
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
Whiteboard: qa- → qa-[marketplace-transition]
You need to log in before you can comment on or make changes to this bug.