Closed Bug 1060473 Opened 10 years ago Closed 10 years ago

Fix concurrent builds counter

Categories

(Marketplace Graveyard :: Integration, defect, P1)

2014-Q3
x86
macOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: andy+bugzilla, Assigned: dcoates)

References

Details

Attachments

(2 files)

We believe the concurrent build counter in the APK Factory may not be functioning as intended. This is a copy and paste from bug 1058166 comment 8:

"I believe this is related to a bug in feature bug 1031027 comment 11. Concurrent builds counter in apk factory generator is not decrementing correctly which causes generator not to accept new connections. Since there is no time stamp in the log it's hard to say when this is actually happening. There are also instances in logs where Concurrent Builds: -1 of 10 which doesn't seem right.

For now I have restarted the generator workers and generator is now accepting connections. Bug 1045282 for monitoring the state of concurrent builds."

We thing this is causing the service to start generating 503 errors back to the client.
Priority: -- → P1
Assignee: nobody → kumar.mcmillan
We're going to cut a release Monday with this timestamp in the log and see if that sheds any light on it. https://github.com/mozilla/apk-factory-service/commit/db0d97f7d2326fcb7b15d3419adba5954f856e1a
Attached patch apk.patchSplinter Review
This is a monkey script I used to try and break things. So far, no luck. I ran the server with:

JAVA_HOME=/usr/ ANDROID_HOME=~/src/adt-bundle-mac/sdk/ npm start

then ran the script with:

node monkey.js
I'm happy to take a look at this
Thanks Danny! :jason can also help from the ops side to see if the timestamp from comment #1 has shed any light on a possible pattern.
Assignee: kumar.mcmillan → nobody
Unfortunately, the most likely cause is a broken callback chain, which might be difficult to track down. I've reviewed the stats in graphite, logs in kibana, and code but haven't found an obvious cause. I've got a couple semi-promising leads to track down.

:jason it might be helpful to see the logs around some of the lines like "Concurrent Builds: x of 10" but I don't see them in kibana. Are those logged persistently anywhere, and if so can I have access to them?

Thanks :)
Flags: needinfo?(jthomas)
Assignee: nobody → dcoates
Generator Console logs.
Flags: needinfo?(jthomas)
'Concurrent Builds' is logged to console. We haven't configured heka to push console logs to kibana. Since 'Concurrent Builds' is the only log writing to console I think we should fix this to use the same location as the other generator logs.

I've attached one of the generator's console log in comment 6. Also, I found concurrent build counts in graphite:

https://graphite.shared.us-west-2.prod.mozaws.net/render/?3=&width=588&height=310&_salt=1406074303.799&from=-20days&target=stats.gauges.apk-generator-release.apk-build-active.count&target=stats.gauges.apk-generator-review.apk-build-active.count&title=Concurrent%20Builds
Thanks :jason for those logs!

https://github.com/mozilla/apk-factory-service/pull/82
is there any way to tell how many users this is impacting a day / week?
Flags: needinfo?(dbialer+1)
Flags: needinfo?(amckay)
The link to the kibana logs in mana seems to be broken, jason where is the up to date kibana logs?
Flags: needinfo?(amckay) → needinfo?(jthomas)
The links in mana work for me. Were you able to log in using persona?
https://kibana.shared.us-west-2.prod.mozaws.net/#/dashboard/elasticsearch/PROD%20-%20APK%20HTTP%20Status
Flags: needinfo?(jthomas)
Sorry Jason I'll blame on it some dodgy routing at my end. Caitlin, the number of 500's over the last 90 days looks pretty small, below 1%. You can see logs at the URL provided in the last comment. I'm not sure how to translate that into numbers of users at this point though.
Flags: needinfo?(dbialer+1)
Just curious, has my PR been deployed? I ask because the graph of concurrent builds still shows an upward slope. If it was deployed then the bug isn't fixed yet.
Sorry, been absorbed in other things. We plan to push this to prod on Tuesday 18th.
Changes were pushed, graph looks like it dropped down again, let's see how it goes over the next couple of days.
Looking at the graph, I'm calling this resolved fixed. Thanks dcoates! You may claim your prize in Portland.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
(In reply to Andy McKay [:andym] from comment #16)
> Looking at the graph, I'm calling this resolved fixed. Thanks dcoates! You
> may claim your prize in Portland.

AWESOME!!!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: