Closed Bug 902260 Opened 11 years ago Closed 10 years ago

Find RFO for bouncer failure to update during FF23 release

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Unassigned)

References

()

Details

(Whiteboard: [reit-ops])

bug 901986 has the details of the issue, and what steps were taken.

However, it doesn't appear that we have a RFO yet. The bouncer app is a key part of the process for both beta and GA releases. We now perform 2 beta releases most weeks which require bouncer updates. Because of that, setting severity to major.

Based on past behavior, changes made in bouncer admin are active within about 5 minutes. As :jakem notes in bug 901986 comment 3, the cronjob can't be directly responsible for updating the database.

The way :jd ran the cron, any existing cron jobs that were blocked by the lock would still be hung (one theory). Can we check for that?

Based on the bug 901986 attachment 786343 [details], it appears a new version of the bouncer app was deployed by the manual run of the cronjob. Any clues there?
(In reply to Hal Wine [:hwine] from comment #0)
> bug 901986 has the details of the issue, and what steps were taken.
> 
> However, it doesn't appear that we have a RFO yet. The bouncer app is a key
> part of the process for both beta and GA releases. We now perform 2 beta
> releases most weeks which require bouncer updates. Because of that, setting
> severity to major.
> 
> Based on past behavior, changes made in bouncer admin are active within
> about 5 minutes. As :jakem notes in bug 901986 comment 3, the cronjob can't
> be directly responsible for updating the database.

I don’t know what you are talking about here. I do not see any reference to a database in that comment. Am I reading this wrong?

> The way :jd ran the cron, any existing cron jobs that were blocked by the
> lock would still be hung (one theory). Can we check for that?

The cron job is not hung now, nor was it yesterday. That is to say nothing in the process list. Further '# lsof /var/lock/bouncer-prod' gives no output (AKA the lock file is not locked or bound by any process). I also '# /usr/bin/flock -w 10 /var/lock/bouncer-prod sleep 1' to be sure and this works as expected.

> 
> Based on the bug 901986 attachment 786343 [details], it appears a new
> version of the bouncer app was deployed by the manual run of the cronjob.
> Any clues there?

Can you please clarify what you are referring to here? I do not think any new code was pulled in from the code repository. All I see is it pull the new JSON blob and rework the corresponding files to update the links on the site. Am I missing something?

For completeness I will say that the cron job does only two things. Thing the first is to call 'manage.py update_product_details' (This is the JSON blob and file regeneration bit). Thing the second is to call 'deploy' which transmits these freshly minted files to the web servers.

Perhaps bug 901986 comment 3 point 3) has some credence? TBH I am out of ideas from my end and it will take someone more intimately familiar with the code to delve beyond this point.
:jd -- obviously I don't know the app either -- my use of the terms "deploy" and "database" were loose.

Functionally - changes we make in the app are usually live within 5 minutes. Running a crontab that only runs once a day isn't likely to be the "real" fix, just unwedged something somehow.

Agreed we need deeper app knowledge to sort this out -- :laura can your team investigate and/or move bug as appropriate, please?
Flags: needinfo?(laura)
So I chatted with jakem about this for a bit and we have a new idea. As this issue is specifically about the latest links it concerns only the links generated by this cron job. Since this cron job is set to only run once a day it does not look like anything is actually broken. The theory is that since this behaviour has only been in place for a short while (used for stub installer) that this was simply not noticed before or the timing worked out a bit differently (like JSON svn updates occurred prior to midnight servertime).

At any rate we do not think that anything is or was broken. Since the behaviour is not what is desired we are proposing to simply alter the cron job to run say every 30min instead of once nightly. How does this sound and does this answer all of the necessary questions?
Severity: major → normal
I would like to have a postmortem discussion on this issue early next week. Does that work for everybody?
Flags: needinfo?(laura)
WFM :)
Additional reference info for post mortem:
 - RelEng's docs on dealing with bouncer are at: https://wiki.mozilla.org/Release:Release_Automation_on_Mercurial:Updates_through_Shipping#Update_Bouncer

 - while this was my first release use, I've done several betas with ~5 min latency to live update and I was not working anywhere near midnight (local or utc) :)

My zimbra is up to date.
Additional pre post mortem info:
14:04 <@nthomas> hwine: depends what you mean. If you're changing locations that should
                 take up to 5 mins, if you're talking product-details then idk the
                 timetable
14:04 < hwine> right - changing location :)
14:04 <@nthomas> sentry should come along and check the location every 5 mins
14:04 <@nthomas> all that stuff is logged, jakem can spelunk
14:05 < hwine> ah -- is sentry a separate program?
14:05 <@nthomas> it's a perl based script that runs on a cron, part of the suite of stuff
                 we call 'bouncer'

'sentry' isn't mentioned anywhere in bug 901986
The RelEng doc Hal linked to in comment #6 talks about updating the firefox-latest and firefox-stub products, and I think there are two things going on here.

For firefox-latest, bug 398366 means there is a redirect based on the version in product-details. eg in practice this works out as
$ curl -sIL  "https://download.mozilla.org/?product=firefox-latest&os=win&lang=en-US" | egrep '^HTTP|^Location'
HTTP/1.1 302 Found
Location: ?product=firefox-23.0&os=win&lang=en-US
HTTP/1.1 302 Found
Location: http://download.cdn.mozilla.net/pub/mozilla.org/firefox/releases/23.0/win32/en-US/Firefox%20Setup%2023.0.exe
HTTP/1.1 200 OK
That first redirect depends on product-details, and the location configured in bouncer shouldn't actually matter. The once-a-day cron to update product-details is too slow on release day. We need to shorten this up to something similar to what www.mozilla.org has, or just pick a value like 15 mins. The releng docs need updating as we don't need to update the location for firefox-latest (in fact it might not even need to exist any more).

For firefox-stub there is no equivalent redirect (although bug 869662 offers an idea of how we might do something similar), so the location does matter. We can look in the sentry logs for more information on this. I'm expecting lines like this would have disappeared:
[timestamp] /firefox/releases/22.0/win32/zh-TW/Firefox%20Setup%20Stub%2022.0.exe...
and been replaced with:
[timestamp] /firefox/releases/23.0/win32/zh-TW/Firefox%20Setup%20Stub%2023.0.exe...
and the status may not have been 'okay' initially. I don't know what the timestamps other than the ping on IRC at 8:44am Pacific on Aug 6th. There may be a timestamp in the db for Hal's change too.
RFO was determined during postmortem - summary:
 - bouncer functionality differs between release & beta runs
 - bouncer was functioning as designed
 - releng & relman users weren't aware of these differences

A full list of recommendations are in the postmortem notes - summary:
 - documentation will be updated to reflect expected behavior
 - program changes were already underway to make the operation more transparent

Further change requests may come from the larger Firefox 23 postmortem
FF Desktop 23.0.1 provided a chance to verify Bouncer is working as expected for full releases:
 - firefox-stub updates immediately (no sentry delay)
 - firefox-latest updates only after the 00:17 cronjob runs.
Looks to me like we're more or less done here... closing this out.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.