Closed Bug 1083185 Opened 10 years ago Closed 10 years ago

Deployment issues with fireplace

Categories

(Marketplace Graveyard :: General, defect, P1)

Tracking

(Not tracked)

RESOLVED FIXED
2014-12-22

People

(Reporter: mat, Assigned: kngo)

References

Details

We have a problem with the way fireplace (possibly other repos) is deployed. I've noticed this on two different pushes now, but it may have happened more.

Currently, when you go to https://marketplace.firefox.com/, the include.js you get is : https://marketplace.cdn.mozilla.net/media/fireplace/js/include.js?b=1413309785513

When you replace the build id with another random string and compare the two:
% md5sum include.js*
c00d143ca63a5e783cd30625817de9f2  include.js?b=1413309785513
c96e5b199f6a185d7eebd53e72a790fb  include.js?b=11122233344455667787979


The include.js?b=1413309785513 version does *not* match the current tag. So every fireplace bug we fixed for this week is still present in production today (bug 1066774 or bug 1038936 are good examples - search for specific strings introduced in the commits for those bugs in include.js, you won't find them). We're using last week JavaScript (and probably CSS as well).
This sounds like it could be deployment order related? If the cache-busting build-id is published and live before the new static resources are pushed and a request comes in, edge CDN servers could be primed with the old static resource with the new build-id.
Perhaps this is a good time to start putting a hash in the filename instead of using query strings. If we've got deployment issues it will 404 instead of serving the wrong file. Not great but we'll get the right code eventually.
Jason - thoughts on improving this?
Flags: needinfo?(jthomas)
(In reply to Mark Striemer [:mstriemer] from comment #2)
> Perhaps this is a good time to start putting a hash in the filename instead
> of using query strings. If we've got deployment issues it will 404 instead
> of serving the wrong file. Not great but we'll get the right code eventually.

This seems like a disastrous strategy. Currently, we don't really look at 404 logs - so if anything, this will just frustrate users when no CSS loads; yes, we'll probably notice it sooner and repush, but I'm not sure we should be punishing users because we don't know the cause of this bug.

Bug 985291 is to use a real hash instead of a timestamp. A bit unrelated to this, but it's somethin we should do.

We should investigate what Mat's describing a bit more.

I went to verify that Commonplace's `build_id.txt` is being generated correctly - and that Zamboni is consuming it correctly.

So Zamboni's code is here, which seems the same and correct:

    https://github.com/mozilla/zamboni/blob/1b10d49/mkt/commonplace/views.py#L74-L77
985804

And then in Commonplace, it looks like all that code has been removed!

    https://github.com/mozilla/commonplace/commit/58802b87#diff-aa0e5ceac674c2a54b83b7626f4d97aaL271

And it's not clear to me why.

But Fireplace is on Commonplace v0.4.22:

    https://github.com/mozilla/fireplace/blob/358fe459/fabfile.py#L38
    https://github.com/mozilla/commonplace/commit/72079b5

So, all I know is Commonplace master regresses this functionality. But if the correct version of Commonplace is actually being used in our deploy, then I don't have a explanation for this.
I saw that happen with 2014-09-16 push (we had to do it twice because of this) so it's not because of recent additions to commonplace.

I am not familiar with that commonplace code but I think it's something along the lines of what Stuart is suggesting above. Maybe we generate the build id file too soon, or there is a fallback somewhere that makes it use the current date before the build is done, something like that.

In any case, we should push 2014-10-14 again since it's broken.
One more thing I forgot to say:

I don't think we should instead 404 these files (per comment 2). I do think it would probably be wise if we starting monitoring this stuff (checking that the CSS/JS paths in `index.html` refer to the latest versions) in nagios or somewhere. And an additional check that checks that the build ID actually changes during after each Fireplace push. (Again related: bug 985291.)

Two more bugs of interest:

Bug 985804 is to have Fireplace serve its `index.html` instead of the current `mkt/commonplace` dance inside Zamboni.
Bug 1082289 will help us ensure that the correct Commonplace is installed locally (and not globally).
(In reply to Stuart Colville [:scolville] [:muffinresearch] from comment #1)
> This sounds like it could be deployment order related? If the cache-busting
> build-id is published and live before the new static resources are pushed
> and a request comes in, edge CDN servers could be primed with the old static
> resource with the new build-id.

This is related to how quickly we distribute code to hosts during deployment time. Actual code deployment for fireplace takes about 10-20 seconds but we are in a inconsistent state until the deployment is complete. Even if one host has a old code then it is possible for the CDN to cache the old assets with the new build-id.

I don't think serving index.html from 'mkt/commonplace' will help this since there will still be a chance that will be in a inconsistent state before a request comes in.

I think the solution here would be to change the build_id only after the code deployment is completed. We could try to cache get_build_id and flush cache after deployment is completed to pick up the new build-id.
Flags: needinfo?(jthomas)
Assignee: nobody → kngo
After talking to Jason and Mat, the proposed solution is to:

- After deployment to all the webheads, run a management command that reads the build_id.txt from each project, and stores the build IDs in the database (with each row being a different project, keyed by repo).
- When we're serving Commonplace's index.html, read the build ID from the database rather than reading directly from the build_id.txt file. Probably build in a fallback to the build_id.txt file if it's not in the DB.

Other solutions talked about:

- Store the build ID in memcached - not as reliable in cases where the build ID might become invalidated if the bucket becomes full. Though would probably do the job anyways. Spartacus and Webpay do this.
- memoize the get_build_id function for some amount of time (60 seconds) - not reliable since it might invalidate while we are deploying.

I like Mat's explanation of the issue:

"webhead x has everything up to date, it serves a index.html with include.js?build_id=<build_id_from_x> but then that include.js?build_id=<...> is served by webhead y, that doesn't have the most recent includes yet"
..."and then that version gets cached by the CDN and we lose"
https://github.com/mozilla/zamboni/pull/2718

python manage.py deploy_build_id <REPO>

Jason, think you whip something up with that? Calling it only once a project has completely deployed.
Flags: needinfo?(jthomas)
Flags: needinfo?(jthomas)
https://github.com/mozilla/fireplace/pull/846
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → 2014-12-23
You need to log in before you can comment on or make changes to this bug.