Closed Bug 799662 Opened 12 years ago Closed 11 years ago

mdn: stage & prod: add cron `manage.py cron build_sitemaps`

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: groovecoder, Assigned: cturra)

References

Details

Need to add a cron job to build sitemaps for MDN once per day:

manage.py cron build_sitemaps

Note: need to run as a user with permission to write files to developer.mozilla.org/kuma/media/
Blocks: 780740
:groovecoder - i have added this cron to dev and stage. it is set to run 2 minutes after midnight each evening. once it has run successfully this evening, we can push this to prod.
Assignee: server-ops-webops → cturra
Status: NEW → ASSIGNED
Noting for the record:

In order to do this without very strange side-effects, we also changed the deploy process for stage and prod slightly.

Kumascript is now restarted by Chief directly, explicitly. Previously, it was restarted during a particular step in the "update-www.sh" script (specifically sync-extras.sh) on the nodes, that just updates the content on the web nodes via rsync to the admin node. That was set up before Chief, and was a trick to make sure kumascript would get restarted if/when a deploy happened. With Chief, we can move this step right into Chief, and simplify the overall process.

Without doing this, the side-effect would have been that when this cron that simply makes a file in /media/ runs, it would have also caused a restart of kumascript. Not horrible, but definitely unexpected and non-ideal.
Thanks :jakem.

:cturra - did this run last night? I don't see the media/sitemap* files on stage.
:groovecoder - unfortunately there were unrelated puppet errors on the admin node, so these cron updates never made it onto the server :( i have resolved this now and the cron's are in place as expected. so we don't need to wait until tomorrow for this to run, i have changed the execution time to 11:02 so should be able to check this shortly thereafter.
OSError: [Errno 13] Permission denied: '/data/developer-stage/www/developer.allizom.org/kuma/media/sitemaps'
:groovecoder - sorry for the delay here. the root cause of these not showing up as expected was the result of some missing ssh keys on the developer admin node in the web cluster.

to make the url's a little more clear, i have setup an apache alias to make the sitemaps available at these location for each environment (dev/stage for now):

  https://developer-dev.allizom.org/sitemaps/

  https://developer.allizom.org/sitemaps/
Okay this is weird. We use https://github.com/mozilla/kuma/blob/master/configs/htaccess-without-mindtouch#L17 to publish a robots.txt, sitemap index file, and the individual sitemaps.

But, https://developer-dev.allizom.org/robots.txt doesn't work and I don't see a webroot/.htaccess symlink to it on dev, stage, nor prod.
:groovecoder - how do you expect these files to be published and served on the web nodes? is this supposed to be done when running the `build_sitemaps` cron, or are there others required?

looking at the .htaccess file in kuba/media/ i don't see any directives in there that would match your htaccess-without-mindtouch file.
Right. https://github.com/mozilla/kuma/blob/master/configs/htaccess-without-mindtouch#L17 contains the directives to publish/enable the sitemap files. 

There's *supposed* to be a webroot/.htaccess -> htaccess-without-mindtouch symlink. But it seems like that's missing.

The missing symlink would also explain why:

https://developer-dev.allizom.org/contests/
https://developer-dev.allizom.org/es4

are broken, as well as missing CORS headers (bug 720068).

I'm not sure why that symlink is missing but we need to restore it on dev and stage and then a whole bunch of stuff should work. Then we'll want to do the same on prod.
i have added this symlink per your request, but we appear to be getting the same results (http 404). as an fyi -- in other projects, such as bedrock (www.mozilla.org), we manage all the rewrites/etc in the apache configs directly. not saying what you're trying to do is incorrect or won't work however. 


[cturra@developer1.dev.webapp.scl3 webroot]$ grep -i "documentroot" /etc/httpd/mozilla/domains/developer-dev.allizom.org.conf 
  DocumentRoot "/data/www/developer-dev.allizom.org/kuma/webroot"

[cturra@developer1.dev.webapp.scl3 webroot]$ pwd
/data/www/developer-dev.allizom.org/kuma/webroot

[cturra@developer1.dev.webapp.scl3 webroot]$ ls -la .htaccess
lrwxrwxrwx 1 root root 37 Oct 17 14:24 .htaccess -> ../configs/htaccess-without-mindtouch
Probably just need a directory block with an "AllowOverride all" directive in it in the Apache config. Like this:

<Directory /data/www/developer-dev.allizom.org/kuma/webroot>
    AllowOverride all
</Directory>

In general I prefer doing rewrites/redirects in a .htaccess file, because I prefer that webdevs be able to manage them without IT/webops involvement. This also goes for cache-control headers, and maybe a few other things. Apache .htaccess files are tailor-made for allowing the users (webdevs) to nudge the webserver in the right direction without having to touch the main config.
(In reply to Jake Maul [:jakem] from comment #11)
> Probably just need a directory block with an "AllowOverride all" directive
> in it in the Apache config. Like this:
> 
> <Directory /data/www/developer-dev.allizom.org/kuma/webroot>
>     AllowOverride all
> </Directory>

it's already present...

  <Directory /data/www/developer-dev.allizom.org/kuma/webroot>
    Options +FollowSymLinks
    AllowOverride All
  </Directory>



> In general I prefer doing rewrites/redirects in a .htaccess file, because I
> prefer that webdevs be able to manage them without IT/webops involvement.
> This also goes for cache-control headers, and maybe a few other things.
> Apache .htaccess files are tailor-made for allowing the users (webdevs) to
> nudge the webserver in the right direction without having to touch the main
> config.

i agree completely!
I know what's wrong here... we probably have the same problem in other sites and just haven't noticed it, and/or are working around it without realizing.

WSGIScriptAlias / /data/www/developer-dev.allizom.org/kuma/wsgi/kuma.wsgi

This conflicts with the DocumentRoot. We are essentially remapping the DocumentRoot to go to a mod_wsgi application. That means anything *in* the DocumentRoot is inaccessible.

/media/ and such still work, because they're different URL paths... they don't overlap the exact same path like WSGIScriptAlias and DocumentRoot currently do. Apache never reads the .htaccess file because it's instead following the alias to the mod_wsgi app.


The solution (or at least *a* solution) is to put the wsgi app somewhere else. Judging by the contents of that .htaccess file, it looks like it used to be at /mswgi. We can move it there again easily. However, in so doing we risk breaking anything that relies on the root to be the django app. There will need to be a RewriteRule that sends anything not a static file over to the mwsgi app.

The .htaccess file has such a RewriteRule in it already. At a glance it seems to be okay, but I haven't tested exhaustively. Once we change this over, likely some things will be broken until either the Apache config or the .htaccess file are updated to do the right thing. Let's do this tomorrow (on -dev), when more people will be around to notice and help out.

The new line would look like this:
WSGIScriptAlias /mwsgi /data/www/developer-dev.allizom.org/kuma/wsgi/kuma.wsgi


I think we will have a problem with the "RewriteBase /" line in the .htaccess file... we can experiment after the change is made.

I'm somewhat uncomfortable with the CORS logic/pattern here too... another thing we can play with once the change is made.
Thanks Jake. I'm around today. Let's do this whenever Raymond is available to test -dev.
Also still getting:

IOError: [Errno 13] Permission denied: '/data/developer-stage/www/developer.allizom.org/kuma/media/sitemap.xml'
The change in comment 13 is in place. Let us know of any breakage and/or incorrect functionality.

It seems the sitemap.xml file works now, but I suspect other things will be broken.
(In reply to Jake Maul [:jakem] from comment #16)
> The change in comment 13 is in place. Let us know of any breakage and/or
> incorrect functionality.
> 
> It seems the sitemap.xml file works now, but I suspect other things will be
> broken.

I'm still looking at the sitemap.xml files on dev. I'll update bug if i find anything out of the ordinary
I think it's doing well on -dev. Let's move on to -stage and then maybe prod today?
Blocks: 720068
:groovecoder - looks like :jakem applied these changes to stage at the same time as he did dev. can you please test and confirm both of these environments function as expected?
https://developer.allizom.org/sitemap.xml is an empty file but I didn't get any errors from the cron job on stage?

I got a permission denied error when I tried to run it myself - presumably because my account on the stage server can't write files.
:groovecoder - we were getting the following error from the cron because user `apache` was trying to do a deploy, but doesn't have the access to do this...

[localhost] err: rsync: failed to set times on "/data/developer-stage/www/developer.allizom.org/kuma": Operation not permitted (1)
[localhost] err: rsync: mkstemp "/data/developer-stage/www/developer.allizom.org/kuma/media/.humans.txt.6Yl497" failed: Permission denied (13)
[localhost] err: rsync: mkstemp "/data/developer-stage/www/developer.allizom.org/kuma/media/.sitemap.xml.iGOiZk" failed: Permission denied (13)
[localhost] err: rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1039) [sender=3.0.6]


as a result i have changed the way we do the `build_sitemaps` cron on the web cluster admin node for stage to store all this in a netapp mount. each web head in the cluster has access to the same netapp volume, so will immediately receive these updated files.

this will be the same route we will want to go in prod.
Stage sitemaps look good now. Ready for prod when :retornam can help us test.
I'm still getting cron error emails with this:

From: root@developeradm.private.scl3.mozilla.com
Subject: Cron <apache@developeradm> cd /data/developer-stage/src/developer.allizom.org/kuma; python2.6 manage.py cron build_sitemaps

IOError: [Errno 13] Permission denied: '/data/developer-stage/src/developer.allizom.org/kuma/media/sitemaps/en-US/sitemap.xml'

Do we need to make a change to the cron job itself?
:groovecoder - looks like there were still some directories and files owned by root:root in media/sitemaps/. i have updated these ownerships to be correct now. 

i will schedule this sitemaps update to prod for monday (11/05).
sorry to overload this bug, but got some permission errors on another stage cron job:

Cron <apache@developeradm> cd /data/developer-stage/src/developer.allizom.org/kuma; python2.6 manage.py cron humans_txt

IOError: [Errno 13] Permission denied: '/data/developer-stage/src/developer.allizom.org/kuma/media/humans.txt'
:groovecoder - while i was changing the way we do the `build_sitemaps` i also applied this same change to the way `humans.txt` is updated. as it turns out, i hadn't set the permissions correctly on that file -- this has been corrected now.

  $ whoami
  apache

  $ cd /data/developer-stage/src/developer.allizom.org/kuma; python2.6 manage.py cron humans_txt
  $ echo bash-4.1$ echo $?
  0
:groovecoder - this has all now been pushed to prod. 

  https://developer.mozilla.org/sitemap.xml
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Looks like the cron job might have an error? the sitemap.xml files aren't updating?
Blocks: 823110
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
i had configured the prod cron to update the stage sitemaps :( oops! this has been fixed now.

  https://developer.mozilla.org/sitemap.xml


$ svn diff 
Index: modules/webapp/files/developer/admin/etc-cron.d/developer
===================================================================
--- modules/webapp/files/developer/admin/etc-cron.d/developer	(revision 55880)
+++ modules/webapp/files/developer/admin/etc-cron.d/developer	(working copy)
@@ -7,7 +7,7 @@
 11 6,18 * * * apache cd /data/developer/src/developer.mozilla.org/kuma; python2.6 manage.py update_product_details
 
 # bug 799662
-0 5 * * *     apache cd /data/developer-stage/src/developer.allizom.org/kuma; python2.6 manage.py cron build_sitemaps
+0 5 * * *     apache cd /data/developer/src/developer.mozilla.org/kuma; python2.6 manage.py cron build_sitemaps
 0 0 * * *     apache cd /data/developer/src/developer.mozilla.org/kuma; python2.6 manage.py cron humans_txt
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.