Closed Bug 753566 Opened 12 years ago Closed 11 years ago

Set up databases for bedrock

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlong, Assigned: nmaul)

References

Details

(Whiteboard: [triaged 20120904])

We need a cron job which updates all the feeds on mozilla.org (caches them in memcache). It can run every hour, and should execute this:

cd bedrock && ./manage.py cron update_feeds

The dev site is the only one that has this cronjob right now, and is the only one with memcache, so just set it up there first and we'll test it. If all is good, we will setup memcache in prod and roll out the updates, then setup the cron job.
Checked into puppet yesterday. I see it in crontab, and it hasn't emailed me any errors...
Assignee: server-ops → nmaul
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Hm, there *should* be posts listed under "in the news"...

http://www-dev.allizom.org/en-US/

Can you check to see when cron was last run? Or check that memcache is setup correctly and the dev server is pointing to it?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Checking in on this, still don't see the feeds on the home page. Any chance you could poke around it some more?
The cron is:

0       * * * * root cd /data/bedrock-dev/src/www-dev.allizom.org-django/bedrock; python manage.py cron update_feeds 2>&1 1> /dev/null | grep -v 'old-style Playdoh layout'

You, me, and Fred are in the MAILTO for it, so it'd be emailing us if there was any output. I just ran it by hand and the only thing is that old-style warning that gets filtered out.

What file(s) is this supposed to write to, and/or what DB/table does it put data in? I can check to see if there's any data actually present.
It should be dumping a pickled python object to memcache under the key "feeds-mozilla".

Make sure this is in settings/base.py:

FEEDS = {
    'mozilla': 'http://blog.mozilla.org/feed/'
}

Also make sure the cache backend in settings/local.py is set to memcache and all those settings are correct.

Thanks for looking into this!
I see that FEEDS block in settings/base.py.

In settings/local.py, I have this CACHES block:
CACHES = {
    'default': {
        'BACKEND': 'django.core.cache.backends.memcached.MemcachedCache',
        'LOCATION': [
            'bedrock-memcache1.webapp.phx1.mozilla.com:11211',
        ],
        'KEY_PREFIX': 'bedrock_dev'
    }
}

This is similar (but not identical to) what SUMO has in their settings_local.py.

I looked into memcache itself a bit:

> stats items
STAT items:2:number 4
STAT items:2:age 1665586
STAT items:2:evicted 0
STAT items:2:evicted_nonzero 0
STAT items:2:evicted_time 0
STAT items:2:outofmemory 0
STAT items:2:tailrepairs 0
STAT items:2:reclaimed 3255

> stats cachedump 2 100
ITEM bedrock_dev:1:dotlang-en-US-research/emscripten [6 b; 1337035506 s]
ITEM bedrock_dev:1:dotlang-en-US-newsletter [6 b; 1337035506 s]
ITEM bedrock_dev:1:dotlang-en-US-base [6 b; 1337035506 s]
ITEM bedrock_dev:1:dotlang-en-US-main [6 b; 1337035506 s]

So there are only 4 keys in memcache, and none of them are "feeds-mozilla".


Is there a suitable way to test memcache connectivity with manage.py somehow? Perhaps I can try that locally and see what happens.
You can run the cronjob yourself:

./manage.py cron update_feeds

I'm not sure how to specifically test memcache. If it helps, I can modify the cron job to print output, or you can do that in /apps/mozorg/cron.py. The job is very simple, it just uses feedparse to load a feed and then it sets it in the cache (which should set it in memcache).
It doesn't throw any errors, but I've discovered the problem. We never opened an ACL between bedrockadm and the memcache node for port 11211 (memcache)... just SSH for controlling it. Bug 756179 has been opened to rectify this. Once it's fixed, this should magically start working on the next cron interval (hourly, on the hour).
Unfortunately I don't see the feed appearing yet: http://www-dev.allizom.org/en-US/

Can you list the memcache keys again to see if it's at least in there?
I see feeds there now.

stats cachedump 29 100
ITEM bedrock_dev:1:feeds-mozilla [50924 b; 1337717112 s]
END

Perhaps the TTL on the memcache entry is shorter than the cron interval, and the cache entry gets expired? The cron runs hourly, on the hour. I don't know what the key's TTL is.

This would be another reason I think this might ultimately be better as a more normal memcache usage pattern (apart from the whole "how to keep scl3 and phx1 in sync" issue)... in a normal memcache pattern, the code would realize the key doesn't exist and would go fetch/generate it, and then store the results.
> Perhaps the TTL on the memcache entry is shorter than the cron interval, and
> the cache entry gets expired? The cron runs hourly, on the hour. I don't
> know what the key's TTL is.

Ah, that's got to be it. It looks like the key TTL is only 5 minutes. I'll change that and hope to see feeds appearing. Thanks.
Unfortunately I still don't see feeds under "In The News" on http://www-dev.allizom.org/en-US/

If it's in memcache though, it might be something on the Python side. I'll try to look into it soon.
Jake, I just ssh'ed into the dev box and did this:

[jlong@node273.seamicro.phx1 ~]$ telnet bedrock-memcache1.webapp.phx1.mozilla.com 11211
Trying 10.8.81.90...
Connected to bedrock-memcache1.webapp.phx1.mozilla.com.
Escape character is '^]'.
stats items
END

It appears that there are no items? I may have done this wrong.
Nope, you did that right. You can do just 'stats' to get a more general overview. One of the things there is "curr_items", which *should* be fairly accurate... although I don't think it's an actual honest count of items, so it might not be perfect.

I ran the cron by hand, and how I get this:

[root@bedrockadm.private.phx1 bedrock]# telnet bedrock-memcache1.webapp.phx1.mozilla.com 11211
Trying 10.8.81.90...
Connected to bedrock-memcache1.webapp.phx1.mozilla.com.
Escape character is '^]'.
stats
<snip>
STAT curr_items 1
<snip>
END
stats items
STAT items:29:number 1
STAT items:29:age 2955192
STAT items:29:evicted 0
STAT items:29:evicted_nonzero 0
STAT items:29:evicted_time 0
STAT items:29:outofmemory 0
STAT items:29:tailrepairs 0
STAT items:29:reclaimed 66
END
stats cachedump 29 100
ITEM bedrock_dev:1:feeds-mozilla [50892 b; 1338325352 s]
END
I just checked it myself and saw some stuff in there but after a little while it disappeared. I think that we're still dealing with the problem that the TTL on the key is too short and it is dropped before the next cron job.

I just made it set the key for 1 year since the cron job has its own interval. This also makes it so that if the cron job fails the content will still appear on the site.
It looks like this is all working now. The dev site is displaying the feed.

We'll need to set this up in production and get memcache ready for production now.
Jake, how much more work is it to make sure memcache is set up for production usage?
Jakem: ping. If we can get this set up sooner than later, it would save IT time from having to do a push to update a new item link on the homepage. Same thing for bug 738381. Thanks!
Depends on: 738381
No longer depends on: 738381
Depends on: 738381
I have rolled out this cron for www.allizom.org:

0 * * * * *  root echo "cd /data/www/www.allizom.org-django/bedrock; python manage.py cron update_feeds | /usr/bin/issue-multi-command bedrock-stage

I'm not entirely confident it will work, but we shall see. I suspect it may be a bit noisy.

Please let me know if it seems to work properly.
So I think we need to change our approach here. I don't like how we use memcache as a database.

What about a cron that copies the feeds on the harddrives of the webheads? And then our code parses this local feed and puts it into memcache.
Jake: Question in comment 20 was for you. (in case I was not clear)
With Chief in place, I'm wondering if we still need this at all, or if you're comfortable just doing a deploy whenever this needs updated.

I'm also wondering if we can just start using an actual database. What do you think about that? It's a little heavyweight for just this, but having a database is (or at least was) a planned dependency for Bedrock anyway, and we did lay a bit of infrastructure for it some time ago.
Chief does reduce the number of person that needs to be involved by one (no IT). But it still involves webdev. In the end, we want only the blogger to be involved.

A database sure sounds nice. I was not involved in the original discussions for Bedrock infra. Was there any issue with a database ?
The only significant concern is making the Bedrock database work in both locations, but I don't think that's likely to be a problem. CC'ing :sheeri to see if she'd like to comment. I believe the PHX1 side is already set up, and we'd need to do the SCL3 side.

Apart from that I believe it was just a general desire not to add too many new dependencies too quickly. Adding a DB at some point has always been part of the plan. :jlongster and :wenzel might know more on the dev side of this.
Actually, we have a database cluster for bedrock in scl3 - bedrock1.db.scl3.mozilla.org and bedrock2.db.scl3.mozilla.org. Right now there are no db's and no users (other than nagios, root and replication users), but it's ready to be in production (it's already being backed up and monitored) whenever you're ready for content.

I'm not sure where in phx1 we have bedrock set up, but I don't believe we have a separate cluster like we do in scl3, with 24G RAM, 200+ Gb disk space and 12-core  2.27GHz CPU. Jake - let me know what you were thinking regarding the phx1 side of things.
@sheeri: this should be pretty simple, I think, as the needs are pretty small and not likely to become massive in the near future. 2 servers on each end, replicating with each other, should do fine. I don't really know how hard that would be. The specs you've listed should be more than adequate in all regards... feel free to halve them if it helps speed things up any. :)


The dev env exists only in PHX1... we can use the existing dev db cluster in PHX1 for that. In fact Rik, if you want I can probably fork off a separate bug to get that set up soon-ish... I know you also want 3 extra environments for feature presentations, we can probably do that in dev as well, also on the dev db cluster...

Stage exists in PHX1 and SCL3. However the usage is very light/sporadic, and I'm okay with simply piggybacking that on the prod cluster, with different DB names and grants. Someday we may want a separate stage db cluster that exists in both locations, but I don't foresee that being a concern in the near future.
Whiteboard: [waiting][dba][webdev]
When you say "2 servers on each end" do you mean 2 in phx and 2 in scl3?

If so, we have generic clusters in both phx and scl3, do you want to just use those? Or did we need the separate bedrock cluster in scl3? (if we don't need it, is it something we can/should reclaim? or keep around in case?)
Also, we don't piggyback stage onto prod. We have a stage setup in scl3, and in phx folks use the dev machine for both dev and stage.
(In reply to Sheeri Cabral [:sheeri] from comment #27)
> When you say "2 servers on each end" do you mean 2 in phx and 2 in scl3?

Yes. The interesting part is that all 4 machines should always have the same data (excepting replication delay, of course). To me this implies a master/slave pair in each DC, with the masters replicating with each other... but I don't presume to dictate implementation... you certainly know better than I. :)

> If so, we have generic clusters in both phx and scl3, do you want to just
> use those? Or did we need the separate bedrock cluster in scl3? (if we don't
> need it, is it something we can/should reclaim? or keep around in case?)

It should be separate hardware. Bedrock (www.mozilla.org) is one of the few properties that is supposed to have dedicated resources, to avoid it being impacted by other properties.


(In reply to Sheeri Cabral [:sheeri] from comment #28)
> Also, we don't piggyback stage onto prod. We have a stage setup in scl3, and
> in phx folks use the dev machine for both dev and stage.

We can definitely use the dev cluster in PHX1 for dev stuff.

However, I don't know that it's feasible to use the stage cluster in SCL3... stage is active/active in PHX1 and SCL3, so we need the data replicated to both. We could possibly host the stage DB in only one place or the other, and just live with cross-datacenter DB accesses from the remote side.
Depends on: 779947
Found an existing bug (linked to this one now) to order hardware for bedrock in phx1. 

I'll think about dev tomorrow, it's time for dinner :D
Jake - let's talk today in IRC about stage options in phx1.
Sorry for the delay, lots of other fires since last week. Thanks for working on this.
Renaming cause the purpose of this bug changed.

Jake: Sure I definitely would like to see that set up on dev :)

Notes for when we start using this in the codebase (so not a blocker for setting up the databases):
- Update our chief script to include south migrations.
- Stop auto-updating stage.
Summary: Set up a cron job to update feeds on bedrock → Set up databases for bedrock
Sheeri and I discussed comment 29 in IRC. The consensus is as follows:

Dev DB will be in PHX1. There is already a suitable cluster, so it's just a matter of some network ACLs, making a DB, and GRANTing a user. The Dev site lives only in PHX1, so there's no concerns here. Easy peasy.

Stage DB will be in SCL3. Same scenario, except we'll need some cross-DC ACLs for this, so that the PHX1 stage node can reach it.

Prod DB will be in both locations, as discussed. Puppet will push out an appropriate /etc/hosts entry to point to the "correct" DB node for the web nodes in each datacenter. This is how we've dealt with this sort of thing in the past, so this is not a one-off / unique config situation.

I will open the appropriate NetOps ACL bugs and have them block this bug. I'll also open a bug for DBA's to create the databases and users in all the relevant places.

Once those are all resolved, it's a simple matter of adding the appropriate config stuff to settings/local.py.

Thanks all!
No longer depends on: 738381
Whiteboard: [waiting][dba][webdev] → [triaged 20120904][waiting][webops][dba]
Blocks: 750912
Whiteboard: [triaged 20120904][waiting][webops][dba] → [triaged 20120904][waiting][webops][dba hardware]
Rik / Mike / Chris: How should we handle databases for the 3 www-demoX.allizom.org sites? I suppose the most complete solution is to have 3 separate DBs.

Sheeri: any word on the DB hardware for prod? I think you had some nodes already, or had a bug for them...
Component: Server Operations: Web Operations → Server Operations: Database
Whiteboard: [triaged 20120904][waiting][webops][dba hardware] → [triaged 20120904][waiting][dba hardware]
according to :csheilds, hardware has been ordered. We do have the old sfx01 hardware but they might already be appropriated (datazilla stage/dev).....plus for something like bedrock we probably want the latest and greatest anyway.
Waiting for confirmation for the dev and demoX sites

DB and GRANTs completed for staging on stage1.db.scl3... will use SCL3 Zeus VIP

Config in place for stage, but left commented out in case it actually breaks anything at the moment.


Will open ACLs all at once... easier on NetOps/OpSec if they can see the whole picture.
Sorry for the late answer, I was focused on other tasks and I'm now not working on Bedrock anymore.

CCing devs who are currently discussing the issue.
(In reply to Jake Maul [:jakem] from comment #36)
> Waiting for confirmation for the dev and demoX sites

We'd like separate databases for each, but a single server will be fine. We plan to (for the demo servers) wipe and re-run the database on each push to ensure that we have good sample data included.

We'd like to do this with a commander script in the github repo, and we'd need some way of determining if the server is a demo, dev, stage, or prod server in the script. 

Once the database is in place for demo I can start work on adding the commander script in the repo. Is there anything special I need to do besides mimicking what other sites do? Can I get a copy of whatever script/commands chief runs when it pushes out bedrock?
Whiteboard: [triaged 20120904][waiting][dba hardware] → [triaged 20120904][waiting][dba build]
Whiteboard: [triaged 20120904][waiting][dba build] → [triaged 20120904][waiting][sre kickstart]
Whiteboard: [triaged 20120904][waiting][sre kickstart] → [triaged 20120904]
Whiteboard: [triaged 20120904] → [triaged 20120904][waiting][netflows]
I have set up the databases in phx to slave scl3, so there is data flowing and usable data.

The architecture is:

bedrock1.db.scl3 -> bedrock1.db.phx1 -> bedrock2.db.phx1
| ^
V |
bedrock2.db.scl3

That is, the failover for bedrock1 in scl3 is bedrock2 in scl3, right now. We can change this to bedrock1.db.scl3 <-> bedrock1.db.phx if desired.


I have set up bug 823252 for monitoring and bug 823254 for backups.
Blocks: 823252, 823254
Whiteboard: [triaged 20120904][waiting][netflows] → [triaged 20120904]
Resolving in favor of bug 901078.
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.