Closed Bug 794161 Opened 12 years ago Closed 12 years ago

Server review of source-dev1.vm.labs.scl3.mozilla.com

Categories

(Infrastructure & Operations Graveyard :: WebOps: Labs, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: boozeniges, Unassigned)

References

()

Details

Heya guys, So we've been set the launch date of October 16th to launch the source site to the world. As mentioned during our all-hands I mentioned that the likely plan will be to make what is currently the dev box live - to avoid having to set up the machine again and to avoid a troublesome data migration. Am wondering if you could take a check over the set up on the box just to make sure that I've not done anything stupid or if there are better ways to do what we've done. I am more than happy to provide any further info or jump on skype/vidyo with you if needed in regard to what we have done and what we were planning to do. Any problems please let me know, happy to help out as needed. Additionally - I asked this to non-labs IT but not sure if their feedback is valid or not (https://bugzilla.mozilla.org/show_bug.cgi?id=769270) Ross
Changes: - increased memory allocated to the VM from 756MB to 1.5GB and restarted VM - set up backup client so we can get a backup of the ES indexes (/var/lib/elasticsearch) - increased number of open file descriptors for elasticsearch user (64K) - took out the manual overrides of ES_MIN_MEM and ES_MAX_MEM from /etc/init.d/elasticsearch - put new ES_MIN_MEM and ES_MAX_MEM values in /etc/sysconfig/elasticsearch (set both to 512m based on limiting ES to 1/2 of the available memory) - on the off-chance that it was never made into a default entry, I added "-Des.monitor.jvm.enabled=false" to the command line evocation in /etc/init.d/elasticsearch (https://github.com/elasticsearch/elasticsearch/issues/1075) elasticsearch was restarted. Thoughts: Mostly, I concerned about ES running out of memory and falling over ungracefully. >_< If it's okay, I was thinking of installing this ES plugin: https://github.com/lukas-vlcek/bigdesk ... to try to get some visibility RE: heap versus non-heap memory utilization. There's a lot of tweaks people do with regard to caching for queries. I didn't all that many queries to this server and the ES logs are fairly empty. I know that there has been discussion in the past about whether or not using the SunJDK versus the openJDK made a significant difference in memory utilization. There is a new version of the openJDK available (1:1.6.0.0-1.49.1.11.4.el6_3, versus 1.6.0.0-1.48.1.11.3.el6_2 installed), but I don't know what the ramifications for the application would be if we updated.
Certainly installing that plugin sounds awesome to me, and thanks very much for looking into things so thoroughly! Am adding in the main developer on the project (Ryan Pitts) to see if he has any feelings on the best time to do it and potential ramifications. For me it feels like having a long term running search is the ultimate goal - wondering what people will feel like if it's potentially broken for part of the QA stage of the project.
I'm all for doing this, and doing it sooner rather than later. We have a call scheduled for 12:30 EST tomorrow, so there will be a few people looking at the feature between now and then -- it would be fantastic if it's possible to install sometime after that.
I'll aim for installing the plugin sometime after 2 PM EDT / 6:30 PM BST today (Wednesday, October 3rd). (My timing is little rough today, since I've got a doctor's appointment during my mid-day.)
Thanks :cyliang, hope everything well!
I installed the plugin, which is currently accessible at: http://source-dev1.vm.labs.scl3.mozilla.com:9200/_plugin/bigdesk/#nodes (You will need to be on the VPN for this URL to work.) Select the nodes link near the top of the toolbar. This interface displays quite a bit of information: I suspect it would be very useful to see what things look like after a few runs / searches. To make it easier to have multiple people access the plugin, I had to reconfigure Elasticsearch to bind on all interfaces. (Previously, it had just been listening on 127.0.0.1) If this is a problem, I can reset to 127.0.0.1 and access to the plugin will require an SSH tunnel.
I've got a plan in place for a large chunk of people (around 15) to jump onto the site and run a bunch of searches later today so hopefully that should help us get a better bunch of data.
:cyliang - did our hacking provide any useless results yesterday?
Two things came to mind: 1) The heap usage didn't climb sharply. I'm leaning towards rejiggering things so that there's more non-heap memory than heap memory allocated, but I want to make sure I understand what the potential downsides are if I do so. =) 2) Is there a quick writeup somewhere about this app? I'm wondering if it's worth trying to find and/or write up a brain dead load generator that makes sustained search queries to see what happens RE: garbage collection time.
1. sounds like a winner to me 2. yeah I was thinking a load generator might be useful (am trying to find out if something already exists) to. What do you mean by a writeup of the app (it's probably something that I could do)
RE: writeup Right now, I know what some of the "cogs" look like (elasticsearch, apache, mysql), but I'm not sure how they work together. I guess, maybe a rough flowchart of what some of the typical interactions with the site look like and what pieces of software those interactions "trigger".
Ah OK - so let me know how helpful this is. We're got: * a standard django/playdoh site - running off apache with mod-wsgi * database is mySQL, just a single DB * search is provided from elasticsearch and accessed through python/django by haystack (http://haystacksearch.org/) * image generation using PIL (and a django library called sorl-thumbnail - http://sorl-thumbnail.readthedocs.org/en/latest/index.html) Most of the content is held in mySQL and served to the site via django (we don't have user generated content yet - though this is coming in a future release) and accessed through their standard model queries, not via ES. We create a new elastic search indexes hourly via a cron - from my understanding of haystack these are what are queried (via elasticsearch) when a user performs a search opposed to accessing mySQL directly. The only trigger for an ES search is via the search form and is entirely user triggered). Sorry if this is totally useless - please let me know, Ross
Ross: the writeup quite helpful. =)
Awesome!
So, it looks like the non-heap memory tuning parameters are to either allocate memory for permgen (metadata about classes) or increasing the code cache. In most cases, people seem to set this variable AFTER seeing errors logged for this: most people seem to have issues with running out of heap as the index grows. Someone mentioned testing their ES using JUnit. I'll try to see if I can find more details about their testing. Is the data in the mysql tables something that should be regularly dumped for nightly backups as well?
Someone in the webQA team was super nice and made us a Jmeter script that would chuck searches at the box - have forwarded the email I got from them containing it. Nightly back-ups certainly would be ace :)
As a side note: Some of the Elasticsearch and (further under the hood) Lucene tuning guides strongly suggest allowing ES to lock memory. This is probably something to try for servers dedicated to ES: it's probably NOT safe to do for a server handling multiple servers. At some point after the launch, I'd like to see if it's possible to set up a parallel source-dev1 server and see what work is involved on the application and operations end to move indexes and/or what work it takes to rebuild them. That way, we'll be prepared to shift ES off the box if things DO get cramped on that server.
That sounds a plan!
Looks like there's a password set for the root mysql user. At some point, can you either IM that to me or create a user called "backup" that has select, reload, and lock_tables privs?
Also: Was there an intention to turn up another instance of ES on this server or add another ES node into the mix? I ask because two indexes (twitter and haystack) are set to "index.number_of_replicas: 1". My understanding is that this means that you want 1 replica built of each shard (shard + 1 copy of shard), which will never be true since there is only one server in the cluster. See http://elasticsearch-users.115913.n3.nabble.com/Health-always-yellow-status-td4023579.html We don't have to set the index.number_of_replicas to 0; it just means that the monitoring system should not perform the usual ES health check (since it will always report with a warning due to the yellow status).
Hmm - I'm not sure what that twitter index is all about, my understanding was that we were accessing all those indexes through haystack... From my understanding there were no plans in place to spin up other indexes. Ryan - do you know what that twitter index is all about?
Dumps of mysql are successfully going to a local directory and are slated to be picked up by the backup server starting this evening.
I sure don't know what the twitter index is doing there. I'm pretty positive that it's nothing to do with anything I've written, though. The only thing we're doing with Twitter is storing some usernames with profiles so that we can make a javascript call to their API on the template side. No indexing whatsoever.
The site is live and has been now for rather a while - good to say this is closed I think. Thanks for all the work put into it :)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.