If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

Add ElasticSearch to Socorro's dev and stage environments

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: Other
RESOLVED FIXED
6 years ago
4 years ago

People

(Reporter: adrian, Assigned: cshields)

Tracking

Details

(Reporter)

Description

6 years ago
As we are getting closer to having an ElasticSearch environment in production for Socorro, we will also need a dev and stage environment for our tests and QA. 

This bug is to start the discussion around that and identify the needs or problems we may have. 

In terms of data size, dev doesn't need to be big at all (100 to 1,000 crashes a day for the last 14 days will be enough). Stage could be bigger though as we would like to be able to perform some performance tests. 

Indexing data into those ES instances is not a problem, we already have a few ways of doing it.
(Reporter)

Comment 1

6 years ago
Anurag and Daniel may have opinions on that matter, as well as Rob.
(Assignee)

Comment 2

6 years ago
our dev ES environment hit a snag..  we are working around it but it will take some time.
(Assignee)

Updated

6 years ago
Depends on: 697984
(Assignee)

Comment 3

6 years ago
(In reply to Corey Shields [:cshields] from comment #2)
> our dev ES environment hit a snag..  we are working around it but it will
> take some time.

scratch that, I forgot that we already re-did this in VMs..

You guys can use it-elasticsearch-dev-zlb.vlan81.phx.mozilla.com:9200

(there may need to be a net flow opened up, if so we can file a bug for that)
Assignee: server-ops → cshields
(Reporter)

Comment 4

6 years ago
Is anyone else using that instance? Can we do stuff like deleting everything in there without causing anyone else troubles?
(Assignee)

Comment 5

6 years ago
(In reply to Adrian Gaudebert [:adrian] from comment #4)
> Is anyone else using that instance? Can we do stuff like deleting everything
> in there without causing anyone else troubles?

It is a shared environment, but you should be creating your own index anyway.  And you can delete all of the data you want within that.
(Reporter)

Comment 6

6 years ago
For performances reasons with use several indexes (one per day), but we will make sure not to impact anything else in there. Thanks for sharing that box!
(Assignee)

Comment 7

6 years ago
(In reply to Adrian Gaudebert [:adrian] from comment #6)
> For performances reasons with use several indexes (one per day), but we will
> make sure not to impact anything else in there. Thanks for sharing that box!

yeah, this will be the quickest way to get you guys up and running..  We don't have any spare hardware to do this baremetal right now so a full dedicated environment would take time.
We have several dev and staging environments for socorro right now:

1) local (VM generated by vagrant using puppet+virtualbox)
2) crash-stats-dev (follows "master" branch, self-contained mozilla-hosted box)
3) crash-stats.allizom.org (stage, several physical boxes)

For #1 and #2, we just need to be able to install Elastic Search and other search service dependencies. I've done this experimentally in a local VM but bagheera wasn't ready for release at the time, we should figure this out (Anurag and I discussed this briefly in IRC, I think we just need a way to pull the latest bagheera release, build it, and make sure we have the right version of ES and any other deps installed).

Having a dev environment such as #3 is helpful and appreciated (since we can get started Right Now), but it'd be great to have some dedicated environments that we can use as well. We'll also test and document the install process as side-effect, which is crucial for contributors and Socorro admins outside Mozilla.
(Assignee)

Comment 9

6 years ago
(In reply to Robert Helmer [:rhelmer] from comment #8)
> Having a dev environment such as #3 is helpful and appreciated (since we can
> get started Right Now), but it'd be great to have some dedicated
> environments that we can use as well. We'll also test and document the
> install process as side-effect, which is crucial for contributors and
> Socorro admins outside Mozilla.

Are you asking for a dedicated dev elasticsearch environment for socorro here?
(Reporter)

Comment 10

6 years ago
(In reply to Corey Shields [:cshields] from comment #9)
> (In reply to Robert Helmer [:rhelmer] from comment #8)
> > Having a dev environment such as #3 is helpful and appreciated (since we can
> > get started Right Now), but it'd be great to have some dedicated
> > environments that we can use as well. We'll also test and document the
> > install process as side-effect, which is crucial for contributors and
> > Socorro admins outside Mozilla.
> 
> Are you asking for a dedicated dev elasticsearch environment for socorro
> here?

I believe he is. And if he is not, I am! :D

As I wrote in comment 1, we would need to store the same amount of data than in our staging Postgres instance: about 3 to 4 weeks of data. That data is processed json, meaning it includes more data (especially full dumps), so it's going to be heavier than what we have in postgres.
(Assignee)

Comment 11

6 years ago
define "heavier" - our current test environment is in VMs, and as you guys are painfully aware - getting real hardware for your production ES has taken us way too long.
(Reporter)

Comment 12

6 years ago
One processed JSON is about 2Kb, and we have between 300,000 and 400,000 crash reports per day. I would say that we one dump of 4 weeks will be close to 20Gb. Loaded into ElasticSearch, I don't know how much place it will take, but it sure won't be more than 3 or 4 times that. I suppose it's not a problem, is it?

Anurag, can you please confirm what I just said? You've been doing a lot more of tests with loading data into ES than me.
:adrian's estimate are close to our previous test run.

Data from previous load test performed in July/August.

Cluster size: 5 nodes
# of documents: 9m
avg. JSON size/document: 2kb
replication factor: 1
disk space/node: 30GB
(Assignee)

Comment 14

6 years ago
You have 8 nodes being turned up for production ES.  Should we take one of these for dev ES and one for stage ES, leaving 6 for production?
(Reporter)

Comment 15

6 years ago
Maybe having two nodes for staging instead of one will be better in case we need to test things like replication, shards, etc. ?

One node for dev is awesome. We'll want to open access from khan and if possible from VPN as well.
(Assignee)

Comment 16

6 years ago
Its a matter of resources - what I was suggesting would take resources away from your production pool (not ideal) versus waiting on new hardware to be ordered.  I can probably have 1 dev, 2 stage in about 2-3 weeks.

Comment 17

6 years ago
I thought hp-node6* were earmarked for stage/dev ES?
(Assignee)

Comment 18

6 years ago
(In reply to Justin Dow [:jabba] from comment #17)
> I thought hp-node6* were earmarked for stage/dev ES?

possibly - its news to me though.  :)

If you are sure of this do we need to re-kick them or just puppetize them?

Otherwise we can block this bug on 728219 and do it with the new xeon nodes coming.

Comment 19

6 years ago
I'm pretty sure that was the master plan. Daniel and C. should fill in the gaps here though.
We used the hp-node6* nodes to implement the first version of the ES search.  It was decided not to use those for production and order new machines instead.

We discussed the possibility of using htem as staging, but I seem to remember that IT was a little worried about using a different class of machine for staging.  I would be fine with using them for staging.  Otherwise, I'd probably say we should just split them up and add them into the HBase pools to have a bit more room.
Sorry, I added a bunch of confusion here because I wasn't remembering previous discussions fully.

We were using them for the first ES, we decided to buy new prod hardware for ES.  We briefly discussed using them for ES staging, IT didn't like that because they weren't the right class of machines.

During the discussion to move all of Socorro staging into PHX, we decided to use these nodes as the new Socorro HBase staging cluster.  That is what they are currently set up to support.

We don't have nodes available for ES staging in the Socorro hardware in PHX at the moment.

Comment 22

6 years ago
After discussion with Daniel, it sounds like:

   - The production pool should be cannibalized for 1 dev, 2 staging servers. Replacements for production should be ordered.

   - Metrics (Anurag) will initially manage the new ES cluster more directly because a custom queue component.  The intent is to do the work necessary to make this manageable by IT (puppet management of the config, monitoring of the queue, updated documentation for service support).

Comment 23

6 years ago
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #21)
> Sorry, I added a bunch of confusion here because I wasn't remembering
> previous discussions fully.
> 
> We were using them for the first ES, we decided to buy new prod hardware for
> ES.  We briefly discussed using them for ES staging, IT didn't like that
> because they weren't the right class of machines.
> 
> During the discussion to move all of Socorro staging into PHX, we decided to
> use these nodes as the new Socorro HBase staging cluster.  That is what they
> are currently set up to support.
> 
> We don't have nodes available for ES staging in the Socorro hardware in PHX
> at the moment.

Ah, yes! That is how I remember it too now. Sorry for adding confusion earlier.
(Reporter)

Comment 24

6 years ago
Per discussion with the team today, we confirm the decision of having one server for dev, two servers for stage and the five remaining servers for production. As stated by C. Liang, we will want to order new boxes to replace those used for dev/stage. Does that require me to file a new bug?

Corey, is there anything you need from us about this?
(Assignee)

Comment 25

5 years ago
this has been done for dev and stage (using bagheera) for weeks now..
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
cshields: adrian is able to submit data but its not getting indexed. Can i get sudo on stage? if not, can someone from ops baby sit with us for a bit as we narrow down the issue?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Assignee)

Comment 27

5 years ago
it is my understanding that the new es stage for socorro has been up for a week now. lets close this out and open new bugs to avoid confusing the environments.
Status: REOPENED → RESOLVED
Last Resolved: 5 years ago5 years ago
Resolution: --- → FIXED
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.