Closed Bug 872363 Opened 9 years ago Closed 9 years ago

Create new ElasticSearch cluster for public bugs

Categories

(Infrastructure & Operations Graveyard :: WebOps: Other, task, P4)

x86_64
Linux

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mcote, Assigned: dmaher)

References

Details

(Whiteboard: [triaged 20130516])

metrics has an ES cluster that contains not only the current state of Bugzilla, but past history as well.  ElasticSearch provides a better method of performing certain types of analyses, and use of it doesn't impact the main Bugzilla database.  However, it contains *some* information on confidential bugs, so it is behind LDAP, which means that any tools that use it generally must be behind LDAP as well, which limits their usefulness.

It would thus be really great to have a publicly accessible ES cluster which contains *only* data on public bugs (or to word it another way, *no* data on confidential bugs of any type).

mcoates has already signed off on this idea from a security standpoint, provided that bugs with any limitations on visibility are excluded, and that all data conerning bugs that are made confidential some time after creation (i.e. were at one point public, but no longer) is removed in a timely fashion.

To do this, we'll need to

1. Find some hardware.  I am not sure what capacity we'll need.
2. Set up an ES cluster.
3. Create a script/app to sync data from the Bugzilla db.  Hopefully we can adapt whatever metrics uses, bearing in mind that it has to grab only public bugs and remove information on bugs that are later switched to confidential.

I'm sure I'm missing some details, but that's the gist.

What kind of timeline can we expect for steps 1 and 2?  Let me know if you need me to figure out hardware requirements.
Hello,

Before we can spec hardware, we need to know more about the resource requirements and expected usage-levels of the service:
* Expected (non-replicated) size of the initial index / indices.
* Expected growth pattern of said indices.
* Expected usage patterns (largely read-only, batch indexing, constant r/w, etc..).

Note that we have production-ready Elasticsearch clusters in both SCL3 and PHX1 right now - one of these may be appropriate for this project, in which case, steps 1 and 2 would already be complete. :)
Assignee: server-ops-webops → dmaher
Flags: needinfo?(mcote)
Priority: -- → P4
Whiteboard: [triaged 20130516]
Excellent questions. :)  I've asked metrics for some info on their clusters, which should largely be applicable to the new system.  At the moment, I believe that the index is about 26 GB.  It will be mostly read-only; the only writing will be the periodic syncing of data from the Bugzilla MySQL database (maybe every 30 or 60 minutes).  Also here's a link to the metrics cluster's Ganglia, which might provide a bit more info:

http://app1.metrics.scl3.mozilla.com/ganglia/ganglia2/?c=scl3_prd_cluster02&m=load_one&r=hour&s=by%20name&hc=4&mc=2

I'll comment again here once I've talked to metrics.
Flags: needinfo?(mcote)
I have just been reminded that this cluster will also contain the full text of (public) comments, which the metrics cluster does not.  I'll have to crunch some numbers to get an estimate of how much bigger the index will be, but it won't be insignificant.
Comments are about 4 GB right now, so it's actually not much more.

However after more discussion today, we'd like to have attachment data too. *That* is a lot more data, around 45 GB right now.
One quick note on the storage requirements for comments - the current metrics bugzilla data in ES keeps a separate document for each revision of a bug.  That is, each change to a bug creates a new document showing the state of that bug at that point in time.

I experimented with adding comment info using the same approach, but had to abandon it because it would have caused the index to become too large.  This was several months back, but if I recall correctly it was looking like it would use about 10x more space than the current index (and the current index is ~26GB). 

Storing comments in a separate index would be much more efficient, but would require separate queries to fetch comment info vs. bug info.
Yes, the plan is to have the comments (and all attachments) in a separate index; where each comment document is placed in the same shard as the bug it belongs to.  ES can now do primitive joins, and they are not too expensive when joining happens only within a machine.
I'm not sure if I'm parsing this information correctly : is the size of the comments 4 GB or 26 GB ?  As for adding the attachment data, what about this makes it weigh ~ 45 GB ?  What sort of data is it, exactly ?

Assuming the largest values, the indices in question would be about 71 GB in size *initially*, and would grow from there - is that correct ?  If so, what is the expected growth pattern ?
Current index = 26GB
Raw Comments (not indexed) = 4GB
All attachments (not indexed) = 45GB

The attachments data is large because it includes many file contents.  Only the metadata will be indexed, not all the file contents.

*ASSUMING* the 75GB (26+4+45) was produced in just the past two years, the data will grow by 3GB/month.  This is a good over-estimate.
Well given the disk usage requirements (initial and growth), it would not make sense to use either of the existing "generic" clusters.  I'll bring it up with my manager and we'll look into the hardware situation.  I will update this bug / make dependant bug(s) as appropriate.
What is the criticality of this proposed service ?  What would be the impact of a temporary service outage ?  How about (permanent) data loss - could it be re-indexed easily ?

The larger question here is whether a single machine with lots of available disk space would suffice, or whether the "standard" three-node-minimum applies.
Flags: needinfo?(mcote)
Daniel:  Can we be less demanding to fit on one of the "existing generic clusters"?

Service Criticality - The service is not critical to Mozilla business at this time.  I imagine only minor annoyance if the service is down.  This data may become more important in the future, but we will revisit then.

Data Redundancy - The data we plan to put on the cluster is NOT original: We will be able to re-create it if the cluster is lost.  

Single Machine? - This may be adequate.  I have had no problems using a single machine in development.
(In reply to Kyle Lahnakoski from comment #11)
> Daniel:  Can we be less demanding to fit on one of the "existing generic
> clusters"?

Yes, certainly.  As I mentioned, the disk requirements (75GB + 3GB per month) are too steep for one of the generic clusters.  If those requirements were to be reduced - by removing the attachments and/or reducing the growth pattern, for example - then there would be no problem.
 
> Single Machine? - This may be adequate.  I have had no problems using a
> single machine in development.

It would appear, then, that the choices are :
* Reduce the disk space requirements and use a generic cluster right away.
* Stick with the largest estimate and wait for dedicated hardware.

I leave it up to you - I'm good either way. :)
Can the "generic cluster" handle 30GB now + 1GB per month?
(In reply to Kyle Lahnakoski from comment #8)
> The attachments data is large because it includes many file contents.  Only
> the metadata will be indexed, not all the file contents.

The current bugzilla data we have in ElasticSearch contains attachment metadata already (attachment fields and flags), but not the attachment contents.  I'm not sure it even makes sense to put the attachment contents into ES.
(In reply to Kyle Lahnakoski from comment #13)
> Can the "generic cluster" handle 30GB now + 1GB per month?

Not indefinitely - I mean, if you need a hundred years of data retention we'll need to look at something else, but a year or two should be fine. :)
We have decided to go with option 2:  "Stick with the largest estimate and wait for dedicated hardware.": Which assumes 75GB to begin, +3GB/month, with "standard" three-node-minimum.

Please do not plan for 100 years of growth :)  3 years of growth is fine (183GB = 75GB + 3GB/month * 36months)
Depends on: 879173
Based on existing infra placement, I've asked for the new nodes to be placed into in the existing bugs VLAN, which will allow for smooth interaction between the Bugzilla database and the new ES cluster.

I'm going to go ahead and assume that you have some sort of application in mind for actually interacting with the new cluster.  Does this application exist already ?  If so, where is it ?  If not, where will it be ?  I ask only so as to ensure that we can get the appropriate network flows opened up in a timely fashion.
The ES cluster is up and ready for you to use :

$ curl http://elasticsearch-zlb.bugs.scl3.mozilla.com:9300/
{
  "ok" : true,
  "status" : 200,
  "name" : "elasticsearch2_bugs_scl3",
  "version" : {
    "number" : "0.20.5",
    "snapshot_build" : false
  },
  "tagline" : "You Know, for Search"
}

The standard triplet of visualisation plugins are available as well : http://elasticsearch-zlb.bugs.scl3.mozilla.com:9300/_plugin/<head|paramedic|bigdesk>/

When you've determined where your app is going to live, feel free to open a netflow request (if necessary) with those details.

If there's anything else I can help you with, please let me know.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Thank you!  That was fast!
Thanks! Now to get the software written and working. :)
Flags: needinfo?(mcote)
(In reply to Mark Côté ( :mcote ) from comment #20)
> Thanks! Now to get the software written and working. :)

Is there another bug for that part?
Blocks: 879822
I have created bug 879822 as a tracking bug, since there are a few discrete tasks that need to be done.
Component: Server Operations: Web Operations → WebOps: Other
Product: mozilla.org → Infrastructure & Operations
Depends on: 943087
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.