Find a cluster for Datazilla Alerts ES data

RESOLVED WONTFIX

Status

RESOLVED WONTFIX
5 years ago
4 years ago

People

(Reporter: ekyle, Unassigned)

Tracking

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/213] )

(Reporter)

Description

5 years ago
Datazilla Alerts is a cron job responsible for detecting regressions in the Talos and B2G performance data. (https://github.com/klahnakoski/datazilla-alerts). It is currently in a pre-alpha state.  It uses an ES index to query the various combinations of test results quickly.

We (the Automation and Tools Engineering team) require an ES instance to hold the index for each of Talos and B2G performance data:   http://elasticsearch-zlb.webapp.scl3.mozilla.com/ seems most appropriate.

The Talos index is:
size: 51.7G (154G)
docs: 16,483,564 (20,892,329)

The B2G index is much smaller:
size: 602M (1.76G)
docs: 232,286 (275,220)
(Reporter)

Updated

5 years ago
Blocks: 1023372
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/234]

Comment 1

4 years ago
I think this is the best long-term solution to avoid problems like bug 1033145.  I think the cluster should be on bare metal and should have a very sizable disk.  Given that every developer has a laptop with at least 250 GB of space, I hope it is possible to get a terabyte disk or more.  We will only be adding more alert data over time, and this system will become more and more important to detect regressions in various products.

Comment 2

4 years ago
Updating the summary to reflect the fact that we just need a place for this data; the webapp cluster is a possibility, but we're not married to it.

Kyle, have the sizes of these indices increased significantly since this bug was filed? Also, what is the estimated growth for them, and what indices are we planning to add in the future (and how big/what growth)?
Summary: Permission to add index to http://elasticsearch-zlb.webapp.scl3.mozilla.com/ → Find a cluster for Datazilla Alerts ES data

Updated

4 years ago
Flags: needinfo?(klahnakoski)
Comment hidden (obsolete)

Comment 4

4 years ago
Do we need to keep this data forever?  In just two years we'll be way past 1 TB...
(Reporter)

Comment 5

4 years ago
A big mistake above, I forgot about the ES replication (x3 the drive space).  To answer Mark:  I do not think we need two years.  One year is enough, and 6 months is tolerable.  Furthermore, it is divided among all machines

I will repeat my previous comment:

Talos currently grows at about 30Gb/month, lets say 40Gb/month for safety.  Assume we need no more than 12 months, plus another 3/7 (so we do not trigger the 70% disk warning**), multiply by 3 for ES replication:  This gives us 3*(1+3/7)*12*40Gb = 2057Gb.  B2G is 3*2.4Gb (already has a year), Eideticker is 3*0.7Gb for 9 months, so no more than 3*20Gb for those two indices.  Note this space requirement is spread over all N nodes (Assume 7 nodes, we have 2117Gb/7 = 302Gb per machine).

There are no plans for more indices, but as long as we have the option to add nodes we can handle unexpected growth.

1Gb = 2^9 bytes

** ES will require more drive space when it compacts shards.
(Reporter)

Comment 6

4 years ago
If performance data is to be made public, then it can not reside on the private cluster in the long term.  Let's keep this bug open.
(Reporter)

Updated

4 years ago
See Also: → bug 1063324

Updated

4 years ago
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/234] → [kanban:https://webops.kanbanize.com/ctrl_board/2/213]

Comment 7

4 years ago
I spoke with :ekyle (reporter). The architecture is still being designed upstream and may or may not involve The Cloud, so until this is more actionable we decided to WONTFIX this for now.
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.