1125903 - Increase the disk size of the treeherder DB VMs

Hosts in question: treeherder1.db.scl3.mozilla.com treeherder2.db.scl3.mozilla.com treeherder1.stage.db.scl3.mozilla.com treeherder2.stage.db.scl3.mozilla.com Current disk allocation for each is 700GB, which was increased in bug 1076740. We've since reduced the data lifecycle (from 6 months to 4 months), which has helped partly, however Treeherder is now having to ingest performance result data (since it is due to replace Datazilla too) which has increased our storage requirements. Datazilla currently has 1TB allocated, of which ~400GB is used (https://rpm.newrelic.com/accounts/263620/servers/5242354/disks#id=219726737) - perhaps we could start reducing the data lifecycle there, and shift the allocation to Treeherder instead? (Presuming they're both on the same virtualised infra).

Group: mozilla-employee-confidential

Sheeri Cabral [:sheeri]

Updated

•

10 years ago

Whiteboard: [data: serveropt]

Sheeri Cabral [:sheeri]

Comment 2

•

10 years ago

Let's add 100G if that's OK. Treeherder2 is ready to be done, just coordinate with me.

Assignee: server-ops-database → server-ops-virtualization

Group: infra

Component: Database Operations → Virtualization

Product: Data & BI Services Team → Infrastructure & Operations

QA Contact: scabral → cshields

Greg Cox [:gcox]

Comment 3

•

10 years ago

Datazilla isn't on VM, so there's no 'transfer' win to be had. Treeherder was already pretty huge in bug 1076740, when 700g was thought to be big enough. It's quite a large percentage of whichever datastore this VM rests on. These are growing to cover new requirements, and the original request in comment 0 was for 1T, so I'm very worried that this 100G isn't sufficient and we'll be back here again in a month. So the questions here are basically (knowing I'm asking you to tell the future), is 100G/node enough / how confident are you on that? If not, what's the right number? And, if this keeps growing, what's the plan and the point at which we start talking about going to physical?

Mauro Doglio [:mdoglio]

Reporter

Comment 4

•

10 years ago

As far as I can tell from New Relic [1] at the moment the data growth rate is not tremendous and I don't think we will see a huge increase in the next 3 months. On the longer term we have a plan to offload a big chunk of data to s3. That will probably free +200GB. I thought the company plan was to move to aws, making "going to physical" not an option. Am I missing something? [1] https://rpm.newrelic.com/accounts/263620/servers/5241973/disks#id=654553771

Greg Cox [:gcox]

Comment 5

•

10 years ago

Nothing is set in stone on where things end up. There will always be use cases that favor physical, on-site VMs, and the cloud. Anyone who says otherwise is selling something. https://rpm.newrelic.com/accounts/263620/servers/5242253/disks#id=654553771 (treeherder2) shows a much faster growth rate, with 28 days to fill; 100G would make it around 50 days. So, I'm still worrying about the questions in comment 3.

Mauro Doglio [:mdoglio]

Reporter

Comment 6

•

10 years ago

Thanks for the clarification :-) We need enough space to cover q1 and q2 so that we can plan to move some data to s3 in q2. Given the estimate NR gives for treeherder2, 1TB will cover that. Please let me know how that sounds to you. I have no idea what's the maximum disk size allowed for on-site VMs, do you know where I can find some documentation about that?

Greg Cox [:gcox]

Comment 7

•

10 years ago

It's not a documentation limit, it's me asking questions ahead and trying to protect the environment. We only have so much disk to give, and it's a shared resource among the 500+ guests we have. I have to look out for the organic growth (VMs getting larger, VMs being added) and p2v conversions coming from a lot of different directions. We're also under a little bit of space crunch as we shuffle some things around this quarter, so I'm being really protective right now. Large databases are the worst, because they use a lot of IOPS with queries, and a lot of space that doesn't really deduplicate well. They also tend to unbalance our datastores because you get a large single VM that can become difficult to move in maintenance cases, and can squeeze out a good number of smaller VMs, leading to further imbalance. It so happens that treeherder db's are the biggest VMs we have, they've already expanded hugely once (with a line in the sand of "this far, no further"), and now they're looking to go even bigger. What I'm looking for is a best guess at "this is what we'll need, and not more than this, and if we're wrong we've got a contingency plan for avoiding having to ask for more space" because 1T VM disks are just getting unwieldy. Whereas I was trying a more gentle line-in-the-sand approach with the 700G, I'm going to have to be way more firm this time unless we get some miraculous changes on the supply side. 1T for prod, okay, but that's it.

Jonathan Griffin (:jgriffin)

Comment 8

•

10 years ago

We can live with 1T for production; if we start running into that limit we'll fast track our work to offload some of the artifacts to S3. What is the ETA for expanding the VM to 1TB?

Matt Pressman [:mpressman]

Comment 9

•

10 years ago

:gcox can we coordinate on this? I can ping you on irc if you prefer. We can start with the stage machines and treeherder2.db.scl3. We will need a CAB approval for a failover in order to do the treeherder1.db.scl3 host

Greg Cox [:gcox]

Comment 10

•

10 years ago

ohcrap, I forgot you wanted stage. How much more do they need? (currently 300g) And yeah, we can IRC it. I'm east-coast.

Greg Cox [:gcox]

Comment 11

•

10 years ago

coordination with :mpressman in #data. This afternoon: treeherder2.db.scl3 upp'ed to 1000G. treeherder2.stage.db.scl3 upped to 400G. :mpressman to coordinate a stage failover and ping me to do stage1; and to get the CAB for prod failover.

Matt Pressman [:mpressman]

Updated

•

10 years ago

Depends on: 1131863

Matt Pressman [:mpressman]

Comment 12

•

10 years ago

We have increased the disk on treeherder2.db.scl3 and treeherder2.stage.db.scl3 - In order to increase treeherder1.db.scl3, we'll need cab approval since it'll require a failover. I have created bug 1131863 in order to get approval. Please feel free to add to it as you see fit

Sheeri Cabral [:sheeri]

Comment 13

•

10 years ago

treeherder cluster failed over, treeherder1 will have its disk increased now.

Greg Cox [:gcox]

Comment 14

•

10 years ago

treeherder1.stage.db.scl3 upped to 400G yesterday. treeherder1.db.scl3 upp'ed to 1000G today.

Mauro Doglio [:mdoglio]

Reporter

Comment 15

•

10 years ago

Thank you :-)

Sheeri Cabral [:sheeri]

Comment 16

•

10 years ago

Keeping this open; while the main task is done, it'd be great to fail back the cluster. I know next week is a bad time, so I'll see if I can coordinate tomorrow at 6 am.

Ed Morley [:emorley]

Comment 17

•

10 years ago

Thank you everyone! :-)

Sheeri Cabral [:sheeri]

Comment 18

•

10 years ago

This was failed back last week.

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

9 years ago

Group: infra

Bugzilla

Increase the disk size of the treeherder DB VMs

Categories

(Infrastructure & Operations :: Virtualization, task)

Tracking

(Not tracked)

People

(Reporter: mdoglio, Unassigned)

References

Details

(Keywords: treeherder, Whiteboard: [data: serveropt])

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Updated