Closed Bug 1125903 Opened 10 years ago Closed 10 years ago

Increase the disk size of the treeherder DB VMs

Categories

(Infrastructure & Operations :: Virtualization, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mdoglio, Unassigned)

References

Details

(Keywords: treeherder, Whiteboard: [data: serveropt])

Can we please increase the size of those VMs? Something like 1TB would be great. Thank you!
Assignee: nobody → server-ops-database
Group: mozilla-employee-confidential
Component: Treeherder: Infrastructure → Database Operations
Product: Tree Management → Data & BI Services Team
QA Contact: laura → scabral
Version: --- → other
Blocks: 1120019
Keywords: treeherder
Summary: Increase the size of the treeherder DB VMs → Increase the disk size of the treeherder DB VMs
Hosts in question: treeherder1.db.scl3.mozilla.com treeherder2.db.scl3.mozilla.com treeherder1.stage.db.scl3.mozilla.com treeherder2.stage.db.scl3.mozilla.com Current disk allocation for each is 700GB, which was increased in bug 1076740. We've since reduced the data lifecycle (from 6 months to 4 months), which has helped partly, however Treeherder is now having to ingest performance result data (since it is due to replace Datazilla too) which has increased our storage requirements. Datazilla currently has 1TB allocated, of which ~400GB is used (https://rpm.newrelic.com/accounts/263620/servers/5242354/disks#id=219726737) - perhaps we could start reducing the data lifecycle there, and shift the allocation to Treeherder instead? (Presuming they're both on the same virtualised infra).
Group: mozilla-employee-confidential
Whiteboard: [data: serveropt]
Let's add 100G if that's OK. Treeherder2 is ready to be done, just coordinate with me.
Assignee: server-ops-database → server-ops-virtualization
Group: infra
Component: Database Operations → Virtualization
Product: Data & BI Services Team → Infrastructure & Operations
QA Contact: scabral → cshields
Datazilla isn't on VM, so there's no 'transfer' win to be had. Treeherder was already pretty huge in bug 1076740, when 700g was thought to be big enough. It's quite a large percentage of whichever datastore this VM rests on. These are growing to cover new requirements, and the original request in comment 0 was for 1T, so I'm very worried that this 100G isn't sufficient and we'll be back here again in a month. So the questions here are basically (knowing I'm asking you to tell the future), is 100G/node enough / how confident are you on that? If not, what's the right number? And, if this keeps growing, what's the plan and the point at which we start talking about going to physical?
As far as I can tell from New Relic [1] at the moment the data growth rate is not tremendous and I don't think we will see a huge increase in the next 3 months. On the longer term we have a plan to offload a big chunk of data to s3. That will probably free +200GB. I thought the company plan was to move to aws, making "going to physical" not an option. Am I missing something? [1] https://rpm.newrelic.com/accounts/263620/servers/5241973/disks#id=654553771
Nothing is set in stone on where things end up. There will always be use cases that favor physical, on-site VMs, and the cloud. Anyone who says otherwise is selling something. https://rpm.newrelic.com/accounts/263620/servers/5242253/disks#id=654553771 (treeherder2) shows a much faster growth rate, with 28 days to fill; 100G would make it around 50 days. So, I'm still worrying about the questions in comment 3.
Thanks for the clarification :-) We need enough space to cover q1 and q2 so that we can plan to move some data to s3 in q2. Given the estimate NR gives for treeherder2, 1TB will cover that. Please let me know how that sounds to you. I have no idea what's the maximum disk size allowed for on-site VMs, do you know where I can find some documentation about that?
It's not a documentation limit, it's me asking questions ahead and trying to protect the environment. We only have so much disk to give, and it's a shared resource among the 500+ guests we have. I have to look out for the organic growth (VMs getting larger, VMs being added) and p2v conversions coming from a lot of different directions. We're also under a little bit of space crunch as we shuffle some things around this quarter, so I'm being really protective right now. Large databases are the worst, because they use a lot of IOPS with queries, and a lot of space that doesn't really deduplicate well. They also tend to unbalance our datastores because you get a large single VM that can become difficult to move in maintenance cases, and can squeeze out a good number of smaller VMs, leading to further imbalance. It so happens that treeherder db's are the biggest VMs we have, they've already expanded hugely once (with a line in the sand of "this far, no further"), and now they're looking to go even bigger. What I'm looking for is a best guess at "this is what we'll need, and not more than this, and if we're wrong we've got a contingency plan for avoiding having to ask for more space" because 1T VM disks are just getting unwieldy. Whereas I was trying a more gentle line-in-the-sand approach with the 700G, I'm going to have to be way more firm this time unless we get some miraculous changes on the supply side. 1T for prod, okay, but that's it.
We can live with 1T for production; if we start running into that limit we'll fast track our work to offload some of the artifacts to S3. What is the ETA for expanding the VM to 1TB?
:gcox can we coordinate on this? I can ping you on irc if you prefer. We can start with the stage machines and treeherder2.db.scl3. We will need a CAB approval for a failover in order to do the treeherder1.db.scl3 host
ohcrap, I forgot you wanted stage. How much more do they need? (currently 300g) And yeah, we can IRC it. I'm east-coast.
coordination with :mpressman in #data. This afternoon: treeherder2.db.scl3 upp'ed to 1000G. treeherder2.stage.db.scl3 upped to 400G. :mpressman to coordinate a stage failover and ping me to do stage1; and to get the CAB for prod failover.
Depends on: 1131863
We have increased the disk on treeherder2.db.scl3 and treeherder2.stage.db.scl3 - In order to increase treeherder1.db.scl3, we'll need cab approval since it'll require a failover. I have created bug 1131863 in order to get approval. Please feel free to add to it as you see fit
treeherder cluster failed over, treeherder1 will have its disk increased now.
treeherder1.stage.db.scl3 upped to 400G yesterday. treeherder1.db.scl3 upp'ed to 1000G today.
Thank you :-)
Keeping this open; while the main task is done, it'd be great to fail back the cluster. I know next week is a bad time, so I'll see if I can coordinate tomorrow at 6 am.
Thank you everyone! :-)
This was failed back last week.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Group: infra
You need to log in before you can comment on or make changes to this bug.