Closed
Bug 1125903
Opened 10 years ago
Closed 10 years ago
Increase the disk size of the treeherder DB VMs
Categories
(Infrastructure & Operations :: Virtualization, task)
Infrastructure & Operations
Virtualization
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mdoglio, Unassigned)
References
Details
(Keywords: treeherder, Whiteboard: [data: serveropt])
Can we please increase the size of those VMs? Something like 1TB would be great.
Thank you!
Reporter | ||
Updated•10 years ago
|
Assignee: nobody → server-ops-database
Group: mozilla-employee-confidential
Component: Treeherder: Infrastructure → Database Operations
Product: Tree Management → Data & BI Services Team
QA Contact: laura → scabral
Version: --- → other
Updated•10 years ago
|
Blocks: 1120019
Keywords: treeherder
Summary: Increase the size of the treeherder DB VMs → Increase the disk size of the treeherder DB VMs
Comment 1•10 years ago
|
||
Hosts in question:
treeherder1.db.scl3.mozilla.com
treeherder2.db.scl3.mozilla.com
treeherder1.stage.db.scl3.mozilla.com
treeherder2.stage.db.scl3.mozilla.com
Current disk allocation for each is 700GB, which was increased in bug 1076740.
We've since reduced the data lifecycle (from 6 months to 4 months), which has helped partly, however Treeherder is now having to ingest performance result data (since it is due to replace Datazilla too) which has increased our storage requirements.
Datazilla currently has 1TB allocated, of which ~400GB is used (https://rpm.newrelic.com/accounts/263620/servers/5242354/disks#id=219726737) - perhaps we could start reducing the data lifecycle there, and shift the allocation to Treeherder instead? (Presuming they're both on the same virtualised infra).
Group: mozilla-employee-confidential
Updated•10 years ago
|
Whiteboard: [data: serveropt]
Comment 2•10 years ago
|
||
Let's add 100G if that's OK.
Treeherder2 is ready to be done, just coordinate with me.
Assignee: server-ops-database → server-ops-virtualization
Group: infra
Component: Database Operations → Virtualization
Product: Data & BI Services Team → Infrastructure & Operations
QA Contact: scabral → cshields
![]() |
||
Comment 3•10 years ago
|
||
Datazilla isn't on VM, so there's no 'transfer' win to be had.
Treeherder was already pretty huge in bug 1076740, when 700g was thought to be big enough. It's quite a large percentage of whichever datastore this VM rests on. These are growing to cover new requirements, and the original request in comment 0 was for 1T, so I'm very worried that this 100G isn't sufficient and we'll be back here again in a month.
So the questions here are basically (knowing I'm asking you to tell the future), is 100G/node enough / how confident are you on that? If not, what's the right number? And, if this keeps growing, what's the plan and the point at which we start talking about going to physical?
Reporter | ||
Comment 4•10 years ago
|
||
As far as I can tell from New Relic [1] at the moment the data growth rate is not tremendous and I don't think we will see a huge increase in the next 3 months.
On the longer term we have a plan to offload a big chunk of data to s3. That will probably free +200GB.
I thought the company plan was to move to aws, making "going to physical" not an option. Am I missing something?
[1] https://rpm.newrelic.com/accounts/263620/servers/5241973/disks#id=654553771
![]() |
||
Comment 5•10 years ago
|
||
Nothing is set in stone on where things end up. There will always be use cases that favor physical, on-site VMs, and the cloud. Anyone who says otherwise is selling something.
https://rpm.newrelic.com/accounts/263620/servers/5242253/disks#id=654553771 (treeherder2) shows a much faster growth rate, with 28 days to fill; 100G would make it around 50 days. So, I'm still worrying about the questions in comment 3.
Reporter | ||
Comment 6•10 years ago
|
||
Thanks for the clarification :-)
We need enough space to cover q1 and q2 so that we can plan to move some data to s3 in q2.
Given the estimate NR gives for treeherder2, 1TB will cover that.
Please let me know how that sounds to you. I have no idea what's the maximum disk size allowed for on-site VMs, do you know where I can find some documentation about that?
![]() |
||
Comment 7•10 years ago
|
||
It's not a documentation limit, it's me asking questions ahead and trying to protect the environment. We only have so much disk to give, and it's a shared resource among the 500+ guests we have. I have to look out for the organic growth (VMs getting larger, VMs being added) and p2v conversions coming from a lot of different directions. We're also under a little bit of space crunch as we shuffle some things around this quarter, so I'm being really protective right now.
Large databases are the worst, because they use a lot of IOPS with queries, and a lot of space that doesn't really deduplicate well. They also tend to unbalance our datastores because you get a large single VM that can become difficult to move in maintenance cases, and can squeeze out a good number of smaller VMs, leading to further imbalance. It so happens that treeherder db's are the biggest VMs we have, they've already expanded hugely once (with a line in the sand of "this far, no further"), and now they're looking to go even bigger.
What I'm looking for is a best guess at "this is what we'll need, and not more than this, and if we're wrong we've got a contingency plan for avoiding having to ask for more space" because 1T VM disks are just getting unwieldy. Whereas I was trying a more gentle line-in-the-sand approach with the 700G, I'm going to have to be way more firm this time unless we get some miraculous changes on the supply side.
1T for prod, okay, but that's it.
Comment 8•10 years ago
|
||
We can live with 1T for production; if we start running into that limit we'll fast track our work to offload some of the artifacts to S3.
What is the ETA for expanding the VM to 1TB?
Comment 9•10 years ago
|
||
:gcox can we coordinate on this? I can ping you on irc if you prefer. We can start with the stage machines and treeherder2.db.scl3. We will need a CAB approval for a failover in order to do the treeherder1.db.scl3 host
![]() |
||
Comment 10•10 years ago
|
||
ohcrap, I forgot you wanted stage. How much more do they need? (currently 300g)
And yeah, we can IRC it. I'm east-coast.
![]() |
||
Comment 11•10 years ago
|
||
coordination with :mpressman in #data. This afternoon:
treeherder2.db.scl3 upp'ed to 1000G.
treeherder2.stage.db.scl3 upped to 400G.
:mpressman to coordinate a stage failover and ping me to do stage1; and to get the CAB for prod failover.
Comment 12•10 years ago
|
||
We have increased the disk on treeherder2.db.scl3 and treeherder2.stage.db.scl3 - In order to increase treeherder1.db.scl3, we'll need cab approval since it'll require a failover. I have created bug 1131863 in order to get approval. Please feel free to add to it as you see fit
Comment 13•10 years ago
|
||
treeherder cluster failed over, treeherder1 will have its disk increased now.
![]() |
||
Comment 14•10 years ago
|
||
treeherder1.stage.db.scl3 upped to 400G yesterday.
treeherder1.db.scl3 upp'ed to 1000G today.
Reporter | ||
Comment 15•10 years ago
|
||
Thank you :-)
Comment 16•10 years ago
|
||
Keeping this open; while the main task is done, it'd be great to fail back the cluster. I know next week is a bad time, so I'll see if I can coordinate tomorrow at 6 am.
Comment 17•10 years ago
|
||
Thank you everyone! :-)
Comment 18•10 years ago
|
||
This was failed back last week.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Group: infra
You need to log in
before you can comment on or make changes to this bug.
Description
•