virtualize datazilla1.webapp.scl3

RESOLVED FIXED

Status

Infrastructure & Operations
WebOps: Other
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: gcox, Assigned: gcox)

Tracking

({p2v})

Details

(Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/1973] [vm-p2v:1])

(Assignee)

Description

3 years ago
datazilla1.webapp.scl3 is on hardware that will be going out-of-warranty soon.  It appears to be lightly utilized and a good candidate for a p2v conversion.

Assuming it's still needed, we'd like to get a window to take it down for a little while to convert it.
AFAIK, dev svcs has had no part in maintaining datazilla. I don't know who has been responsible for it, though.
(Assignee)

Comment 2

3 years ago
Took a stab on filing it based on the creation bugs from moons ago.  :jeads is listed as the admin in mana.
I'm totally open to it being shuffled around.
(Assignee)

Updated

3 years ago
Keywords: p2v
(Assignee)

Updated

3 years ago
Flags: needinfo?(jeads)
I'm the primary developer on datazilla. Not entirely sure what "p2v conversion" implies. Is that moving the webservice+database to internal virtualized nodes? The webservice should be fine, the database especially talos_objectstore_1/talos_perftest_1 might be a bit large for that. We are migrating it to treeherder but that will carry over into Q4. I think virtualizing the non talos databases in datazilla should be fine.
 
What is the timeline for this?
Flags: needinfo?(jeads)
(Assignee)

Comment 4

3 years ago
p2v is "this box is a physical piece of hardware, let's take a snapshot of it and turn it into a VM that runs in ESX."

The timeline is "now-ish/soon as you'll let us": warranty on the webapp box ran out on 2014-8-23, so, the sooner we can do it, the better we protect it from hardware failure.  Downtime is about an hour where we'll turn off apache and stuff, do the conversion, and then bring it back up as a VM.

The database boxes are on the 'spring cleaning' list: they're out of warranty, but their impending cutover to treeherder means we're leaving them alone.

This is only focusing on the webapp box, datazilla1.webapp.scl3.  What we need is to know when we can take the downtime to convert it and who to notify.  We've got decent coverage, pretty easily 0400-1600 PDT weekdays, and any other time with some coordination.
(Assignee)

Updated

3 years ago
See Also: → bug 1079102

Updated

3 years ago
Assignee: server-ops-devservices → server-ops-webops
Component: Server Operations: Developer Services → WebOps: Other
Product: mozilla.org → Infrastructure & Operations

Updated

3 years ago
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1560] [kanban:https://kanbanize.com/ctrl_board/4/1561]
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1079102
(Assignee)

Comment 6

3 years ago
Not a dupe.  This one's prod, the other is stage.
Following sheeri's lead and ni'ing :wlach about timing of an outage.
Status: RESOLVED → REOPENED
Flags: needinfo?(wlachance)
Resolution: DUPLICATE → ---
How long a maintenance window are we talking about? There are three primary users of datazilla currently:

(1) dzAlerts -- alerts sent out based on regressions detected in talos data (Kyle L is working on this). This is soon to be taken down (hopefully by end of Q4)
(2) b2g performance numbers -- probably the most important use of datazilla right now, we use this to track startup times as commits are made to gaia, etc. (Dave Hunt maintains this, I think?)
(3) Benchmarking Firefox vs. other browsers (Dan Minor maintains this, I think?)

If we can avoid it, I'd prefer if we could postpone this migration until (1) is completed and we can turn datazilla submission for talos off (otherwise we'll need to close the trees so our jobs don't turn orange). For (2) and (3) CC'ing the owners of these components to figure out what the implications of turning datazilla off would be.
Flags: needinfo?(wlachance)
Flags: needinfo?(dminor)
Flags: needinfo?(dave.hunt)
(Assignee)

Comment 8

3 years ago
Ballpark downtime for something this small is quoted at 'an hour', reality is usually about 30mins.

We missed the official cutoff for the tree-closing window on 12/20, but we could probably get an exception for this if we ask by Wednesday, 9am PT.

Keep in mind, given that you're on an HP blade that's now out of warranty, you're gambling with there not being a hardware failure before this migrates, so, we'd like to do it asap, but, your box your call.

Comment 9

3 years ago
(In reply to William Lachance (:wlach) from comment #7)
> How long a maintenance window are we talking about? There are three primary
> users of datazilla currently:
> 
> (1) dzAlerts -- alerts sent out based on regressions detected in talos data
> (Kyle L is working on this). This is soon to be taken down (hopefully by end
> of Q4)
> (2) b2g performance numbers -- probably the most important use of datazilla
> right now, we use this to track startup times as commits are made to gaia,
> etc. (Dave Hunt maintains this, I think?)
> (3) Benchmarking Firefox vs. other browsers (Dan Minor maintains this, I
> think?)
> 
> If we can avoid it, I'd prefer if we could postpone this migration until (1)
> is completed and we can turn datazilla submission for talos off (otherwise
> we'll need to close the trees so our jobs don't turn orange). For (2) and
> (3) CC'ing the owners of these components to figure out what the
> implications of turning datazilla off would be.

Bug 1110270 tracks moving Mozbench from datazilla. By end of day today I should no longer be reporting anything to datazilla.
Flags: needinfo?(dminor)
Deferring to Eli, who is actively working on Firefox OS performance, although I suspect we can tolerate the downtime that's predicted.
Flags: needinfo?(dave.hunt) → needinfo?(eperelman)

Comment 11

3 years ago
We can certainly tolerate that downtime. +1 from me.
Flags: needinfo?(eperelman)
(Assignee)

Comment 12

3 years ago
Based on checking with :hwine and the other signoffs I've seen, I'm proposing we do this during this Saturday's treeclosing window.

If there's a reason not to, let me know, otherwise I'll take it to the change board tomorrow.
Status: REOPENED → NEW
as long as the trees are closed, I have no objections

Updated

3 years ago
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1561]

Updated

3 years ago
Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/1973]
(Assignee)

Updated

3 years ago
Flags: cab-review?
(Assignee)

Comment 14

3 years ago
CAB approved, aiming for really early in the window (~0900PT 20 Dec).
Blocks: 1111702
Flags: cab-review? → cab-review+
(Assignee)

Comment 15

3 years ago
p2v completed.  Downtime was 0900-~0930 PT.

Based on usage over the last week, downsized to 1 core and 8G of RAM (was using nothing and 5G), and 40G disk.

Nagios shows green.
Status: NEW → RESOLVED
Last Resolved: 3 years ago3 years ago
Resolution: --- → FIXED
(Assignee)

Updated

3 years ago
Assignee: server-ops-webops → gcox
Whiteboard: [kanban:webops:https://kanbanize.com/ctrl_board/4/1973] → [kanban:webops:https://kanbanize.com/ctrl_board/4/1973] [vm-p2v:1]

Updated

2 years ago
Cab Review: --- → approved
Flags: cab-review+
You need to log in before you can comment on or make changes to this bug.