can't access l10n dashboard, l10n-dashboard1.webapp.scl3.mozilla.com

RESOLVED FIXED

Status

Infrastructure & Operations
MOC: Service Requests
RESOLVED FIXED
3 years ago
3 years ago

People

(Reporter: Pike, Assigned: w0ts0n)

Tracking

Details

(Reporter)

Description

3 years ago
kochbuch:a10n ahecht$ ssh root@bm-l10n-dashboard01
ssh: connect to host bm-l10n-dashboard01 port 22: Connection refused
kochbuch:a10n ahecht$ host bm-l10n-dashboard01
bm-l10n-dashboard01.mozilla.org is an alias for l10n-dashboard1.webapp.scl3.mozilla.com.
l10n-dashboard1.webapp.scl3.mozilla.com has address 10.22.81.129
Assignee: nobody → ludovic
Can't log on the console with current root password.
Also loggin prompts say bm-l10n
Restarted the VM
Fri 04:25:13 PST [5033] l10n-dashboard1.webapp.scl3.mozilla.com (10.22.81.129) is DOWN :PING CRITICAL - Packet loss = 100%
I now have the ubuntu X loggin whihc doesn't let me loggin with the current root password.
Can't get my hands to the machine. Can you guys have a look ?
Component: MOC: Service Requests → Virtualization
QA Contact: lypulong → cshields
(Assignee)

Comment 6

3 years ago
Tried booting into single user mode, it's password locked even from there. 

I think the next step is to try via recover disc.

Comment 7

3 years ago
root found after a brief bruteforce through the old passwords.  Repassworded it up to current infra.
Since that's not something you'd want to hand outside IT, passing this back for figuring out where to go from here.
Component: Virtualization → MOC: Service Requests
QA Contact: cshields → lypulong
We need to see if axel can logon the box.
(Reporter)

Comment 9

3 years ago
I'm timing out when trying to ssh in:

kochbuch:a10n ahecht$ ssh -vvv dashboard@bm-l10n-dashboard01
OpenSSH_6.2p2, OSSLShim 0.9.8r 8 Dec 2011
debug1: Reading configuration data /Users/ahecht/.ssh/config
debug1: Reading configuration data /etc/ssh_config
debug1: /etc/ssh_config line 20: Applying options for *
debug1: /etc/ssh_config line 53: Applying options for *
debug2: ssh_connect: needpriv 0
debug1: Connecting to bm-l10n-dashboard01 [10.22.81.129] port 22.

(seems that that's the step that doesn't make progress)

A bit of context, right now we can't access any data for shipping our localized builds, and I'll be traveling for community meetups in India tomorrow morning

The web server doesn't seem to be reachable either (that'd go through the zeus proxy)
(Reporter)

Updated

3 years ago
Duplicate of this bug: 1128272
(Reporter)

Comment 11

3 years ago
This machine is central to our localization of firefox, as long as it's gone, we can't take updates.

Given the severity of the impact on the product we ship, I'd appreciate daily updates on the progress here. (I'm in India, right now, my day is over, I know yours is not)
This is blocking beta builds.
Severity: normal → blocker
(Assignee)

Updated

3 years ago
Assignee: ludovic → rwatson
Severity: blocker → normal
(Assignee)

Comment 13

3 years ago
This should be resolved. I'll update the bug tomorrow.
(Reporter)

Comment 14

3 years ago
I can confirm that I can access the machine again, and that's responding to the web again, too.

The bots are all started and seem to be working as expected, within the context of this bug. Filed bug 1128833 on the (unrelated and predating) ES problems, for curious folks looking for links.
(Assignee)

Comment 15

3 years ago
Thanks for checking Axel.

It appears someone had made some changes to the networking configuration on the machine. 

It was set to what it should be (some debian specific stuff) and this resolved the issue.

Since this isn't a standard machine with our standard revision control. I can't tell you who made the changes or when they happened :/
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
(In reply to Ryan Watson [:w0ts0n] from comment #15)
> Since this isn't a standard machine with our standard revision control. I
> can't tell you who made the changes or when they happened :/
Any reason why this is not a standard machine?
l10n.m.o is a key element for the Firefox releases. Having this machine down for 3 days can hurt the release process...
(Assignee)

Comment 17

3 years ago
I'm not sure either. Typically we use RHEL and Webops manage our systems. 

This is a Ubuntu 12.04 machine that has local user accounts and seems to have been setup by someone outside of IT. 

Maybe Axel or Laura have more information.
Flags: needinfo?(laura)
Flags: needinfo?(l10n)
(Reporter)

Comment 18

3 years ago
https://bugzilla.mozilla.org/showdependencytree.cgi?id=652792&hide_resolved=0 is the lengthy story of our attempt to get this to better maintained setups.

They key part was storage for data, which I considered to be resolved, but as you can see via bug 1128833, that also regressed (without triggering any errors, too). So there's more to learn here.

Plus, now the new thing is to get things out of our datacenters, which adds another dimension of design mistakes to the picture.

Historically, this setup is a migration from what we had on the l10n community server, and that was an ubuntu, too.
Flags: needinfo?(laura)
Flags: needinfo?(l10n)
(In reply to Axel Hecht [:Pike] from comment #18)

> Plus, now the new thing is to get things out of our datacenters, which adds
> another dimension of design mistakes to the picture.
> 
> Historically, this setup is a migration from what we had on the l10n
> community server, and that was an ubuntu, too.

I think It would make sense to spend some time to make this machine more infra'esque. so we never run into an issue at release time. Seeing how painful was the previous migration it would be nice if this would have  a dev/stage/and production instances. I know setting these might take a lot of unplanned time from Pike's team. But on the long run releasing will be less prone to this machine going down.

Corey how do we spin up a project like this ?
Flags: needinfo?(cshields)
(Reporter)

Comment 20

3 years ago
Please go through the complete array of dependencies of when we tried this last time before going for another cycle of that.

Note, dev/stage/prod instance are out of question due to the file storage problem.
I agree that supporting this in one of our standardized platforms is something that should be done, however we won't have the resources anytime soon in the next couple of quarters.
Flags: needinfo?(cshields)
You need to log in before you can comment on or make changes to this bug.