OVH servers are down (e.g. http://l10n.mozilla-community.org/, mozilla.si)

NEW
Unassigned
(NeedInfo from)

Status

Infrastructure & Operations
Community IT: Hosting
P1
critical
a month ago
2 hours ago

People

(Reporter: flod, Unassigned, NeedInfo)

Tracking

(Blocks: 2 bugs)

Details

(Reporter)

Description

a month ago
Server is currently unreachable. This time it shouldn't be a problem with OVH renewals, since that happens mid month.

It was up and disappeared about 5 minutes ago, doesn't answer to pings either.

@reed
Any idea?
(Reporter)

Updated

a month ago
Flags: needinfo?(reed)
(Reporter)

Comment 1

a month ago
Other OVH servers are down, so it seems to be that problem once again :-\
Flags: needinfo?(reed) → needinfo?(hmitsch)
(Reporter)

Updated

a month ago
Assignee: reed → nobody
Component: Localization Server → Community IT: Hosting
Product: mozilla.org → Infrastructure & Operations
(Reporter)

Updated

a month ago
Summary: http://l10n.mozilla-community.org/ is down → OVH servers are down (e.g. http://l10n.mozilla-community.org/, mozilla.si)
Hi Tom,

do you have a renewal link to get the payment done?

Best regards,
   Henrik
Flags: needinfo?(hmitsch) → needinfo?(tom)
(Reporter)

Comment 3

29 days ago
According to IRC, the problem should be fixed, but both the servers mentioned in the subject are not online.

If possible, it would really help to give me access to the OVH panel for l10n.mozilla-community.org (I already have for another VPS).
Context about the "blocker" status: None of Fennec Nightly, Beta or Release can be published on Google Play Store. This happened with yesterday's nightly[1]. Today, Fennec 55.0b12 needs to be released. 

The reason of the failure is the following: [2] is used by our publishing script[3] to fetch the latest strings to display on Google Play. If the strings can't be fetched, the script fails, without publishing anything (just like [1]).

We need this service to be up, if we want to continue to release Fennec automatically (and securely).


[1] https://tools.taskcluster.net/groups/EjVa6X1GROiMNA3sBZoG7Q/tasks/fcS5fKeLRx6GFiBFDyxDuA/runs/0/logs/public%2Flogs%2Flive_backing.log
[2] https://l10n.mozilla-community.org/stores_l10n/
[3] https://github.com/mozilla-releng/mozapkpublisher
Severity: normal → blocker
Priority: -- → P1
(Reporter)

Comment 5

29 days ago
Henrik, sorry but we need to find a solution here, and I have no idea who else can help.
Flags: needinfo?(hmitsch)
We paid for the servers not too long after this bug was opened, and all of the other servers appear to be back online. Tad is working on kicking this server but is having trouble with the panel.
We've contacted OVH support to try and get the instance back online.

Comment 8

29 days ago
FTR, mozilla.si seems to be down still, too.
Hi jlorenzo and flod,

let's stop the bleeding first. I will make sure that anybody at ParSys who can help to get the OVH servers back online will be working on that.

Following the immediate fix, I will make it a priority to get to a sustainable solution. We did not know that the Firefox deployment toolchain depends on this server. This is of course unacceptable and we are happy to support the L10N people in moving this to sustainable ParSys AWS infrastructure.

Let's focus on getting the server back first.

Best regards,
   Henrik
Flags: needinfo?(hmitsch)
Yousef and I just spoke to the OVH phone support: There is a data center malfunction on OVH side.
We explained the case and got buy in that OVH will fix this as quickly as possible.

Fingers crossed!

@Tad: thanks for providing us all the necessary details in the backchannel! :-)
Flags: needinfo?(tom)
This is the link to track the OVH outage:

http://travaux.ovh.net/?do=details&id=26171&edit=yep
Hi, Henrik, Yousef, I have a few questions to try and help us gauge the immediate impact. Is it possible for one of you to get an ETA for return of service from the hosting provider? Is there any indication of what the underlying issue is? Are there backups of the data in case we need to restore it elsewhere?

Thanks for your help resoling this issue.
Flags: needinfo?(yousef)
Flags: needinfo?(hmitsch)
See Also: → bug 1384083
Hi arr,

on the phone support OVH told us that things should be back in service 'in a few minutes'. Obviously this has not been the case. We do not have an ETA. The ticket is also very sparse on details. Not sure about backups either.

I know this is not very helpful but this is about as much as we have got.

Best regards,
   Henrik
Flags: needinfo?(yousef)
Flags: needinfo?(hmitsch)
(Reporter)

Comment 14

29 days ago
Clarified on IRC: the app that releng depends on doesn't have any data, it uses date stored on VCS (GitHub) and exposes it via API.

Most other l10n tools are in the same situation, there's no backup but also no backup needed.

Having said that, that server contains other things that are out of my control, so I can't really answer for that part.
We're chasing up OVH again, this time to get an ETA.
Per the link in comment 11, the incident was closed yesterday at 7:22PM UTC (~15 minutes after comment 15). I don't manage to connect to https://l10n.mozilla-community.org/. What's the next step?
(In reply to Johan Lorenzo [:jlorenzo] from comment #16)
> Per the link in comment 11, the incident was closed yesterday at 7:22PM UTC
> (~15 minutes after comment 15). I don't manage to connect to
> https://l10n.mozilla-community.org/. What's the next step?

If my understanding was right in the channel this morning, sounds like the infrastructure is okay now but we might have problems rebooting the service.
I pinged a few persons at ovh and here is the answer from support:
https://twitter.com/ovh_support_fr/status/890135664293011456
What does this mean? Can you translate?
That is very likely on our side now.
(In reply to Henrik Mitsch [:hmitsch] from comment #19)
> What does this mean? Can you translate?

My French is a bit rusty but sounds like everything is okay from their side and they have communicated that to us. He demands from Sylvestre that the administrator of the website should contact them directly for more information.
Yes, I had another call with OVH Support about 3.5 hours ago. We can access the KVM but we can't get any further. Still trying to understand what's happening as I personally don't have access to the KVM.

-Henrik
(Reporter)

Comment 23

28 days ago
Partially related: https://bugzilla.mozilla.org/show_bug.cgi?id=1347863#c3

I can't help noticing two VPS marked as L10N
vps28311.ovh.net: IP is the same as l10n.mozilla-community.org, no replies to pings

vps28312.ovh.net: what's this? It's answering to pings, has Internal Server when accessed via HTTP, but I have no clue what's in it or who has access.
:hmitsch: who does have access to the kvm? Can we escalate to them to get status and/or help?
Flags: needinfo?(hmitsch)
:arr, :Tad has access and is looking into this as we speak. If you want, we can invite you to our multi-people Slack channel.
Flags: needinfo?(hmitsch)
Latest status: 
https://l10n.mozilla-community.org/stores_l10n/ is up again. 

Currently hosted on a temporary virtual server on ParSys AWS infrastructure. We aim to have a post mortem on Friday. Thanks to :arr for scheduling that.

Keeping the bug open because we need a permanent, sustainable solution.

I guess we can downgrade this bug now? Who has the authority and understanding to do so?

-Henrik
Severity: blocker → normal
(In reply to Henrik Mitsch [:hmitsch] from comment #26)
> Latest status: 
> https://l10n.mozilla-community.org/stores_l10n/ is up again. 
> 
> Currently hosted on a temporary virtual server on ParSys AWS infrastructure.
> We aim to have a post mortem on Friday. Thanks to :arr for scheduling that.
> 
> Keeping the bug open because we need a permanent, sustainable solution.

This bug is not yet fixed, as most of the stuff that was on that server is still not there, e.g. my folder is still missing: https://l10n.mozilla-community.org/~akalla/

Also: when I want to SSH to the server, I get "Permission denied (publickey)." - this was working before...
Severity: normal → critical
mozilla.si is also still not working...
(Reporter)

Comment 29

27 days ago
(In reply to Adrian Kalla [:adriank] from comment #27)
> This bug is not yet fixed, as most of the stuff that was on that server is
> still not there, e.g. my folder is still missing:
> https://l10n.mozilla-community.org/~akalla/

That's why the bug is still open. We only reinstalled critical pieces on a different temporary VM.

> Also: when I want to SSH to the server, I get "Permission denied
> (publickey)." - this was working before...

For the second part: the server is booting in rescue mode, but not properly, so it can't be fixed.
:adriank, can you please provide reasons for upgrading this issue to critical? The deployment toolchain for Fennec is working. Is there anything else that's critical on this server?

We are aware that not everything is back to normal and will take care of this in collaboration with MCWS in the next days.

-Henrik
Severity: critical → normal
(In reply to Henrik Mitsch [:hmitsch] from comment #30)
> :adriank, can you please provide reasons for upgrading this issue to
> critical? The deployment toolchain for Fennec is working. Is there anything
> else that's critical on this server?

The deployment toolchain for Fennec was a blocker.

Regarding the rest: if an outage of a number of services, like mozilla.si, is not a critical issue, then I truly don't know what is... "normal" severity implies that there is no urgent issue here, like this would be something of the sort "hey, lets move the server to a different location" - while it is still working...

What is really important for me on this server: my unofficial SeaMonkey releases reside there (see: https://unofficialseamonkeynews.wordpress.com/2017/07/26/adrian-kallas-download-page-currently-not-available-%E2%9A%A0%EF%B8%8F/ ) - and even more important: they look on this server for the update.xml files, so I cannot even move them elsewhere without having this server back online...

:hmitsch: when can I expect the server to be back online with all its content?
Flags: needinfo?(hmitsch)
Blocks: 1385245

Comment 32

23 days ago
wrong-bug
1 failures in 1008 pushes (0.001 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-central: 1

Platform breakdown:
* Android: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1383642&startday=2017-07-24&endday=2017-07-30&tree=all
Hi Tom,

different bug, same topic (as in Bug 1347753). Maybe you get around providing an update anytime soon?

Best regards,
   Henrik
Flags: needinfo?(hmitsch) → needinfo?(tom)
(In reply to Adrian Kalla [:adriank] from comment #27)
> (In reply to Henrik Mitsch [:hmitsch] from comment #26)
> > Latest status: 
> > https://l10n.mozilla-community.org/stores_l10n/ is up again. 
> > 
> > Currently hosted on a temporary virtual server on ParSys AWS infrastructure.
> > We aim to have a post mortem on Friday. Thanks to :arr for scheduling that.
> > 
> > Keeping the bug open because we need a permanent, sustainable solution.
> 
> This bug is not yet fixed, as most of the stuff that was on that server is
> still not there, e.g. my folder is still missing:
> https://l10n.mozilla-community.org/~akalla/
> 
> Also: when I want to SSH to the server, I get "Permission denied
> (publickey)." - this was working before...

I am in the same situation, can't get my data at https://l10n.mozilla-community.org/~pascalc/
For the record, Mozilla Slovenija gave up on this and moved mozilla.si to a different hosting.
:hmitsch: it's been over two weeks since the outage began - what's the progress here? What have you guys done to fix this?

Moving back to critical for the reasons explained above and also, as I cannot wait much longer to update SeaMonkey users with the latest security fixes - and for that I need access to my personal folder yesterday...
Severity: normal → critical
Flags: needinfo?(hmitsch)
:mathjazz sorry to hear about your frustration. Can you tell us which hosting you use for that site?

:adriank I spoke to :tad earlier today to ask about his progress with server restore. He will get back to us soon, I hope. Again, I am sorry that this takes so long. We are paying for all the debt accrued by previous generations. I know that "Not your fault, doesn't mean not your problem" so we are doing our best to get the service back.

-Henrik
I have called up OVH again. As usual they have opened another ticket for the network issue, so let's hope it goes differently this time.
Flags: needinfo?(hmitsch)
(In reply to Henrik Mitsch [:hmitsch] from comment #37)
> :mathjazz sorry to hear about your frustration. Can you tell us which
> hosting you use for that site?

:hmitsch Hey, one of our volunteers stepped up and moved the site to his private hosting provider.

Comment 40

10 days ago
This is a rather incredible issue. 

@[:mathjazz]
private hosting provider can be reached where?
(In reply to Yousef Alam [:yalam96] from comment #38)
> I have called up OVH again. As usual they have opened another ticket for the
> network issue, so let's hope it goes differently this time.

Nearly a week has passed since this - what's the outcome?

Please remember: we have an outage here now for already 24 continuous days - yes, days and not hours. I haven't seen such a long server outage in my whole life - until now...
Flags: needinfo?(yousef)

Comment 42

5 days ago
Adrian,

OVH are still being hopelessly useless in helping us with this issue. We are still unable to access the server.

We have followed up with OVH several times, but they seem unable to provide any sort of satisfactory response.
Flags: needinfo?(tom)
(Reporter)

Comment 43

5 days ago
@Adrian
Have you been able to set up the build system on a different server? In case, can you provide me a file (and expected path) to redirect users to this server?

@Tom
Then, what are the next steps? We need to at least get access to that data.

Where you able to determine if other VPS were affected (see Mozilla Slovenia above)?

Updated

5 days ago
Blocks: 1391525
 
> OVH are still being hopelessly useless in helping us with this issue. We are
> still unable to access the server.
> 
> We have followed up with OVH several times, but they seem unable to provide
> any sort of satisfactory response.
Can you give me the ticket number? Thanks
Flags: needinfo?(tom)
We've now just asked them for an image so we can move the data elsewhere.

Current ticket: 59045037 
Previous:
5398173553
1404253792 
1845196749
1550649700
2636967518
5391570667
52074495
Flags: needinfo?(yousef)

Comment 46

2 hours ago
Any update on this? Many of us use Adrian's SeaMonkey builds & since this outage we cannot update. We get:
Update XML file not found (404)
You need to log in before you can comment on or make changes to this bug.