1383642 - OVH servers are down (e.g. http://l10n.mozilla-community.org/, mozilla.si)

Reporter

Description

•

9 years ago

Server is currently unreachable. This time it shouldn't be a problem with OVH renewals, since that happens mid month. It was up and disappeared about 5 minutes ago, doesn't answer to pings either. @reed Any idea?

Francesco Lodolo [:flod]

Reporter

Updated

•

9 years ago

Flags: needinfo?(reed)

Francesco Lodolo [:flod]

Reporter

Comment 1

•

9 years ago

Other OVH servers are down, so it seems to be that problem once again :-\

Flags: needinfo?(reed) → needinfo?(hmitsch)

Francesco Lodolo [:flod]

Reporter

Updated

•

9 years ago

Assignee: reed → nobody

Component: Localization Server → Community IT: Hosting

Product: mozilla.org → Infrastructure & Operations

Francesco Lodolo [:flod]

Reporter

Updated

•

9 years ago

Summary: http://l10n.mozilla-community.org/ is down → OVH servers are down (e.g. http://l10n.mozilla-community.org/, mozilla.si)

Henrik Mitsch [:hmitsch]

Comment 2

•

9 years ago

Hi Tom, do you have a renewal link to get the payment done? Best regards, Henrik

Flags: needinfo?(hmitsch) → needinfo?(tom)

Francesco Lodolo [:flod]

Reporter

Comment 3

•

9 years ago

According to IRC, the problem should be fixed, but both the servers mentioned in the subject are not online. If possible, it would really help to give me access to the OVH panel for l10n.mozilla-community.org (I already have for another VPS).

Johan Lorenzo [:jlorenzo] - PTO - Back August 24th

Comment 4

•

9 years ago

Context about the "blocker" status: None of Fennec Nightly, Beta or Release can be published on Google Play Store. This happened with yesterday's nightly[1]. Today, Fennec 55.0b12 needs to be released. The reason of the failure is the following: [2] is used by our publishing script[3] to fetch the latest strings to display on Google Play. If the strings can't be fetched, the script fails, without publishing anything (just like [1]). We need this service to be up, if we want to continue to release Fennec automatically (and securely). [1] https://tools.taskcluster.net/groups/EjVa6X1GROiMNA3sBZoG7Q/tasks/fcS5fKeLRx6GFiBFDyxDuA/runs/0/logs/public%2Flogs%2Flive_backing.log [2] https://l10n.mozilla-community.org/stores_l10n/ [3] https://github.com/mozilla-releng/mozapkpublisher

Severity: normal → blocker

Priority: -- → P1

Francesco Lodolo [:flod]

Reporter

Comment 5

•

9 years ago

Henrik, sorry but we need to find a solution here, and I have no idea who else can help.

Flags: needinfo?(hmitsch)

Yousef Alam [:yalam96] (use NEEDINFO)

Comment 6

•

9 years ago

We paid for the servers not too long after this bug was opened, and all of the other servers appear to be back online. Tad is working on kicking this server but is having trouble with the panel.

Yousef Alam [:yalam96] (use NEEDINFO)

Comment 7

•

9 years ago

We've contacted OVH support to try and get the instance back online.

Axel Hecht [:Pike]

Comment 8

•

9 years ago

FTR, mozilla.si seems to be down still, too.

Henrik Mitsch [:hmitsch]

Comment 9

•

9 years ago

Hi jlorenzo and flod, let's stop the bleeding first. I will make sure that anybody at ParSys who can help to get the OVH servers back online will be working on that. Following the immediate fix, I will make it a priority to get to a sustainable solution. We did not know that the Firefox deployment toolchain depends on this server. This is of course unacceptable and we are happy to support the L10N people in moving this to sustainable ParSys AWS infrastructure. Let's focus on getting the server back first. Best regards, Henrik

Flags: needinfo?(hmitsch)

Henrik Mitsch [:hmitsch]

Comment 10

•

9 years ago

Yousef and I just spoke to the OVH phone support: There is a data center malfunction on OVH side. We explained the case and got buy in that OVH will fix this as quickly as possible. Fingers crossed! @Tad: thanks for providing us all the necessary details in the backchannel! :-)

Henrik Mitsch [:hmitsch]

Updated

•

9 years ago

Flags: needinfo?(tom)

Yousef Alam [:yalam96] (use NEEDINFO)

Comment 11

•

9 years ago

This is the link to track the OVH outage: http://travaux.ovh.net/?do=details&id=26171&edit=yep

Amy Rich [:arr] [:arich]

Comment 12

•

9 years ago

Hi, Henrik, Yousef, I have a few questions to try and help us gauge the immediate impact. Is it possible for one of you to get an ETA for return of service from the hosting provider? Is there any indication of what the underlying issue is? Are there backups of the data in case we need to restore it elsewhere? Thanks for your help resoling this issue.

Flags: needinfo?(yousef)

Flags: needinfo?(hmitsch)

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

9 years ago

Comment 13

•

9 years ago

Hi arr, on the phone support OVH told us that things should be back in service 'in a few minutes'. Obviously this has not been the case. We do not have an ETA. The ticket is also very sparse on details. Not sure about backups either. I know this is not very helpful but this is about as much as we have got. Best regards, Henrik

Flags: needinfo?(yousef)

Flags: needinfo?(hmitsch)

Francesco Lodolo [:flod]

Reporter

Comment 14

•

9 years ago

Clarified on IRC: the app that releng depends on doesn't have any data, it uses date stored on VCS (GitHub) and exposes it via API. Most other l10n tools are in the same situation, there's no backup but also no backup needed. Having said that, that server contains other things that are out of my control, so I can't really answer for that part.

Yousef Alam [:yalam96] (use NEEDINFO)

Comment 15

•

9 years ago

We're chasing up OVH again, this time to get an ETA.

Johan Lorenzo [:jlorenzo] - PTO - Back August 24th

Comment 16

•

9 years ago

Per the link in comment 11, the incident was closed yesterday at 7:22PM UTC (~15 minutes after comment 15). I don't manage to connect to https://l10n.mozilla-community.org/. What's the next step?

Mihai Tabara [:mtabara]⌚️GMT

Comment 17

•

9 years ago

(In reply to Johan Lorenzo [:jlorenzo] from comment #16) > Per the link in comment 11, the incident was closed yesterday at 7:22PM UTC > (~15 minutes after comment 15). I don't manage to connect to > https://l10n.mozilla-community.org/. What's the next step? If my understanding was right in the channel this morning, sounds like the infrastructure is okay now but we might have problems rebooting the service.

Sylvestre Ledru [:Sylvestre]

Comment 18

•

9 years ago

I pinged a few persons at ovh and here is the answer from support: https://twitter.com/ovh_support_fr/status/890135664293011456

Henrik Mitsch [:hmitsch]

Comment 19

•

9 years ago

What does this mean? Can you translate?

Sylvestre Ledru [:Sylvestre]

Comment 20

•

9 years ago

That is very likely on our side now.

Mihai Tabara [:mtabara]⌚️GMT

Comment 21

•

9 years ago

(In reply to Henrik Mitsch [:hmitsch] from comment #19) > What does this mean? Can you translate? My French is a bit rusty but sounds like everything is okay from their side and they have communicated that to us. He demands from Sylvestre that the administrator of the website should contact them directly for more information.

Henrik Mitsch [:hmitsch]

Comment 22

•

9 years ago

Yes, I had another call with OVH Support about 3.5 hours ago. We can access the KVM but we can't get any further. Still trying to understand what's happening as I personally don't have access to the KVM. -Henrik

Francesco Lodolo [:flod]

Reporter

Comment 23

•

9 years ago

Partially related: https://bugzilla.mozilla.org/show_bug.cgi?id=1347863#c3 I can't help noticing two VPS marked as L10N vps28311.ovh.net: IP is the same as l10n.mozilla-community.org, no replies to pings vps28312.ovh.net: what's this? It's answering to pings, has Internal Server when accessed via HTTP, but I have no clue what's in it or who has access.

Amy Rich [:arr] [:arich]

Comment 24

•

9 years ago

:hmitsch: who does have access to the kvm? Can we escalate to them to get status and/or help?

Flags: needinfo?(hmitsch)

Henrik Mitsch [:hmitsch]

Comment 25

•

9 years ago

:arr, :Tad has access and is looking into this as we speak. If you want, we can invite you to our multi-people Slack channel.

Flags: needinfo?(hmitsch)

Henrik Mitsch [:hmitsch]

Comment 26

•

9 years ago

Latest status: https://l10n.mozilla-community.org/stores_l10n/ is up again. Currently hosted on a temporary virtual server on ParSys AWS infrastructure. We aim to have a post mortem on Friday. Thanks to :arr for scheduling that. Keeping the bug open because we need a permanent, sustainable solution. I guess we can downgrade this bug now? Who has the authority and understanding to do so? -Henrik

Amy Rich [:arr] [:arich]

Updated

•

9 years ago

Severity: blocker → normal

Adrian Kalla [:adriank]

Comment 27

•

9 years ago

(In reply to Henrik Mitsch [:hmitsch] from comment #26) > Latest status: > https://l10n.mozilla-community.org/stores_l10n/ is up again. > > Currently hosted on a temporary virtual server on ParSys AWS infrastructure. > We aim to have a post mortem on Friday. Thanks to :arr for scheduling that. > > Keeping the bug open because we need a permanent, sustainable solution. This bug is not yet fixed, as most of the stuff that was on that server is still not there, e.g. my folder is still missing: https://l10n.mozilla-community.org/~akalla/ Also: when I want to SSH to the server, I get "Permission denied (publickey)." - this was working before...

Adrian Kalla [:adriank]

Updated

•

9 years ago

Severity: normal → critical

Adrian Kalla [:adriank]

Comment 28

•

9 years ago

mozilla.si is also still not working...

Francesco Lodolo [:flod]

Reporter

Comment 29

•

9 years ago

(In reply to Adrian Kalla [:adriank] from comment #27) > This bug is not yet fixed, as most of the stuff that was on that server is > still not there, e.g. my folder is still missing: > https://l10n.mozilla-community.org/~akalla/ That's why the bug is still open. We only reinstalled critical pieces on a different temporary VM. > Also: when I want to SSH to the server, I get "Permission denied > (publickey)." - this was working before... For the second part: the server is booting in rescue mode, but not properly, so it can't be fixed.

Henrik Mitsch [:hmitsch]

Comment 30

•

9 years ago

:adriank, can you please provide reasons for upgrading this issue to critical? The deployment toolchain for Fennec is working. Is there anything else that's critical on this server? We are aware that not everything is back to normal and will take care of this in collaboration with MCWS in the next days. -Henrik

Severity: critical → normal

Adrian Kalla [:adriank]

Comment 31

•

9 years ago

(In reply to Henrik Mitsch [:hmitsch] from comment #30) > :adriank, can you please provide reasons for upgrading this issue to > critical? The deployment toolchain for Fennec is working. Is there anything > else that's critical on this server? The deployment toolchain for Fennec was a blocker. Regarding the rest: if an outage of a number of services, like mozilla.si, is not a critical issue, then I truly don't know what is... "normal" severity implies that there is no urgent issue here, like this would be something of the sort "hey, lets move the server to a different location" - while it is still working... What is really important for me on this server: my unofficial SeaMonkey releases reside there (see: https://unofficialseamonkeynews.wordpress.com/2017/07/26/adrian-kallas-download-page-currently-not-available-%E2%9A%A0%EF%B8%8F/ ) - and even more important: they look on this server for the update.xml files, so I cannot even move them elsewhere without having this server back online... :hmitsch: when can I expect the server to be back online with all its content?

Flags: needinfo?(hmitsch)

Mihai Tabara [:mtabara]⌚️GMT

Updated

•

9 years ago

Blocks: 1385245

Comment hidden (Intermittent Failures Robot)

Henrik Mitsch [:hmitsch]

Comment 33

•

9 years ago

Hi Tom, different bug, same topic (as in Bug 1347753). Maybe you get around providing an update anytime soon? Best regards, Henrik

Flags: needinfo?(hmitsch) → needinfo?(tom)

Pascal Chevrel (relman team) -> :pascalc

Comment 34

•

8 years ago

(In reply to Adrian Kalla [:adriank] from comment #27) > (In reply to Henrik Mitsch [:hmitsch] from comment #26) > > Latest status: > > https://l10n.mozilla-community.org/stores_l10n/ is up again. > > > > Currently hosted on a temporary virtual server on ParSys AWS infrastructure. > > We aim to have a post mortem on Friday. Thanks to :arr for scheduling that. > > > > Keeping the bug open because we need a permanent, sustainable solution. > > This bug is not yet fixed, as most of the stuff that was on that server is > still not there, e.g. my folder is still missing: > https://l10n.mozilla-community.org/~akalla/ > > Also: when I want to SSH to the server, I get "Permission denied > (publickey)." - this was working before... I am in the same situation, can't get my data at https://l10n.mozilla-community.org/~pascalc/

Matjaz Horvat [:mathjazz]

Comment 35

•

8 years ago

For the record, Mozilla Slovenija gave up on this and moved mozilla.si to a different hosting.

Adrian Kalla [:adriank]

Comment 36

•

8 years ago

:hmitsch: it's been over two weeks since the outage began - what's the progress here? What have you guys done to fix this? Moving back to critical for the reasons explained above and also, as I cannot wait much longer to update SeaMonkey users with the latest security fixes - and for that I need access to my personal folder yesterday...

Severity: normal → critical

Flags: needinfo?(hmitsch)

Henrik Mitsch [:hmitsch]

Comment 37

•

8 years ago

:mathjazz sorry to hear about your frustration. Can you tell us which hosting you use for that site? :adriank I spoke to :tad earlier today to ask about his progress with server restore. He will get back to us soon, I hope. Again, I am sorry that this takes so long. We are paying for all the debt accrued by previous generations. I know that "Not your fault, doesn't mean not your problem" so we are doing our best to get the service back. -Henrik

Yousef Alam [:yalam96] (use NEEDINFO)

Comment 38

•

8 years ago

I have called up OVH again. As usual they have opened another ticket for the network issue, so let's hope it goes differently this time.

Flags: needinfo?(hmitsch)

Matjaz Horvat [:mathjazz]

Comment 39

•

8 years ago

(In reply to Henrik Mitsch [:hmitsch] from comment #37) > :mathjazz sorry to hear about your frustration. Can you tell us which > hosting you use for that site? :hmitsch Hey, one of our volunteers stepped up and moved the site to his private hosting provider.

Rainer Bielefeld

Comment 40

•

8 years ago

This is a rather incredible issue. @[:mathjazz] private hosting provider can be reached where?

Adrian Kalla [:adriank]

Comment 41

•

8 years ago

(In reply to Yousef Alam [:yalam96] from comment #38) > I have called up OVH again. As usual they have opened another ticket for the > network issue, so let's hope it goes differently this time. Nearly a week has passed since this - what's the outcome? Please remember: we have an outage here now for already 24 continuous days - yes, days and not hours. I haven't seen such a long server outage in my whole life - until now...

Flags: needinfo?(yousef)

tad

Comment 42

•

8 years ago

Adrian, OVH are still being hopelessly useless in helping us with this issue. We are still unable to access the server. We have followed up with OVH several times, but they seem unable to provide any sort of satisfactory response.

Flags: needinfo?(tom)

Francesco Lodolo [:flod]

Reporter

Comment 43

•

8 years ago

@Adrian Have you been able to set up the build system on a different server? In case, can you provide me a file (and expected path) to redirect users to this server? @Tom Then, what are the next steps? We need to at least get access to that data. Where you able to determine if other VPS were affected (see Mozilla Slovenia above)?

Rainer Bielefeld

Updated

•

8 years ago

Blocks: 1391525

Sylvestre Ledru [:Sylvestre]

Comment 44

•

8 years ago

> OVH are still being hopelessly useless in helping us with this issue. We are > still unable to access the server. > > We have followed up with OVH several times, but they seem unable to provide > any sort of satisfactory response. Can you give me the ticket number? Thanks

Flags: needinfo?(tom)

Yousef Alam [:yalam96] (use NEEDINFO)

Comment 45

•

8 years ago

We've now just asked them for an image so we can move the data elsewhere. Current ticket: 59045037 Previous: 5398173553 1404253792 1845196749 1550649700 2636967518 5391570667 52074495

Flags: needinfo?(yousef)

NoOp

Comment 46

•

8 years ago

Any update on this? Many of us use Adrian's SeaMonkey builds & since this outage we cannot update. We get: Update XML file not found (404)

Francesco Lodolo [:flod]

Reporter

Comment 47

•

8 years ago

And vps13662.ovh.net just got suspended for lack of payment. This is getting ridiculous.

Yousef Alam [:yalam96] (use NEEDINFO)

Comment 48

•

8 years ago

That server was paid for earlier today, no idea why it's out of cycle from the rest. I'm currently downloading the contents of the l10n server, :flod, can you provide me with your gpg key? Since it's ~70GB I'll have to host it somewhere publicly.

Yousef Alam [:yalam96] (use NEEDINFO)

Comment 49

•

8 years ago

The files have been put on the server and flod is currently cleaning them up. I believe he will start an internal discussion about what should be on the server, and we will have a discussion about the next steps for where this stuff will live.

Flags: needinfo?(tom)

NoOp

Comment 50

•

8 years ago

Will these ever be restored? Not Found The requested URL /~akalla/unofficial/seamonkey/ was not found on this server. Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443

Francesco Lodolo [:flod]

Reporter

Comment 51

•

8 years ago

(In reply to NoOp from comment #50) > Will these ever be restored? > > Not Found > > The requested URL /~akalla/unofficial/seamonkey/ was not found on this > server. > > Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443 Not as it was. Adrian is working on setting up a different server, I'll add the redirect as soon as he's ready.

Arthur K. (he/him)

Comment 52

•

8 years ago

(In reply to Francesco Lodolo [:flod] from comment #51) > (In reply to NoOp from comment #50) > > Will these ever be restored? > > > > Not Found > > > > The requested URL /~akalla/unofficial/seamonkey/ was not found on this > > server. > > > > Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443 > > Not as it was. Adrian is working on setting up a different server, I'll add > the redirect as soon as he's ready. Thanks to everyone involved in trying to bring it back up elsewhere.

erpman1

Comment 53

•

8 years ago

(In reply to Arthur K. from comment #52) > (In reply to Francesco Lodolo [:flod] from comment #51) > > (In reply to NoOp from comment #50) > > > Will these ever be restored? > > > > > > Not Found > > > > > > The requested URL /~akalla/unofficial/seamonkey/ was not found on this > > > server. > > > > > > Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443 > > > > Not as it was. Adrian is working on setting up a different server, I'll add > > the redirect as soon as he's ready. > > Thanks to everyone involved in trying to bring it back up elsewhere. How long will it take for Adrian to set up a new server? It's been a little more than 3 weeks since the last comment on here and the month of September is nearly over. will the new server be ready in October?

erpman1

Comment 54

•

8 years ago

(In reply to erpman1 from comment #53) > > How long will it take for Adrian to set up a new server? It's been a little > more than 3 weeks since the last comment on here and the month of September > is nearly over. will the new server be ready in October? or by the end of 2017? and what has happened to Adrian? there has not been any substantial progress on this matter (or Adrian's whereabouts) since my last comment a few months ago.

Johan Lorenzo [:jlorenzo] - PTO - Back August 24th

Updated

•

7 years ago

Comment 55

•

7 years ago

Bulk move of bugs

Component: Community IT: Hosting → MCWS

Product: Infrastructure & Operations → Participation Infrastructure

Francesco Lodolo [:flod]

Reporter

Updated

•

3 years ago

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED