Closed Bug 1383642 Opened 7 years ago Closed 2 years ago

OVH servers are down (e.g. http://l10n.mozilla-community.org/, mozilla.si)

Categories

(Participation Infrastructure :: MCWS, task, P1)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: flod, Unassigned)

References

Details

Server is currently unreachable. This time it shouldn't be a problem with OVH renewals, since that happens mid month. It was up and disappeared about 5 minutes ago, doesn't answer to pings either. @reed Any idea?
Flags: needinfo?(reed)
Other OVH servers are down, so it seems to be that problem once again :-\
Flags: needinfo?(reed) → needinfo?(hmitsch)
Assignee: reed → nobody
Component: Localization Server → Community IT: Hosting
Product: mozilla.org → Infrastructure & Operations
Summary: http://l10n.mozilla-community.org/ is down → OVH servers are down (e.g. http://l10n.mozilla-community.org/, mozilla.si)
Hi Tom, do you have a renewal link to get the payment done? Best regards, Henrik
Flags: needinfo?(hmitsch) → needinfo?(tom)
According to IRC, the problem should be fixed, but both the servers mentioned in the subject are not online. If possible, it would really help to give me access to the OVH panel for l10n.mozilla-community.org (I already have for another VPS).
Context about the "blocker" status: None of Fennec Nightly, Beta or Release can be published on Google Play Store. This happened with yesterday's nightly[1]. Today, Fennec 55.0b12 needs to be released. The reason of the failure is the following: [2] is used by our publishing script[3] to fetch the latest strings to display on Google Play. If the strings can't be fetched, the script fails, without publishing anything (just like [1]). We need this service to be up, if we want to continue to release Fennec automatically (and securely). [1] https://tools.taskcluster.net/groups/EjVa6X1GROiMNA3sBZoG7Q/tasks/fcS5fKeLRx6GFiBFDyxDuA/runs/0/logs/public%2Flogs%2Flive_backing.log [2] https://l10n.mozilla-community.org/stores_l10n/ [3] https://github.com/mozilla-releng/mozapkpublisher
Severity: normal → blocker
Priority: -- → P1
Henrik, sorry but we need to find a solution here, and I have no idea who else can help.
Flags: needinfo?(hmitsch)
We paid for the servers not too long after this bug was opened, and all of the other servers appear to be back online. Tad is working on kicking this server but is having trouble with the panel.
We've contacted OVH support to try and get the instance back online.
FTR, mozilla.si seems to be down still, too.
Hi jlorenzo and flod, let's stop the bleeding first. I will make sure that anybody at ParSys who can help to get the OVH servers back online will be working on that. Following the immediate fix, I will make it a priority to get to a sustainable solution. We did not know that the Firefox deployment toolchain depends on this server. This is of course unacceptable and we are happy to support the L10N people in moving this to sustainable ParSys AWS infrastructure. Let's focus on getting the server back first. Best regards, Henrik
Flags: needinfo?(hmitsch)
Yousef and I just spoke to the OVH phone support: There is a data center malfunction on OVH side. We explained the case and got buy in that OVH will fix this as quickly as possible. Fingers crossed! @Tad: thanks for providing us all the necessary details in the backchannel! :-)
Flags: needinfo?(tom)
Hi, Henrik, Yousef, I have a few questions to try and help us gauge the immediate impact. Is it possible for one of you to get an ETA for return of service from the hosting provider? Is there any indication of what the underlying issue is? Are there backups of the data in case we need to restore it elsewhere? Thanks for your help resoling this issue.
Flags: needinfo?(yousef)
Flags: needinfo?(hmitsch)
See Also: → 1384083
Hi arr, on the phone support OVH told us that things should be back in service 'in a few minutes'. Obviously this has not been the case. We do not have an ETA. The ticket is also very sparse on details. Not sure about backups either. I know this is not very helpful but this is about as much as we have got. Best regards, Henrik
Flags: needinfo?(yousef)
Flags: needinfo?(hmitsch)
Clarified on IRC: the app that releng depends on doesn't have any data, it uses date stored on VCS (GitHub) and exposes it via API. Most other l10n tools are in the same situation, there's no backup but also no backup needed. Having said that, that server contains other things that are out of my control, so I can't really answer for that part.
We're chasing up OVH again, this time to get an ETA.
Per the link in comment 11, the incident was closed yesterday at 7:22PM UTC (~15 minutes after comment 15). I don't manage to connect to https://l10n.mozilla-community.org/. What's the next step?
(In reply to Johan Lorenzo [:jlorenzo] from comment #16) > Per the link in comment 11, the incident was closed yesterday at 7:22PM UTC > (~15 minutes after comment 15). I don't manage to connect to > https://l10n.mozilla-community.org/. What's the next step? If my understanding was right in the channel this morning, sounds like the infrastructure is okay now but we might have problems rebooting the service.
I pinged a few persons at ovh and here is the answer from support: https://twitter.com/ovh_support_fr/status/890135664293011456
What does this mean? Can you translate?
That is very likely on our side now.
(In reply to Henrik Mitsch [:hmitsch] from comment #19) > What does this mean? Can you translate? My French is a bit rusty but sounds like everything is okay from their side and they have communicated that to us. He demands from Sylvestre that the administrator of the website should contact them directly for more information.
Yes, I had another call with OVH Support about 3.5 hours ago. We can access the KVM but we can't get any further. Still trying to understand what's happening as I personally don't have access to the KVM. -Henrik
Partially related: https://bugzilla.mozilla.org/show_bug.cgi?id=1347863#c3 I can't help noticing two VPS marked as L10N vps28311.ovh.net: IP is the same as l10n.mozilla-community.org, no replies to pings vps28312.ovh.net: what's this? It's answering to pings, has Internal Server when accessed via HTTP, but I have no clue what's in it or who has access.
:hmitsch: who does have access to the kvm? Can we escalate to them to get status and/or help?
Flags: needinfo?(hmitsch)
:arr, :Tad has access and is looking into this as we speak. If you want, we can invite you to our multi-people Slack channel.
Flags: needinfo?(hmitsch)
Latest status: https://l10n.mozilla-community.org/stores_l10n/ is up again. Currently hosted on a temporary virtual server on ParSys AWS infrastructure. We aim to have a post mortem on Friday. Thanks to :arr for scheduling that. Keeping the bug open because we need a permanent, sustainable solution. I guess we can downgrade this bug now? Who has the authority and understanding to do so? -Henrik
Severity: blocker → normal
(In reply to Henrik Mitsch [:hmitsch] from comment #26) > Latest status: > https://l10n.mozilla-community.org/stores_l10n/ is up again. > > Currently hosted on a temporary virtual server on ParSys AWS infrastructure. > We aim to have a post mortem on Friday. Thanks to :arr for scheduling that. > > Keeping the bug open because we need a permanent, sustainable solution. This bug is not yet fixed, as most of the stuff that was on that server is still not there, e.g. my folder is still missing: https://l10n.mozilla-community.org/~akalla/ Also: when I want to SSH to the server, I get "Permission denied (publickey)." - this was working before...
Severity: normal → critical
mozilla.si is also still not working...
(In reply to Adrian Kalla [:adriank] from comment #27) > This bug is not yet fixed, as most of the stuff that was on that server is > still not there, e.g. my folder is still missing: > https://l10n.mozilla-community.org/~akalla/ That's why the bug is still open. We only reinstalled critical pieces on a different temporary VM. > Also: when I want to SSH to the server, I get "Permission denied > (publickey)." - this was working before... For the second part: the server is booting in rescue mode, but not properly, so it can't be fixed.
:adriank, can you please provide reasons for upgrading this issue to critical? The deployment toolchain for Fennec is working. Is there anything else that's critical on this server? We are aware that not everything is back to normal and will take care of this in collaboration with MCWS in the next days. -Henrik
Severity: critical → normal
(In reply to Henrik Mitsch [:hmitsch] from comment #30) > :adriank, can you please provide reasons for upgrading this issue to > critical? The deployment toolchain for Fennec is working. Is there anything > else that's critical on this server? The deployment toolchain for Fennec was a blocker. Regarding the rest: if an outage of a number of services, like mozilla.si, is not a critical issue, then I truly don't know what is... "normal" severity implies that there is no urgent issue here, like this would be something of the sort "hey, lets move the server to a different location" - while it is still working... What is really important for me on this server: my unofficial SeaMonkey releases reside there (see: https://unofficialseamonkeynews.wordpress.com/2017/07/26/adrian-kallas-download-page-currently-not-available-%E2%9A%A0%EF%B8%8F/ ) - and even more important: they look on this server for the update.xml files, so I cannot even move them elsewhere without having this server back online... :hmitsch: when can I expect the server to be back online with all its content?
Flags: needinfo?(hmitsch)
Hi Tom, different bug, same topic (as in Bug 1347753). Maybe you get around providing an update anytime soon? Best regards, Henrik
Flags: needinfo?(hmitsch) → needinfo?(tom)
(In reply to Adrian Kalla [:adriank] from comment #27) > (In reply to Henrik Mitsch [:hmitsch] from comment #26) > > Latest status: > > https://l10n.mozilla-community.org/stores_l10n/ is up again. > > > > Currently hosted on a temporary virtual server on ParSys AWS infrastructure. > > We aim to have a post mortem on Friday. Thanks to :arr for scheduling that. > > > > Keeping the bug open because we need a permanent, sustainable solution. > > This bug is not yet fixed, as most of the stuff that was on that server is > still not there, e.g. my folder is still missing: > https://l10n.mozilla-community.org/~akalla/ > > Also: when I want to SSH to the server, I get "Permission denied > (publickey)." - this was working before... I am in the same situation, can't get my data at https://l10n.mozilla-community.org/~pascalc/
For the record, Mozilla Slovenija gave up on this and moved mozilla.si to a different hosting.
:hmitsch: it's been over two weeks since the outage began - what's the progress here? What have you guys done to fix this? Moving back to critical for the reasons explained above and also, as I cannot wait much longer to update SeaMonkey users with the latest security fixes - and for that I need access to my personal folder yesterday...
Severity: normal → critical
Flags: needinfo?(hmitsch)
:mathjazz sorry to hear about your frustration. Can you tell us which hosting you use for that site? :adriank I spoke to :tad earlier today to ask about his progress with server restore. He will get back to us soon, I hope. Again, I am sorry that this takes so long. We are paying for all the debt accrued by previous generations. I know that "Not your fault, doesn't mean not your problem" so we are doing our best to get the service back. -Henrik
I have called up OVH again. As usual they have opened another ticket for the network issue, so let's hope it goes differently this time.
Flags: needinfo?(hmitsch)
(In reply to Henrik Mitsch [:hmitsch] from comment #37) > :mathjazz sorry to hear about your frustration. Can you tell us which > hosting you use for that site? :hmitsch Hey, one of our volunteers stepped up and moved the site to his private hosting provider.
This is a rather incredible issue. @[:mathjazz] private hosting provider can be reached where?
(In reply to Yousef Alam [:yalam96] from comment #38) > I have called up OVH again. As usual they have opened another ticket for the > network issue, so let's hope it goes differently this time. Nearly a week has passed since this - what's the outcome? Please remember: we have an outage here now for already 24 continuous days - yes, days and not hours. I haven't seen such a long server outage in my whole life - until now...
Flags: needinfo?(yousef)
Adrian, OVH are still being hopelessly useless in helping us with this issue. We are still unable to access the server. We have followed up with OVH several times, but they seem unable to provide any sort of satisfactory response.
Flags: needinfo?(tom)
@Adrian Have you been able to set up the build system on a different server? In case, can you provide me a file (and expected path) to redirect users to this server? @Tom Then, what are the next steps? We need to at least get access to that data. Where you able to determine if other VPS were affected (see Mozilla Slovenia above)?
Blocks: 1391525
> OVH are still being hopelessly useless in helping us with this issue. We are > still unable to access the server. > > We have followed up with OVH several times, but they seem unable to provide > any sort of satisfactory response. Can you give me the ticket number? Thanks
Flags: needinfo?(tom)
We've now just asked them for an image so we can move the data elsewhere. Current ticket: 59045037 Previous: 5398173553 1404253792 1845196749 1550649700 2636967518 5391570667 52074495
Flags: needinfo?(yousef)
Any update on this? Many of us use Adrian's SeaMonkey builds & since this outage we cannot update. We get: Update XML file not found (404)
And vps13662.ovh.net just got suspended for lack of payment. This is getting ridiculous.
That server was paid for earlier today, no idea why it's out of cycle from the rest. I'm currently downloading the contents of the l10n server, :flod, can you provide me with your gpg key? Since it's ~70GB I'll have to host it somewhere publicly.
The files have been put on the server and flod is currently cleaning them up. I believe he will start an internal discussion about what should be on the server, and we will have a discussion about the next steps for where this stuff will live.
Flags: needinfo?(tom)
Will these ever be restored? Not Found The requested URL /~akalla/unofficial/seamonkey/ was not found on this server. Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443
(In reply to NoOp from comment #50) > Will these ever be restored? > > Not Found > > The requested URL /~akalla/unofficial/seamonkey/ was not found on this > server. > > Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443 Not as it was. Adrian is working on setting up a different server, I'll add the redirect as soon as he's ready.
(In reply to Francesco Lodolo [:flod] from comment #51) > (In reply to NoOp from comment #50) > > Will these ever be restored? > > > > Not Found > > > > The requested URL /~akalla/unofficial/seamonkey/ was not found on this > > server. > > > > Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443 > > Not as it was. Adrian is working on setting up a different server, I'll add > the redirect as soon as he's ready. Thanks to everyone involved in trying to bring it back up elsewhere.
(In reply to Arthur K. from comment #52) > (In reply to Francesco Lodolo [:flod] from comment #51) > > (In reply to NoOp from comment #50) > > > Will these ever be restored? > > > > > > Not Found > > > > > > The requested URL /~akalla/unofficial/seamonkey/ was not found on this > > > server. > > > > > > Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443 > > > > Not as it was. Adrian is working on setting up a different server, I'll add > > the redirect as soon as he's ready. > > Thanks to everyone involved in trying to bring it back up elsewhere. How long will it take for Adrian to set up a new server? It's been a little more than 3 weeks since the last comment on here and the month of September is nearly over. will the new server be ready in October?
(In reply to erpman1 from comment #53) > > How long will it take for Adrian to set up a new server? It's been a little > more than 3 weeks since the last comment on here and the month of September > is nearly over. will the new server be ready in October? or by the end of 2017? and what has happened to Adrian? there has not been any substantial progress on this matter (or Adrian's whereabouts) since my last comment a few months ago.
See Also: → 1492617

Bulk move of bugs

Component: Community IT: Hosting → MCWS
Product: Infrastructure & Operations → Participation Infrastructure
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.