Closed
Bug 1383642
Opened 7 years ago
Closed 2 years ago
OVH servers are down (e.g. http://l10n.mozilla-community.org/ , mozilla.si)
Categories
(Participation Infrastructure :: MCWS, task, P1)
Participation Infrastructure
MCWS
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: flod, Unassigned)
References
Details
Server is currently unreachable. This time it shouldn't be a problem with OVH renewals, since that happens mid month.
It was up and disappeared about 5 minutes ago, doesn't answer to pings either.
@reed
Any idea?
Reporter | ||
Updated•7 years ago
|
Flags: needinfo?(reed)
Reporter | ||
Comment 1•7 years ago
|
||
Other OVH servers are down, so it seems to be that problem once again :-\
Flags: needinfo?(reed) → needinfo?(hmitsch)
Reporter | ||
Updated•7 years ago
|
Assignee: reed → nobody
Component: Localization Server → Community IT: Hosting
Product: mozilla.org → Infrastructure & Operations
Reporter | ||
Updated•7 years ago
|
Summary: http://l10n.mozilla-community.org/ is down → OVH servers are down (e.g. http://l10n.mozilla-community.org/, mozilla.si)
Comment 2•7 years ago
|
||
Hi Tom,
do you have a renewal link to get the payment done?
Best regards,
Henrik
Flags: needinfo?(hmitsch) → needinfo?(tom)
Reporter | ||
Comment 3•7 years ago
|
||
According to IRC, the problem should be fixed, but both the servers mentioned in the subject are not online.
If possible, it would really help to give me access to the OVH panel for l10n.mozilla-community.org (I already have for another VPS).
Comment 4•7 years ago
|
||
Context about the "blocker" status: None of Fennec Nightly, Beta or Release can be published on Google Play Store. This happened with yesterday's nightly[1]. Today, Fennec 55.0b12 needs to be released.
The reason of the failure is the following: [2] is used by our publishing script[3] to fetch the latest strings to display on Google Play. If the strings can't be fetched, the script fails, without publishing anything (just like [1]).
We need this service to be up, if we want to continue to release Fennec automatically (and securely).
[1] https://tools.taskcluster.net/groups/EjVa6X1GROiMNA3sBZoG7Q/tasks/fcS5fKeLRx6GFiBFDyxDuA/runs/0/logs/public%2Flogs%2Flive_backing.log
[2] https://l10n.mozilla-community.org/stores_l10n/
[3] https://github.com/mozilla-releng/mozapkpublisher
Severity: normal → blocker
Priority: -- → P1
Reporter | ||
Comment 5•7 years ago
|
||
Henrik, sorry but we need to find a solution here, and I have no idea who else can help.
Flags: needinfo?(hmitsch)
Comment 6•7 years ago
|
||
We paid for the servers not too long after this bug was opened, and all of the other servers appear to be back online. Tad is working on kicking this server but is having trouble with the panel.
Comment 7•7 years ago
|
||
We've contacted OVH support to try and get the instance back online.
Comment 8•7 years ago
|
||
FTR, mozilla.si seems to be down still, too.
Comment 9•7 years ago
|
||
Hi jlorenzo and flod,
let's stop the bleeding first. I will make sure that anybody at ParSys who can help to get the OVH servers back online will be working on that.
Following the immediate fix, I will make it a priority to get to a sustainable solution. We did not know that the Firefox deployment toolchain depends on this server. This is of course unacceptable and we are happy to support the L10N people in moving this to sustainable ParSys AWS infrastructure.
Let's focus on getting the server back first.
Best regards,
Henrik
Flags: needinfo?(hmitsch)
Comment 10•7 years ago
|
||
Yousef and I just spoke to the OVH phone support: There is a data center malfunction on OVH side.
We explained the case and got buy in that OVH will fix this as quickly as possible.
Fingers crossed!
@Tad: thanks for providing us all the necessary details in the backchannel! :-)
Updated•7 years ago
|
Flags: needinfo?(tom)
Comment 11•7 years ago
|
||
This is the link to track the OVH outage:
http://travaux.ovh.net/?do=details&id=26171&edit=yep
Comment 12•7 years ago
|
||
Hi, Henrik, Yousef, I have a few questions to try and help us gauge the immediate impact. Is it possible for one of you to get an ETA for return of service from the hosting provider? Is there any indication of what the underlying issue is? Are there backups of the data in case we need to restore it elsewhere?
Thanks for your help resoling this issue.
Flags: needinfo?(yousef)
Flags: needinfo?(hmitsch)
Comment 13•7 years ago
|
||
Hi arr,
on the phone support OVH told us that things should be back in service 'in a few minutes'. Obviously this has not been the case. We do not have an ETA. The ticket is also very sparse on details. Not sure about backups either.
I know this is not very helpful but this is about as much as we have got.
Best regards,
Henrik
Flags: needinfo?(yousef)
Flags: needinfo?(hmitsch)
Reporter | ||
Comment 14•7 years ago
|
||
Clarified on IRC: the app that releng depends on doesn't have any data, it uses date stored on VCS (GitHub) and exposes it via API.
Most other l10n tools are in the same situation, there's no backup but also no backup needed.
Having said that, that server contains other things that are out of my control, so I can't really answer for that part.
Comment 15•7 years ago
|
||
We're chasing up OVH again, this time to get an ETA.
Comment 16•7 years ago
|
||
Per the link in comment 11, the incident was closed yesterday at 7:22PM UTC (~15 minutes after comment 15). I don't manage to connect to https://l10n.mozilla-community.org/. What's the next step?
Comment 17•7 years ago
|
||
(In reply to Johan Lorenzo [:jlorenzo] from comment #16)
> Per the link in comment 11, the incident was closed yesterday at 7:22PM UTC
> (~15 minutes after comment 15). I don't manage to connect to
> https://l10n.mozilla-community.org/. What's the next step?
If my understanding was right in the channel this morning, sounds like the infrastructure is okay now but we might have problems rebooting the service.
Comment 18•7 years ago
|
||
I pinged a few persons at ovh and here is the answer from support:
https://twitter.com/ovh_support_fr/status/890135664293011456
Comment 19•7 years ago
|
||
What does this mean? Can you translate?
Comment 20•7 years ago
|
||
That is very likely on our side now.
Comment 21•7 years ago
|
||
(In reply to Henrik Mitsch [:hmitsch] from comment #19)
> What does this mean? Can you translate?
My French is a bit rusty but sounds like everything is okay from their side and they have communicated that to us. He demands from Sylvestre that the administrator of the website should contact them directly for more information.
Comment 22•7 years ago
|
||
Yes, I had another call with OVH Support about 3.5 hours ago. We can access the KVM but we can't get any further. Still trying to understand what's happening as I personally don't have access to the KVM.
-Henrik
Reporter | ||
Comment 23•7 years ago
|
||
Partially related: https://bugzilla.mozilla.org/show_bug.cgi?id=1347863#c3
I can't help noticing two VPS marked as L10N
vps28311.ovh.net: IP is the same as l10n.mozilla-community.org, no replies to pings
vps28312.ovh.net: what's this? It's answering to pings, has Internal Server when accessed via HTTP, but I have no clue what's in it or who has access.
Comment 24•7 years ago
|
||
:hmitsch: who does have access to the kvm? Can we escalate to them to get status and/or help?
Flags: needinfo?(hmitsch)
Comment 25•7 years ago
|
||
:arr, :Tad has access and is looking into this as we speak. If you want, we can invite you to our multi-people Slack channel.
Flags: needinfo?(hmitsch)
Comment 26•7 years ago
|
||
Latest status:
https://l10n.mozilla-community.org/stores_l10n/ is up again.
Currently hosted on a temporary virtual server on ParSys AWS infrastructure. We aim to have a post mortem on Friday. Thanks to :arr for scheduling that.
Keeping the bug open because we need a permanent, sustainable solution.
I guess we can downgrade this bug now? Who has the authority and understanding to do so?
-Henrik
Updated•7 years ago
|
Severity: blocker → normal
Comment 27•7 years ago
|
||
(In reply to Henrik Mitsch [:hmitsch] from comment #26)
> Latest status:
> https://l10n.mozilla-community.org/stores_l10n/ is up again.
>
> Currently hosted on a temporary virtual server on ParSys AWS infrastructure.
> We aim to have a post mortem on Friday. Thanks to :arr for scheduling that.
>
> Keeping the bug open because we need a permanent, sustainable solution.
This bug is not yet fixed, as most of the stuff that was on that server is still not there, e.g. my folder is still missing: https://l10n.mozilla-community.org/~akalla/
Also: when I want to SSH to the server, I get "Permission denied (publickey)." - this was working before...
Updated•7 years ago
|
Severity: normal → critical
Comment 28•7 years ago
|
||
mozilla.si is also still not working...
Reporter | ||
Comment 29•7 years ago
|
||
(In reply to Adrian Kalla [:adriank] from comment #27)
> This bug is not yet fixed, as most of the stuff that was on that server is
> still not there, e.g. my folder is still missing:
> https://l10n.mozilla-community.org/~akalla/
That's why the bug is still open. We only reinstalled critical pieces on a different temporary VM.
> Also: when I want to SSH to the server, I get "Permission denied
> (publickey)." - this was working before...
For the second part: the server is booting in rescue mode, but not properly, so it can't be fixed.
Comment 30•7 years ago
|
||
:adriank, can you please provide reasons for upgrading this issue to critical? The deployment toolchain for Fennec is working. Is there anything else that's critical on this server?
We are aware that not everything is back to normal and will take care of this in collaboration with MCWS in the next days.
-Henrik
Severity: critical → normal
Comment 31•7 years ago
|
||
(In reply to Henrik Mitsch [:hmitsch] from comment #30)
> :adriank, can you please provide reasons for upgrading this issue to
> critical? The deployment toolchain for Fennec is working. Is there anything
> else that's critical on this server?
The deployment toolchain for Fennec was a blocker.
Regarding the rest: if an outage of a number of services, like mozilla.si, is not a critical issue, then I truly don't know what is... "normal" severity implies that there is no urgent issue here, like this would be something of the sort "hey, lets move the server to a different location" - while it is still working...
What is really important for me on this server: my unofficial SeaMonkey releases reside there (see: https://unofficialseamonkeynews.wordpress.com/2017/07/26/adrian-kallas-download-page-currently-not-available-%E2%9A%A0%EF%B8%8F/ ) - and even more important: they look on this server for the update.xml files, so I cannot even move them elsewhere without having this server back online...
:hmitsch: when can I expect the server to be back online with all its content?
Flags: needinfo?(hmitsch)
Comment hidden (Intermittent Failures Robot) |
Comment 33•7 years ago
|
||
Hi Tom,
different bug, same topic (as in Bug 1347753). Maybe you get around providing an update anytime soon?
Best regards,
Henrik
Flags: needinfo?(hmitsch) → needinfo?(tom)
Comment 34•7 years ago
|
||
(In reply to Adrian Kalla [:adriank] from comment #27)
> (In reply to Henrik Mitsch [:hmitsch] from comment #26)
> > Latest status:
> > https://l10n.mozilla-community.org/stores_l10n/ is up again.
> >
> > Currently hosted on a temporary virtual server on ParSys AWS infrastructure.
> > We aim to have a post mortem on Friday. Thanks to :arr for scheduling that.
> >
> > Keeping the bug open because we need a permanent, sustainable solution.
>
> This bug is not yet fixed, as most of the stuff that was on that server is
> still not there, e.g. my folder is still missing:
> https://l10n.mozilla-community.org/~akalla/
>
> Also: when I want to SSH to the server, I get "Permission denied
> (publickey)." - this was working before...
I am in the same situation, can't get my data at https://l10n.mozilla-community.org/~pascalc/
Comment 35•7 years ago
|
||
For the record, Mozilla Slovenija gave up on this and moved mozilla.si to a different hosting.
Comment 36•7 years ago
|
||
:hmitsch: it's been over two weeks since the outage began - what's the progress here? What have you guys done to fix this?
Moving back to critical for the reasons explained above and also, as I cannot wait much longer to update SeaMonkey users with the latest security fixes - and for that I need access to my personal folder yesterday...
Severity: normal → critical
Flags: needinfo?(hmitsch)
Comment 37•7 years ago
|
||
:mathjazz sorry to hear about your frustration. Can you tell us which hosting you use for that site?
:adriank I spoke to :tad earlier today to ask about his progress with server restore. He will get back to us soon, I hope. Again, I am sorry that this takes so long. We are paying for all the debt accrued by previous generations. I know that "Not your fault, doesn't mean not your problem" so we are doing our best to get the service back.
-Henrik
Comment 38•7 years ago
|
||
I have called up OVH again. As usual they have opened another ticket for the network issue, so let's hope it goes differently this time.
Flags: needinfo?(hmitsch)
Comment 39•7 years ago
|
||
(In reply to Henrik Mitsch [:hmitsch] from comment #37)
> :mathjazz sorry to hear about your frustration. Can you tell us which
> hosting you use for that site?
:hmitsch Hey, one of our volunteers stepped up and moved the site to his private hosting provider.
Comment 40•7 years ago
|
||
This is a rather incredible issue.
@[:mathjazz]
private hosting provider can be reached where?
Comment 41•7 years ago
|
||
(In reply to Yousef Alam [:yalam96] from comment #38)
> I have called up OVH again. As usual they have opened another ticket for the
> network issue, so let's hope it goes differently this time.
Nearly a week has passed since this - what's the outcome?
Please remember: we have an outage here now for already 24 continuous days - yes, days and not hours. I haven't seen such a long server outage in my whole life - until now...
Flags: needinfo?(yousef)
Comment 42•7 years ago
|
||
Adrian,
OVH are still being hopelessly useless in helping us with this issue. We are still unable to access the server.
We have followed up with OVH several times, but they seem unable to provide any sort of satisfactory response.
Flags: needinfo?(tom)
Reporter | ||
Comment 43•7 years ago
|
||
@Adrian
Have you been able to set up the build system on a different server? In case, can you provide me a file (and expected path) to redirect users to this server?
@Tom
Then, what are the next steps? We need to at least get access to that data.
Where you able to determine if other VPS were affected (see Mozilla Slovenia above)?
Comment 44•7 years ago
|
||
> OVH are still being hopelessly useless in helping us with this issue. We are
> still unable to access the server.
>
> We have followed up with OVH several times, but they seem unable to provide
> any sort of satisfactory response.
Can you give me the ticket number? Thanks
Flags: needinfo?(tom)
Comment 45•7 years ago
|
||
We've now just asked them for an image so we can move the data elsewhere.
Current ticket: 59045037
Previous:
5398173553
1404253792
1845196749
1550649700
2636967518
5391570667
52074495
Flags: needinfo?(yousef)
Comment 46•7 years ago
|
||
Any update on this? Many of us use Adrian's SeaMonkey builds & since this outage we cannot update. We get:
Update XML file not found (404)
Reporter | ||
Comment 47•7 years ago
|
||
And vps13662.ovh.net just got suspended for lack of payment. This is getting ridiculous.
Comment 48•7 years ago
|
||
That server was paid for earlier today, no idea why it's out of cycle from the rest.
I'm currently downloading the contents of the l10n server, :flod, can you provide me with your gpg key? Since it's ~70GB I'll have to host it somewhere publicly.
Comment 49•7 years ago
|
||
The files have been put on the server and flod is currently cleaning them up. I believe he will start an internal discussion about what should be on the server, and we will have a discussion about the next steps for where this stuff will live.
Flags: needinfo?(tom)
Comment 50•7 years ago
|
||
Will these ever be restored?
Not Found
The requested URL /~akalla/unofficial/seamonkey/ was not found on this server.
Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443
Reporter | ||
Comment 51•7 years ago
|
||
(In reply to NoOp from comment #50)
> Will these ever be restored?
>
> Not Found
>
> The requested URL /~akalla/unofficial/seamonkey/ was not found on this
> server.
>
> Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443
Not as it was. Adrian is working on setting up a different server, I'll add the redirect as soon as he's ready.
Comment 52•7 years ago
|
||
(In reply to Francesco Lodolo [:flod] from comment #51)
> (In reply to NoOp from comment #50)
> > Will these ever be restored?
> >
> > Not Found
> >
> > The requested URL /~akalla/unofficial/seamonkey/ was not found on this
> > server.
> >
> > Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443
>
> Not as it was. Adrian is working on setting up a different server, I'll add
> the redirect as soon as he's ready.
Thanks to everyone involved in trying to bring it back up elsewhere.
Comment 53•7 years ago
|
||
(In reply to Arthur K. from comment #52)
> (In reply to Francesco Lodolo [:flod] from comment #51)
> > (In reply to NoOp from comment #50)
> > > Will these ever be restored?
> > >
> > > Not Found
> > >
> > > The requested URL /~akalla/unofficial/seamonkey/ was not found on this
> > > server.
> > >
> > > Apache/2.4.18 (Ubuntu) Server at l10n.mozilla-community.org Port 443
> >
> > Not as it was. Adrian is working on setting up a different server, I'll add
> > the redirect as soon as he's ready.
>
> Thanks to everyone involved in trying to bring it back up elsewhere.
How long will it take for Adrian to set up a new server? It's been a little more than 3 weeks since the last comment on here and the month of September is nearly over. will the new server be ready in October?
Comment 54•7 years ago
|
||
(In reply to erpman1 from comment #53)
>
> How long will it take for Adrian to set up a new server? It's been a little
> more than 3 weeks since the last comment on here and the month of September
> is nearly over. will the new server be ready in October?
or by the end of 2017? and what has happened to Adrian? there has not been any
substantial progress on this matter (or Adrian's whereabouts) since my last comment
a few months ago.
Comment 55•6 years ago
|
||
Bulk move of bugs
Component: Community IT: Hosting → MCWS
Product: Infrastructure & Operations → Participation Infrastructure
Reporter | ||
Updated•2 years ago
|
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•