Closed Bug 894913 Opened 11 years ago Closed 11 years ago

Specific Etherpads becoming unavailable

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: davida, Assigned: bjohnson)

References

Details

Attachments

(1 file)

Discussed, in #it, but:

 https://teach.etherpad.mozilla.org/party-calll
 https://etherpad.mozilla.org/opennews-2014-fellowships-develop-capture

are at least two that are still down.

https://etherpad.mozilla.org/toronto-standing-desk-tryout  was down, had data loss, even though it's back up (without some of the data).
etherpad3.webapp.phx1.mozilla.com:/root/etherpad-2013-07-17.txt has the recent contents of the screen session.

Instances of:

Exception in thread "1700983583@qtp-554167265-5100" net.appjet.ajstdlib.SocketMa
nager$HandlerException: An error occurred while handling a request: 500 - You li
ke apples? An error occurred in the error handler while handling an error. How d
o you like <i>them</i> apples?<br>
net.appjet.bodylock.JSRuntimeException: Error while executing: TypeError: Can't
use instanceof on a non-object. (module etherpad/log.js#121)<br>
This one too if it helps to identify the issue. 
https://etherpad.mozilla.org/5iHD1O6XeK
If anyone would like a specific bad pad removed because they have the data elsewhere and/or can recreate the data this can be done by the current oncall SRE via:
  https://etherpad.mozilla.org/ep/admin/delete-pad
Pad that seems to fit this description: https://etherpad.mozilla.org/devtools-firstweek
Can we recover the data?
We are working on a way to recover the lost text, but etherpad is tricky. We can't make any promises, but we're developing what we can. I will update you on or before 2 pm Pacific.
All etherpads reported so far have been fixed. Please re-open if any additional issues occur with these etherpads or open a new bug if a new etherpad break is found.

Thanks!
Assignee: server-ops-webops → bjohnson
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
https://etherpad.mozilla.org/weekly-addons-mtg
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Oh, sorry, I misread this. I'll file a new bug.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Actually, please use this bug as a centralized place for all the broken etherpads. We're actually working on a way to proactively find broken ones too.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 894992
Blocks: 894984
Hi, @sheeri: any ETA on that broken-etherpad-finding-and-repairing script?
SHeeri - do we have an idea of what has caused all these pads to break and loose data in some cases?
Sylvie - yes, we did a routine database failover and any pad that was being written to at the time seems to have been affected. 

Different etherpads have different authentications so we've been trying to find an internal way to find the broken pads. We researched a few ways to find broken pads by comparing fields in the database but haven't come up with a 100% correlation yet between differences and broken etherpads. We may have to resort to spidering the public etherpads to find broken ones, but we have not come up with a way to find all the broken pads among all the different private etherpads.
(In reply to Sheeri Cabral [:sheeri] from comment #22)
> Sylvie - yes, we did a routine database failover and any pad that was being
> written to at the time seems to have been affected. 
> 
> Different etherpads have different authentications so we've been trying to
> find an internal way to find the broken pads. We researched a few ways to
> find broken pads by comparing fields in the database but haven't come up
> with a 100% correlation yet between differences and broken etherpads. We may
> have to resort to spidering the public etherpads to find broken ones, but we
> have not come up with a way to find all the broken pads among all the
> different private etherpads.

Thanks. If we do not have a solution soon, is there a way to revert back to the prior database or an option to try?
https://taiwan.etherpad.mozilla.org/30 fixed

Fixing the other pads now.
Sylvie - we have backups, but then we'll end up losing the information that has been put into all etherpads since the incident. It's not easy to find all the etherpads that were changed in a time period and export them all, unfortunately.
(In reply to Sheeri Cabral [:sheeri] from comment #22)
> Sylvie - yes, we did a routine database failover and any pad that was being
> written to at the time seems to have been affected. 

'routine' sounds like we expect this to happen again. I lost a fair bit of work with this failover. Is this really routine?

Thanks
(In reply to Sheeri Cabral [:sheeri] from comment #25)
> Sylvie - we have backups, but then we'll end up losing the information that
> has been put into all etherpads since the incident. It's not easy to find
> all the etherpads that were changed in a time period and export them all,
> unfortunately.

OK I assumed they DB's were active/passive and replicating and data loss would be at a minimum and less impactful then the situation today?
https://taiwan.etherpad.mozilla.org/255 was empty. Free'd up the URL.
They are active/passive replicating. But the way etherpads work is not a traditional client-server application. The data loss wasn't from the database losing information - the information is there (if it was able to be saved), which is how we can recover the etherpads. Etherpads depend on javascript (node.js in particular) and folks can make changes to a doc when the database isn't available, those changes are supposed to be saved when the database comes back online, but etherpad is not always good about that.

Etherpad corruption happens frequently; Jake reports that their team usually fixes a few a week.
(In reply to Joe Walker [:jwalker] from comment #27)
> (In reply to Sheeri Cabral [:sheeri] from comment #22)
> > Sylvie - yes, we did a routine database failover and any pad that was being
> > written to at the time seems to have been affected. 
> 
> 'routine' sounds like we expect this to happen again. I lost a fair bit of
> work with this failover. Is this really routine?
> 
> Thanks

Yes, the database failover is really routine. We have done this on average once every 3 months for the past 18 months, and have not had corruption like this before.
https://teach.etherpad.mozilla.org/DogDays was empty. Free'd up the URL.
(In reply to SylvieV from comment #28)
> (In reply to Sheeri Cabral [:sheeri] from comment #25)
> > Sylvie - we have backups, but then we'll end up losing the information that
> > has been put into all etherpads since the incident. It's not easy to find
> > all the etherpads that were changed in a time period and export them all,
> > unfortunately.
> 
> OK I assumed they DB's were active/passive and replicating and data loss
> would be at a minimum and less impactful then the situation today?

OK - Jake for another time- lets see how else etherpad like services can be delivered - maybe some SaaS offering while we work out the ne Communication and Collaboration tools for Mozillians
https://etherpad.mozilla.org/summit-peopleandprocess fixed. 

As for now, all etherpads that are reported broken are fixed. Please let me know if we find any others and I'll fix them. 

I'm still actively working on a script to proactively identify pads that broke during this event.
(In reply to Brandon Johnson [:cyborgshadow] from comment #32)
definitely wasn't empty. Entire team was hacking on it yesterday...
Lost https://etherpad.mozilla.org/swc-data --- would appreciate recovery of data.
https://etherpad.mozilla.org/devtools-meeting

Any edits I try to make here don't go through and aren't visible after reload.
(In reply to Laura Hilliger [:epilepticrabbit] from comment #36)
> (In reply to Brandon Johnson [:cyborgshadow] from comment #32)
> definitely wasn't empty. Entire team was hacking on it yesterday...

Hi Laura,

Unfortunately there was no data in the database for this pad. I'm really sorry. :( New pads are definitely stable. This incident was isolated to that specific timeframe.
https://etherpad.mozilla.org/swc-data Fixed.
https://etherpad.mozilla.org/FirefoxWalkthrough Fixed.

fitzgen:
Your pad (https://etherpad.mozilla.org/devtools-meeting
) exists and opens properly. The issue is not related to this incident. Please file a new bug with webops.
Went through all the public etherpads and found 36 out of 3600 touched in the past few days since the incident. Attaching them here.
All etherpads from comment 42's attachment are fixed. Note that many of these appear to be normal anomalies from foreign utf8 characters and unrelated to the incident yesterday.
when we have resolved the etherpad issues - can we send an incident management note on resolution and any insight to the root cause please?
Many of us are getting periodically disconnected from https://etherpad.mozilla.org/qa-staff-meeting - we've tried different browsers, and restarting Firefox, etc.
Stephen - the database is stable and has been since yesterday. This bug is for the unavailable etherpads, that do not load up at all. The server has an error, there is a red box with the text:

Oops!  A server error occured.  It's been logged.

Any other problems with etherpad are unrelated to this bug.
(In reply to Sheeri Cabral [:sheeri] from comment #46)
> Stephen - the database is stable and has been since yesterday. This bug is
> for the unavailable etherpads, that do not load up at all. The server has an
> error, there is a red box with the text:
> 
> Oops!  A server error occured.  It's been logged.
> 
> Any other problems with etherpad are unrelated to this bug.

Thanks, will file a separate bug, then.
Hello - this etherpad disconnects every few seconds after activity and won't save any changes - https://webmakersupport.etherpad.mozilla.org/FAQ

thanks for any help!
Jacob - please see comment 46.
(In reply to PTO until 6 Aug 2013 Sheeri Cabral [:sheeri] from comment #13)
> Actually, please use this bug as a centralized place for all the broken
> etherpads. We're actually working on a way to proactively find broken ones
> too.

Well if this is to be tracking bug for all ehterpad issues then the other etherpad bugs should be dependencies.
Seems like it might be related to the core bug : bug 887753
Having said that I am experiencing this issue on safari, chrome, and firefox / aurora.
Hi Naoki,

None of the etherpads you listed are affected by this issue. Please see comment 46.

Bill,

This is not for all etherpad issues, only a single specific issue where pads have a red box that says the below and do not load at all:

"Oops!  A server error occured.  It's been logged.

Any other problems with etherpad are unrelated to this bug."

We also fixed all the public bugs that were affected by this issue. Only team etherpads should have this issue, although I believe we may have fixed them all so far.
Another broken one that needs immediate attention for Maya/Debbie Cohen:
https://etherpad.mozilla.org/LEAD-20Etherpad
Fixed https://etherpad.mozilla.org/LEAD-20Etherpad

It contained some top bit set characters: 0xe2 0x80 0xa8
Since no bugs have come in for over 12 days now that have been affected by this issue, I'm closing it out.

bit set characters are an unrelated issue.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: