Closed Bug 1219321 Opened 9 years ago Closed 9 years ago

Etherpad data integrity issues

Categories

(Infrastructure & Operations :: Change Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scabral, Unassigned)

Details

Etherpad has data integrity issues, and we'd like to fail over the db to the slave to try to mitigate.
Etherpad db was filling up disk space in /, so I moved the db to /data. 

In the process, the MyISAM table "store" was marked as crashed and needed to be repaired. The repair restored 120 records:

2015-10-28 16:27:59 13292 [Note] Found 4143174 of 4143054 rows when repairing './etherpad/store'
+----------------+--------+----------+---------------------------------------------------+
| Table          | Op     | Msg_type | Msg_text                                          |
+----------------+--------+----------+---------------------------------------------------+
| etherpad.store | repair | info     | Wrong bytesec: 108-108- 44 at 1116783308; Skipped |
| etherpad.store | repair | info     | Wrong bytesec: 115-115-105 at 1116778176; Skipped |
| etherpad.store | repair | info     | Wrong bytesec:  52- 71-116 at 1116785704; Skipped |
| etherpad.store | repair | info     | Wrong bytesec:  58-106-105 at 1116785548; Skipped |
| etherpad.store | repair | info     | Wrong bytesec:  58- 49-110 at 1116794512; Skipped |
| etherpad.store | repair | warning  | Number of rows changed from 4143054 to 4143174    |
| etherpad.store | repair | status   | OK                                                |
+----------------+--------+----------+---------------------------------------------------+
7 rows in set (35.71 sec)

This resulted in a 4 minute outage from 16:23 to 16:27 UTC (9:23 - 9:27 Pacific).
Received a complaint that https://public.etherpad-mozilla.org/p/measurement-team-meeting-notes is blank after the outage. 

Etherpad's database is not set up in a way that we can extract text or history without using the API. There aren't any commandline tools available, so we'd like to fail over to the redundant slave (which didn't crash) in the hopes that the history/text is still there.
Assignee: team73 → server-ops
Component: DB: MySQL → Change Requests
Product: Data & BI Services Team → Infrastructure & Operations
QA Contact: scabral → lypulong
Change Request: --- → ?
needinfo'ing jakem, as this needs updating: https://mana.mozilla.org/wiki/display/websites/etherpad.mozilla.org#etherpad.mozilla.org-RestartEtherpad
Flags: needinfo?(nmaul)
svn sysadmins repo r109826 committed to change config of etherpad db's to swap master and slave (configs only, nothing changes until the lb changes).
:atoll stopped etherpad, I updated the load balancer, :atoll restarted etherpad. Functionality is good, unfortunately the etherpad that lost data, overwrote with an empty pad, so the slave did not have any history.

There may be other pads that lost data, but due to the nature of how etherpad stores data in the db, it's not possible to sleuth out how many pads were affected.
Change Request: ? → emergency
Seems closable.
Status: NEW → RESOLVED
Closed: 9 years ago
Flags: needinfo?(nmaul)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.