Closed
Bug 995139
Opened 11 years ago
Closed 11 years ago
Data loss in production ElasticSearch logs index used by OrangeFactor on 9th April
Categories
(Tree Management Graveyard :: OrangeFactor, defect)
Tree Management Graveyard
OrangeFactor
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: emorley, Unassigned)
Details
(Keywords: sheriffing-P1)
(Transposing content from email)
On 10/04/2014 03:11, Mark Côté wrote:> Hi Celia, you mentioned today that there may have been some data loss in
> the logs index on the production ES cluster. I just noticed that the
> size of the logs index on production
> (elasticsearch-zlb.webapp.scl3.mozilla.com:9200) is waaaay smaller than
> that on dev (elasticsearch-zlb.dev.vlan81.phx1.mozilla.com:9200): 8562
> documents versus 3152488 documents, respectively. Production Orange
> Factor is accordingly generating useless data. This seems (a) like quite
> a lot of data loss and (b) very strange that dev has much more. Did
> something happen while the index was copied over from prod to dev?
>
> Btw the bugs and tbpl indices appear to be fine.
On 10/04/2014 14:54, C. Liang wrote:> [Reiterating some of the things I posted in IRC channel this morning.]
>
> What happened was that, after the cluster was rebooted, two shards were
> persistently stuck in an unallocated state and ES refused to initialize
> them. These shards were the primary and secondary copy of shard #3 (of
> a total of four shards) of the logs index.
>
> I found some files on disk that looked like they might be the remnants
> of the shard on one of the ES servers. I copied them to another
> directory and then told ES to try to reallocate that shard onto that
> same server. It took the command and, since the data size of the files
> after the reallocation command were larger than before the reallocation
> command, I was hoping that we hadn't had any data loss. =\
>
> Right now, a script is used to copy the indexes over from one cluster to
> another. That script does a scrolled search for anything belonging to
> the index in question and then does a full reindex of the search results
> into the destination cluster. If the dev cluster results look more
> reasonable than the prod cluster, a reindex of the prod indexes might help.
>
>
> In future, I'd like to try to get us to version 1.x because this would
> finally give us the ability to take snapshots of ES cluster, which would
> provide us backups without having to maintain 2x the infra or forcing a
> shutdown of ES (to get a clean lock on the index files).
On 10/04/2014 17:20, Mark Côté wrote:> As I mentioned, the dev instance appears to have much more data,
> although of course nothing since it was copied over yesterday. If we can
> copy the old data from dev over to prod without losing the new data on
> prod, then I think we should go ahead with that.
On 10/04/2014 21:34, C. Liang wrote:> On 4/10/14 11:20 AM, Mark Côté wrote:
> I don't know that there is a good way for me to copy over the data from
> dev onto prod with the same index name without losing new data. I can
> try copying over the logs index onto a new / different index name on the
> same cluster, delete the old index, and then alias the old index name to
> the new index name.
>
> (The desire would be to maintain the alias long enough for y'all to
> update OrangeFactor to refer to the new index name. I'd like to name it
> to something like "orangefactor_logs2" or "of_logs" so it's easier to
> identify what service that index belongs to.)
>
> If you'd like me to do this, file a bug and ping me with the number in IRC.
>
> Otherwise, I know that some of the other devs have triggered a reindex
> of their indices on their own, using their own scripts. For logs, this
> will probably take a while. (Mea culpa: I neglected to time yesterday's
> copy so I can't provide timing information from yesterday's copy.)
On 10/04/2014 21:55, Mark Côté wrote:> Yeah, I think we should write a simple script to copy the missing data
> over from dev to prod. It might take a while but I think it's the
> easiest solution. But I'm at PyCon all day tomorrow so I don't have the
> time to do this. David, do any of the sheriffs have the time & know-how
> to write a script (using the pyes package, probably) to do this?
Reporter | ||
Comment 1•11 years ago
|
||
So as far as I can follow (not knowing where/when the IRC conversation mentioned in the emails took place), the data loss was a result of the steps being performed in bug 963824 comment 9.
Reporter | ||
Comment 2•11 years ago
|
||
I don't suppose you could fill in the gaps from the IRC conversation that we missed, about what happened & how? It's just this is the second time in ~1 year that we've lost data from production ES (see bug 848092 comment 6 onwards), and it seems like again it's up to us to try and fix it, which as you can imagine is a little frustrating :-(
Flags: needinfo?(cliang)
Reporter | ||
Comment 3•11 years ago
|
||
(For people following along: bug 848092's summary is "Migrate elasticsearch[1-8].metrics.scl3 to PDU B"; it's marked as 'infra', ask for CC if needed)
Comment 4•11 years ago
|
||
The critical thing to make clear is that the data loss was not a result of the steps performed in https://bugzilla.mozilla.org/show_bug.cgi?id=963824#c9. It is a result of operational issues caused by memory allocation and garbage collection in ES 0.20.x. These issues forced a service restart and both copies of the index shard were unallocated (not on any node) afterwards.
The upgrade of ES called for in Bug #963824 will address the problem because the memory handling in 0.90.x is better than in 0.20.x. This is the advice given to us by ElasticSearch, Inc.
This should reduce the number of times that we need to reboot the cluster and risk data loss. Given that the cluster has had to be restarted at least three times this past week (and at least six times in the past 30 days), the odds are not in our favor.
Flags: needinfo?(cliang)
Comment 5•11 years ago
|
||
I can understand your frustration with lost index data. I can't address the data loss you mention from https://bugzilla.mozilla.org/show_bug.cgi?id=848092#c6 -- the ES cluster in that bug is the one that was run by Metrics. Unfortunately, ES was not designed with easy operational backup in mind, which makes it harder to recover from this. (I'd like to get the cluster to 1.x when we can because that version finally allows us to take snapshots of the index without having to shut down the server.)
As far as I know, the only way that we can try to recover lost data is to force a reindex of logs. The options, as I understand them, are:
1. Have me copy over the logs index to a new index on the production cluster, which will force a reindex of the new index.
2. Have you / someone on your team trigger a reindex of the logs cluster.
3. Have you / someone on your team work on a script to look for differences between the two log indexes (dev and prod) and inject missing data from dev to prod.
For option #1, most of the work is in my ball court. Putting the old index name as an alias on the new index name should mean that you don't need to make any immediate changes to your code. However, as I mentioned, it would be nice to have OrangeFactor work in a change to refer directly the new index since "logs" is a pretty generic index name.
With respect to option #2: The main reason to have you / someone on your team trigger the reindex is that you folks are the ones that are the most familiar with how your application works and its audience. Some apps have toggles built in (i.e. enable / disable search functions); other apps seem to have known "dead times" where its safe to kick off a reindex.
To me, option #3 sounds like more work, but I could see it being generally useful (from a QA standpoint), so I can't speak to whether or not it makes sense to invest the time and effort to do this now.
Reporter | ||
Comment 6•11 years ago
|
||
Mark, do you have a preference as to what we do here?
Flags: needinfo?(mcote)
Comment 7•11 years ago
|
||
#1 sounds good to me, but the logs index on the dev cluster does not contain newer entries. Can those be added easily after the index is copied over? That is, can you easily merge the two into one index?
Flags: needinfo?(mcote) → needinfo?(cliang)
Comment 8•11 years ago
|
||
Perhaps I'm not being clear:
What I will be doing is copying over the logs index from the production cluster to a new index (let's say, 'of_logs') on the production cluster. This copy process will force a reindex of the new index, which should mean that of_logs will have more data in it.
When I copied the logs index from the production cluster to the dev cluster for testing, there was a similar reindex of the index on the dev cluster.
The logs index on the dev cluster will only contains entries that were present at the time that I copied the logs index from the production cluster to the dev cluster. If you want the dev cluster to contain newer entries, I will need to copy the index from production to dev again.
If option #1 sounds good, I'll need help with what to name the new index. There is already an "orangefactor_logs" index. "orangefactor_logs_XXX?"
Flags: needinfo?(cliang)
Comment 9•11 years ago
|
||
Okay, I think that makes sense. :) As long as the new index has everything!
Hm actually, originally logs was an alias of orangefactor_logs, precisely because of the naming issue. The logparser still writes to logs, though, and the Orange Factor site still reads from it. So I think you can blow it away and use that.
Hm this raises a complication, though... the logparser will continue to write to logs while the reindex is going on. Will *these* entries not be in the new index then? Sorry for the newb questions; I'm still very much in the dark as to how all this works.
Comment 10•11 years ago
|
||
Is there a time when the logparser is not writing / has very few writes? Or can the parsing be put on "pause" for a certain length of time / re-queued for later?
Comment 11•11 years ago
|
||
Yes, the parsing can be put on hold for a number of hours, and recover afterwards with no data loss.
Comment 12•11 years ago
|
||
Ah right, of course. I could do this tomorrow or next week.
I'd prefer to have the final name stay as "logs", at least as an alias (like in the original setup). I realize it's a terrible name, but we're actually going to be moving Orange Factor to a different data source in a month or two, after which we won't need the ES indexes at all. I'd rather not have to update several code bases just for that period.
Comment 13•11 years ago
|
||
Does doing this on Tuesday, April 29th, sound good? If so, what time?
(This is a relatively meeting free day for me, so I have greater flexibility.)
Comment 14•11 years ago
|
||
Does 1 pm PDT work for you? I'm on the west coast that week.
Comment 15•11 years ago
|
||
Works for me and blocked off in my calendar.
Comment 16•11 years ago
|
||
The work we did today did not address the issue. We'll need to arrange for another window to pause log parsing and attempt to fix this again.
Apparently, "logs" used to be an alias to orangefactor_logs and that, when things broke, what happened is that logs became a separate index. Since reads and writes were done against "logs", only that index has been updated since April 8th. So, what will need to happen is that the production contents of orangefactor_logs and logs need to be merged into a third index (of_logs).
I attempted to do this today, but ran into an issue when copying orangefactor_logs into of_logs:
[ERROR] ** ElasticSearch::Error::Timeout at /home/esadmin/.perlbrew/perls/perl-5.19.0/lib/site_perl/5.19.0/ElasticSearch/Transport/HTTP.pm line 67 :
read timeout (500)
Since orangefactor_logs is not being updated, I can work on getting a copy of this index into of_logs without having to pause log parsing. Once that is done, a new downtime window can be scheduled so that the logs index can be copied into of_logs.
Reporter | ||
Comment 17•11 years ago
|
||
Thank you :-)
Comment 18•11 years ago
|
||
Okay. The goal of this bug has morphed.
I'm going to try to merge various indexes to get an index that should at *least* have all the data that was in the logs index as of May 5th (plus updates since then).
On elasticsearch-zlb.dev.vlan81.phx1.mozilla.com, the index of_logs_20140506 clocks in at a hair under 4GB and with 3,465,215 documents. This is a merge of:
1) all of logs (SCL3 prod), as of April 29th
2) some of orangefactor_logs (SCL3), from the copy attempts on April 29th that failed partway through
3) all of logs (SCL3 prod), as of May 6th (starting around 2 PM PDT)
[ I think some of #2 is there because I can not otherwise account for why of_logs_20140506 is so big. It has more documents than the sum of what's listed for #1 and #3. ]
If one of you has time, I'd like to know if:
1) of_logs_20140506 has data in it that is missing from the SCL3 prod logs index from the outage this weekend
2) any of the data in of_logs_20140506 is useless (since I tried merging two different copies of logs)
If the data is sound and there's no weird "duplication", I can try merging in the partial copy of orangefactor_logs from May 5th to see if we can get even more data back.
Background:
In the ES outages this past weekend (May 5th), we lost shards of both logs and orangefactor_logs on the production SCL3 cluster. Both lost a primary shard AND the replica of the primary shard. The ES cluster would not recover until those were somehow resolved.
I was able to copy the remaining contents of logs into logs-bck, reindexed logs-bck.
I was not so lucky with orangefactor_logs. I tried to force reallocation of the missing primary shard of orangefactor_logs, but was unable to get it to reroute. A copy of orangefactor_logs got about 1/4 of the way done after about 4 and a half hours; given the size of the index combined with stability issues with the 0.20 cluster, it was unlikely that it would complete before another failure. After discussion with Mark Cote, we ended up needing to delete orangefactor_logs to get the cluster functional again. So, it's not possible to recover all of the data that was in that index.
I'll be filing another bug to do a regular copy of the logs index on the production SCL3 cluster. This should tide us over until we get the ability to do snapshots with ES1.0.
Updated•11 years ago
|
Flags: needinfo?(emorley)
Reporter | ||
Comment 19•11 years ago
|
||
(In reply to C. Liang [:cyliang] from comment #18)
> If one of you has time, I'd like to know if:
> 1) of_logs_20140506 has data in it that is missing from the SCL3 prod
> logs index from the outage this weekend
> 2) any of the data in of_logs_20140506 is useless (since I tried merging
> two different copies of logs)
>
> If the data is sound and there's no weird "duplication", I can try merging
> in the partial copy of orangefactor_logs from May 5th to see if we can get
> even more data back.
I don't have a quick way to verify this - and I don't really have the cycles to dig into it right now sadly.
Flags: needinfo?(emorley)
Reporter | ||
Comment 20•11 years ago
|
||
Seeing as it's been 5 weeks since the outage, and the older OF data gets the less useful it is, I think we should just WONTFIX this for now.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•10 years ago
|
Product: Testing → Tree Management
Updated•4 years ago
|
Product: Tree Management → Tree Management Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•