Closed Bug 1071044 Opened 10 years ago Closed 9 years ago

Upgrade ElasticSearch PHX1 Development cluster to 1.2.x

Categories

(Infrastructure & Operations :: IT-Managed Tools, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cliang, Assigned: cliang)

References

Details

(Whiteboard: [kanban:https://webops.kanbanize.com/ctrl_board/2/118] )

The version of Elasticsearch running on the development cluster in PHX1 is due for another upgrade: from 0.90.x to 1.1.x.  This will involve a full shutdown of the cluster.  I'd like to do this on October 2nd, starting at 10 AM Pacific Time.  

Please let me know if you have issues / concerns with doing this upgrade.

If you have been CC'ed on this bug, I believe that you either have an index on the ES development cluster or could probably CC the correct person / people to this bug:

       autolog : jgriffin
         logs* : emorley, mcote
    mozillians : hoosteeno
   *inputindex : willkg
         sumo* : ricky
         tbpl* : mcote
          xtag : aschaar

As before, there have been a number of breaking changes in the intermediate versions[1], so there may need to be some application changes as a result of this upgrade.  In particular, as of 1.2, the default is to disallow dynamic scripting.[2]  If you know that your application makes use of this feature, please let me know!

[1]  Elasticsearch release notes: http://www.elasticsearch.org/downloads/
[2]  http://www.elasticsearch.org/blog/scripting-security/
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1312]
I want to clarify two things:

1. This is an upgrade to Elasticsearch 1.2, correct? Figured I'd clarify since the first sentence in the description says 1.1.x.

2. Any site that is using ElasticUtils needs to upgrade to 0.10.1 *before* having their ES cluster updated. The wry twist about that is that I haven't released 0.10.1, yet. I'll do that today. I wrote up some rough migration instructions to make life easier (http://elasticutils.readthedocs.org/en/latest/migrating_0.90_to_1.0.html) though they need to be updated. I'll do that today, too.
(In reply to C. Liang [:cyliang] from comment #0)
>          tbpl* : mcote

+ emorley

(In reply to C. Liang [:cyliang] from comment #0)
> Please let me know if you have issues / concerns with doing this upgrade.

What measures do we have in place to avoid the data loss that occurred during the last upgrade? (bug 995139)
Assignee: server-ops-webops → cliang
RE: Comment #1 --

Yes, my bad: this is an upgrade for ES 1.2.x.  I started prepping this bug back when it wasn't clear if ElasticUtils would be compatible with the newer version and did not remember to edit the description.


RE: Comment #2 --

1) The data loss that happened in bug 995139 was a result of a reboot of the cluster to fix a problem with some of the nodes.  I'm hoping that, since I am doing a full shutdown of the server before doing a file-level copy of the indexes, the cluster ought to come back up cleanly. 

2) Since we are upgrading the dev server, if there is data loss, it should be possible to get a fresh copy of data from production.  If there's an index on the dev cluster that can't be replicated from production, please let me know and I can try to arrange for a temporary copy of that index to another cluster for a backup.
Sorry missed that this was about the dev instance, that sounds fine :-)
Ok! I pushed out a better version of the migration document for ElasticUtils users:

http://elasticutils.readthedocs.org/en/latest/migrating_0.90_to_1.0.html

I also pushed out ElasticUtils 0.10.1 which supports Elasticsearch 0.90 up through 1.2. Anyone using ElasticUtils should update it to 0.10.1 now so that your stuff will work with Elasticsearch 1.2.

If you bump into any issues, I'm on IRC on #elasticutils and there's also the ElasticUtils issue tracker:

https://github.com/mozilla/elasticutils/issues
Justin: Has Mozillians updated to ElasticUtils 0.10.1, yet?
Flags: needinfo?(hoosteeno)
Talked to Justin on IRC. cc:ing :williamr, :tasos, and :nemo to covers Mozillians.
Flags: needinfo?(hoosteeno)
Unfortunately, we are using ElasticUtils 0.5 on mozillians.org and we are going to need some time in order to upgrade to the required version. Maybe :williamr has more insights about this.
Flags: needinfo?(williamr)
Sorry to chime in late here. Our team just found out about this upgrade yesterday and we've been discussing it. Would it be possible to postpone the upgrade a couple weeks so that the Mozillians.org team can first upgrade ElasticUtils in our codebase?

We are making the ElasticUtils code changes a high priority for our team. Tasos is currently researching what's involved and will then start making code changes. We'll have an estimate early next week of when Mozillians.org will be upgraded to the required version.

While this bug only impacts the development server, it would be helpful to our team to first upgrade our codebase before ElasticSearch is updated on the server. Otherwise, our development environment will be broken.

Thanks!
Flags: needinfo?(willkg)
Flags: needinfo?(williamr)
Flags: needinfo?(cliang)
I'm holding off on the upgrade.  Please let me know ASAP RE: a reasonable time-frame for scheduling the next attempt.  =)

@williamr - in future, are there more folks from the Mozillians team that should be contacted with respect to impending Elasticsearch updates?
Flags: needinfo?(cliang)
(In reply to C. Liang [:cyliang] from comment #10)
> I'm holding off on the upgrade.  Please let me know ASAP RE: a reasonable
> time-frame for scheduling the next attempt.  =)

Thanks. We'll let you know early next week.

> @williamr - in future, are there more folks from the Mozillians team that
> should be contacted with respect to impending Elasticsearch updates?

Yes. :williamr, :tasos and :nemo.

:hoosteeno was previously the contact person, but he's now contributing on other projects.
Flags: needinfo?(willkg)
Rather than indidviduals, who move between projects, perhaps you have a mailing list for a team instead?
(In reply to Peter Radcliffe [:pir] from comment #12)
> Rather than indidviduals, who move between projects, perhaps you have a
> mailing list for a team instead?

Good idea - you can post on our team's discussion forum

https://www.mozilla.org/about/forums/#dev-community-tools
@williamr, @tasos: Any updates on what it might take to update Mozillians code to handle a newer version of Elasticsearch? =)
Flags: needinfo?(williamr)
Flags: needinfo?(tasos)
(In reply to C. Liang [:cyliang] from comment #14)
> @williamr, @tasos: Any updates on what it might take to update Mozillians
> code to handle a newer version of Elasticsearch? =)

We are almost there. Most of the work is done and we are in the process of reviewing a PR for this (bug 907933). Hopefully we will be ready on our end by the end of this week. Thanks for holding off on the upgrade.
Flags: needinfo?(williamr)
Flags: needinfo?(tasos)
Quick update - we are currently testing on our stage environment. We plan to be ready on our end by the end of next week.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1312] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2387] [kanban:https://kanbanize.com/ctrl_board/4/1312]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2387] [kanban:https://kanbanize.com/ctrl_board/4/1312] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2392] [kanban:https://kanbanize.com/ctrl_board/4/1312]
Quick update - Mozillians is ready for the update. Thanks for holding it off.
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2392] [kanban:https://kanbanize.com/ctrl_board/4/1312] → [kanban:https://kanbanize.com/ctrl_board/4/1312]
Can we get this scheduled (the sooner, the better)?
Flags: needinfo?(cliang)
Back from PTO. =)

I'm not sure that there are enough hands around to do re-indexes, etc. after the update.  If I am wrong, please let me know and I can schedule the update for Friday, December 19th, at 7 AM Pacific Time.

Otherwise, given the holidays (and my need to provide coverage), I think the next window would be January 7th at 10 AM Pacific Time.
Flags: needinfo?(cliang)
(In reply to C. Liang [:cyliang] from comment #19)
> Otherwise, given the holidays (and my need to provide coverage), I think the
> next window would be January 7th at 10 AM Pacific Time.

January 7 sounds great to me!
The dev ES cluster in PHX1 is now running ES version 1.2.4.  You may need to re-index in order to re-gain functionality.  

This upgrade took a while to work out because one of the plugins did not appear to update correctly (and complained that the version of ES was not supported).

Please let me know if you run into any issues.  (I'm seeing errors in the logs like: "org.elasticsearch.index.query.QueryParsingException: [logs] request does not support [bool]")
The dev ES cluster is used by both Input -dev and -stage environments. I reindexed -dev and -stage and everything looks fine. Input's good to go.
On SUMO, we are getting the folowing error when bulk indexing:
https://errormill.mozilla.org/support/sumo-stage/group/175654/ (('48 document(s) failed to index.', [{u'index': {u'status': 503, u'_type': u'users_profile', u'_id': u'294066', u'error': u'EsRejectedExecutionException[rejected execution (queue capacity 50) on org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1@528ec562]',....)

More info about the exception here: http://stackoverflow.com/questions/20683440/elasticsearch-gives-error-about-queue-size


I am not sure if we need to tweak some settings or change our app.
Whiteboard: [kanban:https://kanbanize.com/ctrl_board/4/1312] → [kanban:https://webops.kanbanize.com/ctrl_board/2/118]
@r1cky:

  - What is the batch size for the bulk indexing task?
  - Do you know if you temporarily increase the the refresh_interval for your bulk indexing task?


It looks like this is tied to "the total number of shards that will be updated on a given node by the bulk calls" [1]; making the batch size smaller should help with that.   

If you aren't increasing the refresh_interval (the default is 1s), you might be able to offset the smaller batch size with more throughput by increasing the refresh_interval before indexing and decreasing it afterwards.  

The queue size has been left at the default; this is usually tied to the number of cores in the cluster and (in dev) there aren't that many.  If decreasing the batch size doesn't help, I can look at temporarily increasing the queue size.  ( I'm seeing contradictory information about whether or not changing the queue size requires a cluster reboot. )


[1] Last comment on https://github.com/elasticsearch/elasticsearch-net/issues/427.
Elasticsearch on mozillians.org seems to be working as expected. I reindexed all deployments (dev, stage, prod) without any issues.
@jgriffin, @edmorley, @mcote:  Do you have a rough ETA for testing the use of the autolog, logs, and tbpl indexes under the new version of Elasticsearch?


@r1cky: If SUMO is still encountering issues doing bulk indexing, is there someone I can specifically work with / tasks that I can kick off myself to try to do some troubleshooting?
Flags: needinfo?(rrosario)
Flags: needinfo?(mcote)
Flags: needinfo?(jgriffin)
Flags: needinfo?(emorley)
(In reply to C. Liang [:cyliang] from comment #26)
> @jgriffin, @edmorley, @mcote:  Do you have a rough ETA for testing the use
> of the autolog, logs, and tbpl indexes under the new version of
> Elasticsearch?

We're not using dev at the moment, so this doesn't affect us. Or do you mean the intention is to roll out the new version to prod too? In which case we're going to have issues, as found by our recent inadvertent (partial) usage of dev in bug 1121507.
Flags: needinfo?(emorley)
Oops. :mythmon wasn't CC'd here

(In reply to C. Liang [:cyliang] from comment #26)

> @r1cky: If SUMO is still encountering issues doing bulk indexing, is there
> someone I can specifically work with / tasks that I can kick off myself to
> try to do some troubleshooting?

We read over your link about bulk size but it didn't look related to the error we are seeing. We still need to do further testing and troubleshooting.
Flags: needinfo?(rrosario)
:edmorley - The intent is to roll this upgrade out to the production cluster in PHX (prompted by bug 1062342).  The upgrade to dev is meant as a testbed to work out any issues caused by the impending upgrade in production.

If this upgrade is highly problematic for you and you don't think that you can make the changes needed to accommodate an ES upgrade, please let me know so I can figure out what options there might be for folks who *do* need it. =)
This is one of those chicken-egg problems, since we will be rewriting Orange Factor from scratch (we don't use autolog at all anymore).  I don't want to hold up upgrading ES, but at the same time I don't want to put a lot of effort into updating Orange Factor.

Ed, could you summarize the issues you saw?  Combing through that other bug is difficult.
Flags: needinfo?(mcote) → needinfo?(emorley)
This probably just requires some (hopefully minor) changes to mozeslib and related code.  I can take a whack at it; we probably won't have an OrangeFactor replacement for another quarter, at least, and I'm guessing IT would like to roll out this upgrade before then.
Flags: needinfo?(jgriffin)
Depends on: 1131365
(In reply to Mark Côté [:mcote] from comment #30)
> Ed, could you summarize the issues you saw?  Combing through that other bug
> is difficult.

I've filed bug 1131365 with a summary of what I found, though note that the only issues we saw, were those cause by logparser submitting to both prod and dev (with dev being on ES v1.2). So any other interactions with ES (eg in response to OrangeFactor API calls) haven't been tested at all.
Flags: needinfo?(emorley)
I'm going to close out this bug (since the upgrade has been completed).  Going forward, Orange Factor will be moved to it's own cluster (which will stay at 0.90.x).  Once that has been completed, the production clusters can be upgraded to 1.2.x.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
No longer blocks: 1145671
Depends on: 1145671
You need to log in before you can comment on or make changes to this bug.