Open Bug 1322630 Opened 6 years ago Updated 2 years ago

[tracker] Update Socorro's ElasticSearch cluster to 7.4.0

Categories

(Socorro :: General, task, P3)

Tracking

(Not tracked)

REOPENED

People

(Reporter: miles, Unassigned)

References

(Depends on 3 open bugs, Blocks 3 open bugs)

Details

Attachments

(3 files)

relud had mentioned that the current version in EPEL is 2.3, but that if we wait until January he said that there will likely be a newer version.

We need to decide which version of ES to move to. Likely candidates are 2.3.x and 5.0.x. 5.0.0 went GA in October: https://www.elastic.co/blog/elasticsearch-5-0-0-released.

We may also need to upgrade our JVM version, among other things, to support this upgrade.

This depends on the developers to update ES relevant code in Socorro to match the version of ES we end up settling on.
Blocks: 1322629
Information on the upgrade process from 1.x to 2.x and from 1.x to 5.x: https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade.html.
We need to settle on a target version before upgrading our code dependencies. Maybe we can choose 2.3 as a goal for now? I believe upgrading from 2.x to 5.x or whatever future version will be easier. 

Also, I guess it would be a good thing for us to practice at upgrading our ES cluster / code, because keeping up with recent versions is always a good thing.
Miles, can we choose a version? I see that 2.4.3 has been released in December, as well as 5.1.1. I don't know much about those, I don't think they will impact us much in terms of feature, so I think it's really a matter of what's best infra-wise.
Flags: needinfo?(miles)
I did a little research into these and the types of upgrades that they would entail. Since we will be spinning up a new cluster in a new environment regardless, I think it makes sense to bump up to the latest stable release, 5.1.1.

Is there any reason not to be current?
Flags: needinfo?(miles)
Some notes while I read the documentation: 

> Elasticsearch 5.0 can read indices created in version 2.0 or above. An Elasticsearch 5.0 node
> will not start in the presence of indices created in a version of Elasticsearch before 2.0.
 -- https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking-changes-5.0.html

More interesting documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade.html

Doing a step upgrade to 2.x won't work, since all of our indices have been created in 1.x, so we would have to reindex them in 2.x anyway. We can use reindex-from-remote I guess. If we use the optimizations they show, it might not take too long. 

That makes me wonder what our migration plan can be. If we have to reindex, that means we'll need to have two clusters living side-by-side, and we might need to send data to both at the same time. However, doing that will require that we have two different ES modules in our code base, one for 1.x and one for 5.x. That is something I would like to avoid.
Another possibility would be to go to 2.x, take some time to reindex all our 1.x indices, then do another upgrade to 5.x once indices have been refreshed or have rolled out. But that means we do an upgrade now and another one in a few weeks, and maybe that's too much work?
(In reply to Adrian Gaudebert [:adrian] from comment #5)
> However, doing that will require
> that we have two different ES modules in our code base, one for 1.x and one
> for 5.x. That is something I would like to avoid.

Actually, that would be very difficult, since it would require having different versions of the same python library at the same time, and as far as I know that's not trivial to do. Let's assume it's not possible to use two clusters running different version of ES at the same time. 

That means when we switch from a cluster to another, we have to do it in one go, doing something like: stop all processors, upgrade code, update configuration to use new cluster, restart processors. The question of how to make sure all data from cluster 1 made it to cluster 2 still has to be answered though.

I am more and more inclined to say that we can't really jump straight to 5.x, but let's keep talking about it. Maybe I'm missing something?
Okay, I see a couple options.

1. 1.x => 5.x
Create a fresh 5.x cluster. Take down the processors. Use reindex-from-remote (https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade.html#reindex-upgrade-remote) to get the new cluster up to date. Upgrade the processors to a 5.x compatible client. Switch the processors over to sending data to the 5.x cluster.

2. 1.x => 1.x => 2.x (?) => 5.x
Create a fresh 1.x cluster. Take down the processors for the cloning process. Clone our existing 1.x cluster. We can keep these in sync by issuing requests to both clusters simultaneously once the processors are brought back up.

To upgrade, take down the processors. Upgrade the new cluster to 2.x (full cluster restart). Bring up the processors with a 2.x compatible client, only sending data to the new cluster. Start a new index.

If we choose to, reindex the old indices in place using the Elasticsearch Migration Plugin (https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade.html#reindex-upgrade-inplace) in preparation for the upgrade to 5.x (another full cluster restart).

3. 1.x => 5.x #2
Create a fresh 5.x cluster. Take down the processors. Create a second RabbitMQ queue. Reconfigure the collectors to put crash ids to process into both queues. Clone the existing queue to the new queue. Use reindex-from-remote (https://www.elastic.co/guide/en/elasticsearch/reference/current/reindex-upgrade.html#reindex-upgrade-remote) to get the new cluster up to date. Bring up a set of processors that support 5.x elasticsearch client for the 5.x cluster, consuming from the second queue. Bring up the existing processors that support 1.x elasticsearch client for the existing 1.x cluster, consuming from the original queue. Now there should be two queues, two sets of processors, and two elasticsearch clusters. The elasticsearch clusters should be eventually consistent based on queueing. At this point, we can switch to the 5.x cluster safely.

Some key points:

1. In the first two scenarios we can leverage RabbitMQ to build up a queue of data during cluster downtime and then consume it after switching over the processors.

2. We can leverage RabbitMQ to keep two clusters in sync simultaneously with the collectors putting crashes into two queues for processing and two separate sets of processors consuming from the queues. One queue for the processors with a given version of the elasticsearch client library, one queue for the processors other version.

3. In all scenarios the 1.x cluster is alive and handling incoming requests to crash-stats. This means that there is no effective downtime for our users, even if new incoming data is not indexed by elasticsearch.

An example of upgrading without downtime: https://signalfx.com/blog/upgraded-elasticsearch-1-x-2-x-zero-downtime-handling-upgrade-practice/. This blog post explains that they had to have two different versions of the client libraries in their applications at once.
I don't think we can do two groups of processors for a few reasons.

First, I'm concerned about the non-ES crashstores--they'd both be saving to them. If they're not, then they have different save pipelines and that might result in different data (we hope it doesn't, but that's what that whole "let's clone the crash for every crash store" thing was about a couple of quarters ago).

Second, we'd have to have the crashmover add crash ids to two rabbitmq queues. I think that's doable. We'd also need to put crash ids into two queues for reprocessing. I'm pretty sure the way we do reprocessing makes this difficult or impossible.

Third, setting up a second group of processors seems hard. (Hand-waving here.) We have to make sure we push to them during deploys, they have to have a separate configuration in consul, we'd need to monitor them separately in datadog and other systems, ...
I'd like to know which version of es-py we're using and what versions of ES it supports and then what version we'd have to upgrade to for the possible ES versions we're targeting. Ditto for es-dsl.

I don't think I've seen that specified anywhere. Adrian's out until Monday. I'll look into this either later today or tomorrow.
(In reply to Will Kahn-Greene [:willkg] from comment #9)
> I don't think we can do two groups of processors for a few reasons.
> 
> First, I'm concerned about the non-ES crashstores--they'd both be saving to
> them. If they're not, then they have different save pipelines and that might
> result in different data (we hope it doesn't, but that's what that whole
> "let's clone the crash for every crash store" thing was about a couple of
> quarters ago).

I'm confused about the "different data" that you're referring to. Does this refer to the possibility for different data in elasticsearch vs. in S3 with multiple processor versions processing crashes?

How large of a concern is that?

> Second, we'd have to have the crashmover add crash ids to two rabbitmq
> queues. I think that's doable. We'd also need to put crash ids into two
> queues for reprocessing. I'm pretty sure the way we do reprocessing makes
> this difficult or impossible.

What is the reprocessing process, and what about it makes this difficult or impossible? My assumption is that there has to be some removal of the incorrectly processed crash, and then a re-processing of that crash. Is that correct? If so, the removal sounds like the trickiest bit, but we would in effect be duplicating the processing "pipeline", so the event should be able to be sent to both pipelines (if it is a queued event). If only one pipeline deals with S3, that pipeline could be responsible for the removal.

I'm making a lot of assumptions above. I don't know how this actually works under the hood.

> Third, setting up a second group of processors seems hard. (Hand-waving
> here.) We have to make sure we push to them during deploys, they have to
> have a separate configuration in consul, we'd need to monitor them
> separately in datadog and other systems, ...

From an operations standpoint, it is not necessarily difficult. I know how to create a separate group to of processors with separate roles (for monitoring, etc.), having to have separate configuration in consul is definitely troublesome but could be overcome. There are a couple methods around that: since the processors take config from their environment, we could ditch consul entirely for the second set and use Puppet to supply environment configuration. Alternatively since we would be forking the processor code to account for the elasticsearch version changes, we could change the configuration names (feels bad) and add those to consul.
(In reply to Miles Crabill [:miles] from comment #11)
> (In reply to Will Kahn-Greene [:willkg] from comment #9)
> > I don't think we can do two groups of processors for a few reasons.
> > 
> > First, I'm concerned about the non-ES crashstores--they'd both be saving to
> > them. If they're not, then they have different save pipelines and that might
> > result in different data (we hope it doesn't, but that's what that whole
> > "let's clone the crash for every crash store" thing was about a couple of
> > quarters ago).
> 
> I'm confused about the "different data" that you're referring to. Does this
> refer to the possibility for different data in elasticsearch vs. in S3 with
> multiple processor versions processing crashes?
> 
> How large of a concern is that?

We either have the two sets of processors have the same list of crashstores which leads to them stomping on one another for non-ES crashstores, OR we have them have two different lists of crashstores. A couple of months ago, we discovered that the Elasticsearch crashstore modifies the crash before saving it thus affecting all downstream crashstores. We theorize the other crashstores don't do this, but I don't think we've proven it.

So if the crashstore sets are different, the data we're saving to the two ES environments could be different.


> > Second, we'd have to have the crashmover add crash ids to two rabbitmq
> > queues. I think that's doable. We'd also need to put crash ids into two
> > queues for reprocessing. I'm pretty sure the way we do reprocessing makes
> > this difficult or impossible.
> 
> What is the reprocessing process, and what about it makes this difficult or
> impossible? My assumption is that there has to be some removal of the
> incorrectly processed crash, and then a re-processing of that crash. Is that
> correct? If so, the removal sounds like the trickiest bit, but we would in
> effect be duplicating the processing "pipeline", so the event should be able
> to be sent to both pipelines (if it is a queued event). If only one pipeline
> deals with S3, that pipeline could be responsible for the removal.
> 
> I'm making a lot of assumptions above. I don't know how this actually works
> under the hood.

I'm concerned about the part that kicks off reprocessing by adding crash ids to the rabbitmq queues. If we have two sets of processors, we need to have two sets of queues, so the things that kick off reprocessing have to add crash ids to *both* sets of queues.

I haven't touched that code, but I think some of it is in the webapp where we've been writing things without the configman architecture. If that's correct, then I think this requires a non-trivial code change.


> > Third, setting up a second group of processors seems hard. (Hand-waving
> > here.) We have to make sure we push to them during deploys, they have to
> > have a separate configuration in consul, we'd need to monitor them
> > separately in datadog and other systems, ...
> 
> From an operations standpoint, it is not necessarily difficult. I know how
> to create a separate group to of processors with separate roles (for
> monitoring, etc.), having to have separate configuration in consul is
> definitely troublesome but could be overcome. There are a couple methods
> around that: since the processors take config from their environment, we
> could ditch consul entirely for the second set and use Puppet to supply
> environment configuration. Alternatively since we would be forking the
> processor code to account for the elasticsearch version changes, we could
> change the configuration names (feels bad) and add those to consul.

I don't think these issues are insurmountable. I think this is a substantial amount of work to architect, build and maintain until we're on the new ES cluster. I'm not convinced this option is worth that amount of work, but if it is, I'd like to know how much work is involved here compared with other options (assuming there are other options (I hope there are)).

This thread is getting pretty involved and I contend is outgrowing Bugzilla-style discussion forum format. I think we should figure out the list of "things that affect our decisions" in the bug comments so we're all on the same page about what's important to us and how we're figuring this out. Then flesh out this option into a spec in a Google doc. Ditto for other options (assuming there are other options (I hope there are)).
elasticsearch==1.2
elasticsearch-dsl==0.0.8

From the documentation for elasticsearch, their major version numbers match the major version number of the elasticsearch cluster. elasticsearch-dsl is similarly set up, but uses version 0 for ES 1.x and 2 for Es 2.x with no apparent support for 5.
something to weight against the complexity of zero-downtime gymnastics:

With some communication and planning around release windows, we can accept limited downtime for the reporter. We'd still need to do collection, but could let the queue build up. A few hours during off-peak usage would be ok, as long as we had a plan to back out if things go haywire.
es-py changelog is here: https://elasticsearch-py.readthedocs.io/en/master/Changelog.html

I think in parallel with figuring out how to upgrade our ES cluster, we should upgrade es-py to 1.8.

es-dsl changelog is here: http://elasticsearch-dsl.readthedocs.io/en/latest/Changelog.html

I think in parallel with figuring out how to upgrade our ES cluster, we should upgrade to es-dsl 0.0.10.

The language they have regarding Python lib versions and what ES version they're "compatible with" is confusing. There's no details about what changed in the library to support that compatibility or whether those versions are no longer compatible with older versions. We might be able to get by with upgrading to a higher version now with what we've got and then switching our ES cluster. That's require some analysis of what actually changed and what we actually use.
Depends on: 1331659
Note to self: we need to investigate mapping conflicts if we migrate our current indices. It is possible that the mappings we currently use won't be compatible with ES 2+ and thus will cause indices to not be accepted.
Here's a proposal of a migration plan along with a rollback plan: https://public.etherpad-mozilla.org/p/socorro-es5-migration-plan

Please comment or add more details where needed!
Depends on: 1332364
(In reply to Adrian Gaudebert [:adrian] from comment #17)
> Here's a proposal of a migration plan along with a rollback plan:
> https://public.etherpad-mozilla.org/p/socorro-es5-migration-plan
> 
> Please comment or add more details where needed!

Left some comments, overall LGTM.
Depends on: 1342081
AWS hosted elasticsearch now supports ES 5.1. (https://aws.amazon.com/about-aws/whats-new/2017/01/elasticsearch-5-now-available-on-amazon-elasticsearch-service/)

It will be significantly less overhead to use hosted ES, and it comes with the full ELK stack out of the box.

Developer documentation: http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/what-is-amazon-elasticsearch-service.html

Limits: http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/aes-limits.html (they offer instance types with more than 32GB of RAM but only use up to 32GB, the rest is used for disk cache).

My current recommendation upon seeing this is that we use AWS hosted elasticsearch. I still need to look into how we would do the data migration, i.e. whether reindex-from-remote is still viable in the same way.
Back when JP tried to move us to hosted ES, we had troubles connecting from our hosts to the ES cluster because of how AWS manages permissions. As far as I know he tried a lot of things and never solved the issue, and we came to the conclusion that we would need some heavy code changes to adapt to that new way of connecting, which we never did. I can't find the context sadly.

Miles, are you aware of any changes there? Can we perhaps try to connect stage to a hosted ES cluster, just to see if that can work? (Without worrying about compatibility or anything, just to see if we can reach it from our infra. )
(In reply to Adrian Gaudebert [:adrian] from comment #20)
> Heavy code changes to adapt to that new way of connecting, which we never did.
> I can't find the context sadly.
I'm not sure what those issues might be, but in the worst case we could set up some sort of reverse proxy to bridge if authentication is the difference (as Will suggested it might be).

> Can we perhaps try to connect stage to a hosted ES cluster, just to see if that can work?
This is doable, but probably something we want to plan carefully or do together once your 5.1 code is finished.

Let's talk about how to go about this during tomorrow's meeting.
Scratch that, I sent the file before testing it enough, and it turns out I need some more time to make sure it actually works correctly.
Depends on: 1342083
See bug 1342083 for the schema.
Assignee: nobody → dthorn
Assignee: dthorn → miles
I've begun experimenting with the snapshot/restore/upgrade plan that we have for upgrading our elasticsearch from 1.x=2.x=>5.x.

Current status:
* Snapshotted Socorro stage elasticsearch to s3 successfully
* Created an AWS ES domain in cloudops prod (because Socorro stage ES basically contains prod sensitive data, and we need to make a security decision about the implications of the data sharing that we are doing)

Todo:
* Restore the 1.x snapshot into 2.x
* Everything else
Current status:

* working WIP ES 5.3.3 cluster in ops stage, reliably deployable using standard cloudops methodologies
  * behind an ELB, same as previous ES cluster
  * caveats
    * logging 2.0 is not configured
    * still need to figure out how we will do healthchecking
      * currently uses 'HTTP:9200/' - this should probably be changed to 'TCP:9200' because we don't want a cluster problem to cause the ELB to terminate _all_ of the instances if/when this returns non-200. Could even switch to using EC2 healthcheck instead of ELB.

* ES 5.3.3 cluster configured behind a NAT instance so that we can whitelist a single IP for use with reindex-from-remote from the old cluster
  * caveats
    * due to EC2 networking limitations the cluster and the NAT all have to be in a single AZ / subnet, so that is the case right now. Once the data from the old cluster is all imported we can add new nodes from different AZs to the cluster, shift the data off of the single-AZ nodes, and then eliminate the single AZ cluster.

Todo this week:

* Get the ES 5.3.3 cluster deployed in ops prod
* Whitelist the prod cluster NAT instance for reindex-from-remote form the Socorro stage ES cluster
* Test reindex-from-remote Socorro stage => ops prod cluster

The Socorro stage cluster contains prod/prodlike data, hence plan of only bringing that data into ops prod for now.
Miles: What does this mean?:

> The Socorro stage cluster contains prod/prodlike data, hence plan of only bringing that data into ops prod for now.

Does that mean that the resulting Socorro -stage environment will have an ES cluster that doesn't have any data in it?
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #28)
> Miles: What does this mean?:
> 
> > The Socorro stage cluster contains prod/prodlike data, hence plan of only bringing that data into ops prod for now.
> 
> Does that mean that the resulting Socorro -stage environment will have an ES
> cluster that doesn't have any data in it?

No, it means for now I'm importing the data into a cluster in ops prod. I'll have a chat with cloudsec about security around the data in the stage ES cluster. We can point Socorro stage at the ops prod cluster for now.
The ES 5.3.3 cluster is deployed in ops prod, the NAT is configured correctly, and I can successfully connect to the old cluster.

The mapping provided earlier in this thread is not working for me on 5.3.3. I'm trying to create a fresh index that I can then reindex-from-remote the data from the old cluster into:

$ curl -XPUT 'localhost:9200/socorro201711?pretty' -H 'Content-Type: application/json' -d "@mappings.txt"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "not_x_content_exception",
        "reason" : "Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"
      }
    ],
    "type" : "not_x_content_exception",
    "reason" : "Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"
  },
  "status" : 500
}

For now I will deploy w/ 5.1.1 and see if the mapping works for that version.
Adrian, could you update the mapping for 5.3.3?
Flags: needinfo?(adrian)
Pretty sure the ES 5.3-friendly mapping is here:

    https://raw.githubusercontent.com/adngdb/socorro/6dfdcbd7a67cb3bb7328e49b648b54c966056cfd/scripts/data/supersearch_fields.json

It's the /scripts/data/supersearch_fields.json file in Adrian's branch here:

    https://github.com/adngdb/socorro/tree/1342081-upgrade-to-es-5.1

I think that even though the branch says that it's for ES 5.1, it covers the changes for 5.3, too.

Adrian: Does that sound right?
I tried using the 5.3 mappings, but I'm still having issues. This is using this file: https://raw.githubusercontent.com/adngdb/socorro/6dfdcbd7a67cb3bb7328e49b648b54c966056cfd/scripts/data/supersearch_fields.json (wrapped in a mappings object).

$ curl -XPUT 'localhost:9200/socorro201711?pretty' -H 'Content-Type: application/json' -d "@mappings.txt"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "Root mapping definition has unsupported parameters:  [is_returned : true] [in_database_name : reason] [permissions_needed : []] [storage_mapping : {type=text, fields={full={type=keyword}}}] [form_field_choices : []] [form_field_type : StringField] [description : The crash's exception kind. Different OSes have different exception kinds. Example values: 'EXCEPTION_ACCESS_VIOLATION_READ', 'EXCEPTION_BREAKPOINT', 'SIGSEGV'.] [default_value : null] [is_mandatory : false] [query_type : string] [is_exposed : true] [namespace : processed_crash] [name : reason] [data_validation_type : str] [has_full_version : true]"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "Failed to parse mapping [reason]: Root mapping definition has unsupported parameters:  [is_returned : true] [in_database_name : reason] [permissions_needed : []] [storage_mapping : {type=text, fields={full={type=keyword}}}] [form_field_choices : []] [form_field_type : StringField] [description : The crash's exception kind. Different OSes have different exception kinds. Example values: 'EXCEPTION_ACCESS_VIOLATION_READ', 'EXCEPTION_BREAKPOINT', 'SIGSEGV'.] [default_value : null] [is_mandatory : false] [query_type : string] [is_exposed : true] [namespace : processed_crash] [name : reason] [data_validation_type : str] [has_full_version : true]",
    "caused_by" : {
      "type" : "mapper_parsing_exception",
      "reason" : "Root mapping definition has unsupported parameters:  [is_returned : true] [in_database_name : reason] [permissions_needed : []] [storage_mapping : {type=text, fields={full={type=keyword}}}] [form_field_choices : []] [form_field_type : StringField] [description : The crash's exception kind. Different OSes have different exception kinds. Example values: 'EXCEPTION_ACCESS_VIOLATION_READ', 'EXCEPTION_BREAKPOINT', 'SIGSEGV'.] [default_value : null] [is_mandatory : false] [query_type : string] [is_exposed : true] [namespace : processed_crash] [name : reason] [data_validation_type : str] [has_full_version : true]"
    }
  },
  "status" : 400
}
Tagging Adrian with a needsinfo to help figure stuff out.
Miles, this is my bad. I don't know what happened in my head, but the file I sent you is not the mapping file, it's the file that we should use to populate our Super Search Fields list (which we also need to do during the cluster creation, as it is required for the processors and Super Search to work).
Flags: needinfo?(adrian)
Attached file es_mapping_v5.3.3.json
Please try this file instead. I have tested it locally with ES 5.3.3 and it worked.
We now have an elasticsearch 5.3.3 cluster in ops prod that has a recent copy of all of the data from the webeng stage elasticsearch cluster. I have set up a proxy in webeng that allows ingress from the stage processor and webapp and forwards traffic to the new cluster.

The way to access the elasticsearch proxy is via: new-socorro-es-proxy.mocotoolsstaging.net:9200 from the stage processor or webapp instances.

Now what remains is to test the updated elasticsearch libraries with the new cluster!
I ran the following commands to presumably change the processor and webapp to point at the new cluster.

consulate kv set socorro/web-django/resource.elasticsearch.elasticsearch_urls new-socorro-es-proxy.mocotoolsstaging.net:9200
consulate kv set socorro/webapp-django/ELASTICSEARCH_URLS new-socorro-es-proxy.mocotoolsstaging.net:9200
consulate kv set socorro/common/resource.elasticsearch.elasticsearch_urls new-socorro-es-proxy.mocotoolsstaging.net:9200
consulate kv set socorro/common/resource.elasticsearch.elasticSearchHostname new-socorro-es-proxy.mocotoolsstaging.net:9200
Stage has been mucked with. One caveat of the migration: we didn't have a mapping file for the `socorro` index, which stores misc. stuff including (probably) supersearch fields. So I just migrated it without providing a mapping. This seemed to work OK except that the homepage of crash-stats is broken: https://crash-stats.allizom.org/home/product/Firefox.

These are the mappings that were generated by reindex-from-remote into 5.3.3:
"mappings": {
      "supersearch_fields": {
        "properties": {
          "data_validation_type": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "description": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "form_field_choices": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "form_field_type": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "has_full_version": {
            "type": "boolean"
          },
          "in_database_name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "is_exposed": {
            "type": "boolean"
          },
          "is_mandatory": {
            "type": "boolean"
          },
          "is_returned": {
            "type": "boolean"
          },
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "namespace": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "permissions_needed": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "query_type": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "storage_mapping": {
            "properties": {
              "analyzer": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "dynamic": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "fields": {
                "properties": {
                  "Android_Model": {
                    "properties": {
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "AsyncShutdownTimeout": {
                    "properties": {
                      "analyzer": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "index": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "PluginFilename": {
                    "properties": {
                      "index": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "PluginName": {
                    "properties": {
                      "index": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "PluginVersion": {
                    "properties": {
                      "index": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "cpu_info": {
                    "properties": {
                      "analyzer": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "index": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "full": {
                    "properties": {
                      "index": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "os_name": {
                    "properties": {
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "product": {
                    "properties": {
                      "index": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "reason": {
                    "properties": {
                      "analyzer": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "signature": {
                    "properties": {
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  },
                  "user_comments": {
                    "properties": {
                      "type": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      }
                    }
                  }
                }
              },
              "format": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "index": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              },
              "null_value": {
                "type": "long"
              },
              "properties": {
                "properties": {
                  "skunk_works": {
                    "properties": {
                      "dynamic": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "properties": {
                        "properties": {
                          "classification": {
                            "properties": {
                              "type": {
                                "type": "text",
                                "fields": {
                                  "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                  }
                                }
                              }
                            }
                          },
                          "classification_data": {
                            "properties": {
                              "type": {
                                "type": "text",
                                "fields": {
                                  "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                  }
                                }
                              }
                            }
                          },
                          "classification_version": {
                            "properties": {
                              "analyzer": {
                                "type": "text",
                                "fields": {
                                  "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                  }
                                }
                              },
                              "type": {
                                "type": "text",
                                "fields": {
                                  "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                  }
                                }
                              }
                            }
                          }
                        }
                      }
                    }
                  },
                  "support": {
                    "properties": {
                      "dynamic": {
                        "type": "text",
                        "fields": {
                          "keyword": {
                            "type": "keyword",
                            "ignore_above": 256
                          }
                        }
                      },
                      "properties": {
                        "properties": {
                          "classification": {
                            "properties": {
                              "type": {
                                "type": "text",
                                "fields": {
                                  "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                  }
                                }
                              }
                            }
                          },
                          "classification_data": {
                            "properties": {
                              "type": {
                                "type": "text",
                                "fields": {
                                  "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                  }
                                }
                              }
                            }
                          },
                          "classification_version": {
                            "properties": {
                              "analyzer": {
                                "type": "text",
                                "fields": {
                                  "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                  }
                                }
                              },
                              "type": {
                                "type": "text",
                                "fields": {
                                  "keyword": {
                                    "type": "keyword",
                                    "ignore_above": 256
                                  }
                                }
                              }
                            }
                          }
                        }
                      }
                    }
                  }
                }
              },
              "type": {
                "type": "text",
                "fields": {
                  "keyword": {
                    "type": "keyword",
                    "ignore_above": 256
                  }
                }
              }
            }
          }
        }
      }
    }
Note: the stage admin node needs to be able to access the ES cluster as well (for some cron jobs).

The mapping file you show there Miles doesn't look wrong, but it could be much better optimized. I am not sure that is worth our time, especially since code would be required to change (if, for example, we decided that storage_mapping should be stored as an actual string instead of a JSON blob, to avoid having that huge mapping tree).
< forgot to post this yesterday >

The stage admin node has access to the ES cluster, and afaict crontabber is working correctly. Agreed that the mapping file is gross, it was what elasticsearch generated up reindexing the index.

Last week when Peter and I were debugging reindexing we ran into this error [0] repeatedly. Through trial and error we were able to see that some keys in the source JSON for some crashes can be malformed. Of course, a single error interrupts the reindexing process for an index, so this is a costly issue to continually run into. You can see in the script section here [1] how we have been handling this.

[0] https://pastebin.mozilla.org/9026358
[1] https://gist.github.com/milescrabill/442e3f48bd84b2d0f0363d3d0db26fbd
Miles, 
Can you please distill the active actionable tasks into something more formal. Even though I'm involved I still struggle to understand exactly where we stand. Last last week in SF we made some progress but it still feels like a blur to me. 
I wouldn't mind having a single bug for each and every challenge/action. For example, wasn't the last problem related ES refusing some documents because it contained keys that appeared to be an empty string?

Also, the pastebin is now empty.
Peter: there is some summary to be had in this document: https://docs.google.com/document/d/1dtlZW2xaKy_jHFg9fg0VX2ddpH1eTC3OoWgoRP4WdfM.

I'll create a bug for the current major problem, which is the errors that prevent us from reindexing stage indices into the new cluster.
(In reply to Miles Crabill [:miles] from comment #37)
> I ran the following commands to presumably change the processor and webapp
> to point at the new cluster.
> 
> consulate kv set
> socorro/web-django/resource.elasticsearch.elasticsearch_urls
> new-socorro-es-proxy.mocotoolsstaging.net:9200
> consulate kv set socorro/webapp-django/ELASTICSEARCH_URLS
> new-socorro-es-proxy.mocotoolsstaging.net:9200
> consulate kv set socorro/common/resource.elasticsearch.elasticsearch_urls
> new-socorro-es-proxy.mocotoolsstaging.net:9200
> consulate kv set socorro/common/resource.elasticsearch.elasticSearchHostname
> new-socorro-es-proxy.mocotoolsstaging.net:9200

We're reverting the -stage environment back to Elasticsearch 1.x.

I undid these configuration changes just now. Plus I wrote a bash script so we can switch between the two modes more easily:

"""
#!/bin/bash

# This shell script makes it easier to change all the Elasticsearch host configuration
# keys at the same time to a single host.
#
# To use:
#
# 1. backup configuration
# 2. comment out the appropriate HOST line
# 3. uncomment out the other, and then
# 4. run the script

# ES 5.x cluster
# HOST="new-socorro-es-proxy.mocotoolsstaging.net:9200"

# ES 1.x cluster
HOST="http://socorro-es.mocotoolsstaging.net:9200"

consulate kv set socorro/web-django/resource.elasticsearch.elasticsearch_urls "${HOST}"
consulate kv set socorro/webapp-django/ELASTICSEARCH_URLS "${HOST}"
consulate kv set socorro/common/resource.elasticsearch.elasticsearch_urls "${HOST}"
consulate kv set socorro/common/resource.elasticsearch.elasticSearchHostname "${HOST}"
"""

It's called "change_es.sh" on the -stage admin node.

I waited 10 minutes, sentry errors died down, and it looks like -stage is set up right and unblocked. I'll send an email out next.
I added the requisite ignore_above for UTF8 and included the settings that I am using for index creation in the new ES 5.3.3 cluster.
Here is the Painless script we're using for the current prod reindexing effort:

(string escaping removed, line breaks added, left ;'s in)

def issues = 0;
def itr = ctx._source.raw_crash.keySet().iterator();
while (itr.hasNext()) {
  def key = itr.next();
  if (key == "" || key =~ /[^a-zA-Z0-9_-]/) {
    itr.remove();
    issues += 1;
  }
}
ctx._source.put("issues", issues)
Blocks: 1382727
No longer depends on: 1382727
No longer blocks: 1382727
Depends on: 1382727
Summary: Update Socorro's ElasticSearch cluster to a more recent version → [tracker] Update Socorro's ElasticSearch cluster to a more recent version
Depends on: 1393442
Attaching the new index creation payload.
Depends on: 1393650
Depends on: 1393652
Depends on: 1393653
(In reply to Miles Crabill [:miles] from comment #46)
> Created attachment 8900956 [details]
> create-index-20170824.txt
> 
> Attaching the new index creation payload.

How is this different? 
Has this one been used for anything?
(In reply to Peter Bengtsson [:peterbe] from comment #47)
> (In reply to Miles Crabill [:miles] from comment #46)
> > Created attachment 8900956 [details]
> > create-index-20170824.txt
> > 
> > Attaching the new index creation payload.
> 
> How is this different? 
> Has this one been used for anything?

I see. The difference is the latter has all the sprinkled `"dynamic": "false"` removed. 
And this is, presumably, what was used when creating the index "testindex-no-dynamic".
I wrote a terrible script that compares JSON files. In particular I focused on the _mapping from `testindex-no-dynamic` and `testindex-unsoiled`. Here's the output:
https://gist.github.com/peterbe/14b4c2b578ebcf757d693d75fc2eec75

Sorry if that's hard to read, but basically, these two started out as identical mappings. 
Then Miles started reindexing (aka. migrating, index-from-remote) from ES1 into testindex-no-dynamic. Then, from there, the mapping *changed*! But it only changed in the fact that *new keys were added*. 

For example, look at line 5 https://gist.github.com/peterbe/14b4c2b578ebcf757d693d75fc2eec75#file-gistfile1-txt-L5
It means that the key processed_crash.properties.ReleaseChannel was added into the index(!). 

Important Question; How come that could have happened when the index had `index.mapper.dynamic` set to false?

Important Question; Who cares? Does it matter to have a bit of excess just after migration?

The next thing to note, is that I can't find any other DIFFERENCES between the two mappings other than the one we reindexed into has new stuff ADDED.

Important Observation; Precious things like processed_crash.version is still {"type": "keyword"}. Yay!

This *could* be a distraction to talk about right now but we need to rebase ourselves with which file we use to create the index because there's something odd. When I look at the mapping for `testindex-unsoiled` I see a bunch of keys MISSING. The file we use to create the index is supposed to be align with the Super Search Fields. But if you look at https://github.com/mozilla-services/socorro/blob/master/socorro/external/es/data/super_search_fields.json there are a bunch things this doesn't mention. 
Like, if you look at processed_crash.json_dump.system_info in https://bug1322630.bmoattachments.org/attachment.cgi?id=8888052 you'll see there's only one key (cpu_count), but https://github.com/mozilla-services/socorro/blob/master/socorro/external/es/data/super_search_fields.json there are 6(!) occurrances where the namespace is "processed_crash.json_dump.system_info" namely (dump_cpu_arch, dump_cpu_info, os, os_ver, number_of_processors, cpu_count).

Important Question; Why is super_search_fields.json so different from the indexing file?
No longer depends on: 1407655
Priority: -- → P2

It's been a while since we did our last upgrade attempt, so I'm retargeting this for 7.1.0. If we can hit that--that's great. If we hit problems going from 1.4 to 7.1, then we can rethink things then.

Also, grabbing this since this became the tracker for the whole migration project and I'm working on it now.

Assignee: miles → willkg
Status: NEW → ASSIGNED
Summary: [tracker] Update Socorro's ElasticSearch cluster to a more recent version → [tracker] Update Socorro's ElasticSearch cluster to 7.1.0

Bumping down to P3 until after we look into bug #1568601.

Priority: P2 → P3

Unassigning myself since I'm not going to get to this any time soon.

Assignee: willkg → nobody
Status: ASSIGNED → NEW

We're no longer planning to move to GCP, but upgrading ElasticSearch is still important. Version 7.4.0 was released October 1, 2019.

No longer blocks: 1512641
Summary: [tracker] Update Socorro's ElasticSearch cluster to 7.1.0 → [tracker] Update Socorro's ElasticSearch cluster to 7.4.0
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INACTIVE

This is the top of the Elasticsearch update tree, so I'm going to reopen this as a General bug and not an infra bug. We'll spin off specific infra bugs as we need to.

Status: RESOLVED → REOPENED
Component: Infra → General
Resolution: INACTIVE → ---
You need to log in before you can comment on or make changes to this bug.