Open Bug 1905772 Opened 1 year ago Updated 7 months ago

android_model and other fields are indexed wrong in Elasticsearch

Categories

(Socorro :: Processor, defect, P2)

Tracking

(Not tracked)

People

(Reporter: willkg, Unassigned)

References

Details

(Whiteboard: [cringe])

android_model is defined in super_search_fields.py like this:

    "android_model": {
        "data_validation_type": "str",
        "form_field_choices": [],
        "has_full_version": True,
        "in_database_name": "android_model",
        "is_exposed": True,
        "is_returned": True,
        "name": "android_model",
        "namespace": "processed_crash",
        "query_type": "string",
        "storage_mapping": {
            "fields": {
                "Android_Model": {"type": "string"},
                "full": {"index": "not_analyzed", "type": "string"},
            },
            "type": "multi_field",
        },
    },

This works with facets/aggs, but doesn't work with any of the query operators except "is exactly" which causes the code to look at the "full" field. That's not great.

It should be something like this:

    "android_model": {
        "data_validation_type": "str",
        "form_field_choices": [],
        "has_full_version": True,
        "in_database_name": "android_model",
        "is_exposed": True,
        "is_returned": True,
        "name": "android_model",
        "namespace": "processed_crash",
        "query_type": "string",
        "storage_mapping": {
            "fields": {"full": {"index": "not_analyzed", "type": "string"}},
            "index": "analyzed",
            "type": "string", 
        },
    },

I'm pretty sure there are some other fields that are similarly indexed wrong.

This bug covers identifying them and fixing them.

The underlying problem evolved over the last 6 months.

With Elasticsearch 8, we have two kinds of fields: text and keyword. text fields are analyzed and keyword fields are not.

There are a bunch of android_* fields that are currently text fields that shouldn't be. Further, there are a bunch of fields that are of type text but have the enum query operators which make no sense.

I think we want to adjust all the fields so we have:

  • text fields with string operators
  • and keyword fields with enum operators

I'll look into whether that makes sense now.

Assignee: nobody → willkg
Status: NEW → ASSIGNED

I went through all the fields for fields that were query_type=string, storage_mapping=keyword and query_type=enum, storage_mapping=text. I think those combinations are wrong.

Most of the time if an item is indexed as a keyword, we only need "matches", "does not match", "exists" and "does not exist" operators with a couple of minor exceptions.

If an item is indexed as text it's because we want it analyzed and want to match different kinds of substrings so we should have all the string operators.

NAME                              QUERY     TYPE       EXAMPLE

# These are string / keyword, but should continue to be string / keyword.
accessibility_client              string    keyword    ShowMsg.exe|6.2.0.6068
addons                            string    keyword    formautofill@mozilla.org:1.0.1
url                               string    keyword    http://example.com
crash_inconsistencies             string    keyword    crashing_access_not_found_in_memory_accesses
crash_report_keys                 string    keyword    upload_file_minidump
crashing_thread_name              string    keyword    Shutdown Hang Terminator
gmp_library_path                  string    keyword    c:\blah\blah\blah
stackwalk_version                 string    keyword    minidump-stackwalk 0.22.2 (2024-10-10 v0.22.2)
xpcom_spin_event_loop_stack       string    keyword    default: nsThread::Shutdown: ImageIO

# This is indexed wrong for complicated reasons. We should leave it.
address                           string    keyword    0x0000000600000000

# We should change these to enum / keyword.
accessibility_in_proc_client      string    keyword    0x400
adapter_device_id                 string    keyword    0x9a49
adapter_driver_version            string    text       opengl
adapter_subsys_id                 string    text       86941043
adapter_vendor_id                 string    keyword    0x8086
android_board                     enum      text       universal7870
android_brand                     enum      text       samsung
android_display                   enum      text       231005.007
android_fingerprint               enum      text       samsung
android_hardware                  enum      text       samsungexynos7870
android_model                     string    text       SM-T580
android_packagename               string    keyword    org.mozilla.firefox
android_version                   string    keyword    34 (REL)
background_task_name              string    keyword    backgroundupdate
co_marshal_interface_failure      enum      text       **no examples of this**
cpu_microcode_version             enum      text       0xb4
distribution_id                   string    keyword    mozilla-win-eol-esr115
hang                              string    keyword    ui
js_large_allocation_failure       string    keyword    Recovered
mac_memory_pressure               string    keyword    Normal
phc_kind                          string    keyword    GuardPage
platform                          enum      text       Windows NT
plugin_filename                   enum      text       clearkey
plugin_name                       enum      text       name
plugin_version                    enum      text       4.10.2830.0
process_type                      string    keyword    parent
product                           enum      text       Firefox
report_type                       string    keyword    crash
shutdown_reason                   string    keyword    AppClose
submitted_from                    string    keyword    Client
useragent_locale                  enum      text       en-us
utility_actors_name               string    keyword    mf-media-engine
vendor                            enum      text       mozilla

I'll make those changes now. It should be pretty straight-forward.

Relud: ^^^ Does that give you any pause? I think the work involves switching text_field to keyword_field and vice versa and using the default data_validation_type and query_type for the field. And maybe updating some tests.

Flags: needinfo?(dthorn)

that all sounds like a good plan to me

Flags: needinfo?(dthorn)

I hit a snag. When we convert a field to a query_type=enum, data_validation_type=str, storage_mapping=keyword, then you can't search against it case-insensitively. For example, if we switch platform to that, then platform=Linux works but platform=linux brings up nothing.

I think we need to toss the plan in comment #2. I'm sure we could figure out which set of fields we don't need to search against case-insensitively, but I don't think I have time to do that carefully now, so I'm going to bail on doing this before the elasticsearch 8 migration.

I'm going to unassign myself.

Assignee: willkg → nobody
Status: ASSIGNED → NEW
Whiteboard: [cringe]
Summary: android_model is indexed wrong → android_model and other fields are indexed wrong in Elasticsearch
You need to log in before you can comment on or make changes to this bug.