android_model and other fields are indexed wrong in Elasticsearch
Categories
(Socorro :: Processor, defect, P2)
Tracking
(Not tracked)
People
(Reporter: willkg, Unassigned)
References
Details
(Whiteboard: [cringe])
android_model is defined in super_search_fields.py like this:
"android_model": {
"data_validation_type": "str",
"form_field_choices": [],
"has_full_version": True,
"in_database_name": "android_model",
"is_exposed": True,
"is_returned": True,
"name": "android_model",
"namespace": "processed_crash",
"query_type": "string",
"storage_mapping": {
"fields": {
"Android_Model": {"type": "string"},
"full": {"index": "not_analyzed", "type": "string"},
},
"type": "multi_field",
},
},
This works with facets/aggs, but doesn't work with any of the query operators except "is exactly" which causes the code to look at the "full" field. That's not great.
It should be something like this:
"android_model": {
"data_validation_type": "str",
"form_field_choices": [],
"has_full_version": True,
"in_database_name": "android_model",
"is_exposed": True,
"is_returned": True,
"name": "android_model",
"namespace": "processed_crash",
"query_type": "string",
"storage_mapping": {
"fields": {"full": {"index": "not_analyzed", "type": "string"}},
"index": "analyzed",
"type": "string",
},
},
I'm pretty sure there are some other fields that are similarly indexed wrong.
This bug covers identifying them and fixing them.
| Reporter | ||
Comment 1•1 year ago
|
||
The underlying problem evolved over the last 6 months.
With Elasticsearch 8, we have two kinds of fields: text and keyword. text fields are analyzed and keyword fields are not.
There are a bunch of android_* fields that are currently text fields that shouldn't be. Further, there are a bunch of fields that are of type text but have the enum query operators which make no sense.
I think we want to adjust all the fields so we have:
- text fields with string operators
- and keyword fields with enum operators
I'll look into whether that makes sense now.
| Reporter | ||
Comment 2•1 year ago
|
||
I went through all the fields for fields that were query_type=string, storage_mapping=keyword and query_type=enum, storage_mapping=text. I think those combinations are wrong.
Most of the time if an item is indexed as a keyword, we only need "matches", "does not match", "exists" and "does not exist" operators with a couple of minor exceptions.
If an item is indexed as text it's because we want it analyzed and want to match different kinds of substrings so we should have all the string operators.
NAME QUERY TYPE EXAMPLE
# These are string / keyword, but should continue to be string / keyword.
accessibility_client string keyword ShowMsg.exe|6.2.0.6068
addons string keyword formautofill@mozilla.org:1.0.1
url string keyword http://example.com
crash_inconsistencies string keyword crashing_access_not_found_in_memory_accesses
crash_report_keys string keyword upload_file_minidump
crashing_thread_name string keyword Shutdown Hang Terminator
gmp_library_path string keyword c:\blah\blah\blah
stackwalk_version string keyword minidump-stackwalk 0.22.2 (2024-10-10 v0.22.2)
xpcom_spin_event_loop_stack string keyword default: nsThread::Shutdown: ImageIO
# This is indexed wrong for complicated reasons. We should leave it.
address string keyword 0x0000000600000000
# We should change these to enum / keyword.
accessibility_in_proc_client string keyword 0x400
adapter_device_id string keyword 0x9a49
adapter_driver_version string text opengl
adapter_subsys_id string text 86941043
adapter_vendor_id string keyword 0x8086
android_board enum text universal7870
android_brand enum text samsung
android_display enum text 231005.007
android_fingerprint enum text samsung
android_hardware enum text samsungexynos7870
android_model string text SM-T580
android_packagename string keyword org.mozilla.firefox
android_version string keyword 34 (REL)
background_task_name string keyword backgroundupdate
co_marshal_interface_failure enum text **no examples of this**
cpu_microcode_version enum text 0xb4
distribution_id string keyword mozilla-win-eol-esr115
hang string keyword ui
js_large_allocation_failure string keyword Recovered
mac_memory_pressure string keyword Normal
phc_kind string keyword GuardPage
platform enum text Windows NT
plugin_filename enum text clearkey
plugin_name enum text name
plugin_version enum text 4.10.2830.0
process_type string keyword parent
product enum text Firefox
report_type string keyword crash
shutdown_reason string keyword AppClose
submitted_from string keyword Client
useragent_locale enum text en-us
utility_actors_name string keyword mf-media-engine
vendor enum text mozilla
I'll make those changes now. It should be pretty straight-forward.
| Reporter | ||
Comment 3•1 year ago
|
||
Relud: ^^^ Does that give you any pause? I think the work involves switching text_field to keyword_field and vice versa and using the default data_validation_type and query_type for the field. And maybe updating some tests.
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Comment 5•1 year ago
|
||
I hit a snag. When we convert a field to a query_type=enum, data_validation_type=str, storage_mapping=keyword, then you can't search against it case-insensitively. For example, if we switch platform to that, then platform=Linux works but platform=linux brings up nothing.
I think we need to toss the plan in comment #2. I'm sure we could figure out which set of fields we don't need to search against case-insensitively, but I don't think I have time to do that carefully now, so I'm going to bail on doing this before the elasticsearch 8 migration.
I'm going to unassign myself.
Updated•7 months ago
|
Updated•7 months ago
|
Updated•7 months ago
|
Description
•