Open Bug 1407212 Opened 5 years ago Updated 3 years ago

[Super Search] super search fields need refactoring

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

People

(Reporter: willkg, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

We moved super search fields from being stored in Elasticsearch to being stored in one massive Python file in socorro/external/es/super_search_fields.py. That file is 4400 lines long and mostly consists of repeated sections defining fields.

Many fields share the bulk of their definition. For example, most of the "keyword" type fields have the following parts in common:

    'data_validation_type': 'str',
    'default_value': None,
    'form_field_choices': [],
    'form_field_type': 'StringField',
    'has_full_version': False,
    'is_exposed': True,
    'is_mandatory': False,
    'is_returned': True,
    'query_type': 'string',
    'storage_mapping': {
        'analyzer': 'keyword',
        'type': 'string'
    }

So now we've got a file that's got a lot of redundancy and copying/pasting.

It'd be nice to have a tiny layer of abstraction here that makes it easier to maintain fields and deal with this file.
I think I want the following out of an abstraction layer:

1. when looking at a field, I want to know how it differs from the defaults
2. when looking at a field, it should be easy to determine what "type" it is
3. adding a field with defaults should be easier
4. it'd be nice if the abstraction layer reduced possibility of typos


I'm thinking maybe creating a few functions that return dicts. For example, here's keyword:

PROCESSED_CRASH = 'processed_crash'

VIEW_PII = 'crashstats.view_pii'


def verify(data):
    # validate the data against some schema here and return False if invalid
    return True


def keyword(**overrides):
    defaults = {
        'form_field_type': 'StringField',
        'has_full_version': False,
        'is_exposed': True,
        'is_mandatory': False,
        'is_returned': True,
        ...
    }
    defaults.update(overrides)
    verify(defaults)
    return defaults


Then we could do this:

    'url': keyword(
        description='The website the user ...',
        in_database_name='url',
        name='url',
        namespace=PROCESSED_CRASH,
        permission_needed=[VIEW_PII],
    ),


All the work is done at module-load time. The likelihood of typos is reduced. After the module has been loaded, the value of FIELDS would be the same as it is now, so we don't have to change any other code--just super_search_fields.py. It reduces a lot of redundant defaults and makes it clearer when something is different. We can combine this with comments, too.

This feels easier to write, easier to read, and easier to review changes on.

What do we think?

I like it, this pattern has worked well for me for similar problems.

Another option is moving the definitions to an external file (JSON, YAML, CSV) and loading them at configuration time. It doesn't help with the duplicate data, but gets them out of code, and may make it easier for a non-staff dev to process the structure. I prefer the functions-returning-dicts, however.

We originally had a JSON file, but I ditched that for the Python file because JSON sucks for stuff like this where you need to document why things are different. Those were dark days.

Another option is to use attrs data classes which get us a lot of what we're doing above with a lot less code. This will require us to adjust super_search_fields using code, but I think that's probably a good thing all things considered.

I'll spend a couple of hours fiddling with these options to see what's easier to deal with and report back.

I want to do this before embarking on the Elasticsearch migration because simplifying super search fields will make that significantly easier.

Assignee: nobody → willkg
Blocks: 1322630
Status: NEW → ASSIGNED

Here's an example using attrs:

@attr.s
class Keyword:
    """Defines a keyword field."""
    name = attr.ib()
    namespace = attr.ib()

    in_database_name = attr.ib(
        default=attr.Factory(
            lambda self: self.name,
            takes_self=True
        )
    )
    default_value = attr.ib(default=None)
    data_validation_type = attr.ib(default='str')
    description = attr.ib(default='Add description')
    has_full_version = attr.ib(default=False)
    is_exposed = attr.ib(default=True)
    is_mandatory = attr.ib(default=False)
    is_returned = attr.ib(default=True)
    query_type = attr.ib(default='string')
    form_field_choices = attr.ib(default=attr.Factory(list))
    permissions_needed = attr.ib(default=attr.Factory(lambda: ['crashstats.view_pii']))
    storage_mapping = attr.ib(
        default=attr.Factory(
            lambda: {'analyzer': 'keyword', 'type': 'string'}
        )
    )


FIELDS = [
    Keyword(
        name='ActiveExperiment',
        namespace='raw_crash',
        description='The telemetry experiment active at the time of the crash, if any.',
    )
]

That's pretty succinct. Using attrs allows us to type-check and validate values, too, though I didn't test that out much here. I think I want to go with this.

A while back, I documented the fields in super_search_fields.py and noticed that is_mandatory and default_value have the same value for all the fields. Adrian and I talked about that briefly:

https://github.com/mozilla-services/socorro/pull/4417#discussion_r184945744

We can remove those.

I'll do this before I go further since it reduces the code.

willkg merged PR #4947: "bug 1407212: remove unused super_search_fields code" in 10af200.

That'll make it easier to diff old fields with new fields to make sure they're the same.

Unassigning myself since I'm not going to get to this any time soon.

Assignee: willkg → nobody
Status: ASSIGNED → NEW
You need to log in before you can comment on or make changes to this bug.