Closed Bug 1626277 Opened 2 years ago Closed 2 years ago

[research] usage of email address field

Categories

(Socorro :: General, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Some of the crash report dialogs have an email field. Bug #1619955 covers adding the email field to the content process crash reporter dialog.

This research bug covers figuring out usage of the email address information in crash reports.

Making this a P2 and grabbing it for now since I've been exploring the issue.

In the last week, I looked at crash reports for Firefox 74.0.

label count has email
main process 142,432 32,830
content process 155,772 0
other types 1,563 --

The main process crash reporter dialog has an email address. The content process crash reporter dialog does not have an email address.

Super search for breakdown:

https://crash-stats.mozilla.org/search/?product=Firefox&version=74.0&date=%3E%3D2020-03-24T14%3A22%3A00.000Z&date=%3C2020-03-31T14%3A22%3A00.000Z&_facets=signature&_facets=process_type&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-process_type

Super search for main process with email address:

https://crash-stats.mozilla.org/search/?process_type=__null__&email=%40.%2B&product=Firefox&version=74.0&date=%3E%3D2020-03-24T14%3A22%3A00.000Z&date=%3C2020-03-31T14%3A22%3A00.000Z&_facets=signature&_facets=process_type&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=email#crash-reports

Super search for content process with email address:

https://crash-stats.mozilla.org/search/?process_type=%3Dcontent&email=%40.%2B&product=Firefox&version=74.0&date=%3E%3D2020-03-24T14%3A22%3A00.000Z&date=%3C2020-03-31T14%3A22%3A00.000Z&_facets=signature&_facets=process_type&page=1&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=email#crash-reports

I didn't look at how many email addresses were valid, but my spot check suggested at least some are.

There are no content process crash reports with an email address because that crash reporter dialog doesn't have an email address field. Bug #1619955 covers adding one.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P2

To figure out how Mozilla uses the email address, we have a few options:

  1. we could send a survey to a mailing list
    • pros: easy to set up and run and doesn't require engineering work
    • cons:
      • Travis and I don't think this is likely to be conclusively helpful since it's easily ignored; we could preface the survey with "We're considering removing this field ..." to increase the likelihood the survey isn't ignored
      • this only surveys people at a specific point in time
  2. we could require an additional permission to see the email field
    • pros:
      • this increases the security of that data and reduces the unintentional usage of it
      • this tells us who uses it since they'll have to request permission
    • cons:
      • requires engineering work that we've avoided doing in the past (adding new permissions, adding new group)
  3. we could change the webapp to hide the field by default and require a click to reveal
    • pros:
      • we get metrics on who's using it and how often
    • cons:
      • requires engineering work
  4. we can add metrics gathering to supersearch for when the user requests the email field

I think item 2 is the most "we definitely get the data we want", but there's a bunch of risk around making changes to permissions and groups in the current codebase.

Item 3 seems doable. It catches the case where someone is looking at individual crash reports and goes to email the user, but doesn't catch other cases.

Item 4 is doable, but the supersearch code is used both by users and heavily by Socorro itself. We'd need to distinguish the two cases somehow.

I looked at user comments for that same week as in comment 1:

process type total count has email has user comment
main process 142,432 32,830 4,908
content process 155,772 0 5,105

Main process crash report dialog has an email address field.

Content process crash report dialog does not have an email address field, so we'll never capture email address data for content process crashes.

However, there's a user comments field in both crash report dialogs, so I was curious to see how many comments we had. Note that "has a user comment" is pretty basic test of whether the comment field has at least one a-zA-Z character--this doesn't look at how many are not junk.

Joe suggested I talk to Randell. I talked to him late last week:

  • He thought most crash reports don't have email addresses and did not know that content process crash reports can't have email addresses.
  • He said Marcia did most of the work contacting users using the email address field to find more information.
  • He said he could live without it.
  • He suggested I talk to Philipp and dveditz for their experiences.

Travis said I should do a qualitative analysis of the email data:

  • break down by domain--How many people are Mozilla or Softvision people?
  • get a finer analysis of how many are probably junk email addresses

Then I'll scope out the work involved in measuring email address usage.

Then throw all this into a Google doc defining the problem, proposed solution, and possible further work and send it to Travis.

I did some analysis on email addresses during 3/24-3/31 for main process crashes for Firefox:

Total email addresses: 32,830

Junk email: 541 (1.65%)
   Junk: no host/tld: 287
   Junk: no @: 243
   Junk: no data: 11

Probably valid email: 32,289 (98.35%)
10 most common domains:
   gmail.com: 11,344
   yahoo.com: 2,217
   hotmail.com: 1,953
   orange.fr: 752
   aol.com: 634
   gmx.de: 536
   web.de: 534
   t-online.de: 432
   hotmail.fr: 358
   free.fr: 343

Unique email addresses: 14,837

Tomorrow, I'll scope out roughly what work would be involved in measuring usage of the email address data.

I haven't been able to get real data for email field data usage by asking around and it's been time-intensive. I think it's not worth continuing to do that and instead to try to do it programmatically.

There are several points where the email address data shows up:

  1. the report view
  2. supersearch
  3. the supersearch API
  4. the raw crash API

In all of these cases, the user needs to be logged in and have PII permissions so we always know who's accessing the field.

The first two are in the webapp. Travis suggested hiding the email address data behind a click-wall. Clicking on the click-wall logs the event, then shows the email address to the user. We have a very limited set of users, so we could (ab)use Grafana for raw counts (who looked, when). We could create a db table and toss an entry in there and get more detailed information (who looked, which crashid, when).

The second two are in the API. With the supersearch API, the user has to explicitly include the email field in the list of columns/facets to get back. We can record an event based on whether "email" is in the requested columns/facets. Earlier, I was concerned that the Crash Stats site itself uses the supersearch API a lot, but I think recording those cases as well is the right thing to do. Do we want to log an event that the user requested the email field? Do we want to log an event for each crash id the user saw?

The raw crash API always returns the email address. Since the user isn't explicitly signaling they plan to look at the email address, I think we should record a different maybe-used event so we can differentiate.

So rough scope of work:

  • half-week: figure out how we want to capture events and implement how to get the data in for recording and how to get the data back out again for analysis
  • 2 days: implement the click-wall in the report view
  • 1 day (may not need to do this): implement the click-wall in supersearch
  • 1 day: implement event recording in the supersearch API (need to make sure using supersearch doesn't double-count)
  • 1 day: implement event recording in raw crash API

My rough estimate is 1-2 weeks of work to implement email data usage tracking.

I started a brief covering the entire issue, data gathered, and recommendation:

https://docs.google.com/document/d/1PhtZzhrnuk8NeQq4YLczvobeLgVwmYKlRKIOmI3nx88/edit#

I've done everything I'm going to do here. The doc contains a summary of findings and there's another doc with more details.

Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.