Closed Bug 1639838 Opened 4 years ago Closed 4 years ago

Many send-tab telemetry events are missing metadata about the peer device

Categories

(Firefox :: Sync, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Firefox 79
Tracking Status
firefox79 --- fixed

People

(Reporter: rfkelly, Assigned: eoger)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

The sync ping contains "command-sent" and "command-received" events designed to measure the success rate of send-tab. Ideally, for each such event we would be able to locate the peer device (that is, the device to which our device sent a tab, or from which our device received a tab) as an item in the devices list from the sync ping, and thereby find out its basic device metadata like type and version.

This redash query explores what percentage of events do not have info about the peer device in their sync ping devices list. Key points:

  • Around 20% of events contain the id of the peer device, but do not have it as an entry in the devices list of the containing sync ping.
  • A further 15% to 20% of events do not have any info about the peer device at all, not even its id.

The second part is like due to Bug 1639835, but that first 20% is a bit of a mystery.

I was able to find an example of such a ping in my local archive, which I'm happy to share for reference:

15/05/2020 14:15
{
  "os": {
    "installYear": 2019,
    "hasSuperfetch": true,
    "hasPrefetch": true,
    "name": "Windows_NT",
    "version": "10.0",
    "locale": "en-AU",
    "servicePackMajor": 0,
    "servicePackMinor": 0,
    "windowsBuildNumber": 19041,
    "windowsUBR": 264
  },
  "why": "shutdown",
  "devices": [
    {
      "id": "04dca4827ac13596a49ce634d0a32f55e2ef4c42962680705be09dad8bbd1df5",
      "type": "desktop"
    }
  ],
  "version": 1,
  "syncs": [
    // omitted 
  ],
  "uid": "333412aa09dc23fd3a671315e275578b",
  "syncNodeType": "spanner",
  "deviceID": "04dca4827ac13596a49ce634d0a32f55e2ef4c42962680705be09dad8bbd1df5",
  "sessionStartDate": "2020-05-15T11:00:00.0+10:00",
  "events": [
    [
      9968860,
      "sync",
      "open-uri",
      "command-sent",
      "83499eeabbcc046a7c3c948d0d5753b133b2bc80a24edc8138bf7228e9335fd9",
      {
        "flowID": "d50437b4-8fa5-402d-9c19-5d8bf98ddebb",
        "serverTime": "1589515308.06"
      }
    ],
    [
      10026313,
      "sync",
      "open-uri",
      "command-received",
      "aabea7d669d1f567a60536ab7e2f800b08b56ae58853f575405f87aa7bc5c71a"
    ],
    [
      10168188,
      "sync",
      "open-uri",
      "command-sent",
      "aabea7d669d1f567a60536ab7e2f800b08b56ae58853f575405f87aa7bc5c71a",
      {
        "flowID": "808a2d1e-f576-4553-a9f0-56761b0bce44",
        "serverTime": "1589515910.14"
      }
    ]
  ],
  "histograms": {
      // omitted
  }
}

This ping shows me sending a tab to device 83499eeabbcc046a7c3c948d0d5753b133b2bc80a24edc8138bf7228e9335fd9, and both sending to and receiving from device aabea7d669d1f567a60536ab7e2f800b08b56ae58853f575405f87aa7bc5c71a. Neither of those devices appear in the devices list, which includes only the current device 04dca4827ac13596a49ce634d0a32f55e2ef4c42962680705be09dad8bbd1df5.

I understand that we can't necessarily include the full list of peer devices in every sync ping, because sometimes it's very large and has lots of duplicates. But is there something we can do to ensure that it at least includes all the devices that are referenced elsewhere in the ping?

The severity field is not set for this bug.
:markh, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(markh)
Assignee: nobody → eoger
Status: NEW → ASSIGNED

I made another finding today: for OAuth clients such as Fenix and iOS, the device lastAccessTime property never ever gets updated, most likely because we never call /certificate/sign when using OAuth.
The lastAccessTime property is critical when we send a telemetry ping: we ditch "stale" devices based on that.
I've noticed that the /account/attached_clients FxA endpoint reports a proper last access time for iOS/Fenix devices, most likely because the server merges information from multiple sources (including the OAuth db) to report a more precise time.
What we should probably do is update the /devices endpoint with the same mechanism, however :rfkelly noted that "a bunch of technical debt in the fxa-auth-server might be about to come due" and the work is do is not going to be a 1 line fix here.

Flags: needinfo?(rfkelly)

:rfkelly noted that "a bunch of technical debt in the fxa-auth-server might be about to come due"

Things might be slightly better than I feared, which is always a nice discovery!

The core issue here is the way that we bolted OAuth clients on to the existing devices API:

  • Depending on the type of client, a device record may be associated with a sessionToken, an OAuth refreshToken, or both.
  • The devices table in the FxA auth database contains the device id, core device metadata such as supported commands, and nullable sessionTokenId and refreshTokenId columns.
    • The sessionTokenId references rows in the sessionTokens table which is in the same database.
    • The refreshTokenId references rows in the refreshTokens table, which for historical reasons is in a separate database (the oauth-server db).
  • The correct lastAccessTime for a device is the later of the last-access time for its sessionToken and the last-access time for its refresh token (if any).
  • When handling the /attached_clients endpoint, the FxA server does a query to the auth db for devices and sessions, a separate query to the oauth db for refresh tokens, and merges the two together. Thus, it reports a correct lastAccessTime for all clients.
  • At the time this was implemented, we couldn't update the /devices endpoint to do the same. Thus, it does not report a correct lastAccessTime for OAuth clients, which may not have a sessionToken or may not use it regularly.

The reason why /devices could not be updated at the time are explained in this comment:

// XXX: Ideally, we would call oauthdb.listAuthorizedClients here, and perform a similar merging
// to what we do in /account/attached_clients. That's awkward to do because this route can be called
// with a refreshToken, but oauthdb currently requires a sessionToken. Let's defer that to followup.

The good news is, those reasons no longer hold! Now that the oauth-server codebase has been more fully merged into the auth-server codebase, we don't have to worry about "the oauthdb currently requires a sessionToken" and we can directly query the contents of the oauth db as needed.

Here's what I think we need to do:

It's probably worth factoring out "list all devices, merging data from auth and oauth dbs" into some sort of shared helper that can be used by both of these routes, perhaps in ./lib/devices.js. But I haven't investigated what that might look like in any detail.

Flags: needinfo?(rfkelly)
Attached file GitHub Pull Request

The above pull-request was merged, but is not yet live in production. It should go out with the next deploy, FxA train-176.

(Triage: in current spring => P1)

Severity: -- → S4
Flags: needinfo?(markh)
Priority: -- → P1

I think we also still want something like that phabricator patch too? Ed, you are going to persist with that still?

Flags: needinfo?(eoger)
Attachment #9154657 - Attachment description: Bug 1639838 - Ensure devices associated to events get included in devices ping. → Bug 1639838 - Do not filter the devices list. , markh
Pushed by eoger@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0f4918b726fe Do not filter the devices list. , markh r=rfkelly
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 79
Flags: needinfo?(eoger)

Unfortunately, I don't think this patch helped. We now include a (very large -- every one of my pings has over 50 devices in them now...) list of devices... but the only devices with usable information in them are the devices that we were previously including.

Specifically, prior to this patch, my list looked like:

[{
  "id": "90f50e807e35fc9f12673673dc2ebe1d05e888b9f600579036660199ae452472",
  "syncID": "95c2c6978380fd9a0b04c094e71faa565f6cce84aaefe5ebe82c5aebaef08a8c",
  "type": "mobile", "os": "iOS", "version": "27.0"
}, {
  "id": "0374965ba8f65a01eaeafadcc43026925dbba47e35a766105b2dc2a624b2f65d",
  "syncID": "0fe877fe7298246b45f3fddbce9547ea16bb9e63855cbae6c1984f0708ca5725",
  "type": "desktop", "os": "Darwin", "version": "80.0a1"
}, {
  "id": "d94b178af788cef9447bff331a1bed3bde2e221793ddbf566698e10def68bfd9",
  "syncID": "5db36368b081fe2c8205f2cc0ab784cf0968864228a709ab83bed136da39f66f",
  "type": "desktop", "os": "WINNT", "version": "79.0"
}]

And afterwards it looks like:

[{
  "id": "90f50e807e35fc9f12673673dc2ebe1d05e888b9f600579036660199ae452472",
  "syncID": "95c2c6978380fd9a0b04c094e71faa565f6cce84aaefe5ebe82c5aebaef08a8c",
  "type": "mobile", "os": "iOS", "version": "27.0"
}, {
  "id": "0374965ba8f65a01eaeafadcc43026925dbba47e35a766105b2dc2a624b2f65d",
  "syncID": "0fe877fe7298246b45f3fddbce9547ea16bb9e63855cbae6c1984f0708ca5725",
  "type": "desktop", "os": "Darwin", "version": "80.0a1"
}, {
  "id": "d94b178af788cef9447bff331a1bed3bde2e221793ddbf566698e10def68bfd9",
  "syncID": "5db36368b081fe2c8205f2cc0ab784cf0968864228a709ab83bed136da39f66f",
  "type": "desktop", "os": "WINNT", "version": "79.0"
},
{ "id": "33aa63c10da1c90f5c8d48167e6efac49ae805933b68789e1743dbf641c49b1d", "type": "desktop" },
{ "id": "964c1567462c445e958400e80650370271e88b7120b17de05b2379610a350e71", "type": "mobile" },
{ "id": "e4ea71d9a55bb56ba4fcee89900923e2cfba564a849763050134ed5a46c808b5", "type": "desktop" },
{ "id": "57cfe7ca0a869d1bbd1142b69692c31864b6137f29dcfbe496c19b238ac19528", "type": "mobile" },
{ "id": "1c72e830cb937c67f663dfa0c1a35e6bbf3b26dc4507157b8eb26475bf483833", "type": "mobile" },
{ "id": "b19410ff122129245ac82aa756ed5ae0cd3793019ce51a5f2ea5657dacef9502", "type": "desktop" },
{ "id": "759461aadaac986767e4354bc9fdb3722a922b5c0b78878ece90eb03b6b66c97", "type": "mobile" },
{ "id": "fb06c88441f33b8d8e756bb258a7a65b5b65e3391ba768dff3110bcc04a97ad0", "type": "desktop" },
{ "id": "39fe4fe1ccc531000488fead7db54069b81ee289b50768b288358cbffba6029f", "type": "mobile" },
{ "id": "706a9f3155252950094e140123da702b811c7f730bba2acde0465fd4be1602e0", "type": "desktop" },
{ "id": "5ee866da368fadcd3dc26449ef3f3a5eae966d969995bc9157d16f4aac7127cb", "type": "desktop" },
{ "id": "40c78192e0a2d67e843df1da17b1fd8f6fc3f0dd0376f6de3518e5f3eac7f044", "type": "mobile" },
{ "id": "8d6375bd1cef89d0a1a6eb7ccf862425c56ada2f8f9328a33bf5dc3ec3938931", "type": "desktop" },
{ "id": "c4937e0d5b9d5437a75c409305107e7c3302ba9b5e4ebfaf2478d3dd4ec4160a", "type": "desktop" },
// ... literally like 40 more along the same lines

Unfortunately, for the purposes of the send-tab telemetry, this data is not useful without the os and version information. So I think we still need to fix this issue.

(In reply to Thom Chiovoloni [:tcsc] (please ni?) from comment #10)

Unfortunately, I don't think this patch helped. We now include a (very large -- every one of my pings has over 50 devices in them now...) list of devices... but the only devices with usable information in them are the devices that we were previously including.

It did help - knowing the device is desktop vs mobile is helpful, especially because we know only desktop records the telemetry. While it would be great if the other devices has os and version, there's no regression here.

No longer regressions: 1651497
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: