Closed Bug 1540822 (bmo-stmo-setup) Opened 3 years ago Closed 3 years ago

Next steps for sending BMO data to STMO

Categories

(bugzilla.mozilla.org :: Administration, task)

Production
task
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dylan, Unassigned)

References

Details

As of today's deployment, the "report ping" code is in place.

The next steps involve working with the data team to get the generic ingestion thing to accept this data. Presumably they need the JSON Schema (given at the end of this bug description), and the namespace, name, and version of this ping.

From the bugzilla end, we'll be running docker container periodically.

docker run mozillabteam/bmo ./bugzilla.pl report_ping \
       --report-type Simple \
      --url http://incoming.telemetry.mozilla.org/submit

This will send a copy of every bug to http://incoming.telemetry.mozilla.org/submit/bugzilla/simple/1/BUG_ID. It will do this using 10 concurrent requests, and probably take a few hours.

The JSON payload of every request is guaranteed to match the schema below.

JSON Schema

{
   "additionalProperties" : false,
   "properties" : {
      "assigned_to" : {
         "type" : "integer"
      },
      "blocked_by" : {
         "items" : {
            "type" : "integer"
         },
         "type" : "array"
      },
      "bug_id" : {
         "minimum" : 1,
         "type" : "integer"
      },
      "bug_severity" : {
         "type" : "string"
      },
      "bug_status" : {
         "type" : "string"
      },
      "component" : {
         "type" : "string"
      },
      "depends_on" : {
         "items" : {
            "type" : "integer"
         },
         "type" : "array"
      },
      "duplicate_of" : {
         "type" : [
            "null",
            "integer"
         ]
      },
      "duplicates" : {
         "items" : {
            "type" : "integer"
         },
         "type" : "array"
      },
      "flags" : {
         "items" : {
            "additionalProperties" : false,
            "properties" : {
               "name" : {
                  "type" : "string"
               },
               "requestee_id" : {
                  "type" : [
                     "null",
                     "integer"
                  ]
               },
               "setter_id" : {
                  "type" : "integer"
               },
               "status" : {
                  "enum" : [
                     "?",
                     "+",
                     "-"
                  ],
                  "type" : "string"
               }
            },
            "required" : [
               "status",
               "name",
               "setter_id",
               "requestee_id"
            ],
            "type" : "object"
         },
         "type" : "array"
      },
      "groups" : {
         "items" : {
            "type" : "string"
         },
         "type" : "array"
      },
      "keywords" : {
         "items" : {
            "type" : "string"
         },
         "type" : "array"
      },
      "priority" : {
         "type" : "string"
      },
      "product" : {
         "type" : "string"
      },
      "qa_contact" : {
         "type" : [
            "null",
            "integer"
         ]
      },
      "reporter" : {
         "type" : "integer"
      },
      "resolution" : {
         "type" : "string"
      },
      "target_milestone" : {
         "type" : "string"
      },
      "version" : {
         "type" : "string"
      }
   },
   "required" : [
      "priority",
      "blocked_by",
      "duplicate_of",
      "bug_id",
      "reporter",
      "keywords",
      "duplicates",
      "groups",
      "assigned_to",
      "qa_contact",
      "bug_severity",
      "depends_on",
      "bug_status",
      "flags",
      "version",
      "component",
      "product",
      "target_milestone"
   ],
   "type" : "object"
}

Hey :alm, I'd like to move forward with getting the data team to accept a new data ping -- outlined in comment #0 of this bug. My systems are ready to send them all the records but I don't have a lot of understanding on the next steps.

Pointers would be good -- but I'm interested in having someone else drive this process if possible.

Flags: needinfo?(ameihm)

Un-ping for now, found more things.

Depends on: 1540857
Depends on: 1540859
Depends on: 1540860

Mark, are you the right contact to help us get this data ingested into the telemetry pipeline?

Flags: needinfo?(ameihm) → needinfo?(mreid)

I can help! I think there may be an easier way to integrate this data rather than sending things through the telemetry ingestion endpoint, so I'd like to do a bit of research first. I will update next week, but if this is urgent / blocking please let me know.

Leaving the needinfo so I get nagged accordingly :)

(In reply to Mark Reid [:mreid] from comment #4)

I can help! I think there may be an easier way to integrate this data rather than sending things through the telemetry ingestion endpoint, so I'd like to do a bit of research first. I will update next week, but if this is urgent / blocking please let me know.

Leaving the needinfo so I get nagged accordingly :)

well, I don't care what the endpoint is as long as it is fine taking 1.5 million HTTP PUT requests containing json payloads that match the schema I posted above.

I'm going to proceed with submitting a schema and test data following the process that :mars followed back in October, unless I hear this is not the way to go.

Flags: needinfo?(mreid)
Depends on: 1541918
No longer depends on: 1540859

(In reply to Dylan Hardison [:dylan] (he/him) from comment #6)

unless I hear this is not the way to go.

For the record, my concern is that sending records through the ingestion endpoint may not have the characteristics you want. In particular, our deduplication of IDs uses "first seen record wins", which means that the oldest and most stale copy of each bug would be the one we keep.

We are also moving towards requiring that the document ID is UUID-shaped (so a Bug ID wouldn't fit).

Per our offline discussion, the code has moved on a bit to use "Bug ID + timestamp" as the identifier, and the ability to store each revision of the bug is desirable.

That takes care of my main concern, and if we can (SHA?) hash the Bug ID + timestamp to produce a unique ID for the bug revision, that should take care of the 2nd concern.

We also discussed that it would be a good idea to add the timestamp itself as a field inside the JSON so that it's easy to identify the sequence of revisions when querying later.

It may still be a useful optimization to directly load this data into BigQuery or similar, but we can tackle that later if need be.

Anything I can do to help move this along?

The data is there now.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED

I don't see a data source on https://sql.telemetry.mozilla.org/queries/new - is this behind a permission or is there additional work needed?

Thanks Dylan!
Freddy: sync up with Simon, he's writing queries using the data and can help you get access.

You need to log in before you can comment on or make changes to this bug.