980385 - [Baloo] Provide payload samples for SUMO contribution activity

Reporter

Description

•

11 years ago

Based on the already identified contribution activity types ( https://wiki.mozilla.org/Wormhole/Schema ) for SUMO, plus a new one (creation of a profile), we need to create payload (JSON) samples to be reviewed by BIDW team. Thanks

Pierros Papadeas [:pierros]

Reporter

Updated

•

11 years ago

Group: metrics-private

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 1

•

11 years ago

Let's start with one example, using the schema described in https://wiki.mozilla.org/Wormhole/Schema For a SUMO answer in the Support Forum, the schema is as follows: guid: Globally Unique Identifier of the contribution (string) email: email address of contributor (string) datetime: date and time of contribution (datetime) canonical: permanent URL to contribution (string) volunteer: boolean value indicating paid employee status (boolean) OPTIONAL type: a codified string that describes the contribution (string) source: source of contribution, one of the following: (string) bugzilla hg sumo extra: per-contribution data, following a source-dependent schema (dictionary) type: type of SUMO contribution, one of: (string) locale: locale of contribution (string) question: id of answered question (integer) id: id of answer (integer) product: associated product slug (string) topic: associated topic slug (string) So, the JSON, for this this answer here: https://support.mozilla.org/en-US/questions/986343#answer-533604 looks like: { "guid": "<SEE QUESTIONS BELOW>", "email": "rrosario@mozilla.com", "datetime": "<SEE QUESTIONS BELOW>", "canonical": "https://support.mozilla.org/en-US/questions/986343#answer-533604", /* "volunteer": "<SKIPPING THIS BECAUSE OPTIONAL", */ "type": "sumo-answer", "sourche": "sumo" "extra": { "type": "answer", "locale": "en-US", "question": 986343, "answer": 533604, "product": "firefox", "topic": "download-and-install" } } Questions: * Do I make up any string for the GUID? Or is that generated in Baloo? * What format should datetimes be provided in? * There are two redundant types in this case. Is that ok? * The docs should be updated based on these questions and simple examples should be provided.

Ricky Rosario [:rrosario, :r1cky]

Assignee

Updated

•

11 years ago

OS: Linux → All

Hardware: x86_64 → All

Pierros Papadeas [:pierros]

Reporter

Comment 2

•

11 years ago

Some quick answers: * GUID: we can generate this randomly (like git commits basically) as long as it really unique. * Time and date: There is only http://en.wikipedia.org/wiki/ISO_8601 :) * Two types, I don't know. Would we have a case that those differentiate? * Is the last one a question? Thanks!

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 3

•

11 years ago

(In reply to Pierros Papadeas [:pierros] from comment #2) > * GUID: we can generate this randomly (like git commits basically) as long > as it really unique. OK, I guess the question is: what is it for? Will I ever need to use it from my end?

Pierros Papadeas [:pierros]

Reporter

Comment 4

•

11 years ago

(In reply to Ricky Rosario [:rrosario, :r1cky] from comment #3) > (In reply to Pierros Papadeas [:pierros] from comment #2) > > * GUID: we can generate this randomly (like git commits basically) as long > > as it really unique. > > OK, I guess the question is: what is it for? Will I ever need to use it from > my end? If we can generate this on Baloo side then we don't need it on the systems side. @Anurag would that be possible on submission?

Flags: needinfo?(aphadke)

Anurag Phadke[:aphadke@mozilla.com]

Comment 5

•

11 years ago

(+harsha)

Flags: needinfo?(aphadke) → needinfo?(schintalapani)

Harsha [:harsha]

Comment 6

•

11 years ago

if there is no uuid provided by the client bagheera generates one but prefer to have uuid generation on client side. Client can submit the json data along with uuid to the following url https://data.mozilla.com/submit/baloo/{uuid} if the client sends the data to the following https://data.mozilla.com/submit/baloo than bagheera generates a new uuid for that request. we use uuid as key and json data as value to store data in hadoop(hbase,hdfs etc..)

Flags: needinfo?(schintalapani)

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 7

•

11 years ago

(In reply to Harsha [:harsha] from comment #6) > Client can submit the json data along with uuid to the following url > > https://data.mozilla.com/submit/baloo/{uuid} Can the uuid be any string I make up or does it have some contraints?

Pierros Papadeas [:pierros]

Reporter

Comment 8

•

11 years ago

So if we stick to submitting without it then we are OK? What is the UUID generated from? (SHA something?) Also, the preference is for the system to create it due to load of creation on bagheera side?

Harsha [:harsha]

Comment 9

•

11 years ago

ricky, uuid can be anything we don't have any constraints per say but to have better distribution of data (we use uuid as key) in hadoop we use uuid version 4. pierros, we use java.util.UUID https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java/com/mozilla/bagheera/http/BagheeraHttpRequest.java line 72. randomUUID() generates a version 4 UUID.

Pierros Papadeas [:pierros]

Reporter

Updated

•

11 years ago

Flags: needinfo?(rrosario)

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 10

•

11 years ago

If I was going to generate the string to be used as an ID, I'd use a scheme such as sumo-answer-<answerid>, sumo-kbedit-<revisionid>, etc. That way, I know the corresponding id in baloo because it follows a predictable pattern. I am not sure if that will be useful or not at this time. Does the system allow editing/updating contributions after they have been initially sent to baloo? These are things we should flesh out now so we can come up with the Best Practices for other services.

Flags: needinfo?(rrosario)

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 11

•

11 years ago

BTW, I do know that I don't want to store an extra random UUID on my end.

Pierros Papadeas [:pierros]

Reporter

Comment 12

•

11 years ago

Based on Comment 6 we can continue without UUID. Ricky can you go ahead and create the json files?

Flags: needinfo?(rrosario)

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 13

•

11 years ago

Here are the samples for the SUMO contributions specified in the Wormhole schema... Answer: {"extra": {"product": "Firefox", "locale": "en-US", "question": 986343, "topic": "Download, install and migration", "type": "answer", "id": 533604}, "datetime": "2014-02-13T09:22:54", "source": "sumo", "type": "sumo-answer", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/questions/986343#answer-533604"} Forum post: {"extra": {"locale": "en-US", "type": "forum-post", "id": 45251, "thread": 708160, "slug": "contributors"}, "datetime": "2012-02-14T06:38:45", "source": "sumo", "type": "sumo-forum-post", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/forums/contributors/708160#post-45251"} KB Forum (Discuss Article) post: {"extra": {"locale": "en-US", "type": "kbforum-post", "id": 583, "thread": 435, "slug": "how-do-i-restore-my-tabs-last-time"}, "datetime": "2011-03-17T10:39:30", "source": "sumo", "type": "sumo-kbforum-post", "email": "user345199@example.com", "canonical": "https://support.mozilla.org/en-US/kb/how-do-i-restore-my-tabs-last-time/discuss/435#post-583"} KB Revision: {"extra": {"locale": "es", "article": 10945, "type": "kb-revision", "id": 29398, "slug": "Soluci\\u00f3n b\\u00e1sica de problemas"}, "datetime": "2012-08-07T03:54:10", "source": "sumo", "type": "sumo-kb-revision", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/es/kb/Soluci%C3%B3n%20b%C3%A1sica%20de%20problemas/revision/29398"} Verify those and tell me what you want next :)

Flags: needinfo?(rrosario)

Pierros Papadeas [:pierros]

Reporter

Comment 14

•

11 years ago

This looks fantastic dude! @Anurag is that OK with you too? They seem valid for me here. Once we have confirmation, Ricky you should go ahead and produce the actual ones (for all historical data). I will be filling a new bug for that.

Josh Matthews [:jdm]

Comment 15

•

11 years ago

Could someone clarify the purpose of the type field? I always envisioned being able to query based on the source and knowledge of the extra data - someone looking for stats on all SUMO activity could just look for source='sumo' without having to know anything special about the types of activities, while more specialized queries could be performed looking for specific kinds of data (based on extra.type, for example) that matched known formats that are understood. The 'standard' type field that's been added seems redundant to me.

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 16

•

11 years ago

(In reply to Josh Matthews [:jdm] from comment #15) > Could someone clarify the purpose of the type field? That's a great question. I mentioned in Comment 2: '* There are two redundant types in this case. Is that ok?' I think I agree with you that the top level type is useless EXCEPT if we can standardized on a set of types that can be reused across sources. For example, type="l10n" could be shared across sumo, mdn, verbatim, etc. Then you could query for localization activity only. Coming up with this "standard" set of types might be difficult though. Naming things is hard :)

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 17

•

11 years ago

Oh, I forgot the JSON for registering on SUMO as a contributor: Register as SUMO contributor: {"extra": {"type": "register", "id": 179845}, "datetime": "2012-08-07T03:54:10", "source": "sumo", "type": "sumo-register", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/en-US/user/179845"} Unlike the other sumo contributions, I think this one shouldn't pass a locale.

Anurag Phadke[:aphadke@mozilla.com]

Comment 18

•

11 years ago

:rrosario - any valid JSON works with Baloo as long as it has all the elements needed to answer the analytic questions that you want to be answered.

Pierros Papadeas [:pierros]

Reporter

Comment 19

•

11 years ago

Reply on Comment 16: There is no easy way to do this. I still though believe that having the type in the core part of the payload will make our life easier, without having to dig into the extra to find out what this is. Let's move ahead with it even as it might seem as a duplication. I would just add one thing. the version of the spec used. Let's add a new field about that. (this would be version 0.1) Ricky if you are OK with it, please massively produce the historical data for Sumo and provide a link so we can test-import them and use them. Thanks!

Flags: needinfo?(rrosario)

Anurag Phadke[:aphadke@mozilla.com]

Comment 20

•

11 years ago

:pierros, :rrosario - The baloo_kbforum.json contains a ton of JSON fields. Which of these fields do you want to store in columns? Or do you want to store each column in JSON as separate field?

Flags: needinfo?(pierros)

Anurag Phadke[:aphadke@mozilla.com]

Comment 21

•

11 years ago

:pierros - as per our latest IRC conversation, I am planning to load this data inside HDFS, each entry as a new line. We can then run aggregations using MR jobs. I'll have the data imported today. -anurag

Anurag Phadke[:aphadke@mozilla.com]

Comment 22

•

11 years ago

:pierros - i wrote a sample script to parse the baloo_kbforum.json file. payload corresponding to individual payload is now being written as new line. This allows us to generate various aggregation metrics: # of unique emails # of submissions within last 6 months etc. pierros, ricky - can we sync sometime tomorrow over vidyo to discuss what aggregation metrics need to be pushed to vertica?

Ricky Rosario [:rrosario, :r1cky]

Assignee

Comment 23

•

11 years ago

I think I provided the info requested in the other bug.

Flags: needinfo?(rrosario)

Pierros Papadeas [:pierros]

Reporter

Comment 24

•

11 years ago

This is now resolved.

Status: NEW → RESOLVED

Closed: 11 years ago

Flags: needinfo?(pierros)

Resolution: --- → FIXED