Last Comment Bug 980385 - [Baloo] Provide payload samples for SUMO contribution activity
: [Baloo] Provide payload samples for SUMO contribution activity
Status: RESOLVED FIXED
:
Product: Mozilla Metrics
Classification: Other
Component: Data/Backend Reports (show other bugs)
: unspecified
: All All
-- normal (vote)
: Unreviewed
Assigned To: Ricky Rosario [:rrosario, :r1cky]
:
:
Mentors:
Depends on: 987239
Blocks: 1024000
  Show dependency treegraph
 
Reported: 2014-03-06 08:52 PST by Pierros Papadeas [:pierros]
Modified: 2014-06-11 10:47 PDT (History)
8 users (show)
See Also:
Due Date:
QA Whiteboard:
Iteration: ---
Points: ---


Attachments

Description User image Pierros Papadeas [:pierros] 2014-03-06 08:52:11 PST
Based on the already identified contribution activity types ( https://wiki.mozilla.org/Wormhole/Schema ) for SUMO, plus a new one (creation of a profile), we need to create payload (JSON) samples to be reviewed by BIDW team.

Thanks
Comment 1 User image Ricky Rosario [:rrosario, :r1cky] 2014-03-12 11:52:51 PDT
Let's start with one example, using the schema described in https://wiki.mozilla.org/Wormhole/Schema

For a SUMO answer in the Support Forum, the schema is as follows:

guid: Globally Unique Identifier of the contribution (string)
email: email address of contributor (string)
datetime: date and time of contribution (datetime)
canonical: permanent URL to contribution (string)
volunteer: boolean value indicating paid employee status (boolean) OPTIONAL
type: a codified string that describes the contribution (string)
source: source of contribution, one of the following: (string)
    bugzilla
    hg
    sumo
extra: per-contribution data, following a source-dependent schema (dictionary)
    type: type of SUMO contribution, one of: (string)
    locale: locale of contribution (string)
    question: id of answered question (integer)
    id: id of answer (integer)
    product: associated product slug (string)
    topic: associated topic slug (string)


So, the JSON, for this this answer here:
https://support.mozilla.org/en-US/questions/986343#answer-533604

looks like:
{
    "guid": "<SEE QUESTIONS BELOW>",
    "email": "rrosario@mozilla.com",
    "datetime": "<SEE QUESTIONS BELOW>",
    "canonical": "https://support.mozilla.org/en-US/questions/986343#answer-533604",
    /* "volunteer": "<SKIPPING THIS BECAUSE OPTIONAL", */
    "type": "sumo-answer",
    "sourche": "sumo"
    "extra": {
        "type": "answer",
        "locale": "en-US",
        "question": 986343,
        "answer": 533604,
        "product": "firefox",
        "topic": "download-and-install"
    }
}


Questions:
* Do I make up any string for the GUID? Or is that generated in Baloo?
* What format should datetimes be provided in?
* There are two redundant types in this case. Is that ok?
* The docs should be updated based on these questions and simple examples should be provided.
Comment 2 User image Pierros Papadeas [:pierros] 2014-03-12 12:47:32 PDT
Some quick answers:

* GUID: we can generate this randomly (like git commits basically) as long as it really unique.
* Time and date: There is only http://en.wikipedia.org/wiki/ISO_8601 :)
* Two types, I don't know. Would we have a case that those differentiate?
* Is the last one a question? 

Thanks!
Comment 3 User image Ricky Rosario [:rrosario, :r1cky] 2014-03-12 13:00:31 PDT
(In reply to Pierros Papadeas [:pierros] from comment #2)
> * GUID: we can generate this randomly (like git commits basically) as long
> as it really unique.

OK, I guess the question is: what is it for? Will I ever need to use it from my end?
Comment 4 User image Pierros Papadeas [:pierros] 2014-03-12 13:12:56 PDT
(In reply to Ricky Rosario [:rrosario, :r1cky] from comment #3)
> (In reply to Pierros Papadeas [:pierros] from comment #2)
> > * GUID: we can generate this randomly (like git commits basically) as long
> > as it really unique.
> 
> OK, I guess the question is: what is it for? Will I ever need to use it from
> my end?

If we can generate this on Baloo side then we don't need it on the systems side.

@Anurag would that be possible on submission?
Comment 5 User image Anurag Phadke[:aphadke@mozilla.com] 2014-03-12 13:27:35 PDT
(+harsha)
Comment 6 User image Harsha [:harsha] 2014-03-14 08:53:39 PDT
if there is no uuid provided by the client bagheera generates one but prefer to have  uuid generation on client side. 
Client can submit the json data along with uuid to the following url

https://data.mozilla.com/submit/baloo/{uuid} 
if the client sends the data to the following
https://data.mozilla.com/submit/baloo than bagheera generates a new uuid for that request.
we use uuid as key and json data as value to store data in hadoop(hbase,hdfs etc..)
Comment 7 User image Ricky Rosario [:rrosario, :r1cky] 2014-03-14 09:54:27 PDT
(In reply to Harsha [:harsha] from comment #6)
> Client can submit the json data along with uuid to the following url
> 
> https://data.mozilla.com/submit/baloo/{uuid} 

Can the uuid be any string I make up or does it have some contraints?
Comment 8 User image Pierros Papadeas [:pierros] 2014-03-14 09:56:53 PDT
So if we stick to submitting without it then we are OK?
What is the UUID generated from? (SHA something?)

Also, the preference is for the system to create it due to load of creation on bagheera side?
Comment 9 User image Harsha [:harsha] 2014-03-14 10:13:40 PDT
ricky,
  uuid can be anything we don't have any constraints per say but to have better distribution of data (we use uuid as key) in hadoop we use uuid version 4.
pierros,
   we use java.util.UUID https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java/com/mozilla/bagheera/http/BagheeraHttpRequest.java 
line 72.
randomUUID() generates a version 4 UUID.
Comment 10 User image Ricky Rosario [:rrosario, :r1cky] 2014-03-17 06:40:39 PDT
If I was going to generate the string to be used as an ID, I'd use a scheme such as sumo-answer-<answerid>, sumo-kbedit-<revisionid>, etc. That way, I know the corresponding id in baloo because it follows a predictable pattern. I am not sure if that will be useful or not at this time. Does the system allow editing/updating contributions after they have been initially sent to baloo?

These are things we should flesh out now so we can come up with the Best Practices for other services.
Comment 11 User image Ricky Rosario [:rrosario, :r1cky] 2014-03-17 06:41:27 PDT
BTW, I do know that I don't want to store an extra random UUID on my end.
Comment 12 User image Pierros Papadeas [:pierros] 2014-03-19 12:16:09 PDT
Based on Comment 6 we can continue without UUID. Ricky can you go ahead and create the json files?
Comment 13 User image Ricky Rosario [:rrosario, :r1cky] 2014-03-20 12:20:21 PDT
Here are the samples for the SUMO contributions specified in the Wormhole schema...

Answer:
{"extra": {"product": "Firefox", "locale": "en-US", "question": 986343, "topic": "Download, install and migration", "type": "answer", "id": 533604}, "datetime": "2014-02-13T09:22:54", "source": "sumo", "type": "sumo-answer", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/questions/986343#answer-533604"}

Forum post:
{"extra": {"locale": "en-US", "type": "forum-post", "id": 45251, "thread": 708160, "slug": "contributors"}, "datetime": "2012-02-14T06:38:45", "source": "sumo", "type": "sumo-forum-post", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/forums/contributors/708160#post-45251"}

KB Forum (Discuss Article) post:
{"extra": {"locale": "en-US", "type": "kbforum-post", "id": 583, "thread": 435, "slug": "how-do-i-restore-my-tabs-last-time"}, "datetime": "2011-03-17T10:39:30", "source": "sumo", "type": "sumo-kbforum-post", "email": "user345199@example.com", "canonical": "https://support.mozilla.org/en-US/kb/how-do-i-restore-my-tabs-last-time/discuss/435#post-583"}

KB Revision:
{"extra": {"locale": "es", "article": 10945, "type": "kb-revision", "id": 29398, "slug": "Soluci\\u00f3n b\\u00e1sica de problemas"}, "datetime": "2012-08-07T03:54:10", "source": "sumo", "type": "sumo-kb-revision", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/es/kb/Soluci%C3%B3n%20b%C3%A1sica%20de%20problemas/revision/29398"}


Verify those and tell me what you want next :)
Comment 14 User image Pierros Papadeas [:pierros] 2014-03-20 12:24:07 PDT
This looks fantastic dude!

@Anurag is that OK with you too? They seem valid for me here.

Once we have confirmation, Ricky you should go ahead and produce the actual ones (for all historical data). I will be filling a new bug for that.
Comment 15 User image Josh Matthews [:jdm] 2014-03-20 19:43:03 PDT
Could someone clarify the purpose of the type field? I always envisioned being able to query based on the source and knowledge of the extra data - someone looking for stats on all SUMO activity could just look for source='sumo' without having to know anything special about the types of activities, while more specialized queries could be performed looking for specific kinds of data (based on extra.type, for example) that matched known formats that are understood. The 'standard' type field that's been added seems redundant to me.
Comment 16 User image Ricky Rosario [:rrosario, :r1cky] 2014-03-21 06:01:49 PDT
(In reply to Josh Matthews [:jdm] from comment #15)
> Could someone clarify the purpose of the type field?

That's a great question. I mentioned in Comment 2: '* There are two redundant types in this case. Is that ok?'

I think I agree with you that the top level type is useless EXCEPT if we can standardized on a set of types that can be reused across sources. For example, type="l10n" could be shared across sumo, mdn, verbatim, etc. Then you could query for localization activity only. Coming up with this "standard" set of types might be difficult though. Naming things is hard :)
Comment 17 User image Ricky Rosario [:rrosario, :r1cky] 2014-03-21 07:20:58 PDT
Oh, I forgot the JSON for registering on SUMO as a contributor:

Register as SUMO contributor:
{"extra": {"type": "register", "id": 179845}, "datetime": "2012-08-07T03:54:10", "source": "sumo", "type": "sumo-register", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/en-US/user/179845"}

Unlike the other sumo contributions, I think this one shouldn't pass a locale.
Comment 18 User image Anurag Phadke[:aphadke@mozilla.com] 2014-03-21 11:23:02 PDT
:rrosario - any valid JSON works with Baloo as long as it has all the elements needed to answer the analytic questions that you want to be answered.
Comment 19 User image Pierros Papadeas [:pierros] 2014-03-24 06:00:23 PDT
Reply on Comment 16:

There is no easy way to do this. I still though believe that having the type in the core part of the payload will make our life easier, without having to dig into the extra to find out what this is. Let's move ahead with it even as it might seem as a duplication.

I would just add one thing. the version of the spec used. Let's add a new field about that. (this would be version 0.1)

Ricky if you are OK with it, please massively produce the historical data for Sumo and provide a link so we can test-import them and use them.

Thanks!
Comment 20 User image Anurag Phadke[:aphadke@mozilla.com] 2014-03-25 11:16:57 PDT
:pierros, :rrosario - The baloo_kbforum.json contains a ton of JSON fields. Which of these fields do you want to store in columns? Or do you want to store each column in JSON as separate field?
Comment 21 User image Anurag Phadke[:aphadke@mozilla.com] 2014-03-25 12:08:10 PDT
:pierros - as per our latest IRC conversation, I am planning to load this data inside HDFS, each entry as a new line. We can then run aggregations using MR jobs. 
I'll have the data imported today.

-anurag
Comment 22 User image Anurag Phadke[:aphadke@mozilla.com] 2014-03-25 14:11:33 PDT
:pierros - i wrote a sample script to parse the baloo_kbforum.json file. payload corresponding to individual payload is now being written as new line.

This allows us to generate various aggregation metrics:
# of unique emails
# of submissions within last 6 months 

etc.
pierros, ricky - can we sync sometime tomorrow over vidyo to discuss what aggregation metrics need to be pushed to vertica?
Comment 23 User image Ricky Rosario [:rrosario, :r1cky] 2014-04-01 06:46:11 PDT
I think I provided the info requested in the other bug.
Comment 24 User image Pierros Papadeas [:pierros] 2014-05-13 03:25:33 PDT
This is now resolved.

Note You need to log in before you can comment on or make changes to this bug.