Closed
Bug 980385
Opened 11 years ago
Closed 11 years ago
[Baloo] Provide payload samples for SUMO contribution activity
Categories
(Mozilla Metrics :: Data/Backend Reports, defect)
Mozilla Metrics
Data/Backend Reports
Tracking
(Not tracked)
RESOLVED
FIXED
Unreviewed
People
(Reporter: pierros, Assigned: rrosario)
Details
Based on the already identified contribution activity types ( https://wiki.mozilla.org/Wormhole/Schema ) for SUMO, plus a new one (creation of a profile), we need to create payload (JSON) samples to be reviewed by BIDW team.
Thanks
Reporter | ||
Updated•11 years ago
|
Group: metrics-private
Assignee | ||
Comment 1•11 years ago
|
||
Let's start with one example, using the schema described in https://wiki.mozilla.org/Wormhole/Schema
For a SUMO answer in the Support Forum, the schema is as follows:
guid: Globally Unique Identifier of the contribution (string)
email: email address of contributor (string)
datetime: date and time of contribution (datetime)
canonical: permanent URL to contribution (string)
volunteer: boolean value indicating paid employee status (boolean) OPTIONAL
type: a codified string that describes the contribution (string)
source: source of contribution, one of the following: (string)
bugzilla
hg
sumo
extra: per-contribution data, following a source-dependent schema (dictionary)
type: type of SUMO contribution, one of: (string)
locale: locale of contribution (string)
question: id of answered question (integer)
id: id of answer (integer)
product: associated product slug (string)
topic: associated topic slug (string)
So, the JSON, for this this answer here:
https://support.mozilla.org/en-US/questions/986343#answer-533604
looks like:
{
"guid": "<SEE QUESTIONS BELOW>",
"email": "rrosario@mozilla.com",
"datetime": "<SEE QUESTIONS BELOW>",
"canonical": "https://support.mozilla.org/en-US/questions/986343#answer-533604",
/* "volunteer": "<SKIPPING THIS BECAUSE OPTIONAL", */
"type": "sumo-answer",
"sourche": "sumo"
"extra": {
"type": "answer",
"locale": "en-US",
"question": 986343,
"answer": 533604,
"product": "firefox",
"topic": "download-and-install"
}
}
Questions:
* Do I make up any string for the GUID? Or is that generated in Baloo?
* What format should datetimes be provided in?
* There are two redundant types in this case. Is that ok?
* The docs should be updated based on these questions and simple examples should be provided.
Assignee | ||
Updated•11 years ago
|
OS: Linux → All
Hardware: x86_64 → All
Reporter | ||
Comment 2•11 years ago
|
||
Some quick answers:
* GUID: we can generate this randomly (like git commits basically) as long as it really unique.
* Time and date: There is only http://en.wikipedia.org/wiki/ISO_8601 :)
* Two types, I don't know. Would we have a case that those differentiate?
* Is the last one a question?
Thanks!
Assignee | ||
Comment 3•11 years ago
|
||
(In reply to Pierros Papadeas [:pierros] from comment #2)
> * GUID: we can generate this randomly (like git commits basically) as long
> as it really unique.
OK, I guess the question is: what is it for? Will I ever need to use it from my end?
Reporter | ||
Comment 4•11 years ago
|
||
(In reply to Ricky Rosario [:rrosario, :r1cky] from comment #3)
> (In reply to Pierros Papadeas [:pierros] from comment #2)
> > * GUID: we can generate this randomly (like git commits basically) as long
> > as it really unique.
>
> OK, I guess the question is: what is it for? Will I ever need to use it from
> my end?
If we can generate this on Baloo side then we don't need it on the systems side.
@Anurag would that be possible on submission?
Flags: needinfo?(aphadke)
Comment 6•11 years ago
|
||
if there is no uuid provided by the client bagheera generates one but prefer to have uuid generation on client side.
Client can submit the json data along with uuid to the following url
https://data.mozilla.com/submit/baloo/{uuid}
if the client sends the data to the following
https://data.mozilla.com/submit/baloo than bagheera generates a new uuid for that request.
we use uuid as key and json data as value to store data in hadoop(hbase,hdfs etc..)
Flags: needinfo?(schintalapani)
Assignee | ||
Comment 7•11 years ago
|
||
(In reply to Harsha [:harsha] from comment #6)
> Client can submit the json data along with uuid to the following url
>
> https://data.mozilla.com/submit/baloo/{uuid}
Can the uuid be any string I make up or does it have some contraints?
Reporter | ||
Comment 8•11 years ago
|
||
So if we stick to submitting without it then we are OK?
What is the UUID generated from? (SHA something?)
Also, the preference is for the system to create it due to load of creation on bagheera side?
Comment 9•11 years ago
|
||
ricky,
uuid can be anything we don't have any constraints per say but to have better distribution of data (we use uuid as key) in hadoop we use uuid version 4.
pierros,
we use java.util.UUID https://github.com/mozilla-metrics/bagheera/blob/master/src/main/java/com/mozilla/bagheera/http/BagheeraHttpRequest.java
line 72.
randomUUID() generates a version 4 UUID.
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(rrosario)
Assignee | ||
Comment 10•11 years ago
|
||
If I was going to generate the string to be used as an ID, I'd use a scheme such as sumo-answer-<answerid>, sumo-kbedit-<revisionid>, etc. That way, I know the corresponding id in baloo because it follows a predictable pattern. I am not sure if that will be useful or not at this time. Does the system allow editing/updating contributions after they have been initially sent to baloo?
These are things we should flesh out now so we can come up with the Best Practices for other services.
Flags: needinfo?(rrosario)
Assignee | ||
Comment 11•11 years ago
|
||
BTW, I do know that I don't want to store an extra random UUID on my end.
Reporter | ||
Comment 12•11 years ago
|
||
Based on Comment 6 we can continue without UUID. Ricky can you go ahead and create the json files?
Flags: needinfo?(rrosario)
Assignee | ||
Comment 13•11 years ago
|
||
Here are the samples for the SUMO contributions specified in the Wormhole schema...
Answer:
{"extra": {"product": "Firefox", "locale": "en-US", "question": 986343, "topic": "Download, install and migration", "type": "answer", "id": 533604}, "datetime": "2014-02-13T09:22:54", "source": "sumo", "type": "sumo-answer", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/questions/986343#answer-533604"}
Forum post:
{"extra": {"locale": "en-US", "type": "forum-post", "id": 45251, "thread": 708160, "slug": "contributors"}, "datetime": "2012-02-14T06:38:45", "source": "sumo", "type": "sumo-forum-post", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/forums/contributors/708160#post-45251"}
KB Forum (Discuss Article) post:
{"extra": {"locale": "en-US", "type": "kbforum-post", "id": 583, "thread": 435, "slug": "how-do-i-restore-my-tabs-last-time"}, "datetime": "2011-03-17T10:39:30", "source": "sumo", "type": "sumo-kbforum-post", "email": "user345199@example.com", "canonical": "https://support.mozilla.org/en-US/kb/how-do-i-restore-my-tabs-last-time/discuss/435#post-583"}
KB Revision:
{"extra": {"locale": "es", "article": 10945, "type": "kb-revision", "id": 29398, "slug": "Soluci\\u00f3n b\\u00e1sica de problemas"}, "datetime": "2012-08-07T03:54:10", "source": "sumo", "type": "sumo-kb-revision", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/es/kb/Soluci%C3%B3n%20b%C3%A1sica%20de%20problemas/revision/29398"}
Verify those and tell me what you want next :)
Flags: needinfo?(rrosario)
Reporter | ||
Comment 14•11 years ago
|
||
This looks fantastic dude!
@Anurag is that OK with you too? They seem valid for me here.
Once we have confirmation, Ricky you should go ahead and produce the actual ones (for all historical data). I will be filling a new bug for that.
Comment 15•11 years ago
|
||
Could someone clarify the purpose of the type field? I always envisioned being able to query based on the source and knowledge of the extra data - someone looking for stats on all SUMO activity could just look for source='sumo' without having to know anything special about the types of activities, while more specialized queries could be performed looking for specific kinds of data (based on extra.type, for example) that matched known formats that are understood. The 'standard' type field that's been added seems redundant to me.
Assignee | ||
Comment 16•11 years ago
|
||
(In reply to Josh Matthews [:jdm] from comment #15)
> Could someone clarify the purpose of the type field?
That's a great question. I mentioned in Comment 2: '* There are two redundant types in this case. Is that ok?'
I think I agree with you that the top level type is useless EXCEPT if we can standardized on a set of types that can be reused across sources. For example, type="l10n" could be shared across sumo, mdn, verbatim, etc. Then you could query for localization activity only. Coming up with this "standard" set of types might be difficult though. Naming things is hard :)
Assignee | ||
Comment 17•11 years ago
|
||
Oh, I forgot the JSON for registering on SUMO as a contributor:
Register as SUMO contributor:
{"extra": {"type": "register", "id": 179845}, "datetime": "2012-08-07T03:54:10", "source": "sumo", "type": "sumo-register", "email": "user179845@example.com", "canonical": "https://support.mozilla.org/en-US/user/179845"}
Unlike the other sumo contributions, I think this one shouldn't pass a locale.
Comment 18•11 years ago
|
||
:rrosario - any valid JSON works with Baloo as long as it has all the elements needed to answer the analytic questions that you want to be answered.
Reporter | ||
Comment 19•11 years ago
|
||
Reply on Comment 16:
There is no easy way to do this. I still though believe that having the type in the core part of the payload will make our life easier, without having to dig into the extra to find out what this is. Let's move ahead with it even as it might seem as a duplication.
I would just add one thing. the version of the spec used. Let's add a new field about that. (this would be version 0.1)
Ricky if you are OK with it, please massively produce the historical data for Sumo and provide a link so we can test-import them and use them.
Thanks!
Flags: needinfo?(rrosario)
Comment 20•11 years ago
|
||
:pierros, :rrosario - The baloo_kbforum.json contains a ton of JSON fields. Which of these fields do you want to store in columns? Or do you want to store each column in JSON as separate field?
Flags: needinfo?(pierros)
Comment 21•11 years ago
|
||
:pierros - as per our latest IRC conversation, I am planning to load this data inside HDFS, each entry as a new line. We can then run aggregations using MR jobs.
I'll have the data imported today.
-anurag
Comment 22•11 years ago
|
||
:pierros - i wrote a sample script to parse the baloo_kbforum.json file. payload corresponding to individual payload is now being written as new line.
This allows us to generate various aggregation metrics:
# of unique emails
# of submissions within last 6 months
etc.
pierros, ricky - can we sync sometime tomorrow over vidyo to discuss what aggregation metrics need to be pushed to vertica?
Assignee | ||
Comment 23•11 years ago
|
||
I think I provided the info requested in the other bug.
Flags: needinfo?(rrosario)
Reporter | ||
Comment 24•11 years ago
|
||
This is now resolved.
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: needinfo?(pierros)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•