Closed Bug 980385 Opened 9 years ago Closed 9 years ago

[Baloo] Provide payload samples for SUMO contribution activity


(Mozilla Metrics :: Data/Backend Reports, defect)

Not set


(Not tracked)



(Reporter: pierros, Assigned: rrosario)


Based on the already identified contribution activity types ( ) for SUMO, plus a new one (creation of a profile), we need to create payload (JSON) samples to be reviewed by BIDW team.

Group: metrics-private
Let's start with one example, using the schema described in

For a SUMO answer in the Support Forum, the schema is as follows:

guid: Globally Unique Identifier of the contribution (string)
email: email address of contributor (string)
datetime: date and time of contribution (datetime)
canonical: permanent URL to contribution (string)
volunteer: boolean value indicating paid employee status (boolean) OPTIONAL
type: a codified string that describes the contribution (string)
source: source of contribution, one of the following: (string)
extra: per-contribution data, following a source-dependent schema (dictionary)
    type: type of SUMO contribution, one of: (string)
    locale: locale of contribution (string)
    question: id of answered question (integer)
    id: id of answer (integer)
    product: associated product slug (string)
    topic: associated topic slug (string)

So, the JSON, for this this answer here:

looks like:
    "guid": "<SEE QUESTIONS BELOW>",
    "email": "",
    "datetime": "<SEE QUESTIONS BELOW>",
    "canonical": "",
    /* "volunteer": "<SKIPPING THIS BECAUSE OPTIONAL", */
    "type": "sumo-answer",
    "sourche": "sumo"
    "extra": {
        "type": "answer",
        "locale": "en-US",
        "question": 986343,
        "answer": 533604,
        "product": "firefox",
        "topic": "download-and-install"

* Do I make up any string for the GUID? Or is that generated in Baloo?
* What format should datetimes be provided in?
* There are two redundant types in this case. Is that ok?
* The docs should be updated based on these questions and simple examples should be provided.
OS: Linux → All
Hardware: x86_64 → All
Some quick answers:

* GUID: we can generate this randomly (like git commits basically) as long as it really unique.
* Time and date: There is only :)
* Two types, I don't know. Would we have a case that those differentiate?
* Is the last one a question? 

(In reply to Pierros Papadeas [:pierros] from comment #2)
> * GUID: we can generate this randomly (like git commits basically) as long
> as it really unique.

OK, I guess the question is: what is it for? Will I ever need to use it from my end?
(In reply to Ricky Rosario [:rrosario, :r1cky] from comment #3)
> (In reply to Pierros Papadeas [:pierros] from comment #2)
> > * GUID: we can generate this randomly (like git commits basically) as long
> > as it really unique.
> OK, I guess the question is: what is it for? Will I ever need to use it from
> my end?

If we can generate this on Baloo side then we don't need it on the systems side.

@Anurag would that be possible on submission?
Flags: needinfo?(aphadke)
Flags: needinfo?(aphadke) → needinfo?(schintalapani)
if there is no uuid provided by the client bagheera generates one but prefer to have  uuid generation on client side. 
Client can submit the json data along with uuid to the following url{uuid} 
if the client sends the data to the following than bagheera generates a new uuid for that request.
we use uuid as key and json data as value to store data in hadoop(hbase,hdfs etc..)
Flags: needinfo?(schintalapani)
(In reply to Harsha [:harsha] from comment #6)
> Client can submit the json data along with uuid to the following url

Can the uuid be any string I make up or does it have some contraints?
So if we stick to submitting without it then we are OK?
What is the UUID generated from? (SHA something?)

Also, the preference is for the system to create it due to load of creation on bagheera side?
  uuid can be anything we don't have any constraints per say but to have better distribution of data (we use uuid as key) in hadoop we use uuid version 4.
   we use java.util.UUID 
line 72.
randomUUID() generates a version 4 UUID.
Flags: needinfo?(rrosario)
If I was going to generate the string to be used as an ID, I'd use a scheme such as sumo-answer-<answerid>, sumo-kbedit-<revisionid>, etc. That way, I know the corresponding id in baloo because it follows a predictable pattern. I am not sure if that will be useful or not at this time. Does the system allow editing/updating contributions after they have been initially sent to baloo?

These are things we should flesh out now so we can come up with the Best Practices for other services.
Flags: needinfo?(rrosario)
BTW, I do know that I don't want to store an extra random UUID on my end.
Based on Comment 6 we can continue without UUID. Ricky can you go ahead and create the json files?
Flags: needinfo?(rrosario)
Here are the samples for the SUMO contributions specified in the Wormhole schema...

{"extra": {"product": "Firefox", "locale": "en-US", "question": 986343, "topic": "Download, install and migration", "type": "answer", "id": 533604}, "datetime": "2014-02-13T09:22:54", "source": "sumo", "type": "sumo-answer", "email": "", "canonical": ""}

Forum post:
{"extra": {"locale": "en-US", "type": "forum-post", "id": 45251, "thread": 708160, "slug": "contributors"}, "datetime": "2012-02-14T06:38:45", "source": "sumo", "type": "sumo-forum-post", "email": "", "canonical": ""}

KB Forum (Discuss Article) post:
{"extra": {"locale": "en-US", "type": "kbforum-post", "id": 583, "thread": 435, "slug": "how-do-i-restore-my-tabs-last-time"}, "datetime": "2011-03-17T10:39:30", "source": "sumo", "type": "sumo-kbforum-post", "email": "", "canonical": ""}

KB Revision:
{"extra": {"locale": "es", "article": 10945, "type": "kb-revision", "id": 29398, "slug": "Soluci\\u00f3n b\\u00e1sica de problemas"}, "datetime": "2012-08-07T03:54:10", "source": "sumo", "type": "sumo-kb-revision", "email": "", "canonical": ""}

Verify those and tell me what you want next :)
Flags: needinfo?(rrosario)
This looks fantastic dude!

@Anurag is that OK with you too? They seem valid for me here.

Once we have confirmation, Ricky you should go ahead and produce the actual ones (for all historical data). I will be filling a new bug for that.
Could someone clarify the purpose of the type field? I always envisioned being able to query based on the source and knowledge of the extra data - someone looking for stats on all SUMO activity could just look for source='sumo' without having to know anything special about the types of activities, while more specialized queries could be performed looking for specific kinds of data (based on extra.type, for example) that matched known formats that are understood. The 'standard' type field that's been added seems redundant to me.
(In reply to Josh Matthews [:jdm] from comment #15)
> Could someone clarify the purpose of the type field?

That's a great question. I mentioned in Comment 2: '* There are two redundant types in this case. Is that ok?'

I think I agree with you that the top level type is useless EXCEPT if we can standardized on a set of types that can be reused across sources. For example, type="l10n" could be shared across sumo, mdn, verbatim, etc. Then you could query for localization activity only. Coming up with this "standard" set of types might be difficult though. Naming things is hard :)
Oh, I forgot the JSON for registering on SUMO as a contributor:

Register as SUMO contributor:
{"extra": {"type": "register", "id": 179845}, "datetime": "2012-08-07T03:54:10", "source": "sumo", "type": "sumo-register", "email": "", "canonical": ""}

Unlike the other sumo contributions, I think this one shouldn't pass a locale.
:rrosario - any valid JSON works with Baloo as long as it has all the elements needed to answer the analytic questions that you want to be answered.
Reply on Comment 16:

There is no easy way to do this. I still though believe that having the type in the core part of the payload will make our life easier, without having to dig into the extra to find out what this is. Let's move ahead with it even as it might seem as a duplication.

I would just add one thing. the version of the spec used. Let's add a new field about that. (this would be version 0.1)

Ricky if you are OK with it, please massively produce the historical data for Sumo and provide a link so we can test-import them and use them.

Flags: needinfo?(rrosario)
:pierros, :rrosario - The baloo_kbforum.json contains a ton of JSON fields. Which of these fields do you want to store in columns? Or do you want to store each column in JSON as separate field?
Flags: needinfo?(pierros)
:pierros - as per our latest IRC conversation, I am planning to load this data inside HDFS, each entry as a new line. We can then run aggregations using MR jobs. 
I'll have the data imported today.

:pierros - i wrote a sample script to parse the baloo_kbforum.json file. payload corresponding to individual payload is now being written as new line.

This allows us to generate various aggregation metrics:
# of unique emails
# of submissions within last 6 months 

pierros, ricky - can we sync sometime tomorrow over vidyo to discuss what aggregation metrics need to be pushed to vertica?
I think I provided the info requested in the other bug.
Flags: needinfo?(rrosario)
This is now resolved.
Closed: 9 years ago
Flags: needinfo?(pierros)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.