Closed Bug 677311 Opened 13 years ago Closed 13 years ago

Sandboxed access to Vertica for addons download data

Categories

(Mozilla Metrics :: Data/Backend Reports, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE
Unreviewed

People

(Reporter: jbalogh, Unassigned)

References

Details

We're thinking about using Vertica to power the stats pages on amo. Please hook us up with some read-only accounts and a little bit of schema guidance so we can see how it feels.
Any timeline/ETAs for this?
I'll try to put together something to give you guys access by the end of this week.  We need to work out example queries and point you at the Python library for querying, so I need time to sort that stuff out.
Depends on: 677606
I did some investigation, and the only way to get raw SQL access to Vertica from Python would be through iODBC or somesuch.  I don't think you guys would want to mess with that, so I spoke with Jeff and we are going to instead provide an HTTP API that you guys can query to retrieve any particular dataset you need.

I'm thinking something along the lines of the following input:

{
  addon_ids: [...],
  dates: [
    {from: "yyyy-mm-dd", to: "yyyy-mm-dd"},
    {date: "yyyy-mm-dd"},
    {date: "yyyy-mm-dd"}
  ],
  filters: [
    {dimension: "country_code", values: ["US","PT"]},
    {dimension: "addon_status", values: ["userDisabled,incompatible"]},
  ],
  breakdowns: [
    "product_name",
    "country_code"
  ]
}


It might very well be that we could do this with olap4js or something. CCing pedro for feedback on that part.

Jeff and Wil, if you had the ability to post a query like this and get back a result set for the data you want, would that work for AMO?
(In reply to Daniel Einspanjer :dre [:deinspanjer] from comment #3)
> Jeff and Wil, if you had the ability to post a query like this and get back
> a result set for the data you want, would that work for AMO?

Yes, we could work with that. I'm looking at the stats pages this week to see what parts we actually pull out.

Another problem I'd like to solve in this space is bad data. We allow drilling down on things like application and locale, but the data is really noisy. If you click the "Change" button on the top of the graph on https://addons.mozilla.org/z/en-US/firefox/addon/adblock-plus/statistics/usage/applications/ you'll see a ton of bad app/version combos like Firefox 4.0b6-pre4.x, Firefox 5.0-MCP.x, undefined_Invalid.x, etc. It would be nice to clean this up before the data gets to our app so we don't have to unpack the JSON just to clean it all up. However, that can wait until we get the basic flow connected.
Our current data looks like this (I took out a bunch of noise in the middle):
[
    {
        "count": "13324488",
        "date": "2011-05-13",
        "end": "2011-05-13",
        "locales": {
            "ar": "21281",
            "ca": "6222",
            "de": "2628722"
        },
        "row_count": 1
    },
    {
        "count": "13932791",
        "date": "2011-05-12",
        "end": "2011-05-12",
        "locales": {
            "ar": "21899",
            "ca": "6813",
            "de": "2851485",
            "zu-ZA": "1"
        },
        "row_count": 1
    }
]

As for the API, I think we can start with something like:

{
  id: NN,
  table: "updates",  # or "downloads"
  date: {from: XX, to: YY},
  facet: "locales",  # optional
} 

Right now we support faceting usage counts by locale, addon version, application, and OS; we have requests for raw download and usage counts that don't facet at all.

And now that I'm looking at this closer, I'm wondering if elasticsearch would be good enough. What's easier on your end?
Daniel: What's the status of this?  It's blocking some app store development
Sorry, I didn't realize that the timeline was that compressed.

Jeff's last question was whether ElasticSearch might be a good fit.  I think it is very possible, if you guys are planning on using ES in this current app store development, then I'd suggest that you take the existing data that is being exported into your DB, munge it however you like to make it fit in an ES index, and then share that format with us and we can develop a new export mechanism that will insert the data into your ES index directly.  The big advantage of trying the ES route is that it requires no initial interaction from us since you already have the data you want and can make sure that the new schema is the most useful to you.

If it turns out that ES isn't a good option for you guys, then we are going to need at least a week to put a test version of this API together for you.  Nelson is back from vacation on Tuesday, and I need to talk with him to sort out the best approach for putting an API into place that you can query.

There was also comment #4 where Jeff mentioned that there is a bunch of noise in certain dimensions.  Unfortunately, this is a fact that Metrics has always had to live with because I've never found a really reliable way to cleanse the data without risking the possibility that some perfectly legitimate value might appear in the future and be rejected.  That said, I would be happy to implement whatever filters, validations, or mappings that you guys would like in the data we export.  This is much less of a problem because if it turns out that the filter drops something, it is always possible for us to go back to the datawarehouse and re-export once the filter is fixed.
Thanks, I didn't mention the timeline to you earlier so that's my bad.  We don't need to drop everything and work on this, but making steady progress is definitely something I'm hoping for.

/me will let Jeff comment on the ES idea
Daniel, one very quick way to implement this would be to define a few queries on cda and oass the url to them, what do you think?
Daniel, any thoughts on that?
I think overall the ES thing might be easier, but yes, if we did an API, my first approach would be to see whether we could just do v1 of it through CDA URLs.
I'll play with ES this week; I like having that flexibility vs. figuring out our own limited API.
I'll open a new bug once I figure out ES. Thanks guys.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.