Closed Bug 1489506 Opened 7 years ago Closed 7 years ago

main summary uptake returns an error

Categories

(Cloud Services :: Operations: Miscellaneous, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: magopian, Assigned: peterbe)

Details

I've filed https://github.com/mozilla/PollBot/issues/226 against the Pollbot project, which queries STMO with requests looking like the following one: https://sql.telemetry.mozilla.org/api/query_results/<query id> It seems it's now receiving bad data from STMO. Has the STMO API changed lately? Or is it an authentication issue because some tokens/credentials have been changed?
:wezhou are you in charge of the Pollbot service deployment? Do you see anything strange in the logs?
Flags: needinfo?(wezhou)
STMO switched over to Auth0 a couple months ago but it should not have affected querying using an API key. What is the response you are receiving from the STMO API?
Hi Jason! I have no idea, never had access to this service, so I'm just trying to investigate using the source code (https://github.com/mozilla/PollBot/blob/master/pollbot/tasks/telemetry.py#L65-L173). I'm guessing the STMO API is returning something that's not the same as before, otherwise we wouldn't have this issue (maybe we're now receiving an error message instead of a proper JSON object?). If there's a way to check what's the API response, I'd be able to investigate further. I'm not sure it makes sense for me to have the credentials to be able to run the queries myself, maybe there's someone in charge of STMO who has access to a log of the queries, and able to see if there's anything odd?
All mozilla users have access to https://sql.telemetry.mozilla.org/, you can just try to login with Auth0. Once authenticated you can hit https://sql.telemetry.mozilla.org/users/me to get your user API key and you can test against STMO API.
I see the following error in the app, when accessing [1]. Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: string indices must be integers Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: Traceback (most recent call last): Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: File "/app/pollbot/views/release.py", line 21, in wrapped Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: response = await task(product, version) Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: File "/app/pollbot/tasks/telemetry.py", line 122, in main_summary_uptake Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: query_info = await get_query_info_from_title(session, query_title) Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: File "/app/pollbot/tasks/telemetry.py", line 74, in get_query_info_from_title Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: body = [query for query in body Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: File "/app/pollbot/tasks/telemetry.py", line 75, in <listcomp> Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: if not query['name'].startswith('Copy of') and Sep 07 18:45:22 ip-172-31-11-84 docker-pollbot[2453]: TypeError: string indices must be integers [1] https://pollbot.services.mozilla.com/v1/devedition/63.0b2/telemetry/main-summary-uptake
Flags: needinfo?(wezhou)
Thanks :jason for the STMO tip, and thanks :wezhou for the logs! So I did some archeology, and found out that what Pollbot expects from a query to: https://sql.telemetry.mozilla.org/api/queries?q=Uptake+Devedition+AURORA+63.0b2+%2820180830123124%29&org_slug=default&drafts=false is a json response looking like: [{ "latest_query_data_id": 5678, "id": 40197, "name": "Uptake Firefox RELEASE 57.0 (20171009192146)", "user": { "id": int(TELEMETRY_USER_ID) } }] But what we're currently getting back from the API is this: {"count": 0, "page": 1, "page_size": 25, "results": []} So it seems the API did change, I'm just not sure when nor why. I guess we're now supposed to get the data we're looking for in the `results` item of the new response? Any idea who I could ping about that?
Flags: needinfo?(jthomas)
Hey Jannis is this something you can assist with?
Flags: needinfo?(jthomas) → needinfo?(jezdez)
Yeah, the API response for the queries API endpoint has changed in Redash v5 (which we've deployed for a bit) and the old /api/queries/search endpoint is being redirected to the new one that supports both listing and searching the queries. In other words, PollBot will need to be updated to use the new API format. The error pasted above just related to the fact that the API response JSON is formatted differently. I'm surprised a service was written against the STMO API without us knowing about it, given the fact it's not considered to be a stable API and only caters to the Redash frontend client. We've recently been working on porting our custom changes to the upstream repo and keep our fork rebased on top of the main repo. As such, I would recommend following not only our fork development, but Redash in general. FWIW, Marina has written a client library in the past to support her work on experiments but it's also not up-to-date with the new API changes from upstream Redash: https://github.com/mozilla/redash_client I would recommend updating the client library with the recent changes on the API side and then update PollBot to use that library as much as possible. That way we'd be able to abstract away API changes and PollBot would only need to update that client library. That also fits the hope of the upstream Redash authors to officially support a client library eventually (and maybe our library can be that). I hope that answers your question, Jason?
Flags: needinfo?(jezdez)
Thanks a lot Jannis, this did indeed answer the question ;) I'll NI Benson, to prioritize and assign to whoever will take the lead on this project in the future.
Flags: needinfo?(bwong)
Put this on our project board. will see who has time to update.
Flags: needinfo?(bwong)
PR https://github.com/mozilla/PollBot/pull/227 Note, this doesn't solve the problem by switching to using redash_client. It's just a bandaid solution which'll hopefully work.
Jason, I reproduced the bug locally, fixed it and make a new git tag and new git release with this fix. I don't know how to take it from there. It doesn't appear to be documented.
This probably needs to be released, needinfo'ing wezhou.
Flags: needinfo?(wezhou)
s/released/deployed/
It looks like our pipeline has pushed v1.2.1 -stage automatically. Please verify it works there. -prod deployment will happen next week once QA passed.
What is the front-end URL for the delivery dashboard? (if there is one)
What's delievery dashboard? Pollbot -stage url is this: https://pollbot.stage.mozaws.net/v1/ -prod is this: https://pollbot.services.mozilla.com/v1/
(In reply to :wezhou from comment #17) > What's delievery dashboard? > > Pollbot -stage url is this: https://pollbot.stage.mozaws.net/v1/ > -prod is this: https://pollbot.services.mozilla.com/v1/ Delivery dashboard is the front-end for pollbot. The Prod front-end is https://mozilla.github.io/delivery-dashboard/ but that points to the Prod pollbot. I was wondering if there's a Stage front-end that points to the STAGE pollbot. Either way, if you take the Prod URL: https://pollbot.services.mozilla.com/v1/firefox/64.0a1/telemetry/main-summary-uptake and replace the domain to be: https://pollbot.stage.mozaws.net/v1/firefox/64.0a1/telemetry/main-summary-uptake Then it fails with a different error! Sigh.
Not sure how to proceed here on this because if I take that failing URL to my localhost version: curl http://localhost:8000/v1/firefox/64.0a1/telemetry/main-summary-uptake then it works. ▶ curl http://localhost:8000/v1/firefox/64.0a1/telemetry/main-summary-uptake {"status": "incomplete", "message": "Telemetry uptake calculation for version 64.0a1 (20181005102516, 20181004224156, 20181004100222) is in progress", "link": "https://sql.telemetry.mozilla.org/queries/59340"} But, just too be extra weird, when I run it a second time I get a different output: ▶ curl http://localhost:8000/v1/firefox/64.0a1/telemetry/main-summary-uptake {"status": "incomplete", "message": "Query still processing.", "link": "https://sql.telemetry.mozilla.org/queries/59340"}
I don't think cloudops help set up https://mozilla.github.io/delivery-dashboard/ I think it is up to the project owners of https://github.com/mozilla/delivery-dashboard to decide which pollbot environment the "delivery dashboard" points to (and it sounds like the dashboard is pointing to pollbot -prod at the moment). The pollbot -stage logs show the following error: Oct 05 18:34:19 ip-172-31-57-96 docker-pollbot[2622]: '<' not supported between instances of 'NoneType' and 'float' Oct 05 18:34:19 ip-172-31-57-96 docker-pollbot[2622]: Traceback (most recent call last): Oct 05 18:34:19 ip-172-31-57-96 docker-pollbot[2622]: File "/app/pollbot/views/release.py", line 21, in wrapped Oct 05 18:34:19 ip-172-31-57-96 docker-pollbot[2622]: response = await task(product, version) Oct 05 18:34:19 ip-172-31-57-96 docker-pollbot[2622]: File "/app/pollbot/tasks/telemetry.py", line 160, in main_summary_uptake Oct 05 18:34:19 ip-172-31-57-96 docker-pollbot[2622]: if ratio < 0.5: Oct 05 18:34:19 ip-172-31-57-96 docker-pollbot[2622]: TypeError: '<' not supported between instances of 'NoneType' and 'float'
I had a chat with sunah about this. This is the recommendation: - use the clients_daily table instead of the main_summary table - it will still be slow, but at least the data table size is smaller - the clients_daily table is updated once a day - when we have the correct query then we will look at making it faster - caching on the pollbot size (redis?) - automatically generating the results into a JSON file w/ telemetry's airflow (workflow.telemetry.mozilla.org)
Anything needed for ops at this point? I was ni'd recently for deploying v1.2.1, which is on -stage now. Let us know if we need to proceed with deploying v1.2.1 (or a new version) to -prod.
Flags: needinfo?(wezhou)
Nothing needed from ops at this point. We will need to deploy like a 1.3 or something in the future after we figure out the right way to do this.
(In reply to Benson Wong [:mostlygeek] from comment #21) > I had a chat with sunah about this. This is the recommendation: > > - use the clients_daily table instead of the main_summary table > - it will still be slow, but at least the data table size is smaller Why is it not cached in Telemetry? What about taking Jannis's idea of using redash_client instead of requests.get(url). Will that make it better? Also, does the PollBot code not support returning a response like "Warning. Still waiting" which will make it an orange/yellow warning in delivery-dashboard?
Assignee: nobody → peterbe
tl;dr; If you use https://mozilla.github.io/delivery-dashboard/?server=https://pollbot.stage.mozaws.net/v1 the "Telemetry Main Summary Uptake" works again. PollBot Stage has been upgraded. PollBot Prod has not. Hoping to get some help doing verifications first. Longer version... The Telemetry Main Summary Uptake query started to fail because of two distinct problems. First, the result format from Redash changed. Instead of returning a dictionary, it returned a list of dictionaries. We first thought it was an easy to fix, to just read the first element from the *list* and you'd get the same stuff. Second, the queries PollBot were probably a faster back in the day when Remy built this (don't know who helped him). These days, even with the pagination fix, they are just too slow so the result was never there and the Python code in PollBot ungracefully failed to deal with this. So we (Sunah and me) dug in. We decided to switch from creating a new (with a refresh schedule!!) query each time, to prepare a specific PollBot saved query (https://sql.telemetry.mozilla.org/queries/59383) that is run every day and tries to track the arrival of the daily main summary query. Having a **saved query** means, from the perspective of PollBot, that we just need to go to Redash and grab the results already computed. (See how the query ID is hardcoded here: https://github.com/mozilla/PollBot/pull/235/files#diff-241270e325ddb6cf2b758c67e4ce075dR11) Now the Telemetry Main Summary Uptake query works. At least in Stage. I haven't pushed for upgrade PollBot Prod yet as I'm waiting for the dust to settle and for more people to get a chance to shoot it down. HOWEVER, a couple of interesting things have fallen out of this... * We've discovered that Buildhub is sporadically slow to pick up the latest buildIDs. No official plan yet but we are pursuing the idea of solving this by switching to the the almost-finished Buildhub2 which would work for PollBot. * Sometimes the Telemetry Main Summary is late. It just sometimes takes a really long time to come in or there might be errors which forces manual intervention. They are working on adding a status banner to the top of Redash (at https://sql.telemetry.mozilla.org/) so you can see if there's a reason there isn't any data there. "Fortunately", this happened one of the days I tried to use the new saved query so now there's a new possible outcome from PollBot which simply says "Query results contained no rows." (https://github.com/mozilla/PollBot/pull/235/files#diff-241270e325ddb6cf2b758c67e4ce075dR71) * We also have a potential gap between when the main summary is ready and our saved query's result. So for some (small) periods of time every day we might have main summary data in Telemetry but no data for the PollBot query. What we *could* do is make *our* query rely on the "max" date rather than "today's" date. But if that's the case, we'd need to carry that information through so it says "Uptake for Firefox 64.0b1 is 15.5% (BASED ON YESTERDAY'S DATE)".
:jason or :wezhou can you please release v1.4.0 on PollBot Prod? Once that is done I believe we are done here.
Pollbot v1.4.0 has been pushed to -prod.
Yay!
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.