Closed Bug 1328839 Opened 7 years ago Closed 7 years ago

[Research] How to reliably load JSON sources from MDN GitHub repositories into KumaScript?

Categories

(developer.mozilla.org Graveyard :: KumaScript, enhancement)

All
Other
enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: fs, Unassigned)

References

Details

(Keywords: in-triage, Whiteboard: [specification][type:feature])

What problem would this feature solve?
======================================
Currently, data about Open Web technologies lives in KumaScript macros, because MDN pages consume this data to generate documentation or navigation and more.

Ideally, Open Web data lives separate from KumaScript and the wiki pages, because it is useful for other consumers than MDN. This separation was started by putting data into JSON files in the repositories mdn/data and mdn/browser-compat-data.

To use this data in MDN, KumaScript still needs to have fast and reliable access to the JSON files. Live-loading from GitHub is not a good approach. Alternatives have to be researched.

Who has this problem?
=====================
Core contributors to MDN

How do you know that the users identified above have this problem?
==================================================================
Our Site Reliability Engineer noticed that the current way of loading from the GitHub repository is not a good approach. https://github.com/mozilla/kumascript/pull/78

From 2012: https://www.quora.com/GitHub/What-is-the-recommended-way-to-use-raw-github-com

"You can use it as a way to link to (or download) the raw contents of a file.  Git is not designed as an efficient file serving system, so we rate limit it pretty heavily to protect against sites hotlinking (or doing anything that generates a lot of traffic)."

How are the users identified above solving this problem now?
============================================================
a) Live load from raw.github (don't do this anymore!)

b) Use a KumaScript macro (ejs format) that duplicates the JSON from the external repo. (works for a handful of JSON files, not for loads of them [BC JSONs]).

Do you have any suggestions for solving the problem? Please explain in detail.
==============================================================================
Josh says:
"Once MDN (including kumascript) is being continuously deployed to Kubernetes later this year we'll have much better options for automatically updating data from external repos and APIs on the server side, either as part of an automated deployment, for example as part of the kumascript Docker image build phase, or as a sidecar container like https://github.com/kubernetes/git-sync, ideally with some sort of schema validation."


I have no expertise with any of the things Josh mentioned. My first thought was to somehow integrate mdn/data and mdn/browser-compat-data as submodules into kumascript. But then submodules are making people cry, so I am eager to hear better ideas.

Is there anything else we should know?
======================================
I filed this as a research/preparation bug, so that I would consider it fixed, once there is agreement on an actual solution for this (the summary question is answered) and (an) implementation bug(s) is/are filed.

I think this is critical for MDN to proceed with offering structured data to external collaborators. I will put this on my wish list.
Keywords: in-triage
Currently the data is only requested from GitHub when you reload a page as logged in user via Ctrl+F5, purging the cache. And the quoted comment is already five years old, so this info may have changed in the meantime. This doesn't happen that often, so the issue currently shouldn't be that big.

Non-the-less there should be some caching strategy established regarding caching that data on MDN-side.
One solution that comes to my mind involves the use of the cacheFn() function[1] within the KumaScripts to cache the data once it's fetched from GitHub.

Otherwise, the data may be fetched from GitHub on deployment and made accessible somewhere on MDN. Disadvantage is that a contributor has to wait for the next release to see the changes live.

Sebastian

[1] https://developer.mozilla.org/en-US/docs/MDN/Contribute/Tools/KumaScript#Built-in_methods
I contacted GitHub support, and here's their answer:

----

You shouldn't really be using that raw endpoint for programmatic access. For programmatic access -- the recommended approach is to use the API:

http://developer.github.com/v3/

The raw endpoint isn't documented, as you noticed, which means it doesn't have defined caching or rate limiting behavior. In other words, you might get limited at any time and without any warning. The API has documented rate limits which you can rely on.

---

My ideal solution would be that the data is present on each KumaScript host, like the macros (as of December 2016), and loading the data is a local file read.  This ensures the data is fast, consistent, and takes the network out of the loop.  However, I understand the desire to update this data more frequently than once or twice a week.  I think rapid deployment (5-30 min after merge to mdn/data's master) could be possible after the AWS migration, so we're talking Q3 2017 before that is feasible.

Some other options:

* Publish as GitHub pages and load from there
* Add as a git submodule to kumascript, deploy from there
* Deploy a [kinto](https://github.com/Kinto/kinto) server, load the JSON data on merge to master with a web hook or similar

I think a good first step would be to create an internal API through KumaScript. For example, mdn.LoadJsonData('api/groups') could just call mdn.fetchJSONResource('https://raw.githubusercontent.com/mdn/data/master/api/groups.json'), but later could load from disk, from a document storage API, or other method.

And then I'd support quietly using raw.githubusercontent.com under current usage patterns until we're in AWS.  We really don't have the dev resources to implement a better solution in SCL3.
Another option:

* use npm packages (mdn-data & mdn-compat-data) which we plan to publish soon (seems like https://www.npmjs.com/package/caniuse-db releases daily or so)
Blocks: 1367174
Commits pushed to master at https://github.com/mozilla/kumascript

https://github.com/mozilla/kumascript/commit/db40f8fa45bb1cac95ba50337049631b0c6eef84
bug 1328839: enable "require" of npm packages

* add "mdn-browser-compat-data" to package.json
  and update "npm-shrinkwrap.json"
* change name of "require" method on "APIContext"
  class to "require_macro", and update all macros
  and tests to reflect the change
* add "require" method on "APIContext" class that
  maps to the nodejs "require" and add test

https://github.com/mozilla/kumascript/commit/f2383aa2618e6a9819fee0e5b1976dfd63a20c9f
Merge pull request #183 from escattone/use-npm-bcd-1328839

bug 1328839: enable "require" of npm packages
Commits pushed to master at https://github.com/mozilla/kuma

https://github.com/mozilla/kuma/commit/439728c54e8723be1c55792bcc3c7bfe8ccdccea
bug 1328839: improve kumascript npm package updates

https://github.com/mozilla/kuma/commit/a9f27ae22035e7fb26aa50cb56a8c3817743fc07
Merge pull request #4255 from escattone/npm-update-1328839

bug 1328839: improve kumascript npm package updates
I think we have proven that:

1. npm is a good way to package versioned JSON data, and to share it with external projects
2. GitHub raw API is a decent way to access quickly changing JSON data, and will fail for a few hours every other month.
3. Other options may be available after the AWS transition (bug 1110799, jgmize's comments)

Since this bug is about investigating the issue, I think we can resolve it as fixed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Product: developer.mozilla.org → developer.mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.