Closed Bug 862832 Opened 11 years ago Closed 11 years ago

Extract metadata when publishing in order to populate MakeAPI for a Thimble project

Categories

(Webmaker Graveyard :: Thimble, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: michiel, Assigned: thecount)

References

Details

(Whiteboard: [thimble:node] u=dev p=1 s=2013w17)

Attachments

(1 file)

      No description provided.
Assignee: nobody → scott
Status: NEW → ASSIGNED
Morphing this bug slightly based on our meeting.

We want to enable users to tag, describe and otherwise decorate web pages they make in Thimble such that the MakeAPI (and other services, e.g., Open Graph) can get what it needs.  Also, we want to do this using markup vs. adding new UI to Thimble.

This means that we need to be able to parse out metadata in a Thimble-created document.  We shouldn't invent new ways to do this, since many already exist, for example:

* Meta Properties, like Facebook does http://developers.facebook.com/docs/opengraph/property-types/

<meta property="property-name" content="property-value">

* Meta Content, like everyone else does, for example twitter: https://dev.twitter.com/docs/cards/types/summary-card

<meta name="twitter:card" content="summary">

We need to specify the set of properties/names we are going to look for, and map them to the MakeAPI.

CC'ing Matt/Chris for ideas too.
Summary: Publish MakeAPI metadata (somewhere) for all makes → Extract metadata when publishing in order to populate MakeAPI for a Thimble project
Whiteboard: [thimble:node] → [thimble:node] u=dev p=1 s=2013w17
If I understand open graph, it is designed to be read from inside a webpage.

I am missing the use case of exposing it the make api.

Is there a point where we need the data so we can regenerate the meta tags?

Can we add meta tags to thimble or do they get sanitized? Is that why we need this?
Attached file Need info
Flags: needinfo?(david.humphrey)
Flags: needinfo?(pomax)
Ah, one more piece of info I need that I thought I put in the above comment.

I am having issues finding out how to parse the html. I can only seem to obtain a stringified version.
Some examples of things I heard today in a meeting with the Learning/Badge folks about the kinds of things they'd want to be able to tag in a project:

* That a Make leads to a badge
* That a Make is somehow related to the Web Literacy Standard, perhaps a section therein
* That a Make is an Activity Kit
* That a Make was created by, or is related to, Partner X (e.g., Telefonica)
* That a Make is related to a Hive Network
* That a Make is something with Peer Assessment or Automatic Assessment
* That a Make is something you can do offline (e.g., low-bandwitdh users)

Scott: to answer your questions.  Yes, things like Open Graph are metadata that follows a given form within a web page.  Indexing services like Google, Twitter, Facebook, etc. know how to crawl and extract such data, and then use it in their index.  It's a well understood and common practice, and I want to see us use it vs. inventing a new system.

The easiest way for you to get this data is probably to extract it on the browser-side and pass it to the server along with the data.  A browser can easily pick <meta> tags out of a DOM, but trying to do this by hand with strings is going to be filled with error.

So imagine I make the following Thimble document:

<html>
<head>
  <title>Yo Hive!</title>
  <meta name="webmaker-tag" content="HiveNYC">
  <meta name="webmaker-tag" content="MadeByMozilla">
</head>
<body>
I love Hive NYC!
</body>
</html>

I know have two things: an HTML document, and some metadata about that document, which happens to be included within the document.  When I send this to the server, I might send it as a JSON blob like so:

{
  "html": "<html>\n<head>....",
  "metadata": {
    "tags": [
      "HiveNYC",
      "MadeByMozilla"
    ],
    "title": "Yo Hive!"
  }
}

Now I can publish the HTML to the DB/S3, and the metadata to the MakeAPI.

We need to sort out the actual naming we want to use (I used webmaker-tag above, but we might choose to use webmaker:tag, wmtag, or just tab).

Does this make sense?
Flags: needinfo?(david.humphrey)
Follow-up to call Scott and I had.  I'd suggest, for a first run at this, we do the following:

* Take the string we get from friendlycode, which has the HTML, and re-inflate it into a full DOM in an iframe or document fragment, then use DOM selectors to pull out <title> and <meta>.

* We then look at the <meta> tags we get, and focus on the following name=foo, where foo is one of:

-description
-author
-webmaker:*

The last one is a namespaced name we reserve for any webmaker type metadata.  Examples include:

-webmaker:tag

So a more complete example of my doc above might look like this:

<html>
<head>
  <title>Yo Hive!</title>
  <meta name="author" content="David Humphrey">
  <meta name="description" content="A project I made about Hive NYC">
  <meta name="webmaker:tag" content="HiveNYC,MadeByMozilla">
</head>
<body>
I love Hive NYC!
</body>
</html>

Here I've combined the webmaker:tag into a single element, which is probably fine (Pomax, any issue with this?).  I'm not sure which other webmaker:* we need at this stage, but anything that's a field in the MakeAPI is a potential.
I don't think you need to reinflate, but if you do, where would you do this? We would be extracting server-side, and node.js has no good DOM library, let alone an actual element implementation. Title and meta elements are pretty easy to grab with a regexp, too, though. There's no unpredictable (legally, anyway) content there.

As for meta tags, I'd go with opengraph tags where they already exist so we don't invent our own for roles that search engines etc. already look for. Ideally none of our machine tags are "our own" but simply rely on already established machine tags.
Flags: needinfo?(pomax)
I got a working pull request here: https://github.com/mozilla/thimble.webmaker.org/pull/26

This uses the makeAPI commit in bug 861816 to actually connect to the makeAPI and start publishing.

The commit that matters for this is here: https://github.com/ScottDowne/ThimbleOnNode/commit/00937dc650887530bc5b070b7cb34b558f5469c7

So for now think of it as something to look at, and not a review, I'll have to rebase it, likely, but I'm cool with that.

What/how I did it.

I pull the data, client-side from index.html, out of the iframe using querySelectors.

It works and I am open to other methods.

I then pass it over to node who builds makeapi data out of it.
I ended up putting all the data with a webmaker_ name prefix, event description. I can go either way regarding webmaker-description vs description.

Also used a dash, not a colon, but I can change that.

So:

  <meta name="webmaker-title" content="David Humphrey">
  <meta name="webmaker-description" content="A project I made about Hive NYC">
  <meta name="webmaker-tag" content="HiveNYC,MadeByMozilla">

I didn't think of putting author here.

I also accept title from the title element, which is taken in favour of a meta title. Kinda weird. Thoughts?
I really like this patch, nice work Scott.  I made a bunch of suggestions in the pull request.  This method is really simple and clean.  

I agree with Pomax that we should seriously consider Open Graph (http://ogp.me/) here where it makes sense vs. creating our own.  If you read their spec, it won't get us everything we need, but it's what Facebook, Google, WordPress (not Twitter), and others use, and by extension what others will support by virtue of these first sites blessing it.

The HTML5 spec (http://www.whatwg.org/specs/web-apps/current-work/multipage/semantics.html#the-meta-element) says that you can use the following for <meta>: name=author, name=description, name=keywords or one of the approved extensions, see http://wiki.whatwg.org/wiki/MetaExtensions.  Truth be told you can put anything you want in the name, but it means your document won't validate (if we care about that, I'm unclear if we do).

For sure we care about at least the following:

* title - we should just use <title>
* author - <meta name=author content="humphd">
* description - <meta name=description content="This is what I made">
* MakeAPI tags - either <meta name=keywords content="tag1,tag2,tag3"> or <meta name="webmaker:tags" content="tag1,tag2,tag3">.
* MakeAPI thumbnail - either <meta name="og:image" content="http://ia.media-imdb.com/images/rock.jpg"> or <meta name="webmaker:thumbnail" content="http://ia.media-imdb.com/images/rock.jpg">  
* MakeAPI locale - either <meta name="og:locale" content="en_US"> or <meta name="webmaker:locale" content="en_US">

We have another bug to think about injecting script and other branding into published Makes, and I would suggest that we probably want to also automatically inject generated Open Graph and Twitter Card metadata.  That is, we probably don't want to force our users to do it, since most of it is easily generated by us (we can do checks to see if they have it, and prefer theirs to our generated content).

So my proposal is that we use HTML5 meta standards where we can, and webmaker:* where we can't, and not also mix in Open Graph, such that we can generate it later, or allow users to define it without mixing with our needs for the MakeAPI.

Given that, my document above would look like this:

<html>
<head>
  <title>Yo Hive!</title>
  <meta name="author" content="David Humphrey">
  <meta name="description" content="A project I made about Hive NYC">
  <meta name="webmaker:tags" content="HiveNYC,MadeByMozilla">
  <meta name="webmaker:thumbnail" content="http://ia.media-imdb.com/images/rock.jpg">
  <meta name="webmaker:locale" content="en_US">
</head>
<body>
I love Hive NYC!
</body>
</html>

When we publish and remix, we can inject the OG and Twitter Card metadata necessary to have this show up properly in social media, and use the data we have from webmaker:* to get it, possibly stripping the webmaker:* stuff as we do.
Driveby here:

I noticed Pomax comment that "hopefully we aren't coming up with our own machine tags".

If we have a category of content that we want to surface with webmaker's navigation (ie at /learn we are surfacing everything tagged as a "challenge") - isn't this best done as a category of tags (which I assumed were machine tags) that we control and limit, rather than those which I'm seeing above which seem more like folksonomy tags? (HiveNYC,MadeByMozilla).
Pomax is talking about the `name` vs. the `content` part of the tag.  We have to standardize on the thing we try to parse out of the web page; however, the content can be anything.

To your point about the tag contents being random, two thoughts.  First, it won't be totally random--we'll privilege certain things (i.e., "hacktivity" or "template" can have special meaning for us in the system).  Second, we want to allow random stuff like this so that the value of the metadata can grow to accommodate new uses.

So we also need to come up with some of the privileged tag names we'll expect--that should be part of this ticket.
Thinking more about this, one other approach we could take for privileged tags.  The HTML5 spec allows for you to not provide `content` if it's the empty string, so we could do:

<meta name="webmaker:template">

In the end, this is going to go in the MakeAPI tags, but if you think it's better to single it out like this, we can do that too.
Scott, where did we land on this?
I fixed all your review comments. Sadly it has blockers but I am going to be in good shape to land it once those blockers are clear. I'll add the blockers and more info to this ticket asap. On my phone atm.

The pain we're feeling right now with all these blockers is because of how heavily it uses the make api which is a new untested thing. Me Chris and Matt have been working together on unblocking. This also gets us closer to the makeapi being deployed.
Thanks for the update, Scott.  Awesome to see this stuff coming together.

One thing helpful for Beltzner and I - if you could mark the "depends on" fields with what this is blocked by.
Depends on: 865439, 861816
Yeah, simply forgot to add blockers to this one in the bug shuffle. I'll be more mindful about adding blockers.

Also, a note about this and bug 861816.

Most of this ticket is actually going to land in bug 861816, to take a fairly large load off.

That would leave this ticket to just be adding new fields to make API.

This also means a chunk of humph's review comments are landing in bug 861816, which I and Pomax have noted.

I can have a demoable webmaker turotial together for Friday if that's something we want to make a priority. (I think we should)
Thanks, Scott - just to clarify here: by tutorial  you mean one of these hacktivity kits, correct?  We're using "tutorial" to be a seperate thing, which is an overlay of instructions in thimble and popcorn maker.  Different. can explain tomorrow in IRC if you like.

What would be useful for Friday is to have some of the actual content that the mentor team is creating tagged as an "activity" - see bug #62862 on the actual content they're creating.

This link for example would be a good piece of real content they want to use: https://thimble.webmaker.org/p/lzsz/
d'oh, thats bug #862862 for mentor team content.
Hm, no. By tutorial I meant bug 864828, and not a hacktivity kit. I just connected them because this blocks the other. I could do the hackativity kit instead. It doesn't seem too different in terms of metadata.
Blocks: 865709
Staged: https://github.com/mozilla/thimble.webmaker.org/commit/25b7351e661625b6bf79ff40793ab29b909e28d0
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: