Markups, including HTML5 and XHTML

A Loose Set of Notes on RDFa, XHTML, and HTML5

Shelley Sat, 05/23/2009 - 11:33

There's been a great deal of discussion about RDFa, HTML5, and microdata the last few days, on email lists and elsewhere. I wanted to write down notes of the discussions here, for future reference. Those working issues with RDFa in Drupal 7 should pay particular attention, but the material is relevant to anyone incorporating RDFa.

Shane McCarron released a proposal for RDFa in HTML4, which is based on creating a DTD that extends support for RDFa in HTML4. He does address some issues related to the differences in how certain data is handled in HTML4 and XHTML, but for the most part, his document refers processing issues to the original RDFaSyntax document.

Philip Taylor responded with some questions, specifically about how xml:lang is handled by HTML5 parsers, as compared to XML parsers. His second concern was how to handle XMLLiteral in HTML5, because the assumption is that RDFa extractors in JavaScript would be getting their data from the DOM, not processing the characters in the page.

"If the object of a triple would be an XMLLiteral, and the input to the processor is not well-formed [XML]" - I don't understand what that means in an HTML context. Is it meant to mean something like "the bytes in the HTML file that correspond to the contents of the relevant element could be parsed as well-formed XML (modulo various namespace declaration issues)"? If so, that seems impossible to implement. The input to the RDFa processor will most likely be a DOM, possibly manipulated by the DOM APIs rather than coming straight from an HTML parser, so it may never have had a byte representation at all.

There's a lively little sub-thread related to this one issue, but the one response I'll focus on is Shane, who replied, RDFa does not pre-suppose a processing model in which there is a DOM. The issue of xml:lang is also still under discussion, but I want to move on to new issues.

While the discussion related to Shane's document was ongoing, Philip released his own first look at RDFa in HTML5. Concern was immediately expressed about Philip's copying of some of Shane's material, in order to create a new processing rule section. The concern wasn't because of any issue to do with copyright, but the problems that can occur when you have two sets of processing rules for the same data and the same underlying data model. No matter how careful you are, at some point the two are likely to diverge, and the underlying data model corrupted.

Rather than spend time on Philip's specification directly at this time, I want to focus, instead, on a note he attached to the email entry providing the link to the spec proposal. In it he wrote:

There are several unresolved design issues (e.g. handling of case-sensitivity, use of xmlns:* vs other mechanisms that cause fewer problems, etc) - I haven't intended to make any decisions on such issues, I've just attempted to define the behaviour with sufficient detail that it should make those issues visible.

More on case sensitivity in a moment.

Discussion started a little more slowly for Philip's document, but is ongoing. In addition, both Philip and Manu Sporney released test suites. Philip's is focused on highlighting problems when parsing RDFa in HTML as compared to XHTML; The one that Manu posted, created by Shane, focused on a basic set of test cases for RDFa, generally, but migrated into the RDFa in HTML4 document space.

Returning to Philip's issue with case sensitivity, I took one of Shane's RDFa in HTML test cases, and the rdfquery JavaScript from Philip's test suit, and created pages demonstrating the case sensitivity issue. One such is the following:

<!DOCTYPE HTML PUBLIC "-//ApTest//DTD HTML4+RDFa 1.0//EN" "http://www3.aptest.com/standards/DTD/html4-rdfa-1.dtd">
<html
xmlns:t="http://test1.org/something/"
xmlns:T="http://test2.org/something/"
xmlns:dc="http://purl.org/dc/elements/1.1/">
<head>
<title>Test 0011</title>
</head>
<body>
<div about="">
Author: <span property="dc:creator t:apple T:banana">Albert Einstein</span>
<h2 property="dc:title">E = mc<sup>2</sup>: The Most Urgent Problem of Our Time</h2>
</div>
</body>
</html>

Notice the two namespace declarations, one for "t" and one for "T". Both are used to provide properties for the object being described in the document: t:apple and T:banana. Parsing the document with a RDFa application that applies XML rules, treats the namespaces, "t" and "T" as two different namespaces. It has no problem with the RDFa annotation.

However, using the rdfquery JavaScript library, which treats "t" and "T" the same because of HTML case insensitivity, an exception results: Malformed CURIE: No namespace binding for T in CURIE T:banana. Stripping away the RDFa aspects, and focusing on the namespaces, you can see how browsers handle namespace case in an HTML document and in a document served up as XHTML. To make matter more interesting, check out the two pages using Opera 10, Firefox 3.5, and the latest Safari. Opera preserves the case, while both Safari and Firefox lowercase the prefix. Even within the HTML world, the browsers handle namespace case in HTML differently. However, all handle the prefixes the same, and correctly in XHTML. So does the rdfquery JavaScript library, as this test page demonstrates.

Returning to the discussion, there is some back and forth on how to handle case sensitivity issues related to HTML, with suggestions varying as widely as: tossing the RDFa in XHTML spec out and creating a new one; tossing RDFa out in favor of Microdata; creating a best practices document that details the problem and provides appropriate warnings; creating a new RDFa in HTML document (or modifying existing profile document) specifying that all conforming applications must treat prefix names as case insensitive in HTML, (possibly cross-referencing the RDFa in XHTML document, which allows case sensitive prefixes). I am not in favor of the first two options. I do favor the latter two options, though I think the best practices document should strongly recommend using lowercase prefix names, and definitely not using two prefixes that differ only by case. During the discussion, a new conforming RDFa test case was proposed that tests based on case. This has now started its own discussion.

I think the problem of case and namespace prefixes (not to mention xmlns as compared to XMLNS) is very much an edge issue, not a show stopper. However, until a solution is formalized, be aware that xmlns prefix case is handled differently in XHTML and HTML. Since all things are equal, consider using lowercase prefixes, only, when embedding RDFa (or any other namespace-based functionality). In addition, do not use XMLNS. Ever. If not for yourself, do it for the kittens.

Speaking of RDFa in HTML issues, there is now a new RDFa in HTML issues wiki page. Knock yourselves out.

updateA new version of the RDFa in HTML4 profile has been released. It addresses a some of the concerns expressed earlier, including the issue of case and XMLLiteral. Though HTML5 doesn't support DTDs, as HTML4 does, the conformance rules should still be good for HTML5.

A battle of Beliefs: RDF, Natural Language Processing, and the future of the web

Shelley Sun, 02/15/2009 - 10:15

update I rest my case regarding my assertion that underlying biases are influencing perceptions regarding RDFa and HTML5.


Last Week in HTML has been practicing its wicked ways, and pulled a quote from a comment I made to a post at Sam Ruby's

Ian is wrong. Absolutely, completely, and dead wrong.

...

rather than Ian shouting out “Hurrah!”, he says we must have five different solutions to the five problems, because to do otherwise is to...what? Give up control? Fail to meet the Guinness Book of World Records for largest, most pedantic specification ever derived by man?

At first glance, this seems a repetition of an argument that is growing thin with overuse, but the recent discussions in the RDFa mailing list, about RDFa in HTML5, provides a clear demonstration of the basic disconnect between the parties. Enough so to make it of value to re-visit the discussion, again.

On the one hand, you have RDFa, which is a serialization of RDF, which is a formal data model providing support for a universal form of structured data. On the other hand, you have those whose ideology for the future of the web is based on natural language processing. This is an old, old battle and one we've been fighting since RDF was first proposed—prior, really, as I remember working ideological differences between natural language processing, as compared to structured data techniques, in various projects at Boeing in the 1980s.

One would think, then, considering the age of the debate that we wouldn't fight this old battle in the lists for HTML5. Why? Because it exists above and beyond just HTML5. It is a debate about the fundamental nature of the web, at its most general and profound level, while HTML5 is really nothing more then the next generation of HTML. However, we are fighting this macro battle out in the micro lists of HTML5, but deceptively so.

Those who support RDFa have been continuously asked to provide use cases for RDFa, and have created a wiki page to record these use cases. But each time the use cases are proposed, we're given a response that the use cases are inadequate, and different sets of criteria for how these use cases can be "improved". It is frustrating to the RDFa adherents, stumbling about in the dark hoping to hit exactly the right "fit" in order to satisfy these never ending requests.

In the new thread, though, the underlying ideological differences are peering out through the fabric of technical obfuscation, and we see the real purpose behind the demands for RDFa to justify its existence in HTML5. We're not being asked to justify RDFa in HTML5; we're being asked to justify RDF, and beyond that, we're being asked to justify the concept of structured data. Not just once, but for every instance of a use case.

Ian Hickson writes in one comment in the mailing list thread:

I wouldn't worry too much about the various solutions in each case -- a list of solutions can never be complete, and people will never agree on what consists a pro and a con. What would be useful, though, is an example of how RDFa is expected to solve the problem, e.g. with sample markup showing how the relevant data might be encoded and code snippets showing how the data would then be processed; and a discussion of ways to deal with the likely problems (e.g., for this particular use case: how to deal with authors screwing up and encoding bad data, how to deal with apathy from sites that you want to scrape data from, how to deal with malicious authors encoding misleading data, how to deal with spammers, how to deal with requirements like Amazon's desire to track per-developer usage, how to enable monetization for producers who are intentionally obfuscating the data today, etc. I expect other use cases will have different problems).

The first set of requests are reasonable, and have been demonstrated. I use RDFa in my site to document each post with a formal title, author, date, and set of topics, each of which can be extracted using a PHP API that I've installed at my site. I plan on using this data in order to generate my front page eventually. This same data can be extracted with an Firefox toolbar, too, if I'm so inclined, and used to output a RDF document for other's to consume. The data has also been extracted as part of Yahoo's SearchMonkey effort, I do believe.

Others have provided examples of the Creative Commons licenses, and FOAF, and other uses of RDF/RDFa. Not only the purpose behind the use, but even demonstrations of how the data can be combined across pages. These seem to meet the requests for demonstrating code to both incorporate the RDFa in HTML5, as well as code to pull such data out.

As for authors screwing up and providing bad data, well I have to assume the same mechanisms in place, in the browser, when a person inputs bad data into an alt attribute (if it survives in HTML5) would be in place for bad data in a property attribute. And if the data is coded incorrectly, applications expecting valid RDFa wouldn't be able to process the data, but that's little different than applications not being able to process a bad script, or malformed piece of SVG, or even a crappy video file, embedded in the page.

The questions I just responded to are legitimate questions. They serve a purpose, and a person can determine by looking at these questions what needs to be provided to ensure success of the use case. But then we start getting into murkier territory. Ian asks, how to deal with apathy from sites that you want to scrape data from, how to deal with malicious authors encoding misleading data, how to deal with spammers, how to deal with requirements like Amazon's desire to track per-developer usage, how to enable monetization for producers who are intentionally obfuscating the data today, ...

My god, how do we deal with these on the web today? HTML, itself, fails badly with all of these, so do we give up on HTML? If not, then why are we demanding a state of rigor from RDFa that we're not willing to apply to HTML5, itself?

If you think this latter set of questions were tongue-in-cheek, perhaps a bit of markup levity, Ian repeats them, later, in the same thread

Do we have reason to believe that it is more likely that we will get authors to widely and reliably include such relations than it is that we will get high quality natural language processing? Why?

How would an RDF/RDFa system deal with people gaming the system?

How would an RDF/RDFa system deal with the problem of the _questions_ being unstructured natural language?

How would an RDF/RDFa system deal with data provided by companies that have no interest in providing the data in RDF or RDFa? (e.g. companies providing data dumps in XML or JSON.)

How would an RDF/RDFa system deal with companies that do not want to provide the data free of charge?

How would an RDF/RDFa system deal with companies that want to track per-developer usage of their data?

One could ask all but the first question about HTML, and not find satisfactory answers. Yet we're being asked to provide sufficient answers to these questions for a small subset of attributes in HTML5, which would form the basis of support for RDFa. As for the first question, Do we have reason to believe that it is more likely that we will get authors to widely and reliably include such relations than it is that we will get high quality natural language processing?, this, again, brings us back to a fundamental differences in ideology, natural language processing as compared to structured data, and how can one deal with such profound differences in something like a use case?

To repeat what I said earlier, the issue isn't about RDFa in HTML5. It is about the existence of structured data on the web. It is about the underlying purpose behind RDF. It calls into questions a decade's worth of work, based on the input of hundreds if not thousands of developers and designers. It is questioning the fundamental separation of ideology between the web of the future based on natural language processing and the web of the future based on structured data. But where the structured data folks, those who support RDF, and RDFa, welcome natural language processing as a complementary process, the natural language processing folks seem to see the very existence of structured data woven into web documents to be anathema.

Now, someone tell me how we can break through this wall with use cases?

Dan Brickley chastises those on the RDFa group who see this as a battle, writing

This is not a battle. Battles kill people. It is a dispute amongst technologists who have varying assumptions, backgrounds, collaboration networks and agendas, and who are slowly learning to see each other's perspective.

Please (and I am very serious here) stop using such bloody metaphors to describe what should be a civil and mutually respectful collaborative process. You will not improve anything if you foster this kind of perspective on our shared problems. Battle talk results in a battle mindset. I do not want to hear any RDFa advocates talking in such terms.

Really, enough with the battle stuff. Go find someone who works on HTML5 and be nice to them, find common ground, try out their tools.

Play nice...try out their tools.

I have tried the tools, and in fact just tried the HTML5 validator with the SVG, MathML, and RDFa (minus Curie) preset, and aside from the fact that it tossed my DOCTYPE, didn't like my profile attribute, some of my meta elements, and the use of "none" as a value for preserveAspectRatio in my SVG, the validator had no problems with any of my RDFa. I would have to assume, then, that we have seen a demonstration of RDFa in HTML5...and found it good? And lo and behold, the RDFa extractors have also found the same page, and the same use of RDFa, to be good. Hands across the water.

But evidently, not sufficient. What else must we do to play nice? Well, Sam has laid out the "nice filter" in comments to his post that began this particular thread

What would it take for inclusion of the RDFa attributes in HTML 5 to be tracked in the W3C HTML Working Group issues list? Given the links I provided at the top of this post, I’d say that pretty much all of the pieces are in place except for a discussion on the public-html mailing list.

What work would be helpful in getting this to be resolved successfully? Fleshing out the use cases addressing as much of these concerns as are relevant.

How can you help? Join the WG and/or contribute to the wiki.

Just so that it is clear, as we move towards summer I plan to become ruthless in clearing out issues which have been raised but don’t appear to have any substantive proposals or support. There is much good work in HTML5 and it would be positively criminal for it not to advance due to procedural maneuverings. I don’t intend to let that happen either.

And this then leads us back to the questions posed by Ian, above. For each use case, must I then justify RDF? Structured data? Must I give details about how spammers will be vanquished, and evil corporations not allowed to monetize such effort? Must I provide a 12-step program in how to lure the reluctant microformat user into the fold? Does the fact that Virgin Mobile misused the Creative Commons license to publish photos of people without getting model releases, mean that the use of RDF/RDFa to document a Creative Commons license can never be a valid use case? After all, it fails the evil corporate use case requirement being demanded of RDFa.

There seems to exist a gentleman's agreement in these specification email lists, whereby the participants humor absurd questions such as those proposed by Ian. Well, thank goodness I'm no gentleman.

If the RDFa in HTML5 adherents will be required to provide not only justification for RDFa, but also justification for RDF, as a whole, in addition to a dialog and debate about the fundamental differences between natural language processing and structured data with each and every use case, then I fail to see the "niceness" supposedly in play here. It's difficult, too, to see exactly what we're supposed to do to bring about this so-called "common ground". Ultimately, structured data people see natural language processing as complementary, and that's there room on the web for both ideologies. The natural language processing folks see structured data as competitive, and that the web of the future will be based on one or the the other, but not both. How do you work through that kind of difference?

8-track

Shelley Sun, 01/25/2009 - 17:20
8-track

Respect

Shelley Fri, 01/16/2009 - 14:35

I have spent too much time worrying about specifications managed by people who, frankly, don't have a lot of respect for what I have to say. I am not a browser developer, specification author, nor do I fit within the narrow parameters of "people who are seen to be contributors".

Years ago, I defined the term Coders-Only-Club, to designated the seeming feeling of being an outsider, unless one acts a certain way, or does a certain thing. I can definitely say unequivocally that writing books or weblog posts does not ensure entry into the Coders-only-Club, or perhaps I should term it, "Contributors-Only-Club". To be honest, writing simple tutorials or examples, helping people, or answering questions doesn't gain one entry, either.

What's absurd about the whole thing is I'm fighting for something I don't really need, because I do have viable alternatives I can use with my own work. I deliver every page at my web sites as application/xhtml+xml, which gives me singular power to accomplish wonderful things. I doubt, very much, that any browser is going to drop XHTML support for many, many years to com, so I can continue to incorporate SVG, or RDFa, or any number of new vocabularies that haven't even been invented yet.

Frankly, I'm just wasting my time worrying about things I can't change.

HTML5: Put Up or Shut Up

Shelley Fri, 01/16/2009 - 12:34

Sam Ruby

I question the presumption implicit in the notions of “the” editor, and “the” spec. I reluctantly accept the notion that any individual spec development process need not employ processes requiring consensus or voting, but I reject any implication, however subtle, of inevitability or entitlement.

Simply put, there needs to be a recourse if a person or a group disagrees with a decision made by the editor of the WHATWG document. That recourse is forking.

I realize that that is a very high bar, and will say that is intentionally so. Simply put, specs don't write themselves... I don't care how good you think your idea is, either you need to step up and directly write the spec text yourself, or accept that you need to be persuasive.

Quite simply, that is the most absurd set of statements I have ever read. What Sam is saying, if you don't like it, fork, or shut up.

Have to be persuasive? How can one be persuasive when there are underlying biases and prejudices in play that makes it impossible to ever...ever persuade the gatekeepers to change their mind? Or even open their minds?

So the alternative that Sam allow us, is to fork the entire HTML specification. Contrary to some people involved in this discussion, most of us are not employed by large corporations and can spend all of our time reading mailing lists or participating in specification work. Most of us have to do other things in order to pay the rent, or buy food.

But we are still dependent on the same specifications, still concerned that what comes out of a group such as the HTML5 working group is the best specification for as many people as possible—not just representatives from one or two companies who control the HTML5 specification development with a fist clad in an arrogance as dense as the thickest iron.

As for contributing to the group, the HTML5 editor did put something out, recently, on the mailing list about other editors. The requirements demanded for these voluteers were such that few of us could even consider applying. I can't guarantee I have 20+ hours to devote every week. I can't guarantee that I can fly to meetings with other editors, no, not even once a year. The most I, and others like me, can guarantee is that we would try our best, but keeping the roofs over our heads has to be our first priority. When was the last time the powers-to-be behind the HTML5 effort opened their windows and got a good whiff of our troubled times?

I also resent the assumption that those of us not directly contributing to the editing of a specification are not contributing. Contrary to what Sam seems to believe, we don't need to be a member of a specification group, or an editor of a specification, to contribute to the overall success of the specification. People who write about the specifications, in books or articles, or who provide tutorials, example applications, libraries, help others—we contribute just as much as those who formally create the specs. The only difference is that our names don't get listed, we rarely get credit, and evidently, according to Sam, we shouldn't express any concerns, or frustrations, either.

Well, perhaps that is the way of the world for HTML5, but thankfully it hasn't been that way for any other web specification I use, including XHTML, CSS, RDF, SVG, and so on. Oh, we still may not be able to influence these specifications, but I've not seen any of these groups give so much power over the direction of the specifications to so few. I've not heard once, from any of the people behind the specifications, to either put up, or shut up.