March 13th, 2008

Great idea on the part of Yahoo to begin incorporating semantic web information into its search open platform. How deep the semantics will go, and in how many directions is still TBA, but I'm please to see interest in microformat and more structured semantic data via RDF. I'll be even more pleased when we start to see working examples.

Marshall Kirkpatrick believes that Google will follow suit. I just don't see it. Google might embrace microformats, but the company has long pit its algorithms against human annotation of data, and the semantic web is based on some human annotation–even if the annotation is based, indirectly, on checking an option in a page.

My biggest concern about all of this is if we were to limit semantics to microformats. It's with relief that I see that Yahoo is going beyond just microformats into the broader scope of the structured semantics based on RDF and its various serializations. Paul Miller also brings up other needed caveats:

The tools to create and embed that structure need to follow, of course. And issues that efforts like Dublin Core struggled with over a decade ago need to be thrashed out in some more detail, as the malicious, the malevolent, the careless and the mischievous rush to ‘game’ the rich structured data with which their web pages will soon be filled.

Putting pressure on the tool makers is essential, though probably not as essential as it once was because most tools provide a plug-in infrastructure that enables expansion. Still, there's a lot more that tools can do, which is one reason why I've been so interested in Drupal: this tools is definitely ahead of this curve.

What's key to all of this is showing people what they can get if they go that little extra step. I read people who write reviews on books. If we start showing more intelligent search results based on adding a little additional information to their writings that reflect that the work is a book review of a certain book by a certain author, etc., they will, most likely, be willing to spend a little time adding this additional information.

Someday when I'm looking for a new book to download from the web, I'll be able to pull up a browser in my Kindle ebook reader and see all the reviews written about this book, online. Everywhere. We are so close to making this work, and I'm not normally the type to to tap dance every time someone comes along, breathing the words "semantic web", through lips moist with anticipation.

Yahoo should have received a hostile takeover bid a long time ago. Lately, the company has been galvanized.

February 28th, 2008

Mathew Ingram has decided that the problem with the semantic web is that it’s as boring as dry toast. Of course, by Mathew's standard, all the stuff that makes the web work is also boring as hell. It's probably a good thing, then, that some people looked beyond the need for immediate titaliation when it comes to the tech underlying this environment, or Mathew's audience for his opinions would be his immediate family members, and perhaps those neighbors not quick enough to run away when seeing him approach.

He also writes:

It’s all about plumbing and widgets and data standards, all of which have names like FOAF and TOTP and SIOC and whatnot. It’s right off the dork-o-meter. The Lone Gunmen from The X-Files would have a hard time getting interested in this stuff, let alone anyone who isn’t married to their slide rule or their pocket protector.

Now, taking Mathew's complaints of No glitter! No glitter! Mama, Mama, where's my glitter! seriously, I decided to put my slide rule down for a sec and see if I couldn't respond to his one statement about no one knowing what this all means.

First, there was the web. The web was dumb, but it was hyperlinked.

Then, there was search. Search followed hyperlinks, scraped pages, massaged keywords and tested the strength of the links. The web was still dumb, but number crunching helped generate some smarts. Think of your favorite dog. Yeah, that smart.

Next, there was the semantic web. The semantic web says, You and I can derive understanding from this blob of text on this page, but applications can't. Applications can pull keywords and run algorithms, but can only approximate what this blob of text is all about. What if we add a little information to this blob of text so that applications don't have to crunch numbers or make guesses as to what we mean?

How do we add a little information? A hundred different ways. We can use microformats, or RDFa, or RDF, or whatever the HTML5 people cook up for us. With this little bit of extra information, applications can access a web page list that's created with UL/LI elements, but instead of having to look at the text in the list and try to guess what the list is all about, it can read that little bit of data and know that the list consists of recommended books. Perhaps they can take that little list of books and use another application to look up these books at Amazon. Or at their library. Or better yet, click a button and load all the books into our Kindle. (Assuming that Mathew doesn't subscribe to the Steve Jobs school of, "We don't read, we aint' got no books, gimme the vids", school of thought.)

The little bit of information might, instead, be an address for an event, triggering the browser to add that event information to a desktop calendar application.

It could be information about people we know and how we know them, so that when we move from Facebook, which is today's darling, to MyPowerBase, we can tell MyPowerBase to add all people who we have defined as friends, but not those defined as just contacts.

If the information is embedded in a photo–wow, information embedded in a photo, how dull–when we upload the photo to a site like Flickr, it could automatically be added to a map, with all the other photos from the same location. It can be pulled up on a search someday, when we ask the web to show us all photos for St. Louis, or for a certain block in St. Louis. Perhaps it can even help us find photos that are licensed Creative Commons so we can steal them.

I might write about a product or company, and the little bit of information I add to my post might help others who are thinking of doing business with the company, or buying that product. Sure, search engines can scrape the content and try and gleam useful bits based on keywords such as the product or company name, but we've all had enough really strange search results to know how far search can go, no matter how brainy the algorithm.

Someday, I'll be able to write about movies and add just a little bit of extra information, and we can do the same for movies. Or music. Or cooking recipes ("give me all recipes on the web that use apricot jam and bourbon, but I don't want chicken"). Or even poetry, though don't mention poetry around Sir Tim–it makes him peevish.

Mathew is very addicted to FriendFeed, which allows him to pull in all the activities of his friends in various places. I bet if we scratched the surface of this application, a lot of the data that makes the application tick comes courtesy of the semantic web dorks.

I could go on and on, but I've already been away from my slide rule too long. Instead I'll end with the best for last: because all of these different ways of adding that tiny little bit of useful information to blocks of text or photos or video files or what have you are based on agreed upon specifications, we can use applications to merge this data and use it for something new; something we haven't thought of yet. See, now that's when it really gets exciting because rather than coming up with an idea and then taking five years to get enough data to test it, we'll already have the data, at no extra effort or cost.

Maybe I've been cooped up in my cube with my computers and code for too long, but that strikes me as kind of interesting. In a dorky sort of way.

February 10th, 2008

On today's tenth anniversary of the birth of XML, Norm Walsh writes:

I joined O'Reilly on the very first day of an unprecedented two-week period during which the production department, the folks who actually turn finished manuscripts into books, was closed. The department was undergoing a two-week training period during which they would learn SGML and, henceforth, all books would be done in SGML…My job, I learned on that first day, would be to write the publishing system that would turn SGML into Troff so that sqtroff could turn it into PostScript. “SGML”, I recall thinking, “well, at least I know how to spell it.”

Ah yes. "Unix Power Tools" was formatted as SGML, the one and only book at O'Reilly I worked on that wasn't in a Word format. I must express a partiality to my NeoOffice, though the SGML system was ideal for cross-referencing and indexing. OpenOffice ODT, or OpenDocument text, will be the most likely format for the next UPT. Just another example of the permanent/impermanence of web trends.

Norm also mentions about HTML5 possibly being the nail in this child of SGML's coffin, but as I wrote recently, the folks behind HTML5 have solemnly assured us this specification also includes XHTML5. I'd hate to think we're giving up on the benefits of XHTML just when they're finally being realized by a more general audience.

Of course, I'm also fond of RDF/XML, which seems to cause others a great deal of pain, the pansies. And I've never hidden my SVG fandom and SVG is based in XML. I must also confess to preferring XML over JSON–you know, good enough for granddad, good enough for me. Atom rules. Or is that, Atom rocks? I'm also sure XML has squeezed between the joints of many of my other applications, and I just don't know it.