Black Hole Revelations: Understanding Flash

This week Google and Yahoo announced that over 10 years after web users were first haunted with flash intro splash screens, they will finally be able to index the content of SWF files in their search engines. Adobe Flash is the most prevalent web platform today, available on 98% of desktop browsers, yet content locked up in binary SWF files has been part of a big black hole in the web that search engines and other services have not been able to read and understand.

The solution offered here from Adobe to both Google and Yahoo (and probably offered to that other search providing company) is a special ‘flash player’ that allows the search engine to dive into existing SWF files. It might be akin to a decompiler, in that the raw objects are extracted and then the text is parsed out (decompiling Flash 9 is very possible).

What Google and Yahoo have now is simply access to the text-based content within Flash applets – it does not guarantee that the search engines will treat it equally with well-formed text-based markup. While text can be extracted, the contents still do not have the same structure and context as a text-based page, such as a header, metadata, inbound links, headings, other markup tags and everything else. Futher, if your SWF files use graphic-based text, the search engines still won’t be able to see it.

There seems to be a lot of misunderstanding about just what this means and the importance of it. First of all, in the context of web applications, search engine optimization is not important when offering a private user application view. In that case, such as with an email application, there is no public search or index. The important part here is in public-facing flash applications (or websites) where the main site content is locked up in a binary container running on a proprietary runtime/virtual machine. In these cases, up until now most site owners have replicated that same content with a proper URI structure in HTML to gain the most out of search engine indexes and referrals. This is a more ideal solution as it gives sites and content more structure that the crawlers from Google and Yahoo readily understand and can interpret: the addition of being able to grep out the text components of a SWF file add little by way of structure or organization to the web.

The next issue is when comparing different RIA technologies, the argument is often made that they are all equally poor with representing content and data in context for search engines to easily understand. This may be very true of SWF, but is is untrue of XHTML + Javascript applications, or event applications using XUL or XAML as part of Silverlight: as they are text-based formats with clear markup rules that signify to an interpreter the context and significance of content. For Google or Yahoo to understand another text type is an elementary step, and a step that they have taken numerous times (eg. parsing RSS, Atom, Microformats, etc.) – so there is no reason why the same steps can’t be taken for these engines to grok XAML or any other format.

In RIA logic is often encapsulated in code – either as Javascript in AJAX applications or in bytecode in the case of Silverlight. This is not as important as application logic is irelevant for a search engine to understand content and context, which can both be conveyed through markup and in presentation. So while the announcement from Adobe, Google and Yahoo does shine a light into a black hole, the resultant data output is nothing more than a stream of bits which the search engines themselves must determine the importance of.

I strongly believe that it is almost impossible to build a true semantic web within binary file formats and proprietary virtual machines. We can hack some way towards it, but it will never be close to what plain text markup can offer.