1. Jul23

    Writing Flash for search engines

    Flash logo

    On June 30 Google and Adobe announced a new indexer optimized for Flash (SWF) discovered by its web crawlers. The new partnership takes advantage of a server-side Flash player optimized for a search engine indexing environment and unidirectional text (e.g. no Hebrew or Arabic). Search engines previously discovered the location of a SWF file on the Web and perhaps indexed its metadata but did not take a deep look inside its binary content. Last month's announcement was a big change for both Adobe and major search engines as it is now possible to run a very GUI-based Flash file at the command line and interpret both its text content and interaction opportunities. In this post I will walk through what we currently know about the search engine Flash runtime and how it affects search engine optimization in Flash.

    Build for a blind, deaf user

    Search engine indexers are blind and deaf. They open a file, examine its contents, and try to deduce meaning through your page structure and its content. A web page designed for screen readers will also expose more content to search engines not evaluating your page's full render state of content, layout, and interactions.

    Search engines utilizing Flash player indexing are still restricted to this screen reader approach. Accessible Flash applications complete with names, labels, reading order and XMP should continue to be more search engine friendly than other SWF files on the Web. Google's tips for creating accessible, crawlable sites still apply, but in a new Flash context.

    The server-side Flash Player

    If we want to understand how search engines such as Google might interpret Flash content we'll first need to take a look at the Flash Player itself. Adobe provides little details in its official SWF searchability FAQ but we can infer a few implementation details. How would you rewrite Flash Player for server-side indexing of SWF content?

    The search engine Flash Player is likely a scaled-down, secure version optimized for machine readers. Strip out video, audio, fonts, and file system access. The server side Flash player should open a binary SWF file, pull out the functionality it understands, and create a data tree of all possible actions. These features are actually quite similar to a screen reader interface, but Adobe is instead targeting a Linux-based headless runtime. I believe the guts of the Flash Player for servers is built using the same accessibility abstraction layer Adobe currently uses for Windows, Mac, and Linux desktops.

    The Adobe Flash Player creates a list of objects on the screen at each render and records this list into an accessible data tree (according to a 2005 white paper by Bob Regan). This data tree is updated with each change in the application state, allowing any application listening in to update an object model of clickable buttons, labels, and links.

    Adobe interfaces with OS-level accessibility frameworks on every major desktop platform. The Windows version of Flash Player binds to Microsoft Active Accessibility. Mac versions of Flash bind to Universal Access. On GNOME the player binds to the Assistive Technology Service Provider Interface (at-spi). A server-side version of Flash likely builds upon this same abstracted accessibility object model, passing screen objects to the search engine indexer for further interpretation or interaction.

    Windows Live Search was noticeably missing from the server-side Flash player announcement for search engines. It's possible Adobe has developed a server-side Flash Player for Linux that is not yet compatible with the Windows Server environment of Microsoft's Windows Live Search.

    Accessing deep content

    Googlebot can fill out forms, click buttons, and navigate deep within your site. Clickable Flash objects will likely behave the same way, exposing new content paths for Googlebot within your larger SWF. Flash websites can help ensure deep indexing of SWF content by adding individual SWF fragments to their sitemap. Reading order will likely play a roll in selecting important content on your page, and I expect Googlebot may follow the first item in your reading order sooner than the last.

    Googlebot still throws out references to a anchor name fragment in the URL (e.g. #section=menu) and this announcement does not change the general behavior of Google's URL storage and analysis.

    Do Flash versions matter?

    Emperor Tamarin monkey

    The official announcement from Google and Adobe makes it seem like all Flash is now universally indexed regardless of your Flash version but I think that's bogus. If a search engine wanted to index JavaScript they might run Rhino on the server and interpret results. If you wanted to build an advanced interpreter of Flash content you might use Tamarin or its derivatives, an AVM2 (Flash 9+) virtual machine. I believe AVM2-compatible SWF files will enjoy better search exposure than binaries built for the older AVM. I can't prove it; just a hunch.

    Dynamic object insertion

    Googlebot will detect common JavaScript libraries such as SWFObject used to dynamically insert Flash content at page load. Publishers can back up the dynamic insertion JavaScript with a noscript element just in case Google doesn't discover your dynamic insertion. Sticking with standard dynamic insertion libraries will help ensure your content is discovered through expected behaviors.

    Summary

    The new search version of Flash Player opens the binary SWF format to interpretation by text-focused search engines. Flash developers can take additional steps to package SWF content for accessibility and search discoverability. Developing for modern virtual machines, adding accessibility hooks, and wrapping your SWF in XMP.

  2. Jul08

    Google App Engine optimizations

    Google App Engine

    I have developed a few web applications powered by Google App Engine since its launch in May. It has been a fairly easy transition from my traditional programming in Python and Django backed by MySQL to the distributed App Engine environment, Bigtable, and the limitations of each. I have learned a few App Engine best practices over over the past month and would like to share some best practices for App Engine development gained mostly through trial and error. In this post I will share data optimization tips for Google's hosted Bigtable instance, reduce the errors and resource usage of your application, and add a few steps to your deployment checklist.

    Key-based lookups

    I program Django applications referenced by a set of short unique object labels named slugs. A slug column is uniquely queried across a model and easily indexed for fast scans. In the Bigtable world of Google App Engine slugs are optimally stored as a model's key name. Key names are limited to 500 bytes and must be unique across your defined entity. This unique key lookup directly copies the entity into memory without needing to scan an entire distributed hashtable.

    Entity key names provide very fast lookups for developers who like to plan ahead. You cannot alter the key name once it's set and it cannot start with a number or underscores. If you can accept these limitations within your code you'll experience an even snappier reads from your data store.

    Reduce indexed columns

    It's tempting to choose a Datastore property by its input helper or based on names similar to a SQL equivalent. So what's the difference between a short String and Text? An index.

    According to Guido, a 300 byte string stored as Text is the same size as String but without an index. If you have a short string you never query or sort you'll optimize your data queries if it's stored as Text.

    Define a favicon

    App Engine developers should define favicon.ico, robots.txt, and other frequently requested file paths. Google App Engine logs frequent errors inside your administrative console if it has to hunt for your icon with every browser request.

    Define the location of your static favicon file directly from app.yaml for fast response times:

    - url: /favicon.ico
      static_files: static/favicon.ico
      upload: static/favicon.ico
    

    You should follow a similar pattern for robots.txt and optionally the verification files from Google Webmaster Tools, Yahoo! Site Explorer, and Windows Live Search.

    Define default 400 and 500 response templates

    Your site is not perfect. Visitors will inevitably request pages that do not exist or generate an internal server error. Your site should define default templates for 404 and 500 status codes or risk displaying whatever is sitting on Google's NetScaler.

    Google App Engine default 500 page

    The screenshot above shows an error page of an App Engine application without a defined 500 handler. A link on the page suggests a visit to Google's support website where your visitors will find no support options of interest.

    Django developers should define 404.html and 500.html in your app's templates directory. Django will load and render each file for the default page_not_found and server_error views respectively.

    Deploy and request

    Developers should prime Google's distributed server networks by issuing requests for key URLs a few minutes after deploy. These automated requests trigger your memcache storage and distribute your app instance across Google's distributed servers. The first request requires more CPU cycles and memory than subsequent requests as Google tries to prioritize active application instances and their versions. You can speed things up by always issuing one or more requests after a successful deploy.

    This process is not unlike flushing and re-populating CDN PoPs with new content from your origin server or propagating dynamic handlers across your front-end cluster. It's best to kick off the process early and have the latest version of your content waiting for new visitors on subsequent requests.

    Summary

    Google App Engine simplifies the scaling process but is not a magic cloud that will erase all latency and resource usage issues in your app. App Engine requires new approaches to data storage, data latency, and resource requirements in a metered and opaque environment. Hopefully my trials and experience will speed up your App Engine web apps as you create new services in the cloud.

  3. Jul01

    Announcing Widget Summit 2008

    Widget Summit logo

    I am hosting a my third annual Widget Summit conference November 3rd and 4th at Hotel Nikko in San Francisco. The two-day widget event will once again educate and connect a a widget ecosystem of publishers, toolmakers, developers, and service providers across a variety of platforms including desktop, mobile, web, and social networks. I enjoy taking a look beyond the hype with a sold-out audience interested in building better syndicated content experience through distributed widgets.

    The widget industry is constantly evolving as publishers extend their reach beyond their web address and into remote locations already bustling with activity. The popularity of a single site pales in comparison to the aggregate crowds gathered in front of their Windows Vista desktops, iPhones, or My Yahoo! homepages. In the past year we've seen new context added to our widget environments connecting us to the location, friend list, or shared application of our widget community wherever they may interact with our content. Today's smartest widgets enjoy a close bind with their parent platform's features, regularly poll their home base for relevant updates, and reach new audiences through targeted and integrated content interactions.

    At my first widget conference in 2006 we struggled with the name "widget" and this new distribution network most people interpreted as a Flash badge on MySpace. Last year iPhone web applications and the social canvas of Facebook was all the rage, with new opportunities in the enterprise slowly emerging through the rollout of Windows Vista and personal information dashboards powered by software as a service offerings from established consumer brands such as Google and Netvibes.

    A lot has changed in the widget space in the 8 months since the last Widget Summit. Widgets are going mainstream, with the startup valuations and press coverage to match. Somewhere among the fog of hype are useful opportunities to reach targeted audiences on their platform of choice. Let's take a look at some of the big changes we've seen since October 2007.

    • New collaborative technologies such as OpenSocial and its open-source reference container Apache Shindig are quickly creating new widget environments at companies that could not afford to create their own implementations from scratch. MySpace, Orkut, Hi5, LinkedIn, and Yahoo! have all committed to a standard set of widget APIs.
    • The Facebook platform is in the middle of its first big changes since its 2.0 release in May 2007. Shifting concepts of profile display, authoring, and member interaction will require new upgrades or fresh opportunities for completely new applications.
    • The iPhone continues to spark interest in mobile web app development based on single-browser environments. iPhone 2.0 will put smartphones in the hands of a worldwide audience for about the price of a ubiquitous iPod and hopefully expand mobile data opportunities.
    • Advertising networks have created separate product offerings specifically focused on widgets. DoubleClick syndicates and tracks widgets through its DART platform. AOL's Platform-A recently announced widget-specific advertising and sponsorship powered by TACODA's trail of cookie bounties.
    • The enterprise continues to adopt software as a service and widgets are no exception. Google, IBM, and Microsoft are extending their hosted software into large companies and bundling the latest widget technologies inside an integrated package.
    • Consumer electronics ship with widgets built-in. Your next car, GPS unit, television, or alarm clock may contain customized widget content.

    These are just a few of the large trends creating new opportunities for publishers extending the reach of their content through widgets. We'll cover all the major widget platforms and opportunities at this year's Widget Summit, providing the business sense and development basics to kick off your new widget initiatives in 2009.

    You may have noticed this blog grow quiet over the past few months as I rebuilt the conference software behind Widget Summit and aligned the many business details needed to create the best possible experience. In the next week I'll share some of the technical details behind my new sites and services.

    Registration for Widget Summit is now open with early bird pricing of $795 for the two-day conference in downtown San Francisco on November 3rd and 4th (the Monday and Tuesday before Web 2.0 Summit). I hope you can join us for what should be our best conference yet!

  4. Apr14

    Customizing conference speeches for your audience

    Speaking at a conference can be a hit-or-miss event. Next week I will take the stage at Web 2.0 Expo for a three-hour workshop on Web 2.0 Best Practices: expressive HTML, feed syndication, and widgets. Delivering technical content longer than The Godfather is an intimidating yet worthy challenge. I like to tailor my talks for each audience, dive deep when given the opportunity, and connect with new smart people.

    Over the past few weeks new conversations have emerged regarding how conferences must change to better suit their audience. As a conference producer, conference speaker, and attendee I have many opinions on running a great show but today's post will focus on speakers. In this post I will share three speaking tips that keep coming up in my conversations with other speakers in the industry.

    1. Gauge audience skill levels
    2. Prepare more content than needed
    3. Avoid card collectors

    Gauge audience skill levels

    I like to address audiences with an intermediate to advanced knowledge of web development, content syndication, and widget platforms. I am never quite sure how much my audience already knows and how quickly I can move past the basic bits of knowledge about a particular product or technology. I typically begin a longer presentation with a few technical questions for the audience to set the pace and depth of my talk.

    At last year's Web 2.0 Expo I decided to gauge my audience's experience with XML and syndication basics by a show of hands. I exposed the following bullet points one-by-one with rising levels of difficulty.

    Does this scare you?

    1. & vs. &
    2. 2007-04-17T16:50:00-07:00
    3. HTTP status codes: 200, 304, 410

    I was pleasantly surprised by my audience's reaction to these questions. Only a few people in the audience admitted to not knowing the difference between an escaped and unescaped characters and the ampersand entity reference. A few more were unable to decipher an ISO 8601 date and time. Approximately 10% of the room knew the difference between Found, Not Modified, and Gone HTTP status codes.

    Prepare more content than needed

    I typically throw out 20% of my presentation based on the skill level of my audience and unforeseen time limitations. Throwing out my carefully-prepared slides was a big mental leap but it allows me to refocus my message on-the-fly to better match the conference, its topics, and its attendees.

    Armed with my on-the-fly audience demographics from my earlier questions I may quickly skim over basics on my way to more advanced content. I may skip a topic already over-covered during previous sessions. Quickly flashing more advanced slides on screen on my way to my final presentation slide may also prompt conversations after my talk with more advanced members of the audience curious to hear even more.

    I prepared a 10-minute talk for last year's Web 2.0 Summit. I did not realize the organizers start the timer for your talk when the conference chair takes the stage for introductions, not when you reach the podium. John Battelle provided a nice introduction but my presentation was suddenly cut to 8.5 minutes instead of the prepared 10. I stuck to the basics for the Cx0 crowd and threw out the final 20% of my presentation.

    Avoid card collectors

    Some conference attendees are business card collectors. They don't actually engage in conversation or ask questions on site but will come up to the stage to collect a new slip of paper from every session, perhaps for a more itemized expense report or a vast spam database.

    After my presentation I like to stick around and answer 1-on-1 questions with session attendees. I place a small stack of business cards on one end of the stage for easy self-service while I continue to engage members of the audience 1-on-1. The conversationalist crowd is a bit thinner and may invite new participants.

    Summary

    Speaker content can and should adapt to the audience. Conference organizers should help their speakers better understand audience composition, but it's also possible for a speakers to step up and deliver a stellar individual performance.

  5. Apr10

    Google App Engine for developers

    Google App Engine

    On Monday Google launched Google App Engine, a hosted dynamic runtime environment for Python web applications inside Google's geo-distributed architecture. Google App Engine is the latest in a series of Google-hosted application environments and the first publicly-available dynamic runtime and storage environment based on large-scale propriety computing systems.

    Google App Engine lets any Python developer execute CGI-driven Web applications, store its results, and serve static content from a fault-tolerant geo-distributed computing grid built exclusively for modern Web applications. I met with the App Engine's team leads on Monday morning for an in-depth overview of the product, its features, and its limitations. Google has been working on the Google App Engine since at least March 2006 and has only just begun revealing some of its features. In this post I will summarize Google App Engine from a developer's point of view, outline its major features, and examine pitfalls for developers and startups interested in deploying web applications on Google's servers.

    What is Google App Engine?

    Google App Engine is a proprietary virtualized computing suite covering the major common components of a modern web application: dynamic runtime, persistent storage, static file serving, user management, external web requests, e-mail communication, service monitoring, and log analysis. The Google App Engine product offers a single hosted production web server stack hosted on Google's custom-designed computers and datacenters distributed around the world.

    Google App Engine is a managed hosting environment with a tightly managed stack running in a machine-independent environment. It simplifies the deployment and management of your web application software stack while constraining you to a specific stack. When I start a new web development project today I have to first setup a tiered system to effectively handle site growth:

    3tera Applogic grid
    1. Purchase dedicated servers or virtualized slices. Estimate necessary CPU, memory, disk space, etc. at each tier.
    2. Configure a web server for dynamic content. Install Python and its eggs, Apache HTTPd and extra modules such as modwsgi. Configure and tweak each. Open appropriate ports. Listen.
    3. Setup a MySQL database server and choose the appropriate storage engine. Configure MySQL, add users, add permissions. Tweak and optimize.
    4. Add an in-memory caching layer for frequently accessed dynamic content.
    5. Monitor your uptime and resource utilization with Ganglia and/or other tools on each machine.
    6. Serve static files such as JavaScript, CSS, and images from a specialized serving environment such as Amazon's Simple Storage Service.
    7. Turn your static server into an origin server for a CDN with points of presence close to your website's users.
    8. Connect each piece of the stack, keep its software updated to avoid security vulnerabilities, and hopefully respond to all website requests in less than a second.
    9. Dedicate work hours and expertise to all the above. Hire outside assistance if needed.
    10. Don't go broke trying.

    Your tiers will expand as your new web application gains popularity. Your single-server tiers become load-balanced services, message bus broadcasts and listeners, and distributed cache arrays at scale. You'll probably spend time rearchitecting your application at each stage of growth to incorporate for these new resource demands if you can afford the time, expertise, and effort.

    Google App Engine is a new and interesting solution for Python developers interested in adding features, not servers. Google spends hundreds of millions of dollars developing its custom infrastructure with 12-volt power supplies tapped into a hydro-electric dam next door and fat fiber pipes owned by local governments carrying requests and responses to their proper home. Google's physical infrastructure is vast array of highly optimized web machines, and we'll now be able to see how such infrastructure performs across more generic applications on App Engine.

    Freemium hosting model

    Google App Engine is a "freemium" business model offering basic features for free with paid upsells available for application developers exceeding approximately 5 million pageviews a month. This resource quota approximately matches the Google Analytics 5 million pageview limit. Google Analytics customers may currently exceed this limit if they maintain an active AdWords account with a daily advertising budget of $1 or more. The Google App Engine team plans to introduce pricing and service level agreements for additional resources, priced in a pay-as-you-go marginal resource structure, once the product leaves its limited 10,000-person preview period later this year.

    Quota TypeLimit / day
    HTTP requests650,000
    Bandwidth In9.77 GB
    Bandwidth Out9.77 GB
    CPU megacycles200 million
    E-mails2,000
    Datastore calls2.5 million
    External URL requests160,000

    Google publishes these quotas and provides administrative monitoring tools. The quotas are just a guideline as Google may cut off access to your application if you receive a traffic spike of an unspecified duration. The Google App Engine quota page specifies:

    If your application sustains very heavy traffic for too long, it is possible to see quota denials even though your 24-hour limit has not yet been reached.

    Google App Engine over quota

    Google App Engine already failed the Techcrunch effect and appears the platform is currently unable to handle referral traffic loads from a popular blog or news site typically associated with a product launch. The traffic spike cutoffs make me think twice about hosting anything of value on App Engine.

    The team

    The Google team behind App Engine has a long history in developer services. Team members include some of the top Python experts in the world, financial transaction specialists, and developer tool builders.

    • Python creator Guido van Rossum wrote the App Engine SDK and ported the Python runtime and Django framework for the new environment. Google App Engine is Guido's first full-time project at Google after his Noogler project Mondrian.
    • Technical lead Kevin Gibbs previously worked on the the SashXB Linux development toolset and multiple RPC projects at IBM before he created Google Suggest in 2004.
    • Developer Ryan Barrett wrote the BigTable datastore implementation and related APIs. Previously Ryan was tech lead on Moneta, Google's transaction processing platform and customer data store.
    • Product lead Paul McDonald has worked on Google Checkout, AdWords, and a Web-based IDE named Mashup Editor (all strong candidates for App Engine inclusion).
    • Product manager Peter Koomen has previously authored papers on natural language search and semantic analysis.

    The list above is just a sampling of the full-team behind App Engine.

    Feature limitations

    Google App Engine is not without its faults. Applications cannot currently expand beyond the quota's ceiling. It's still unclear how an application will dynamically scale on App Engine once it leaves the farm leagues, and at what cost.

    A few major issues include:

    1. Static files are limited to 1 MB. App Engine does not support partial content requests (Accept-Ranges).
    2. Cron jobs and other long-life processes are not permitted.
    3. Applications are not uniquely identifiable by IP address, leading to a lack of identification for external communications. Applications may suffer from bad neighbor penalties from API providers upset at another app on the service.
    4. No SSL support. No IP address complicates signing, but port 443 is open for requests. You can rely on Google services (and branding) for trusted login and possibly future payments.
    5. No image processing. Python Imaging Library relies on C, and is therefore not a possible App Engine module.
    6. Google user accounts. Site visitors are very aware of your choice in web hosts each time they attempt to logon to your application. I feel like this flow makes your application seem less professional, but may be a reasonable trade-off. Google will store your user data and potentially mine its data for better ad targeting.

    Summary

    Overall I am quite impressed with Google App Engine and its potential to remove operations management and systems administration from my task list. I am not confident in Google App Engine as a hosting solution for any real business while the host is in preview stage but those concerns may be alleviated once the product is ready for real customers and real service-level agreements.

    Python developers have just been granted a few superpowers for future projects. As an existing Python and Django developer I know how difficult it can be to find a managed hosting provider with modern Python support. Many hosts are years behind, running Python 2.3. I am excited App Engine already features the programming tools I use every day, with a few modifications for their proprietary systems. App Engine should introduce more developers to Python and the Django framework and hopefully cause other web hosts to provide better Python support as well.

  6. Feb06

    iPhone web app performance

    iPhone iPod touch web apps

    The Exceptional Performance group at Yahoo! just released a detailed performance analysis of web applications on the iPhone. Yahoo! analyzed the full capabilities of the iPhone's Safari browser including browser cache and transfer speeds.

    Cache persistence

    The Safari browser on iPhone allocates memory from the shared system memory but does not save web content into persistent storage. Any cached objects (CSS, JavaScript, images, etc.) are removed from memory on reboot.

    Optimal component size

    Safari for iPhone will only cache files 25 KB or smaller served using the Expires explicit expiration time or Cache-Control max-age directive HTTP headers. Safari decodes the file before saving it cache, meaning your total unzipped file size must squeeze under the 25 KB ceiling to hit the cache. Components already in cache are only replaced by new cacheable components using the least recently used algorithm.

    Safari for iPhone is able to cache a maximum of 19 external components, placing a maximum cache limit at around 475 KB.

    Download speed

    Yahoo! found typical download iPhone download speeds vary from 82 kbps to 150 kbps when connected to a GSM cellular data network. Wi-Fi connections over an 802.11b/g networks obviously speed up the experience, but pages should assume cellular data load times when designing for a compelling user experience.

    Summary

    Works with iPhoneWeb applications built for the iPhone's Safari browser need to specifically target web performance these small devices and special cache rules. Desktop browser best practices such as zipped components and combined files for CSS and JavaScript may be too bloated for the Safari mobile browser. A few tips:

    • Limit cacheable components to a decompressed size of 25 KB or less
    • Limit yourself to 19 or less cached components
    • Minify CSS and JavaScript for slimmer file weights.
    • Use CSS sprites to combine multiple small images into a shared image under 25 KB
  7. Feb05

    Sniff browser history for improved user experience

    The social web has filled our websites with too much third-party clutter as we figure out the best way to integrate content with the favorite sites and preferences of our visitors. Intelligent websites should tune-in to the content preferences of their visitors, tailoring a specific experience based on each visitor's favorite sites and services across the social web. In this post I will teach you how to mine the rich treasure trove of personalization data sitting inside your visitor's browser history for deep personalization experiences.

    I first blogged about this technique almost two years ago but I will now provide even more details and example implementations.

    1. Evaluate links on a page
    2. Test a known set of links
    3. Live demos and examples
      1. Online aggregators
      2. Social bookmarks
      3. OpenID providers
      4. Mapping services
    4. Summary

    Web browsers store a list of web pages in local history for about a week by default. Your browsing history improves your browsing experience by autocompleting a URL in your address bar, helping you search for previously viewed content, or coloring previously visited links on a page. Link coloring, or more generally applying special CSS properties to a :visited link, is a DOM-accessible page state and a useful method of comparing a known set of links against a visitor's browser history for improved user experience.

    • New Site
    • Visited site

    A web browser such as Firefox or Internet Explorer will load the current user's browser history into memory and compare each link (anchor) on the page against the user's previous history. Previously visited links receive a special CSS pseudo-class distinction of :visited and may receive special styling.

    <style type="text/css">
    ul#test li a:visited{color:green !important}
    </style>
    <ul id="test">
      <li><a href="http://example.com/">Example</a></li>
    </ul>
    

    The example above defines a list of test links and applies custom CSS to any visited link within the set. Your site's JavaScript code can request each link within the test unordered list and evaluate its visited state.

    Any website can test a known set of links against the current visitor's browser history using standard JavaScript.

    1. Place your set of links on the page at load or dynamically using the DOM access methods.
    2. Attach a special color to each visited link in your test set using finely scoped CSS.
    3. Walk the evaluated DOM for each link in your test set, comparing the link's color style against your previously defined value.
    4. Record each link that matches the expected value.
    5. Customize content based on this new information (optional).

    Each link needs to be explicitly specified and evaluated. The standard rules of URL structure still apply, which means we are evaluating a distinct combination of scheme, host, and path. We do not have access to wildcard or regex definitions of a linked resource.

    In less geeky terms we need to take into account all the different ways a particular resource might be referenced. We might need to check the http and https versions of the page, with and without a www. prefix to more thoroughly evaluate active use of a particular website and its pages.

    I group my tests into sets of URLs with the most likely matches placed at the beginning of the set. I evaluate each link in the set until I find a match thereby exhausting positive indicators of site activity while prioritizing the data scan.

    Live demos and examples

    Sniffing a visitor's browser history has good and evil implications. An advertiser can determine if you visited Audi's website lately, drill down on exact Audi models, and offer related information without ever placing code on the Audi website. I have been scanning the browser history of my site visitors for the past few months and I have coded a few examples to show benevolent uses for improved user experience.

    Online aggregators

    Feed aggregator button grid

    Clusters of feed subscription buttons clutter our websites, displaying tiny banner ads for online aggregators of little use to most of our site visitors. My blog checks a known list of online aggregators against the current visitor's browser history and adds a targeted feed subscription button for increased conversion. A Google Reader user will see an "Add to Google button" and a Netvibes user will see an "Add to Netvibes" button without cluttering up the interface. I insert direct links to each site's feed handlers to help convert the current visitor into a long-term subscriber.

    Once I match a particular service I could also check to see if the current visitor is already subscribed to my feed. I would simply need to run a second test against the data retrieval URL, such as feedid=1234, to match web traffic with subscriber numbers.

    Visit my live example of link scanning popular online feed aggregators for a demo and the applicable code.

    Social Bookmarks

    Social bookmark button sample

    I like to see my latest blog posts spread all over the web thanks to social bookmarking sites and other methods of content filtering and annotation. Most sites spray a group of tiny service icons near their blog posts and hope a visitor recognizes the 16 pixel square and takes action. Suck. There has to be a better way.

    I can scan a current visitor's browser history to determine an active presence on one or more bookmarking sites. Once I determine the current visitor is also a Digg user I can show live data from Digg.com to prompt a specific action such as submitting a story or voting for content. I can create a much better user experience for 3 services I know my visitor actively uses instead of spraying 50 sites across the page.

    Visit my live example of link scanning popular social bookmarking sites for a demo and the applicable code.

    OpenID providers

    Pibb OpenID signin

    OpenID is an increasingly popular single sign-on method and centralized identity service. OpenID lets a member of your site sign-on using a username and password from a growing list of OpenID providers including your instant messenger, web portal, blog host, or telephone company account. Visitors signing up for your site or service shouldn't have to know anything about OpenID, federated identities, or other geeky things, but should be able to easily discover they can sign-in with a service they already use and trust every day.

    I can scan a list of sign-in endpoints for a list of OpenID providers and only present my site visitor with options actually relevant to their everyday web usage. Prompting a user to sign-in to your service with their WordPress.com account should be much more effective than an input field sporting an OpenID icon. Link scanning for active usage should increase new member sign-ups, reduce support costs due to yet another username and password, and make your members happy.

    Visit my live example of link scanning current OpenID providers for a demo and applicable code.

    Mapping services

    Facebook map drop-down

    Online mapping services have changed the way we interact with location data. Need to get to 123 Main Street? Not a problem, I'll just send that data over to your favorite mapping service to help you find your way.

    I can scan a visitor's browser history to determine their favorite mapping service. Perhaps she is most comfortable with MapQuest, Google Maps, or Yahoo. Or maybe she uses a Garmin GPS unit and would prefer a direct sync with that specialized service. Determining my visitors' favorite mapping tool helps me deliver a valuable visualization or link I know they prefer.

    Visit my live example of link scanning map API providers for a demo and applicable code.

    Summary

    Websites should take advantage of the full capabilities of modern browsers to deliver a compelling user experience. Built-in capabilities such as XMLHttpRequest took years of implementation before finding its asynchronous groove in data-heavy websites. I hope we can similarly probe other latent useful features to improve the social web through more personalized and responsive experiences.

    I have been the browser history of my website visitors for the past few months to gracefully enhance adding my Atom feed to their favorite feed reader. Easily recognized branding such as "Add to My Yahoo" has yielded much higher conversion rates than a simple Atom link with a minimal effect on page load performance. Dynamically checking for active usage of 50 or so aggregators allows me to extend my total test list and promote an obscure tool that might never make the cut for permanent on-screen real estate.

    How will your site utilize your visitor's browser history for a more custom user experience? How will you connect data in new ways once you have concrete knowledge of the new feature developments that will be most useful to your visitors' online lifestyle?

  8. Jan29

    Data interchange for the social web

    Data portability is only useful if outside systems can comprehend the exported data. Well-described and interoperable data sets open new possibilities for context-aware social applications, importing your friends, photos, or genetic markup from an existing system into your current tool of choice. In this post I will discuss website best practices for exporting portable, descriptive data sets in the name of data portability. This post builds upon user authorization concepts covered in my last post.

    Expressing data between two unrelated systems is difficult at best. You need a shared set of vocabulary to explain even the basic data points (time, person, etc.). Good data exports will want to represent as much data as possible with the least probable data loss.

    Voyager golden record cover

    NASA launched the Voyager 1 spacecraft into space in September 1977 with a set of golden records onboard. These records communicate small pieces of human knowledge to any intelligent life that may discover our small explorer. The graphic above is humanity's attempt at data interoperability, teaching alien explorers the proper positioning of an included stylus over a record rotating once every 3.6 seconds (time is expressed as the fundamental transition of the hydrogen atom). Thankfully web developers do not have to worry about interoperability with so many unknown measures, but your data could just as easily lost and never played back for other worlds to hear.

    Identify exportable data

    The first step in data export is identifying the unique pieces of information you would like to package and ship outside your walls. What information might be useful to a user seeking to backup or otherwise export his or her data? How would you like to import such data back into your own website?

    Google Mail message listing sample

    Pictured above is a list of messages stored in Gmail. One message is part of a continuing conversation or thread, another message is flagged, and two messages have custom labels. A typical e-mail system might just export a list of raw messages but could possibly lose key data such as a flagged state or labels/tags.

    Research existing data standards

    Data interoperability is not a new concept and your current challenges may be easily solved by existing certified and de-facto standards. Standards increase the chances your data will be consumed, processed, and understood by others. You could invent an entirely new dialect and vocabulary to describe your information but you will be much more successful at disseminating data if you are easily interpreted.

    Standards organizations have spent years analyzing the essential elements and interoperability requirements of many common forms of data. Below are just a few standard data formats for elements of the social web.

    People, Places, and Things
    vCard
    xNAL
    KML
    LDAP
    Events
    iCalendar
    News articles
    Atom Syndication Format
    News Industry Text Format
    Human DNA
    NCBI homo sapien genome build 36.2, FASTA.

    Each data markup has a specific set of required data intended for a specific audience or interpreter. Google Maps prefers a feed of business listings and locations in xNAL while Google Earth prefers KML for example. Bloggers output news articles in Atom for consumption by a specific set of tools, while mainstream publications mark up their stories in a news industry format for increased granularity. Some formats may not be applicable if your product does not store all the required types of data (i.e. you know their name but not their hometown). Your company will need to select a target output format based on expected external use and how your information might map onto a format's required elements.

    Extend where appropriate

    Each format supports extended namespaces for custom data not covered by the base vocabulary. A member's favorite food or soccer club is not an essential component of an international standards body but can easily be extended with your own custom namespace where appropriate.

    The same rules of data loss apply to custom namespaces: custom definitions are more likely to be missed while common namespaces are more easily understood. Extended namespaces may already be in active use by a big company or a coalition, increasing your chances of data visibility. An AOL Instant Messenger screenname is defined as "X-AIM" in a vCard context for example, where the X- represents an extension element.

    Summary

    Data portability and interoperability on the social web continues to be a hot topic. While there are PR benefits for first-movers I expect there will not be widespread adoption until portable data has a remote consumer. Startups with limited resources will need to see a possible consuming service for their exported data before carving out part of their product cycle for the new feature. I think data portability is a great project for this summer's interns, providing deep exposure to data complexity and the industry as a whole while balancing proper authenication and privacy concerns.

  9. Jan21

    Data Portability, Authentication, and Authorization

    The social web is booming, signing up new users and generating new pieces of unique content at a steady clip. A recurring theme of the social web is "data portability," the ability to change providers without leaving behind accumulated contacts and content. Most nodes of the social web agree data portability is a good thing, but the exact process of authentication, authorization, and transport of a given user and his or her data is still up in the air. In this post I will take a deeper look at the current best practices of the social Web from the point of view of its major data hubs. We will take a detailed look at the right and wrong ways to request user data from social hubs large and small, and outline some action items for developers and business people interested in data portability and interoperability done right.

    General issues

    Friends, photographs, and other objects of meaning are essential parts of the social web. We're much more inclined to physically move from one city to the next if our friends, furniture, and clothes come along with us. The interconnectedness of the digitized social web makes the moving process much simpler: we can lift friends from one location into another, clone your digital photographs, and match your blog or diary entries to the structure of your new social home. Each of these digital movers represent what we generally call "social network portability" or, more generically, "data portability."

    Social networks accelerate interactions and your general sense of happiness in your new home through automated pieces of software designed to help you move data, or simply mine its content, from some of the most popular sites and services on the Web. These access paths are roughly equivalent to a new physical location setting up easy transit routes between some of the largest cities to help fuel new growth.

    Facebook Friend Finder e-mail import

    Your e-mail inbox is currently the most popular way to construct social context in an entirely new location. Site such as Facebook request your login credentials for a large online hub such as Google, Yahoo!, or Microsoft to impersonate you on each network and read all data which may be relevant to the social network such as a list of e-mail correspondents. Every day social network users hand over working user names and passwords for other websites and hope the new service does the right thing with such sensitive information. Trusted brands don't like external sites collecting sensitive login information from their users and want to prevent a repeat of the phishing scams faced by PayPal and others. There is a better way to request sensitive data on behalf of a user, limited to a specific task, and with established forms of trust and identity.

    1. Use the front door
    2. Identify yourself
    3. State your intentions
    4. Provide secure transport

    Use the front door

    Google, Yahoo!, and Microsoft all support web-based authentication by third parties requesting data on behalf of an active user. The Google Authentication Proxy interface (AuthSub), Yahoo! Browser-Based Authentication, and Microsoft's Windows Live ID Web Authentication issue a security token to third-party requesters once a user has approved data access. This token can allow one-time or repeated access and is the preferred method of interaction for today's large data hubs. The OAuth project is a similar concept to web-based third-party authentication systems of the large Internet portals, and may be a common form of third-party access in the future.

    Google Accounts Access example

    Supporting websites provide limited account access to a registered entity after receiving authorization from a specific user. The user can typically view a list of previously authorized third parties and revoke access at any time. The third-party retains access to a particular account even after the user changes his or her password.

    Imagine if you could give your local grocery store access to just your kitchen, but not hand over the keys to your entire house. A delivery person would be automatically scanned upon arrival, compared against a registry, and granted access to the kitchen if yo previously assigned them access. You could revoke their access to your kitchen at any time, but they never have access to your jewelry box or other non-essential functions within your house.

    Identify yourself

    Third-party applications requesting access should first register with the target service for accurate identification and tracking. Applications receive an identification key for future communications connected to a base set of permissions required to accomplish your task (e.g. read only or read/write). A registered application can complete a few extra steps for added user trust and less user-facing warning messages.

    State your intentions

    Your application or web service should focus on a specific task such as retrieving a list of contacts from an online address book. Your authentication requests should specify this scope and required permissions (e.g. read only) when you request a user's permission to access his or her data.

    Google services with Gmail highlighted

    An application declaring scope lets users know you are only interested in a single scan of their e-mail and you will not have access to their credit card preferences, stored home address, or the ability to send e-mails from their account. Not requesting full account access in the form of a username and a password creates better trust from the user and the user's existing service(s).

    Provide secure transport

    Armored Truck How will you transport my user's data back to your servers? Did you bring an armored car with your company's logo prominently displayed on the side or will my data sit in the back of your borrowed pick-up truck? Requesting applications should transport user data over secure communications channels to prevent eavesdropping and forged messages. Registered and verified secured communications will result in less user-facing warning messages of mistrust, and secure certificates are relatively inexpensive. Large portals such as Google or Microsoft will bump your communications (and privileges) to mutual authentication if you are capable.

    Twitter SSL certificate Firefox view

    Register an SSL/TLS certificate for your website to enable secure transport and further identify yourself. Certificates vary in cost and complexity from a free self-signed cert to paid certificates from a major provider with extended validation and server-gated cryptography. Google and Yahoo! use 256-bit keys. Windows Live and Facebook use 128-bit keys.

    Summary

    Data authorization is the first step in data portability. Emerging standards such as OAuth combined with established access methods from Internet giants provide specialized access for third-parties acting on behalf of another user. Sites interested in importing data from other services should take note of these best practices and prepare their services for intelligent interchange.

  10. Jan17

    Upgrade your Google Analytics tracker

    Google Analytics logo

    Google released a new version of its Google Analytics tracking code in December after a two-month limited beta. The new Google Analytics tracker is a complete rewrite of JavaScript inherited from the Urchin acquisition in 2005 and the first time the two products have been officially decoupled. The existing version of Google Analytics tracker, urchin.js, has been deprecated but should continue to function until the end of 2008. Google will only roll out new features on the new ga.js tracker. If you currently track website statistics using Google Analytics you should upgrade your templates to take advantage of the new libraries.

    What changed?

    The new Google Analytics tracker supports proper JavaScript namespacing and more intuitive configuration methods (e.g. _setDomainName instead of _udn). My tests show about a 100 ms faster execution even with a 24% increase (1514 bytes) in file size (ga.js is also minified).

    The new tracking code makes advanced features a lot more accessible. You can now track a page on multiple Google Analytics accounts, which should help user generated content sites integrate their author's Google Analytics IDs alongside the company's own tracking account. The new event tracker lets you group a set of on-page related actions such as clicking a drop-down menu or typing a search query (very useful for widgets). Ecommerce tracking is now a lot more readable. You can read about all the tracker changes in the Google Analytics migration guide PDF.

    Implementation

    Switching your site tracker is pretty simple. Trackers are now created as objects and configured before the page is tracked.

    <script type="text/javascript" src="http://www.google-analytics.com/ga.js"></script>
    <script type="text/javascript">
    var pageTracker=_gat._getTracker('UA-XXXXXX-X');
    pageTracker._initData();
    pageTracker._trackPageview();
    </script>
    

    That's it. You are now running the new Google Analytics tracker. You'll need to swap in your Analytics account and profile IDs, which should be pretty easy to spot in your existing code.

    Summary

    Google Analytics tracking code is completely rewritten for faster on-page behavior that plays well with others. The old tracker will be deprecated within a year, and new features are only available to users running the new code. Existing Google Analytics users should swap out their tracking code to take full advantage of this free stats tool.

Niall Kennedy Niall Kennedy is a web technologist in San Francisco, California in the United States. I am very interested in the world of... MORE »

Search this weblog:

Subscribe:

Latest: Posts

Latest feature: Widget development

Archives: Popular Categories

Sites: More from Niall