XMPP PubSub: a great compliment to Atom/RSS
PubSub 101: Push vs Pull
PubSub is short for "publish subscribe" which is a common design pattern describing a way to distribute information to interested parties. The publisher tells a server about new information, and the server fans the information out to everybody who has registered interest in that topic or channel. Data consumers find out about the new information very quickly, with relatively little load on the whole system, since the pubsub mechanism provides a means to "push" data to them.
By contrast, almost all of the web today follows uses a "pull model" where a data consumer only finds out about new information when it gets around to checking if there is something new. This data distribution model is significantly simpler because the server only needs to keep track of the content, not who is interested in knowing about it. Modern networks are optimized for this kind of query-based traffic where data consumers (clients, web browsers) initiate connections to servers, such that it’s often impossible for servers to initiate conncetions to clients because of firewalls or NAT.
The downside of the pull model is that the only way a data consumer can find out if thanything is new on the server is to "check back frequently" or "poll" the server for changes. If you want to know within 15 minutes if anything new has been posted, you have to ask the server at least every 15 minutes "anything new?" No. "How about now?" No. "Got anything yet?" No. Mulitply this by potentially millions of interested data consumers and you end up spending a lot of network bandwidth and server resources doing very little. Even worse, the problem scales horribly. If clients want to know about changes within 5 minutes instead of 15, that puts 3 times the load on the server. Want to know within a few seconds? Forget it — the servers would crash. There’s an intrinsic delay in distributing information in this model, and reducing that delay is very expensive.
XMPP as an alternative to polling
XMPP is the protocol used for Instant Messaging by Google Talk and Jabber and a large number of small servers. In order to deliver instant messages, XMPP systems maintain persistent connections between all machines that allow packets of data to be pushed with very low latency — IM messages are typically delivered within a second of sending them. So it’s natural to want to use this infrastructure to deliver other data more efficiently than through polling HTTP.
The XMPP PubSub spec known as XEP-0060 describes how to do exactly this at the protocol level. But for a variety of reasons, this standard has not gained wide adoption. IMHO the biggest reason is that there isn’t a very pressing need. The current system is horribly inefficient, but it works. Moreover, it puts the burden of inefficiency squarely in the hands of the information publishers. Popular publishers are just expected to shell up for necessary hardware to meet the demands of their readers, and with advertising they can typically recoup the necessary investment.
Another way to state that is that pubsub hasn’t found its niche yet. IMHO this is partly because the mechanism is so useful it can be applied to almost anything. Not just breaking news, but everything from e-mail mailing lists to doorbell chimes get used as examples of how XMPP pubsub technology could be applied. Not wanting to exclude any of these potentially interesting uses, the protocol remains very generic.
Micro-blogging, Atom and Yesterday’s Realization
One place where the current HTTP model breaks down is micro-blogging, which is the generic term for services like Twitter or Facebook’s udpates. Here, the payload of actual content is very small, so the overhead of checking far outweighs the "useful data" which is delivered. Also, because the information is social (i.e. "Heading to Broadway for a bite — wanna come?") consumers demand it be delivered quickly. Nonetheless, current micro-blogging services still rely on polling clients, and their servers suffer as a result.
Yesterday, a group of us including Blaine Cook, Anders Conbere, Evan Prodromou, and XEP-0060 co-author Ralph Meijer were discussing how to apply XMPP PubSub to micro-blogging. This was likely obvious to many there already, but during the discussion I had a realization. We aren’t solving this problem from whole cloth. RSS and Atom feeds already describe all the information we need. We just need to find a way to substitute XMPP for the assumed transport HTTP.
So we discussed mechanisms for mapping an Atom URL to an XMPP PubSub Node. (We pretty much ignored RSS because RSS isn’t as cool for reasons I really don’t understand.) We talked about putting a link-rel tag in the feed to point to the XMPP PubSub node. This would look something like
<link rel="alternate" type="xmpp/pubsub" href="xmpp:twitter.com?;node=users/leopd" />
Even better, the URL for these nodes should be guessable from the URL for the HTTP feed. The above node would be the normal place to look for a the pubsub version of http://twitter.com/leopd. Even though it’s not as generic and robust to have a standard mapping like this, I think it’s an important way to speed adoption of a new standard. The code to do a bit of string manipulation is vastly easier than fetching the URL and looking for a link-rel tag. And developers are intrinsically lazy (for good reasons!) so making things easier for them means they’ll succeed a lot more.
Ever pragmatic, Blaine pointed out that we should use HTTP for things it is good at, and not re-invent them in XMPP. I wholeheartedly agree. Re-transmission is a key example. What happens if a client is offline when a new post happens, and so never hears about it? Answer: The clients should fetch the historic archive of the feed over HTTP. These feeds exist today — no need to improve on them. If all the posts have sequence numbers on them, then it’s easy to figure out if you’ve missed one. So all the posts from a user should have sequence numbers. I don’t think this is standard in Atom feeds today.
The story unfolds…
There’s a lot more to be worked out and standardized here. And clearly many more people need to voice their opinions before we can reach consensus. Sadly I can’t be down in Portland today to continue the discussion, so this post will have to take my place as I return to my regular daily commitments. If you’d like to stay tuned as the story unfolds, you’ll have to poll this site, as I can’t yet give you a PubSub node to subscribe to for updates. If I could it would probably be something like xmpp:embracingchaos.com?;node=xmpp — try it. By the time you read this, it might be working!