By BoLOBOOLNE payday loans

Rhapsody Greasemonkey Script: Optimizing Text Manipulation in Javascript with Regular Expressions

After many months of talking and thinking about it, I finally wrote a greasemonkey script to annotate web pages with Rhaplinks.  The script scans web pages looking for the names of musicians and when it finds them, links them to Rhapsody.com so you can listen to music by the named artist.

This simple idea is actually tricky to implement properly.  Rhapsody has a lot of music and a lot of artists.  So many that keeping the entire list in a javascript program is impractical, as is downloading the entire list from the server.  So I took the most popular 50-100 artists in each primary genre and combined them into a single manageable list of about 1,000 names.

This idea is made practical by one of my favorite features of Rhapsody.com — human-writable URLs.  Assuming your browser is set up properly (install plugin, enable pop-ups), opening http://play.rhapsody.com/Morcheeba causes Morcheeba to start playing.   This API (can URL’s be API’s?  I think so!) accepts punctuation too — http://play.rhapsody.com/R.E.M. will play R.E.M.  And thanks to a generous interpretation of the HTTP spec by just about everybody, http://play.rhapsody.com/The Postal Service actually works too.  (Note the technically illegal spaces in the URL.)  What this means is that my script just needs a list of the names of the artists, and doesn’t need corresponding ID values to generate the playback URL.  In fact, you can browse www.rhapsody.com for quite a while before ever seeing a database ID in your address bar.  Which brings me to the interesting part of this post.

Computer Science Interlude

Javascript is a slow, interpreted language.  The straightforward way to write this script would be to loop through a list of artist names, replacing each one in the document.  Something like this:

var artists = ['The Postal Service', 'Morcheeba', 'Massive Attack', 'Madonna', 'Tosca', 'Underworld' ];  // The actual list is much
longer...

for(var i=0; i<artists.length; i++) {
   document.body.innerHTML = document.body.innerHTML.replace( //... some regular expression
   );
}

This script would run very slowly.  To scan an HTML document with N characters for M artist names this way would take O(N*M) time.   Instead I wrote the script in just 2 lines as follows:

var regex = /\b(The Postal Service|Morcheeba|Massive Attack|Madonna|Tosca|Underworld)\b/gi;  // The actual list is much longer...

document.body.innerHTML= document.body.innerHTML.replace(regex,"<a href=\"http://play.rhapsody.com/$1\" title=\"Play $1 on Rhapsody\" >$1<img src='http://www.rhapsody.com/favicon.ico' alt=\"Play $1 on Rhapsody\"/></a>");

This might look like a cop-out — a cheezy easy way to do this.  But it’s actually much faster.  This will run in about O(N) time (assuming N>>M).  The single giant regular expression looks for any of the artist-name-keywords and applies it to the whole HTML document at once.  Firefox’s highly-optimized C++ regular expression engine compiles the big artist list into a single state-machine which is applied to the HTML much faster than anything I could possibly write in javascript.  Regular expression interpreters are brilliantly efficient.  Check out Jeffrey Friedl’s excellent Regular Expressions book if you want to know more about this highly practical topic.  The result is that the script can parse a document for a large number of artist names in a totally tolerable amount of time.  There’s a short delay when the page loads, but it’s still faster than browsing in IE.

Enough Theory.  Let’s get down to practice!

The script isn’t perfect, but it’s pretty neat to use it to browse Myspace or Facebook and have a lot of the music people mention be instantly playable

If you’d like to play with it, install Greasemonkey, and then install the Rhapsody Artist Linker script here.

[Update 5/4/07: a new and improved script is available here.  Read about the changes.]

  1. leodirac says:

    Thanks for pointing that our Eric, and for the detailed off-line explanation. I should really run some performance tests on FF to check how they've implemented it.

  2. Eric Lippert says:

    I don't doubt that for typical data the regexp algorithm is O(N), or that it is typically faster than the O(MN) naive solution.

    However I am surprised to learn that Firefox compiles regular expressions into state machines. (As someone who used to develop a competing product and is still tangentially involved with it, I am probably still legally prevented from looking at the firefox implementation, so I cannot check myself.)

    The JScript implementation of regular expressions compiles into a bytecode language which is then run through a custom bytecode interpreter. This interpreter is allowed to do backtracking. This is the usual approach these days for so-called "regular" expressions, because regular expressions are not actually regular anymore. There is not a 1-1 mapping between a finite state machine and a modern regular expression. Modern regular expressions require at least a stack, which makes them equivalent to something as strong or stronger than push-down automata, not state machines.

    This means that (at least in our implementation, I cannot speak for anyone else's) you cannot rely on matching being O(N), because of the potential backtracking. It is certainly possible to build regular expressions which take polynomial time to match and I believe it is possible to build a regular expression which takes exponential time to match.

    Yours is probably not an example of such though.

  1. There are no trackbacks for this post yet.