So I decided to try and build a full blown web app using the various tools Dojo provides. I wanted to try out the feed reading data store, the grid control, some of the form dijits (Dojo widgets), and the tab control. After some thought, I ended up deciding to write an application that would allow you to compare the writing styles of different blogs and see if they had the same author. This idea had been on my “backburner list” since around 2003 when I read an interesting article on the subject called “Bookish Math”.
The field of study that deals with examining artistic works to determine authorship is called Stylometry. The theory is that each of us has a unique and identifiable way of doing something. For writing, many of the newer techniques involve examining the how often we use common words. From “Bookish Math”:
“People’s unconscious use of everyday words comes out with a certain stamp,” says David Holmes, a stylometrist at the College of New Jersey in Ewing. Precisely because writers use these function words without thinking about them, they may offer more reliable fingerprints of a writer’s style than unusual words do.
“Rare words are noticeable words, which someone else might pick up or echo unconsciously,” Burrows says. “It’s much harder for someone to imitate my frequency pattern of ‘but’ and ‘in’.”
The article goes on to talk about how frequency analysis of certain words in the “Federalist Papers” supported the idea that Madison wrote them instead of Hamilton, how an analysis on the 15th Wizard of OZ Book (billed as Frank L. Baum’s last book) revealed that it wasn’t really written by him, and how various other works can be clearly distinguished from an analysis of common words.
Since counting up common words is rather trivial, I decided to see if I could read in some blog feeds, find the frequencies of their common words, and then compare these frequencies to other blogs to see if I could determine authorship. Unfortunately, this rather naive approach didn’t come out as well as I hoped. After the app was tested, none of the numbers seemed to really stand out.
For blogs that should be similar (like this one and my livejournal), I found the common word frequencies to vary somewhat significantly. I only had overlap on around 10-20% of the words, and I wasn’t sure if that was a statistical coincidence. I also used one other person’s professional and personal blog and found similar results. I then tried to do a little original research and implemented the following alogrithm:
- Find the frequencies of the 50 most common words in the blog’s first 1,200 words.
- Find the frequencies of the 50 most common words in the whole document.
- Compare the two lists and dub the words that have similar frequencies “pattern words” – words that the person seems to use with a consistent frequency.
- Compare the “pattern words” in different blogs and see how well they overlap.
That worked a better, but I still couldn’t get completely accurate results. So the algorithm still needs a lot of work. Below you can see a small sampling of the frequency results from this blog vs my old livejournal. A frequency of 1% would mean that word makes up 1% of all of the words that were typed.
As for the Dojo side of things, I ended up really liking the slick look of the dijits. I also liked how I didn’t have to host any of the Dojo files myself, I could simply use the ones posted at the AOL Developer Network.
However, I wasn’t too happy that Dojo caused the page to take 3-4 seconds to load. And the odd sudden change from normal widgets into dijits in front of the user was kind of odd. I’m not sure if there’s a way to avoid that. This might be because I’m using a lot of Dojo tools and the Dojo library is 1.6MB gzipped. Not everything is downloaded, only what you use, but I ended up using quite a few of its tools.
Other issues I ran into were:
- There’s a bug in the grid control that effects IE7 users. The grid text doesn’t appear in IE7 if the div containing the grid has anything other than “left” for the “text-align” style property.
- You can’t create Dojo grids in divs that have their “display” property set to “none”. This bothered me because I originally wanted the grid containing the frequencies to “fade-in” after the user hit the “Process Data!” button.
Despite the short comings of my algorithm and some of app’s bloat, I decided to post it up anyway. You can view it here: Blog Stylometry Tool (note: I commented out the analysis side of things – so all it does is spit back a table of word frequencies)
I’ll most likely end up slimming down on the amount of Dojo that it uses to increase load time. Either that, or I’ll try and figure out a way to defer some of the load time. The majority of the loading time is coming from setting up the grid.