Ok, this one may need some explanation, so here I go:
As some of you may be aware, I was a bit obsessed the last days with finding a good way to measure information content of individual posts. After some dead ends I finally came up with the following scheme to approximate originality and information content in the OTT.
I count the number of new words per post, and the number of new word sequences with a length of 2, 3, 4, and 5. Than I summarize those values but in doing so divide the number for the sequences by 2, 4, 8 and 16, to counter inflation. (This step btw is highly dubious in terms of information theory, but I couldn't come up with anything better... so sue me.) As a last step I multiply by 16 (just to get rid of the fractions).
The diagram below is the first result of these calculations. What is pretty amazing about it is the rise in original content. The graph starts with a peak, which is just to be expected: The first post naturally is all new content and so are the following ones. The metric quickly declines as most words and combinations thereof are already used. But then something amazing happens: The counts rise. And keep rising. And still do so. Which means that... Boy, aren't we a creative bunch!
Ok it's coma time here, but I've got a few new and novel ideas for combining this data with other things (eras come to mind, but also user rankings, new users by time, information by post length and so on)
- Added original content