A tumbleblog by Jacob Harris (harrisj)

Posts tagged twitter

May 25


Like a large swath of the Internet, I have become increasingly obsessed with the nonsensical tweets of the @horse_ebooks spambot twitter account (if you don’t know what I’m talking about, here are some links: Know Your Meme; The Ballad of @Horse_ebooks; The Human Being Behind @Horse_ebooks; and the Web Comic). I also love the @nytimes twitter account, which is now 5 years old and over 5 million followers strong. I felt like it would be fun to do something silly to commemorate how far @nytimes has come

And thus, @nytimes_ebooks was born.

If you are interested, this is how it works. First though, I must restate my usual caveat: I am a developer at the New York Times who works on twitter (among other things), but this is NOT an official project of the NY Times or part of our twitter strategy. Silly to have to declare this, but I don’t want to see any blog posts leaping to conclusions about such a silly topic.

Anyhow, if you are curious, this is how it works:

  • On a cron job, I grab the RSS feed for the New York Times homepage and look for new articles.
  • I have some code to grab the article text whether it’s an article or a blog post.
  • I then extract quotes I find in the text; these are usually a lot more colorful than the rest of the article (better for ebookification)
  • This is fed into a Markov Chainer which then spits out a sentence. I set the order to be 1 to make it more nonsensical.
  • Then I do a few stylistic tweaks to match the @horse_ebooks style: replace apostrophes with spaces, truncate the last word of sentences if preceded by a preposition. It’s the little things that make the difference.
  • Of course, I also append a shortened URL so you can see where the text came from.

I’m still tweaking things, but it’s remarkable how compelling the text generated from this approach can be, especially since a low-ordered Markov generation is more prone to looping and nonsense. All of which makes for a fun hack.