Tweets & Readability

As I've mentioned previously on this site, this past summer I had the great fortune to intern with MSR in Redmond, WA. Much of the summer was spent discussing, imagining, and thinking about data and science. Additionally, I spent the last few weeks there writing a short paper on one of the data explorations we undertook with Twitter! (I also made a poster about pataphysics, video games, and pandas... but that's another story)

I found writing a paper in another field delightfully challenging. It's like going back to kindergarten...  you have to learn the language, the structure, the pacing and voice. Most importantly, you have to stumble through their literature, trying to appear competent enough to contribute to the scientific process! (Mostly you try to quack like the right breed of duck without looking like a total fraud!) My mentors at MSR helped in this last step as much as they could.

The publication process in CS is quite different from Astronomy. For example: publishing in conferences instead of journals, two-way anonymous refereeing, low acceptance rates.  I enjoy the sheer number of places you can submit your work. In astronomy we have a fairly small number of respected journals to cite literature from, while CS seems to have endless numbers of specialized conferences on every sub-discipline. Pros/Cons to both models abound.

I submitted my paper to a well regarded conference, but eventually it was not selected (though reviews were quite positive!) Probably I'll make another set of changes and submit it again. In the meanwhile I wanted to give it a stable online home, so I did what any astronomer would do: submit it to the arXiv.

"The Readability of Tweets and their 
Geographic Correlation with Education"

A major part of this project was based on the US Census, which (as ever) was a fascinating data source. Here's a figure not included in the paper, but made from Census data. It shows the relation between median household income and the fraction of college degree holders within a given ZIP code.

Remember kids: correlation != causation.... but stay in school.
The paper outlines how we gathered a large sample of Tweets and measured their Readability (reading ease). Here's a cute figure for tweets with geo-data (lat,lon), grouping in to ZIP code areas and measuring the average readability (high reading ease #'s = simpler sentences). No large scale coherent trend is present, but there does appear to be sub-structure. This is something I'd love to follow up more, using some actual statistical/spatial analysis.

Finally, this is the "money graph" for the paper. Here we've shown the average reading ease score in each ZIP code (actually a ZCTA) compared with the % of college graduates. There is a significant anti-correlation present, which I think is very interesting! More intriguing, we didn't find a strong correlation with median income, nor the high school graduation rate.
Average Readability score as a function of college graduate rate. Lower scores indicate more complex text.

A few things could be the underlying cause of this apparent relation:

  1. There are significant demographic differences between ZIP codes with very high #'s of college grads and those without. These higher educated people may use more complex language in their tweets, but this seems too speculative to be convincing to me.
  2. The content type of tweets may be different in these higher education ZIP codes. For example, promotion of news/events versus personal status updates. Content-tagging a massive number of tweets is needed to understand the dependence content has on linguistic complexity.
To my knowledge only a handful of (very interesting) studies have investigated linguistic complexity within Twitter, and none I'm aware of in its geographic or regional dependence. The neat thing about Twitter is that it is a (massive!) living data set, and you can repeat these experiments every day.

Just for fun, here are a few neat projects/studies being done with data gathered from or derived from Twitter:


  1. Neat analysis! Which software did you use? Can you share your code?

    1. This comment has been removed by the author.

    2. These figures were generated with matplotlib. The code to do the analysis is not public

  2. This comment has been removed by the author.


Inappropriate comments, advertisements, or spam will be removed.
Posts older than 2 weeks have moderated comments.
(Anonymous commenting disabled due to increasing spam)