Graphs of Thrones

or

A Chart of Ice and Fire

or some other cheesy pun....

That's right, I'm jumping on the Game of Thrones bandwagon, a mere day after the series has concluded its broadcast run on HBO. Many have already examined the "downfall" of GoT in ratings over this past season (there are NINE such posts at present on r/dataisbeautiful at time of writing!). I'll spare you my takes on creativity versus expectations, and instead I thought it would be fun to look at other (relatively) simple ways we can analyze 'Thrones: Natural Language Processing!

So I'm dusting off Python's NLTK package, which I've used many times in the past (my favorite Star Trek example is here). All the code for this project can be found on my GitHub, of course!

First, as ever, we need data! Shout-out to this blog post (in R) from 2017 who linked their source: here! I was able to easily step through all 8 seasons of shows and scrape the HTML for the script (well, at least the dialog). There is one bug in the data: Season 7 Ep 1 was missing - not a big deal, just needs to be deleted from our analysis throughout!

Go look through the code for the details - it's a straightforward use (i.e. I spent 2 hours on StackOverflow....) of urllib, BeautifulSoup, and nltk.

Let's get to some graphs!

Reading Grade Level



Here's a graph that shows basically nothing. (a graph has no meaning? I'm trying here...) In other words the "readability", or approximate grade level the text is written at, stays roughly constant over the whole show (seasons marked by grey boxes). The specific grade number doesn't really matter, what's interesting is despite the show passing the books starting in Season 6, the grade doesn't really change. In other words, the language used doesn't get any simpler or more complex. (Note: that doesn't mean the writing the is same or as good - this is not a forensic or quality analysis)

What's it about?

We can use this library of scripts to look at the occurrence rates of words - like the old "ngram" viewer. For example, here's the occurrence of a few GoT-brand words:
View post on imgur.com

Despite the amazing women in the cast who arguably hold most of the power throughout the show, GoT is apparently still all about Kings (spoiler: I think "King's Landing" is skewing this graph). Also, the show is usually a Song of Mostly Ice, and a little Fire.


View post on imgur.com

It's also a show about mothers, some of whom are mothers to dragons, some to just plain monsters...

What happens?

Here we can see that Summer in Westeros apparently ended in mid Season 4. Does that make Season 5 autumn?

View post on imgur.com

There was a big wedding in late summer. Maybe you remember that? At least the weather was still nice...

View post on imgur.com


Where does it happen? 

Despite everybody gunning for the Iron Throne down in King's Landing, it's really a show about Winterfell (duh)

View post on imgur.com


Who?

Let's look at how often the characters show up - note this includes both dialog prompts (i.e. Jon says:) and also people being reference (e.g. "You know nothing, Jon..."). But I'd argue these both count towards a character's impact on a story.

OK, so Stark's are obviously the main characters usually. Interesting – and somewhat disappointingly to me – Dany is never the most mentioned character (!)
View post on imgur.com
Here we see the various threads of the story being told as characters rise in importance and are - usually - killed. This is classic GoT storytelling... Hodor
View post on imgur.com
I love this graph, because it shows a dramatic shift in the last 2-3 seasons! Staring in Season 6 it really becomes the Jon Snow and Friends show. This upward bend for all the main characters in the last 3 seasons also might represent a shift in the writing style, that the scripts become more explicit and telling us lots of things, rather than showing.... I'd love to see more on this.
View post on imgur.com

Who matters? or "It Ends like it Begins"

Looking at the rise/fall of the various cast is a neat way to view the show, but I started wondering if there was a way to examine the broad shifts in how characters were represented.

View post on imgur.com
In this final graph I show the total occurrence rate per episode of all "Main" characters divided by all the "Supporting" characters. (cast lists defined here). Episode 1 is all about exposition, telling us who the important players are. Very quickly we're thrown into the world of Westeros, and the lead cast usually has about 60% more lines/mentions than the secondary cast. Seems reasonable.

One big outlier is present, Season 2 Ep 9: the battle of Blackwater, which includes lots of big moments for the supporting cast!

But the ending of the show really stands out. The last 2 entire seasons become utterly dominated by the principle cast. I certainly felt this was happening at the end of Season 6, when a TON of secondary (and main) characters have their story lines... concluded. Many people feel this was the last great episode of GoT. While the data can't prove it's "great" or not, Season 6 Ep 10 is clearly an inflection point where the structure of the show changes.

With only 2 partial seasons remaining, it makes sense they had to shift their storytelling style a bit (so many wars to fight!) I can think of two possible interpretations of this graph:

  1. over the final 2 seasons the show distilled the story to just the principle cast, to wrap up story lines more directly – or,
  2. the storytelling/dialog style really did shift, and characters became more explicit about discussing each other, and giving exposition.

With only a couple hours of playing with this data, I can't tell these 2 scenarios apart. But perhaps somebody with a better script library or who want's to include some more complex text analysis will take the ball and run... there' a ton more graphs and useful code on my GitHub repo, or check out the full imgur album!

No comments:

Post a Comment

Inappropriate comments, advertisements, or spam will be removed.
Posts older than 2 weeks have moderated comments.
(Anonymous commenting disabled due to increasing spam)