IMDB D3 (Dimple.js) Visualizations

By Andrey Kurenkov (hi!)

And there it is! IMDB data collected with this code, visualized with D3, or more precisely with the D3-powered Dimple.js(http://dimplejs.org/). The data is minimally cleaned up by filtering for movies that have at least one vote and associated length information, and info on TV episodes or shows is also not included, but the data is otherwise directly (after parsing) from IMDB. The legend is interactive (try clicking the rectangles!).

As you can see this chart visualizes the number of genre movie releases between 1915 and 2013, as well as the total number of movies in those years. A single movie may be associated with zero, one, or multiple genres and so the 'Total Movies' line corresponds to actual movie counts and every colored-in area represents the number of movies tagged with that genre for that year. The clear conclusion is that there has been an explosion in film production from the 90s onward, for which I have some theories but no definitive explanation. Beyond the big takeaway there are a multitude of possible smaller conclusions regarding the relative popularity of genres and movies overall, which was really my intent in making such an open-ended visualization.

There is a ton more that can be done with the data. The direction I decided to go with it was to explore various aspects of more recent data rather than more aspects related to change over time. I would love to eventually add controls to view any year range for all the following charts, but they still reveal some interesting aspects about modern movie production and IMDB metrics.

An obvious place to start is with looking at how rating data is distributed, and the answer is delightfully normal:
Yep, a bell curve-ish kinda shape! Not overly suprising to see that most movies are rated as mediocre/good and the frequency flattens out at either extreme. Next, a slightly more fun shape from the length distribution:
Ah, what a nice regularly spiky shape. It's logical that most movies hit the 90-minute mark, though it seems likely that simplified data entry also brings about the periodicity here. The chart is a bit of a mess as a line graph, so it makes sense to clean it up by binning the data quite a bit more:
And there it is, hiding in that data was another sort-of bell curve. Except of course for that first bar - IMDB apparently has a large amount of shorter 0-20 minute film entries as well. No doubt short films are part of this, though it's unclear why there are quite so many. As with many aspects of the data, it could be explored more deeply and filtered more thouroughly to focus on a specific subset of films. But, that's for another day. For now I continued my visualization quest by looking into the vote distribution:
Yes astute reader, that is indeed a log-scale on the x axis. Unsurprisingly, the number of votes for any given film declines exponentially - very few of those thousands of movies in the first graph are blockbusters[^again]. As with the histogram above the continuous data is in fact binned for counting, but in this case there are enough bins that it makes sense to smooth out into a line. Once again the data can also be shown via a histogram with fewer bins:
Lastly, I explored the distribution of budgets within the data. I was originally inspired to look into movie data based on an article that discussed the death of mid-budget-cinema, and of course I wanted to look into the data and see the phenomenon myself. The result once again demands a log-scale and reveals a certain periodicity:
The data does not seem to back the notion of mid-budget movies dying, since one peak is at about 1m, but then again as said before the data is not particularly carefully filtered. There being a ton of less-than one million budget movies certainly explains how such an explosion in movie production might have been possible in the past twenty years. That guess shall hopefully be further explored in future posts, but for now I will finish with a final simplified histogram: