Andrey Kurenkov's Web World Jekyll 2017-02-24T22:17:46-08:00 / Andrey Kurenkov / contact@andreykurenkov.com <![CDATA[ObjectCropBot]]> /projects/hacks/objectcropbot 2017-02-18T00:00:00-08:00 2017-02-18T00:00:00-08:00 www.andreykurenkov.com contact@andreykurenkov.com <p>I had already had experience with DeepMask when starting on this, since my project for Stanford’s AI class was to modify DeepMask to see if it could be used to crop objects with a single click. My conclusion from that was that I was still better off using Facebook’s approach, but my experience with AWS for that project came in handy. Specifically, I reused my previous AWS DeepMask computing solution: paying for an AWS EC2 instance with a GPU, installing the relevant dependencies on it, and making it host a REST server. I had to modify the DeepMask code slightly, but for the most part I just set it up to run on the cloud and used Facebook’s pretrained models.</p> <p>The bulk of the remaining work was in implementing the http://objectcropbot.com/ web demo. I made this with basic HTML/CSS/JS parts, except for the excellent open source visual cropping library Cropper.js. After an all-nighter tweaking the demo to look and feel right, I set it up to be hosted on GitHub and arranged the domain to be what it is after buying it from NameCheap.</p> <p><a href="/projects/hacks/objectcropbot/">ObjectCropBot</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on February 18, 2017.</p> <![CDATA[DeepCrop]]> /projects/major_projects/deepcrop 2016-12-16T01:26:22-08:00 2016-12-16T01:26:22-08:00 www.andreykurenkov.com contact@andreykurenkov.com <p>See the quite in-depth attached documents. In the end the outcome was not that impresive, but it was quite fun to do and a good chance to play with Deep Learning.</p> <p><a href="/projects/major_projects/deepcrop/">DeepCrop</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on December 16, 2016.</p> <![CDATA[IMDB Data Visualizations with D3 + Dimple.js]]> /writing/visualizing-imdb-data-with-d3 2016-08-10T16:19:34-07:00 2016-08-10T16:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p><em>Notes: not optimized for mobile (or much else). Full page version <strong><a href="/writing/files/2016-08-10-visualizing-imdb-data-with-d3/standalone_page.html">here</a></strong>, visualization code <strong><a href="https://github.com/andreykurenkov/imdb-data-viz">here</a></strong>. I don’t get into the technical aspects here, but feel free to take a look.</em></p> <div id="genreChartContainer" class="chartContainer"> <script type="text/javascript"> /*Start on 1915 because prior too few movies are listed to make them a fair comparison to modern times*/ var start_year = 1915; /*End on 2013 due to a strange dive towards zero in 2014 and 2015 I cannot explain or guarantee is not due to flawed data. At first I included the dip but received feedback it is best to remove it to avoid confusion, and then removed it.*/ var end_year = 2013; //Get from localhost, perhaps change to github later var data_source = "/writing/files/2016-08-10-visualizing-imdb-data-with-d3/data/yearly_data.tsv"; var name = "IMDB Yearly Movie And Genre Counts (1915-2013)"; createGenreChart("#genreChartContainer", data_source, name, start_year, end_year); </script> </div> <form class="form" id="genreToggleForm"> <div class="switch-field"> <!-- <div class="switch-title">Display Type</div> --> <input type="radio" id="switch_left" name="switch" value="yes" checked="" /> <label for="switch_left">Counts</label> <input type="radio" id="switch_right" name="switch" value="no" /> <label for="switch_right">Percents</label> </div> </form> <p>And there it is! IMDB data<sup id="fnref:gotten_with"><a href="#fn:gotten_with" class="footnote">1</a></sup> visualized with <a href="https://d3js.org/">D3</a>, or more precisely with the D3-powered <a href="http://dimplejs.org/">Dimple.js</a>. The data is minimally cleaned up by filtering for movies that have at least one vote and associated length information, and info on TV episodes or shows is also not included, but the data is otherwise directly (after parsing) from IMDB. The legend is interactive (try clicking the rectangles!).</p> <p>As you can see this chart visualizes the number of genre movie releases between 1915 and 2013<sup id="fnref:why_years"><a href="#fn:why_years" class="footnote">2</a></sup>, as well as the total number of movies in those years. A single movie may be associated with zero, one, or multiple genres and so the ‘Total Movies’ line corresponds to actual movie counts and every colored-in area represents the number of movies tagged with that genre for that year. The clear conclusion is that there has been an explosion in film production from the 90s onward, for which I have some theories<sup id="fnref:theories"><a href="#fn:theories" class="footnote">3</a></sup> but no definitive explanation. Beyond the big takeaway there are a multitude of possible smaller conclusions regarding the relative popularity of genres and movies overall, which was really my intent in making such an open-ended visualization.</p> <p>There is a ton more that can be done with the data. The direction I decided to go with it was to explore various aspects of more recent data rather than more aspects related to change over time. I would love to eventually add controls to view any year range for all the following charts<sup id="fnref:nontrivial"><a href="#fn:nontrivial" class="footnote">4</a></sup>, but they still reveal some interesting aspects about modern movie production and IMDB metrics.</p> <p>An obvious place to start is with looking at how rating data is distributed, and the answer is delightfully normal:</p> <div id="ratingChartContainer" class="chartContainer"> <script type="text/javascript"> createLineChart("#ratingChartContainer", "/writing/files/2016-08-10-visualizing-imdb-data-with-d3/data/rating_data.tsv", false, "IMDB Average Movie Rating Distribution (2003-2013) ", "rating", false, "Average IMDB User Rating"); </script> </div> <p>Yep, a bell curve-ish<sup id="fnref:bell_curve"><a href="#fn:bell_curve" class="footnote">5</a></sup> kinda shape! Not overly suprising to see that most movies are rated as mediocre/good and the frequency flattens out at either extreme. Next, a slightly more fun shape from the length distribution:</p> <div id="lengthChartContainer" class="chartContainer"> <script type="text/javascript"> createLineChart("#lengthChartContainer", "/writing/files/2016-08-10-visualizing-imdb-data-with-d3/data/length_data.tsv", true, "IMDB Movie Length Distribution (2003-2013)", "length", false, "Length (minutes)", "max"); </script> </div> <p>Ah, what a nice regularly spiky shape<sup id="fnref:fourier"><a href="#fn:fourier" class="footnote">6</a></sup>. It’s logical that most movies hit the 90-minute mark, though it seems likely that simplified data entry also brings about the periodicity here. The chart is a bit of a mess as a line graph, so it makes sense to clean it up by binning the data quite a bit more:</p> <div id="lengthBinChartContainer" class="chartContainer"> <script type="text/javascript"> createHistChart("#lengthBinChartContainer", "/writing/files/2016-08-10-visualizing-imdb-data-with-d3/data/length_data_hist.tsv", true, "IMDB Movie Length Distribution (2003-2013) ", "length", false, "Length (minutes)"); </script> </div> <p>And there it is, hiding in that data was another sort-of bell curve. Except of course for that first bar - IMDB apparently has a large amount of shorter 0-20 minute film entries as well. No doubt short films are part of this, though it’s unclear why there are quite so many. As with many aspects of the data, it could be explored more deeply and filtered more thouroughly to focus on a specific subset of films. But, that’s for another day. For now I continued my visualization quest by looking into the vote distribution:</p> <div id="votesChartContainer" class="chartContainer"> <script type="text/javascript"> createLineChart("#votesChartContainer", "/writing/files/2016-08-10-visualizing-imdb-data-with-d3/data/votes_data.tsv", true, "IMDB Movie Vote Count Distribution (2003-2013) ", "votes", true, "IMDB User Vote Count"); </script> </div> <p>Yes astute reader<sup id="fnref:corny"><a href="#fn:corny" class="footnote">7</a></sup>, that is indeed a log-scale on the x axis. Unsurprisingly, the number of votes for any given film declines exponentially - very few of those thousands of movies in the first graph are blockbusters<sup id="fnref:again"><a href="#fn:again" class="footnote">8</a></sup>. As with the histogram above the continuous data is in fact binned for counting, but in this case there are enough bins that it makes sense to smooth out into a line. Once again the data can also be shown via a histogram with fewer bins:</p> <div id="votesBinChartContainer" class="chartContainer"> <script type="text/javascript"> createHistChart("#votesBinChartContainer", "/writing/files/2016-08-10-visualizing-imdb-data-with-d3/data/votes_data_hist.tsv", true, "IMDB Movie Vote # Distribution (2003-2013) ", "votes", true, "IMDB User Vote Count"); </script> </div> <p>Lastly, I explored the distribution of budgets within the data <sup id="fnref:budgets"><a href="#fn:budgets" class="footnote">9</a></sup>. I was originally inspired to look into movie data based on <a href="http://flavorwire.com/492985/how-the-death-of-mid-budget-cinema-left-a-generation-of-iconic-filmmakers-mia">an article</a> that discussed the death of mid-budget-cinema, and of course I wanted to look into the data and see the phenomenon myself. The result once again demands a log-scale and reveals a certain periodicity:</p> <div id="budgetChartContainer" class="chartContainer"> <script type="text/javascript"> createLineChart("#budgetChartContainer", "/writing/files/2016-08-10-visualizing-imdb-data-with-d3/data/budget_data.tsv", true, "IMDB Movie Budget Distribution (2003-2013) ", "budget", true, "Budget (USD)", "average"); </script> </div> <p>The data does<sup id="fnref:plural"><a href="#fn:plural" class="footnote">10</a></sup> not seem to back the notion of mid-budget movies dying, since one peak is at about 1m, but then again as said before the data is not particularly carefully filtered. There being a ton of less-than one million budget movies certainly explains how such an explosion in movie production might have been possible in the past twenty years. That guess shall hopefully be further explored in future posts, but for now I will finish with a final simplified histogram:</p> <div id="budgetBinChartContainer" class="chartContainer"> <script type="text/javascript"> createHistChart("#budgetBinChartContainer", "/writing/files/2016-08-10-visualizing-imdb-data-with-d3/data/budget_data_hist.tsv", true, "IMDB Movie Budget Distribution (2003-2013) ", "budget", true, "Budget (USD)"); </script> </div> <h2 id="what-i-learned">What I Learned</h2> <p>And now time for everyone’s favorite part of the book report. In truth I prepared the genre chart for Udacity’s an online class, <a href="https://www.udacity.com/course/data-visualization-and-d3js--ud507">Data Visualization with D3.JS</a>. I completed the class as part of Udacity’s Data Analyst Nanodegree, and as with my <a href="http://www.andreykurenkov.com/writing/fun-visualizations-of-stackoverflow/">previous</a> <a href="http://www.andreykurenkov.com/writing/power-of-ipython-pandas-scikilearn/">posts</a> based off projects for the nanodegree I felt that I learned<sup id="fnref:learned"><a href="#fn:learned" class="footnote">11</a></sup> a useful technology and got the chance to complete a fun project with it worthy of cataloguing. I have a few key takeaways from having now completed the project:</p> <ul> <li>It’s way easier to do data exploration and visualization via RStudio or IPython than D3. Perhaps this is not true for others, but I was surprised by how high level and unstreamlined D3 is for typical visualization tasks. Of course this is part of its power and the reason that higher-abstraction libraries like Dimple.js got built, but on balance I still felt that using JavaScript, HTML, and a browser was not nearly as elegant as RStudio. As someone only mildly experienced with web-dev, the prior classes on R and Pandas+IPython made me feel much more empowered to play with data easily.</li> <li>D3 allows for interactivity, but interactvity is not always needed. This one is rather obvious but why not still spell it out. All but the genre chart here could have comfortably been PNGs files (as in my previous visualization posts), and not lost much. Still, allowing for interactivity does open up a considerable amount of possibilies and in particular is good for open ended data visualization without a single particularly concrete point.</li> <li>Aggregation over hundreds of thousands of data points in JS is probably a bad idea. The year-grouped data for the genre graph was originally completely computed in JS when I submitted my project to Udacity. This took painful seconds, which was not helped by my meek laptop. Again unsurprising, but I did feel a tinge of annoyance at realizing I would be best off writing a python script to pre-process my data into multiple CSV files ready for charting without much manipulation.</li> </ul> <h2 id="bloopers">Bloopers</h2> <p>You did not ask for them, and I delivered. Here are a couple of silly moments during the creation of this<sup id="fnref:high_five"><a href="#fn:high_five" class="footnote">12</a></sup>. Hope you enjoyed!</p> <figure class="figure"><div class="figure__main"> <p><a href="/writing/images/2016-08-10-visualizing-imdb-data-with-d3/oops.png"><img class="postimageactual" src="/writing/images/2016-08-10-visualizing-imdb-data-with-d3/oops.png" alt="oops" /></a></p> </div><figcaption class="figure__caption"><p>Making small ordering mistakes unsurisingly had major glitchy implications…</p> </figcaption></figure> <figure class="figure"><div class="figure__main"> <p><a href="/writing/images/2016-08-10-visualizing-imdb-data-with-d3/great.png"><img class="postimageactual" src="/writing/images/2016-08-10-visualizing-imdb-data-with-d3/great.png" alt="great" /></a></p> </div><figcaption class="figure__caption"><p>… which were worse in some cases than others.</p> </figcaption></figure> <figure class="figure"><div class="figure__main"> <p><a href="/writing/images/2016-08-10-visualizing-imdb-data-with-d3/step1.png"><img class="postimageactual" src="/writing/images/2016-08-10-visualizing-imdb-data-with-d3/step1.png" alt="step1" /></a></p> </div><figcaption class="figure__caption"><p>For a while I thought to plot the binned data with tiny little cute bins via step interpolation…</p> </figcaption></figure> <figure class="figure"><div class="figure__main"> <p><a href="/writing/images/2016-08-10-visualizing-imdb-data-with-d3/step2.png"><img class="postimageactual" src="/writing/images/2016-08-10-visualizing-imdb-data-with-d3/step2.png" alt="step2" /></a></p> </div><figcaption class="figure__caption"><p>… but evidently changed my mind.</p> </figcaption></figure> <h2 id="notes">Notes</h2> <div class="footnotes"> <ol> <li id="fn:gotten_with"> <p>(gotten with <a href="https://github.com/andreykurenkov/data-movies">this code</a>) <a href="#fnref:gotten_with" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:why_years"> <p>The cutoff years are chosen due to there being very few movies comparable to modern films prior to 1915, and the data possibly being incomplete post 2013 <a href="#fnref:why_years" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:theories"> <p>Primarily that films are cheaper to make due to digital technology and that IMDB tacks more modern movies better <a href="#fnref:theories" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:nontrivial"> <p>This is nontrivial for various boring reasons and I have too many side-projects as it is… <a href="#fnref:nontrivial" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:bell_curve"> <p>Fine, a somewhat offset and wobbly bell curve, but still looks pretty good. <a href="#fnref:bell_curve" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:fourier"> <p>Don’t you just want to take the fourier transform of it? <a href="#fnref:fourier" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:corny"> <p>Too corny? I shan’t apologize, this is my site! <a href="#fnref:corny" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:again"> <p>Again, something warranting a deeper dive someday. <a href="#fnref:again" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:budgets"> <p>Many movies did not have associated budget data, but that still left thousands that did <a href="#fnref:budgets" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:plural"> <p>I know, ‘do’ is grammatically correct here, but then natural speech is largely nonsensical so who cares <a href="#fnref:plural" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:learned"> <p>Well, learned a little… <a href="#fnref:learned" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:high_five"> <p>Wow, did you actually read all the text and are still reading it? High five. <a href="#fnref:high_five" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <p><a href="/writing/visualizing-imdb-data-with-d3/">IMDB Data Visualizations with D3 + Dimple.js</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on August 10, 2016.</p> <![CDATA[VividVr]]> /projects/hacks/vividvr 2016-06-17T00:00:00-07:00 2016-06-17T00:00:00-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p>The presentation slides or photos below sum it up; basically we tried to glue together some cutting edge open source projects to get something quite novel. Trying to do so in 24 hours proved tough, and we only go as far as running the separate programs but not to the full pipeline. Something like this may still be worth exploring, though I don’t think this is the best approach to go about it.</p> <p><a href="/projects/hacks/vividvr/">VividVr</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on June 17, 2016.</p> <![CDATA[The Power of IPython Notebook + Pandas + and Scikit-learn]]> /writing/power-of-ipython-pandas-scikilearn 2016-06-10T19:19:34-07:00 2016-06-10T19:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p>IPython Notebook, Numpy, Pandas, MongoDB, R — for the better part of a year now, I have been trying out these technologies as part of Udacity’s <a href="https://www.udacity.com/course/data-analyst-nanodegree--nd002">Data Analyst Nanodegree</a>. My undergrad education barely touched on data visualization or more broadly data science, and so I figured being exposed to the aforementioned technologies would be fun. And fun it has been, with R’s powerful IDE-powered data mundging and visualization techniques having been particularly revelatory. I learned enough of R to create <a href="/writing/fun-visualizations-of-stackoverflow/">some complex visualizations</a>, and was impressed by how easy is to import data into its Dataframe representations and then transform and visualize that data. I also thought RStudio’s paradigm of continuously intermixed code editing and execution was superior to my habitual workflow of just endlessly cycling between tweaking and executing of Python scripts.</p> <figure class="figure"><div class="figure__main"> <p><a href="/writing/images/2016-06-10-power-of-ipython-pandas-scikitlearn/rstudio.png"><img class="postimageactual" src="/writing/images/2016-06-10-power-of-ipython-pandas-scikitlearn/rstudio.png" alt="History" /></a></p> </div><figcaption class="figure__caption"><p>The RStudio IDE</p> </figcaption></figure> <p>Still, R is a not-quite-general-purpose-language and I hit upon multiple instances in which simple things were hard to do. In such times, I could not help but miss the powers of Python, a language I have tons of experience with and which is about as general purpose as it gets. Luckily, the courses also covered the equivalent of an R implementation for Python: the Python Data Analysis Library, Pandas. This let me use the features of R I now liked — dataframes, powerful plotting methods, elegant methods for transforming data — with Python’s lovely syntax and libraries I already knew and loved. And soon I got to do just that, using both Pandas and the supremely good Machine Learning package Scikit-learn for the final project of <a href="https://www.udacity.com/course/intro-to-machine-learning--ud120">Udacity’s Intro to Machine Learning Course</a>. Not only that, but I also used IPython Notebook for RStudio-esque intermixed code editing and execution and nice PDF output.</p> <p>I had such a nice experience with this combination of tools that I decided to dedicate a post to it, and what follows is mostly a summation of that experience. Reading it should be sufficient to get a general idea for why these tools are useful, whereas a much more detailed introdution and tutorial for Pandas can be found elsewhere (for instance <a href="http://nbviewer.jupyter.org/github/fonnesbeck/pytenn2014_tutorial/blob/master/Part%201.%20Data%20Wrangling%20with%20Pandas.ipynb">here</a>). Incidentally, this whole post was written in IPython Notebook and the source of that <a href="http://www.andreykurenkov.com/writing/files/2016-06-10-power-of-ipython-pandas-scikilearn/post.ipynb">can be found here</a> with the produced HTML <a href="http://www.andreykurenkov.com/writing/files/2016-06-10-power-of-ipython-pandas-scikilearn/post.html">here</a>.</p> <h2 id="data-summarization">Data Summarization</h2> <p>First, a bit about the project. The task was to first explore and clean a given dataset, and then train classification models using it. The dataset contained dozens of features about roughly 150 important employees from the <a href="https://en.wikipedia.org/wiki/Enron_scandal">notoriously corrupt</a> company Enron, witch were classified as either a “Person of Interest” or not based on the outcome of investigations into Enron’s corruption. It’s a tiny dataset and not what I would have chosen, but such were the instructions. The data was provided in a bunch of Python dictionaries, and at first I just used a Python script to change it into a CSV and started exploring it in RStudio. But, it soon dawned on me that I would be much better off just working entirely in Python, and the following code is taken verbatim from my final project submission.</p> <p>And so, the code. Following some imports and a ‘%matplotlib notebook’ comment to allow plotting within IPython, I loaded the data using pickle and printed out some basic things about it (not yet using Pandas):</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="kn">as</span> <span class="nn">plt</span> <span class="kn">import</span> <span class="nn">matplotlib</span> <span class="kn">import</span> <span class="nn">pickle</span> <span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span> <span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">display</span> <span class="o">%</span><span class="n">matplotlib</span> <span class="n">notebook</span></code></pre></div> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">enron_data</span> <span class="o">=</span> <span class="n">pickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="nb">open</span><span class="p">(</span><span class="s">&quot;./ud120-projects/final_project/final_project_dataset.pkl&quot;</span><span class="p">,</span> <span class="s">&quot;rb&quot;</span><span class="p">))</span> <span class="k">print</span><span class="p">(</span><span class="s">&quot;Number of people: </span><span class="si">%d</span><span class="s">&quot;</span><span class="o">%</span><span class="nb">len</span><span class="p">(</span><span class="n">enron_data</span><span class="o">.</span><span class="n">keys</span><span class="p">()))</span> <span class="k">print</span><span class="p">(</span><span class="s">&quot;Number of features per person: </span><span class="si">%d</span><span class="s">&quot;</span><span class="o">%</span><span class="nb">len</span><span class="p">(</span><span class="nb">list</span><span class="p">(</span><span class="n">enron_data</span><span class="o">.</span><span class="n">values</span><span class="p">())[</span><span class="mi">0</span><span class="p">]))</span> <span class="k">print</span><span class="p">(</span><span class="s">&quot;Number of POI: </span><span class="si">%d</span><span class="s">&quot;</span><span class="o">%</span><span class="nb">sum</span><span class="p">([</span><span class="mi">1</span> <span class="k">if</span> <span class="n">x</span><span class="p">[</span><span class="s">&#39;poi&#39;</span><span class="p">]</span> <span class="k">else</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">enron_data</span><span class="o">.</span><span class="n">values</span><span class="p">()]))</span></code></pre></div> <pre><code>Number of people: 146 Number of features per person: 21 Number of POI: 18 </code></pre> <p>But working with this set of dictionaries would not be nearly as fast or easy as a Pandas dataframe, so I soon converted it to that and went ahead and summarized all the features with a single method call:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">(</span><span class="n">enron_data</span><span class="p">)</span> <span class="k">del</span> <span class="n">df</span><span class="p">[</span><span class="s">&#39;TOTAL&#39;</span><span class="p">]</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">transpose</span><span class="p">()</span> <span class="n">numeric_df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">pd</span><span class="o">.</span><span class="n">to_numeric</span><span class="p">,</span> <span class="n">errors</span><span class="o">=</span><span class="s">&#39;coerce&#39;</span><span class="p">)</span> <span class="k">del</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;email_address&#39;</span><span class="p">]</span> <span class="n">numeric_df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span></code></pre></div> <div class="post_table_div"> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>bonus</th> <th>deferral_payments</th> <th>deferred_income</th> <th>director_fees</th> <th>exercised_stock_options</th> <th>expenses</th> <th>from_messages</th> <th>from_poi_to_this_person</th> <th>from_this_person_to_poi</th> <th>loan_advances</th> <th>long_term_incentive</th> <th>other</th> <th>poi</th> <th>restricted_stock</th> <th>restricted_stock_deferred</th> <th>salary</th> <th>shared_receipt_with_poi</th> <th>to_messages</th> <th>total_payments</th> <th>total_stock_value</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>81.000000</td> <td>38.000000</td> <td>48.000000</td> <td>16.000000</td> <td>101.000000</td> <td>94.000000</td> <td>86.000000</td> <td>86.000000</td> <td>86.000000</td> <td>3.000000</td> <td>65.000000</td> <td>92.000000</td> <td>145</td> <td>109.000000</td> <td>17.000000</td> <td>94.000000</td> <td>86.000000</td> <td>86.000000</td> <td>1.240000e+02</td> <td>125.000000</td> </tr> <tr> <th>mean</th> <td>1201773.074074</td> <td>841602.526316</td> <td>-581049.812500</td> <td>89822.875000</td> <td>2959559.257426</td> <td>54192.010638</td> <td>608.790698</td> <td>64.895349</td> <td>41.232558</td> <td>27975000.000000</td> <td>746491.200000</td> <td>465276.663043</td> <td>0.124138</td> <td>1147424.091743</td> <td>621892.823529</td> <td>284087.542553</td> <td>1176.465116</td> <td>2073.860465</td> <td>2.623421e+06</td> <td>3352073.024000</td> </tr> <tr> <th>std</th> <td>1441679.438330</td> <td>1289322.626180</td> <td>942076.402972</td> <td>41112.700735</td> <td>5499449.598994</td> <td>46108.377454</td> <td>1841.033949</td> <td>86.979244</td> <td>100.073111</td> <td>46382560.030684</td> <td>862917.421568</td> <td>1389719.064851</td> <td>0.330882</td> <td>2249770.356903</td> <td>3845528.349509</td> <td>177131.115377</td> <td>1178.317641</td> <td>2582.700981</td> <td>9.488106e+06</td> <td>6532883.097201</td> </tr> <tr> <th>min</th> <td>70000.000000</td> <td>-102500.000000</td> <td>-3504386.000000</td> <td>3285.000000</td> <td>3285.000000</td> <td>148.000000</td> <td>12.000000</td> <td>0.000000</td> <td>0.000000</td> <td>400000.000000</td> <td>69223.000000</td> <td>2.000000</td> <td>False</td> <td>-2604490.000000</td> <td>-1787380.000000</td> <td>477.000000</td> <td>2.000000</td> <td>57.000000</td> <td>1.480000e+02</td> <td>-44093.000000</td> </tr> <tr> <th>25%</th> <td>425000.000000</td> <td>79644.500000</td> <td>-611209.250000</td> <td>83674.500000</td> <td>506765.000000</td> <td>22479.000000</td> <td>22.750000</td> <td>10.000000</td> <td>1.000000</td> <td>1200000.000000</td> <td>275000.000000</td> <td>1209.000000</td> <td>0</td> <td>252055.000000</td> <td>-329825.000000</td> <td>211802.000000</td> <td>249.750000</td> <td>541.250000</td> <td>3.863802e+05</td> <td>494136.000000</td> </tr> <tr> <th>50%</th> <td>750000.000000</td> <td>221063.500000</td> <td>-151927.000000</td> <td>106164.500000</td> <td>1297049.000000</td> <td>46547.500000</td> <td>41.000000</td> <td>35.000000</td> <td>8.000000</td> <td>2000000.000000</td> <td>422158.000000</td> <td>51984.500000</td> <td>0</td> <td>441096.000000</td> <td>-140264.000000</td> <td>258741.000000</td> <td>740.500000</td> <td>1211.000000</td> <td>1.100246e+06</td> <td>1095040.000000</td> </tr> <tr> <th>75%</th> <td>1200000.000000</td> <td>867211.250000</td> <td>-37926.000000</td> <td>112815.000000</td> <td>2542813.000000</td> <td>78408.500000</td> <td>145.500000</td> <td>72.250000</td> <td>24.750000</td> <td>41762500.000000</td> <td>831809.000000</td> <td>357577.250000</td> <td>0</td> <td>985032.000000</td> <td>-72419.000000</td> <td>308606.500000</td> <td>1888.250000</td> <td>2634.750000</td> <td>2.084663e+06</td> <td>2606763.000000</td> </tr> <tr> <th>max</th> <td>8000000.000000</td> <td>6426990.000000</td> <td>-833.000000</td> <td>137864.000000</td> <td>34348384.000000</td> <td>228763.000000</td> <td>14368.000000</td> <td>528.000000</td> <td>609.000000</td> <td>81525000.000000</td> <td>5145434.000000</td> <td>10359729.000000</td> <td>True</td> <td>14761694.000000</td> <td>15456290.000000</td> <td>1111258.000000</td> <td>5521.000000</td> <td>15149.000000</td> <td>1.035598e+08</td> <td>49110078.000000</td> </tr> </tbody> </table> </div> <p>This high level summarization of data is one example of what Pandas can do for you. But the main strength is in how easy it is to manipulate the data and derive new things from it. The project instructed me to first summarize some things about the data, and then handle outliers. The summary indicated a large standard deviation for many of the features, and also a lot of missing values in the data for various features. First I dropped features with almost no non-null values, such as loan_advances and restricted_stock_deferred. Then, in order to investigate if any features are particularly bad in terms of outliers, I went ahead computed the standard deviation of each feature for each entry in the data, and easily got summary statistics for this data as well:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="k">del</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;loan_advances&#39;</span><span class="p">]</span> <span class="k">del</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;restricted_stock_deferred&#39;</span><span class="p">]</span> <span class="k">del</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;director_fees&#39;</span><span class="p">]</span> <span class="n">std</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">np</span><span class="o">.</span><span class="n">abs</span><span class="p">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">x</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="o">/</span> <span class="n">x</span><span class="o">.</span><span class="n">std</span><span class="p">())</span> <span class="n">std</span> <span class="o">=</span> <span class="n">std</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">std</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="n">std</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span></code></pre></div> <div class="post_table_div"> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>bonus</th> <th>deferral_payments</th> <th>deferred_income</th> <th>exercised_stock_options</th> <th>expenses</th> <th>from_messages</th> <th>from_poi_to_this_person</th> <th>from_this_person_to_poi</th> <th>long_term_incentive</th> <th>other</th> <th>poi</th> <th>restricted_stock</th> <th>salary</th> <th>shared_receipt_with_poi</th> <th>to_messages</th> <th>total_payments</th> <th>total_stock_value</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> <td>145.000000</td> </tr> <tr> <th>mean</th> <td>0.612134</td> <td>0.670659</td> <td>0.690552</td> <td>0.558364</td> <td>0.739307</td> <td>0.487468</td> <td>0.694769</td> <td>0.532234</td> <td>0.670577</td> <td>0.444004</td> <td>0.657200</td> <td>0.525893</td> <td>0.568830</td> <td>0.794256</td> <td>0.648079</td> <td>0.287221</td> <td>0.547885</td> </tr> <tr> <th>std</th> <td>0.587181</td> <td>0.371822</td> <td>0.409188</td> <td>0.689763</td> <td>0.537626</td> <td>0.669599</td> <td>0.549542</td> <td>0.648923</td> <td>0.491393</td> <td>0.711333</td> <td>0.751724</td> <td>0.735294</td> <td>0.659254</td> <td>0.462087</td> <td>0.582615</td> <td>0.884946</td> <td>0.774945</td> </tr> <tr> <th>min</th> <td>0.001230</td> <td>0.001025</td> <td>0.002415</td> <td>0.040311</td> <td>0.005314</td> <td>0.028674</td> <td>0.010294</td> <td>0.032302</td> <td>0.027083</td> <td>0.000058</td> <td>0.375173</td> <td>0.044846</td> <td>0.025148</td> <td>0.037736</td> <td>0.041484</td> <td>0.003077</td> <td>0.014143</td> </tr> <tr> <th>25%</th> <td>0.380270</td> <td>0.670659</td> <td>0.611358</td> <td>0.346078</td> <td>0.510059</td> <td>0.310038</td> <td>0.481671</td> <td>0.342075</td> <td>0.546392</td> <td>0.297679</td> <td>0.375173</td> <td>0.302841</td> <td>0.250755</td> <td>0.605495</td> <td>0.455283</td> <td>0.130231</td> <td>0.296228</td> </tr> <tr> <th>50%</th> <td>0.612134</td> <td>0.670659</td> <td>0.690552</td> <td>0.470558</td> <td>0.739307</td> <td>0.324161</td> <td>0.694769</td> <td>0.412024</td> <td>0.670577</td> <td>0.334411</td> <td>0.375173</td> <td>0.417338</td> <td>0.568830</td> <td>0.794256</td> <td>0.648079</td> <td>0.196170</td> <td>0.423551</td> </tr> <tr> <th>75%</th> <td>0.612134</td> <td>0.670659</td> <td>0.690552</td> <td>0.558364</td> <td>0.817162</td> <td>0.487468</td> <td>0.694769</td> <td>0.532234</td> <td>0.670577</td> <td>0.444004</td> <td>0.375173</td> <td>0.525893</td> <td>0.568830</td> <td>0.847365</td> <td>0.648079</td> <td>0.271301</td> <td>0.508700</td> </tr> <tr> <th>max</th> <td>4.715491</td> <td>4.332032</td> <td>3.103078</td> <td>5.707630</td> <td>3.786101</td> <td>7.473631</td> <td>5.324312</td> <td>5.673526</td> <td>5.097756</td> <td>7.119750</td> <td>2.647054</td> <td>6.051404</td> <td>4.669820</td> <td>3.687066</td> <td>5.062584</td> <td>10.638201</td> <td>7.004259</td> </tr> </tbody> </table> </div> <p>This result suggested that most features have large outliers (larger than 3 standard deviations). In order to be careful not to remove any useful data, I manually inspected all rows with large outliers to see any values that seem appropriate for removal:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">outliers</span> <span class="o">=</span> <span class="n">std</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span> <span class="o">&gt;</span> <span class="mi">5</span><span class="p">)</span><span class="o">.</span><span class="n">any</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span> <span class="n">outlier_df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">index</span><span class="o">=</span><span class="n">numeric_df</span><span class="p">[</span><span class="n">outliers</span><span class="p">]</span><span class="o">.</span><span class="n">index</span><span class="p">)</span> <span class="k">for</span> <span class="n">col</span> <span class="ow">in</span> <span class="n">numeric_df</span><span class="o">.</span><span class="n">columns</span><span class="p">:</span> <span class="n">outlier_df</span><span class="p">[</span><span class="nb">str</span><span class="p">((</span><span class="n">col</span><span class="p">,</span><span class="n">col</span><span class="o">+</span><span class="s">&#39;_std&#39;</span><span class="p">))]</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="nb">zip</span><span class="p">(</span><span class="n">numeric_df</span><span class="p">[</span><span class="n">outliers</span><span class="p">][</span><span class="n">col</span><span class="p">],</span><span class="n">std</span><span class="p">[</span><span class="n">outliers</span><span class="p">][</span><span class="n">col</span><span class="p">]))</span> <span class="n">display</span><span class="p">(</span><span class="n">outlier_df</span><span class="p">)</span> <span class="n">numeric_df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">&#39;FREVERT MARK A&#39;</span><span class="p">,</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">df</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">&#39;FREVERT MARK A&#39;</span><span class="p">,</span><span class="n">inplace</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span></code></pre></div> <div class="post_table_div"> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>('bonus', 'bonus_std')</th> <th>('deferral_payments', 'deferral_payments_std')</th> <th>('deferred_income', 'deferred_income_std')</th> <th>('exercised_stock_options', 'exercised_stock_options_std')</th> <th>('expenses', 'expenses_std')</th> <th>('from_messages', 'from_messages_std')</th> <th>('from_poi_to_this_person', 'from_poi_to_this_person_std')</th> <th>('from_this_person_to_poi', 'from_this_person_to_poi_std')</th> <th>('long_term_incentive', 'long_term_incentive_std')</th> <th>('other', 'other_std')</th> <th>('poi', 'poi_std')</th> <th>('restricted_stock', 'restricted_stock_std')</th> <th>('salary', 'salary_std')</th> <th>('shared_receipt_with_poi', 'shared_receipt_with_poi_std')</th> <th>('to_messages', 'to_messages_std')</th> <th>('total_payments', 'total_payments_std')</th> <th>('total_stock_value', 'total_stock_value_std')</th> </tr> </thead> <tbody> <tr> <th>DELAINEY DAVID W</th> <td>(3000000.0, 1.24731398542)</td> <td>(nan, 0.67065886001)</td> <td>(nan, 0.690552246623)</td> <td>(2291113.0, 0.121547846815)</td> <td>(86174.0, 0.6936264325)</td> <td>(3069.0, 1.3363193564)</td> <td>(66.0, 0.0127001697143)</td> <td>(609.0, 5.67352642171)</td> <td>(1294981.0, 0.635622582522)</td> <td>(1661.0, 0.333603873451)</td> <td>(True, 2.64705431598)</td> <td>(1323148.0, 0.078107486712)</td> <td>(365163.0, 0.457714373186)</td> <td>(2097.0, 0.781228126919)</td> <td>(3093.0, 0.394602217763)</td> <td>(4747979.0, 0.22391802188)</td> <td>(3614261.0, 0.0401335784062)</td> </tr> <tr> <th>FREVERT MARK A</th> <td>(2000000.0, 0.553678511813)</td> <td>(6426990.0, 4.33203246439)</td> <td>(-3367011.0, 2.95725609803)</td> <td>(10433518.0, 1.35903759241)</td> <td>(86987.0, 0.711258803121)</td> <td>(21.0, 0.319272057897)</td> <td>(242.0, 2.03617142019)</td> <td>(6.0, 0.352068179278)</td> <td>(1617011.0, 1.00881008801)</td> <td>(7427621.0, 5.00989337561)</td> <td>(False, 0.375173052658)</td> <td>(4188667.0, 1.3518014845)</td> <td>(1060932.0, 4.38570296241)</td> <td>(2979.0, 1.5297529467)</td> <td>(3275.0, 0.465071080146)</td> <td>(17252530.0, 1.54183664695)</td> <td>(14622185.0, 1.72513602468)</td> </tr> <tr> <th>HIRKO JOSEPH</th> <td>(nan, 0.612134343218)</td> <td>(10259.0, 0.644790923106)</td> <td>(nan, 0.690552246623)</td> <td>(30766064.0, 5.05623412708)</td> <td>(77978.0, 0.515871316129)</td> <td>(nan, 0.487467982744)</td> <td>(nan, 0.694769235346)</td> <td>(nan, 0.532233915598)</td> <td>(nan, 0.670576589457)</td> <td>(2856.0, 0.332743987428)</td> <td>(True, 2.64705431598)</td> <td>(nan, 0.52589323995)</td> <td>(nan, 0.568830375372)</td> <td>(nan, 0.794256482633)</td> <td>(nan, 0.648079292459)</td> <td>(91093.0, 0.266895026444)</td> <td>(30766064.0, 4.19630821004)</td> </tr> <tr> <th>KAMINSKI WINCENTY J</th> <td>(400000.0, 0.556138245963)</td> <td>(nan, 0.67065886001)</td> <td>(nan, 0.690552246623)</td> <td>(850010.0, 0.383592797689)</td> <td>(83585.0, 0.637476115725)</td> <td>(14368.0, 7.47363149225)</td> <td>(41.0, 0.274724723819)</td> <td>(171.0, 1.29672636328)</td> <td>(323466.0, 0.490226746415)</td> <td>(4669.0, 0.331439407211)</td> <td>(False, 0.375173052658)</td> <td>(126027.0, 0.454000599932)</td> <td>(275101.0, 0.0507338450054)</td> <td>(583.0, 0.503654613618)</td> <td>(4607.0, 0.980810226817)</td> <td>(1086821.0, 0.161950156636)</td> <td>(976037.0, 0.363704047455)</td> </tr> <tr> <th>LAVORATO JOHN J</th> <td>(8000000.0, 4.71549135347)</td> <td>(nan, 0.67065886001)</td> <td>(nan, 0.690552246623)</td> <td>(4158995.0, 0.21810105193)</td> <td>(49537.0, 0.100958023148)</td> <td>(2585.0, 1.07342360688)</td> <td>(528.0, 5.32431220222)</td> <td>(411.0, 3.69497297064)</td> <td>(2035380.0, 1.4936409531)</td> <td>(1552.0, 0.33368230657)</td> <td>(False, 0.375173052658)</td> <td>(1008149.0, 0.0619063591605)</td> <td>(339288.0, 0.311636142127)</td> <td>(3962.0, 2.3639931937)</td> <td>(7259.0, 2.00764222154)</td> <td>(10425757.0, 0.822328102755)</td> <td>(5167144.0, 0.277836132837)</td> </tr> <tr> <th>LAY KENNETH L</th> <td>(7000000.0, 4.02185587986)</td> <td>(202911.0, 0.495369827029)</td> <td>(-300000.0, 0.29833016899)</td> <td>(34348384.0, 5.70763022327)</td> <td>(99832.0, 0.98984158372)</td> <td>(36.0, 0.311124462355)</td> <td>(123.0, 0.668028926971)</td> <td>(16.0, 0.252141237305)</td> <td>(3600000.0, 3.30681561025)</td> <td>(10359729.0, 7.11975001798)</td> <td>(True, 2.64705431598)</td> <td>(14761694.0, 6.05140425399)</td> <td>(1072321.0, 4.44999996622)</td> <td>(2411.0, 1.0477097521)</td> <td>(4273.0, 0.851488248598)</td> <td>(103559793.0, 10.6382007936)</td> <td>(49110078.0, 7.00425896119)</td> </tr> <tr> <th>MARTIN AMANDA K</th> <td>(nan, 0.612134343218)</td> <td>(85430.0, 0.586488215565)</td> <td>(nan, 0.690552246623)</td> <td>(2070306.0, 0.16169859209)</td> <td>(8211.0, 0.997237664333)</td> <td>(230.0, 0.205748893335)</td> <td>(8.0, 0.654125583284)</td> <td>(0.0, 0.412024344462)</td> <td>(5145434.0, 5.09775639018)</td> <td>(2818454.0, 1.6932755666)</td> <td>(False, 0.375173052658)</td> <td>(nan, 0.52589323995)</td> <td>(349487.0, 0.36921495869)</td> <td>(477.0, 0.593613378808)</td> <td>(1522.0, 0.21367570973)</td> <td>(8407016.0, 0.609562657351)</td> <td>(2070306.0, 0.196202351233)</td> </tr> <tr> <th>SHAPIRO RICHARD S</th> <td>(650000.0, 0.382729377561)</td> <td>(nan, 0.67065886001)</td> <td>(nan, 0.690552246623)</td> <td>(607837.0, 0.427628659031)</td> <td>(137767.0, 1.81257710587)</td> <td>(1215.0, 0.329276547308)</td> <td>(74.0, 0.104676135645)</td> <td>(65.0, 0.237500778364)</td> <td>(nan, 0.670576589457)</td> <td>(705.0, 0.33429178227)</td> <td>(False, 0.375173052658)</td> <td>(379164.0, 0.341483782727)</td> <td>(269076.0, 0.0847481963923)</td> <td>(4527.0, 2.84349038551)</td> <td>(15149.0, 5.06258356331)</td> <td>(1057548.0, 0.165035387918)</td> <td>(987001.0, 0.362025768533)</td> </tr> <tr> <th>WHITE JR THOMAS E</th> <td>(450000.0, 0.521456472283)</td> <td>(nan, 0.67065886001)</td> <td>(nan, 0.690552246623)</td> <td>(1297049.0, 0.302304844785)</td> <td>(81353.0, 0.58906842664)</td> <td>(nan, 0.487467982744)</td> <td>(nan, 0.694769235346)</td> <td>(nan, 0.532233915598)</td> <td>(nan, 0.670576589457)</td> <td>(1085463.0, 0.446267416662)</td> <td>(False, 0.375173052658)</td> <td>(13847074.0, 5.64486498335)</td> <td>(317543.0, 0.188873972681)</td> <td>(nan, 0.794256482633)</td> <td>(nan, 0.648079292459)</td> <td>(1934359.0, 0.072623789327)</td> <td>(15144123.0, 1.80502999986)</td> </tr> </tbody> </table> </div> <p>Looking through these, I found one instance of a valid outlier - Mark A. Frevert (CEO of Enron), and removed him from the dataset.</p> <p>I should emphasize the benefits of doing all this in IPython Notebook. Being able to tweak parts of the code without reexecuting all of it and reloading all the data made iterating on ideas much faster, and iterating on ideas fast is essential for exploratory data analysis and development of machine learned models. It’s no accident that the Matlab IDE and RStudio, both tools commonly used in the sciences for data processing and analysis, have essentially the same structure. I did not understand the benefits of IPython Notebook when I was first made to use it for class assignments in College, but now it has finally dawned on me that it fills the same role as those IDEs and became popular because it is similaly well suited for working with data.</p> <h2 id="feature-visualization-engineering-and-selection">Feature Visualization, Engineering and Selection</h2> <p>The project also instructed me to choose a set of features, and to engineer some of my own. In order to get an initial idea of possible promising features and how I could use them to create new features, I computed the correlation of each feature to the Person of Interest classification:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">corr</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="o">.</span><span class="n">corr</span><span class="p">()</span> <span class="k">print</span><span class="p">(</span><span class="s">&#39;</span><span class="se">\n</span><span class="s">Correlations between features to POI:</span><span class="se">\n</span><span class="s"> &#39;</span> <span class="o">+</span><span class="nb">str</span><span class="p">(</span><span class="n">corr</span><span class="p">[</span><span class="s">&#39;poi&#39;</span><span class="p">]))</span></code></pre></div> <pre><code>Correlations between features to POI: bonus 0.306907 deferral_payments -0.075632 deferred_income -0.334810 exercised_stock_options 0.513724 expenses 0.064293 from_messages -0.076108 from_poi_to_this_person 0.183128 from_this_person_to_poi 0.111313 long_term_incentive 0.264894 other 0.174291 poi 1.000000 restricted_stock 0.232410 salary 0.323374 shared_receipt_with_poi 0.239932 to_messages 0.061531 total_payments 0.238375 total_stock_value 0.377033 Name: poi, dtype: float64 </code></pre> <p>The results indicated that ‘exercised_stock_options’, ‘total_stock_value’, and ‘bonus’ are the most promising features. Just for fun, I went ahead and plotted these features to see if I could visually verify their significance:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">numeric_df</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">column</span><span class="o">=</span><span class="s">&#39;exercised_stock_options&#39;</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="s">&#39;poi&#39;</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span><span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">&quot;exercised_stock_options by POI&quot;</span><span class="p">)</span></code></pre></div> <p><img src="" /></p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">numeric_df</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">column</span><span class="o">=</span><span class="s">&#39;total_stock_value&#39;</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="s">&#39;poi&#39;</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span><span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">&quot;total_stock_value by POI&quot;</span><span class="p">)</span></code></pre></div> <p><img src="" /></p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">numeric_df</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">column</span><span class="o">=</span><span class="s">&#39;bonus&#39;</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="s">&#39;poi&#39;</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span><span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">&quot;bonus by POI&quot;</span><span class="p">)</span></code></pre></div> <p><img src="" /></p> <p>As well as one that is not strongly correlated:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">numeric_df</span><span class="o">.</span><span class="n">hist</span><span class="p">(</span><span class="n">column</span><span class="o">=</span><span class="s">&#39;to_messages&#39;</span><span class="p">,</span><span class="n">by</span><span class="o">=</span><span class="s">&#39;poi&#39;</span><span class="p">,</span><span class="n">bins</span><span class="o">=</span><span class="mi">25</span><span class="p">,</span><span class="n">sharex</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span><span class="n">sharey</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">&quot;to_messages by POI&quot;</span><span class="p">)</span></code></pre></div> <p><img src="" /></p> <p>The data and plots above indicated that the exercised_stock_options, total_stock_value, and restricted_stock, and to a lesser extent to payment related information (total_payments, salary, bonus, and expenses), are all correlated to Persons of Interest. Therefore, I created new features as sums and ratios of these ones. Working with Pandas made this incredibely easy due to vectorized operations, and though Numpy could similarly make this easy I think Pandas’ Dataframe construct makes it especially easy.</p> <p>It was also easy to fix any problems with the data before starting to train machine learning models. In order to use the data for evaluation and training, I replaced null values with the mean of each feature so as to be able to use the dataset with Scikit-learn. I also scaled all features to a range of 1-0, to better work with Support Vector Machines:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#Get rid of label</span> <span class="k">del</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;poi&#39;</span><span class="p">]</span> <span class="n">poi</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">&#39;poi&#39;</span><span class="p">]</span> <span class="c">#Create new features</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;stock_sum&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;exercised_stock_options&#39;</span><span class="p">]</span> <span class="o">+</span>\ <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;total_stock_value&#39;</span><span class="p">]</span> <span class="o">+</span>\ <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;restricted_stock&#39;</span><span class="p">]</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;stock_ratio&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;exercised_stock_options&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;total_stock_value&#39;</span><span class="p">]</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;money_total&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;salary&#39;</span><span class="p">]</span> <span class="o">+</span>\ <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;bonus&#39;</span><span class="p">]</span> <span class="o">-</span>\ <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;expenses&#39;</span><span class="p">]</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;money_ratio&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;bonus&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;salary&#39;</span><span class="p">]</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;email_ratio&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;from_messages&#39;</span><span class="p">]</span><span class="o">/</span><span class="p">(</span><span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;to_messages&#39;</span><span class="p">]</span><span class="o">+</span><span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;from_messages&#39;</span><span class="p">])</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;poi_email_ratio_from&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;from_poi_to_this_person&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;to_messages&#39;</span><span class="p">]</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;poi_email_ratio_to&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;from_this_person_to_poi&#39;</span><span class="p">]</span><span class="o">/</span><span class="n">numeric_df</span><span class="p">[</span><span class="s">&#39;from_messages&#39;</span><span class="p">]</span> <span class="c">#Feel in NA values with &#39;marker&#39; value outside range of real values</span> <span class="n">numeric_df</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="n">numeric_df</span><span class="o">.</span><span class="n">mean</span><span class="p">())</span> <span class="c">#Scale to 1-0</span> <span class="n">numeric_df</span> <span class="o">=</span> <span class="p">(</span><span class="n">numeric_df</span><span class="o">-</span><span class="n">numeric_df</span><span class="o">.</span><span class="n">min</span><span class="p">())</span><span class="o">/</span><span class="p">(</span><span class="n">numeric_df</span><span class="o">.</span><span class="n">max</span><span class="p">()</span><span class="o">-</span><span class="n">numeric_df</span><span class="o">.</span><span class="n">min</span><span class="p">())</span></code></pre></div> <p>Then, I scored features using Scikit-learn’s SelectKBest to get an ordering of them to test with multiple algorithms afterward. Pandas Dataframes can be used directly with Scikit-learn, which is another great benefit of it:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.feature_selection</span> <span class="kn">import</span> <span class="n">SelectKBest</span> <span class="n">selector</span> <span class="o">=</span> <span class="n">SelectKBest</span><span class="p">()</span> <span class="n">selector</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">numeric_df</span><span class="p">,</span><span class="n">poi</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span> <span class="n">scores</span> <span class="o">=</span> <span class="p">{</span><span class="n">numeric_df</span><span class="o">.</span><span class="n">columns</span><span class="p">[</span><span class="n">i</span><span class="p">]:</span><span class="n">selector</span><span class="o">.</span><span class="n">scores_</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">numeric_df</span><span class="o">.</span><span class="n">columns</span><span class="p">))}</span> <span class="n">sorted_features</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">scores</span><span class="p">,</span><span class="n">key</span><span class="o">=</span><span class="n">scores</span><span class="o">.</span><span class="n">get</span><span class="p">,</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span> <span class="k">for</span> <span class="n">feature</span> <span class="ow">in</span> <span class="n">sorted_features</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">&#39;Feature </span><span class="si">%s</span><span class="s"> has value </span><span class="si">%f</span><span class="s">&#39;</span><span class="o">%</span><span class="p">(</span><span class="n">feature</span><span class="p">,</span><span class="n">scores</span><span class="p">[</span><span class="n">feature</span><span class="p">]))</span></code></pre></div> <pre><code>Feature exercised_stock_options has value 30.528310 Feature total_stock_value has value 22.901164 Feature stock_sum has value 16.090150 Feature salary has value 14.428640 Feature poi_email_ratio_to has value 13.619580 Feature bonus has value 11.771121 Feature money_total has value 11.005135 Feature deferred_income has value 9.058555 Feature total_payments has value 8.334006 Feature restricted_stock has value 7.335986 Feature long_term_incentive has value 6.448285 Feature shared_receipt_with_poi has value 6.340473 Feature other has value 4.067974 Feature money_ratio has value 3.781568 Feature from_poi_to_this_person has value 3.626045 Feature email_ratio has value 2.176411 Feature from_this_person_to_poi has value 1.318493 Feature poi_email_ratio_from has value 1.279491 Feature from_messages has value 0.613342 Feature expenses has value 0.543049 Feature to_messages has value 0.400295 Feature deferral_payments has value 0.223368 Feature stock_ratio has value 0.013109 </code></pre> <p>It appeared that several of my features are among the most useful, as ‘poi_email_ratio_to’, ‘stock_sum’, and ‘money_total’ are all ranked highly. But, since the data is so small I had no need to get rid of any of the features and went ahead with testing several classifiers with several sets of features.</p> <h1 id="training-and-evaluating-models">Training and Evaluating Models</h1> <p>Proceding with the project, I selected three algorithms to test and compare: Naive Bayes, Decision Trees, and Support Vector Machines. Naive Bayes is a good baseline for any ML task, and the other two fit well into the task of binary classification with many features and can both be automatically tuned using sklearn classes. A word on SkLearn: it is simply a very well designed Machine Learning toolkit, with great compatibility with Numpy (and therefore also Pandas) and an elegant and smart API structure that makes trying out different models and evaluating features and just about anything one might want short of Deep Learning easy.</p> <p>I think the code that follows will attest to that. I tested those three algorithms with a variable number of features, from one to all of them ordered by the SelectKBest scoring. Because the data is so small, I could afford an extensive validation scheme and did multiple random splits of the data into training and testing to get an average that best indicated the strength of each algorithm. I also went ahead and evaluated precision and recall besides accuracy, since those were to be the metric of performance. And all it took to do all that is maybe 50 lines of code:</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="kn">from</span> <span class="nn">sklearn.naive_bayes</span> <span class="kn">import</span> <span class="n">GaussianNB</span> <span class="kn">from</span> <span class="nn">sklearn.svm</span> <span class="kn">import</span> <span class="n">SVC</span> <span class="kn">from</span> <span class="nn">sklearn.grid_search</span> <span class="kn">import</span> <span class="n">RandomizedSearchCV</span><span class="p">,</span> <span class="n">GridSearchCV</span> <span class="kn">from</span> <span class="nn">sklearn.tree</span> <span class="kn">import</span> <span class="n">DecisionTreeClassifier</span> <span class="kn">from</span> <span class="nn">sklearn.metrics</span> <span class="kn">import</span> <span class="n">precision_score</span><span class="p">,</span> <span class="n">recall_score</span><span class="p">,</span> <span class="n">accuracy_score</span> <span class="kn">from</span> <span class="nn">sklearn.cross_validation</span> <span class="kn">import</span> <span class="n">StratifiedShuffleSplit</span> <span class="kn">import</span> <span class="nn">scipy</span> <span class="kn">import</span> <span class="nn">warnings</span> <span class="n">warnings</span><span class="o">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s">&#39;ignore&#39;</span><span class="p">)</span> <span class="n">gnb_clf</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">GaussianNB</span><span class="p">(),{})</span> <span class="c">#No params to tune for for linear bayes, use for convenience</span> <span class="n">svc_clf</span> <span class="o">=</span> <span class="n">SVC</span><span class="p">()</span> <span class="n">svc_search_params</span> <span class="o">=</span> <span class="p">{</span><span class="s">&#39;C&#39;</span><span class="p">:</span> <span class="n">scipy</span><span class="o">.</span><span class="n">stats</span><span class="o">.</span><span class="n">expon</span><span class="p">(</span><span class="n">scale</span><span class="o">=</span><span class="mi">1</span><span class="p">),</span> <span class="s">&#39;gamma&#39;</span><span class="p">:</span> <span class="n">scipy</span><span class="o">.</span><span class="n">stats</span><span class="o">.</span><span class="n">expon</span><span class="p">(</span><span class="n">scale</span><span class="o">=.</span><span class="mi">1</span><span class="p">),</span> <span class="s">&#39;kernel&#39;</span><span class="p">:</span> <span class="p">[</span><span class="s">&#39;linear&#39;</span><span class="p">,</span><span class="s">&#39;poly&#39;</span><span class="p">,</span><span class="s">&#39;rbf&#39;</span><span class="p">],</span> <span class="s">&#39;class_weight&#39;</span><span class="p">:[</span><span class="s">&#39;balanced&#39;</span><span class="p">,</span><span class="bp">None</span><span class="p">]}</span> <span class="n">svc_search</span> <span class="o">=</span> <span class="n">RandomizedSearchCV</span><span class="p">(</span><span class="n">svc_clf</span><span class="p">,</span> <span class="n">param_distributions</span><span class="o">=</span><span class="n">svc_search_params</span><span class="p">,</span> <span class="n">n_iter</span><span class="o">=</span><span class="mi">25</span><span class="p">)</span> <span class="n">tree_clf</span> <span class="o">=</span> <span class="n">DecisionTreeClassifier</span><span class="p">()</span> <span class="n">tree_search_params</span> <span class="o">=</span> <span class="p">{</span><span class="s">&#39;criterion&#39;</span><span class="p">:[</span><span class="s">&#39;gini&#39;</span><span class="p">,</span><span class="s">&#39;entropy&#39;</span><span class="p">],</span> <span class="s">&#39;max_leaf_nodes&#39;</span><span class="p">:[</span><span class="bp">None</span><span class="p">,</span><span class="mi">25</span><span class="p">,</span><span class="mi">50</span><span class="p">,</span><span class="mi">100</span><span class="p">,</span><span class="mi">1000</span><span class="p">],</span> <span class="s">&#39;min_samples_split&#39;</span><span class="p">:[</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">4</span><span class="p">],</span> <span class="s">&#39;max_features&#39;</span><span class="p">:[</span><span class="mf">0.25</span><span class="p">,</span><span class="mf">0.5</span><span class="p">,</span><span class="mf">0.75</span><span class="p">,</span><span class="mf">1.0</span><span class="p">]}</span> <span class="n">tree_search</span> <span class="o">=</span> <span class="n">GridSearchCV</span><span class="p">(</span><span class="n">tree_clf</span><span class="p">,</span> <span class="n">tree_search_params</span><span class="p">,</span> <span class="n">scoring</span><span class="o">=</span><span class="s">&#39;recall&#39;</span><span class="p">)</span> <span class="n">search_methods</span> <span class="o">=</span> <span class="p">[</span><span class="n">gnb_clf</span><span class="p">,</span><span class="n">svc_search</span><span class="p">,</span><span class="n">tree_search</span><span class="p">]</span> <span class="n">average_accuracies</span> <span class="o">=</span> <span class="p">[[</span><span class="mi">0</span><span class="p">],[</span><span class="mi">0</span><span class="p">],[</span><span class="mi">0</span><span class="p">]]</span> <span class="n">average_precision</span> <span class="o">=</span> <span class="p">[[</span><span class="mi">0</span><span class="p">],[</span><span class="mi">0</span><span class="p">],[</span><span class="mi">0</span><span class="p">]]</span> <span class="n">average_recall</span> <span class="o">=</span> <span class="p">[[</span><span class="mi">0</span><span class="p">],[</span><span class="mi">0</span><span class="p">],[</span><span class="mi">0</span><span class="p">]]</span> <span class="n">num_splits</span> <span class="o">=</span> <span class="mi">10</span> <span class="n">train_split</span> <span class="o">=</span> <span class="mf">0.9</span> <span class="n">indices</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">StratifiedShuffleSplit</span><span class="p">(</span><span class="n">poi</span><span class="o">.</span><span class="n">tolist</span><span class="p">(),</span> <span class="n">num_splits</span><span class="p">,</span> <span class="n">test_size</span><span class="o">=</span><span class="mi">1</span><span class="o">-</span><span class="n">train_split</span><span class="p">,</span> <span class="n">random_state</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span> <span class="n">best_features</span> <span class="o">=</span> <span class="bp">None</span> <span class="n">max_score</span> <span class="o">=</span> <span class="mi">0</span> <span class="n">best_classifier</span> <span class="o">=</span> <span class="bp">None</span> <span class="n">num_features</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">for</span> <span class="n">num_features</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">sorted_features</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span> <span class="n">features</span> <span class="o">=</span> <span class="n">sorted_features</span><span class="p">[:</span><span class="n">num_features</span><span class="p">]</span> <span class="n">feature_df</span> <span class="o">=</span> <span class="n">numeric_df</span><span class="p">[</span><span class="n">features</span><span class="p">]</span> <span class="k">for</span> <span class="n">classifier_idx</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">):</span> <span class="n">sum_values</span> <span class="o">=</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">0</span><span class="p">]</span> <span class="c">#Only do parameter search once, too wasteful to do a ton</span> <span class="n">search_methods</span><span class="p">[</span><span class="n">classifier_idx</span><span class="p">]</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">feature_df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">indices</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">],:],</span> <span class="n">poi</span><span class="p">[</span><span class="n">indices</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]]</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span> <span class="n">classifier</span> <span class="o">=</span> <span class="n">search_methods</span><span class="p">[</span><span class="n">classifier_idx</span><span class="p">]</span><span class="o">.</span><span class="n">best_estimator_</span> <span class="k">for</span> <span class="n">split_idx</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">num_splits</span><span class="p">):</span> <span class="n">train_indices</span><span class="p">,</span> <span class="n">test_indices</span> <span class="o">=</span> <span class="n">indices</span><span class="p">[</span><span class="n">split_idx</span><span class="p">]</span> <span class="n">train_data</span> <span class="o">=</span> <span class="p">(</span><span class="n">feature_df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">train_indices</span><span class="p">,:],</span><span class="n">poi</span><span class="p">[</span><span class="n">train_indices</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span> <span class="n">test_data</span> <span class="o">=</span> <span class="p">(</span><span class="n">feature_df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="n">test_indices</span><span class="p">,:],</span><span class="n">poi</span><span class="p">[</span><span class="n">test_indices</span><span class="p">]</span><span class="o">.</span><span class="n">tolist</span><span class="p">())</span> <span class="n">classifier</span><span class="o">.</span><span class="n">fit</span><span class="p">(</span><span class="n">train_data</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="n">train_data</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="n">predicted</span> <span class="o">=</span> <span class="n">classifier</span><span class="o">.</span><span class="n">predict</span><span class="p">(</span><span class="n">test_data</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="n">sum_values</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">+=</span><span class="n">accuracy_score</span><span class="p">(</span><span class="n">predicted</span><span class="p">,</span><span class="n">test_data</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="n">sum_values</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span><span class="o">+=</span><span class="n">precision_score</span><span class="p">(</span><span class="n">predicted</span><span class="p">,</span><span class="n">test_data</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="n">sum_values</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span><span class="o">+=</span><span class="n">recall_score</span><span class="p">(</span><span class="n">predicted</span><span class="p">,</span><span class="n">test_data</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="n">avg_acc</span><span class="p">,</span><span class="n">avg_prs</span><span class="p">,</span><span class="n">avg_recall</span> <span class="o">=</span> <span class="p">[</span><span class="n">val</span><span class="o">/</span><span class="n">num_splits</span> <span class="k">for</span> <span class="n">val</span> <span class="ow">in</span> <span class="n">sum_values</span><span class="p">]</span> <span class="n">average_accuracies</span><span class="p">[</span><span class="n">classifier_idx</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">avg_acc</span><span class="p">)</span> <span class="n">average_precision</span><span class="p">[</span><span class="n">classifier_idx</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">avg_prs</span><span class="p">)</span> <span class="n">average_recall</span><span class="p">[</span><span class="n">classifier_idx</span><span class="p">]</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">avg_recall</span><span class="p">)</span> <span class="n">score</span> <span class="o">=</span> <span class="p">(</span><span class="n">avg_prs</span><span class="o">+</span><span class="n">avg_recall</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span> <span class="k">if</span> <span class="n">score</span><span class="o">&gt;</span><span class="n">max_score</span> <span class="ow">and</span> <span class="n">avg_prs</span><span class="o">&gt;</span><span class="mf">0.3</span> <span class="ow">and</span> <span class="n">avg_recall</span><span class="o">&gt;</span><span class="mf">0.3</span><span class="p">:</span> <span class="n">max_score</span> <span class="o">=</span> <span class="n">score</span> <span class="n">best_features</span> <span class="o">=</span> <span class="n">features</span> <span class="n">best_classifier</span> <span class="o">=</span> <span class="n">search_methods</span><span class="p">[</span><span class="n">classifier_idx</span><span class="p">]</span><span class="o">.</span><span class="n">best_estimator_</span> <span class="k">print</span><span class="p">(</span><span class="s">&#39;Best classifier found is </span><span class="si">%s</span><span class="s"> </span><span class="se">\n\</span> <span class="s"> with score (recall+precision)/2 of </span><span class="si">%f</span><span class="se">\n\</span> <span class="s"> and feature set </span><span class="si">%s</span><span class="s">&#39;</span><span class="o">%</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">best_classifier</span><span class="p">),</span><span class="n">max_score</span><span class="p">,</span><span class="n">best_features</span><span class="p">))</span></code></pre></div> <pre><code>Best classifier found is DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=0.25, max_leaf_nodes=25, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') with score (recall+precision)/2 of 0.370000 and feature set ['exercised_stock_options', 'total_stock_value', 'stock_sum', 'salary', 'poi_email_ratio_to', 'bonus'] </code></pre> <p>Then, I could go right back to Pandas to plot the results. Sure, I could do this with matplotlib just as well, but the flexibility and simplicity of the ‘plot’ function call on a DataFrame is more elegant in my opinion.</p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">results</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">({</span><span class="s">&#39;Naive Bayes&#39;</span><span class="p">:</span> <span class="n">average_accuracies</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s">&#39;SVC&#39;</span><span class="p">:</span><span class="n">average_accuracies</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="s">&#39;Decision Tree&#39;</span><span class="p">:</span><span class="n">average_accuracies</span><span class="p">[</span><span class="mi">2</span><span class="p">]})</span> <span class="n">results</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">xlim</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">sorted_features</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span><span class="n">ylim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span> <span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">&quot;Classifier accuracy by # of features&quot;</span><span class="p">)</span></code></pre></div> <p><img src="" /></p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">results</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">({</span><span class="s">&#39;Naive Bayes&#39;</span><span class="p">:</span> <span class="n">average_precision</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s">&#39;SVC&#39;</span><span class="p">:</span><span class="n">average_precision</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="s">&#39;Decision Tree&#39;</span><span class="p">:</span><span class="n">average_precision</span><span class="p">[</span><span class="mi">2</span><span class="p">]})</span> <span class="n">results</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">xlim</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">sorted_features</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span><span class="n">ylim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span> <span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">&quot;Classifier precision by # of features&quot;</span><span class="p">)</span></code></pre></div> <p><img src="" /></p> <div class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">results</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">DataFrame</span><span class="o">.</span><span class="n">from_dict</span><span class="p">({</span><span class="s">&#39;Naive Bayes&#39;</span><span class="p">:</span> <span class="n">average_recall</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="s">&#39;SVC&#39;</span><span class="p">:</span><span class="n">average_recall</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="s">&#39;Decision Tree&#39;</span><span class="p">:</span><span class="n">average_recall</span><span class="p">[</span><span class="mi">2</span><span class="p">]})</span> <span class="n">results</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">xlim</span><span class="o">=</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">sorted_features</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">),</span><span class="n">ylim</span><span class="o">=</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">1</span><span class="p">))</span> <span class="n">plt</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">&quot;Classifier recall by # of features&quot;</span><span class="p">)</span></code></pre></div> <p><img src="" /></p> <p>As output by my code, the best algorithm was consistently found to be Decision Trees and so I could finally finish up the project by submitting that as my model.</p> <h2 id="conclusion">Conclusion</h2> <p>I did not much care for the project’s dataset and overall structure, but I still greatly enjoyed completing it because of how fun it was to combine Pandas data processing with Scikit-learn model training in the process, with IPython Notebook making that process even more fluid. While not at all a well written introduction or tutorial for these packages, I do hope that this write up about a single project I finished using them might inspire some readers to try out doing that as well.</p> <p><a href="/writing/power-of-ipython-pandas-scikilearn/">The Power of IPython Notebook + Pandas + and Scikit-learn</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on June 10, 2016.</p> <![CDATA[Verbosify]]> /projects/hacks/verbosify 2016-05-07T00:00:00-07:00 2016-05-07T00:00:00-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p>We made a quick flask app, used NLTP to swap some words with their definitions, and accessed it using JQuery in a Chrome app. Nothing too tricky, but fun.</p> <p><a href="/projects/hacks/verbosify/">Verbosify</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on May 07, 2016.</p> <![CDATA[The Distinct Artistry of Trailers]]> /writing/the-distinct-artistry-of-trailers 2016-05-05T19:19:34-07:00 2016-05-05T19:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <figure> <iframe width="560" height="315" src="https://www.youtube.com/embed/d-S9nKByu5w" frameborder="0" allowfullscreen=""></iframe> <figcaption>The ultimate example of the conventional Hollywood trailer, which the trailers below will stand in contrast to. </figcaption> </figure> <p>Some time ago, I wrote <a href="http://www.andreykurenkov.com/writing/in-my-head-music-videos-i-dont-forget/">a post</a> that was little more than a list of music videos I have long remembered for being great. What compelled me to write such a thing, rather than the sort of essay or technical post I usually peddle, is that there really was a discrete number of music videos that stuck with me and seemed to represent the best of the medium. And so it is with another medium not so dissimilar from the music video - the trailer.</p> <p>The similarities are obvious - like music videos, trailers tend to be short, are free to not represent reality but bound to reflect another piece of art, and tend to make a strong aesthetic impression with imagery and sound. But, then again trailers also have a practical purpose music videos do not have - to inform you about the nature of some movie and make you want to watch it. As a result of this a great majority of trailers take up a fairly predictable form, like the one above, and work well as an advertisement for the movie but not as its own piece of art apart from that movie.</p> <p>Still, I have come across many trailers in my life that did not just make me want to watch the movie, but also excited, entertained, and impressed me wholly on their own. This is not so surprising, since (as stated in <a href="http://filmmakermagazine.com/37093-first-impressions/#.VypVA14oBC0">“The Art of First Impressions: How to Cut a Movie Trailer”</a>) a trailer is really its own film - it has rhythm, appeal, and impact that are utterly distinct from the movie it is derived from. So here I am, posting another list of things I like, and little more. But here’s a twist - thinking back on the trailers that made the greatest lasting impression on me, I noticed they tended to fit into one of a few styles that that differed from the above ‘conventional’ trailer. And so, read on for those greatest trailers stuck in my head excitingly grouped by the styles it occured to me they exemplify.</p> <h2 id="the-music-over-montage-trailer">The Music Over Montage Trailer</h2> <figure> <iframe width="560" height="315" src="https://www.youtube.com/embed/JFbo9uj-TrY" frameborder="0" allowfullscreen=""></iframe> <figcaption>The Double Teaser Trailer. This song is not in the final movie.</figcaption> </figure> <p>This is my favorite style of trailer, perhaps because it tends to least resemble advertising and most function as its own work of art. The cutting together of footage without any continuity, and the muting of dialogue in favor of a single musical track, together almost completely dissasociate the images from the movie they are from. Instead, the audio track and spliced footage make for a new composition that can be quite different from the movie while hinting at it’s overall mood and spirit.</p> <p>I think this is particularly true of the above trailer; The Double is a slow dark comedy about heady themes concerning our need to be acknowledged, but you could hardly guess that from the trailer alone. The normal exposition about themes and characters is totally absent, and about the only thing that is conveyed is the creepy surreal aspect of the doppleganger and the fantastic visual style. It is not telling us what the movie <em>will</em> be about, who these characters <em>will</em> be, but just <em>is</em> its own little film representing the style and spirit of the whole movie. Of course, this is also why this style is reserved for teasers and not ‘real’ trailers that actually need to explain what the movie is about. Still, I think this style deserves recognition since in my humble opinion it is basis for some of the best movie trailers of all time:</p> <figure> <iframe width="560" height="315" src="https://www.youtube.com/embed/WVLvMg62RPA" frameborder="0" allowfullscreen=""></iframe> <figcaption>The Girl With The Dragon Tattoo Trailer #1. Again, we at most see Fincher's visual style, that this will be some sort of gritty thriller-sort of thing.</figcaption> </figure> <figure> <iframe width="560" height="315" src="https://www.youtube.com/embed/SPRzm8ibDQ8" frameborder="0" allowfullscreen=""></iframe> <figcaption>The Clockwork Orange trailer, an especially extreme example of this style. Kubrick, ever the auteur, was among the first to do these sorts of crazy things. </figcaption> </figure> <figure> <iframe width="560" height="315" src="https://www.youtube.com/embed/jQ5lPt9edzQ" frameborder="0" allowfullscreen=""></iframe> <figcaption>The trailer for Alien. This is even more its own thing than the others, as it includes footage not at all in the movie. </figcaption> </figure> <h2 id="the-single-track-trailer">The Single Track Trailer</h2> <iframe width="560" height="315" src="https://www.youtube.com/embed/7iggyFPls4w" frameborder="0" allowfullscreen=""></iframe> <p>A similar but distinct style, in which a single track of dialogue or beat is repeated and intensified over a montage of unrelated footage. It is not quite as divorced from the movie as the previous style, since it is typically the main character monologuing and there may be some cuts to dialogue, but is still far less expository than the likes of the Independnece Day trailer. And, crucially, these trailers can also include an element entirely not in the movie. In the above trailer, the continuous blackboard beat and repeated scenes are elements unique to the trailer. And then there is Inception’s <a href="https://www.youtube.com/watch?v=830I9w7I7wM">now ubiquitous BRAAAM</a>, which was <a href="http://blogs.indiewire.com/theplaylist/who-really-created-the-inception-braaam-composer-mike-zarin-sets-the-record-straight-20131113">in fact conceived</a> for the teaser trailer rather than for the movie itself.</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/Z564VzbQ9Hc" frameborder="0" allowfullscreen=""></iframe> <h2 id="the-condensed-movie-trailer">The Condensed Movie Trailer</h2> <figure> <iframe width="560" height="315" src="https://www.youtube.com/embed/XG8qATRtNuU" frameborder="0" allowfullscreen=""></iframe> <figcaption>The first full trailer for The Double, notably different from the teaser. </figcaption> </figure> <p>Sometimes, just taking all the best elements and plot beats of a movie and succintly showing them off works wonders. In Hopes&amp;Fears’ <a href="http://www.hopesandfears.com/hopes/culture/film/214473-epic-history-movie-trailers-mad-max-independence-day">“An epic history of the movie trailer”</a> this is referred to as the mini-movie, and said to come about in the last few decades:</p> <blockquote> <p>“By the end of the 1980s, movies were making more money than ever. Year after year total grosses and movie budgets were higher. As such, studios took fewer risks on trailers. They would make multiple trailers, premiere them at different times, test them on different markets, and find just the right way to sell their product.</p> </blockquote> <blockquote> <p>More importantly, they started honing in on an abridged version of the movie, advertising with a short version of the film’s three acts: setup, confrontation, and climax—essentially everything but the resolution.”. This may sound like pretty much every trailer out there nowdays, but it takes skill to do well, and is doomed to fail if the movie has nothing interesting going on to begin with. These trailers, at least the ones I consider great, do more than just introduce the characters and plot of their movie - they simultaneously weave together the strongest elements of the movie, leave out just enough to keep the viewer interested, and have a crazy amount of editing flourish that is entertatining even apart from the footage it is executed with. “</p> </blockquote> <p>The above trailer for The Double is wholly different from the teaser in that it introduces the characters and plot at length. But it does more than that - it combines multiple songs from the movie’s memorable score, quickly executes multiple character arcs, and builds to an incredibely energetic conclusion that combines moments from multiple distinct scenes in the source. The following trailers, too, combine exposition with a melding of multiple scenes that is not found in the movie but represents its unique strengths:</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/JJkPLYmUyzg" frameborder="0" allowfullscreen=""></iframe> <iframe width="560" height="315" src="https://www.youtube.com/embed/0nU7dC9bIDg" frameborder="0" allowfullscreen=""></iframe> <iframe width="560" height="315" src="https://www.youtube.com/embed/FJuaAWrgoUY" frameborder="0" allowfullscreen=""></iframe> <iframe width="560" height="315" src="https://www.youtube.com/embed/hEJnMQG9ev8" frameborder="0" allowfullscreen=""></iframe> <h2 id="the-cinematic-video-game-trailer">The Cinematic Video Game Trailer</h2> <iframe width="560" height="315" src="https://www.youtube.com/embed/Kq5KWLqUewc" frameborder="0" allowfullscreen=""></iframe> <p>It should be clear that the above three styles represent increasing levels of exposition about and representation of their source materials. But, trailers don’t only exist for movies. They exist for video games as well, and for video games there is a unique style of trailer that is literally a wholly distinct work of art - the cinematic trailer. Usually these short films are not much more than eye candy, but in some cases - as in these cases - they are quite good short films in their own right. I have seen the above Deus Ex trailer many times now, purely because I think it is marvelously conceived and executed.</p> <iframe width="560" height="315" src="https://www.youtube.com/embed/xt_65k-gv1U" frameborder="0" allowfullscreen=""></iframe> <p>Advertisements are often seen as something to instinctively avoid these days, as something tasteless that intrudes into our lives only to cause annoyane. But, it should be acknowledged that ads can also be this good looking, this well produced, this distinctly artistic. I can only hope to see more like them.</p> <p><a href="/writing/the-distinct-artistry-of-trailers/">The Distinct Artistry of Trailers</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on May 05, 2016.</p> <![CDATA[Planr]]> /projects/major_projects/planr 2016-05-05T02:26:22-07:00 2016-05-05T02:26:22-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p>An app and a server, no big deal right. We used Flask for the server (some basic REST stuff), and a whole lot of Android libraries to make the app. Sadly we did not win (probably because we did not quite finish), but we were close!</p> <p><a href="/projects/major_projects/planr/">Planr</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on May 05, 2016.</p> <![CDATA[A 'Brief' History of Game AI Up To AlphaGo, Part 3]]> /writing/a-brief-history-of-game-ai-part-3 2016-04-18T19:19:34-07:00 2016-04-18T19:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p>This is the third and final part of ‘A Brief History of Game AI Up to AlphaGo’. Part 1 is <a href="/writing/a-brief-history-of-game-ai">here</a> and part 2 is <a href="/writing/a-brief-history-of-game-ai-part-2">here</a>. In this part, we shall cover the intellectual innovations necessary to finally achieve strong Go programs, and the final steps beyond those that get us all the way to the present and to AlphaGo.</p> <h1 id="and-then-there-was-go">And Then There Was Go</h1> <p>Although Go playing programs by the late 90s were still far from impressive, this was not due to a lack of people trying. Following Bruce Wilcox’s 70s work on Go programs (mentioned in <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-game-ai-up-to-alphago-2">part 2</a>), numerous people continued to dedicate their time and CPU cycles to implementing better Go programs throughout the 80s and 90s. One of the best programs of the 90s, The Many Faces of Go, achieved 13-kyu (good non-professional) performance. It took 30 thousand lines of code written over a decade by its developer, David Fotland, to implement the many Go-specific components it used on top of traditional alpha-beta search<sup id="fnref:ManyFaces"><a href="#fn:ManyFaces" class="footnote">1</a></sup>. The combination of a larger branching factor, longer games, and trickier-to-evaluate board positions rendered Go resistant to the techniques that by this time were achieving master-level Chess play. The so-called “knowledge-based” approach of Many Faces of Go (encoding many human-designed strategies into the program) worked, but by the 2000s hit a point of diminishing returns. Traditional techniques were simply not enough.</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/16-go90s.png" alt="Go90s" /> <figcaption>Go AIs in the 90s were at the level of good non-professionals; the 'master' ranks are actually obtained by relatively few people who typically play in tournaments, and the 'professional' rank is reserved for incredibly good players. <a href="http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Applications_files/grand-challenge.pdf"><b>(Source)</b></a></figcaption> </figure> <p>Fortunately, another family of techniques did exist — <a href="https://simple.wikipedia.org/wiki/Monte_Carlo_algorithm">Monte Carlo algorithms</a>. The general idea of Monte Carlo algorithms is to simulate a process, usually with some randomness thrown in, in order to statistically get a good approximation of some value that is very hard to calculate directly. For instance, <a href="https://en.wikipedia.org/wiki/Simulated_annealing">simulated annealing</a> is a classic optimization technique for finding the optimal parameters for some function. It boils down to starting with some random values and semi-randomly changing them to gradually get more optimal values, as shown in the graphic below. If this is done for long enough, it turns out a solution close to the optimal one can most likely be found with much less computation than the actually optimal one would have taken to find.</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/17-simulated_annealing.gif" alt="SA" /> <figcaption> An example of simulated annealing. Just 'hill climbing' in one direction from any starting point does not work to find the optimal value since locally many places look like a peak, so at first big random jumps in location are allowed. Over time, less randomness is allowed as the algorithm converges on the globally highest peak. <a href="https://en.wikipedia.org/wiki/File:Hill_Climbing_with_Simulated_Annealing.gif"><b>(Source)</b></a></figcaption> </figure> <p class="sidenoteleftlarge">1993</p> <p>Monte Carlo techniques had been applied to games of chance such as Blackjack since the 70s, but they were not suggested for Go until 1993 with Bernd Brügmann’s <a href="http://www.ideanest.com/vegos/MonteCarloGo.pdf">“Monte Carlo Go”</a> <sup id="fnref:MonteCarloGo"><a href="#fn:MonteCarloGo" class="footnote">2</a></sup>. The idea he presented was suspiciously simple: Instead of implementing a complex evaluation function and executing typical tree search, have a program just simulate playing many random games (using a version of simulated annealing) and pick the move that leads to the best outcome on average. Remember, the whole reason for evaluation functions is to search only a few moves ahead and not until the end of the game, since the number of possible ways to get to the end is far too immense to thoroughly explore. But, this does not work well for Go since evaluation functions are harder to write for it and typically take much longer to compute than for Chess. So, it turns out that a great alternative is to use that computation time to randomly play out only a subset of possible games (typically described as doing many <strong>rollouts</strong>), and then evaluate moves based on just counting victories and losses. Despite being radically simple (having little more than the rules of Go), Brügmann’s implementation of the idea showed decent beginner-level play not too far from the vastly more complex Many Faces of Go.</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/18-hugetrees.png" /> <figcaption>Monte Carlo approach compared to classical tree search. <a href="http://www.remi-coulom.fr/JFFoS/JFFoS.pdf"><b>(Source)</b></a></figcaption> </figure> <p>Although it was a novel approach that showed promise for overcoming the problems that stumped traditional Go AI, Brügmann’s work was initially not taken seriously. It was not until the 2000s that a group in Paris made up of Bruno Bouzy, Tristan Cazenave, and Bernard Helmstetter started seriously exploring the potential of Monte Carlo techniques for Go through multiple papers and programs<sup id="fnref:ParisSchool1"><a href="#fn:ParisSchool1" class="footnote">3</a></sup><sup id="fnref:ParisSchool2"><a href="#fn:ParisSchool2" class="footnote">4</a></sup><sup id="fnref:ParisSchool3"><a href="#fn:ParisSchool3" class="footnote">5</a></sup><sup id="fnref:MonteCarloRevolution"><a href="#fn:MonteCarloRevolution" class="footnote">6</a></sup>. Though they simplified and expanded upon the Brügmann approach (stripping out simulated annealing and adding some heuristics and pruning), their programs were still inferior to the strongest standard tree search ones such as GNU Go. However this lasted only a few years, until several milestone achievements in 2006:</p> <ul> <li>Rémi Coulom further improved on the use of Monte Carlo evaluation with tree search, and coined the term <strong>Monte-Carlo Tree Search</strong> (MCTS)<sup id="fnref:MCTS"><a href="#fn:MCTS" class="footnote">7</a></sup>. His program CrazyStone won that year’s KGS computer-Go tournament for the small 9x9 variant of Go, beating other programs such as NeuroGo and GNU Go and thus proving the potential of MCTS.</li> <li>Levente Kocsis and Csaba Szepesvári developed the <strong>UCT</strong> (Upper Confidence Bounds for Trees) algorithm <sup id="fnref:BanditMonte"><a href="#fn:BanditMonte" class="footnote">8</a></sup>. This algorithm solves the choice between exploitation (simulating already good-looking moves to get a better estimate of how good they are) and exploration (trying new or bad-looking moves to try to get something better) in MCTS. The same tradeoff has long been studied in Computer Science in the form of the <a href="https://en.wikipedia.org/wiki/Multi-armed_bandit">multi-arm bandit problem</a>, and UCT was a modification of a theoretically-sound Upper Confidence Bounds formula developed for that problem a few years prior.</li> <li>Lastly, Sylvain Gelly et al. combined MCTS with UCT as well as the older ideas of local pattern matching and tree pruning<sup id="fnref:MoGo"><a href="#fn:MoGo" class="footnote">9</a></sup>. Their program, MoGo, quickly surpassed CrazyStone and became the best computer Go AI.</li> </ul> <p class="sidenoteleftlarge">2006</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/19-monte_carlo.png" /> <figcaption>A nice visualization of (basic) Monte Carlo Tree Search. <a href="https://commons.wikimedia.org/w/index.php?curid=25382061"><b>(Source)</b></a>, By <a href="//commons.wikimedia.org/w/index.php?title=User:Mciura&amp;action=edit&amp;redlink=1" class="new">Mciura</a> - <span class="int-own-work" lang="en">Own work</span>, <a title="Creative Commons Attribution-Share Alike 3.0" href="http://creativecommons.org/licenses/by-sa/3.0">CC BY-SA 3.0</a></figcaption> </figure> <p>The rapid success of CrazyStone and MoGo was impressive enough, but what happened next was truly groundbreaking. In 2008, MoGo beat professional 8 dan player Kim Myungwan in 19x19 Go. Granted, Myungwan was playing with a large handicap and MoGo was running on an 800-node supercomputer, but the feat was nevertheless historic, given that previous Go programs could at best beat highly-ranked amateurs with large handicaps. In the same year, CrazyStone defeated professional Japanese 4 dan player Kaori Aoba with a smaller handicap while running on a normal PC<sup id="fnref:GoHistory"><a href="#fn:GoHistory" class="footnote">10</a></sup>. And so on it went, with faster computers and various tweaks to MCTS making Go programs rapidly become much, much better.</p> <p class="sidenoteleftlarge">2012</p> <p>By 2012, Go programs got good enough for an amateur player to write a post titled <a href="http://blog.printf.net/articles/2012/02/23/computers-are-very-good-at-the-game-of-go/">“Computers are very good at the game of Go”</a>. The post highlighted the fact that the standout program at the time, Zen19, had improved from a 1 dan ranking (entry-level master) to 5 dan (higher-level master) in the span of 4 years. This is a big deal:</p> <blockquote> <p>“To put the 5-dan rank in perspective: amongst the players who played American Go Association rated games in 2011, there were only 105 players that are 6-dan and above. This suggests that there are only around 100 active tournament players in the US who are significantly stronger than Zen19. I’m sure I’ll never become that strong myself.”</p> </blockquote> <figure class="sidefigureright"> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/20-go_ratings.png" /> <figcaption>The pace of betterment for Go programs after the introduction of MCTS <a href="https://www.usgo.org/files/bh_library/Supercomputer%20Go.pdf"><b>(Source)</b></a></figcaption> </figure> <figure class="sidefigureleft"> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/21-crazystone.jpeg" /> <figcaption>Progress of Go AIs compared to Chess - despite being better than that vast majority of Go players, the best Go AIs were still nowhere near the best humans by this point. But they were getting better and better, very fast. From 2014's <a href="http://spectrum.ieee.org/robotics/artificial-intelligence/ais-have-mastered-chess-will-go-be-next"><b>"AIs Have Mastered Chess. Will Go Be Next?" by Jonathan Schaeffer, Martin Müller &amp; Akihiro Kishimoto</b></a></figcaption> </figure> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/22.5-go2012.png" /> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/22-progress.jpeg" /> <figcaption>An image from 2013 of Rémi Coulom (left) placing stones for Crazy Stone, against professional player Ishida Yoshio. Crazy Stone won. <b><a src="https://gogameguru.com/crazy-stone-computer-go-ishida-yoshio-4-stones/">(Source)</a></b></figcaption> </figure> <p>Still, Go programs were only beating lower-level professionals, and only with significant handicaps. A revolutionary idea — Monte Carlo Tree Search — enabled the ability to brute-force good Go play, and solved the ‘type A’ (smart use of brute-force) strategy part of the problem. It accomplished roughly the same feat as alpha-beta search for Chess programs, but worked better for Go because it replaced the human-implemented evaluation functions component with many random rollouts of the game that were easy to do fast and parallelize to multiple processing cores. But in order to go up against the best humans at, ‘type B’ intelligence (emulation of human-like learned instincts) would also be needed. That would soon be developed, but it would require a second and wholly different revolution - deep learning.</p> <h1 id="go-ais-ascend-to-divinity-with-deep-learning">Go AIs Ascend to Divinity with Deep Learning</h1> <p class="sidenoteleftlarge">1994</p> <p>To understand the revolution of deep learning, we must revisit the 90s. We previously covered the success of Neurogammon, a backgammon program powered by neural nets and supervised learning, as well as TD-Gammon, its successor that was also powered by neural nets but based on reinforcement learning. These programs demonstrated that machine learning was a viable alternative to the knowledge-based approach of hand-coding complex strategies. Indeed, there was an attempt to to apply the approach behind TD-Gammon to Go with 1994’s <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/sds94.pdf">“Temporal difference learning of position evaluation in the game of Go”</a><sup id="fnref:TDGo"><a href="#fn:TDGo" class="footnote">11</a></sup> by Nicol N. Schraudolph, Peter Dayan, and Terrence J. Sejnowski.</p> <p>Schraudolph et al. noted that efforts to make a Go AI with supervised learning (like Neurogammon) were hindered by the difficulty of generating enough examples of scored Go boards. So they suggested an approach similar to TD-Gammon — training a neural net to evaluate a given board position though playing itself and other programs. However, the team found that using a plain neural net as in TD-Gammon was inefficient, since it did not capture the fact that many patterns on Go boards can be rotated or moved around and still hold the same significance. So they used a <strong>convolutional neural net</strong> (CNN), a type of neural net constrained to perform the same computation for different parts of the input. Typically, the constraint is to apply the same processing to small patches all over the input, and then have some number of additional <strong>layers</strong> of looking for specific combinations of patterns, and then eventually use those combinations to compute the overall output. The CNN in Schraudolph et al’s work was simpler, but helped their result by exploiting rotational, reflectional, and color inversion symmetries in Go.</p> <figure> <img class="postimagesmall" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/23-CNN.jpg" alt="CNN" /> <figcaption>A nice visualization of how multi-layer CNNs work. They are typically much better than plain neural nets for computer vision tasks such as number recognition, since it helps to find small features such as circles first and then use those to compute the output. A Go board is somewhat like an image, in that it is a grid of 'pixels' (Go stones) that are not in themselves significant but the combination of which forms the 'image' (game position). <a href="http://image.slidesharecdn.com/bp2slides-090922011749-phpapp02/95/the-back-propagation-learning-algorithm-10-728.jpg?cb=1253582278">(Source)</a></figcaption> </figure> <p>Though this approach could potentially learn to intuitively ‘see’ the value of a Go position (as Zobrist sought to achieve in the 60s), neither the computer power nor the understanding of neural nets were sufficient to make this effective in 1994. In fact, a large wave of hype for neural nets in general died out in the mid 90s as limited computing power and missing algorithmic insights led to underwhelming results. So, the CNN-based program could only beat Many Faces of Go at a low level, and was therefore not even close to good human players.</p> <p class="sidenoteleftlarge">2006</p> <p>Neural nets were viewed unfavorably well into the 2000s, while other machine learning methods gained favor and Go programs failed to get much better. The history here is not exact, but a paper titled <a href="https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf">“A fast learning algorithm for deep belief nets”</a><sup id="fnref:DBN"><a href="#fn:DBN" class="footnote">12</a></sup> is often credited with rekindling interest in neural nets by suggesting an approach for successfully training “deep” neural nets with many layers of computing units. This was the start of what would become the huge phenomenon of deep learning (which is just a term for large neural nets with many layers). In a wonderful historic coincidence, this paper was published in 2006 — the same year as all those MCTS papers!</p> <p>Deep neural nets continued to gain attention in the following years, and by 2009 achieved record-setting results in speech recognition. Besides algorithmic improvements, their resurgence was in large part due to the availability of large amounts of training data and the use of the massively parallel computational capabilities of GPUs - both things that did not really exist in the 90s. Larger neural nets were trained with more data and more layers, more quickly than before thanks to modern hardware, and benchmark after broken benchmark showed this to be a powerful methodology. But the event that really set off a huge wave of research and investment in deep learning was the ILSVRC (ImageNet Large Scale Visual Recognition Challenge) 2012 computer vision competition. A CNN-based submission performed far, far better than the next-best entry, surpassing the record on that competition’s ImageNet benchmark problem. This, the first and only CNN entry in that competition, was an undisputed sign that deep learning was a big deal. Now, almost all entries to the competition use CNNs.</p> <p class="sidenoteleftlarge">2012</p> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/24-cnn2.png" alt="CNN2" /> <figcaption>Another good visualization of a basic deep CNN. The Imagenet benchmark is precisely this, a set of images with thousands of categories of objects. <a href="https://www.clarifai.com/technology">(Source)</a></figcaption> </figure> <p>Up to this point, deep learning was largely applied to supervised learning tasks unrelated to game-playing, such as outputting the right category for a given image or recognizing human speech. But in 2013, a company called DeepMind made a big splash with the publication of <a href="http://arxiv.org/abs/1312.5602">“Playing Atari with Deep Reinforcement Learning”</a><sup id="fnref:Atari"><a href="#fn:Atari" class="footnote">13</a></sup>. Yep, they trained a neural net to play Atari games. More specifically, they presented a new approach to doing reinforcement learning with deep neural nets, which was largely abandoned as a research direction since the failure of TD-Gammon’s approach to work for other games. With just the input of the pixels you or I would see on screen, and the game’s score, their Deep Q-Networks (named after Q-learning, the basis for their algorithm) learned to play Breakout, Pong, and more. It bested all other reinforcement learning schemes, and in some cases humans as well!</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/25-breakout.gif" alt="breakout" /> <figcaption>DeepMind's Atari player playing Breakout. <a href="https://github.com/kuz/DeepMind-Atari-Deep-Q-Learner">(Source)</a></figcaption> </figure> <p class="sidenoteleftlarge">2014</p> <p>This Atari work was a great innovation in AI research, so much so that in 2014 Google payed $400 million to acquire DeepMind. But I digress - as with IBM’s 2011 win in Jeopardy with Watson, this Atari feat is not directly relevant to the history of AI for chess-like board games, so let’s get back to that. Since the 90s, there were several more papers on learning to predict moves or to evaluate positions in Go with machine learning<sup id="fnref:Go2000sA"><a href="#fn:Go2000sA" class="footnote">14</a></sup><sup id="fnref:Go2000sB"><a href="#fn:Go2000sB" class="footnote">15</a></sup><sup id="fnref:Go2000sC"><a href="#fn:Go2000sC" class="footnote">16</a></sup>, but up to this point none really used large and deep neural nets. That changed in 2014, when two groups independently trained large CNNs to predict, with great accuracy, what move expert Go players would make in a given position.</p> <p>The first to publish were Christopher Clark and Amos Storkey at the University of Edinburgh, on December 10th of 2014 with <a href="https://arxiv.org/abs/1412.3409">“Teaching Deep Convolutional Neural Networks to Play Go”</a><sup id="fnref:Go2014A"><a href="#fn:Go2014A" class="footnote">17</a></sup>. Unlike the DeepMind Atari AI, here the researchers did purely supervised learning: Using two datasets of 16.5 million move-position pairs from human-played games, they trained a neural net to produce the probability of a human Go player making each possible move from a given position. Their deep CNN surpassed all prior results on move prediction for both datasets, and even defeated the conventionally-implemented GNU Go 85% of the time. Prior research on move prediction with weaker accuracy could never achieve better play than GNU Go, but this research showed that it could be done by just playing the moves deemed likely to be the choice of a skilled player by a neural net trained on lots of data!</p> <p>It’s important to understand that the neural net had no prior knowledge of Go before being trained with the data - it did not know the rules of the game and did not simulate playing Go for its move selection. So, you could say it was playing purely by ‘intuition’ for good moves, based on the data it had seen in training. Therefore it is hugely impressive that Clark et al’s neural net was able to beat a program specifically written to play Go well, but also not surprising it alone could not beat the stronger existing MCTS-based Go programs (which by that point played as well as very skilled people). The researchers concluded by suggesting the ‘intuition’ of machine-learned move prediction could be combined with brute-force MCTS move evaluation to make a much stronger Go AI than had ever existed:</p> <blockquote> <p>“Our networks are state of the art at move prediction, despite not using the previous moves as input, and can play with an impressive amount of skill even though future positions are not explicitly examined. … The most obvious next step is to integrate a DCNN into a full fledged Go playing system. For example, a DCNN could be run on a GPU in parallel with a MCTS Go program and be used to provide high quality priors for what the strongest moves to consider are. Such a system would both be the first to bring sophisticated pattern recognitions abilities to playing Go, and have a strong potential ability to surpass current computer Go programs.”</p> </blockquote> <p>Within a mere two weeks of the publication of Clark et al’s work, a second paper was released describing research with precisely the same ambitions, by Chris J. Maddison (University of Toronto), Ilya Sutskever (Google), and Aja Huang and David Silver (DeepMind, now also part of Google)<sup id="fnref:Go2014B"><a href="#fn:Go2014B" class="footnote">18</a></sup>. They trained a larger, deeper CNN and achieved even more impressive prediction results plus the ability to beat GNU Go a whopping 97% of the time. Still, when faced with MCTS programs operating at full brute force capacity — 100,000 rollouts per move — the deep neural net won only 10% of the time. As with the first paper, it was suggested that neural nets could be used together with MCTS, that the former could serve as ‘intuition’ to quickly identify potentially good moves, and the latter could serve as ‘reasoning’ to more accurately evaluate how good those moves really are. This team went further by implementing a prototype program that did just that, and showed it could beat their lone neural net 87% of the time. They concluded:</p> <blockquote> <p>“We have provided a preliminary proof-of-concept that MCTS and deep neural networks may be combined effectively. It appears that we now have two core elements that scale effectively with increased computational resource: scalable planning, using Monte-Carlo search; and scalable evaluation functions, using deep neural networks. In the future, as parallel computation units such as GPUs continue to increase in performance, we believe that this trajectory of research will lead to considerably stronger programs than are currently possible.”</p> </blockquote> <p class="sidenoteleftlarge">2016</p> <p>They did not just believe in that trajectory of research — they pursued it. And just shy of a year later, their belief was proven right. In January of 2016, a <a href="http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html">paper in the prestigious publication Nature</a> by these four authors and some sixteen more (all from DeepMind or Google) announced the development of AlphaGo, the first Go AI to beat a high ranked professional player without any handicaps<sup id="fnref:AlphaGo"><a href="#fn:AlphaGo" class="footnote">19</a></sup>. Specifically, they reported having beaten Fan Hui, a European Go champion, in 5 out of 5 no-handicap games.</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/26-fanhui.jpg" alt="CNN2" /> <figcaption>An image of a game between DeepMind's AlphaGo and Fan Hui. <b><a href="http://gardinerchess.com.au/gm-rogers-chess-ego-boost-from-alphago/">(Source)</a></b></figcaption> </figure> <p>How did they build a Go AI this good? With a small battalion of incredibly intelligent people working to achieve quite a list of innovative ideas:</p> <ol> <li>First, they created a much better neural net for predicting the best move for a given position. They did this by starting with supervised learning from a dataset of 30 million position-move pairs from games played by people (aided by some simple Go features), and then improving this neural net with reinforcement learning (making it play against older versions of itself to learn to get better, TD-Gammon style). This singular neural net — the <strong>policy network</strong> — could by itself beat the best MCTS Go programs 85% of the time, a huge leap from the 10% that was achieved before. But they did not stop there.</li> <li>A natural next step to get even better performance was to add MCTS into the mix, but the policy neural net is far too slow to be evaluated continually for each move of the tens of thousands of game rollouts typically done with MCTS. So, the supervised learning data was also used to train a second network that is much faster to evaluate - the <strong>rollout network</strong>. The full policy network is only ever used once to get an initial estimate on how good a move is, and then the much faster rollout policy is used in choosing the many more moves needed to get to an end of the game in an MCTS rollout. This makes the move selections in simulation better than random but fast enough to have the benefits of MCTS. These two components together with MCTS already made for a far better Go AI than has every been achieved, but there was a third trick that really pushed AlphaGo into the highest ranks of human skill.</li> <li>A huge part of why MCTS works so well for Go is that it removes the need to write evaluation functions for positions in Go, which is hard. Well, if we can train a neural net to predict what good moves are, it’s not hard to imagine we could also train a neural net to evaluate a Go position. And that is precisely why this group did: They used the already-trained high quality policy network to generate a dataset of positions and final outcomes in that game, and trained a <strong>value network</strong> that evaluated a position based on the overall probability of winning the game from that position. So the policy net suggests promising moves to evaluate, which is then done through a combination of MCTS rollouts (using the rollout net) and the value network prediction, which together turn out to work significantly better than either by itself.</li> <li>To top everything off, all of this was implemented in a hugely scalable manner that can and did leverage hardware that easily surpassed anything that was used for Go play in the past. AlphaGo used 40 search threads running on 48 CPUs, with 8 GPUs for neural net computations being done in parallel. And that’s just one one computer! AlphaGo was also implemented in a distributed version which could run on multiple machines, and scale to more than a thousand CPUs and close to 200 GPUs.</li> </ol> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/27-alphagonets.jpg" alt="AlphaGoA" /> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/28-alphagosearch.jpg" alt="AlphaGoB" /> <figcaption>(A) A great visual breakdown of all the neural nets involved <a href="http://www.nature.com/nature/journal/v529/n7587/fig_tab/nature16961_F1.html"><b>(From the AlphaGo Nature Paper)</b></a><br /> (B) And a breakdown of how these networks are used in conjunction with MCTS. I recommend <a href="https://xcorr.net/2016/02/03/5-easy-pieces-how-deepmind-mastered-go/"><b>this technical summary</b></a> for more detail. <b>(Source: also from the Nature Paper, with modified annotations by <a href="http://www.slideshare.net/ShaneSeungwhanMoon/how-alphago-works">Shane (Seungwhan) Moon)</a></b></figcaption> </figure> <p>Besides training a state-of-the-art move predictor through a combination of supervised and reinforcement learning, probably the most novel aspect of AlphaGo is the strong integration of machine-learned intelligence with MCTS. Whereas the 2014 paper only contained a very preliminary combination of MCTS with a CNN, and before that MCTS programs contained at most modest integrations with supervised learning, AlphaGo gets its strength from being a well designed hybrid AI approach that is run on modern-day supercomputer hardware. Put simply, AlphaGo is a huge successful synthesis of multiple prior effective approaches that is far better than any of those individual elements alone. At least one other team, Yuandong Tian and Yan Zhu from Facebook, explored the same MCTS+deep learning idea in the same timeframe and likewise achieved impressive results <sup id="fnref:DarkForest"><a href="#fn:DarkForest" class="footnote">20</a></sup>. But the Google team definitely invested much more effort and resources into AlphaGo, and that ultimately paid off when it achieved a historic milestone for AI.</p> <blockquote> <p>AlphaGo = Deep Supervised Learning (with domain-specific features) + Deep Reinforcement Learning + Monte Carlo Tree Search + Hugely parallel processing</p> </blockquote> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/29-go2016.png" alt="AlphaGoRankA" /> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/30-alphagoranking.jpg" alt="AlphaGoRankB" /> <figcaption>AlphaGo, now at the 'divine' top rank of Go play. <b>(From the AlphaGo Nature Paper)</b> </figcaption> </figure> <p>And now, finally, we are back to where we began: Just a few months after the announcement of AlphaGo’s existence, it faced off against one of the best living Go players and decidedly won. This will perhaps go down as a historic moment not just for the field of AI, but for the whole history of human invention and ingenuity. I think it is fascinating to see how we can trace the intellectual efforts to accomplish this decades back, with every idea used to build AlphaGo being a refinement on one that came before it, and see that (as always in science and engineering) credit here is due to both a large team at Google and dozens of people that made their work possible. Now, what of the future?</p> <figure> <a href="/writing/images/2016-4-15-a-brief-history-of-game-ai/0-history.png"> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/0-history.png" alt="History" /> </a> <figcaption>The past - just about the scope of this series of posts, as promised. Created with <a src="http://www.readwritethink.org/classroom-resources/student-interactives/timeline-30007.html">Timeline</a>.</figcaption> </figure> <h1 id="epilogue-ai-after-alphago">Epilogue: AI After AlphaGo</h1> <figure> <a href="/writing/images/2016-4-15-a-brief-history-of-game-ai/31-venn.png"> <img class="postimagesmall" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/31-venn.png" alt="Games" /> </a> <figcaption>Go and Chess in the spheres of AI problems. I recommend <a href="https://xcorr.net/2016/02/03/5-easy-pieces-how-deepmind-mastered-go/"><b>this for more quality discussion how significant AlphaGo is for AI as a research field.</b></a></figcaption> </figure> <p>Congratulations, dear reader, you made it. I am impressed and appreciative that you got to this point. With this full history well trodden, I hope that my secret motivation in writing this clear — to show that AlphaGo is not some scary incomprehensible AI program, but really a quite reasonable feat of human research and engineering. And also to note that Go, ultimately, is still a strategic board game. Yes, it is a game for which it is hugely challenging to make good computer programs. But, in being a game, it also possesses multiple nice properties that make AlphaGo possible, as shown in the above Venn Diagram.</p> <p>There are many, many problems still in the realms of AI that are outside the intersection of those nice properties. To name just a few: playing Starcraft or Soccer, human-like conversation, (fully) autonomous driving. This is an incredible time for AI research, as we are starting to see these problems being tackled by well funded researchers like the ones at Google. In fact, I think it is appropriate to end with this quote <a href="https://googleblog.blogspot.com/2016/03/what-we-learned-in-seoul-with-alphago.html?m=1">from the AlphaGo team itself</a>:</p> <blockquote> <p>“But as they say about Go in Korean: “Don’t be arrogant when you win or you’ll lose your luck.” This is just one small, albeit significant, step along the way to making machines smart. We’ve demonstrated that our cutting edge deep reinforcement learning techniques can be used to make strong Go and Atari players. Deep neural networks are already used at Google for specific tasks — like image recognition, speech recognition, and Search ranking. However, we’re still a long way from a machine that can learn to flexibly perform the full range of intellectual tasks a human can — the hallmark of true artificial general intelligence.<br /><br /> With this tournament, we wanted to test the limits of AlphaGo. The genius of Lee Sedol did that brilliantly — and we’ll spend the next few weeks studying the games he and AlphaGo played in detail. And because the machine learning methods we’ve used in AlphaGo are general purpose, we hope to apply some of these techniques to other challenges in the future. Game on!”</p> </blockquote> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/32-leesedol.jpg" alt="LeeSedol" /> <figcaption>"Demis and Lee Sedol hold up the signed Go board from the Google DeepMind Challenge Match" <a href="https://googleblog.blogspot.com/2016/03/what-we-learned-in-seoul-with-alphago.html?m=1"><b>(From the AlphaGo team's post quoted)</b></a></figcaption> </figure> <h2 id="acknowledgements">Acknowledgements</h2> <p>Big thanks to <a href="http://cs.stanford.edu/people/abisee/">Abi See</a> and <a href="https://www.linkedin.com/in/pavel-komarov-a2834048">Pavel Komarov</a> for helping to edit this.</p> <h2 id="references">References</h2> <div class="footnotes"> <ol> <li id="fn:ManyFaces"> <p>David Fotland (1993). Knowledge representation in the Many Faces of Go. <a href="#fnref:ManyFaces" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:MonteCarloGo"> <p>Brügmann, B. (1993). Monte carlo go (Vol. 44). Syracuse, NY: Technical report, Physics Department, Syracuse University. <a href="#fnref:MonteCarloGo" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:ParisSchool1"> <p>Bouzy, B., &amp; Helmstetter, B. (2004). <a href="http://www.ai.univ-paris8.fr/~bh/articles/acg10-mcgo.pdf">Monte-carlo go developments</a>. In Advances in computer games (pp. 159-174). Springer US. <a href="#fnref:ParisSchool1" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:ParisSchool2"> <p>Cazenave, T., &amp; Helmstetter, B. (2005). <a href="http://www.ai.univ-paris8.fr/~bh/articles/searchmcgo.pdf">Combining Tactical Search and Monte-Carlo in the Game of Go</a>. CIG, 5, 171-175. <a href="#fnref:ParisSchool2" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:ParisSchool3"> <p>Bouzy, B. (2005). <a href="http://web.mi.parisdescartes.fr/~bouzy/publications/Bouzy-JCIS03.pdf">Associating domain-dependent knowledge and Monte Carlo approaches within a Go program</a>. Information Sciences, 175(4), 247-257. <a href="#fnref:ParisSchool3" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:MonteCarloRevolution"> <p>Coulom, R. (2009, January). <a href="http://www.remi-coulom.fr/JFFoS/JFFoS.pdf">The Monte-Carlo Revolution in Go</a>. In The Japanese-French Frontiers of Science Symposium (JFFoS 2008), Roscoff, France. <a href="#fnref:MonteCarloRevolution" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:MCTS"> <p>Coulom, R. (2006). Efficient selectivity and backup operators in Monte-Carlo tree search. In Computers and games (pp. 72-83). Springer Berlin Heidelberg. <a href="#fnref:MCTS" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:BanditMonte"> <p>Kocsis, L., &amp; Szepesvári, C. (2006). <a href="http://www.sztaki.hu/~szcsaba/papers/ecml06.pdf">Bandit based monte-carlo planning</a>. In Machine Learning: ECML 2006 (pp. 282-293). Springer Berlin Heidelberg. <a href="#fnref:BanditMonte" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:MoGo"> <p>Gelly, S., Wang, Y., Teytaud, O., Patterns, M. U., &amp; Tao, P. (2006). <a href="https://hal.inria.fr/inria-00117266/document">Modification of UCT with patterns in Monte-Carlo Go</a>. Technical Report RR-6062, 32, 30-56. <a href="#fnref:MoGo" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:GoHistory"> <p><a href="http://www.britgo.org/computergo/history">History of Go-playing programs.</a> http://www.britgo.org/computergo/history <a href="#fnref:GoHistory" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:TDGo"> <p>Schraudolph, N. N., Dayan, P., &amp; Sejnowski, T. J. (1994). <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/sds94.pdf">Temporal difference learning of position evaluation in the game of Go</a>. Advances in Neural Information Processing Systems, 817-817. <a href="#fnref:TDGo" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:DBN"> <p>Hinton, G. E., Osindero, S., &amp; Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554. <a href="#fnref:DBN" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:Atari"> <p>Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &amp; Riedmiller, M. (2013). <a href="http://arxiv.org/abs/1312.5602">Playing atari with deep reinforcement learning</a>. arXiv preprint arXiv:1312.5602. <a href="#fnref:Atari" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:Go2000sA"> <p>Sutskever, I., &amp; Nair, V. (2008). <a href="http://www.cs.utoronto.ca/~ilya/pubs/2008/go_paper.pdf">Mimicking Go experts with convolutional neural networks</a>. In Artificial Neural Networks-ICANN 2008 (pp. 101-110). Springer Berlin Heidelberg. <a href="#fnref:Go2000sA" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:Go2000sB"> <p>Stern, D., Herbrich, R., &amp; Graepel, T. (2006, June). <a href="http://www.autonlab.org/icml_documents/camera-ready/110_Bayesian_Pattern_Ran.pdf">Bayesian pattern ranking for move prediction in the game of Go</a>. In Proceedings of the 23rd international conference on Machine learning (pp. 873-880). ACM. <a href="#fnref:Go2000sB" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:Go2000sC"> <p>Wu, L., &amp; Baldi, P. F. (2006). <a href="https://papers.nips.cc/paper/3094-a-scalable-machine-learning-approach-to-go.pdf">A scalable machine learning approach to go</a>. In Advances in Neural Information Processing Systems (pp. 1521-1528). <a href="#fnref:Go2000sC" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:Go2014A"> <p>Clark, C., &amp; Storkey, A. (2014). <a href="https://arxiv.org/abs/1412.3409">Teaching deep convolutional neural networks to play go</a>. arXiv preprint arXiv:1412.3409. <a href="#fnref:Go2014A" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:Go2014B"> <p>Maddison, C. J., Huang, A., Sutskever, I., &amp; Silver, D. (2014). <a href="https://arxiv.org/abs/1412.6564">Move evaluation in go using deep convolutional neural networks</a>. arXiv preprint arXiv:1412.6564. <a href="#fnref:Go2014B" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:AlphaGo"> <p>David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel &amp; Demis Hassabis (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-503. <a href="#fnref:AlphaGo" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:DarkForest"> <p>Tian, Y., &amp; Zhu, Y. (2015). Better Computer Go Player with Neural Network and Long-term Prediction. arXiv preprint arXiv:1511.06410. <a href="#fnref:DarkForest" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <p><a href="/writing/a-brief-history-of-game-ai-part-3/">A 'Brief' History of Game AI Up To AlphaGo, Part 3</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on April 18, 2016.</p> <![CDATA[A 'Brief' History of Game AI Up To AlphaGo, Part 2]]> /writing/a-brief-history-of-game-ai-part-2 2016-04-18T19:19:34-07:00 2016-04-18T19:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p>This is the second part of ‘A Brief History of Game AI Up to AlphaGo’. Part 1 is <a href="/writing/a-brief-history-of-game-ai">here</a> and part 3 is <a href="/writing/a-brief-history-of-game-ai-part-3">here</a>. In this part, we shall cover just about four decades of progress, from the first victories of computers against people at Checkers and Chess all the way up to DeepBlue’s victory against humanity’s then-best living Chess player.</p> <h1 id="computers-start-to-win">Computers Start To Win</h1> <p class="sidenoteleftlarge">1958</p> <p>By the late 1950s, the industrious engineers at IBM were far from the only ones working on AI — excitement for the new field filled research groups in universities from the US to the Soviet Union. One such group was made up of Allen Newell and Herbert Simon (both attendants of the Dartmouth Conference) from Carnegie Mellon University, and Cliff Shaw from RAND Corporation. They collaborated on Chess AI from 1955 to 1958, culminating in <a href="http://aitopics.org/sites/default/files/classic/Feigenbaum_Feldman/C&amp;T-Newll-Shaw-Simon.pdf">“Chess Playing Programs and the Problem of Complexity”</a><sup id="fnref:NSS"><a href="#fn:NSS" class="footnote">1</a></sup> which both summarized existing Chess AI research and contributed new ideas that they tested with the NSS (Newell, Shaw, and Simon) Chess program.</p> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/7-NSS.png" alt="bernstein_chess" /> <figcaption> A summary of 50s work on Chess AI from the NSS group. <a href="http://aitopics.org/sites/default/files/classic/Feigenbaum_Feldman/C&amp;T-Newll-Shaw-Simon.pdf"><b>(Source)</b></a></figcaption> </figure> <p>Just as Shannon noted that master players use intuition to think selectively about moves, Newell, Shaw and Simon considered heuristics to be an important aspect of human Chess-playing. Like Bernstein’s program, the NSS algorithm used a type of simple “intelligence” to choose which moves to explore. The group’s most significant addition to Minimax was an approximation of something that became an essential part of future Chess playing programs: <strong>alpha-beta pruning</strong>. This is a small modification to Minimax that makes the algorithm avoid simulating moves that are clearly bad (‘pruning’ branches of the tree that need not be simulated), thus saving those precious computing resources for more promising moves. The efficiency gains can be huge, and alpha-beta pruning became as standard for future Chess programs as Minimax itself.</p> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/8A-alphabeta.jpg" /> <figcaption>A tiny example of alpha-beta pruning. You are currently at position A and have three move options: B, C and D. You want to maximize your end-game score. For any move you make, the opponent will choose a move so as to minimize your score. The worst score you might get with option B is 3, so as soon as you see your opponent has a response in option C that only nets you a score of 2, you can cease to explore option C because option B is already definitely better. <a href="http://cs.stackexchange.com/questions/1069/what-use-are-the-minimum-values-on-minimax-trees"><b>(Source)</b></a></figcaption> </figure> <p>In emphasizing the need for such heuristics, the NSS group also argued that implementing them would be much easier with higher-level “interpreted” programming languages — again, this is in 1958! Back then programmers worked in the binary language of the computer, so another notable aspect of the NSS group’s work is their use of a symbolic compiled programming language to implement a more complex program. As with Bernstein’s program, the limitations of the hardware and of the code resulted in a rather shoddy Chess player. Still, it <a href="https://chessprogramming.wikispaces.com/NSS">has been said</a> to be the first chess program to beat (an almost humorously inexperienced) human player:</p> <blockquote> <p>“In 1958, a chess program (NSS) beat a human player for the first time. The human player was a secretary who was taught how to play chess one hour before her game with the computer. The computer program was played on an IBM 704. The computer displayed a level of chess-playing expertise greater than an adult human could gain from one hour of chess instruction.”</p> </blockquote> <p class="sidenoteleftlarge">1962</p> <p>Meanwhile, Arthur Samuel’s Checkers program played Checkers well already, and continued to get better. In 1962, Samuel and IBM had enough faith in the program to publicly pit it against a good human player. As described in a wonderful <a href="https://webdocs.cs.ualberta.ca/~chinook/project/legacy.html">retrospective about the event</a>, they strangely chose the human opponent to be Robert Nealy, who considered himself a master but was not ranked highly as a tournament player. Partially because of this, and partially because the program was good at Checkers, Nealy lost. Though it would soon be clear Samuel’s program was no match for the best human players at the game — it was easily beaten by two of them at the 1966 world championship — the public and media reaction to its win in 1962 was not unlike the current media frenzy around AlphaGo:</p> <blockquote> <p>“Wait! Hold the presses! A computer defeated a master checkers player! This was a major news story. Computers could solve the game of checkers. Mankind’s intellectual superiority was being challenged by electronic monsters. To the technology-illiterate public of 1962, this was a major event. It was a precursor to machines doing other intelligent things better than man. How long could it possibly be before computers would be smarter than man? After all, comput­ers have only been around for a few years, and already rapid progress was being made in the fledgling computer field of artificial intelligence. Paranoia.” <a href="https://webdocs.cs.ualberta.ca/~chinook/project/legacy.html">Source</a></p> </blockquote> <figure class="sidefigureleft"> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/8B-GoChess.jpg" alt="Chess vs Go" /> <figcaption>A comparison of Chess vs Go. Go has a much larger branching factor and a set of rules for which it is much harder to write an evaluation function (and here's <a href="https://en.wikipedia.org/wiki/Go_%28game%29">a handy link to Wikipedia</a> for those). <a href="http://spectrum.ieee.org/computing/software/cracking-go/chess-vs-go"><b>(Source)</b></a></figcaption> </figure> <p>AlphaGo’s victory is of course in a different league - Lee Sedol is unquestionably among the best players in the world and our computer-acclimated culture is less shocked by such a feat — but it is interesting to note the similarities between the two highly publicized events. Despite the fact Samuel’s program was nowhere near as good as the best humans, its win gave the lasting impression Checkers was a ‘simple’ game that computers had already conquered and that Chess was the real challenge, much as Go was seen after Deep Blue’s success with Chess. Speaking of which, 1962 was the year the first computer Go program was attempted with <a href="http://www.britgo.org/files/computergo/remus.pdf">“Simulation of a Learning Machine For Playing Go”</a><sup id="fnref:RemusGo"><a href="#fn:RemusGo" class="footnote">2</a></sup> by H. Remus (also at IBM!), though the resulting program was incomplete and never played a full game of Go. It would be half a decade more until a true Go program akin to Bernstein’s Chess program or Samuel’s Checkers program would play human players.</p> <p>Meanwhile, yet more research teams in the Soviet Union and in the US were working on implementing Chess AIs. Notably, a group of students at MIT led by AI legend John McCarthy developed a Chess-playing program based on Minimax with alpha-beta pruning, and in 1966 faced it off against a program developed at the Moscow Institute of Theoretical and Experimental Physics (ITEP) by telegram. The Kotok-McCarthy program lost 3-1, and was in general very weak due to being limited to searching very few positions (fewer than Bernstein’s program, even). But, another student named Richard Greenblatt saw the program and, being a skilled chess player, was inspired to write his own - the <a href="https://en.wikipedia.org/wiki/Mac_Hack">Mac Hack</a>. This program searched through many more positions and had other refinements, to the point that it could beat a ranked human player in a tournament in 1967 and win or draw multiple times more in succeeding tournaments. But it was still nowhere near as good as the best players.</p> <div><button class="btn" data-toggle="collapse" data-target="#greenblatt"> Aside: more on Richard Greenblatt's Chess Program &raquo; </button></div> <blockquote class="aside"><p id="greenblatt" class="collapse" style="height: 0px;"> There is <a href="http://archive.computerhistory.org/resources/text/Oral_History/Greenblatt_Richard/greenblatt.oral_history_transcript.2005.102657935.pdf">a fun oral history of Richard Greenblatt's</a> that is quite worth looking over if you are curious. Here are some choice excerpts:<br /><br /> "Anyway, I looked at this thing and I could see that the quality of the analysis was not good. And I said, gee, I can do better than that. And so I immediately set to it, and within just a few weeks, after I got back, I had the thing playing chess.<br /> ...<br /> And so then as word got around — Well, there was a guy a MIT in those days named Hubert Dreyfuss, who was a prominent critic of artificial intelligence, and made some statements of the form, you know, computers will never be any good for chess, and so forth. And, of course, he was, again, very romanticized. He was not a strong chess player. However, he thought he was, or I guess he knew he wasn’t world class, but he thought he was a lot better than he was. So anyway, I had this chess program and basically Jerry Sussman, who’s a professor at MIT now, and who was also a member of our group, had played. It was around and it was available on the machine. People played it, and so forth. And basically Sussman brought over Dreyfuss and said, well, how would you like to have a friendly game or something. Dreyfuss said, oh, sure. And sure enough, Dreyfuss sat down and got beat. So this immediately got quite a bit of publicity. " </p></blockquote> <figure class="sidefigureright"> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/8C-1968go.jpg" /> <figcaption>A figure showing Zobrist's visual representation from <a href="http://www.computer.org/csdl/proceedings/afips/1969/5073/00/50730103.pdf">the paper</a>.</figcaption> </figure> <p>Then, in 1968 a Go playing program reached the milestone that was conquered for Chess a whole decade earlier: beating a wholly inexperienced amateur. The program did not rely on tree search, but was rather based on emulating the way a human player “sees” an internal representation of a game position in Go so as to recognize patterns that matter for choosing the correct move. Interestingly, much of the power of AlphaGo is based on creating powerful internal representations of the board with Machine Learning techniques commonly applied to visual tasks, so the intuition here was in a way quite right. This feat was achieved by Alfred Zobrist, as described in <a href="http://www.computer.org/csdl/proceedings/afips/1969/5073/00/50730103.pdf">“A novel of visual organization for the game of GO”</a><sup id="fnref:ZobristGo"><a href="#fn:ZobristGo" class="footnote">3</a></sup>:</p> <blockquote> <p>“Given that a player “sees” a fairly stable and uniform internal representation, it follows that familiar and meaningful configurations may be recognized in terms of it. The result of visual organization is to classify a tremendous number of possible board situations into a much smaller number of recognizable or familiar board situations. Thus a player can respond to a board position he has never encountered, because it has been mapped into a familiar internal representation. This report will describe a simulation model for visual organization. … The program now has a record of two wins and two losses against human opponents. The opponents can best be described as intelligent adults who know how to play GO, have played from two to twenty games but have not studied the game. The program appears to have reached the bottom rung of the ladder of human GO players. “</p> </blockquote> <p>Because tricky ideas like this were necessary in order to cope with the huge branching factor and hard-to-codify heuristics of the game, progress for Go playing programs was much slower than for Chess or Checkers. It would be another decade until Bruce Wilcox developed a stronger program, again without reliance on traditional game AI techniques but with some limited tree search (as covered in <a href="http://www.wired.com/2014/05/the-world-of-computer-go/">this great Wired story</a>). The approach there was to subdivide the bigger board into smaller regions that were easier to reason about, which would continue to be a hallmark of Go AIs. But even then, it was nowhere near even decent human play.</p> <p>The same could not be said of Chess programs. Throughout the 70s, Chess AI progressed mostly by refining previously successful approaches. For instance, in the early 70s the Chess AI group at ITEP refined their program into a better version they named Kaissa, which went on to become the first computer Chess champion of the world in 1974 after squaring off against US programs. The program significantly benefited from faster computers and an efficient implementation that included alpha-beta pruning and some other tricks, for the first time showing the strength of the Shannon ‘type A’ AI strategy that relied more on fast search than smart heuristics or position evaluation.</p> <p>But also by this point, it was becoming typical to use extra ‘type B’ ideas such as <a href="https://chessprogramming.wikispaces.com/Quiescence+Search">quiescence</a> (basically searching further only after moves involving captures or checks, to not mistake trades for piece captures). It turned out to be very beneficial to selectively search further in certain tree paths than in others, so as to not miss critical turns in the game. As we’ll see, these techniques ultimately proved sufficient to write Chess AIs that can beat all of humanity — though it would take a while longer to get there…</p> <h1 id="humans-stop-winning-except-for-go">Humans Stop Winning… except for Go</h1> <p class="sidenoteleftlarge">1989</p> <p>The first computer program to completely dominate humans at a complex game was not developed until about 3 decades after Samuel’s Checkers program won that one game against Robert Nealy, and it was the Checkers program CHINOOK. The program was developed by a team at the University of Alberta led by Jonathan Schaeffer, starting in 1989. By 1994 the best Checkers player on the planet only managed to play CHINOOK to a draw <sup id="fnref:Chinook"><a href="#fn:Chinook" class="footnote">4</a></sup>.</p> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/9-chinook.jpg" alt="games_history" /> <figcaption>Chinook being put to the test against the world champion in Checkers. <a href="http://afflictor.com/2015/12/16/within-the-decade-the-computer-will-know-how-the-game-will-turn-out-even-before-it-begins/"><b>(Source)</b></a></figcaption> </figure> <p>By the 90s computers got orders of magnitude faster, and computer memory orders of magnitude larger, compared even to the computers of the 70s. This both enabled and enhanced the several techniques that powered CHINOOK: (A) a database of opening moves from games played by grandmasters, (B) alpha-beta tree search with an evaluation function based on a linear combination of many handcrafted features, and (C) an end-game database for all positions with fewer than eight pieces. And that’s it! That’s the recipe to a world-class Checkers playing program.</p> <p>A similar recipe also powered a world-class Chess program developed around that time - Deep Thought. Developed by a team headed by Feng-hsiung Hsu, it incorporated all these ideas and had two notable extra strengths: custom hardware and smart selective extensions. According to a <a href="http://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/753/671">retrospective about its success</a>, it was the fastest Chess program up to that point in terms of how many positions it could consider per second. This was achieved by performing move simulation and evaluation with custom circuit boards, which worked in tandem with software running on a powerful computer. In addition to being fast, Deep Thought was also smart: it had <em>singular extensions</em>, a nice type of <a href="https://chessprogramming.wikispaces.com/Extensions">selective extension</a> of search past the default depth at promising positions. This allowed search depth to be extended considerably: “The result is that on the average, an N ply search penetrates along the principal variation to a depth of 1.5N and reaches a depth of 3N about once in a game”<sup id="fnref:DeepThoughtWins"><a href="#fn:DeepThoughtWins" class="footnote">5</a></sup><sup id="fnref:DeepThoughtExtensions"><a href="#fn:DeepThoughtExtensions" class="footnote">6</a></sup>.</p> <p>So, Deep Thought was successful precisely because it was a combination of ‘type A’ brute force AI (searching all positions up to a certain depth) and ‘type B’ selective search (searching past that depth in certain cases). By 1988, Deep Thought became the computer Chess champion of the world and, more impressively, beat Chess grandmaster Bent Larsen.</p> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/10-deepthought.jpg" alt="deep_thought" /> <figcaption>The Deep Thought team showing off their custom hardware when they won the Fredkin Intermediate Prize: "In 1988 Deep Thought and Grandmaster Tony Miles shared first place in the Software Toolworks Open in Los Angeles. Deep Thought had a 2745 performance rating, and moved its U.S. Chess Federation (USCF) rating up to 2551, and qualified for the $10,000 Fredkin Intermediate Prize as the first computer to achieve a USCF performance rating of 2500 over a set of 25 contiguous games in human tournaments." <a href="https://chessprogramming.wikispaces.com/Deep+Thought"><b>(Source)</b></a></figcaption> </figure> <p>Another interesting aspect of Deep Thought is that its evaluation function was automatically tuned using a database of games between master chess players, rather than having all the function’s parameters hardcoded by its programmers. In this respect it harkened all the way back to Arthur Samuel’s Checkers program, which also had the ability to ‘learn’ by tuning its evaluation function from experience. Though Chess programs improved over the decades due to increased computer speeds and ideas such as alpha-beta pruning and selective extensions, almost all programs still had no learning component and ultimately derived all their intelligence fully from their human creators. Deep Thought was a notable break from this trend.</p> <p>Still, the structure of Deep Thought’s evaluation function encoded a lot of human intuition and knowledge about the game of chess, as was the norm. This is problematic, because the tough part in playing a game (how to evaluate positions and select moves) is essentially still solved by the programmer and not the program itself. It should be possible to write AI programs that could just learn this stuff by themselves, right? Right, and this was soon done for the first time using an algorithm that was later essential to AlphaGo: <strong>neural networks</strong>. Neural networks are a technique for <strong>supervised machine learning</strong>, which is just a category of algorithms that can learn to produce a desired output for some type of input by viewing many training examples of known input/output pairs of the same type. (For an in-depth explanation feel free to look at <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning/">my little neural net writeup</a>). Following a major 1986 paper describing how larger neural nets can be trained for tougher problems, they were all the rage in the late 80s and were being applied to many sorts of problems. One such application is the backgammon AI dubbed Neurogammon.</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/11.5-supervised.png" alt="Supervised Leraning" /> <figcaption>Visualization of supervised learning. The inputs are size and domestication, and the output is a classification of 'dog' or 'cat'. The dots already on the graph are the <b>training set</b> for learning, and the lines are the learned functions for getting an output for inputs not in the training set. <a href="https://en.wikipedia.org/wiki/Perceptron#/media/File:Perceptron_example.svgl"><b>(Source)</b></a>, By <a href="//commons.wikimedia.org/w/index.php?title=User:Elizabeth_goodspeed&amp;action=edit&amp;redlink=1" class="new" title="User:Elizabeth goodspeed (page does not exist)">Elizabeth Goodspeed</a> - <span class="int-own-work" lang="en">Own work</span>, <a title="Creative Commons Attribution-Share Alike 4.0" href="http://creativecommons.org/licenses/by-sa/4.0">CC BY-SA 4.0</a></figcaption> </figure> <p>Like Go, Backgammon has a huge branching factor and the traditional tree-search-with-handcrafted-evalution-function approach does not work well. A large branching factor makes it impossible to search many moves ahead, and it is very difficult to write a great evaluation function to compensate. Gerald Tesauro, a researcher at the University of Illinois and later IBM (surprise!), and renowned Machine Learning researcher Terrence Sejnowski explored an approach based on learning a good evaluation function (a goal that had been abandoned since Arthur Samuel’s work). As explained in their 1989 paper <a href="http://papers.cnl.salk.edu/PDFs/A%20Parallel%20Network%20That%20Learns%20to%20Play%20Backgammon%201989-2965.pdf">“A parallel network that learns to play backgammon”</a>, they trained a neural net to accept as input a backgammon game position and a potential move, and to output a score measuring the quality of that move <sup id="fnref:NeuroGammon"><a href="#fn:NeuroGammon" class="footnote">7</a></sup>. This approach removes the need for engineers to attempt to encode human intuition when writing the program, which is ideal. However, to make the approach work well some human intuition was still encoded in the system in the form of <strong>features</strong> — derived aspects about the game position, for example piece counts in Chess — also used as input in addition to the raw game position.</p> <figure class="sidefigureleft"> <div><button class="btn" data-toggle="collapse" data-target="#features"> Aside: why use features? &raquo; </button></div> <blockquote class="aside"><p id="features" class="collapse" style="height: 0px;"> When building machine learning systems with large 'raw data' (images, audio, or game states), it is typical to use informative <b>features</b> extracted from the data with human-written code. These features are then used as the input instead of the raw data. Intuitively, this makes the learning program easier by giving the machine learning algorithm only the useful information from the input and not forcing it to figure that bit out by itself. So-called <b>feature-engineering</b> used to be a standard step in building machine learning systems, and was possibly one of the most time-consuming steps since (as with evaluation functions) coming up and implementing good features is not always easy. Nowadays 'deep learning' (which we shall get to soon) has made it more typical to learn directly from the raw data. Indeed deep learning seems to derive much of its power from its ability to learn useful features better than the ones humans can implement. </p></blockquote> </figure> <figure> <img class="postimagesmall" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/11-nntraining.png" alt="Backprop" /> <figcaption>Supervised learning with neural nets. Basically, neural nets are made up of a bunch of units that each just output a weighted sum of their input, and the correct weights for a given application are learned from training data. Neurogammon worked exactly like this, except that the inputs were backgammon game positions, as well as derived features of the game positions, and the outputs were scores for the game position. <a href="http://devblogs.nvidia.com/parallelforall/inference-next-step-gpu-accelerated-deep-learning/">(Source)</a></figcaption> </figure> <p class="sidenoteleftlarge">1992</p> <p>With further improvements, the program was dubbed Neurogammon 1.0 and went on to win against all other programs at the 1989 First Computer Olympiad <sup id="fnref:NeuroGammonWins"><a href="#fn:NeuroGammonWins" class="footnote">8</a></sup>. However, it was still not as strong as the best human players, a feat that would soon go to another neural net based program by Gerald Tesauro: TD-Gammon. First unveiled to the world in 1992, TD-Gammon was a hugely successful application of <strong>reinforcement learning</strong>. Unlike supervised learning, which approximates some function with particular types of input and outputs, reinforcement learning deals with finding optimal choices in different situations. More specifically, we think in terms of states (situations), in which an agent (the program) can take actions that change the agent’s state in a known way (choices). Every transition between states results in a numeric ‘reward’, and figuring out the right action to take in a given state in order to get the highest reward in the long term, is what reinforcement learning is broadly about. Whereas supervised learning learns to approximate a function via examples of inputs and outputs, reinforcement learning generally learns from ‘experience’ of receiving rewards after trying actions in different states.</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/12-rl.png" alt="RL" /> <figcaption>A visualization of the general idea of reinforcement learning. Rather than learning to compute a correct output given some input, as in supervised learning, the goal is to learn to choose a correct action in any state in order to obtain the maximum reward in the long term. <a href="http://www2.hawaii.edu/~chenx/ics699rl/grid/rl.html"><b>(Source)</b></a></figcaption> </figure> <p>So, TD-Gammon learned by just playing games of backgammon against prior versions of itself, observing which player won, and using that experience to tune a neural net to produce a probability of winning from any given position. This is fundamentally different from Neurogammon, which required compiling a dataset of hundreds of moves with human-assigned scores and was thus much more cumbersome than just letting the program play games against itself for a few hours. Note this is very similar to what Arthur Samuel was trying to do all the way back in 1957 with his Checkers program that learned from self-play. In fact, the type of reinforcement learning TD-Gammon is based on, Temporal Difference Learning, was developed in 1986 by Richard Sutton as a formalization of the learning in Samuel’s work<sup id="fnref:TDSutton"><a href="#fn:TDSutton" class="footnote">9</a></sup>.</p> <p class="sidenoteleftlarge">1994</p> <p>Besides learning through self-play, there was nothing fancy in the approach - instead of using tuned tree search the program just exhaustively looked at all positions two steps ahead and used the move that led to the largest probability of winning. With just the raw board positions as input — essentially no human intuition engineered into it — TD-Gammon achieved a level of play comparable to Neurogammon. And with the addition of Neurogammon’s features, it became comparable to the best human players in the world<sup id="fnref:TDGammon"><a href="#fn:TDGammon" class="footnote">10</a></sup>.</p> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/13-tdgammon.png" alt="TDGammon" /> <figcaption>The TD-Gammon neural net that learned to play expert-level Backgammon. The input later included features in addition to the raw board positions. <a href="https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node108.html"><b>(Source)</b></a></figcaption> </figure> <p>TD-Gammon is to this day a milestone in the history of AI. But, when researchers naturally tried to use the same approach for other games, the results were not quite as impressive. Sebastian Thrun’s NeuroChess<sup id="fnref:NeuroChess"><a href="#fn:NeuroChess" class="footnote">11</a></sup> (1995) was only comparable to commercial Chess programs on a low difficulty setting, and Markus Enzenberger’s NeuroGo<sup id="fnref:NeuroGo"><a href="#fn:NeuroGo" class="footnote">12</a></sup> (1996) likewise did not match the skill of existing (poor) Go AIs. In the case of NeuroChess, the discrepancy was surmised to be in large part due to the large amount of time it took to compute the evaluation function (“Computing a large neural network function takes two orders of magnitude longer than evaluating an optimized linear evaluation function (like that of GNU-Chess)”), making NeuroChess unable to explore nearly as many moves ahead as the commercial Chess program. The benefit of a better evaluation function just did not win out over a simpler one that allowed for many more positions to be explored during search.</p> <p class="sidenoteleftlarge">1997</p> <p>Which brings us back to Deep Thought. After the success of that program, some of the same team were hired by IBM and set out to create Deep Thought II, which was later renamed Deep Blue (Deep Thought x Big Blue = Deep Blue). By and large, Deep Blue was conceptually the same as Deep Thought but much, much beefier in terms of computing power — it was a custom- built supercomputer! Still, when it played Kasparov in 1996 Deep Blue lost with a score of 2-4. The team then spent a year making Deep Blue yet more powerful and tuning its evaluation function, and it was this version that historically beat Kasparov with a score of 3.5-2.5 on May 11th of 1997<sup id="fnref:DeepBlue"><a href="#fn:DeepBlue" class="footnote">13</a></sup>.</p> <figure class="sidefigureright"> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/15-deepblue.jpg" alt="Kasparov" /> <figcaption>The supercomputer that powered Deep Blue <a href="https://www-03.ibm.com/ibm/history/exhibits/vintage/vintage_4506VV1001.html"><b>(Source)</b></a></figcaption> </figure> <figure class="sidefigureleft"> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/14-kasparov.jpg" alt="Kasparov" /> <figcaption>Kasparov vs Deep Blue <a href="http://stanford.edu/~cpiech/cs221/apps/deepBlue.html"><b>(Source)</b></a></figcaption> </figure> <figure> <figure> <iframe src="https://www.youtube.com/embed/NJarxpYyoFI" frameborder="0" allowfullscreen=""></iframe> </figure> <figcaption>A short documentary about Kasparov vs Deep Blue.</figcaption> </figure> <p>The team credited many things with getting Deep Blue to the point that it could score this victory<sup id="fnref:DeepBlue:1"><a href="#fn:DeepBlue" class="footnote">13</a></sup>:</p> <blockquote> <p>“There were a number of factors that contributed to this success, including:<br /> 1. a single-chip chess search engine,<br /> 2. a massively parallel system with multiple levels of parallelism,<br /> 3. a strong emphasis on search extensions,<br /> 4. a complex evaluation function, and<br /> 5. effective use of a Grandmaster game database”<br /></p> </blockquote> <p>So, it would be wrong to claim DeepBlue won purely through “brute-force”, since it included decades of ideas about how to tackle the AI problem of Chess. But brute-force surely was hugely important - DeepBlue was run with thirty processors inside a supercomputer working jointly with 480 single-chip chess search engines (16 per processor). When playing Kasparov it observed 126 million positions per second on average, and typically searched to a depth of between 6 and 12 plies and to a maximum of forty plies. All this allowed it to barely win, arguably due to uncharacteristic blunders on Kasparov’s part. But, all that hardly matters; since then computers have continued to become exponentially faster, and today humanity’s best Chess players are likely no match for programs you can run on your smartphone.</p> <p>So, Checkers, Chess, and Backgammon had all been mastered by AI programs by the late 90s - what about Go? Even the best computer programs were poor matches for amateurs with some experience. The techniques we’ve seen so far — supervised learning, reinforcement learning, and well-tuned tree search — were all attempted and found insufficient to make a Go program that could challenge serious human players. To see why these approaches failed, and how their defects were addressed over the span of two decades culminating in the creation of AlphaGo, go on ahead to <a href="/writing/a-brief-history-of-game-ai-part-3">the final part of this history</a>.</p> <h2 id="acknowledgements">Acknowledgements</h2> <p>Big thanks to <a href="http://cs.stanford.edu/people/abisee/">Abi See</a> and <a href="https://www.linkedin.com/in/pavel-komarov-a2834048">Pavel Komarov</a> for helping to edit this.</p> <h2 id="references">References</h2> <div class="footnotes"> <ol> <li id="fn:NSS"> <p>Allen Newell, Cliff Shaw, Herbert Simon (1958). Chess Playing Programs and the Problem of Complexity. IBM Journal of Research and Development, Vol. 4, No. 2, pp. 320-335. Reprinted (1963) in Computers and Thought (eds. Edward Feigenbaum and Julian Feldman), pp. 39-70. McGraw-Hill, New York, N.Y. pdf <a href="#fnref:NSS" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:RemusGo"> <p>Remus, H. (1962, January). Simulation of a learning machine for playing Go. In COMMUNICATIONS OF THE ACM (Vol. 5, No. 6, pp. 320-320). 1515 BROADWAY, NEW YORK, NY 10036: ASSOC COMPUTING MACHINERY. <a href="#fnref:RemusGo" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:ZobristGo"> <p>Zobrist, A. L. (1969, May). A model of visual organization for the game of Go. In Proceedings of the May 14-16, 1969, spring joint computer conference (pp. 103-112). ACM. <a href="#fnref:ZobristGo" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:Chinook"> <p>Schaeffer, J., Lake, R., Lu, P., &amp; Bryant, M. (1996). <a href="https://www.aaai.org/ojs/index.php/aimagazine/article/viewFile/1208/1109">CHINOOK, The World Man-Machine Checkers Champion</a>. AI Magazine, 17(1), 21. <a href="#fnref:Chinook" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:DeepThoughtWins"> <p>Berliner, H. J. (1989). <a href="https://pdfs.semanticscholar.org/bf2d/10d4bc292762f8ca5e648a0668baafd2e551.pdf">Deep Thought Wins Fredkin Intermediate Prize</a>. AI Magazine, 10(2), 89. <a href="#fnref:DeepThoughtWins" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:DeepThoughtExtensions"> <p>Anantharaman, T., Campbell, M. S., &amp; Hsu, F. H. (1990). Singular extensions: Adding selectivity to brute-force searching. Artificial Intelligence, 43(1), 99-109. <a href="#fnref:DeepThoughtExtensions" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:NeuroGammon"> <p>Tesauro, G., &amp; Sejnowski, T. J. (1989). <a href="http://papers.cnl.salk.edu/PDFs/A%20Parallel%20Network%20That%20Learns%20to%20Play%20Backgammon%201989-2965.pdf">A parallel network that learns to play backgammon. Artificial Intelligence</a>, 39(3), 357-390. <a href="#fnref:NeuroGammon" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:NeuroGammonWins"> <p>Tesauro, G. (1989). Neurogammon wins computer olympiad. Neural Computation, 1(3), 321-323. <a href="#fnref:NeuroGammonWins" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:TDSutton"> <p>Sutton, R. S. (1988). <a href="https://webdocs.cs.ualberta.ca/~sutton/papers/sutton-88-with-erratum.pdf">Learning to predict by the methods of temporal differences</a>. Machine learning, 3(1), 9-44. <a href="#fnref:TDSutton" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:TDGammon"> <p>Tesauro, G. (1994). <a href="http://www.aaai.org/Papers/Symposia/Fall/1993/FS-93-02/FS93-02-003.pdf">TD-Gammon, a self-teaching backgammon program, achieves master-level play</a>. Neural computation, 6(2), 215-219. <a href="#fnref:TDGammon" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:NeuroChess"> <p>Thrun, S. (1995). <a href="http://www-preview.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1995_8/thrun_sebastian_1995_8.pdf">Learning to play the game of chess</a>. Advances in neural information processing systems, 7. <a href="#fnref:NeuroChess" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:NeuroGo"> <p>Enzenberger, Markus. <a href="http://www.cgl.ucsf.edu/go/Programs/neurogo-html/neurogo.html">“The integration of a priori knowledge into a Go playing neural network.”</a> (1996). <a href="#fnref:NeuroGo" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:DeepBlue"> <p>Campbell, M., Hoane, A. J., &amp; Hsu, F. H. (2002). <a href="http://www.mimuw.edu.pl/~ewama/zsi/deepBlue.pdf">Deep blue</a>. Artificial intelligence, 134(1), 57-83. <a href="#fnref:DeepBlue" class="reversefootnote">&#8617;</a> <a href="#fnref:DeepBlue:1" class="reversefootnote">&#8617;<sup>2</sup></a></p> </li> </ol> </div> <p><a href="/writing/a-brief-history-of-game-ai-part-2/">A 'Brief' History of Game AI Up To AlphaGo, Part 2</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on April 18, 2016.</p> <![CDATA[A 'Brief' History of Game AI Up To AlphaGo, Part 1]]> /writing/a-brief-history-of-game-ai 2016-04-18T19:19:34-07:00 2016-04-18T19:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <figure class="figure"><div class="figure__main"> <p><a href="/writing/images/2016-4-15-a-brief-history-of-game-ai/0-history.png"><img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/0-history.png" alt="History" /></a></p> </div><figcaption class="figure__caption"><p>Just about the scope of this series of posts. Created with <a href="http://www.readwritethink.org/classroom-resources/student-interactives/timeline-30007.html">Timeline</a>.</p> </figcaption></figure> <p>This is the first part of ‘A Brief History of Game AI Up to AlphaGo’. Part 2 is <a href="/writing/a-brief-history-of-game-ai-part-2">here</a> and part 3 is <a href="/writing/a-brief-history-of-game-ai-part-3">here</a>. In this part, we shall cover the birth of AI and the very first game-playing AI programs to run on digital computers.</p> <h1 id="prologue-at-long-last-algorithms-triumph-over-humans-at-go">Prologue: At Long Last, Algorithms Triumph Over Humans At Go</h1> <p class="sidenoteleftlarge">2016</p> <p>On March 9th of 2016, a historic milestone for AI was reached when the Google-engineered program AlphaGo defeated the world-class Go champion Lee Sedol. Go is a two-player strategy board game like Chess, but the larger number of possible moves and difficulty of evaluation make Go the harder problem for AI. So it was a big deal when, a week and four more games against Lee Sedol later, AlphaGo was crowned the undisputed winner of their match having lost only one game. How big a deal? Media coverage accurately described AlphaGo as a <a href="http://www.theguardian.com/technology/2016/mar/09/google-deepmind-alphago-ai-defeats-human-lee-sedol-first-game-go-contest">“major breakthrough for AI”</a> that achieved <a href="http://www.theverge.com/2016/3/12/11210650/alphago-deepmind-go-match-3-result">“one of the most sought-after milestones in the field of AI research”</a>. Comment boards were less reserved, with many describing AlphaGo’s victory as scary or even a sign that superhuman AI was now imminent.</p> <p>Months before that day, I was excitedly skimming <a href="https://research.google.com/pubs/pub44806.html">the paper on AlphaGo</a> after Google <a href="http://googleresearch.blogspot.com/2016/01/alphago-mastering-ancient-game-of-go.html">first announced its development</a> <sup id="fnref:AlphaGo"><a href="#fn:AlphaGo" class="footnote">1</a></sup>. It struck me as a hugely impressive leap over state-of-the-art Go play, achieved through a cool combination of already successful techniques. So, when the media frenzy over AlphaGo broke out I thought to write a little post explaining why it is cool from an AI perspective, but also why it is not some scary baby-Skynet AI. In doing so, I stumbled on so many noteworthy developments and details in the 60-year history of game AI that I could not stop at writing that short little blog post. So, I hope you enjoy this ‘brief’ summary of all those exciting ideas and historic milestones that preceded AlphaGo and led to this latest marvel of human ingenuity.</p> <div><button class="btn" data-toggle="collapse" data-target="#sources"> Disclaimer: not an expert, more in depth sources, corrections &raquo; </button></div> <blockquote class="aside"> <p id="sources" class="collapse" style="height: 0px;"> As with my previous 'brief' history, I should emphasize I am not expert on the topic and just wrote it out of personal interest. I have not covered all periods or aspects of this history, so some other good resources are <a href="http://academicworks.cuny.edu/cgi/viewcontent.cgi?article=1181&amp;context=gc_pubs">"The History of Computer Games"</a>, <a href="https://www.chess.com/article/view/a-short-history-of-computer-chess">"A Short History of Computer Chess"</a>, and <a href="http://www.britgo.org/computergo/history">"History of Go-playing Programs"</a>. I am also not a professional writer, and consulted some good pieces written on the topic by professional writers such as <a href="http://www.wired.com/2014/05/the-world-of-computer-go/">"The Mystery of Go, the Ancient Game That Computers Still Can’t Win"</a> by Alan Levinovitz. I also will stay away from getting too technical here, but there is a plethora of tutorials on the internet on all the major topics covered in brief by me. <br /> Any corrections would be greatly appreciated, though I will note some omissions are intentional since I want to try and keep this 'brief' through a good mix of simple technical explanations and storytelling. </p> </blockquote> <p><br /></p> <h1 id="humble-beginnings">Humble Beginnings</h1> <p class="sidenoteleftlarge">1949</p> <p>Since the inception of the modern computer, there were people pondering whether it could match — or supersede — human intelligence. And since measuring human intelligence is difficult, many of those people reasoned they could tackle the question by first making computers good at certain tasks that challenged the human intellect. So, strategy games. As early as 1949, no less than <a href="https://en.wikipedia.org/wiki/Claude_Shannon">Claude Shannon</a> published his thoughts on the topic of how a computer might be made to play Chess <sup id="fnref:ShannonChess"><a href="#fn:ShannonChess" class="footnote">2</a></sup>. He both justified the usefulness of solving such a problem and defined its scope:</p> <blockquote> <p>“The chess machine is an ideal one to start with, since: (1) the problem is sharply defined both in allowed operations (the moves) and in the ultimate goal (checkmate); (2) it is neither so simple as to be trivial nor too difficult for satisfactory solution; (3) chess is generally considered to require ‘thinking’ for skillful play; a solution of this problem will force us either to admit the possibility of a mechanized thinking or to further restrict our concept of ‘thinking’; (4) the discrete structure of chess fits well into the digital nature of modern computers. … It is clear then that the problem is not that of designing a machine to play perfect chess (which is quite impractical) nor one which merely plays legal chess (which is trivial). We would like to play a skillful game, perhaps comparable to that of a good human player.”</p> </blockquote> <p>The approach Shannon suggested is today called <strong>Minimax</strong> (named after John Vonn Neuman’s minimax theorem, proven by him in 1928) and would be hugely influential for the future game-playing AIs. It is perhaps the most obvious approach one can take to making a game-playing AI. The idea is to assume both players will consider all future moves of the whole game, and so play optimally. In other words, you should always choose a move such that, even if the opponent chooses the absolute best response to that move and to every future move of yours, you will still get the highest score possible at the end of the game.</p> <p>It’s easy to make a computer do this. With a representation of the positions (or <strong>states</strong>) and the rules of the game, all one needs to do is write a program to generate all possible next game states from the current state, and the possible states for those states, and so on. By doing this, the program can simulate the game past the current point and build a <strong>tree</strong> of possible paths toward the end of the game. Then, it just needs to follow the best-case path of moves to get to the best end game. This has the flaw of not capitalizing on potential mistakes the opponent might make, but on the whole is a very safe and sensible strategy.</p> <figure class="sidefigureleft"> <div><button class="btn" data-toggle="collapse" data-target="#minimax"> Aside: minimax step by step &raquo; </button></div> <blockquote class="aside"><p id="minimax" class="collapse" style="height: 0px;"> Broken down, minimax works as follows:<br /><br /> 1. At the start of your move, consider all the possible moves you can take.<br /> 2. For each of your possible moves, consider each of your opponent's response moves.<br /> 3. Now consider every possible response to the opponents' response, and keep thinking into the future until you reach the end of the game in your head when you can get a score.<br /> 4. Assume that just like you, the opponent can think through all the moves right through the end of the game, and so will never make a mistake — they will play optimally, and assume you will play optimally. Choose the current move to maximize your final score, assuming your opponent will always choose all future moves to minimize your score in response to whatever moves you make. </p></blockquote> </figure> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/1-TicTacToe.gif" alt="MinimaxTree" /> <figcaption>Example minimax game tree on the simple Tic-Tac-Toe game. Each successive layer in this tree represents possible game states some number of moves ahead of the current one and is traditionally called a <b>ply</b>. <a href="https://www.cs.cmu.edu/~adamchik/15-121/lectures/Game%20Trees/Game%20Trees.html"><b>(Source)</b></a></figcaption> </figure> <p>One more detail: though in theory Minimax search involves finding all paths to the end of the game, in practice this is impossible due to the combinatorial explosion of game states to keep track of with each additional move simulated into the future. That is, for every move there is some number of options (known as the <strong>branching factor</strong>) and so every additional ply (i.e. layer) in the tree roughly multiplies its size by that branching factor. If the branching factor is 10, looking one move ahead would require considering 10 game states, two moves ahead requires 10+10*10=110, three moves gets us to 1110, and so on. So, we use an <strong>evaluation function</strong> to evaluate positions that are not yet the end of the game. An evaluation function can be as simple as counting the number of pieces each player has, or much more complicated, making it possible to only search 3 or 6 moves ahead (or, a <strong>depth</strong> of 3 or 6 <strong>plies</strong>) rather than the 40-ish moves involved in the average Chess game.</p> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/2-evalfunc.png" alt="chess_eval func" /> <figcaption>Evaluation non-end game positions in a search tree. <a href="http://stanford.edu/~cpiech/cs221/apps/deepBlue.html"><b>(Source)</b></a></figcaption> </figure> <p>Shannon’s paper defined how one could use Minimax with an evaluation function, and set the course for future work on Chess AI by proposing two possible strategies to go about it: (A) doing brute-force Minimax tree search with an evaluation function, OR (B) using a ‘plausible move generator’ rather than just the rules of the game to look at a small subset of next moves at each ply during tree search. Future Chess playing programs would often be categorized as ‘type A’ or ‘type B’ based on which strategy they were mainly based on. Shannon specifically noted the first strategy was simple but not practical since the number of states grows exponentially with each additional ply and the overall number of possible positions (the <strong>state-space</strong>) is huge. For the second strategy, Shannon took inspiration from master Chess players, who selectively consider only promising moves. However, a good ‘plausible move generator’ is not at all trivial to write, so massive-scale search as in strategy (A) is still useful. As we shall see later, Deep Blue (the program that beat Chess world champion Gary Kasparov) was basically a combination of both approaches.</p> <figure> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/2-Shannon.jpg" alt="shannon_machine" /> <figcaption>Shannon demonstrating a machine he built to try programming rules for a limited version of Chess. <a href="https://videogamehistorian.wordpress.com/tag/computer-game/"><b>(Source)</b></a></figcaption> </figure> <p class="sidenoteleftlarge">1951</p> <p>But, the supercomputer that powered Deep Blue was still decades away from existence, at most a dream in the minds of the early computer pioneers of the early 50s (in fact, the famous <a href="https://en.wikipedia.org/wiki/Moore's_law">Moore’s law</a> would not be defined until a decade later). In fact, the first Chess program was run not with silicon or vacuum tubes, nor any sort of digital computer, but rather by the gooey fleshy neurons of the human brain — that of Alan Turing, to be precise. Turing, a mathematician and pioneering AI thinker, spent years working on a Chess algorithm he completed in 1951 and called TurboChamp<sup id="fnref:short_chess_history"><a href="#fn:short_chess_history" class="footnote">3</a></sup>.</p> <p>TurboChamp was not as extensive as Shannon’s proposed systems, and very basic by future standards, but still, it could play Chess. In 1952, Turing manually executed the algorithm in what must have been an excruciatingly slow game, which the program ultimately lost. Still, Turing also published his thoughts on Chess AI and posited that <em>in principle</em> a program that could learn from experience and play at the level of humans ought to be completely possible<sup id="fnref:TuringChess"><a href="#fn:TuringChess" class="footnote">4</a></sup>. Just a few years later, the first ever computer Chess program would be executed…</p> <h1 id="theory-becomes-code">Theory Becomes Code</h1> <p class="sidenoteleftlarge">1956</p> <p>All of this happened before AI — the field of Artificial Intelligence — was really born. This can be said to have happened at the 1956 Dartmouth Conference, a sort of month-long brainstorming session among future AI luminaries where the term “Artificial Intelligence” was coined (or so claims <a href="https://en.wikipedia.org/wiki/Dartmouth_Conferences">Wikipedia</a>). Besides the University mathematicians and researchers in attendance (Claude Shannon among them), there were also two engineers from IBM: Nathaniel Rochester and Arthur Samuel. Nathaniel Rochester headed a small group that began a long tradition of people at IBM achieving breakthroughs in AI, with Arthur Samuel being the first.</p> <p>Samuel had been thinking about Machine Learning (algorithms that enable computers to solve problems through learning rather than through hand-coded human solutions) since 1949, and was particularly focused on developing an AI that could learn to play the game of Checkers. Checkers, which has 10<sup>20</sup> possible board positions, is simpler than Chess (10<sup>47</sup>) or Go (10<sup>250</sup> ! We’ll get to that one in a bit…) but still complicated enough that it is not easy to master. With the slow and unwieldy computers of the time, Checkers was a good first target. Working with the resources he had at IBM, and particularly their first commercial computer (the IBM 701), Samuel developed a program that could play the game of Checkers well, the first such game-playing AI to run on a computer. He summarized his accomplishments in the seminal <a href="https://www.cs.virginia.edu/~evans/greatworks/samuel1959.pdf">“Some studies in machine learning using the game of Checkers”</a><sup id="fnref:SamulCheckers"><a href="#fn:SamulCheckers" class="footnote">5</a></sup>:</p> <blockquote> <p>“Two machine-learning procedures have been investigated in some detail using the game of checkers. Enough work has been done to verify the fact that a computer can be programmed so that it will learn to play a better game of checkers than can be played by the person who wrote the program. Furthermore, it can learn to do this in a remarkably short period of time (8 or 10 hours of machine-playing time) when given only the rules of the game, a sense of direction, and a redundant and incomplete list of parameters which are thought to have something to do with the game, but whose correct signs and relative weights are unknown and unspecified. The principles of machine learning verified by these experiments are, of course, applicable to many other situations.”</p> </blockquote> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/3-minimax.png" alt="samuel_minimax" /> <figcaption>A great visual from Samuel's paper explaining Minimax. <a href="https://www.cs.virginia.edu/~evans/greatworks/samuel1959.pdf"><b>(Source)</b></a></figcaption> </figure> <p>Fundamentally, the program was based on Minimax, but had an additional hugely important aspect: <strong>learning</strong>. The program became better over time without human intervention, through two novel methods: (A) “rote-learning”,— meaning it could store the values of certain positions as previously evaluated with Minimax, and so not need to spend computational resources considering moves further down those branches — and (B) “learning-by-generalization”, i.e. modifying the multipliers for different parameters (thus modifying the evaluation function) based on previous games played by the program. The multipliers were changed so as to lower the difference between the calculated goodness of a given board position (according to the evaluation function) and its actual goodness (found through playing out the game to completion).</p> <p>Rote learning was a fairly obvious way to make the program more efficient and capable over time, and it worked well. But it was learning-by-generalization that was particularly groundbreaking, as it showed that a program could learn to ‘intuitively’ know how good a game position was without tons of simulation of future moves. Not only that, but the program was made to learn by playing past versions of itself, which would one day be a key component of AlphaGo! But let’s not get ahead of ourselves…</p> <figure class="sidefigureright"> <div><button class="btn" data-toggle="collapse" data-target="#samuel_learning"> Aside: Quote from Samuel about learning procedures &raquo; </button></div> <blockquote class="aside"><p id="samuel_learning" class="collapse" style="height: 0px;"> Here is a fun excerpt from the paper about the advantages of each learning strategy:<br /> "Some interesting comparisons can be made between the playing style developed by the learning-by-generalization program and that developed by the earlier rote-learning procedure. The program with rote learning soon learned to imitate master play during the opening moves. It was always quite poor during the middle game, but it easily learned how to avoid most of the obvious traps during end-game play and could usually drive on toward a win when left with a piece advantage. The program with the generalization procedure has never learned to play in a conventional manner and its openings are apt to be weak. On the other hand, it soon learned to play a good middle game, and with a piece advantage it usually polishes off its opponent in short order.<br /> Apparently, rote learning is of the greatest help, either under conditions when the result of any specific action are long delayed, or in those situations where highly specialized techniques are required. Contrasting with this, the generalization procedure is most helpful in situations in which the available permutations of conditions are large in number and when the consequences of any specific action are not long delayed." </p></blockquote> </figure> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/4-rote_learning.png" /> <figcaption>Another great figure from Samuel's paper showing the use of rote learning. <a href="https://www.cs.virginia.edu/~evans/greatworks/samuel1959.pdf"><b>(Source)</b></a></figcaption> </figure> <p>Not only were these ideas groundbreaking, but they also worked in practice: The program could play a respectable game of Checkers, which was no small feat given the limited computing power at the time. And so, as this <a href="https://webdocs.cs.ualberta.ca/~chinook/project/legacy.html">great retrospective</a> details, when Samuel’s program was first demonstrated in the very early days of AI (in the same year as the Dartmouth Conference, in fact) it made a strong impression:</p> <blockquote> <p>“It didn’t take long before Samuel had a program that played a respectable game of checkers, capable of easily defeating novice players. It was first publicly demonstrated on television on February 24, 1956. Thomas Watson, President of IBM, arranged for the program to be exhibited to shareholders. He predicted that it would result in a fifteen-point rise in the price of IBM stock. It did.”</p> </blockquote> <figure> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/5-samuel.jpg" alt="samuel_checkers" /> <figcaption>"On February 24, 1956, Arthur Samuel’s Checkers program, which was developed for play on the IBM 701, was demonstrated to the public on television." <a href="http://www-03.ibm.com/ibm/history/ibm100/us/en/icons/ibm700series/impacts/"><b>(Source)</b></a></figcaption> </figure> <p class="sidenoteleftlarge">1957</p> <p>But, this is Checkers — what of the game everyone really cared about, Chess? Well, once again it was employees at IBM who pioneered the first Chess AI, and as with Samuel those employees were supervised by Nathaniel Rochester. The work was chiefly led by Alex Bernstein, a mathematician and experienced Chess player. Like Samuel, he decided to explore the problem out of personal interest and ultimately led the implementation of a fully functional Chess playing AI on the IBM 701, which was completed in 1957 <sup id="fnref:BernsteinChess"><a href="#fn:BernsteinChess" class="footnote">6</a></sup>. The program also used Minimax, but lacked any learning capability and was constrained to look at most 4 moves ahead, and consider only 7 options per move. Until the 70s, most Chess-playing programs would be similarly constrained, plus perhaps some extra logic to choose which moves to simulate, rather like the type (B) strategy outlined by Shannon in 1949. Bernstein’s program had some <b>heuristics</b> (cheap to compute ‘rules of thumb’) to select the best 7 moves to simulate, which in itself was a new contribution. Still, these limitations meant the program only achieved very basic Chess play.</p> <figure class="sidefigureleft"> <img class="postimage" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/6-chessai.png" alt="bernstein_chess" /> <figcaption>Illustration of the limited Minimax search the Bernstein program did, from the article on it. <a href="http://archive.computerhistory.org/projects/chess/related_materials/text/2-2.Computer_V_ChessPlayer.Bernstein_Roberts.Scientific_American.June-1958/Computer_V_ChessPlayer.Bernstein_Roberts.Scientific_American.June-1958.062303059.sm.pdf"><b>(Source)</b></a></figcaption> </figure> <figure class="sidefigureright"> <img class="postimageactual" src="/writing/images/2016-4-15-a-brief-history-of-game-ai/6A-Bernstein.jpg" alt="bernstein_chessB" /> <figcaption>Bernstein playing his program. <a href="https://chessprogramming.wikispaces.com/The+Bernstein+Chess+Program"><b>(Source)</b></a></figcaption> </figure> <figure> <figure> <iframe src="https://www.youtube.com/embed/iT_Un3xo1qE" frameborder="0" allowfullscreen=""></iframe> </figure> <figcaption>Bernstein's Chess program starring in its very own TV report!</figcaption> </figure> <p>Still, it was the first fully functional Chess-playing program and demonstrated that even extremely limited Minimax search with a simple evaluation function and no learning can yield passable novice Chess play. And this was in 1957! So much more is yet to come in <a href="/writing/a-brief-history-of-game-ai-part-2">the coming decades</a>…</p> <h2 id="acknowledgements">Acknowledgements</h2> <p>Big thanks to <a href="http://cs.stanford.edu/people/abisee/">Abi See</a>, <a href="https://www.linkedin.com/in/pavel-komarov-a2834048">Pavel Komarov</a>, and Stefeno Fenu for helping to edit this.</p> <h2 id="references">References</h2> <div class="footnotes"> <ol> <li id="fn:AlphaGo"> <p>David Silver, Aja Huang, Christopher J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel &amp; Demis Hassabis (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-503. <a href="#fnref:AlphaGo" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:ShannonChess"> <p>Shannon, C. E. (1988). <a href="http://vision.unipv.it/IA1/aa2009-2010/ProgrammingaComputerforPlayingChess.pdf">Programming a computer for playing chess</a> (pp. 2-13). Springer New York. <a href="#fnref:ShannonChess" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:short_chess_history"> <p>A Jenery (2008). A Short History of Computer Chess. chess.com <a href="https://www.chess.com/article/view/a-short-history-of-computer-chess">link</a> <a href="#fnref:short_chess_history" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:TuringChess"> <p>Alan Turing (1953). Chess. part of the collection Digital Computers Applied to Games. in Bertram Vivian Bowden (editor), Faster Than Thought, a symposium on digital computing machines, reprinted 1988 in Computer Chess Compendium <a href="#fnref:TuringChess" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:SamulCheckers"> <p>Samuel, A. L. (1959). <a href="](https://www.cs.virginia.edu/~evans/greatworks/samuel1959.pdf)">Some studies in machine learning using the game of checkers</a>. IBM Journal of research and development, 3(3), 210-229. <a href="#fnref:SamulCheckers" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:BernsteinChess"> <p>Bernstein, A., &amp; Roberts, M. D. V. (1958). <a href="http://archive.computerhistory.org/projects/chess/related_materials/text/2-2.Computer_V_ChessPlayer.Bernstein_Roberts.Scientific_American.June-1958/Computer_V_ChessPlayer.Bernstein_Roberts.Scientific_American.June-1958.062303059.sm.pdf">Computer v chess-player</a>. Scientific American, 198(6), 96-105. <a href="#fnref:BernsteinChess" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <p><a href="/writing/a-brief-history-of-game-ai/">A 'Brief' History of Game AI Up To AlphaGo, Part 1</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on April 18, 2016.</p> <![CDATA[Fun Visualizations of the 2015 StackOverflow Developer Survey]]> /writing/fun-visualizations-of-stackoverflow 2016-02-12T19:19:34-07:00 2016-02-12T19:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/14-occ_lang_rB.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/14-occ_lang_rB.png" alt="Langs hmat" /> </a> <figcaption>Where this post is heading. The entirety of the code used for this post <b><a href="https://github.com/andreykurenkov/stackoverflow-R-dataviz">can be found here</a></b>. </figcaption> </figure> <h1 id="what">What?</h1> <p>What will follow are a selection of what I think are cool visualizations of data from the <a href="http://stackoverflow.com/research/developer-survey-2015">StackOverflow 2015 Developer Survey</a>. The survey asked software developers a bunch of questions concerning their background and work, and got an impressive 26086 responses. Due to there being multiple ‘Select up to’ questions, the data contains 222 observations for every response - lots of data to play with and visualize! The visualizations were made with R, as part of a project for the <a href="https://www.udacity.com/courses/ud651">Data Analysis with R</a> class.</p> <h1 id="fun-with-developers-personalprofessional-metrics">Fun With Developers’ Personal+Professional Metrics</h1> <p>Having loaded and cleaned up the data, the natural place to start is looking at who responded to the survey:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/1-gender_age.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/1-gender_age.png" alt="gender_age" /> </a> </figure> <p>Yep, big surprise, most respondents are male and younger than 35. Though, interestingly, females in the 20-24 range outnumber 25-29 range, which is not true for males. Also predictably, the graph shows that people with more experience tend to be older, though there are developers past their 40s with less than 10 years of experience which shows that it is possible to become a developer even at a later age. There is also the strong suggestion some respondents are jokesters, since there exist some people who are not yet 20 but have 11+ years of experience.</p> <p>The fact that there is a strong correlation between age and experience (0.6424202, in fact) is hardly surprising, but it begs a question: does the increased experience translate to better compensation? With this huge spreadsheet in hand, that question is easy to answer:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/2-comp_exp.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/2-comp_exp.png" alt="comp_exp" /> </a> </figure> <p>Yep, the world is (at least to some degree and in this context) just. It is also very unequal, though those who like me who have made it to the US can at least feel good about rather cushy average compensations. Besides being in the US, I also happen to be the sort of person to work on these sorts of blog posts in my free time, so a follow up question that makes sense is whether hours spent programming as a hobby also correlates positively with compensation. Perhaps lots of hobbyist programming indicates a more skilled programmer who might get better positions, or perhaps increased compensation translates to less time for hobby programming. Once again, let’s have a look at the data:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/3-comp_hours.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/3-comp_hours.png" alt="comp_hours" /> </a> </figure> <p>It seems the second is correct - more hours programming for fun does not equate to better compensation, though we can of course be optimistic and hope those with bigger compensations and less hobby programming relish their jobs and are in no need for extra projects. In fact, we can go ahead and look at the job satisfaction data and see:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/3-comp_sat.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/3-comp_sat.png" alt="comp_sat" /> </a> </figure> <p>So, seems having a higher paid job does somewhat correlate with having a more satisfying job - all those making big cash are not secretly miserable, good to know. Now then, all these line graphs are fun, but not very efficient, so using R magic we can make a graph that communicates quite a bit more about the variables we have been exploring thus far (click for larger version):</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/3-comp_satB.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/3-comp_satB.png" alt="comp_satB" /> </a> <figcaption>R makes it incredibely easy to do stuff like this! (not that you necessarily should make visualizations this packed with variables...)</figcaption> </figure> <p>Fancy graphs, right? The fun is just getting started, trust me, the best is yet ahead…</p> <h1 id="exploring-countries-industries-occupations">Exploring Countries, Industries, Occupations</h1> <p>Now, the ‘in US’/’not in US’ dichotomy is rather simplistic (and perhaps stereotypically American), so it’d be interesting to see what others countries programmers fare well in:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/4-country_comp.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/4-country_comp.png" alt="comp_country" /> </a> <figcaption>One could argue the use of (logarithmic) coloring is not needed here, but I for one like to know the relative number of instances from which averages are computed.</figcaption> </figure> <p>The results, after filtering to require at least twenty observations per country, are as America and Europe heavy as this posts’ likely readership - no surprise there. It is somewhat surprising that high tech countries such as South Korea and Germany are ranked relatively low in the list, though. But, as we said before money is not everything, so let’s go ahead and see how the countries fare by job satisfaction:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/4-country_sat.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/4-country_sat.png" alt="sat_country" /> </a> </figure> <p>Quite a different result, though seemingly one with less disparity than compensation averages. The US is merely 20th! It looks like Denmark and Israel really hit the sweet spot in terms of balance on both measures. Alas, I live in the US, and that will not change anytime soon. And as a software developer in the US, the next question for me to ask is what industries and occupations correlate with high compensations here:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/5-occ_comp.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/5-occ_comp.png" alt="occ_comp" /> </a> </figure> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/6-ind_comp.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/6-ind_comp.png" alt="ind_comp" /> </a> </figure> <p>Again, a few unexpected details here - data scientists are apparently not making as much as I thought they might - but most of this seems plausible. Seems it is a good thing I have no particular desire to work in gaming and non-profit industries or as a mobile dev - but then these are just averages, and money is much less important than job fulfillment anyway. So, once again, let’s have a look at that too:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/5-occ_sat.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/5-occ_sat.png" alt="occ_sat" /> </a> </figure> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/6-ind_sat.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/6-ind_sat.png" alt="ind_sat" /> </a> </figure> <p>Well, good news for the gaming developer - you may statistically be likely to earn less than most occupations, but also top the charts in terms of job satisfaction. As we saw before, it is possible to display the count and variance of such data fairly nicely with a combination of boxplot and translucent points, so may as well do just that:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/7-occ_comp_exp.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/7-occ_comp_exp.png" alt="occ_comp_exp" /> </a> </figure> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/8-ind_comp_exp.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/8-ind_comp_exp.png" alt="ind_comp_exp" /> </a> </figure> <p>The data is discrete, but R makes it very easy to introduce jitter to be able to roughtly have a sense of the distributions underling the means we’ve been looking at. There is quite a lot going on these graphs, but on the whole I think they do a good job of combining and conveying all involved information. Let’s skip the whole tradition of doing the same for satisfaction, as I am sure you are getting tired of all this satisfaction and money talk and boring bar graphs, and change things up.</p> <h1 id="turning-up-the-heat">Turning Up the Heat</h1> <p>Having played this much with occupations and industries, I naturally started to wonder which industries have the most of each occupation. The only sane way I could envision tackling the question is with heatmaps, and so after several fervent hours of massaging the data and plotting I got this:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/9-occ_ind.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/9-occ_ind.png" alt="occ_ind" /> </a> </figure> <p>I know, not fun - the full-stack web dev occupation and Software Products industry dwarf all the others in terms of counts and so make most of the heatmap a bland monotonous red. One option is to try to log scale the coloring, but I think that’d be sort of cheating in this case (my mind is built to assume heatmaps are linear), so instead we can produce a heatmap where the coloring is scaled within one row:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/10-occ_ind_r.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/10-occ_ind_r.png" alt="ind_comp_exp_r" /> </a> <figcaption>Notice that this in effect shows the breakdown of industries for each occupation separate from the rest</figcaption> </figure> <p>Not bad! The intersection of ‘Student’ and ‘I’m a student’ is bright yellow, so this is at least somewhat correct. There are lots of little neat nuggets of info here, such as the amount of mobile devs and designers in consulting or the prevalence of embedded developers in telecommunications. Admittedly, I produced this heatmap only after another one I knew I would want to make from the start - a heatmap showing the technologies developers in different occupations use:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/11-occ_lang.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/11-occ_lang.png" alt="occ_lang" /> </a> </figure> <p>Again with the huge majority of web devs! Well, no matter, we can do the same thing as before and view the data with per-row scaling:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/12-occ_lang_r.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/12-occ_lang_r.png" alt="occ_lang_r" /> </a> <figcaption>Now then we see the technologies developers in each occupation</figcaption> </figure> <p>Turns out executives rely mostly on JavaScript and the Cloud, who knew. As these technologies are sorted by overall usage and in fact only the top 20 are shown, I was surprised the niche ones such as LAMP and Redis still have more users than favorites of mine like Django or R. Overall though, this heatmap is nice for simply confirming common sense expectations about technology usage for each occupation. Still, I was not too big a fan of this look, so I also generated the same heatmap with a different:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/13-occ_langB.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/13-occ_langB.png" alt="occ_langB" /> </a> </figure> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/14-occ_lang_rB.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/14-occ_lang_rB.png" alt="occ_lang_rB" /> </a> <figcaption>Oh yeahhh you know those colors represent percentages per row now. I think this may be my favorite result from this project, to be honest.</figcaption> </figure> <h1 id="lastly---how-do-you-become-a-dev-anyway">Lastly - How Do You Become A Dev Anyway</h1> <p>Let’s finish up by looking at a topic this post is implicitly really about - how people learn. The survey helpfully asked what sort of training (education, online classes, even mentorships) each responder had. And so, again, we can make use of the information-dense wonder of a heatmap:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/15-occ_train.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/15-occ_train.png" alt="occ_train" /> </a> </figure> <p>So, most devs get at most a Bachelors degree, learn on the job, or have no formal training at all. However, online classes are also popular, no doubt in addition to these other forms of training. There are also exceptions in the fields of Machine Learning devs and data scientists, which I have some interest in. Just to take a break from all these heatmaps, we can take a closer look at these two with good ol’ bar graphs:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/16-occ_trainB.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/16-occ_trainB.png" alt="occ_trainB" /> </a> </figure> <p>Based on the low average salary of data scientists and the large numbers of them with non-formal or on the job training, I think it is likely the term has just gotten to be used very loosely to encompass many data-oriented positions. Machine learning developers, on the other hand, are still in a more exclusive club of more educated types who no doubt meddle with more math and algorithms. Regardless, the size of both of these occupations (that were not even really around a decade back) is as good a sign of our times as it gets.</p> <p>It would be tiresome to do this sort of visualization for all the jobs types, but why not go ahead and cap off this spade of visuals with something truly over the top:</p> <figure> <a href="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/17-occ_trainC.png"> <img class="postimage" src="/writing/images/2016-2-12-fun-visualizations-of-stackoverflow/17-occ_trainC.png" alt="occ_trainC" /> </a> <figcaption>Just because you can do something does not mean you should.</figcaption> </figure> <h1 id="why-do-all-this">Why Do All This?</h1> <p>For fun! But also, for learning. One of the great things about being a person who writes software for fun (and a living) is the lack of barriers to self-teaching new skills - all one needs is a laptop, an internet connection, and large quantities of time and perseverance to go ahead and learn something new. Though in theory just about anything can be self taught with a textbook and time, learning this way can be very difficult if the knowledge is not applied and tested along the way. Computer Science allows for the best of both worlds here in that there are massive amounts of tutorials and information for free online, and using that information requires no breadboards, no tools, no chemicals - just a computer and a mediocre internet connection.</p> <p>What barriers do exist have been further diminished by the rise of online classes in the vein of those on Udacity and Coursera, which are particularly well suited to teaching software skills - there are now dozens of such classes about algorithms, mobile dev, web dev, machine learning, and so on. Having taken and completed several of these classes I think they can very effectively help with learning through briskly paced lessons and high quality assignments. So, when I found out about the <a href="https://www.udacity.com/course/data-analyst-nanodegree--nd002">Udacity Data Analyst Nanodegree</a> I was naturally intrigued given my existing fondness for machine learning and lack of experience with data visualization - intrigued enough to sign up for the program (after confirming my company was willing to foot the bill here).</p> <p>‘Data Analysis with R’ is the fourth class in the program so far, and was my introduction to the language (a sort of mix between Python and SQL, known for being good for data analysis and visualization). The final project for the class called for using the R skills so far for self-led exploration of some data, with the option to choose among a few supplied datasets or find your own - precisely the sort of freeform class assignment I can get excited about. It took me some time to settle on a fun dataset to work with, but when I remembered about the StackOverflow survey data results I became sure there had to be many more opportunities for neat data analysis there - and I’d say I was right!</p> <p><a href="/writing/fun-visualizations-of-stackoverflow/">Fun Visualizations of the 2015 StackOverflow Developer Survey</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on February 12, 2016.</p> <![CDATA[What Brief Hacker News Fame Looks Like]]> /writing/what-brief-hacker-news-fame-looks-like 2016-01-22T19:19:34-07:00 2016-01-22T19:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p>The most traffic this site has ever received in one hour is precisely 1,814 pageviews, at 11:00 AM on January 15th, 2016, when <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning/">A ‘Brief’ History of Neural Nets and Deep Learning’</a> hit the front page on Hacker News (a site very popular among programmers, researchers and basically all manner of technical people who can slack off at work by browsing the internet). As an extremely humble person (and fan of wry ironic wording), I debated whether to write about this for a while due to it perhaps seeming self congratulatory. But there are some fun graphs to share and it would be nice to write something short for a change, so why not.</p> <p>Let’s start back when I first released this ‘Brief’ history writeup.</p> <figure> <a href="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/1-traffic.png"> <img class="postimage" src="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/1-traffic.png" alt="Traffic 1" /> </a> </figure> <p>I finally finished an acceptable draft of the history on December 24th, after roughly a month and a half of working on it. Being quite ready to be done with this prolonged writing project and go on vacation, I went ahead and posted it, and shared the result on Facebook. As you can see, 7 people actually took a look at it on Christmas, surprisingly. But, as was typical then, the highest session counts per day did not crack the double digits. Fast forward two weeks, and the picture looks quite different:</p> <figure> <a href="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/2-traffic.png"><img class="postimage" src="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/2-traffic.png" alt="Traffic 2" /></a> </figure> <p>The reason for the views uptick? Reddit. On January 5th the good people over at /r/machinelearning graced my little history with 13 upvotes and a good deal of traffic, which for me was very exciting. I’ve never shared the writing outside of Facebook and LinkedIn before, but since this one was by far my largest writing effort I figured it was worth seeing if others would like it. And worth it it was - the sessions were now in the triple digits. Not only that, but I was getting a bit of positive feedback, which was very exciting - I mostly worked on the history for my own gratification, so seeing others read and compliment it felt great. And, well, can you guess what happened next?</p> <figure> <a href="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/3-traffic.png"><img class="postimage" src="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/3-traffic.png" alt="Traffic 3" /></a> </figure> <p>Yep, on Friday, January the 15th, my history made it to the front page of Hacker News after I resubmitted it (it having failed to make much of a splash on my first submission). Knowing the sort of traffic the site receives, I was admittedly a bit stunned - I even took a screenshot to commemorate the event:</p> <figure> <img class="postimagesmall" src="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/4-famous.png" alt="Traffic 4" /> <figcaption>This still sits in my Pictures folder, and is named famous.png</figcaption> </figure> <p>The traffic spike was huge. For a better perspective, have a look at the hourly breakdown of the sessions:</p> <figure> <a href="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/5-traffic.png"><img class="postimage" src="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/5-traffic.png" alt="Traffic 5" /></a> <a href="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/6-traffic.png"><img class="postimage" src="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/6-traffic.png" alt="Traffic 6" /></a> </figure> <p>Suffice it to say, I was happy. Predictabely the pageviews diminished very quickly, but not as quickly as I expected - the views to this day exceed my expectations. To be fair, I had to go back and clean up tons of typos and things, but i’d say that’s a fair deal. And for once, I actually had some reason to dig into the Google Analytics of the traffic. Especially neat is the behavior flow graph:</p> <figure> <a href="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/7-traffic.png"><img class="postimage" src="/writing/images/2016-1-21-what-brief-hacker-news-fame-looks-like/7-traffic.png" alt="Traffic 7" /></a> </figure> <p>Understandably, most people start at Part 1 and drop off there - only 12% make it to part 2. Still, hundreds if not more than a thousand people have made it all the way through the whole thing - exciting! It is a fairly long read at ~15000 words, so it’s good to see a fair deal of people enjoy these sorts of mini-treatises.</p> <p>Who knows if I’ll ever write something so ambitious, or so widely read, again. Either way, I expect to keep on writing, and I’ll always have famous.png.</p> <p><a href="/writing/what-brief-hacker-news-fame-looks-like/">What Brief Hacker News Fame Looks Like</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on January 22, 2016.</p> <![CDATA[Organizing My Emails With A Neural Net]]> /writing/organizing-my-emails-with-a-neural-net 2016-01-13T19:19:34-07:00 2016-01-13T19:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/18-conf_normalized2.png" alt="Conf mat 0" /> <figcaption>Or, how to make this happen with your gmail data. The entirety of the code used for this post <b><a href="https://github.com/andreykurenkov/emailinsight/tree/master/pyScripts">can be found here</a></b>. </figcaption> </figure> <h1 id="emailfiler-v1">EmailFiler V1</h1> <p>One of my favorite small projects, <a href="http://www.andreykurenkov.com/projects/hacks/email-filer/">EmailFiler</a>, was motivated by a school assignment for Georgia Tech’s Intro to Machine Learning class. Basically, the assignment was to pick some datasets, throw a bunch of supervised learning algorithms at them, and analyze the results. But here’s the thing: we could make our own datasets if we so chose. And so choose I did - to export my gmail data and explore the feasibility of machine-learned email categorization.</p> <p>See, I learned long ago that it’s often best to keep emails around in case there is randomly some need to refer back to them in the future. But, I also learned that I can’t help but strive for the ideal of the empty inbox (hopeless as that may be). So, years ago I started categorizing my emails into about a dozen folders within gmail, and by the point I took the ML class I had many thousands of emails spread across these categories. It seemed like a great project to make a classifier that could suggest a single category for each email in the inbox, so there could be a button by each email in the inbox for quickly putting it into the correct category.</p> <figure> <img class="postimageactual" src="/writing/images/2016-1-13-neural-net-categorize-my-email/1-emailscategories.png" alt="Emails categories" /> <figcaption>The set of categories and email counts I worked with at the time</figcaption> </figure> <p>Well, I had my inputs, the emails, and my outputs, the categories, and even a nice button to easily export all that data in a nice format - easy right? Not so fast. Though I was not exactly striving for full text comprehension, I still wanted to learn using email text and metadata, and at first did not really know how to convert this data into a nice machine-learnable dataset. As any person who has studied Natural Language Processing can quickly point out, one easy approach is to use Bag of Words features. This is about as simple an approach as you can take with text classification - just find what the most common N words in all the text instances are, and then create binary features for each word (meaning a feature that has a value of 1 for an instance of text if it contains the word, and a 0 otherwise).</p> <p>I did this for a bunch of words found in all my emails, and also for the top 20 senders of the emails (since in some cases the sender should correlate strongly with the category, such as the sender being my research adviser and the category ‘research’), and for the top 5 domains the email was sent from (since a few domans like @gatech.edu would be strongly indicative for categories like ‘TA’ and ‘academic’). So, after an hour or so of writing <a href="https://github.com/andreykurenkov/emailinsight/blob/master/pyScripts/mboxConvert.py">mbox parsing code</a> I ended up with the function that output my actual dataset as a csv.</p> <p>So, how well did it work? Well, but not as well as I hoped. At the time I was fond of the Orange Python ML framework, and so as per the assignment <a href="https://github.com/andreykurenkov/emailinsight/blob/master/pyScripts/orangeClassify.py">tested</a> how well a bunch of algorithms did against my dataset. The standouts were decision trees, as the best algorithm, and neural nets, as the worst:</p> <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/2-emailsdtree.png" alt="DTrees emails" /> <figcaption>How Decision Trees fared on my little email dataset</figcaption> </figure> <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/3-emailsnn.png" alt="NN emails" /> <figcaption>... And now neural nets </figcaption> </figure> <p>If you take a close look at those beautiful OpenOffice Calc plots, you will see that the best Decision Trees managed to achieve on the test set is roughly 72%, and that neural nets could only get to a measly 65% - an F! Way better than random, considering there are 11 categories, but far from great.</p> <p>Why the disappointing result? Well, as we saw the features created for the dataset are very simple - just selecting the 500 most frequent words will yield a few good indicators, but also many generic terms that just appear a lot in english such as ‘that’ or ‘is’. I understood this at the time and tried a few things - removing 3-character words entirely, and writing some annoying code to select the most frequent words in each category specifically rather than in all the emails - but ultimately did not manage to figure out how to get better results.</p> <h1 id="if-at-first-you-dont-succeed">If At First You Don’t Succeed…</h1> <p>So, why am I writing this, if I did this years ago and got fairly lame results (albeit a good grade) then? In short, to try again. With a couple more years of experience, and having just completed a <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning/">giant 4-part history of neural networks and Deep Learning</a>, it seemed only appropriate to dive into a modern machine learning framework and see what I could do.</p> <p>But, where to start? By picking the tools, of course! The framework I decided to try is <a href="http://keras.io/">Keras</a>, both because it is in Python (which seems to be a favorite for data science and machine learning nowdays, and plays nice with the wonderful <a href="http://www.numpy.org/">numpy</a>, <a href="http://pandas.pydata.org/">pandas</a>, and <a href="http://www.numpy.org/">scikit-learn</a>) and because it is backed by the well regarded Theano library.</p> <p>It also just so happens that Keras has several easy to copy-paste examples to get started with, including one with a <a href="https://github.com/fchollet/keras/blob/master/examples/reuters_mlp.py">multi-category text classification problem</a>. And, here’s the interesting thing - the example uses just about the same features as I did for my class project. It finds the 1000 most frequent words in the documents, makes those into binary features, and trains a neural net with one hidden layer and dropout to predict the category of input text based solely of those features.</p> <p>So, the obvious first thing to try is exactly this, but with my own data - see if doing feature extraction with Keras will work better. Luckily, I can still use my old mbox parsing code, and Keras has a handy Tokenizer class for text feature extraction. So, it is easy to create a dataset in the same format as in the Keras example, and get an update on my current email counts while we’re at it:</p> <pre><code>Using Theano backend. Label email count breakdown: Personal:440 Group work:150 Financial:118 Academic:1088 Professional:388 Group work/SolarJackets:1535 Personal/Programming:229 Professional/Research:1066 Professional/TA:1801 Sent:513 Unread:146 Professional/EPFL:234 Important:142 Professional/RISS:173 Total emails: 8023 </code></pre> <p>Eight thousand emails - not a giant dataset by any stretch, but nevertheless enough to do some serious machine learning. Having converted the data to the correct format, now it is just a matter of seeing if training a neural net with it works. The Keras example makes it very easy to go ahead and do just that:</p> <pre><code>7221 train sequences 802 test sequences Building model... Train on 6498 samples, validate on 723 samples Epoch 1/5 6498/6498 [==============================] - 2s - loss: 1.3182 - acc: 0.6320 - val_loss: 0.8166 - val_acc: 0.7718 Epoch 2/5 6498/6498 [==============================] - 2s - loss: 0.6201 - acc: 0.8316 - val_loss: 0.6598 - val_acc: 0.8285 Epoch 3/5 6498/6498 [==============================] - 2s - loss: 0.4102 - acc: 0.8883 - val_loss: 0.6214 - val_acc: 0.8216 Epoch 4/5 6498/6498 [==============================] - 2s - loss: 0.2960 - acc: 0.9214 - val_loss: 0.6178 - val_acc: 0.8202 Epoch 5/5 6498/6498 [==============================] - 2s - loss: 0.2294 - acc: 0.9372 - val_loss: 0.6031 - val_acc: 0.8326 802/802 [==============================] - 0s Test score: 0.585222780162 </code></pre> <p><strong>Test accuracy: 0.847880299252</strong></p> <p>Hell yeah 85% test accuracy! That handily beats the measly 65% score of my old neural net. Awesome.</p> <p>Except… why?</p> <p>I mean, my old code was doing basically this - finding the most frequent words, creating a binary matrix of features, and training a neural net with one hidden layer to be the classifier. Perhaps, it is because of this fancy new ‘relu’ neuron, and dropout, and using a non-sgd optimizer? Let’s find out! Since my old features were indeed binary and in a matrix, it takes very little work to make those be the dataset this neural net is trained with. And so, the results:</p> <pre><code>Epoch 1/5 6546/6546 [==============================] - 1s - loss: 1.8417 - acc: 0.4551 - val_loss: 1.4071 - val_acc: 0.5659 Epoch 2/5 6546/6546 [==============================] - 1s - loss: 1.2317 - acc: 0.6150 - val_loss: 1.1837 - val_acc: 0.6291 Epoch 3/5 6546/6546 [==============================] - 1s - loss: 1.0417 - acc: 0.6661 - val_loss: 1.1216 - val_acc: 0.6360 Epoch 4/5 6546/6546 [==============================] - 1s - loss: 0.9372 - acc: 0.6968 - val_loss: 1.0689 - val_acc: 0.6635 Epoch 5/5 6546/6546 [==============================] - 2s - loss: 0.8547 - acc: 0.7215 - val_loss: 1.0564 - val_acc: 0.6690 808/808 [==============================] - 0s Test score: 1.03195088158 </code></pre> <p><strong>Test accuracy: 0.64603960396</strong></p> <p>Ouch. So yes, my old email-categorizing solution was fairly flawed. I can’t say for sure, but I think it is a mix of overconstraining the features (forcing the top senders, domains, and words from each category to be there) and having too few words. The Keras example just throws the top 1000 words into a big matrix without any more intelligent filtering, and lets the neural net have at it. Not limiting what the features can be lets better ones be discovered, and so the overall accuracy is better.</p> <p>Well, that, or my code just sucks and has mistakes in it - modifying it to be less restrictive still only nets a 70% accuracy. In any case, it’s clear that I was able to beat my old result by leveraging a newer ML library, so the question now clearly is - can I do better?</p> <h1 id="deep-learning-is-no-good-here">Deep Learning Is No Good Here</h1> <p>When I first started looking at the Keras code, I was briefly excited by the mistaken notion that it would use the actual sequence of text, with the words in their original order. It turned out that this was not the case, but that does not mean that it can’t be. Indeed, a very cool recent phenomena in machine learning is the resurgence of recurrent neural nets, which are well suited for dealing with long sequences of data. Additionally, when dealing with words it is common to perform an ‘embedding’ step in which each word is converted into a vector of numbers, so that similar words are converted into similar vectors.</p> <p>So, instead of changing the emails into matrices of binary features it’s possible to just change the words into numbers using the words’ frequency ranking, and the numbers themselves will be converted into vectors which represent the ‘idea’ of each word. Then, we can use the sequence to train a recurrent neural net with Long Short Term Memory or Gated Recurrent units to do the classification. And, guess what? There is also a <a href="https://github.com/fchollet/keras/blob/master/examples/imdb_lstm.py">nice Keras example</a> that does just this, so it is easy to fire up and see what happens:</p> <pre><code>Epoch 1/15 7264/7264 [===========================] - 1330s - loss: 2.3454 - acc: 0.2411 - val_loss: 2.0348 - val_acc: 0.3594 Epoch 2/15 7264/7264 [===========================] - 1333s - loss: 1.9242 - acc: 0.4062 - val_loss: 1.5605 - val_acc: 0.5502 Epoch 3/15 7264/7264 [===========================] - 1337s - loss: 1.3903 - acc: 0.6039 - val_loss: 1.1995 - val_acc: 0.6568 ... Epoch 14/15 7264/7264 [===========================] - 1350s - loss: 0.3547 - acc: 0.9031 - val_loss: 0.8497 - val_acc: 0.7980 Epoch 15/15 7264/7264 [===========================] - 1352s - loss: 0.3190 - acc: 0.9126 - val_loss: 0.8617 - val_acc: 0.7869 Test score: 0.861739277323 </code></pre> <p><strong>Test accuracy: 0.786864931846</strong></p> <p>Darn it. Not only did the LSTM take FOREVER, but the results at the end were not that good. Presumably the reason for this is that my emails are just not that much data, and in general sequences are not that useful for categorizing them. That is, the added complexity of learning on sequences does not overcome the benefit of seeing the text in the correct order, since the sender and individual words in the email are good indicators of which category the email should be in as it is.</p> <p>But, the extra embedding step still seems like it should be useful, since it creates a richer representation of the word. So it seems worthwhile to still try to use it, and also include the important Deep Learning tool of convolution on the text to find important local features. Once again, <a href="https://github.com/fchollet/keras/blob/master/examples/imdb_cnn.py">there is a Keras example</a> that still does embedding but feeds those vector into convolution and pooling layers instead of LSTM layers. But, the results once again are not that impressive:</p> <pre><code>Epoch 1/3 5849/5849 [===========================] - 127s - loss: 1.3299 - acc: 0.5403 - val_loss: 0.8268 - val_acc: 0.7492 Epoch 2/3 5849/5849 [===========================] - 127s - loss: 0.4977 - acc: 0.8470 - val_loss: 0.6076 - val_acc: 0.8415 Epoch 3/3 5849/5849 [===========================] - 127s - loss: 0.1520 - acc: 0.9571 - val_loss: 0.6473 - val_acc: 0.8554 Test score: 0.556200767488 </code></pre> <p><strong>Test accuracy: 0.858725761773</strong></p> <p>I really hoped learning with sequences and embeddings could be better learning with basic n-gram features, since in theory the former contains more information about the original emails. But, the folk knowledge that Deep Learning is not very useful for small datasets appears to be true here.</p> <h1 id="its-the-features-dummy">It’s The Features, Dummy</h1> <p>Well, hmm, that did not get me that coveted 90% test accuracy… Time to try being a little smarter about this. See, the current approach of making features out of the top 2500 frequent words is rather silly, in that it includes common english words such as ‘i’ or ‘that’ along with useful category specific words such as ‘homework’ or ‘due’. But, it’s tricky to just guess a cutoff of most frequent words, or blacklist some number of words - you never know what turns out to be useful for features, since it is possible I happen to use one plain word more in one category than the others (such as the category ‘Personal’).</p> <p>So, let’s avoid the guesswork and instead rely on good ol’ feature selection to pick out features that are actually good and filter out silly ones like ‘i’. As with baseline testing, this is easy using scikit and its <a href="http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html">SelectKBest</a> class, and is fast enough that it barely takes any time compared to running the neural net. So, does this work?</p> <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/11-select_accs_zoomed_512.png" alt="Select accs" /> <figcaption>Yes it works, 90%! </figcaption> </figure> <p>Very nice! Though there is still variance in the performance, more words to start with is clearly better, but this set of words can be cut down rather heavily with feature selection without reducing performance. Apparently the neural net has no problem with undefitting if all the words are kept around. Inspecting the best and worst features according to the feature selector confirms it selects sensible seeming words as good and bad:</p> <figure> <img class="postimagehalf" src="/writing/images/2016-1-13-neural-net-categorize-my-email/15-scores_best.png" alt="Select times" /> <img class="postimagehalf" src="/writing/images/2016-1-13-neural-net-categorize-my-email/16-scores_worst.png" alt="Select times GPU" /> <figcaption>Best and worst words according to chi squared feature selection <a href="http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html">(loosely based on Scikit sample code)</a></figcaption> </figure> <p>A lot of the best ones are names or refer to specific things (the ‘controller’ is from ‘motor controller’), as could be expected, though a few such as ‘remember’ or ‘total’ would not strike me as very good features. The worst ones, on the other hand, are fairly predictable being either overly generic or overly specific words.</p> <p>So, the end conclusion is that more words=better, and feature selection can help out by keeping the runtime lower. Well, this helps, but perhaps there is something else to be done to improve performance. To see what, we can look at is what mistakes the neural net makes, with a <a href="http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#example-model-selection-plot-confusion-matrix-py">confusion matrix again from scikit learn</a>:</p> <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/17-conf_normalized.png" alt="Conf mat" /> <figcaption>The confusion matrix for the neural net results</figcaption> </figure> <p>Okay, nice, most of the color is along the diagonal, but there are still some annoying blotches elsewhere. In particular, the visualization implies the ‘Unread’ and ‘Important’ categories are problem makers. But wait! I did not even create those, I don’t really care about things working correctly with them, nor with ‘Sent’. Clearly I should take those out and see if the neural net can do a good job specifically with the categories I created for myself.</p> <p>So, let’s wrap up with a final experiment in which those irrelevant categories are removed and we use the most features of any run so far - 10000 words with selection of the 4000 best:</p> <pre><code>Epoch 1/5 5850/5850 [==============================] - 2s - loss: 0.8013 - acc: 0.7879 - val_loss: 0.2976 - val_acc: 0.9369 Epoch 2/5 5850/5850 [==============================] - 1s - loss: 0.1953 - acc: 0.9557 - val_loss: 0.2322 - val_acc: 0.9508 Epoch 3/5 5850/5850 [==============================] - 1s - loss: 0.0988 - acc: 0.9795 - val_loss: 0.2418 - val_acc: 0.9338 Epoch 4/5 5850/5850 [==============================] - 1s - loss: 0.0609 - acc: 0.9865 - val_loss: 0.2275 - val_acc: 0.9462 Epoch 5/5 5850/5850 [==============================] - 1s - loss: 0.0406 - acc: 0.9925 - val_loss: 0.2326 - val_acc: 0.9462 722/722 [==============================] - 0s Test score: 0.243211859068 </code></pre> <p><strong>Test accuracy: 0.940443213296</strong></p> <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/18-conf_normalized2.png" alt="Conf mat 2" /> <figcaption>The confusion matrix for the updated neural net results</figcaption> </figure> <p>How about that! The neural net can predict categories with 94% accuracy. Though, most of that is due to the large feature set - a good comparison classifier (scikit-learn’s <a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PassiveAggressiveClassifier.html#sklearn.linear_model.PassiveAggressiveClassifier.fit">Passive Aggressive classifier</a>) itself gets 91% on the same exact data. In fact following <a href="https://www.reddit.com/r/MachineLearning/comments/41posw/organizing_my_emails_with_a_neural_net/cz45hdg">someone else’s suggestion</a> to train a Support Vector Machine classifier (scikits LinearSVC) in a particular way also resulted in roughtly 94% accuracy.</p> <p>So, the conclusion is fairly straighforward - the fancier methods of Deep Learning do not seem that useful for a small dataset such as my emails, and older approaches such as n-grams + tfifd + SVM can work as well as the most moden neural nets. More generally, just working with a Bag of Words approach works rather well if the data is as small and neatly categorized as this.</p> <p>I don’t know if few people use categories in gmail, but if it really is this easy to make a classifier that is right most of the time, I would really like it if gmail indeed had such a machine-learned approach to suggesting a category for each email for one-click email organizing. But, for now, I can just feel nice knowing I managed to get a 20% improvement over my last attempt at this, and played with Keras while I was at it.</p> <p><br /> <br /> <br /></p> <h1 id="epilogue-extra-experiments">Epilogue: Extra Experiments</h1> <p>I did a bunch of other things while working on this, and some are worth highlighting. A problem I had all this stuff took forever to run, in large part because I have yet to make use of the now standard trick of doing machine learning with a GPU. So, following a <a href="http://deeplearning.net/software/theano/install_ubuntu.html">very nice tutorial</a> I did just that, and got nice results:</p> <figure> <img class="postimagehalf" src="/writing/images/2016-1-13-neural-net-categorize-my-email/13-select_times_512.png" alt="Select times" /> <img class="postimagehalf" src="/writing/images/2016-1-13-neural-net-categorize-my-email/14-select_times_512_gpu.png" alt="Select times GPU" /> <figcaption>The times taken to achieve that 90% plot above, without vs with GPU; what a nice speedup!</figcaption> </figure> <p>It should be noted that the Keras neural net with 94% was significantly faster to train and use than the SVM, so it was still ultimately the best approach I have found so far.</p> <p>I also wanted to do more visualizations besides confusion matrices. There was not much I could with Keras for this, though I did find an <a href="https://github.com/fchollet/keras/issues/254">ongoing discussion</a> concerning visualization. That led me to <a href="https://github.com/aleju/keras">a fork of Keras</a> with at least a nice way to graph the training progress. Not very useful, but fun. After hacking it a bit to plot batches instead of epochs, it generated very nice training graphs:</p> <figure> <img class="postimage" src="/writing/images/2016-1-13-neural-net-categorize-my-email/4-graph.png" alt="NN training" /> <figcaption>The progression of neural net training for a slightly modified version of the example (with more words included) </figcaption> </figure> <p>Interesting - the cross validation between epochs results in big jumps in training accuracy, not something I’d expect. But, more pertinently, it’s easy to see the training accuracy just about reaches 1.0 and definitely plateaus.</p> <p>Okay, good, but the harder problem was increasing the test accuracy. As before, the first question is whether I can quickly alter the feature representation to help the neural net out. The Keras module that converts the text into matrices has several options besides making a binary matrix: matrices with word counts, frequencies, or tfidf values. It is also very easy to alter the amount of words kept in the matrices as features, and so being the amazing programmer that I am I managed to write a few loops to evaluate how varying the feature type and word count affects the test accuracy. Not only that, but I even made a pretty plot of the results with python:</p> <figure> <img class="postimagehalf" src="/writing/images/2016-1-13-neural-net-categorize-my-email/5-word_accs.png" alt="Word accs" /> <img class="postimagehalf" src="/writing/images/2016-1-13-neural-net-categorize-my-email/6-word_baseline_accs.png" alt="Baseline accs" /> <figcaption>Test accuracy based on feature type and how many words are kept as features (baseline being k nearest neighbors)</figcaption> </figure> <p>This is where I first saw that I should definitely increase the maximum words to more than 1000. It was also interesting to see that the most basic and least information dense feature type, binary 1s or 0s indicating word presence, is about as good or better than the other features that convey more about the original data. This is not too unexpected, though - most likely more interesting words like ‘code’ or ‘grade’ are helpful for categorization, and having a single occurance in an email is likely almost as informative as more than one. No doubt the more exact features help somewhat, but also lead to worse performance due to more potential for overfitting.</p> <p>All in all, what we see is that the binary feature type is clearly the best one, and that increasing the number of words helps out quite a bit to get accuracies of about 87%-88%.</p> <p>I also looked at baseline algorithms while working on this, to ensure something simple like k nearest neighbors (<a href="http://scikit-learn.org/stable/modules/neighbors.html">from scikit</a>) was not equivalent to neural nets, which indeed proved to be true. Linear regression performed even worse, so it seems my use of neural nets is justified.</p> <p>By the way, all this word increasing is not cheap. Even with cached versions of the dataset such that I did not have to parse the emails and extract features each time, running all these tests took a hefty amount of time:</p> <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/8-word_times.png" alt="Word times" /> <figcaption>Linear increase in time as word count is increased. Not bad, really; linear regression was far worse</figcaption> </figure> <p>So, increasing the number of words helped, but I was still not cracking the 90% mark - the coveted A threshold! So the next logical thing was to stick with 2500 words and look at varying the neural net size. Also, the example Keras model happened to have 50% dropout on the hidden layer and it was interesting to see if this actually meaningfully helps the performance. So, time to spin up another set of loops and get another pretty graph:</p> <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/9-hidden_accs_zoomed.png" alt="Hidden accs" /> <figcaption>Zoomed in view of accuracies for different dropouts and hidden layer sizes</figcaption> </figure> <p>Well, this is somewhat surprising - we don’t need very many hidden layer units at all to do well! With lower dropout (less regularization), as few as 64 and 124 hidden layer units can do just about as well as the default of 512. These results are averaged across five runs, by the way, so mere variation in the outcomes does not account for the ability of small hidden layers to do well. This suggests that the large word counts are good for just including the helpful features, but that there are not really that many helpful features to pick up on - otherwise more neurons would be necessary to do better. This is good to know, since we can save quite a bit of time by using the smaller hidden layers:</p> <figure> <img class="postimagesmaller" src="/writing/images/2016-1-13-neural-net-categorize-my-email/10-hidden_times.png" alt="Hidden times" /> <figcaption>Again, linear growth as we increase the hidden layer (as we'd hope since they are independnent of each other) </figcaption> </figure> <p>But, this is not entirely accurate. More runs with large number of features show that the default hidden layer size of 512 does perform significantly better than a much smaller hidden layer:</p> <figure> <img class="postimagehalf" src="/writing/images/2016-1-13-neural-net-categorize-my-email/11-select_accs_zoomed_512.png" alt="Select accs" /> <img class="postimagehalf" src="/writing/images/2016-1-13-neural-net-categorize-my-email/12-select_accs_zoomed_32.png" alt="Select accs 32" /> <figcaption>Comparison of performance with 512 and 32 hidden layer units. </figcaption> </figure> <p>So, in the end we find what we already knew - more words=better.</p> <p><a href="/writing/organizing-my-emails-with-a-neural-net/">Organizing My Emails With A Neural Net</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on January 13, 2016.</p> <![CDATA[A 'Brief' History of Neural Nets and Deep Learning, Part 4]]> /writing/a-brief-history-of-neural-nets-and-deep-learning-part-4 2015-12-24T18:19:34-08:00 2015-12-24T18:19:34-08:00 www.andreykurenkov.com contact@andreykurenkov.com <p>This is the fourth part in ‘A Brief History of Neural Nets and Deep Learning’. Parts 1-3 <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning">here</a>, <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-2">here</a>, and <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-3">here</a>. In this part, we will get to the end of our story and see how deep learning emerged from the slump neural nets found themselves in by the late 90s, and the amazing state of the art results it has achieved since.</p> <blockquote> <p>“Ask anyone in machine learning what kept neural network research alive and they will probably mention one or all of these three names: Geoffrey Hinton, fellow Canadian Yoshua Bengio and Yann LeCun, of Facebook and New York University.”<sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p> </blockquote> <h1 id="the-deep-learning-conspiracy">The Deep Learning Conspiracy</h1> <p>When you want a revolution, start with a conspiracy. With the ascent of Support Vector Machines and the failure of backpropagation, the early 2000s were a dark time for neural net research. LeCun and Hinton variously mention how in this period their papers or the papers of their students were routinely rejected from being published due to their subject being Neural Nets. The above quote is probably an exaggeration - certainly research in Machine Learning and AI was still very active, and other people were also still working with neural nets - but citation counts from the time make it clear that the excitement had leveled off, even if it did not completely evaporate. Still, they persevered. And they found a strong ally outside the research realm: The Canadian government. Funding from the Canadian Institute for Advanced Research (CIFAR), which encourages basic research without direct application, was what motivated Hinton to move to Canada in 1987, and funded his work afterward. But, the funding was ended in the mid 90s just as sentiment towards neural nets was becoming negative again. Rather than relenting and switching his focus, Hinton fought to continue work on neural nets, and managed to secure more funding from CIFAR as told well in <a href="http://www.thestar.com/news/world/2015/04/17/how-a-toronto-professors-research-revolutionized-artificial-intelligence.html">this exemplary piece</a><sup id="fnref:1:1"><a href="#fn:1" class="footnote">1</a></sup>:</p> <blockquote> <p>“But in 2004, Hinton asked to lead a new program on neural computation. The mainstream machine learning community could not have been less interested in neural nets.</p> </blockquote> <blockquote> <p>“It was the worst possible time,” says Bengio, a professor at the Université de Montréal and co-director of the CIFAR program since it was renewed last year. “Everyone else was doing something different. Somehow, Geoff convinced them.”</p> </blockquote> <blockquote> <p>“We should give (CIFAR) a lot of credit for making that gamble.”</p> </blockquote> <blockquote> <p>CIFAR “had a huge impact in forming a community around deep learning,” adds LeCun, the CIFAR program’s other co-director. “We were outcast a little bit in the broader machine learning community: we couldn’t get our papers published. This gave us a place where we could exchange ideas.””</p> </blockquote> <p>The funding was modest, but sufficient to enable a small group of researchers to keep working on the topic. As Hinton tells it, they hatched a conspiracy: “rebrand” the frowned-upon field of neural nets with the moniker “Deep Learning” <sup id="fnref:1:2"><a href="#fn:1" class="footnote">1</a></sup>. Then, what every researcher must dream of actually happened: Hinton, Simon Osindero, and Yee-Whye Teh published a paper in 2006 that was seen as a breakthrough, a breakthrough significant enough to rekindle interest in neural nets: <a href="https://www.cs.toronto.edu/~hinton/absps/fastnc.pdf"><strong>A fast learning algorithm for deep belief nets</strong></a><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. Though, as we’ll see, the approaches used in the paper have been superceded by newer work, the movement that is ‘Deep Learning’ can very persuasively be said to have started precisely with this paper. But, more important than the name was the idea - that neural networks with many layers really could be trained well, if the weights are initialized in a clever way rather than randomly. Hinton <a href="https://youtu.be/vShMxxqtDDs?t=6m59s">once expressed</a> the need for such an advance at the time:</p> <blockquote> <p>“Historically, this was very important in overcoming the belief that these deep neural networks were no good and could never be trained. And that was a very strong belief. A friend of mine sent a paper to ICML [International Conference on Machine Learning], not that long ago, and the referee said it should not accepted by ICML, because it was about neural networks and it was not appropriate for ICML. In fact if you look at ICML last year, there were no papers with ‘neural’ in the title accepted, so ICML should not accept papers about neural networks. That was only a few years ago. And one of the IEEE journals actually had an official policy of [not accepting your papers]. So, it was a strong belief.”</p> </blockquote> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34953?token=zHc69el3bU7qKD0PwPBDHVLSVQsc0nyufUrpCEm9164_1Alk9kBBHkl5ymeS-6xNRoN3bPMo-J1VV5llJxc5k3M" alt="RBM" /> <figcaption>A Restricted Boltzmann Machine. <a href="http://deeplearning.net/tutorial/rbm.html">(Source)</a></figcaption> </figure> <p>So what was the clever way of initializing weights? The basic idea is to train each layer one by one with unsupervised training, which starts off the weights much better than just giving them random values, and then finishing with a round of supervised learning just as is normal for neural nets. Each layer starts out as a Restricted Boltzmann Machine (RBM), which is just a Boltzmann Machine without connections between hidden and visible units as illustrated above, and is taught a generative model of data in an unsupervised fashion. It turns out that this form of Boltzmann machine can be trained in an efficient manner introduced by Hinton in the 2002 <a href="http://www.cs.toronto.edu/~fritz/absps/tr00-004.pdf">“Training Products of Experts by Minimizing Contrastive Divergence”</a><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. Basically, this algorithm maximizes something other than the probability of the units generating the training data, which allows for a nice approximation and turns out to still work well. So, using this method, the algorithm is as such:</p> <ol> <li>Train an RBM on the training data using contrastive-divergence. This is the first layer of the belief net.</li> <li>Generate the hidden values of the trained RBM for the data, and train another RBM using those hidden values. This is the second layer - ‘stack’ it on top of the first and keep weights in just one direction to form a belief net.</li> <li>Keep doing step 2 for as many layers as are desired for the belief net.</li> <li>If classification is desired, add a small set of hidden units that correspond to the classification labels and do a variation on the wake-sleep algorithm to ‘fine-tune’ the weights. Such combinations of unsupervised and supervised learning are often called <strong>semi-supervised</strong> learning.</li> </ol> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34836?token=eme-iIQe7-0La4-L2TE4ho9Fmj3Hlx4z8dP0khGimjihi31RDYtsPPjHTB5TpCYPlH-I8xFqce5jrbln-lMwokE" alt="From http://www.iro.umontreal.ca/~lisa/twiki/bin/view.cgi/Public/DeepVsShallowComparisonICML2007" /> <figcaption>The layerwise pre-training that Hinton introduced. <a href="http://deeplearning.net/tutorial/rbm.html">(Source)</a></figcaption> </figure> <p>The paper concluded by showing that deep belief networks (DBNs) had state of the art performance on the standard MNIST character recognition dataset, significantly outperforming normal neural nets with only a few layers. Yoshua Bengio et al. followed up on this work in 2007 with <a href="http://papers.nips.cc/paper/3048-greedy-layer-wise-training-of-deep-networks.pdf">“Greedy Layer-Wise Training of Deep Networks”</a><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>, in which they present a strong argument that deep machine learning methods (that is, methods with many processing steps, or equivalently with hierarchical feature representations of the data) are more efficient for difficult problems than shallow methods (which two-layer ANNs or support vector machines are examples of).</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34966?token=ylFbLB-4cILErX48_1I24s32Oz1uTA-Kr0HzMcF4MKvfhfTS-IG4ybsb8PGouGYxE5uZvVRoHjG_3W2AHs6xMWI" alt="Autoencoder pre-supverised traning" /> <figcaption>Another view of unsupervised pre-training, using autoencoders instead of RBMs. <a href="https://commons.wikimedia.org/wiki/File:Stacked_Autoencoders.png?uselang=ru">(Source)</a></figcaption> </figure> <p>They also present reasons for why the addition of unsupervised pre-training works, and conclude that this not only initializes the weights in a more optimal way, but perhaps more importantly leads to more useful learned representations of the data. In fact, using RBMs is not that important - unsupervised pre-training of normal neural net layers using backpropagation with plain Autoencoders layers proved to also work well. Likewise, at the same time another approach called Sparse Coding also showed that unsupervised feature learning was a powerful approach for improving supervised learning performance.</p> <p>So, the key really was having many layers of computing units so that good high-level representation of data could be learned - in complete disagreement with the traditional approach of hand-designing some nice feature extraction steps and only then doing learning using those features. Hinton and Bengio’s work had empirically demonstrated that fact, but more importantly, showed the premise that deep neural nets could not be trained well to be false. This, LeCun had already demonstrated with CNNs throughout the 90s, but neural nets still went out of favor. Bengio, in collaboration with Yann LeCun, reiterated this on <a href="http://yann.lecun.com/exdb/publis/pdf/bengio-lecun-07.pdf">“Scaling Algorithms Towards AI”</a><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>:</p> <blockquote> <p>“Until recently, many believed that training deep architectures was too difficult an optimization problem. However, at least two different approaches have worked well in training such architectures: simple gradient descent applied to convolutional networks [LeCun et al., 1989, LeCun et al., 1998] (for signals and images), and more recently, layer-by-layer unsupervised learning followed by gradient descent [Hinton et al., 2006, Bengio et al., 2007, Ranzato et al., 2006]. Research on deep architectures is in its infancy, and better learning algorithms for deep architectures remain to be discovered. Taking a larger perspective on the objective of discovering learning principles that can lead to AI has been a guiding perspective of this work. We hope to have helped inspire others to seek a solution to the problem of scaling machine learning towards AI.”</p> </blockquote> <p>And inspire they did. Or at least, they started; though deep learning had not yet gained the tsumani momentum that it has today, the wave had unmistakably begun. Still, the results at that point were not that impressive - most of the demonstrated performance in the papers up to this point was for the MNIST dataset, a classic machine learning task that had been the standard benchmark for algorithms for about a decade. Hinton’s 2006 publication demonstrated a very impressive error rate of only 1.25% on the test set, but SVMs had already gotten an error rate of 1.4%, and even simple algorithms could get error rates in the low single digits. And, as was pointed out in the paper, Yann LeCun already demonstrated error rates of 0.95% in 1998 using CNNs.</p> <p>So, doing well on MNIST was not necessarily that big a deal. Aware of this and confident that it was time for deep learning to take the stage, Hinton and two of his graduate students, Abdel-rahman Mohamed and George Dahl, demonstrated their effectiveness at a far more challenging AI task: <a href="http://www.cs.toronto.edu/~gdahl/papers/dbnPhoneRec.pdf">Speech Recognition</a><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>. Using DBNs, the two students and Hinton managed to improve on a decade-old performance record on a standard speech recognition dataset. This was an impressive achievement, but in retrospect seems like only a hint at what was coming - in short, many more broken records.</p> <h1 id="the-importance-of-brute-force">The Importance of Brute Force</h1> <p>The algorithmic advances described above were undoubtedly important to the emergence of deep learning, but there was another essential component that had emerged in the decade since the 1990s: pure computational power. Following Moore’s law, computers got dozens of times faster since the slow days of the 90s, making learning with large datasets and many layers much more tractable. But even this was not enough - CPUs were starting to hit a ceiling in terms of speed growth, and computer power was starting to increase mainly through weakly parallel computations with several CPUs. To learn the millions of weights typical in deep models, the limitations of weak CPU parallelism had to be left behind and replaced with the massively parallel computing powers of GPUs. Realizing this is, in part, how Abdel-rahman Mohamed, George Dahl, and Geoff Hinton accomplished their record breaking speech recognition performance<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup>:</p> <blockquote> <p>“Inspired by one of Hinton’s lectures on deep neural networks, Mohamed began applying them to speech - but deep neural networks required too much computing power for conventional computers – so Hinton and Mohamed enlisted Dahl. A student in Hinton’s lab, Dahl had discovered how to train and simulate neural networks efficiently using the same high-end graphics cards which make vivid computer games feasible on personal computers.</p> </blockquote> <blockquote> <p>They applied the same method to the problem of recognizing fragments of phonemes in very short windows of speech,” said Hinton. “They got significantly better results than previous methods on a standard three-hour benchmark.””</p> </blockquote> <p>It’s hard to say just how much faster using GPUs over CPUs was in this case, but the paper <a href="http://www.machinelearning.org/archive/icml2009/papers/218.pdf">“Large-scale Deep Unsupervised Learning using Graphics Processors”</a><sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup> of the same year suggests a number: 70 times faster. Yes, 70 times - reducing weeks of work into days, even a single day. The authors, who had previously developed Sparse Coding, included the prolific Machine Learning researcher Andrew Ng, who increasingly realized that making use of lots of training data and of fast computation had been greatly undervalued by researchers in favor of incremental changes in learning algorithms. This idea was strongly supported by 2010’s <a href="http://arxiv.org/pdf/1003.0358.pdf">“Deep Big Simple Neural Nets Excel on Handwritten Digit Recognition”</a><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup> (notably co-written by J. Schmidhuber, one of the inventors of the recurrent LTSM networks), which showed a whopping %0.35 error rate could be achieved on the MNIST dataset without anything more special than really big neural nets, a lot of variations on the input, and efficient GPU implementations of backpropagation. These ideas had existed for decades, so although it could not be said that algorithmic advancements did not matter, this result did strongly support the notion that the brute force approach of big training sets and fast parallelized computations were also crucial.</p> <p>Dahl and Mohamed’s use of a GPU to get record breaking results was an early and relatively modest success, but it was sufficient to incite excitement and for the two to be invited to intern at Microsoft Research<sup id="fnref:1:3"><a href="#fn:1" class="footnote">1</a></sup>. Here, they would have the benefit from another trend in computing that had emerged by then: Big Data. That loosest of terms, which in the context of machine learning is easy to understand - lots of training data. And lots of training data is important, because without it neural nets still did not do great - they tended to overfit (perfectly work on the training data, but not generalize to new test data). This makes sense - the complexity of what large neural nets can compute is such that a lot of data is needed to avoid them learning every little unimportant aspect of the training set - but was a major challenge for researchers in the past. So now, the computing and data gathering powers of large companies proved invaluable. The two students handily proved the power of deep learning during their three month internship, and Microsoft Research has been at the forefront of deep learning speech recognition ever since.</p> <p>Microsoft was not the only BigCompany to recognize the power of deep learning (though it was handily the first). Navdeep Jaitly, another student of Hinton’s, went off to a summer internship at Google in 2011. There, he worked on Google’s speech recognition, and showed their existing setup could be much improved by incorporating deep learning. The revised approach soon powered Android’s speech recognition, replacing much of Google’s carefully crafted prior solution <sup id="fnref:1:4"><a href="#fn:1" class="footnote">1</a></sup>.</p> <p>Besides the impressive effects of humble PhD interns on these gigantic companies’ products, what is notable here is that both companies were making use of the same ideas - ideas that were out in the open for anyone to work with. And in fact, the work by Microsoft and Google, as well as IBM and Hinton’s lab, resulted in the impressively titled <a href="http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/38131.pdf">“Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups”</a><sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup> in 2012. Four research groups - three from companies that could certainly benefit from a briefcase full of patents on the emerging wonder technology of deep learning, and the university research group that popularized that technology - working together and publishing their results to the broader research community. If there was ever an ideal scenario for industry adopting an idea from research, this seems like it.</p> <p>Not to say the companies were doing this for charity. This was the beginning of all of them exploring how to commercialize the technology, and most of all Google. But it was perhaps not Hinton, but Andrew Ng who incited the company to become likely the world’s biggest commercial adopter and proponent of the technology. In 2011, Ng <a href="https://medium.com/backchannel/google-search-will-be-your-next-brain-5207c26e4523#.b3x9b7ods">incidentally met</a> with the legendary Googler Jeff Dean while visiting the company, and chatted about his efforts to train neural nets with Google’s fantastic computational resources. This intrigued Dean, and together with Ng they formed Google Brain - an effort to build truly giant neural nets and explore what they could do. The work resulted in unsupervised neural net learning of an unprecedented scale - 16,000 CPU cores powering the learning of a whopping 1 billion weights (for comparison, Hinton’s breakthrough 2006 DBN had about 1 million weights). The neural net was trained on Youtube videos, entirely without labels, and learned to recognize the most common objects in those videos - leading of course to the internet’s collective glee over the net’s discovery of cats:</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34978?token=5YYsfXB4l7NfgnQSV-PHB3ctZ5NVQU0oc5W8MnG1LIDiO6wBW_f9C_hUixutvqV7N4fRH_P0W3ALuOdZbIGf_L4" alt="cat" /> <figcaption>Google's famous neural-net learned cat. This is the optimal input to one of the neurons. <a href="https://googleblog.blogspot.com/2012/06/using-large-scale-brain-simulations-for.html">(Source)</a></figcaption> </figure> <p>Cute as that was, it was also useful. As they reported in a regularly published paper, the features learned by the model could be used for record setting performance on a standard computer vision benchmark<sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup>. With that, Google’s internal tools for training massive neural nets were born, and they have only continued to evolve since. The wave of deep learning research that began in 2006 had now undeniably made it into industry.</p> <h1 id="the-ascendance-of-deep-learning">The Ascendance of Deep Learning</h1> <p>While deep learning was making it into industry, the research community was hardly keeping still. The discovery that efficient use of GPUs and computing power in general was so important made people examine long-held assumptions and ask questions that should have perhaps been asked long ago - namely, why exactly does backpropagation not work well? The insight to ask why the old approaches did not work, rather than why the new approaches did, led Xavier Glort and Yoshua Bengio to write <a href="http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf">“Understanding the difficulty of training deep feedforward neural networks”</a> in 2010<sup id="fnref:12"><a href="#fn:12" class="footnote">12</a></sup>. In it, they discussed two very meaningful findings:</p> <ol> <li>The particular non-linear activation function chosen for neurons in a neural net makes a big impact on performance, and the one often used by default is not a good choice.</li> <li>It was not so much choosing random weights that was problematic, as choosing random weights without consideration for which layer the weights are for. The old vanishing gradient problem happens, basically, because backpropagation involves a sequence of multiplications that invariably result in smaller derivatives for earlier layers. That is, unless weights are chosen with difference scales according to the layer they are in - making this simple change results in significant improvements.</li> </ol> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34990?token=EsjkMeAXQXbIbaniAeVd0nqCOnwpl2PLo2SLnmkePeWbCSSBB0CG1lTOu_JDP2aDIqCWbB_rfGPuUm_A7MXj2wg" alt="ReLU" /> <figcaption>Different activation functions. The ReLU is the **rectified linear unit**. <a href="https://imiloainf.wordpress.com/2013/11/06/rectifier-nonlinearities/">(Source)</a></figcaption> </figure> <p>The second point is quite clear, but the first opens the question: ‘what, then, is the best activation function’? Three different groups explored the question (a group with LeCun, with <a href="http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf">“What is the best multi-stage architecture for object recognition?”</a><sup id="fnref:13"><a href="#fn:13" class="footnote">13</a></sup>, a group with Hinton, in <a href="http://www.cs.toronto.edu/~fritz/absps/reluICML.pdf">“Rectified linear units improve restricted boltzmann machines”</a><sup id="fnref:14"><a href="#fn:14" class="footnote">14</a></sup>, and a group with Bengio -<a href="https://www.utc.fr/~bordesan/dokuwiki/_media/en/glorot10nipsworkshop.pdf">“Deep Sparse Rectifier Neural Networks”</a><sup id="fnref:15"><a href="#fn:15" class="footnote">15</a></sup>), and they all found the same surprising answer: the very much non-differentiable and very simple function f(x)=max(0,x) tends to be the best. Surprising, because the function is kind of weird - it is not strictly differentiable, or rather is not differentiable precisely at zero, so on paper as far as math goes it looks pretty ugly. But, clearly the zero case is a pretty small mathematical quibble - a bigger question is why such a simple function, with constant derivatives on either side of 0, is so good. The answer is not precisely known, but a few ideas seem pretty well established:</p> <ol> <li>Rectified activation leads to <strong>sparse</strong> representations, meaning not many neurons actually end up needing to output non-zero values for any given input. In the years leading up to this point sparsity was shown to be beneficial for deep learning, both because it represents information in a more robust manner and because it leads to significant computational efficiency (if most of your neurons are outputting zero, you can in effect ignore most of them and compute things much faster). Incidentally, researchers in computational neuroscience first introduced the importance of sparse computation in the context of the brain’s visual system, a decade before it was explored in the context of machine learning.</li> <li>The simplicity of the function, and its derivatives, makes it much faster to work with than the exponential sigmoid or the trigonometric tanh. As with the use of GPUs, this turns out to not just be a small boost but really important for being able to scale neural nets to the point where they perform well on challenging problems.</li> <li>A later analysis titled <a href="http://ai.stanford.edu/~amaas/papers/relu_hybrid_icml2013_final.pdf">“Rectifier Nonlinearities Improve Neural Network Acoustic Models”</a><sup id="fnref:16"><a href="#fn:16" class="footnote">16</a></sup>, co-written by Andrew Ng, also showed the constant 0 or 1 derivative of the ReLU not too detrimental to learning. In fact, it helps avoid the vanishing gradient problem that was the bane of backpropagation. Furthermore, beside producing more sparse representations, it also produces more distributed representations - meaning is derived from the combination of multiple values of different neurons, rather than being localized to individual neurons.</li> </ol> <p>At this point, with all these discoveries since 2006, it had become clear that unsupervised pre-training is not essential to deep learning. It was helpful, no doubt, but it was also shown that in some cases well-done, purely supervised training (with the correct starting weight scales and activation function) could outperform training that included the unsupervised step. So, why indeed, did purely supervised learning with backpropagation not work well in the past? Geoffrey Hinton <a href="https://youtu.be/IcOMKXAw5VA?t=21m29s">summarized the findings up to today in these four points</a>:</p> <ol> <li>Our labeled datasets were thousands of times too small.</li> <li>Our computers were millions of times too slow.</li> <li>We initialized the weights in a stupid way.</li> <li>We used the wrong type of non-linearity.</li> </ol> <p>So here we are. Deep learning. The culmination of decades of research, all leading to this:</p> <blockquote> <p><strong>Deep Learning =<br /> Lots of training data + Parallel Computation + Scalable, smart algorithms</strong></p> </blockquote> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34968?token=mNMoQEOZRbbIXdXVDFZry4exMhFX-S9L8PH5MM1ADWxHdSgOcgwn-zp89AaoIoAT3_BqeE4V2XiXD7haXwqklP8" alt="Equation" /> <figcaption>I wish I was first to come up with this delightful equation, but it seems others came up with it before me. <a href="http://www.computervisionblog.com/2015/05/deep-learning-vs-big-data-who-owns-what.html">(Source)</a></figcaption> </figure> <p>Not to say all there was to figure out was figured out by this point. Far from it. What had been figured out is exactly the opposite: that peoples’ intuition was often wrong, and in particular unquestioned decisions and assumptions were often very unfounded. Asking simple questions, trying simple things - these had the power to greatly improve state of the art techniques. And precisely that has been happening, with many more ideas and approaches being explored and shared in deep learning since. An example: <a href="http://arxiv.org/pdf/1207.0580.pdf">“Improving neural networks by preventing co-adaptation of feature detectors”</a><sup id="fnref:17"><a href="#fn:17" class="footnote">17</a></sup> by G. E. Hinton et al. The idea is very simple: to prevent overfitting, randomly pretend some neurons are not there while training. This straightforward idea - called <strong>Dropout</strong> - is a very efficient means of implementing the hugely powerful approach of ensemble learning, which just means learning in many different ways from the training data. Random Forests, a dominating technique in machine learning to this day, is chiefly effective due to being a form of ensemble learning. Training many different neural nets is possible but is far too computationally expensive, yet this simple idea in essence achieves the same thing and indeed significantly improves performance.</p> <p>Still, having all these research discoveries since 2006 is not what made the computer vision or other research communities again respect neural nets. What did do it was something somewhat less noble: completely destroying non-deep learning methods on a modern competitive benchmark. Geoffrey Hinton enlisted two of his Dropout co-writers, Alex Krizhevsky and Ilya Sutskever, to apply the ideas discovered to create an entry to the ILSVRC-2012 computer vision competition. To me, it is very striking to now understand that their work, described in <a href="http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf">“ImageNet Classification with deep convolutional neural networks”</a><sup id="fnref:18"><a href="#fn:18" class="footnote">18</a></sup>, is the combination of very old concepts (a CNN with pooling and convolution layers, variations on the input data) with several new key insight (very efficient GPU implementation, ReLU neurons, dropout), and that this, precisely this, is what modern deep learning is. So, how did they do? Far, far better than the next closest entry: their error rate was %15.3, whereas the second closest was %26.2. This, the first and only CNN entry in that competition, was an undisputed sign that CNNs, and deep learning in general, had to be taken seriously for computer vision. Now, almost all entries to the competition are CNNs - a neural net model Yann LeCun was working with since 1989. And, remember LSTM recurrent neural nets, devised in the 90s by Sepp Hochreiter and Jürgen Schmidhuber to solve the backpropagation problem? Those, too, are now state of the art for sequential tasks such as speech processing.</p> <p>This was the turning point. A mounting wave of excitement about possible progress has culminated in undeniable achievements that far surpassed what other known techniques could manage. The tsunami metaphor that we started with in part 1, this is where it began, and it has been growing and intensifying to this day. Deep learning is here, and no winter is in sight.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34991?token=qzrUAAJqtenu9pNK9CtttT7LRBHwKlOr7udYaUSBRoKNhJjnTJJhUYBGmkUcZVqVvm_D3UwK23d-yTj1ni-clhU" alt="From Google Scholar" /> <figcaption>The citation counts for some of the key people we have seen develop deep learning. I believe I don't need to point out the exponential trends since 2012. From Google Scholar. </figcaption> </figure> <h1 id="epilogue-state-of-the-art">Epilogue: state of the art</h1> <p>If this were a movie, the 2012 ImageNet competition would likely have been the climax, and now we would have a progression of text describing ‘where are they now’. Yann LeCun - Facebook. Geoffrey Hinton - Google. Andrew Ng - Coursera, Google, Baidu. Bengio, Schmidhuber, and Hochreiter actually still in academia - but presumably with many more citations and/or grad students<sup id="fnref:19"><a href="#fn:19" class="footnote">19</a></sup>. Though the ideas and achievements of deep learning are definitely exciting, while writing this I was inevitably also moved that these people, who worked in this field for decades (even as most abandoned it), are now rich, successful, and most of all better situated to do research than ever. All these peoples’ ideas are still very much out in the open, and in fact basically all these companies are open sourcing their deep learning frameworks, like some sort of utopian vision of industry-led research. What a story.</p> <p>I was foolish enough to hope I could fit a summary of the most impressive results of the past several years in this part, but at this point it is clear I will not have the space to do so. Perhaps one day there will be a part five of this that can finish out the tale by describing these things, but for now let me provide a brief list:</p> <p>1 - The resurgence of LTSM RNNs + representing ‘ideas’ with <a href="http://machinelearning.wustl.edu/mlpapers/paper_files/BengioDVJ03.pdf">distributed representations</a></p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34992?token=BfZWKt2mBMH5H3j82QmADSH7i1sKJemavojs6daR5fbgqsxTIpTXn47ji7ChiCqCrkp8jJS7nPpRZhRKlNh9L2E" alt="From Google, taken at https://gigaom.com/2014/11/18/google-stanford-build-hybrid-neural-networks-that-can-explain-photos/" /> <figcaption>A result from last year. Just look at that! <a href="https://gigaom.com/2014/11/18/google-stanford-build-hybrid-neural-networks-that-can-explain-photos/">(Source)</a></figcaption> </figure> <p><a href="http://blogs.microsoft.com/blog/2014/05/27/microsoft-demos-breakthrough-in-real-time-translated-conversations/">Skype real time translation</a></p> <p>2 - Using deep learning for reinforcement learning (again, but better)</p> <figure> <iframe width="420" height="315" src="https://www.youtube.com/embed/V1eYniJ0Rnk" frameborder="0" allowfullscreen=""></iframe> </figure> <p><a href="http://arxiv.org/abs/1509.01549">Chess playing!</a></p> <p>3 - Adding external memory writable and readable to by the neural net</p> <figure> <iframe width="560" height="315" src="https://www.youtube.com/embed/U_Wgc1JOsBk" frameborder="0" allowfullscreen=""></iframe> </figure> <div class="footnotes"> <ol> <li id="fn:1"> <p>Kate Allen. How a Toronto professor’s research revolutionized artificial intelligence Science and Technology reporter, Apr 17 2015 http://www.thestar.com/news/world/2015/04/17/how-a-toronto-professors-research-revolutionized-artificial-intelligence.html <a href="#fnref:1" class="reversefootnote">&#8617;</a> <a href="#fnref:1:1" class="reversefootnote">&#8617;<sup>2</sup></a> <a href="#fnref:1:2" class="reversefootnote">&#8617;<sup>3</sup></a> <a href="#fnref:1:3" class="reversefootnote">&#8617;<sup>4</sup></a> <a href="#fnref:1:4" class="reversefootnote">&#8617;<sup>5</sup></a></p> </li> <li id="fn:2"> <p>Hinton, G. E., Osindero, S., &amp; Teh, Y. W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7), 1527-1554. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:3"> <p>Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation, 14(8), 1771-1800. <a href="#fnref:3" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:4"> <p>Bengio, Y., Lamblin, P., Popovici, D., &amp; Larochelle, H. (2007). Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 153. <a href="#fnref:4" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:5"> <p>Bengio, Y., &amp; LeCun, Y. (2007). Scaling learning algorithms towards AI. Large-scale kernel machines, 34(5). <a href="#fnref:5" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:6"> <p>Mohamed, A. R., Sainath, T. N., Dahl, G., Ramabhadran, B., Hinton, G. E., &amp; Picheny, M. (2011, May). Deep belief networks using discriminative features for phone recognition. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5060-5063). IEEE. <a href="#fnref:6" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:7"> <p>November 26, 2012. Leading breakthroughs in speech recognition software at Microsoft, Google, IBM Source: http://news.utoronto.ca/leading-breakthroughs-speech-recognition-software-microsoft-google-ibm <a href="#fnref:7" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:8"> <p>Raina, R., Madhavan, A., &amp; Ng, A. Y. (2009, June). Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th annual international conference on machine learning (pp. 873-880). ACM. <a href="#fnref:8" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:9"> <p>Claudiu Ciresan, D., Meier, U., Gambardella, L. M., &amp; Schmidhuber, J. (2010). Deep big simple neural nets excel on handwritten digit recognition. arXiv preprint arXiv:1003.0358. <a href="#fnref:9" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:10"> <p>Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A. R., Jaitly, N., … &amp; Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6), 82-97. <a href="#fnref:10" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:11"> <p>Le, Q. V. (2013, May). Building high-level features using large scale unsupervised learning. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on (pp. 8595-8598). IEEE. <a href="#fnref:11" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:12"> <p>Glorot, X., &amp; Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In International conference on artificial intelligence and statistics (pp. 249-256). <a href="#fnref:12" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:13"> <p>Jarrett, K., Kavukcuoglu, K., Ranzato, M. A., &amp; LeCun, Y. (2009, September). What is the best multi-stage architecture for object recognition?. In Computer Vision, 2009 IEEE 12th International Conference on (pp. 2146-2153). IEEE. <a href="#fnref:13" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:14"> <p>Nair, V., &amp; Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) (pp. 807-814). <a href="#fnref:14" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:15"> <p>Glorot, X., Bordes, A., &amp; Bengio, Y. (2011). Deep sparse rectifier neural networks. In International Conference on Artificial Intelligence and Statistics (pp. 315-323). <a href="#fnref:15" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:16"> <p>Maas, A. L., Hannun, A. Y., &amp; Ng, A. Y. (2013, June). Rectifier nonlinearities improve neural network acoustic models. In Proc. ICML (Vol. 30). <a href="#fnref:16" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:17"> <p>Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., &amp; Salakhutdinov, R. R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. <a href="#fnref:17" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:18"> <p>Krizhevsky, A., Sutskever, I., &amp; Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105). <a href="#fnref:18" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:19"> <p>http://www.technologyreview.com/news/524026/is-google-cornering-the-market-on-deep-learning/ <a href="#fnref:19" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <p><a href="/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/">A 'Brief' History of Neural Nets and Deep Learning, Part 4</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on December 24, 2015.</p> <![CDATA[A 'Brief' History of Neural Nets and Deep Learning, Part 3]]> /writing/a-brief-history-of-neural-nets-and-deep-learning-part-3 2015-12-24T17:19:34-08:00 2015-12-24T17:19:34-08:00 www.andreykurenkov.com contact@andreykurenkov.com <p>This is the third part of ‘A Brief History of Neural Nets and Deep Learning’. Parts 1 and 2 are <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning">here</a> and <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-2">here</a>, and part 4 is <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4">here</a>. In this part, we will continue to see the swift pace of research in the 90s, and see why neural nets ultimately lost favor much as they did in the late 60s.</p> <h1 id="neural-nets-make-decisions">Neural Nets Make Decisions</h1> <p>Having discovered the application of neural nets to unsupervised learning, let us also quickly see how they were used in the third branch of machine learning: <strong>reinforcement learning</strong>. This one requires the most mathy notation to explain formally, but also has a goal that is very easy to describe informally: learn to make good decisions. Given some theoretical agent (a little software program, for instance), the idea is to make that agent able to decide on an <strong>action</strong> based on its current <strong>state</strong>, with the reception of some <strong>reward</strong> for each action and the intent of getting the maximum <strong>utility</strong> in the long term. So, whereas supervised learning tells the learning algorithm exactly what it should learn to output, reinforcement learning provides ‘rewards’ as a by-product of making good decisions over time, and does not directly tell the algorithm the correct decisions to choose. From the outset it was a very abstracted decision making model - there were a finite number of states, and a known set of actions with known rewards for each state. This made it easy to write very elegant equations for finding the optimal set of actions, but hard to apply to real problems - problems with continuous states or hard-to-define rewards.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34985?token=8S98i7brY2iTusq7B68-OHsvSS-ND9Kc5F_-XnppdoNFt6hyAbhhxRZ5W4ipFEaF-N4XX9yjAMyDdKx0QKL4--Q" alt="RL" /> <figcaption>Reinforcement learning. <a href="http://www2.hawaii.edu/~chenx/ics699rl/grid/rl.html">(Source)</a></figcaption> </figure> <p>This is where neural nets come in. Machine learning in general, and neural nets in particular, are good at dealing with messy continuous data or dealing with hard to define functions by learning them from examples. Although classification is the bread and butter of neural nets, they are general enough to be useful for many types of problems - the descendants of Bernard Widrow’s and Ted Hoff’s Adaline were used for adaptive filters in the context of electrical circuits, for instance. And so, following the resurgence of research caused by backpropagation, people soon devised ways of leveraging the power of neural nets to perform reinforcement learning. One of the early examples of this was solving a simple yet classic problem: the balancing of a stick on a moving platform, known to students in control classes everywhere as the inverted pendulum problem <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34982?token=k1WhSbXvaWX6oxEe7C_ChtP_n-ypQHY9JsSZc1Q8gwFlTKGjUaW0wuou46Um2KbDryXEKXnZqcThjIJX2MyDXmY" alt="pendulum " /> <figcaption>The double pendulum control problem - a step up from the single pendulum version, which is a classic control and reinforcement learning task. <a href="hhttp://www.pdx.edu/biomedical-signal-processing-lab/inverted-double-pendulum">(Source)</a></figcaption> </figure> <p>As with adaptive filtering, this research was strongly relevant to the field of Electrical Engineering, where control theory had been a major subfield for many decades prior to neural nets’ arrival. Though the field had devised ways to deal with many problems through direct analysis, having a means to deal with more complex situations through learning proved useful as evidenced by the hefty 7000 (!) citations of the 1990 “Identification and control of dynamical systems using neural networks”<sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. Perhaps predictably, there was another field separate from Machine Learning where neural nets were useful - robotics. A major example of early neural net use for robotics came from CMU’s NavLab with 1989’s <a href="http://www.dtic.mil/dtic/tr/fulltext/u2/a218975.pdf">“Alvinn: An autonomous land vehicle in a neural network”</a><sup id="fnref:2b"><a href="#fn:2b" class="footnote">3</a></sup>:</p> <figure> <iframe width="420" height="315" src="https://www.youtube.com/embed/5-acCtyKf7E" frameborder="0" allowfullscreen=""></iframe> </figure> <p>As discussed in the paper, the neural net in this system learned to control the vehicle through plain supervised learning using sensor and steering data recorded while a human drove. There was also research into teaching robots using reinforcement learning specifically, as exemplified by the 1993 PhD thesis <a href="http://www.dtic.mil/dtic/tr/fulltext/u2/a261434.pdf">“Reinforcement learning for robots using neural networks”</a><sup id="fnref:3"><a href="#fn:3" class="footnote">4</a></sup>. The thesis showed that robots could be taught behaviors such as wall following and door passing in reasonable amounts of time, which was a good thing considering the prior inverted pendulum work requires impractical lengths of training.</p> <p>These disparate applications in other fields are certainly cool, but of course the most research on reinforcement learning and neural nets was happening within AI and Machine Learning. And here, one of the most significant results in the history of reinforcement learning was achieved: a neural net that learned to be a world class backgammon player. Dubbed <a href="http://courses.cs.washington.edu/courses/cse590hk/01sp/Readings/tesauro95cacm.pdf">TD-Gammon</a>, the neural net was trained using a standard reinforcement learning algorithm and was one of the first demonstrations of reinforcement learning being able to outperform humans on relatively complicated tasks <sup id="fnref:4"><a href="#fn:4" class="footnote">5</a></sup>. And it was specifically a reinforcement learning approach that worked here, as the same research showed just using a neural net without reinforcement learning did not work nearly as well.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34983?token=iTl1pbpNKoeqgLWOC7YNBJYTYokPCrYeH8WhMh6Pn7a2Ie9y3zigQjMDiD55r_ZQLzmxgaf_NxWmls9cNMkAw50" alt="TDGammon" /> <figcaption>The neural net that learned to play expert-level Backgammon. <a href="https://webdocs.cs.ualberta.ca/~sutton/book/ebook/node108.html">(Source)</a></figcaption> </figure> <p>But, as we have seen happen before and will see happen again in AI, research hit a dead end. The predictable next problem to tackle using the TD-Gammon approach was investigated by Sebastian Thrun in the 1995 <a href="http://www-preview.ri.cmu.edu/pub_files/pub1/thrun_sebastian_1995_8/thrun_sebastian_1995_8.pdf">“Learning To Play the Game of Chess”</a>, and the results were not good <sup id="fnref:5"><a href="#fn:5" class="footnote">6</a></sup>. Though the neural net learned decent play, certainly better than a complete novice at the game, it was still far worse than a standard computer program (GNU-Chess) implemented long before. The same was true for the other perennial challenge of AI, Go <sup id="fnref:6"><a href="#fn:6" class="footnote">7</a></sup>. See, TD-Gammon sort of cheated - it learned to evaluate positions quite well, and so could get away with not doing any ‘search’ over multiple future moves and instead just picking the one that led to the best next position. But the same is simply not possible in chess or Go, games which are a challenge to AI precisely because of needing to look many moves ahead and having so many possible move combinations. Besides, even if the algorithm were smarter, the hardware of the time just was not up to the task - Thrun reported that “NeuroChess does a poor job, because it spends most of its time computing board evaluations. Computing a large neural network function takes two orders of magnitude longer than evaluating an optimized linear evaluation function (like that of GNU-Chess).” The weakness of computers of the time relative to the needs of the neural nets was a very real issue, and as we shall see not the only one…</p> <h1 id="neural-nets-get-loopy">Neural Nets Get Loopy</h1> <p>As neat as unsupervised and reinforcement learning are, I think supervised learning is still my favorite use case for neural nets. Sure, learning probabilistic models of data is cool, but it’s simply much easier to get excited for the sorts of concrete problems solved by backpropagation. We already saw how Yann Lecun achieved quite good recognition of handwritten text (a technology which went on to be nationally deployed for check-reading, and much more a while later…), but there was another obvious and greatly important task being worked on at the same time: understanding human speech.</p> <p>As with writing, understanding human speech is quite difficult due to the practically infinite variation in how the same word can be spoken. But, here there is an extra challenge: long sequences of input. See, for images it’s fairly simple to crop out a single letter from an image and have a neural net tell you which letter that is, input-&gt;output style. But with audio it’s not so simple - separating out speech into characters is completely impractical, and even finding individual words within speech is less simple. Plus, if you think about human speech, generally hearing words in context makes them easier to understand than being separated. While this structure works quite well for processing things such as images one at a time, input-&gt;output style, it is not at all well suited to long streams of information such as audio or text. The neural net has no ‘memory’ with which an input can affect another input processed afterward, but this is precisely how we humans process audio or text - a string of word or sound inputs, rather than a single large input. Point being: to tackle the problem of understanding speech, researchers sought to modify neural nets to process input as a stream of input as in speech rather than one batch as with an image.</p> <p>One approach to this, by Alexander Waibel et. al (including Hinton), was introduced in the 1989 <a href="http://www.cs.toronto.edu/~fritz/absps/waibelTDNN.pdf">“Phoneme recognition using <strong>time-delay neural networks</strong>”</a><sup id="fnref:7"><a href="#fn:7" class="footnote">8</a></sup>. These time-delay neural networks (TDNN) were very similar to normal neural networks, except each neuron processed only a subset of the input and had several sets of weights for different delays of the input data. In other words, for a sequence of audio input, a ‘moving window’ of the audio is input into the network and as the window moves the same bits of audio are processed by each neuron with different sets of weights based on where in the window the bit of audio is. This is best understood with a quick illustration:</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34902?token=I-gRSza-SJchfi0jbeWtZxR7YaGXvDHCdJ7YyQx6h_hFxJotY8-jgChVwcwbP9hGCl-YKm36PUgvsCtS9pDhDZI" alt="TDNN" /> <figcaption>Time delay neural networks. <a href="https://electroviees.wordpress.com/tag/chacha/">(Source)</a></figcaption> </figure> <p>In a sense, this is quite similar to what CNNs do - instead of looking at the whole input at once, each unit looks at just a subset of the input at a time and does the same computation for each small subset. The main difference here is that there is no idea of time in a CNN, and the ‘window’ of input for each neuron is always moved across the whole input image to compute a result, whereas in a TDNN there actually is sequential input and output of data. Fun fact: <a href="https://youtu.be/vShMxxqtDDs?t=26m4s">according to Hinton</a>, the idea of TDNNs is what inspired LeCun to develop convolutional neural nets. But, funnily enough CNNs became essential for image processing, whereas in speech recognition TDNNs have been surpassed to another approach - <strong>recurrent neural nets</strong> (RNNs). See, all the networks that have been discussed so far have been <strong>feedforward</strong> networks, meaning that the output of neurons in a given layer acts as input to only neurons in a next layer. But, it does not have to be so - there is nothing prohibiting us brave computer scientists from connecting output of the last layer act as an input to the first layer, or just connecting the output of a neuron to itself. By having the output of the network ‘loop’ back into the network, the problem of giving the network memory as to past inputs is solved so elegantly!</p> <div><button class="btn" data-toggle="collapse" data-target="#rnnvs"> Aside: more on RNNs vs TDNNs &raquo; </button></div> <blockquote class="aside"><p id="rnnvs" class="boltzmann" style="height: 0px;"> Again, those seeking greater insight into the distinctions between different neural nets would do well to just go back to the actual papers. Here is a nice summation of why RNNs are cooler than TDNNs for sequential data: "A recurrent network has cycles in its graph that allow it to store information about past inputs for an amount of time that is not fixed a priori but rather depends on its weights and on the input data. The type of recurrent networks considered here can be used either for sequence recognition production or prediction. Units are not clamped and we are not interested in convergence to a fixed point. Instead the recurrent network is used to transform an input sequence eg speech spectra into an output sequence, eg degrees, of evidence for phonemes. The main advantage of such recurrent networks is that the relevant past context can be represented in the activity of the hidden units and then used to compute the output at each time step. In theory the network can learn how to extract the relevant context information from the input sequence. In contrast, in network with time delays such as TDNNs the designer of the network must decide a priori by the choice of delay connections which part of the past input sequence should be used to predict the next output. According to the terminology introduced in [] the memory is static in the case of TDNNs but it is adaptive in the case of recurrent networks." </p></blockquote> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34568?token=osHGQ5vZmlKI8wvUQDodyNnTzHvIIucFK6U0Z1ynSkEKrMZy1FEdoBrizZ7fujKpEWiYaC1-1fm8lMLh7GKKVuc" alt="RNN" /> <figcaption>Diagram of a Recurrent Neural Net. Recall Boltzmann Machines from before? Surprise! Those were recurrent neural nets. <a href="http://www.wolframalpha.com/docs/timeline/computable-knowledge-history-6.html">(Source)</a></figcaption> </figure> <p>Well, it’s not quite so simple. Notice the problem - if backpropagation relies on ‘propagating’ the error from the output layer backward, how do things work if the first layer connects back to the output layer? The error would go ahead and propagate from the first layer back to the output layer, and could just keep looping through the network, infinitely. The solution, independently derived by multiple groups, is <strong>backpropagation through time</strong>. Basically, the idea is to ‘unroll’ the recurrent neural network by treating each loop through the neural network as an input to another neural network, and looping only a limited number of times.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/35004?token=GxHCevxXTmakxvF9U8WazhEAPyVJK-uWNzaDfqra756dhTkvCVM2ElBUQNhmf6pL7U4_9boMOFGQ58mZRmm_jCo" alt="The wonders of public domain images from Wikipedia!" /> <figcaption>The wonderfully intuitive backpropagation through time concept. <a href="https://upload.wikimedia.org/wikipedia/en/e/ee/Unfold_through_time.png">(Source)</a></figcaption> </figure> <p>This fairly simple idea actually worked - it was possible to train recurrent neural nets. And indeed, multiple people explored the application of RNNs to speech recognition. But, here is a twist you should now be able to predict: this approach did not work very well. To find out why, let’s meet another modern giant of Deep Learning: Yoshua Bengio. Starting work on speech recognition with neural nets around 1986, he co-wrote many papers on using ANNs and RNNs for speech recognition, and ended up working at the AT&amp;T Bell Labs on the problem just as Yann LeCun was working with CNNs there. In fact, in 1995 they co-wrote the summary paper <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-bengio-95a.pdf">“Convolutional Networks for Images, Speech, and Time-Series”</a><sup id="fnref:8"><a href="#fn:8" class="footnote">9</a></sup>, the first of many collaborations among them. But, before then Bengio wrote the 1993 <a href="http://www.iro.umontreal.ca/~lisa/publications2/index.php/attachments/single/161">“A Connectionist Approach to Speech Recognition”</a><sup id="fnref:9"><a href="#fn:9" class="footnote">10</a></sup>. Here, he summarized the general failure of effectively teaching RNNs:</p> <blockquote> <p>“Although recurrent networks can in many instances outperform static networks, they appear more difficult to train optimally. Our experiments tended to indicate that their parameters settle in a suboptimal solution which takes into account short term dependencies but not long term dependencies. For example in experiments described in (ctation) we found that simple duration constraints on phonemes had not at all been captured by the recurrent network. … Although this is a negative result, a better understanding of this problem could help in designing alternative systems for learning to map input sequences to output sequences with long term dependencies eg for learning finite state machines, grammars, and other language related tasks. Since gradient based methods appear inadequate for this kind of problem we want to consider alternative optimization methods that give acceptable results even when the criterion function is not smooth.”</p> </blockquote> <h1 id="a-new-winter-dawns">A New Winter Dawns</h1> <p>So, there was a problem. A big problem. And the problem, basically, was what so recently was a huge advance: backpropagation. See, convolutional neural nets were important in part because backpropagation just did not work well for normal neural nets with many layers. And that’s the real key to deep-learning - having many layers, in today’s systems as many as 20 or more. But already by the late 1980’s, it was known that deep neural nets trained with backpropagation just did not work very well, and particularly did not work as well as nets with fewer layers. The reason, in basic terms, is that backpropagation relies on finding the error at the output layer and successively splitting up blame for it for prior layers. Well, with many layers this calculus-based splitting of blame ends up with either huge or tiny numbers and the resulting neural net just does not work very well - the ‘vanishing or exploding gradient problem’. Jurgen Schmidhuber, another Deep Learning luminary, summarizes the more formal explanation well<sup id="fnref:10"><a href="#fn:10" class="footnote">11</a></sup>:</p> <blockquote> <p>“A diploma thesis (Hochreiter, 1991) represented a milestone of explicit DL research. As mentioned in Sec. 5.6, by the late 1980s, experiments had indicated that traditional deep feedforward or recurrent networks are hard to train by backpropagation (BP) (Sec. 5.5). Hochreiter’s work formally identified a major reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either shrink rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth (Sec. 3), or they explode. “</p> </blockquote> <p>Backpropagation through time is essentially equivalent to a neural net with a whole lot of layers, so RNNs were particularly difficult to train with Backpropagation. Both Sepp Hochreiter, advised by Schmidhuber, and Yoshua Bengio published papers on the inability of learning long-term information due to limitations of backpropagation <sup id="fnref:11"><a href="#fn:11" class="footnote">12</a></sup><sup id="fnref:12"><a href="#fn:12" class="footnote">13</a></sup>. The analysis of the problem did reveal a solution - Schmidhuber and Hochreiter introduced a very important concept in 1997 that essentially solved the problem of how to train recurrent neural nets, much as CNNs did for feedforward neural nets - <a href="http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf"><strong>Long Short Term Memory</strong></a> (LSTM)<sup id="fnref:13"><a href="#fn:13" class="footnote">14</a></sup>. In simple terms, as with CNNs the LTSM breakthrough ended up being a small alteration to the normal neural net model <sup id="fnref:10:1"><a href="#fn:10" class="footnote">11</a></sup>:</p> <blockquote> <p>“The basic LSTM idea is very simple. Some of the units are called Constant Error Carousels (CECs). Each CEC uses as an activation function f, the identity function, and has a connection to itself with fixed weight of 1.0. Due to f’s constant derivative of 1.0, errors backpropagated through a CEC cannot vanish or explode (Sec. 5.9) but stay as they are (unless they “flow out” of the CEC to other, typically adaptive parts of the NN). CECs are connected to several nonlinear adaptive units (some with multiplicative activation functions) needed for learning nonlinear behavior. Weight changes of these units often profit from error signals propagated far back in time through CECs. CECs are the main reason why LSTM nets can learn to discover the importance of (and memorize) events that happened thousands of discrete time steps ago, while previous RNNs already failed in case of minimal time lags of 10 steps.”</p> </blockquote> <p>But, this did little to fix the larger perception problem that neural nets were janky and did not work very well. They were seen as a hassle to work with - the computers were not fast enough, the algorithms were not smart enough, and people were not happy. So, around the mid 90s, a new AI Winter for neural nets began to emerge - the community once again lost faith in them. A new method called Support Vector Machines, which in the very simplest terms could be described as a mathematically optimal way of training an equivalent to a two layer neural net, was developed and started to be seen as superior to the difficult to work with neural nets. In fact, the 1995 <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-95b.pdf">“Comparison of Learning Algorithms For Handwritten Digit Recognition”</a><sup id="fnref:14"><a href="#fn:14" class="footnote">15</a></sup> by LeCun et al. found that this new approach worked better or the same as all but the best designed neural nets:</p> <blockquote> <p>“The [support vector machine] classifier has excellent accuracy, which is most remarkable, because unlike the other high performance classifiers, it does not include <em>a priori</em> knowledge about the problem. In fact, this classifier would do just as well if the image pixels were permuted with a fixed mapping. It is still much slower and memory hungry than the convolutional nets. However, improvements are expected as the technique is relatively new.”</p> </blockquote> <p>Other new methods, notably Random Forests, also proved to be very effective and with lovely mathematical theory behind them. So, despite the fact that CNNs consistently had state of the art performance, enthusiasm for neural nets dissipated and the machine learning community at large once again disavowed them. Winter was back. In <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4">part 4</a>, we shall see how a small group of researchers persevered in this research climate and ultimately made Deep Learning what it is today.</p> <div class="footnotes"> <ol> <li id="fn:1"> <p>Anderson, C. W. (1989). Learning to control an inverted pendulum using neural networks. Control Systems Magazine, IEEE, 9(3), 31-37. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:2"> <p>Narendra, K. S., &amp; Parthasarathy, K. (1990). Identification and control of dynamical systems using neural networks. Neural Networks, IEEE Transactions on, 1(1), 4-27. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:2b"> <p>Pomerleau, D. A. (1989). Alvinn: An autonomous land vehicle in a neural network (No. AIP-77). Carnegie-Mellon Univ Pittsburgh Pa Artificial Intelligence And Psychology Project. <a href="#fnref:2b" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:3"> <p>Lin, L. J. (1993). Reinforcement learning for robots using neural networks (No. CMU-CS-93-103). Carnegie-Mellon Univ Pittsburgh PA School of Computer Science. <a href="#fnref:3" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:4"> <p>Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), 58-68. <a href="#fnref:4" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:5"> <p>Thrun, S. (1995). Learning to play the game of chess. Advances in neural information processing systems, 7. <a href="#fnref:5" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:6"> <p>Schraudolph, N. N., Dayan, P., &amp; Sejnowski, T. J. (1994). Temporal difference learning of position evaluation in the game of Go. Advances in Neural Information Processing Systems, 817-817. <a href="#fnref:6" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:7"> <p>Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., &amp; Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. Acoustics, Speech and Signal Processing, IEEE Transactions on, 37(3), 328-339. <a href="#fnref:7" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:8"> <p>Yann LeCun and Yoshua Bengio. 1998. Convolutional networks for images, speech, and time series. In The handbook of brain theory and neural networks, Michael A. Arbib (E()d.). MIT Press, Cambridge, MA, USA 255-258. <a href="#fnref:8" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:9"> <p>Yoshua Bengio, A Connectionist Approach To Speech Recognition Int. J. Patt. Recogn. Artif. Intell., 07, 647 (1993). <a href="#fnref:9" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:10"> <p>J. Schmidhuber. “Deep Learning in Neural Networks: An Overview”. “Neural Networks”, “61”, “85-117”. http://arxiv.org/abs/1404.7828 <a href="#fnref:10" class="reversefootnote">&#8617;</a> <a href="#fnref:10:1" class="reversefootnote">&#8617;<sup>2</sup></a></p> </li> <li id="fn:11"> <p>Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institutfur Informatik, Lehrstuhl Prof. Brauer, Technische Universitat Munchen. Advisor: J. Schmidhuber. <a href="#fnref:11" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:12"> <p>Bengio, Y.; Simard, P.; Frasconi, P., “Learning long-term dependencies with gradient descent is difficult,” in Neural Networks, IEEE Transactions on , vol.5, no.2, pp.157-166, Mar 1994 <a href="#fnref:12" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:13"> <p>Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (November 1997), 1735-1780. DOI=http://dx.doi.org/10.1162/neco.1997.9.8.1735. <a href="#fnref:13" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:14"> <p>Y. LeCun, L. D. Jackel, L. Bottou, A. Brunot, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, U. A. Muller, E. Sackinger, P. Simard and V. Vapnik: Comparison of learning algorithms for handwritten digit recognition, in Fogelman, F. and Gallinari, P. (Eds), International Conference on Artificial Neural Networks, 53-60, EC2 &amp; Cie, Paris, 1995 <a href="#fnref:14" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <p><a href="/writing/a-brief-history-of-neural-nets-and-deep-learning-part-3/">A 'Brief' History of Neural Nets and Deep Learning, Part 3</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on December 24, 2015.</p> <![CDATA[A 'Brief' History of Neural Nets and Deep Learning, Part 2]]> /writing/a-brief-history-of-neural-nets-and-deep-learning-part-2 2015-12-24T16:19:34-08:00 2015-12-24T16:19:34-08:00 www.andreykurenkov.com contact@andreykurenkov.com <p>This is the second part of ‘A Brief History of Neural Nets and Deep Learning’. Part 1 is <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning">here</a>, and Parts 3 and 4 are <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-3">here</a> and <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4">here</a>. In this part, we will look into several strains of research that made rapid progress following the development of backpropagation and until the late 90s, which we shall see later are the essential foundations of Deep Learning.</p> <h1 id="neural-nets-gain-vision">Neural Nets Gain Vision</h1> <figure> <img class="postimagesmall" src="http://yann.lecun.com/exdb/lenet/gifs/asamples.gif" alt="LeNet" /> <figcaption>Yann LeCun's LeNet demonstrated.</figcaption> </figure> <p>With the secret to training multilayer neural nets uncovered, the topic was once again ember-hot and the lofty ambitions of Rosenblatt seemed to perhaps be in reach. It took only until 1989 for another key finding now universally cited in textbooks and lectures to be <a href="http://www.sciencedirect.com/science/article/pii/0893608089900208">published</a><sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup>: “Multilayer feedforward networks are universal approximators”. Essentially, it mathematically proved that multiple layers allow neural nets to theoretically implement any function, and certainly XOR.</p> <p>But, this is mathematics, where you could imagine having endless memory and computation power should it be needed - did backpropagation allow neural nets to be used for anything in the real world? Oh yes. Also in 1989, Yann LeCun et al. at the AT&amp;T Bell Labs demonstrated a very significant real-world application of backpropagation in <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-89e.pdf">"”Backpropagation Applied to Handwritten Zip Code Recognition”</a> <sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>. You may think it fairly unimpressive for a computer to be able to correctly understand handwritten digits, and these days it is indeed quite quaint, but prior to the publication the messy and inconsistent scrawls of us humans proved a major challenge to the much more tidy minds of computers. The publication, working with a large dataset from the US Postal Service, showed neural nets were entirely capable of this task. And much more importantly, it was first to highlight the practical need for a key modifications of neural nets beyond plain backpropagation toward modern deep learning:</p> <blockquote> <p>“Classical work in visual pattern recognition has demonstrated the advantage of extracting local features and combining them to form higher order features. Such knowledge can be easily built into the network by forcing the hidden units to combine only local sources of information. Distinctive features of an object can appear at various location on the input image. Therefore it seems judicious to have a set of feature detectors that can detect a particular instance of a feature anywhere on the input place. Since the <em>precise</em> location of a feature is not relevant to the classification, we can afford to lose some position information in the process. Nevertheless, <em>approximate</em> position information must be preserved, to allow the next levels to detect higher order, more complex features (Fukushima 1980; Mozer 1987).”</p> </blockquote> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/35003?token=pRZiRNO5tZB3uHSXV0bjIdzsP2tAUr9jpXUUChNI20Dwk0Y9JOMcGwtmmlgHGVzgJRGhXsr998Ogpxbl3K1Vn_8" alt="CNN" /> <figcaption>A visualization of how this neural net works. <a href="http://image.slidesharecdn.com/bp2slides-090922011749-phpapp02/95/the-back-propagation-learning-algorithm-10-728.jpg?cb=1253582278">(Source)</a></figcaption> </figure> <p>Or, more concretely: the first hidden layer of the neural net was <strong>convolutional</strong> - instead of each neuron having a different weight for each pixel of the input image (40x60=2400 weights), the neurons only have a small set of weights (5x5=25) that were applied a whole bunch of small subsets of the image of the same size. So, for instance instead of having 4 different neurons learn to detect 45 degree lines in each of the 4 corners of the input image, a single neuron could learn to detect 45 degree lines on subsets of the image and do that everywhere within it. Layers past the first work in a similar way, but take in the ‘local’ features found in the previous hidden layer rather than pixel images, and so ‘see’ successively larger portions of the image since they are combining information about increasingly larger subsets of the image. Finally, the last two layers are just plain normal neural net layers that use the higher-order larger features generated by the convolutional layers to determine which digit the input image corresponds to. The method proposed in this 1989 paper went on to be the basis of nationally deployed check-reading systems, as demonstrated by LeCun in this gem of a video:</p> <figure> <iframe width="420" height="315" src="https://www.youtube.com/embed/FwFduRA_L6Q" frameborder="0" allowfullscreen=""></iframe> </figure> <p>The reason for why this is helpful is intuitively if not mathematically clear: without such constraints the network would have to learn the same simple things (such as detecting 45 degree lines, small circles, etc) a whole bunch of times for each portion of the image. But with the constraint there, only one neuron would need to learn each simple feature - and with far fewer weights overall, it could do so much faster! Moreover, since the pixel-exact locations of such features do not matter the neuron could basically skip neighboring subsets of the image - <strong>subsampling</strong>, now known as a type of <strong>pooling</strong> - when applying the weights, further reducing the training time. The addition of these two types of layers - convolutional and pooling layers - are the primary distinctions of <strong>Convolutional Neural Nets</strong> (<strong>CNNs/ConvNets</strong>) from plain old neural nets.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34967?token=cmXwbZkJ53nKUhEFA3zCrtdFDF1cVgfhGFBv1lD8Z7TPCqZpKRwR0Ht-vE-894hZyaWbYxUX8wak0QjMXvNq8P4" alt="CNN 2" /> <figcaption>A nice visualization of CNN operation <a href="https://sites.google.com/site/5kk73gpu2013/assignment/cnn">(Source)</a></figcaption> </figure> <p>At that time, the convolution idea was called ‘weight sharing’, and it was actually discussed in the 1986 extended analysis of backpropagation by Rumelhart, Hinton, and Williams<sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>. Actually, the credit goes even further back - Minsky and Papert’s 1969 analysis of Perceptrons was thorough enough to pose a problem that motivated this idea. But, as before, others have also independently explored the concept - namely, Kunihiko Fukushima in 1980 with his notion of the <a href="http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf">Neurocognitron</a><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup>. And, as before, the ideas behind it drew inspiration from studies of the brain:</p> <blockquote> <p>“According to the hierarchy model by Hubel and Wiesel, the neural network in the visual cortex has a hierarchy structure: LGB (lateral geniculate body)-&gt;simple cells-&gt;complex cells-&gt;lower order hypercomplex cells-&gt;higher order hypercomplex cells. It is also suggested that the neural network between lower order hypercomplex cells and higher order hypercomplex cells has a structure similar to the network between simple cells and complex cells. In this hierarchy, a cell in a higher stage generally has a tendency to respond selectively to a more complicated feature of the stimulus pattern, and, at the same time, has a larger receptive field, and is more insensitive to the shift in position of the stimulus pattern. … Hence, a structure similar to the hierarchy model is introduced in our model.”</p> </blockquote> <p>LeCun continued to be a major proponent of CNNs at Bell Labs, and his work on them resulted in major commercial use for check-reading in the mid 90s - his talks and interviews often include <a href="http://www.kdnuggets.com/2014/02/exclusive-yann-lecun-deep-learning-facebook-ai-lab.html">the fact that</a> “At some point in the late 1990s, one of these systems was reading 10 to 20% of all the checks in the US.”<sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>.</p> <h1 id="neural-nets-go-unsupervised">Neural Nets Go Unsupervised</h1> <p>Automating the rote and utterly uninteresting task of reading checks is a great instance of what Machine Learning can be used for. Perhaps a less predictable application? Compression. Meaning, of course, finding a smaller representation of some data from which the original data can be reconstructed. Learned compression may very well outperform stock compression schemes, when the learning algorithm can find features within the data stock methods would miss. And it is very easy to do - just train a neural net with a small hidden layer to just output the input:</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34875?token=N8kgwOTY2SLYiUmyWgp6q_SUr2lq1VZRCsqjuEcUzhSyxukW8SaukGh2U-PdFABd3WIkZlgtOr9pbVX_kGGUfnM" alt="Autoencode" /> <figcaption>An autoencoder neural net. <a href="http://research.chtsai.org/papers/iml-bkp.html">(Source)</a></figcaption> </figure> <p>This is an <strong>autoencoder</strong> neural net, and is a method for learning compression - efficiently translating (encoding) data to a compact format and back to itself (auto). See, the output layer computes its outputs, which ideally are the same as the input to the neural net, using only the hidden layer’s outputs. Since the hidden layer has fewer outputs than does the input layer, the output of the hidden layer is the compressed representation of the input data, which can be reconstructed with the output layer.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34939?token=mIbhFk3rVIyx-Byzt6TXV1hGzMH7_w5sjy5OzeYM0qex33WDiI1PhspANLICVpp53PZyysX8yR9YahhXtBVkV6M" alt="RBM" /> <figcaption>A more explicit view of an autoencoder compression. <a href="http://stats.stackexchange.com/questions/114385/what-is-the-difference-between-convolutional-neural-networks-restricted-boltzma">(Source)</a></figcaption> </figure> <p>Notice a neat thing here: the only thing we need for training is some input data. This is in contrast to the requirement of supervised machine learning, which needs a training set of input-output pairs (<strong>labeled data</strong>) in order to approximate a function that can compute such outputs from such inputs. And indeed, autoencoders are not a form of supervised learning; they are a form of <strong>unsupervised learning</strong>, which only needs a set of input data (<strong>unlabeled data</strong>) in order to find some hidden structure within that data. In other words, unsupervised learning does not approximate a function so much as it derives one from the input data to another useful representation of that data. In this case, this representation is just a smaller one from which the original data can still be reconstructed, but it can also be used for finding groups of similar data (<strong>clustering</strong>) or other inference of <strong>latent variables</strong> (some aspect that is known to exist for the data but the value of which is not known).</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34886?token=iEm6_7c5iGDJPTSNZ-amHkQyb3_f4G3657WBTHcyJJL87dQx8xTMxiJES68Mj0YhbbLx5YWkkaOR4QDkJHiG3-0" alt="Clustering, from good ol' public domain wikipedia" /> <figcaption>Clustering, a very common unsupervised learning application. <a href="https://en.wikipedia.org/wiki/K-means_clustering">(Source)</a></figcaption> </figure> <p>There were other unsupervised applications of neural networks explored prior to and after the discovery of backpropagation, most notably Self Organizing Maps <sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>, which produce a low-dimensional representation of data good for visualization, and Adapative Resonance Theory<sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup>, which can learn to classify arbitrary input data without being told correct classifications. If you think about it, it is intuitive that quite a lot can be learned from unlabeled data. Say you have a dataset of a bunch of images of handwritten digits, without labels of which digit each image corresponds to. Well, an image with some digit in that dataset most likely looks most like all the other images with that same digit, and so though a computer may not know which digit all those images correspond to, it should still be able to find that they all correspond to the same one. This, <strong>pattern recognition</strong>, is really what most of machine learning is all about, and arguably also is the basis for the great powers of the human brain. But, let us not digress from our exciting deep learning journey, and get back to autoencoders.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34887?token=rWIBfCstMS8Y3OsonIyWAOPNHnc9NZIGHs5JesWlo01UCpYKKcMbhJOCj-AvZuDS8VeENeNo35Z1BxQhkOexbHM" alt="SOM" /> <figcaption>Self Organizing Maps - mapping a large vector of inputs into a grid of neuron outputs, where each output is a cluster. Nearby neurons represent similar clusters. <a href="http://lcdm.astro.illinois.edu/static/code/mlz/MLZ-1.0/doc/html/somz.html">(Source)</a></figcaption> </figure> <p>As with weight-sharing, the idea of autoencoders was first discussed in the aforementioned extensive 1986 analysis of backpropagation <sup id="fnref:3:1"><a href="#fn:3" class="footnote">3</a></sup>, and as with weight-sharing it resurfaced in more research in the following years<sup id="fnref:8"><a href="#fn:8" class="footnote">8</a></sup><sup id="fnref:9"><a href="#fn:9" class="footnote">9</a></sup>, including by Hinton himself <sup id="fnref:10"><a href="#fn:10" class="footnote">10</a></sup>. This paper, with the fun title <a href="http://www.cs.toronto.edu/~fritz/absps/cvq.pdf">“Autoencoders, Minimum Description Length, and Helmholts Free Energy”</a>, posits that “A natural approach to unsupervised learning is to use a model that defines probability distribution over observable vectors” and uses a neural net to learn such a model. So here’s another neat thing you can do with neural nets: approximate probability distributions.</p> <h1 id="neural-nets-gain-beliefs">Neural Nets Gain Beliefs</h1> <p>In fact, before being co-author of the seminal 1986 paper on backpropagation learning algorithm, Hinton worked on a neural net approach for learning probability distributions in the 1985 <a href="http://www.cs.toronto.edu/~fritz/absps/cogscibm.pdf">“A Learning Algorithm for Boltzmann Machines”</a> <sup id="fnref:11"><a href="#fn:11" class="footnote">11</a></sup>. Boltzmann Machines are networks just like neural nets and have units that are very similar to Perceptrons, but instead of computing an output based on inputs and weights, each unit in the network can compute a probability of it having a value of 1 or 0 given the values of connected units and weights. The units are therefore <strong>stochastic</strong> - they behave according to a probability distribution, rather than in a known deterministic way. The Boltzmann part refers <a href="https://en.wikipedia.org/wiki/Boltzmann_distribution">to a probability distribution</a> that has to do with the states of particles in a system based the particles’ energy and on the thermodynamic temperature of that system. This distribution defines not only the mathematics of the Boltzmann machines, but also the interpretation - the units in the network themselves have energies and states, and learning is done by minimizing the energy of the system and with direct inspirartion from thermodynamics. Though a bit unintuitive, this energy-based interpretation is actually just one example of an <strong>energy-based model</strong>, and fits in the <strong>energy-based learning</strong> theoretical framework with which many learning algorithms can be expressed<sup id="fnref:ebm"><a href="#fn:ebm" class="footnote">12</a></sup>.</p> <div><button class="btn" data-toggle="collapse" data-target="#ebm"> Aside: a bit more Energy Based Models &raquo; </button></div> <blockquote class="aside"><p id="ebm" class="collapse" style="height: 0px;"> That there is a common theoretical framework for a bunch of learning methods is not too surprising, since at the end of the day all of learning boils down to optimization. Quoting from the above cited tutorial: <br /><br /> "Training an EBM consists in finding an energy function that produces the best Y for any X ... The architecture of the EBM is the internal structure of the parameterized energy function E(W, Y, X) ... This quality measure is called the loss functional (i.e. a function of function) and denoted L(E,S). ... In order to find the best energy function [] we need a way to assess the quality of any particular energy function, based solely on two elements: the training set, and our prior knowledge about the task. For simplicity, we often denote it L(W,S) and simply call it the loss function. The learning problem is simply to find the W that minimizes the loss." <br /><br /> So, the key to energy based models is recognizing all these algorithms are essentially different ways to optimize a pair of functions, that can be called the energy function E and loss function L, by finding a set of good values to a bunch of variables that can be denoted W using data denoted X for input and Y for the output. It's really a very broad definition for a framework, but still nicely encapsulates what a lot of algorithms fundamentally do. </p></blockquote> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34928?token=uZt9tR3PMJ7XcI0pscNEF0hgpiGBmAWdxlT-mXi88-6jI1VKnv5eRXDeX2soiwQ2MJJuq1QeKvSOb1JiviyiZl8" alt="Public domain from wikipedia" /> <figcaption>A simple belief, or bayesian, network - a Boltzmann machine is basically this but with undirected/symmetric connections and trainable weights to learn the probabilities in a particular fashion. <a href="https://commons.wikimedia.org/wiki/File:SimpleBayesNet.svg">(Source)</a> </figcaption> </figure> <p>Back to Boltzmann Machines. When such units are put together into a network, they form a graph, and so are a <strong>graphical model</strong> of data. Essentially, they can do something very similar to normal neural nets: some <strong>hidden units</strong> compute the probability of some <strong>hidden variables</strong> (the outputs - classifications or features for data) given known values of <strong>visible units</strong> that represent <strong>visible variables</strong> (the inputs - pixels of images, characters in text, etc.). In our classic example of classifying images of digits, the hidden variables are the actual digit values, and the visible variables are the pixels of the image; given an image of the digit ‘1’ as input, the value of visible units is known and the hidden unit modeling the probability of the image representing a ‘1’ should have a high output probability.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34944?token=wt8jYAmcmFL7nUvwusO-SYCwcXyM0_jECFgyhTNKc5OI7gyImufruQFh98267EgUTNKXFRmZqqPP9ia4OdaOhrQ" alt="BM" /> <figcaption>An example Boltzmann machine. Each line has an associated weight, as with a neural net. Notice there are no layers here - everything can sort of be connected to everything. We'll talk about this variation on neural net in a little bit... <a href="https://en.wikipedia.org/wiki/File:Boltzmannexamplev1.png">(Source)</a> </figcaption> </figure> <p>So, for the classification task, there is now a nice way of computing the probability of each category. This is very analogous to actually computing the output values of a normal classification neural net, but these nets have another neat trick: they can generate plausible looking input data. This follows from the probability equations involved - not only does the net learn the probabilities of values for the hidden variables given known values for the visible variables, but also the inverse of that - visible probabilities given known hidden values. So, if we want to generate a ‘1’ digit image, the units corresponding to the pixel variables have known probabilities of outputting a 1 and an image can be probabilistically generated; these networks are <strong>generative graphical models</strong>. Though it is possible to do supervised learning with very similar goals as normal neural nets, the unsupervised learning task of learning a good generative model - probabilistically learning the hidden structure of some data - is commonly what these nets are used for. Most of this was not really that novel, but the learning algorithm presented and the particular formulation that enabled it were, as stated in the paper itself:</p> <blockquote> <p>“Perhaps the most interesting aspect of the Boltzmann Machine formulation is that it leads to a domain-independent learning algorithm that modifies the connection strengths between units in such a way that the whole network develops an internal model which captures the underlying structure of its environment. There has been a long history of failure in the search for such algorithms (Newell, 1982), and many people (particularly in Artificial Intelligence) now believe that no such algorithms exist.”</p> </blockquote> <div><button class="btn" data-toggle="collapse" data-target="#boltzmann"> Aside: more explanation of Boltzmann Machines &raquo; </button></div> <blockquote class="aside"><p id="boltzmann" class="aside" style="height: 0px;"> Having learned the classical neural net models first, it took me a while to understand the notion behind these probabilistic nets. To elaborate, let me present a quote from the paper itself that restates all that I have said above quite well: <br /><br /> "The network modifies the strengths of its connections so as to construct an internal generative model that produces examples with the same probability distribution as the examples it is shown. Then, when shown any particular example, the network can “interpret” it by finding values of the variables in the internal model that would generate the example. <br /> ... <br /> The machine is composed of primitive computing elements called units that are connected to each other by bidirectional links. A unit is always in one of two states, on or off, and it adopts these states as a probabilistic function of the states of its neighboring units and the weights on its links to them. The weights can take on real values of either sign. A unit being on or off is taken to mean that the system currently accepts or rejects some elemental hypothesis about the domain. The weight on a link represents a weak pairwise constraint between two hypotheses. A positive weight indicates that the two hypotheses tend to support one another; if one is currently accepted, accepting the other should be more likely. Conversely, a negative weight suggests, other things being equal, that the two hypotheses should not both be accepted. Link weights are symmetric, having the same strength in both directions (Hinton &amp; Sejnowski, 1983)."</p> </blockquote> <p>Without delving into the full details of the algorithm, here are some highlights: it is a variant on <strong>maximum-likelihood</strong> algorithms, which simply means that it seeks to maximize the probability of the net’s visible unit values matching with their known correct values. Computing the actual most likely value for each unit all at the same time turns out to be much too computationally expensive, so in training <strong>Gibbs Sampling</strong> - starting the net with random unit values and iteratively reassigning values to units given their connections’ values - is used to give some actual known values. When learning using a training set, the visible units are just set to have the value of the current training example, so sampling is done to get values for the hidden units. Once some ‘real’ values are sampled, we can do something similar to backpropagation - take a derivative for each weight to see how we can change so as to increase the probability of the net doing the right thing.</p> <p>As with neural net, the algorithm can be done both in a supervised fashion (with known values for the hidden units) or in an unsupervised fashion. Though the algorithm was demonstrated to work (notably, with the same ‘encoding’ problem that autoencoder neural nets solve), it was soon apparent that it just did not work very well - Redford M. Neal’s 1992 <a href="http://www.zabaras.com/Courses/BayesianComputing/Papers/1-s2.0-0004370292900656-main.pdf">“Connectionist learning of belief networks”</a><sup id="fnref:12"><a href="#fn:12" class="footnote">13</a></sup> justified a need for a faster approach by stating that: “These capabilities would make the Boltzmann machine attractive in many applications, were it not that its learning procedure is generally seen as being painfully slow”. And so Neal introduced a similar idea in the <strong>belief net</strong>, which is essentially like a Boltzmann machine with directed, forward connections (so that there are again layers, as with the the neural nets we have seen before, and unlike the Boltzmann machine image above). Without getting into mucky probability math, this change allowed the nets to be trained with a faster learning algorithm. We actually saw a ‘belief net’ just above with the sprinkler and rain variables, and the term was chosen precisely because this sort of probability-based modeling has a close relationship to ideas from the mathematical field of probability, in addition to its link to the field of Machine Learning.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34893?token=vvO-2350CpV8LNjOLn8Tmcd2EFeZYCmgNj1GsYxzisrm0tqe2AF_FfynWcppZQdQ9823HTw9E2i8SC7XposnH0w" alt="belief nets" /> <figcaption>An explanation of belief nets. <a href="http://www.slideserve.com/Leo/restricted-boltzmann-machines-and-deep-belief-networks">(Source)</a></figcaption> </figure> <p>Though this approach was an advance upon Boltzmann machines, it was still just too slow - the math for correctly deriving probabilistic relations between variables is such that a ton of computation is typically required without some simplifying tricks. And so Hinton, along with Neal and two other co-authors, soon came up with extra tricks in the 1995 <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.82.804&amp;rep=rep1&amp;type=pdf">“The <strong>wake-sleep algorithm</strong> for unsupervised neural networks”</a><sup id="fnref:13"><a href="#fn:13" class="footnote">14</a></sup>. These tricks called for a slightly different belief net setup, which was now deemed <a href="http://www.gatsby.ucl.ac.uk/~dayan/papers/hm95.pdf">“The Helmholtz Machine”</a><sup id="fnref:14"><a href="#fn:14" class="footnote">15</a></sup>. Skirting the details once again, the key idea was to have separate sets of weights for inferring hidden variables from visible variables (<strong>recognition weights</strong>) and vice versa (<strong>generative weights</strong>), and to keep the directed aspect of Neal’s belief nets. This allows the training to be done much faster, while being applicable to the unsupervised and supervised learning problems of Boltzmann Machines.</p> <div><button class="btn" data-toggle="collapse" data-target="#wakesleep"> Aside: the gross simplifying assumption of the wake-sleep algorithm &raquo; </button></div> <blockquote class="aside"><p id="wakesleep" class="aside" style="height: 0px;"> In videos of Hinton talking about the Wake Sleep algorithm, he often notes how gross the simplifying assumption being made is, and that it turns out the algorithm just works regardless. Again I will quote as the paper itself explains the assumption well: <br /><br /> "The key simplifying assumption is that the recognition distribution for a particular example d, Q is factorial (separable) in each layer. If there are h stochastic binary units in a layer B, the portion of the distribution P(B,d) due to that layer is determined by 2^(h - 1) probabilities. However, Q makes the assumption that the actual activity of any one unit in layer P is independent of the activities of all the other units in that layer, given the activities of all the units in the lower layer, l - 1, so the recognition model needs only specify h probabilities rather than 2" - 1. The independence assumption allows F(d; 8.4) to be evaluated efficiently, but this computational tractability is bought at a price, since the true posterior is unlikely to be factorial <br /> ... <br /> The generative model is taken to be factorial in the same way, although one should note that factorial generative models rarely have recognition distributions that are themselves exactly factorial." <br /><br /> Note the Neal's belief nets also implicitly made the probabilities factorize by having layers of units with only forward-facing directed connections. </p></blockquote> <p>Finally, belief nets could be trained somewhat fast! Though not quite as influential, this algorithmic advance was a significant enough forward step for unsupervised training of belief nets that it could be seen as a companion to the now almost decade-old publication on backpropagation. But, by this point new machine learning methods had begun to also emerge, and people were again beginning to be skeptical of neural nets since they seemed so intuition-based and since computers were still barely able to meet their computational needs. As we’ll see in <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-3">part 3</a>, a new AI Winter for neural nets began just a few years later…</p> <div class="footnotes"> <ol> <li id="fn:1"> <p>Kurt Hornik, Maxwell Stinchcombe, Halbert White, Multilayer feedforward networks are universal approximators, Neural Networks, Volume 2, Issue 5, 1989, Pages 359-366, ISSN 0893-6080, http://dx.doi.org/10.1016/0893-6080(89)90020-8. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:2"> <p>LeCun, Y; Boser, B; Denker, J; Henderson, D; Howard, R; Hubbard, W; Jackel, L, “Backpropagation Applied to Handwritten Zip Code Recognition,” in Neural Computation , vol.1, no.4, pp.541-551, Dec. 1989 89 <a href="#fnref:2" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:3"> <p>D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Learning internal representations by error propagation. In Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, David E. Rumelhart, James L. McClelland, and CORPORATE PDP Research Group (Eds.). MIT Press, Cambridge, MA, USA 318-362 <a href="#fnref:3" class="reversefootnote">&#8617;</a> <a href="#fnref:3:1" class="reversefootnote">&#8617;<sup>2</sup></a></p> </li> <li id="fn:4"> <p>Fukushima, K. (1980), ‘Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position’, Biological Cybernetics 36 , 193–202 . <a href="#fnref:4" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:5"> <p>Gregory Piatetsky, ‘KDnuggets Exclusive: Interview with Yann LeCun, Deep Learning Expert, Director of Facebook AI Lab’ Feb 20, 2014. http://www.kdnuggets.com/2014/02/exclusive-yann-lecun-deep-learning-facebook-ai-lab.html <a href="#fnref:5" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:6"> <p>Teuvo Kohonen. 1988. Self-organized formation of topologically correct feature maps. In Neurocomputing: foundations of research, James A. Anderson and Edward Rosenfeld (Eds.). MIT Press, Cambridge, MA, USA 509-521. <a href="#fnref:6" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:7"> <p>Gail A. Carpenter and Stephen Grossberg. 1988. The ART of Adaptive Pattern Recognition by a Self-Organizing Neural Network. Computer 21, 3 (March 1988), 77-88. <a href="#fnref:7" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:8"> <p>H. Bourlard and Y. Kamp. 1988. Auto-association by multilayer perceptrons and singular value decomposition. Biol. Cybern. 59, 4-5 (September 1988), 291-294. <a href="#fnref:8" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:9"> <p>P. Baldi and K. Hornik. 1989. Neural networks and principal component analysis: learning from examples without local minima. Neural Netw. 2, 1 (January 1989), 53-58. <a href="#fnref:9" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:10"> <p>Hinton, G. E. &amp; Zemel, R. S. (1993), Autoencoders, Minimum Description Length and Helmholtz Free Energy., in Jack D. Cowan; Gerald Tesauro &amp; Joshua Alspector, ed., ‘NIPS’ , Morgan Kaufmann, , pp. 3-10 . <a href="#fnref:10" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:11"> <p>Ackley, D. H., Hinton, G. E., &amp; Sejnowski, T. J. (1985). A learning algorithm for boltzmann machines*. Cognitive science, 9(1), 147-169. <a href="#fnref:11" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:ebm"> <p>LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., &amp; Huang, F. (2006). A tutorial on energy-based learning. Predicting structured data, 1, 0. <a href="#fnref:ebm" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:12"> <p>Neal, R. M. (1992). Connectionist learning of belief networks. Artificial intelligence, 56(1), 71-113. <a href="#fnref:12" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:13"> <p>Hinton, G. E., Dayan, P., Frey, B. J., &amp; Neal, R. M. (1995). The” wake-sleep” algorithm for unsupervised neural networks. Science, 268(5214), 1158-1161. <a href="#fnref:13" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:14"> <p>Dayan, P., Hinton, G. E., Neal, R. M., &amp; Zemel, R. S. (1995). The helmholtz machine. Neural computation, 7(5), 889-904. <a href="#fnref:14" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <p><a href="/writing/a-brief-history-of-neural-nets-and-deep-learning-part-2/">A 'Brief' History of Neural Nets and Deep Learning, Part 2</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on December 24, 2015.</p> <![CDATA[A 'Brief' History of Neural Nets and Deep Learning, Part 1]]> /writing/a-brief-history-of-neural-nets-and-deep-learning 2015-12-24T15:19:34-08:00 2015-12-24T15:19:34-08:00 www.andreykurenkov.com contact@andreykurenkov.com <p>This is the first part of ‘A Brief History of Neural Nets and Deep Learning’. Part 2 is <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-2">here</a>, and parts 3 and 4 are <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-3">here</a> and <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4">here</a>. In this part, we shall cover the birth of neural nets with the Perceptron in 1958, the AI Winter of the 70s, and neural nets’ return to popularity with backpropagation in 1986.</p> <h1 id="prologue-the-deep-learning-tsunami">Prologue: The Deep Learning Tsunami</h1> <blockquote> <p>“Deep Learning waves have lapped at the shores of computational linguistics for several years now, but 2015 seems like the year when the full force of the tsunami hit the major Natural Language Processing (NLP) conferences.” -<a href="http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00239">Dr. Christopher D. Manning, Dec 2015</a> <sup id="fnref:1"><a href="#fn:1" class="footnote">1</a></sup></p> </blockquote> <p>This may sound hyperbolic - to say the established methods of an entire field of research are quickly being superseded by a new discovery, as if hit by a research ‘tsunami’. But, this catastrophic language is appropriate for describing the meteoric rise of Deep Learning over the last several years - a rise characterized by drastic improvements over reigning approaches towards the hardest problems in AI, massive investments from industry giants such as Google, and exponential growth in research publications (and Machine Learning graduate students). Having taken several classes on Machine Learning, and even used it in undergraduate research, I could not help but wonder if this new ‘Deep Learning’ was anything fancy or just a scaled up version of the ‘artificial neural nets’ that were already developed by the late 80s. And let me tell you, the answer is quite a story - the story of not just neural nets, not just of a sequence of research breakthroughs that make Deep Learning somewhat more interesting than ‘big neural nets’ (that I will attempt to explain in a way that just about anyone can understand), but most of all of how several unyielding researchers made it through dark decades of banishment to finally redeem neural nets and achieve the dream of Deep Learning.</p> <div><button class="btn" data-toggle="collapse" data-target="#sources"> Disclaimer: not an expert, more in depth sources, corrections &raquo; </button></div> <blockquote class="aside"><p id="sources" class="collapse" style="height: 0px;"> I am in no capacity an expert on this topic. In depth technical overviews with long lists of references written by those who actually made the field what it is include Yoshua Bengio's <a href="http://www.iro.umontreal.ca/~lisa/pointeurs/TR1312.pdf">"Learning Deep Architectures for AI"</a>, Jürgen Schmidhuber's <a href="http://arxiv.org/pdf/1404.7828v4.pdf">"Deep Learning in Neural Networks: An Overview"</a> and LeCun et al.s' <a href="http://www.cs.toronto.edu/~hinton/absps/NatureDeepReview.pdf">"Deep learning"</a>. In particular, this is mostly a history of research in the US/Canada AI community, and even there will not mention many researchers; a particularly in depth history of the field that covers these omissions is Jürgen Schmidhuber's <a href="http://people.idsia.ch/~juergen/deep-learning-overview.html">"Deep Learning in Neural Networks: An Overview"</a>. I am also most certainly not a professional writer, and will cop to there being shorter and much less technical overviews written by professional writers such as Paul Voosen's <a href="http://chronicle.com/article/The-Believers/190147">"The Believers"</a>, John Markoff's <a href="http://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html">"Scientists See Promise in Deep-Learning Programs"</a> and Gary Marcus's <a href="http://www.newyorker.com/news/news-desk/is-deep-learning-a-revolution-in-artificial-intelligence">"Is “Deep Learning” a Revolution in Artificial Intelligence?"</a>. I also will stay away from getting too technical here, but there is a plethora of tutorials on the internet on all the major topics covered in brief by me. <br /> Any corrections would be greatly appreciated, though I will note some ommisions are intentional since I want to try and keep this 'brief' and a good mix of simple technical explanations and storytelling. </p></blockquote> <p><br /></p> <h1 id="the-centuries-old-machine-learning-algorithm">The Centuries Old Machine Learning Algorithm</h1> <figure> <img class="postimagesmall" src="https://upload.wikimedia.org/wikipedia/commons/3/3a/Linear_regression.svg" alt="Linear Regression" /> <figcaption>Linear regression <a href="https://upload.wikimedia.org/wikipedia/commons/3/3a/Linear_regression.svg">(Source)</a></figcaption> </figure> <p>Let’s start with a brief primer on what Machine Learning is. Take some points on a 2D graph, and draw a line that fits them as well as possible. What you have just done is generalized from a few example of pairs of input values (x) and output values (y) to a general function that can map any input value to an output value. This is known as linear regression, and it is a wonderful little <a href="https://en.wikipedia.org/wiki/Linear_regression#cite_note-4">200 year old</a> technique for extrapolating a general function from some set of input-output pairs. And here’s why having such a technique is wonderful: there is an incalculable number of functions that are hard to develop equations for directly, but are easy to collect examples of input and output pairs for in the real world - for instance, the function mapping an input of recorded audio of a spoken word to an output of what that spoken word is.</p> <p>Linear regression is a bit too wimpy a technique to solve the problem of speech recognition, but what it does is essentially what <strong>supervised Machine Learning</strong> is all about: ‘learning’ a function given a <strong>training set</strong> of <strong>examples</strong>, where each example is a pair of an input and output from the function (we shall touch on the unsupervised flavor in a little while). In particular, machine learning methods should derive a function that can generalize well to inputs not in the training set, since then we can actually apply it to inputs for which we do not have an output. For instance, Google’s current speech recognition technology is powered by Machine Learning with a massive training set, but not nearly as big a training set as all the possible speech inputs you might task your phone with understanding.</p> <p>This generalization principle is so important that there is almost always a <strong>test set</strong> of data (more examples of inputs and outputs) that is not part of the training set. The separate set can be used to evaluate the effectiveness of the machine learning technique by seeing how many of the examples the method correctly computes outputs for given the inputs. The nemesis of generalization is <strong>overfitting</strong> - learning a function that works really well for the training set but badly on the test set. Since machine learning researchers needed means to compare the effectiveness of their methods, over time there appeared standard <strong>datasets</strong> of training and testing sets that could be used to evaluate machine learning algorithms.</p> <p>Okay okay, enough definitions. Point is - our line drawing exercise is a very simple example of supervised machine learning: the points are the training set (X is input and Y is output), the line is the approximated function, and we can use the line to find Y values for X values that don’t match any of the points we started with. Don’t worry, the rest of this history will not be nearly so dry as all this. Here we go.</p> <h1 id="the-folly-of-false-promises">The Folly of False Promises</h1> <p>Why have all this prologue with linear regression, since the topic here is ostensibly neural nets? Well, in fact linear regression bears some resemblance to the first idea conceived specifically as a method to make machines learn: <a href="http://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&amp;id=1959-09865-001">Frank Rosenblatt’s <strong>Perceptron</strong></a><sup id="fnref:2"><a href="#fn:2" class="footnote">2</a></sup>.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34998?token=eP9x7J-SbQvC30KjZ23yt38XOWoU6_d0JVo72rZF-EHtWcLW-zNPyZU2ZYJu4VPm7Fxs20Gd7mPoRylRtNFigXs" alt="Perceptron" /> <figcaption>A diagram showing how the Perceptron works. <a href="http://cse-wiki.unl.edu/wiki/images/0/0f/Perceptron.jpg">(Source)</a></figcaption> </figure> <p>A psychologist, Rosenblatt conceived of the Percetron as a simplified mathematical model of how the neurons in our brains operate: it takes a set of binary inputs (nearby neurons), multiplies each input by a continuous valued weight (the synapse strength to each nearby neuron), and thresholds the sum of these weighted inputs to output a 1 if the sum is big enough and otherwise a 0 (in the same way neurons either fire or do not). Most of the inputs to a Perceptron are either some data or the output of another Perceptron, but an extra detail is that Perceptrons also have one special ‘bias’ input, which just has a value of 1 and basically ensures that more functions are computable with the same input by being able to offset the summed value. This model of the neuron built on the work of Warren McCulloch and Walter Pitts <a href="http://www.minicomplexity.org/pubs/1943-mcculloch-pitts-bmb.pdf">Mcculoch-Pitts</a><sup id="fnref:3"><a href="#fn:3" class="footnote">3</a></sup>, who showed that a neuron model that sums binary inputs and outputs a 1 if the sum exceeds a certain threshold value, and otherwise outputs a 0, can model the basic OR/AND/NOT functions. This, in the early days of AI, was a big deal - the predominant thought at the time was that making computers able to perform formal logical reasoning would essentially solve AI.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34832?token=vQiHNdPnUSPiPJcJcgobMGedDJRvgguccVapCN76gZnxqVQIKczfq4BqUQ06bWdVXnabb3tScv_04nigKqMZjS4" alt="Perceptron 2" /> <figcaption>Another diagram, showing the biological inspiration. The <b>activation function</b> is what people now call the non-linear function applied to the weighted input sum to produce the output of the artificial neuron - in the case of Rosenblatt's Perceptron, the function just a thresholding operation. <a href="http://cs231n.github.io/neural-networks-1/">(Source)</a> </figcaption> </figure> <p>However, the Mcculoch-Pitts model lacked a mechanism for learning, which was crucial for it to be usable for AI. This is where the Perceptron excelled - Rosenblatt came up with a way to make such artificial neurons learn, inspired by the <a href="http://onlinelibrary.wiley.com/doi/10.1002/cne.900930310/abstract">foundational work</a><sup id="fnref:4"><a href="#fn:4" class="footnote">4</a></sup> of Donald Hebb. Hebb put forth the unexpected and hugely influential idea that knowledge and learning occurs in the brain primarily through the formation and change of synapses between neurons - concisely stated as Hebb’s Rule:</p> <blockquote> <p>“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.”</p> </blockquote> <p>The Perceptron did not follow this idea exactly, but having weights on the inputs allowed for a very simple and intuitive learning scheme: given a <strong>training set</strong> of input-output examples the Perceptron should ‘learn’ a function from, for each example increase the weights if the Perceptron output for that example’s input is too low compared to the example, and otherwise decrease the weights if the output is too high. Stated ever so slightly more formally, the algorithm is as follows:</p> <ol> <li>Start off with a Perceptron having random weights and a training set</li> <li>For the inputs of an example in the training set, compute the Perceptron’s output</li> <li>If the output of the Perceptron does not match the output that is known to be correct for the example: <ul> <li>If the output should have been 0 but was 1, decrease the weights that had an input of 1.</li> <li>If the output should have been 1 but was 0, increase the weights that had an input of 1.</li> </ul> </li> <li>Go to the next example in the training set and repeat steps 2-4 until the Perceptron makes no more mistakes</li> </ol> <p>This procedure is simple, and produces a simple result: an input linear function (the weighted sum), just as with linear regression, ‘squashed’ by a non-linear <strong>activation function</strong> (the thresholding of the sum). It’s fine to threshold the sum when the function can only have a finite set of output values (as with logical functions, in which case there are only two - True/1 and False/0), and so the problem is not so much to generate a continuous-numbered output for any set of inputs - regression - as to categorize the inputs with a correct label - <strong>classification</strong>.</p> <figure> <img class="postimagesmall" src="https://upload.wikimedia.org/wikipedia/en/5/52/Mark_I_perceptron.jpeg" alt="." /> <figcaption>'Mark I Perceptron at the Cornell Aeronautical Laboratory', hardware implementation of the first Perceptron (Source: Wikipedia / Cornell Library)</figcaption> </figure> <p>Rosenblatt implemented the idea of the Perceptron in custom hardware (this being before fancy programming languages were in common use), and showed it could be used to learn to classify simple shapes correctly with 20x20 pixel-like inputs. And so, machine learning was born - a computer was built that could approximate a function given known input and output pairs from it. In this case it learned a little toy function, but it was not difficult to envision useful applications such as converting the mess that is human handwriting into machine-readable text.</p> <p>But wait, so far we’ve only seen how one Perceptron is able to learn to output a one or a zero - how can this be extended to work for classification tasks with many categories, such as human handwriting (in which there are many letters and digits as the categories)? This is impossible for one Perceptron, since it has only one output, but functions with multiple outputs can be learned by having multiple Perceptrons in a <strong>layer</strong>, such that all these Perceptrons receive the same input and each one is responsible for one output of the function. Indeed, neural nets (or, formally, ‘Artificial Neural Networks’ - ANNs) are nothing more than layers of Perceptrons - or neurons, or units, as they are usually called today - and at this stage there was just one layer - the <strong>output layer</strong>. So, a prototypical example of neural net use is to classify an image of a handwritten digit. The inputs are the pixels of the image , and there are 10 output neurons with each one corresponding to one of the 10 possible digit values. In this case only one of the 10 neurons output 1, the highest weighted sum is taken to be the correct output, and the rest output 0.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34466?token=YFsmpDuQfD3DDylinRD8F4sLOgjCFm4Aow1gIWoCY5KED3bnQKs17RaTja95OIQQWdr25dqS2fxq_6mDwwdcs9Y" alt="Neural net with an output layer." /> <figcaption>A neural net with multiple outputs.</figcaption> </figure> <p>It is also possible to conceive of neural nets with artificial neurons different from the Perceptron. For instance, the thresholding activation function is not strictly necessary; Bernard Widrow and Tedd Hoff soon explored the option of just outputting the weight input in 1960 with <a href="http://www-isl.stanford.edu/~widrow/papers/t1960anadaptive.pdf">“An adaptive “ADALINE” neuron using chemical “memistors”</a><sup id="fnref:5"><a href="#fn:5" class="footnote">5</a></sup>, and showed how these ‘Adaptive Linear Neurons’ could be incorporated into electrical circuits with ‘memistors’ - resistors with memory. They also showed that not having the threshold activation function is mathematically nice, because the neuron’s learning mechanism can be formally based on minimizing the error through good ol’ calculus. See, with the neuron’s function not being made weird by this sharp thresholding jump from 0 to 1, a measure of how much the error changes when each weight is changed (the derivative) can be used to drive the error down and find the optimal weight values. As we shall see, finding the right weights using the derivatives of the training error with respect to each weight is exactly how neural nets are typically trained to this day.</p> <div><button class="btn" data-toggle="collapse" data-target="#maths"> Aside: a bit more on the math &raquo; </button></div> <blockquote class="aside"><p id="maths" class="collapse" style="height: 0px;"> In short a function is differentiable if it is a nice smooth line - Rosenblatt's Perceptron computed the output in such a way that the output abruptly jumped from 0 to 1 if the input exceeded some number, whereas Adaline simply output the input which was a nice non-jumpy line. For a much more in depth explanation of all this math you can read <a href="http://sebastianraschka.com/Articles/2015_singlelayer_neurons.html">this tutorial</a>, or any resource from Google - let us focus on the fun high-level concepts and story here. </p></blockquote> <p>If we think about ADALINE a bit more we will come up with a further insight: finding a set of weights for a number of inputs is really just a form of linear regression. And again, as with linear regression, this would not be enough to solve the complex AI problems of Speech Recognition or Computer Vision. What McCullough and Pitts and Rosenblatt were really excited about is the broad idea of Connectionism: that networks of such simple computational units can be vastly more powerful and solve the hard problems of AI. And, Rosenblatt said as much, as in this frankly ridiculous New York Times quote <a href="http://query.nytimes.com/gst/abstract.html?res=9D01E4D8173DE53BBC4053DFB1668383649EDE">from the time</a><sup id="fnref:6"><a href="#fn:6" class="footnote">6</a></sup>:</p> <blockquote> <p>“The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself an be conscious of its existence … Dr. Frank Rosenblatt, a research psychologist at the Cornell Aeronautical Laboratory, Buffalo, said Perceptrons might be fired to the planets as mechanical space explorers”</p> </blockquote> <p>Or, have a look at this TV segment from the time:</p> <figure> <iframe width="420" height="315" src="https://www.youtube.com/embed/aygSMgK3BEM" frameborder="0" allowfullscreen=""></iframe> <figcaption>The stuff promised in this video - still not really around.</figcaption> </figure> <p>This sort of talk no doubt irked other researchers in AI, many of whom were focusing on approaches based on manipulation of symbols with concrete rules that followed from the mathematical laws of logic. Marvin Minsky, founder of the MIT AI Lab, and Seymour Papert, director of the lab at the time, were some of the researchers who were skeptical of the hype and in 1969 published their skepticism in the form of rigorous analysis on of the limitations of Perceptrons in a seminal book aptly named <a href="https://mitpress.mit.edu/books/perceptrons">Perceptrons</a><sup id="fnref:7"><a href="#fn:7" class="footnote">7</a></sup>. Interestingly, Minksy may have actually been the first researcher to implement a hardware neural net with 1951’s <a href="https://en.wikipedia.org/wiki/Stochastic_neural_analog_reinforcement_calculator">SNARC</a> (Stochastic Neural Analog Reinforcement Calculator) <sup id="fnref:SNARC"><a href="#fn:SNARC" class="footnote">8</a></sup>, which preceded Rosenblatt’s work by many years. But the lack of any trace of his work on this system and the critical nature of the analysis in <em>Perceptrons</em> suggests that he concluded this approach to AI was a dead end. The most widely discussed element of this analysis is the elucidation of the limits of a Perceptron - they could not, for instance, learn the simple boolean function XOR because it is not <strong>linearly separable</strong>. Though the history here is vague, this publication is widely believed to have helped usher in the first of the <strong>AI Winters</strong> - a period following a massive wave of hype for AI characterized by disillusionment that causes a freeze to funding and publications.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/35002?token=NfMQ9LV1TFNi0v0QmbwfNTxYgbYjG7kjHqR9DaATySwhqdIE8Xc3Lfrcwa6JA2CAwSvQFYQL7fj_fwFT6o_Zelw" /> <figcaption>Visualization of the limitations of Perceptrons. Finding a linear function on the inputs X,Y to correctly ouput + or - is equivalent to drawing a line on this 2D graph separating all + cases from - cases; clearly, for the third case this is impossible. </figcaption> </figure> <h1 id="the-thaw-of-the-ai-winter">The Thaw of the AI Winter</h1> <p>So, things were not good for neural nets. But why? The idea, after all, was to combine a bunch of simple mathematical neurons to do complicated things, not to use a single one. In other terms, instead of just having one <strong>output layer</strong>, to send an input to arbitrarily many neurons which are called a <strong>hidden layer</strong> because their output acts as input to another hidden layer or the output layer of neurons. Only the output layer’s output is ‘seen’ - it is the answer of the neural net - but all the intermediate computations done by the hidden layer(s) can tackle vastly more complicated problems than just a single layer.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34833?token=w5fXrfJQKt_CECpofd0YJUaAvNZEe2Qc0hgwlCj-x-48i8Bs5CtQDe52yns49AzIgK3NnSCwJvcmqQ1kp5mVwB0" alt="Hidden layers" /> <figcaption>Neural net with two hidden layers <a href="http://cs231n.github.io/neural-networks-1/">(Excellent Source)</a></figcaption> </figure> <p>The reason hidden layers are good, in basic terms, is that the hidden layers can find <strong>features</strong> within the data and allow following layers to operate on those features rather than the noisy and large raw data. For example, in the very common neural net task of finding human faces in an image, the first hidden layer could take in the raw pixel values and find lines, circles, ovals, and so on within the image. The next layer would receive the position of these lines, circles, ovals, and so on within the image and use those to find the location of human faces - much easier! And people, basically, understood this. In fact, until recently machine learning techniques were commonly not applied directly to raw data inputs such as images or audio. Instead, machine learning was done on data after it had passed through <strong>feature extraction</strong> - that is, to make learning easier machine learning was done on preprocessed data from which more useful features such as angles or shapes had been already extracted.</p> <div><button class="btn" data-toggle="collapse" data-target="#nonlinearwhy"> Aside: why have non-linear activation functions &raquo; </button></div> <blockquote class="aside"><p id="nonlinearwhy" class="collapse" style="height: 0px;"> Earlier, we saw that the weighted sum computed by the Perceptron is usually put through a non-linear activation function. Now we can get around to fully answering an implicit question: why bother? Two reasons: 1. Without the activation function, the learned functions could only be linear, and most 'interesting' functions are not linear (for instance, logic functions that only output 1 or 0 or classification functions that output the category). 2. Several layers of linear Perceptrons can always be collapsed into only one layer due to the linearity of all the computations - the same cannot be done with non-linear activation functions. <br /> So, in intuitive speak a network can massage the data better with activation functions than without. </p></blockquote> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/35001?token=qG2LYkSR81r2DESqTceIWvJdDBoXVK-PRShf3s-57FZbvr81A9xC9-4x46XNiD33WgEy9Kh3A95i09dYQnHdDrE" alt="Feature extraction" /> <figcaption>Visualization of traditional handcrafted feature extraction. <a href="http://lear.inrialpes.fr/people/vandeweijer/color_descriptors.html">(Source)</a></figcaption> </figure> <p>So, it is important to note Minsky and Papert’s analysis of Perceptrons did not merely show the impossibility of computing XOR with a single Perceptron, but specifically argued that it had to be done with multiple layers of Perceptrons - what we now call multilayer neural nets - and that Rosenblatt’s learning algorithm did not work for multiple layers. And that was the real problem: the simple learning rule previously outlined for the Perceptron does not work for multiple layers. To see why, let’s reiterate how a single layer of Perceptrons would learn to compute some function:</p> <ol> <li>A number of Perceptrons equal to the number of the function’s outputs would be started off with small initial weights</li> <li>For the inputs of an example in the training set, compute the Perceptrons’ output</li> <li>For each Perceptron, if the output does not match the example’s output, adjust the weights accordingly</li> <li>Go to the next example in the training set and repeat steps 2-4 until the Perceptrons no longer make mistakes</li> </ol> <p>The reason why this does not work for multiple layers should be intuitively clear: the example only specifies the correct output for the final output layer, so how in the world should we know how to adjust the weights of Perceptrons in layers before that? The answer, despite taking some time to derive, proved to be once again based on age-old calculus: the chain rule. The key realization was that if the neural net neurons were not quite Perceptrons, but were made to compute the output with an activation function that was still non-linear but also differentiable, as with Adaline, not only could the derivative be used to adjust the weight to minimize error, but the chain rule could also be used to compute the derivative for all the neurons in a prior layer and thus the way to adjust their weights would also be known. Or, more simply: we can use calculus to assign some of the blame for any training set mistakes in the output layer to each neuron in the previous hidden layer, and then we can further split up blame if there is another hidden layer, and so on - we <strong>backpropagate</strong> the error. And so, we can find how much the error changes if we change any weight in the neural net, including those in the hidden layers, and use an optimization technique (for a long time, typically <strong>stochastic gradient descent</strong>) to find the optimal weights to minimize the error.</p> <figure> <img class="postimagesmall" src="https://draftin.com:443/images/34948?token=LF6pwbG4bKjYLDLj4GDemUWKiFy8SQC5tluQgSGnxKxiaoFUlJ9FaYoC_Syh6t4fvzOT8rwz1fnBQ0xInJ_tuO0" alt="Backprop" /> <figcaption>The basic idea of backpropagation. <a href="http://devblogs.nvidia.com/parallelforall/inference-next-step-gpu-accelerated-deep-learning/">(Source)</a></figcaption> </figure> <p><strong>Backpropagation</strong> was derived by multiple researchers in the early 60’s and implemented to run on computers much as it is today as early as 1970 by Seppo Linnainmaa<sup id="fnref:8"><a href="#fn:8" class="footnote">9</a></sup>, but Paul Werbos was first in the US to propose that it could be used for neural nets after analyzing it in depth in his 1974 PhD Thesis<sup id="fnref:9"><a href="#fn:9" class="footnote">10</a></sup>. Interestingly, as with Perceptrons he was loosely inspired by work modeling the human mind, in this case the psychological theories of Freud as <a href="http://www.die.uchile.cl/ieee-cis/evic2005/files/AD2004Werbosv2.pdf">he himself recounts</a><sup id="fnref:10"><a href="#fn:10" class="footnote">11</a></sup>:</p> <blockquote> <p>“In 1968, I proposed that we somehow imitate Freud’s concept of a backwards flow of credit assignment, flowing back from neuron to neuron … I explained the reverse calculations using a combination of intuition and examples and the ordinary chainrule, though it was also exactly a translation into mathematics of things that Freud had previously proposed in his theory of psychodynamics!”</p> </blockquote> <p>Despite solving the question of how multilayer neural nets could be trained, and seeing it as such while working on his PhD thesis, Werbos did not publish on the application of backprop to neural nets until 1982 due to the chilling effects of the AI Winter. In fact, Werbos thought the approach would make sense for solving the problems pointed out in <em>Perceptrons</em>, but the community at large lost any faith in tackling those problems:</p> <blockquote> <p>“Minsky’s book was best known for arguing that (1) we need to use MLPs [multilayer perceptrions, another term for multilayer neural nets] even to represent simple nonlinear functions such as the XOR mapping; and (2) no one on earth had found a viable way to <em>train</em> MLPs good enough to learn such simple functions. Minsky’s book convinced most of the world that neural networks were a discredited dead-end – the worst kind of heresy. Widrow has stressed that this pessimism, which squashed the early “perceptron” school of AI, should not really be blamed on Minsky. Minsky was merely summarizing the experience of hundreds of sincere researchers who had tried to find good ways to train MLPs, to no avail. There had been islands of hope, such as the algorithm which Rosenblatt called “backpropagation” (not at all the same as what we now call backpropagation!), and Amari’s brief suggestion that we might consider least squares [what is the basis of simple linear regression] as a way to train neural networks (without discussion of how to get the derivatives, and with a warning that he did not expect much from the approach). But the pessimism at that time became terminal. In the early 1970s, I did in fact visit Minsky at MIT. I proposed that we do a joint paper showing that MLPs can in fact overcome the earlier problems … But Minsky was not interested(14). In fact, no one at MIT or Harvard or any place I could find was interested at the time.”</p> </blockquote> <p>It seems that it was because of this lack of academic interest that it was not until more than a decade later, in 1986, that this approach was popularized in <a href="http://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf">“Learning representations by back-propagating errors”</a> by David Rumelhart, Geoffrey Hinton, and Ronald Williams <sup id="fnref:11"><a href="#fn:11" class="footnote">12</a></sup>. Despite the numerous discoveries of the method (the paper even explicitly mentions David Parker and Yann LeCun as two people who discovered it beforehand) the 1986 publication stands out for how concisely and clearly the idea is stated. In fact, as a student of Machine Learning it is easy to see that the description in their paper is essentially identical to the way the concept is still explained in textbooks and AI classes. A <a href="http://www-isl.stanford.edu/~widrow/papers/j199030years.pdf">retrospective in IEEE</a><sup id="fnref:12"><a href="#fn:12" class="footnote">13</a></sup> echoes this notion:</p> <blockquote> <p>“Unfortunately, Werbos’s work remained almost unknown in the scientific community. In 1982, Parker rediscovered the technique [39] and in 1985, published a report on it at M.I.T. [40]. Not long after Parker published his findings, Rumelhart, Hinton, and Williams [41], [42] also rediscovered the techniques and, largely as a result of the clear framework within which they presented their ideas, they finally succeeded in making it widely known.”</p> </blockquote> <p>But the three authors went much further than just present this new learning algorithm. In the same year they published the much more in-depth <a href="http://psych.stanford.edu/~jlm/papers/PDP/Volume%201/Chap8_PDP86.pdf">“Learning internal representations by error propagation”</a><sup id="fnref:13"><a href="#fn:13" class="footnote">14</a></sup>, which specifically addressed the problems discussed by Minsky in <em>Perceptrons</em>. Though the idea was conceived by people in the past, it was precisely this formulation in 1986 that made it widely understood how multilayer neural nets could be trained to tackle complex learning problems. And so, neural nets were back! In <a href="http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-2">part 2</a>, we shall see how just a few years later backpropagation and some other tricks discussed in “Learning internal representations by error propagation” were applied to a very significant problem: enabling computers to read human handwriting.</p> <div class="footnotes"> <ol> <li id="fn:1"> <p>Christopher D. Manning. (2015). Computational Linguistics and Deep Learning Computational Linguistics, 41(4), 701–707. <a href="#fnref:1" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:2"> <p>F. Rosenblatt. The perceptron, a perceiving and recognizing automaton Project Para. Cornell Aeronautical Laboratory, 1957. <a href="#fnref:2" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:3"> <p>W. S. McCulloch and W. Pitts. A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4):115–133, 1943. <a href="#fnref:3" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:4"> <p>The organization of behavior: A neuropsychological theory. D. O. Hebb. John Wiley And Sons, Inc., New York, 1949 <a href="#fnref:4" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:5"> <p>B. Widrow et al. Adaptive ”Adaline” neuron using chemical ”memistors”. Number Technical Report 1553-2. Stanford Electron. Labs., Stanford, CA, October 1960. <a href="#fnref:5" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:6"> <p>“New Navy Device Learns By Doing”, New York Times, July 8, 1958. <a href="#fnref:6" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:7"> <p>Perceptrons. An Introduction to Computational Geometry. MARVIN MINSKY and SEYMOUR PAPERT. M.I.T. Press, Cambridge, Mass., 1969. <a href="#fnref:7" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:SNARC"> <p>Minsky, M. (1952). A neural-analogue calculator based upon a probability model of reinforcement. Harvard University Pychological Laboratories internal report. <a href="#fnref:SNARC" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:8"> <p>Linnainmaa, S. (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master’s thesis, Univ. Helsinki. <a href="#fnref:8" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:9"> <p>P. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, Cambridge, MA, 1974. <a href="#fnref:9" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:10"> <p>Werbos, P.J. (2006). Backwards differentiation in AD and neural nets: Past links and new opportunities. In <em>Automatic Differentiation: Applications, Theory, and Implementations,</em> pages 15-34. Springer. <a href="#fnref:10" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:11"> <p>Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. <a href="#fnref:11" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:12"> <p>Widrow, B., &amp; Lehr, M. (1990). 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proceedings of the IEEE, 78(9), 1415-1442. <a href="#fnref:12" class="reversefootnote">&#8617;</a></p> </li> <li id="fn:13"> <p>D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Learning internal representations by error propagation. In Parallel distributed processing: explorations in the microstructure of cognition, vol. 1, David E. Rumelhart, James L. McClelland, and CORPORATE PDP Research Group (Eds.). MIT Press, Cambridge, MA, USA 318-362 <a href="#fnref:13" class="reversefootnote">&#8617;</a></p> </li> </ol> </div> <p><a href="/writing/a-brief-history-of-neural-nets-and-deep-learning/">A 'Brief' History of Neural Nets and Deep Learning, Part 1</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on December 24, 2015.</p> <![CDATA[Movie Recommendations For The Aspiring Eclectic Intellectual]]> /writing/movie-recommendations-for-the-aspiring-eclectic-intellectual 2015-10-22T16:19:34-07:00 2015-10-22T16:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <p><em>Revised from an email I wrote to a friend some years ago.</em></p> <p>I should warn you this will get overlong. In my 10 or so years of watching Good movies I have come across many that have imprinted themselves onto my memory - hovering images and sounds that in a curious way make me happy to be alive. What I will now do is try to put all those most striking experiences into a flowing set of paragraphs and hope it all coheres. Perhaps these shall have the same effect for you, perhaps not - nonetheless let’s go!</p> <p>Any conversation with me about quality art movies has got to start with Andrei Tarkovsky, the poetically melancholic depths of any Russian’s soul incarnate. Very slow. Very poetic. Beautiful, disturbing, strange, spiritual. An acquired taste, and to some a complete nonsensical bore. <strong>Solaris</strong>, a trippy 3-hour sci fi meditation on love and human consciousness, is a good starting point as the most accessible of his films. But I prefer his latter yet more poetically inclined works, such as <strong>The Sacrifice</strong> - a tale of a man averting a nuclear war through a spiritual journey. Better yet is <strong>Stalker</strong>, a 3-hour meditation on faith and the struggles inside our souls, and one of my favorite films. These movies have characters and stories, but describing them is almost beside the point - Tarkovsky’s work is visual poetry, and can only really be understood as such.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33122?token=7iV9cR_SsFlaUhBUMaA1ig5xTuE344ic99SQ6ElGChkr7zx7bvDTIKyVNc0qro4_AncYv-VBGF6f5oi-24n4yKQ" alt="Stalker" /> </figure> <p>Next, Andrei Zvyagintsev, the recent successor to Tarkovsky’s crown - a sort of enlightened monarch who retains many of the customs but changes the underlying frame of mind. I will without reserve recommend three of his films: <strong>The Return</strong>, <strong>Elena</strong>, and <strong>Leviathan</strong>. The first is his first movie, and is as difficult as it is rewarding to watch, the sort of movie that sinks into you, the sort of movie that you have not seen before, the sort of movie that when you realize its subtle story you experience a moment of wonder and appreciation for its elegance. Or, the sort of movie you call pretentious and stop watching midway through, much as with Tarkovsky. <strong>Elena</strong> is both more and less complicated, a stylish slow neo-noir that many critics amazingly did not comprehend to be a devastating critique of modern Russian society as well as its history. Luckily critics had no such difficulty with <strong>Leviathan</strong>, a transparent and <a href="https://www.youtube.com/watch?v=2oo7H25kirk">devastatingly beautiful</a> critique of the corruption of modem Russian government and spirituality.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33164?token=g5dWtasZaV7cc9NwSJ2RyPSJU_79js7SZk9Lr1Wi-U9V5JK8reBPidTzQwwdiHU6XfY61OV3ggPcFRMmh9n0S_k" alt="Leviathan" /> </figure> <p>Another quite good slow Russian movie is <strong>The Island</strong>, but lets not get fixated on those. After Tarkovsky, my first GOTO masterclass director has got to be Ingmar Bergman, the director to top every director’s list of favorite directors (or at least Woody Allen’s… and Tarkovsky’s… and Kubrick’s…). His movies are about… many things, the subconscious depths and murks of our half-understood psyches most of all perhaps. Slow, quiet, amazingly talented at creating minimalist black and white scenes. <strong>The Seventh Seal</strong>, a classic film about a medieval knight who plays chess with Death, is a good starting point. <strong>The Hour of the Wolf</strong>, a sort of slowly creeping surreal film about guilt, derangement, and ultimately horror is at the other end of the spectrum of accessibility but is supremely good. And this more or less is the range of Bergman’s large body of work - dark explorations of human’s shattered souls and psyches rendered with unbelievably striking black and white imagery.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33166?token=Sq9TWbrnPcO9MCegAOWl9YM92HCe3FreRsHn26-Kfye7NFswz6vPCESgkjvH8sUZI5E-gytfNHgAKacWDDsseMc" alt="The Seventh Seal" /> </figure> <p>But all this is so recent, pretty much 60s onwards, what kind of self-respecting movie person would I be if I dont have some directors from the olden days to show some love for? A bad one, clearly. That’s where Friz Lang, the mad genius from the days when sound has yet to ground cinema to dull reality, enters the picture. Have I seen any of his crazy epics besides Metropolis? No, but I can tell you <strong>Metropolis</strong> is quite the crazy epic and perhaps the first sci-fi film (robots!). It’s hard to beat <a href="http://www.ebertfest.com/thirteen/metropolis.html">Ebert</a> on this one: “Lang’s film is the summit of German Expressionism, with its combination of stylized sets, dramatic camera angles, bold shadows and frankly artificial theatrics.” But! Believe it or not, the much less grand <strong>M</strong> - a film about the merciless retribution of a city against one who has sinned against it - may be the better film. Either way, you can’t go wrong with Lang.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33168?token=dIfqMhRME5us-_B68taUPNXJZyUeNyvRsztjl9AeLOOZx5-2WNIpKO_NMNL3X8jqJ9WuIhOgMHH4YD0g-U7h3lI" alt="Metropolis" /> </figure> <p>And then ofcourse there’s the other director any person who has taken a film history class is overjoyed to bring up in normal conversation (because damn it, the world needs to know your knowledge), Sergei Eisenstein. But <strong>Battleship Potempkin</strong> really is quite Radical, the prototypical example of the alternative form of cinema that bloomed in the strange days of soviet youth. If you want to get truly crazy, give Vertov’s <strong>Man with a Movie Camera</strong> a watch - its not just capital R Radical, its all-caps RADICAL, but in the sort of intellectually and theoretically backed way that is actually quite striking and at least it makes sense maybe and does not seem like a student project by a guy who has seen too many artsy short films.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33169?token=6AMiChfeW2UxJZdhflEGkD8uf7FiqEOJM5NT6eBjpiAFNzYkYGNflo6yf3QD-EulMU-3S-s5QXa6rLK9sf4niOs" alt="Man With A Movie Camera" /> </figure> <p>Across the pond Chaplin and Keaton are making far less weird masterpieces about this time (<strong>Modern Times</strong>,<strong>The General</strong>), but let’s stick with European strangeness, specifically that hailing from Spain. Here Luis Bunuel has managed to hang out with Dali a bit too long and accidentally produces <strong>Un Chien Andalou</strong>, a surreal movie to surmount all surreal movies. This does seem like a student project by a guy who has seen too many artsy short films, but hey, it was the first, and its got some hella weird imagery. Bunuel then continues to make weird surreal cinema for decades, eventually getting around to laugh-out-loud family favorites such as <strong>The Discrete Charm of the Bourgeoisie</strong>. Basically, if you are in the mood for surrealism, then this guy can hook you up.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33171?token=XcSoMc2F9qs09dcuN08Q6edZsHiKgcXTml2mJCDsFz5OCvNdgn3LC-kJjzYlOrd3uBeBgmBQN-I5Muy0hyC2UIA" alt="Un Chien Andalou" /> </figure> <p>As fun as Bunuel’s surrealism can be, sometimes you just want to stick to the weight and tragedies of the real world. Well then, fine, just go be morose in the reality of post WWII Italy. Yes indeed, Italian Neorealism is where its at. <strong>Umberto D</strong> and <strong>Bicycle Thieves</strong> are the ones that I know should be seen, and I cant help but squeeze in a mention of <strong>Ikiru</strong> by Kurosawa here as well (ah, but you must see Kurosawa! Too many of his movies deserve mention, at the very least the grand truth-mocking treatise that is <strong>Rashomon</strong>). Still, there is only so much hard real-world reality you can take - luckily Fellini’s comes in later and slowly merges Italian Neorialism with very personal surrealism. Another person more of whose movies I should watch, but at least I can say <strong>8 1/2</strong> is wholly enjoyable.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33185?token=PIkh2isQSk6RP6MWHBkPmRKOXn6J1WcBu-ZOq4QSSupEhJKjTN9MND9uHsWSHAJQ4t7t_TrNpLcd6pHGUsDWMME" alt="Bycicle Thieves" /> </figure> <p>But lets say you want your surrealism with a dose of psychoanalysis and strangely menacing Americana. Two words: David Lynch. Let’s get this out of the way: Lynch’s hair is amazing. Did he start my appreciation for nonconformal hair? It’s possible. <strong>Eraserhead</strong>, his incredibly weird movie about city and family oppression and fear in the mind of a young artist featuring a weird worm baby and an actual eraserhead, is also notable for crazy hair. But for real, <strong>Mulholland Drive</strong> is another one of my favorite movies - a story that makes perfect sense in dream logic, the true realization of the promise of surrealism in cinema, a modern film so vivid with color and character your eyes feel drunk. Its predecessor, <strong>Lost Highway</strong>, is also in that vein and also has Nick Cage, so it’s pretty much a must see. But if the vivid color and subversion of quintessentially American naivette is more exciting to you than Nick Cage, then <strong>Blue Velvet</strong> is definitely where you will get the most thrills.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33186?token=o0Ce8LgS7xqEflwfFfe_LidlB6nTT8qIUwMiWhDi5374dGpr3mgCEpU5Wisn42j77S4b15rLrTYEmnB4D8BjH8E" alt="Mulholland Drive" /> </figure> <p>Speaking of America, it turns out some fine directors lived there. Too many I am aware of, in fact. So let’s get some famous ones out of the way: Kubrick (<strong>Dr. Strangelove</strong>), Francis Ford Copolla (<strong>Apocalypse Now</strong>), Scorsese (<strong>Taxi Driver</strong>), Allen (<strong>Annie Hall</strong>), the Coens (<strong>Fargo</strong>), Fincher (<strong>Fight Club</strong>). Yeah, I know, you’re not impressed. Let’s move on to the hip ones. Darren Aronofsky, a fairly recent addition to the cannon of distinguished American filmmakers. His first movie was the low-budget <strong>Pi</strong>, a film about a gifted mathematician increasingly going insane in search of the deep fundamental mathematical truth. That’s Aronofsky in a nutshell - people struggling and increasingly coming apart, as shown with unsubtle but very potent vision. In the magnificent <strong>Requirem for a Dream</strong>, due to drugs, in the overtly ambitious <strong>The Fountain</strong>, due to death, and so on. Aronofsky’s film pack a strong stylistic punch, and a good counter to that are the naturalistic films of <strong>Richard Linklater</strong>. If nothing else, consider <strong>Boyhood</strong> - a film that took 12 years to film yet managed to pack all that time in 2.5 hour movie most considered successful. Ah, this paragraph is getting long, but I’ll sneak in Paul Thomas Anderson and tell you that <strong>There Will Be Blood</strong> is great and all and enjoying <strong>The Master</strong> will get you intellectual street cred but if you have not seen <strong>Punch Drunk Love</strong> you are missing out on a completely original amazing experience.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33187?token=vLdVZwnULxs7ZtMMgZiVWvGm4gnk2LeEpdivEYS-7XM9TbrfiONSBBNpuWna5hZIpRCLqYcwc9mLbzRy0KynS0M" alt="Punch Drunk Love" /> </figure> <p>Hmm, about now the next wacky transition is starting to get harder to think up, so I’ll pull the nice trendy post-modernist trick of self-awareness. Speaking of post-modernism (BAM), it’s about time I brought up Charlie Kaufman. He is distinguished on this list for being the only one who is primarily an author of screenplays, yet it is fair to say his work is unmistakably his. <strong>Synecdoche NY</strong>, <strong>Eternal Sunshine of the Spotless Mind</strong>, <strong>Adaptation</strong>, <strong>Being John Malkovich</strong> - fantastic movies about art, love, and life. They speak for themselves.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33243?token=ra_Hy_oz8XdXfQ2OmIMxeiHYlrraFDoWFPOB75ac8RWwXAamoyV-yU7AKJ7Bm7YF2PV-aKZL4dM6Hdgszfgqp_E" alt="Synechdoche New York" /> </figure> <p>Kaufman basically exudes talent. In addition to writing all the above, the people who did direct his films are themselves rockstars of the art film. Spike Jonze (<strong>Adaptation</strong> and <strong>Being John Malkovich</strong>) also directed the magnificent <strong>Her</strong>, a film that managed to understand that science fiction is not solely a vehicle for action or horror, and a host of fantastic short films: <a href="http://www.youtube.com/watch?v=e-0siK1w3eM"><strong>Kanye West Meets His Demon</strong></a>, <a href="http://www.youtube.com/watch?v=6OY1EXZt4ok"><strong>Robots Can Love Too</strong></a>, and <a href="http://www.youtube.com/watch?v=5Euj9f3gdyM"><strong>Damn Arcade Fire Is So Fucking Good</strong></a>. It just so happens that Jonze was also married to Sofia Coppola, whose <strong>Lost in Translation</strong> is about as good as you can hope from a movie premised about an aged and lonely Bill Murray befriending an existentially wrought Scarlett Johansonn in Japan. Back on topic: Michel Gondry directed the Kaufman written <strong>Eternal Sunshine of the Spotless Mind</strong>, a movie perhaps unparalleled in its presentation of dreams and how our inner lives are reflected in them, a joy of a movie owing if only for its inventiveness and emotional resonance. He also directed the enjoyably quirky but perhaps forgettable <strong>Science of Sleep</strong> and a host of awesome Bjork music videos - this may well be my <a href="http://www.youtube.com/watch?v=4z7NN4n8CTY">favorite music video ever</a>.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33189?token=PgQT7iBEszRhPobAfIVpi0MhbFFyXJzPZb6_ccAE3OSgl-QLcoD-zng2LgsK_dNAiMyVVDsdiu07iUFUYaPX17g" alt="Her" /> </figure> <p>Finally, George Clooney! George Clooney started his directorial work with <strong>Confessions of a Dangerous Mind</strong>, yet another awesome movie written by Kaufman. Clooney’s later work is not that artsy but he should get way more cred for how good it is. Speaking of which, <strong>Micheal Clayton</strong>. Clayton is weird only in how damn realistic it is, how fully it acknowledges the way humans are, how non-indie and not-artsy it is in its exploration of those themes. One of my favorites films - get through to the last scene and sit speechless at the genius of it.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33188?token=QVHSTbQt9CQjDuFB9ksPQMtljyhytFPppxoiXTY4BOu-_AjGPE7yS7AKrfL6y0hJpGRIxht5T7OwXcGTIOh0wfY" alt="Michael Clayton" /> </figure> <p>But now the tone is getting too heavy - this can be addressed with some exploding heads. David Cronenberg is a peculiar auteur, the true master of body horror through all the eighties and nineties. Just watch the trailer to <strong>Videodrome</strong> or <strong>Naked Lunch</strong>. Just do it. His later work (<strong>Eastern Promises</strong>,<strong>A History of Violence</strong>) is strangely normal but still fantastically well made. So if you want to see a TV that has something resembling a vagina, a type writer that has something resembling a vagina, cars that dont have have anything resembling vaginas but are distinctly erotic (I am not making this up), well… Its fair to say you maybe don’t really want to see those things, but the general weirdness of Cronenberg’s early work and its legit thematic underpinnings should be appreciated, as should his basically excellent if more traditional work.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33244?token=i40M6GIXvMVU8g9GXre0s2Sc0nsmSC7ZaA-jjYa5vb868IpeWbBXFl33n1szTAYyCSyUpY5RWrwHM_N7RKASZjo" alt="Naked Lunch" /> </figure> <p>Not interested in seeing inanimate manner being quite so animate, or just want to watch some nice classy art movies? Shane Carruth. The director whom I quite literally proselytize to every one I meet, a software engineer turned auteur filmmaker. <strong>Primer</strong>. A purportedly $7000 movie about the mind fuck of time travel from the perspective of an engineer, and the corrupting power of power. <strong>Upstream Color</strong>. A hypnotizing movie that merges sci fi and artfulness in a way I have never seen done - I can think of few dialogue-free scenes as striking as found in this movie. In this case a quote from <a href="http://www.rottentomatoes.com/m/upstream_color/">RottenTomatoes</a> feels most appropriate: “As technically brilliant as it is narratively abstract, Upstream Color represents experimental American cinema at its finest – and reaffirms Shane Carruth as a talent to watch.”.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33245?token=kg1Nlhxmbunc_ReJHx6HwqWV30xzh9BOcZNmpspTQ8DqBgWWGkSDvYwmdDIfV1z4CuMTIK4dwTDPZc-pZl0KHPU" alt="Upstream Color" /> </figure> <p>So many words, yet so little ground covered. Let’s do a quick sweep of the globe - fast now, over in Japan we forgot to mention Ozu, whose excellent <strong>Tokyo Story</strong> you will surely love, and the more recent Takeshi Kitano with his existentialist gangster films in the vein of <strong>Sonatine</strong>, and lastly the counterpoint of Hirokazu Koreeda whose <strong>Nobody Knows</strong> is the best minimalist theme I can think of. But let’s not linger - over in China there is the distinctly weird yet unquestionably interesting Wong Kar-Wai with bizarre dreams like <strong>Chunkgink Express</strong> and <strong>Fallen Angels</strong>, though if you are not into that there is also Johnie To with fantastic Shakespearean gangster tragedies like <strong>The Triad Election</strong>. Then it just so happens in Korea Park Chan-Wook is both violent and bizarre, most notably in the already-classic <strong>Oldboy</strong> (and personal favorite blood-dark romantic comedy <strong>Thirst</strong>), and alongside his work there is as a host of recent fairly normal yet fantastic violent thrillers of which Lee Jeong-beom To’s <strong>The Man From Nowhere</strong> is surely the best. Not to be all about Asia here, over from Mexico there is Benecio del Toro with the unforgettable WWII fairy tale <strong>Pan’s Labyrith</strong>, and Alfonso Cuarón has the poignant life tale <strong>Y Tu Mama Tambien</strong> and the just-please-watch-it-you’ll-be-happy-you-did-masterpiece <strong>Children of Men</strong>. Let’s mention Britain too, where Richard Ayoade has recently made the excellent coming of age tale <strong>Submarine</strong> and the somehow yet more delightful adaptation of Dostoevsky <strong>The Double</strong>. Finally, we wind up in Israel due to the sheer beauty of Ari Folman’s <strong>Waltz with Bashir</strong> of which you may be convinced with nothing more than the following image.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33248?token=T20B0DzRAvKHdSt9L2sk_5FRkcwsWKN9dOB5KhoBqLbu3SzSPAsRv4NkJIVN8tME3eVg7N7X-W-TJM83GWbHwMY" alt="Waltz with Bashir" /> </figure> <p>Tired yet? I’m impressed you made it this far. Listen, I wonder how to close this thing - I have journeyed through most of my memory and mind and found the scattered pieces of film impressions that are clearly somewhat important to me, yet can I be expected to find some sequence of sentences to sum up all this? No, that would hardly be a reasonable expectation. Rather, I shall close on one last filmmaker, perhaps my favorite, and most certainly one not sufficiently recognized as of yet. That filmmaker is Satoshi Kon, and I must admit his work not to be film strictly - as with the above it is animation, anime (speaking of which, surely I can take for granted you have heard of Miyazake and <strong>Princess Mononoke</strong> and <strong>Spirited Away</strong>). Yet, his work is about film - <strong>Perfect Blue</strong> with its obvious Hitchcockian inspiration, <strong>Millenium Actress</strong> with an overt and beautifully rendered and just so damn wonderful recollection of the history of Japanese cinema, and finally <strong>Paprika</strong> - a movie about the power of dream that may as well equate cinema itself to that act. Vivid, creative, exuberant, surprising, challenging, comedic, touching - Kon’s work is a testament to the power of film, and the fantastic dreams we may inhabit through them.</p> <figure> <img class="postimage" src="https://draftin.com:443/images/33249?token=IyoXEt59YrNKHsBba9nzrRhsj0BS3Oy4kFsqlkaLT4kbX0YsMleiBE_QB3hNHgTgjFJNrkWF93CcVbtU27r0qF0" alt="Paprika" /> </figure> <p><a href="/writing/movie-recommendations-for-the-aspiring-eclectic-intellectual/">Movie Recommendations For The Aspiring Eclectic Intellectual</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on October 22, 2015.</p> <![CDATA[Exceptional People]]> /writing/exceptional-people 2015-10-11T16:19:34-07:00 2015-10-11T16:19:34-07:00 www.andreykurenkov.com contact@andreykurenkov.com <h3 id="or-people-disappoint-you-an-essay-poem">Or, People Disappoint You, an essay-poem</h3> <div><button class="btn" data-toggle="collapse" data-target="#foreword"> Foreword &raquo; </button></div> <blockquote class="aside"><p id="foreword" class="collapse" style="height: 0px;"> As with most things I write about, this has been rolling about in my head for many years. It is the sort of thing I keep hitting up against and wondering at, and eventually crystallize some notion of that calls for itself to be solidified in text. In starting to solidify this into text, I realized that it will be short, and comes from a place particularly likely to call upon the typical impactful literary techniques I use whose names I do not remember. So, I figured, if I am to write a short essay with impassioned sections, why not go all the way and give it the form of a poem. I may not be a poet, but by posting these digital pages of text I do inevitably claim the identity of a writer, of some sort. So, here's an essay poem. PS shout out, similar thoughts were echoed well in <a href="http://colorfulcortex.co/2015/10/24/the-youre-not-as-cool-in-person-phenomenon/">'The “You’re Not as Cool in Person” Phenomenon'</a>. </p></blockquote> <p>I have seen the shine of exceptional people,<br /> their radiance of confidence, skill, power,<br /> their lit up eyes in the midst of inspiration,<br /> their bright excitement in the flow of creation.</p> <p>When? When I have sat with them in cruel extra credit classes,<br /> and worked with them during brutal nocturnal work slogs,<br /> and argued with them over specifics in hastily scrawled sketches,<br /> and drank with them past sundown unfazed by the impending morning.</p> <p>Then, I have sensed in them burning hearts,<br /> or perhaps perpetually caffeinated brains,<br /> and in those brains always a thought, a plan, an idea,<br /> and in those hearts always the energy to think, imagine, strive.</p> <p>And in me? Often, mired in melancholy, I felt jealousy -<br /> felt just short of that ascendancy beyond normalcy,<br /> different but not outstanding, smart but not exceptional,<br /> the angst of a restless below average overachiever.</p> <p>But! Don’t worry. I grew up, slain that silly impulse, <br /> long since realized the emptiness of numbers,<br /> pierced the perceived glow of outstanding people,<br /> and through it saw, simply, friends, peers, people.</p> <p>Just people. Impressive people.<br /> People who have bad days and tough times.<br /> People who have dark sides and hidden fears.<br /> People who make mistakes.<br /> People who make bad jokes.<br /> People who struggle.</p> <p>And at first it was distinctly disappointing to see the death of that ideal, <br /> the impossibility of escaping difficulty dealing with all of life’s damned dimensions. <br /> Such a typically tepid truth - us mere mortals incapable of that heroic Herculean halo,<br /> and left to grow to see humans do not live in fairly tales, as I did with those around me.</p> <p>But you know, funny thing - in time I liked them all the more for it.</p> <p><a href="/writing/exceptional-people/">Exceptional People</a> was originally published by Andrey Kurenkov at <a href="">Andrey Kurenkov's Web World</a> on October 11, 2015.</p>