{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "# Cleaning Data\n",
    "\n",
    "**Outcomes**\n",
    "\n",
    "- Be able to use string methods to clean data that comes as a string  \n",
    "- Be able to drop missing data  \n",
    "- Use cleaning methods to prepare and analyze a real dataset  \n",
    "\n",
    "**Data**\n",
    "\n",
    "- Item information from about 3,000 Chipotle meals from about 1,800\n",
    "  Grubhub orders  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import qeds"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Cleaning Data\n",
    "\n",
    "For many data projects, a [significant proportion of\n",
    "time](https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#74d447456f63)\n",
    "is spent collecting and cleaning the data — not performing the analysis.\n",
    "\n",
    "This non-analysis work is often called “data cleaning”.\n",
    "\n",
    "pandas provides very powerful data cleaning tools, which we\n",
    "will demonstrate using the following dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#23</td>\n",
       "      <td>23</td>\n",
       "      <td>green</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#24</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#18</td>\n",
       "      <td>18</td>\n",
       "      <td>yellow</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#14</td>\n",
       "      <td>14</td>\n",
       "      <td>orange</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>purple</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>#10</td>\n",
       "      <td>XYZ</td>\n",
       "      <td>blue</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>#35</td>\n",
       "      <td>35</td>\n",
       "      <td>pink</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers nums  colors  other_column\n",
       "0     #23   23   green             0\n",
       "1     #24   24     red             1\n",
       "2     #18   18  yellow             0\n",
       "3     #14   14  orange             2\n",
       "4     #12  NaN  purple             1\n",
       "5     #10  XYZ    blue             0\n",
       "6     #35   35    pink             2"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame({\"numbers\": [\"#23\", \"#24\", \"#18\", \"#14\", \"#12\", \"#10\", \"#35\"],\n",
    "                   \"nums\": [\"23\", \"24\", \"18\", \"14\", np.nan, \"XYZ\", \"35\"],\n",
    "                   \"colors\": [\"green\", \"red\", \"yellow\", \"orange\", \"purple\", \"blue\", \"pink\"],\n",
    "                   \"other_column\": [0, 1, 0, 2, 1, 0, 2]})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "What would happen if we wanted to try and compute the mean of\n",
    "`numbers`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "ename": "TypeError",
     "evalue": "Could not convert #23#24#18#14#12#10#35 to numeric",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/nanops.py\u001b[0m in \u001b[0;36m_ensure_numeric\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m   1534\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1535\u001b[0;31m             \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfloat\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1536\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: could not convert string to float: '#23#24#18#14#12#10#35'",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mValueError\u001b[0m                                Traceback (most recent call last)",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/nanops.py\u001b[0m in \u001b[0;36m_ensure_numeric\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m   1538\u001b[0m             \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1539\u001b[0;31m                 \u001b[0mx\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcomplex\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mx\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1540\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mValueError\u001b[0m: complex() arg is a malformed string",
      "\nThe above exception was the direct cause of the following exception:\n",
      "\u001b[0;31mTypeError\u001b[0m                                 Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-3-c62df3911a80>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mdf\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m\"numbers\"\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36mmean\u001b[0;34m(self, axis, skipna, level, numeric_only, **kwargs)\u001b[0m\n\u001b[1;32m  11116\u001b[0m         )\n\u001b[1;32m  11117\u001b[0m         \u001b[0;32mdef\u001b[0m \u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumeric_only\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m> 11118\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mNDFrame\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumeric_only\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m  11119\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m  11120\u001b[0m         \u001b[0;31m# pandas\\core\\generic.py:10924: error: Cannot assign to a method\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36mmean\u001b[0;34m(self, axis, skipna, level, numeric_only, **kwargs)\u001b[0m\n\u001b[1;32m  10724\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m  10725\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mmean\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumeric_only\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m> 10726\u001b[0;31m         return self._stat_function(\n\u001b[0m\u001b[1;32m  10727\u001b[0m             \u001b[0;34m\"mean\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnanops\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mnanmean\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumeric_only\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m  10728\u001b[0m         )\n",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/generic.py\u001b[0m in \u001b[0;36m_stat_function\u001b[0;34m(self, name, func, axis, skipna, level, numeric_only, **kwargs)\u001b[0m\n\u001b[1;32m  10709\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mlevel\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m  10710\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_agg_by_level\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlevel\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mlevel\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m> 10711\u001b[0;31m         return self._reduce(\n\u001b[0m\u001b[1;32m  10712\u001b[0m             \u001b[0mfunc\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mname\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mname\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnumeric_only\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mnumeric_only\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m  10713\u001b[0m         )\n",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/series.py\u001b[0m in \u001b[0;36m_reduce\u001b[0;34m(self, op, name, axis, skipna, numeric_only, filter_type, **kwds)\u001b[0m\n\u001b[1;32m   4180\u001b[0m                 )\n\u001b[1;32m   4181\u001b[0m             \u001b[0;32mwith\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merrstate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mall\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"ignore\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 4182\u001b[0;31m                 \u001b[0;32mreturn\u001b[0m \u001b[0mop\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdelegate\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   4183\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   4184\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_reindex_indexer\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mnew_index\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mindexer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcopy\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/nanops.py\u001b[0m in \u001b[0;36m_f\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m     69\u001b[0m             \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     70\u001b[0m                 \u001b[0;32mwith\u001b[0m \u001b[0mnp\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0merrstate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0minvalid\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m\"ignore\"\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 71\u001b[0;31m                     \u001b[0;32mreturn\u001b[0m \u001b[0mf\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m*\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m     72\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     73\u001b[0m                 \u001b[0;31m# we want to transform an object array\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/nanops.py\u001b[0m in \u001b[0;36mf\u001b[0;34m(values, axis, skipna, **kwds)\u001b[0m\n\u001b[1;32m    131\u001b[0m                     \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0malt\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    132\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 133\u001b[0;31m                 \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0malt\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwds\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    134\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    135\u001b[0m             \u001b[0;32mreturn\u001b[0m \u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/nanops.py\u001b[0m in \u001b[0;36mnew_func\u001b[0;34m(values, axis, skipna, mask, **kwargs)\u001b[0m\n\u001b[1;32m    390\u001b[0m             \u001b[0mmask\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0misna\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    391\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 392\u001b[0;31m         \u001b[0mresult\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mfunc\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mskipna\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mskipna\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mmask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m**\u001b[0m\u001b[0mkwargs\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    393\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    394\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mdatetimelike\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/nanops.py\u001b[0m in \u001b[0;36mnanmean\u001b[0;34m(values, axis, skipna, mask)\u001b[0m\n\u001b[1;32m    629\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    630\u001b[0m     \u001b[0mcount\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_get_counts\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mshape\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mmask\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype_count\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 631\u001b[0;31m     \u001b[0mthe_sum\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0m_ensure_numeric\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mvalues\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msum\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0maxis\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdtype\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mdtype_sum\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    632\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    633\u001b[0m     \u001b[0;32mif\u001b[0m \u001b[0maxis\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0mgetattr\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mthe_sum\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;34m\"ndim\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0;32mFalse\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m~/Programming/anaconda3/envs/rsit/lib/python3.8/site-packages/pandas/core/nanops.py\u001b[0m in \u001b[0;36m_ensure_numeric\u001b[0;34m(x)\u001b[0m\n\u001b[1;32m   1540\u001b[0m             \u001b[0;32mexcept\u001b[0m \u001b[0mValueError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1541\u001b[0m                 \u001b[0;31m# e.g. \"foo\"\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m-> 1542\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0mTypeError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34mf\"Could not convert {x} to numeric\"\u001b[0m\u001b[0;34m)\u001b[0m \u001b[0;32mfrom\u001b[0m \u001b[0merr\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m   1543\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m   1544\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypeError\u001b[0m: Could not convert #23#24#18#14#12#10#35 to numeric"
     ]
    }
   ],
   "source": [
    "df[\"numbers\"].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numbers         object\n",
       "nums            object\n",
       "colors          object\n",
       "other_column     int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "It throws an error!\n",
    "\n",
    "Can you figure out why?\n",
    "\n",
    "Hint: When looking at error messages, start at the very\n",
    "bottom.\n",
    "\n",
    "The final error says, `TypeError: Could not convert #23#24... to numeric`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## String Methods\n",
    "\n",
    "Our solution to the previous exercise was to remove the `#` by using\n",
    "the `replace` string method: `int(c2n.replace(\"#\", \"\"))`.\n",
    "\n",
    "One way to make this change to every element of a column would be to\n",
    "loop through all elements of the column and apply the desired string\n",
    "methods…"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.77 ms, sys: 0 ns, total: 1.77 ms\n",
      "Wall time: 1.69 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# Iterate over all rows\n",
    "for row in df.iterrows():\n",
    "\n",
    "    # `iterrows` method produces a tuple with two elements...\n",
    "    # The first element is an index and the second is a Series with the data from that row\n",
    "    index_value, column_values = row\n",
    "\n",
    "    # Apply string method\n",
    "    clean_number = int(column_values[\"numbers\"].replace(\"#\", \"\"))\n",
    "\n",
    "    # The `at` method is very similar to the `loc` method, but it is specialized\n",
    "    # for accessing single elements at a time... We wanted to use it here to give\n",
    "    # the loop the best chance to beat a faster method which we show you next.\n",
    "    df.at[index_value, \"numbers_loop\"] = clean_number"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "While this is fast for a small dataset like this, this method slows for larger datasets.\n",
    "\n",
    "One *significantly* faster (and easier) method is to apply a string\n",
    "method to an entire column of data.\n",
    "\n",
    "Most methods that are available to a Python string (we learned a\n",
    "few of them in the [strings lecture](../python_fundamentals/basics.ipynb)) are\n",
    "also available to a pandas Series that has `dtype` object.\n",
    "\n",
    "We access them by doing `s.str.method_name` where `method_name` is\n",
    "the name of the method.\n",
    "\n",
    "When we apply the method to a Series, it is applied to all rows in the\n",
    "Series in one shot!\n",
    "\n",
    "Let’s redo our previous example using a pandas `.str` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    #23\n",
       "1    #24\n",
       "2    #18\n",
       "3    #14\n",
       "4    #12\n",
       "5    #10\n",
       "6    #35\n",
       "Name: numbers, dtype: object"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"numbers\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 898 µs, sys: 77 µs, total: 975 µs\n",
      "Wall time: 942 µs\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "# ~2x faster than loop... However, speed gain increases with size of DataFrame. The\n",
    "# speedup can be in the ballpark of ~100-500x faster for big DataFrames.\n",
    "# See appendix at the end of the lecture for an application on a larger DataFrame\n",
    "df[\"numbers_str\"] = df[\"numbers\"].str.replace(\"#\", \"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dtype('O')"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"numbers_str\"].dtype"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We can use `.str` to access almost any string method that works on\n",
    "normal strings. (See the [official\n",
    "documentation](https://pandas.pydata.org/pandas-docs/stable/text.html)\n",
    "for more information.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3    False\n",
       "4     True\n",
       "5    False\n",
       "6     True\n",
       "Name: colors, dtype: bool"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"colors\"].str.contains(\"p\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>purple</td>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>#35</td>\n",
       "      <td>35</td>\n",
       "      <td>pink</td>\n",
       "      <td>2</td>\n",
       "      <td>35.0</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers nums  colors  other_column  numbers_loop numbers_str\n",
       "4     #12  NaN  purple             1          12.0          12\n",
       "6     #35   35    pink             2          35.0          35"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.loc[df[\"colors\"].str.contains(\"p\")]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     GREEN\n",
       "1       RED\n",
       "2    YELLOW\n",
       "3    ORANGE\n",
       "4    PURPLE\n",
       "5      BLUE\n",
       "6      PINK\n",
       "Name: colors, dtype: object"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"colors\"].str.upper()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     Green\n",
       "1       Red\n",
       "2    Yellow\n",
       "3    Orange\n",
       "4    Purple\n",
       "5      Blue\n",
       "6      Pink\n",
       "Name: colors, dtype: object"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"colors\"].str.capitalize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Type Conversions\n",
    "\n",
    "In our example above, the `dtype` of the `numbers_str` column shows that pandas still treats\n",
    "it as a string even after we have removed the `\"#\"`.\n",
    "\n",
    "We need to convert this column to numbers.\n",
    "\n",
    "The best way to do this is using the `pd.to_numeric` function.\n",
    "\n",
    "This method attempts to convert whatever is stored in a Series into\n",
    "numeric values\n",
    "\n",
    "For example, after the `\"#\"` removed, the numbers of column\n",
    "`\"numbers\"` are ready to be converted to actual numbers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [],
   "source": [
    "df[\"numbers_numeric\"] = pd.to_numeric(df[\"numbers_str\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numbers             object\n",
       "nums                object\n",
       "colors              object\n",
       "other_column         int64\n",
       "numbers_loop       float64\n",
       "numbers_str         object\n",
       "numbers_numeric      int64\n",
       "dtype: object"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "      <th>numbers_numeric</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#23</td>\n",
       "      <td>23</td>\n",
       "      <td>green</td>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#24</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#18</td>\n",
       "      <td>18</td>\n",
       "      <td>yellow</td>\n",
       "      <td>0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#14</td>\n",
       "      <td>14</td>\n",
       "      <td>orange</td>\n",
       "      <td>2</td>\n",
       "      <td>14.0</td>\n",
       "      <td>14</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>purple</td>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers nums  colors  other_column  numbers_loop numbers_str  \\\n",
       "0     #23   23   green             0          23.0          23   \n",
       "1     #24   24     red             1          24.0          24   \n",
       "2     #18   18  yellow             0          18.0          18   \n",
       "3     #14   14  orange             2          14.0          14   \n",
       "4     #12  NaN  purple             1          12.0          12   \n",
       "\n",
       "   numbers_numeric  \n",
       "0               23  \n",
       "1               24  \n",
       "2               18  \n",
       "3               14  \n",
       "4               12  "
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We can convert to other types well.\n",
    "\n",
    "Using the `astype` method, we can convert to any of the supported\n",
    "pandas `dtypes`.\n",
    "\n",
    "Below are some examples. (Pay attention to the reported `dtype`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    23\n",
       "1    24\n",
       "2    18\n",
       "3    14\n",
       "4    12\n",
       "5    10\n",
       "6    35\n",
       "Name: numbers_numeric, dtype: object"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"numbers_numeric\"].astype(str)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    23.0\n",
       "1    24.0\n",
       "2    18.0\n",
       "3    14.0\n",
       "4    12.0\n",
       "5    10.0\n",
       "6    35.0\n",
       "Name: numbers_numeric, dtype: float64"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"numbers_numeric\"].astype(float)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Exercise**\n",
    "\n",
    "Convert the column `\"nums\"` to a numeric type using `pd.to_numeric` and\n",
    "save it to the DataFrame as `\"nums_tonumeric\"`.\n",
    "\n",
    "Notice that there is a missing value, and a value that is not a number.\n",
    "\n",
    "Look at the documentation for `pd.to_numeric` and think about how to\n",
    "overcome this.\n",
    "\n",
    "Think about why this could be a bad idea of used without\n",
    "knowing what your data looks like. (Think about what happens when you\n",
    "apply it to the `\"numbers\"` column before replacing the `\"#\"`.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0   NaN\n",
       "1   NaN\n",
       "2   NaN\n",
       "3   NaN\n",
       "4   NaN\n",
       "5   NaN\n",
       "6   NaN\n",
       "Name: numbers, dtype: float64"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pd.to_numeric(df[\"numbers\"], errors=\"coerce\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    23.0\n",
       "1    24.0\n",
       "2    18.0\n",
       "3    14.0\n",
       "4     NaN\n",
       "5     NaN\n",
       "6    35.0\n",
       "Name: nums, dtype: float64"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df[\"nums\"]\n",
    "\n",
    "pd.to_numeric(df[\"nums\"], errors=\"coerce\")\n",
    "# pd.to_numeric(df[\"nums\"], errors=\"raise\")\n",
    "# pd.to_numeric(df[\"nums\"], errors=\"ignore\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Missing Data\n",
    "\n",
    "Many datasets have missing data.\n",
    "\n",
    "In our example, we are missing an element from the `\"nums\"` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "      <th>numbers_numeric</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#23</td>\n",
       "      <td>23</td>\n",
       "      <td>green</td>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#24</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#18</td>\n",
       "      <td>18</td>\n",
       "      <td>yellow</td>\n",
       "      <td>0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#14</td>\n",
       "      <td>14</td>\n",
       "      <td>orange</td>\n",
       "      <td>2</td>\n",
       "      <td>14.0</td>\n",
       "      <td>14</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#12</td>\n",
       "      <td>NaN</td>\n",
       "      <td>purple</td>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>#10</td>\n",
       "      <td>XYZ</td>\n",
       "      <td>blue</td>\n",
       "      <td>0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>#35</td>\n",
       "      <td>35</td>\n",
       "      <td>pink</td>\n",
       "      <td>2</td>\n",
       "      <td>35.0</td>\n",
       "      <td>35</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers nums  colors  other_column  numbers_loop numbers_str  \\\n",
       "0     #23   23   green             0          23.0          23   \n",
       "1     #24   24     red             1          24.0          24   \n",
       "2     #18   18  yellow             0          18.0          18   \n",
       "3     #14   14  orange             2          14.0          14   \n",
       "4     #12  NaN  purple             1          12.0          12   \n",
       "5     #10  XYZ    blue             0          10.0          10   \n",
       "6     #35   35    pink             2          35.0          35   \n",
       "\n",
       "   numbers_numeric  \n",
       "0               23  \n",
       "1               24  \n",
       "2               18  \n",
       "3               14  \n",
       "4               12  \n",
       "5               10  \n",
       "6               35  "
      ]
     },
     "execution_count": 31,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We can find missing data by using the `isnull` method."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "      <th>numbers_numeric</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>False</td>\n",
       "      <td>True</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   numbers   nums  colors  other_column  numbers_loop  numbers_str  \\\n",
       "0    False  False   False         False         False        False   \n",
       "1    False  False   False         False         False        False   \n",
       "2    False  False   False         False         False        False   \n",
       "3    False  False   False         False         False        False   \n",
       "4    False   True   False         False         False        False   \n",
       "5    False  False   False         False         False        False   \n",
       "6    False  False   False         False         False        False   \n",
       "\n",
       "   numbers_numeric  \n",
       "0            False  \n",
       "1            False  \n",
       "2            False  \n",
       "3            False  \n",
       "4            False  \n",
       "5            False  \n",
       "6            False  "
      ]
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We might want to know whether particular rows or columns have any\n",
    "missing data.\n",
    "\n",
    "To do this we can use the `.any` method on the boolean DataFrame\n",
    "`df.isnull()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "numbers            False\n",
       "nums                True\n",
       "colors             False\n",
       "other_column       False\n",
       "numbers_loop       False\n",
       "numbers_str        False\n",
       "numbers_numeric    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull().any(axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2    False\n",
       "3    False\n",
       "4     True\n",
       "5    False\n",
       "6    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 35,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.isnull().any(axis=1)\n",
    "# df.isnull().any(axis=\"columns\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "Many approaches have been developed to deal with missing data, but the two most commonly used (and the corresponding DataFrame method) are:\n",
    "\n",
    "- Exclusion: Ignore any data that is missing (`.dropna`).  \n",
    "- Imputation: Compute “predicted” values for the data that is missing\n",
    "  (`.fillna`).  \n",
    "\n",
    "\n",
    "For the advantages and disadvantages of these (and other) approaches,\n",
    "consider reading the [Wikipedia\n",
    "article](https://en.wikipedia.org/wiki/Missing_data).\n",
    "\n",
    "For now, let’s see some examples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "      <th>numbers_numeric</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#23</td>\n",
       "      <td>23</td>\n",
       "      <td>green</td>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#24</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#18</td>\n",
       "      <td>18</td>\n",
       "      <td>yellow</td>\n",
       "      <td>0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#14</td>\n",
       "      <td>14</td>\n",
       "      <td>orange</td>\n",
       "      <td>2</td>\n",
       "      <td>14.0</td>\n",
       "      <td>14</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>#10</td>\n",
       "      <td>XYZ</td>\n",
       "      <td>blue</td>\n",
       "      <td>0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>#35</td>\n",
       "      <td>35</td>\n",
       "      <td>pink</td>\n",
       "      <td>2</td>\n",
       "      <td>35.0</td>\n",
       "      <td>35</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers nums  colors  other_column  numbers_loop numbers_str  \\\n",
       "0     #23   23   green             0          23.0          23   \n",
       "1     #24   24     red             1          24.0          24   \n",
       "2     #18   18  yellow             0          18.0          18   \n",
       "3     #14   14  orange             2          14.0          14   \n",
       "5     #10  XYZ    blue             0          10.0          10   \n",
       "6     #35   35    pink             2          35.0          35   \n",
       "\n",
       "   numbers_numeric  \n",
       "0               23  \n",
       "1               24  \n",
       "2               18  \n",
       "3               14  \n",
       "5               10  \n",
       "6               35  "
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# drop all rows containing a missing observation\n",
    "df.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.dropna?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "      <th>numbers_numeric</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#23</td>\n",
       "      <td>green</td>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#24</td>\n",
       "      <td>red</td>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#18</td>\n",
       "      <td>yellow</td>\n",
       "      <td>0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#14</td>\n",
       "      <td>orange</td>\n",
       "      <td>2</td>\n",
       "      <td>14.0</td>\n",
       "      <td>14</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#12</td>\n",
       "      <td>purple</td>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>#10</td>\n",
       "      <td>blue</td>\n",
       "      <td>0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>#35</td>\n",
       "      <td>pink</td>\n",
       "      <td>2</td>\n",
       "      <td>35.0</td>\n",
       "      <td>35</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers  colors  other_column  numbers_loop numbers_str  numbers_numeric\n",
       "0     #23   green             0          23.0          23               23\n",
       "1     #24     red             1          24.0          24               24\n",
       "2     #18  yellow             0          18.0          18               18\n",
       "3     #14  orange             2          14.0          14               14\n",
       "4     #12  purple             1          12.0          12               12\n",
       "5     #10    blue             0          10.0          10               10\n",
       "6     #35    pink             2          35.0          35               35"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.dropna(axis=\"columns\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "      <th>numbers_numeric</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#23</td>\n",
       "      <td>23</td>\n",
       "      <td>green</td>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#24</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#18</td>\n",
       "      <td>18</td>\n",
       "      <td>yellow</td>\n",
       "      <td>0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#14</td>\n",
       "      <td>14</td>\n",
       "      <td>orange</td>\n",
       "      <td>2</td>\n",
       "      <td>14.0</td>\n",
       "      <td>14</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#12</td>\n",
       "      <td>100</td>\n",
       "      <td>purple</td>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>#10</td>\n",
       "      <td>XYZ</td>\n",
       "      <td>blue</td>\n",
       "      <td>0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>#35</td>\n",
       "      <td>35</td>\n",
       "      <td>pink</td>\n",
       "      <td>2</td>\n",
       "      <td>35.0</td>\n",
       "      <td>35</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers nums  colors  other_column  numbers_loop numbers_str  \\\n",
       "0     #23   23   green             0          23.0          23   \n",
       "1     #24   24     red             1          24.0          24   \n",
       "2     #18   18  yellow             0          18.0          18   \n",
       "3     #14   14  orange             2          14.0          14   \n",
       "4     #12  100  purple             1          12.0          12   \n",
       "5     #10  XYZ    blue             0          10.0          10   \n",
       "6     #35   35    pink             2          35.0          35   \n",
       "\n",
       "   numbers_numeric  \n",
       "0               23  \n",
       "1               24  \n",
       "2               18  \n",
       "3               14  \n",
       "4               12  \n",
       "5               10  \n",
       "6               35  "
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# fill the missing values with a specific value\n",
    "df.fillna(value=100)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "      <th>numbers_numeric</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#23</td>\n",
       "      <td>23</td>\n",
       "      <td>green</td>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#24</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#18</td>\n",
       "      <td>18</td>\n",
       "      <td>yellow</td>\n",
       "      <td>0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#14</td>\n",
       "      <td>14</td>\n",
       "      <td>orange</td>\n",
       "      <td>2</td>\n",
       "      <td>14.0</td>\n",
       "      <td>14</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#12</td>\n",
       "      <td>XYZ</td>\n",
       "      <td>purple</td>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>#10</td>\n",
       "      <td>XYZ</td>\n",
       "      <td>blue</td>\n",
       "      <td>0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>#35</td>\n",
       "      <td>35</td>\n",
       "      <td>pink</td>\n",
       "      <td>2</td>\n",
       "      <td>35.0</td>\n",
       "      <td>35</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers nums  colors  other_column  numbers_loop numbers_str  \\\n",
       "0     #23   23   green             0          23.0          23   \n",
       "1     #24   24     red             1          24.0          24   \n",
       "2     #18   18  yellow             0          18.0          18   \n",
       "3     #14   14  orange             2          14.0          14   \n",
       "4     #12  XYZ  purple             1          12.0          12   \n",
       "5     #10  XYZ    blue             0          10.0          10   \n",
       "6     #35   35    pink             2          35.0          35   \n",
       "\n",
       "   numbers_numeric  \n",
       "0               23  \n",
       "1               24  \n",
       "2               18  \n",
       "3               14  \n",
       "4               12  \n",
       "5               10  \n",
       "6               35  "
      ]
     },
     "execution_count": 40,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# use the _next_ valid observation to fill the missing data\n",
    "df.fillna(method=\"bfill\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 41,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>numbers</th>\n",
       "      <th>nums</th>\n",
       "      <th>colors</th>\n",
       "      <th>other_column</th>\n",
       "      <th>numbers_loop</th>\n",
       "      <th>numbers_str</th>\n",
       "      <th>numbers_numeric</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>#23</td>\n",
       "      <td>23</td>\n",
       "      <td>green</td>\n",
       "      <td>0</td>\n",
       "      <td>23.0</td>\n",
       "      <td>23</td>\n",
       "      <td>23</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>#24</td>\n",
       "      <td>24</td>\n",
       "      <td>red</td>\n",
       "      <td>1</td>\n",
       "      <td>24.0</td>\n",
       "      <td>24</td>\n",
       "      <td>24</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>#18</td>\n",
       "      <td>18</td>\n",
       "      <td>yellow</td>\n",
       "      <td>0</td>\n",
       "      <td>18.0</td>\n",
       "      <td>18</td>\n",
       "      <td>18</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>#14</td>\n",
       "      <td>14</td>\n",
       "      <td>orange</td>\n",
       "      <td>2</td>\n",
       "      <td>14.0</td>\n",
       "      <td>14</td>\n",
       "      <td>14</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>#12</td>\n",
       "      <td>14</td>\n",
       "      <td>purple</td>\n",
       "      <td>1</td>\n",
       "      <td>12.0</td>\n",
       "      <td>12</td>\n",
       "      <td>12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>#10</td>\n",
       "      <td>XYZ</td>\n",
       "      <td>blue</td>\n",
       "      <td>0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>10</td>\n",
       "      <td>10</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>#35</td>\n",
       "      <td>35</td>\n",
       "      <td>pink</td>\n",
       "      <td>2</td>\n",
       "      <td>35.0</td>\n",
       "      <td>35</td>\n",
       "      <td>35</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  numbers nums  colors  other_column  numbers_loop numbers_str  \\\n",
       "0     #23   23   green             0          23.0          23   \n",
       "1     #24   24     red             1          24.0          24   \n",
       "2     #18   18  yellow             0          18.0          18   \n",
       "3     #14   14  orange             2          14.0          14   \n",
       "4     #12   14  purple             1          12.0          12   \n",
       "5     #10  XYZ    blue             0          10.0          10   \n",
       "6     #35   35    pink             2          35.0          35   \n",
       "\n",
       "   numbers_numeric  \n",
       "0               23  \n",
       "1               24  \n",
       "2               18  \n",
       "3               14  \n",
       "4               12  \n",
       "5               10  \n",
       "6               35  "
      ]
     },
     "execution_count": 41,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# use the _previous_ valid observation to fill missing data\n",
    "df.fillna(method=\"ffill\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "We will see more examples of dealing with missing data in future\n",
    "chapters."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Case Study\n",
    "\n",
    "We will now use data from an\n",
    "[article](https://www.nytimes.com/interactive/2015/02/17/upshot/what-do-people-actually-order-at-chipotle.html)\n",
    "written by The Upshot at the NYTimes.\n",
    "\n",
    "This data has order information from almost 2,000 Chipotle orders and\n",
    "includes information on what was ordered and how much it cost."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>order_id</th>\n",
       "      <th>quantity</th>\n",
       "      <th>item_name</th>\n",
       "      <th>choice_description</th>\n",
       "      <th>item_price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Chips and Fresh Tomato Salsa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>$2.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Izze</td>\n",
       "      <td>[Clementine]</td>\n",
       "      <td>$3.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Nantucket Nectar</td>\n",
       "      <td>[Apple]</td>\n",
       "      <td>$3.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Chips and Tomatillo-Green Chili Salsa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>$2.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>Chicken Bowl</td>\n",
       "      <td>[Tomatillo-Red Chili Salsa (Hot), [Black Beans...</td>\n",
       "      <td>$16.98</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   order_id  quantity                              item_name  \\\n",
       "0         1         1           Chips and Fresh Tomato Salsa   \n",
       "1         1         1                                   Izze   \n",
       "2         1         1                       Nantucket Nectar   \n",
       "3         1         1  Chips and Tomatillo-Green Chili Salsa   \n",
       "4         2         2                           Chicken Bowl   \n",
       "\n",
       "                                  choice_description item_price  \n",
       "0                                                NaN     $2.39   \n",
       "1                                       [Clementine]     $3.39   \n",
       "2                                            [Apple]     $3.39   \n",
       "3                                                NaN     $2.39   \n",
       "4  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98   "
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chipotle = qeds.data.load(\"chipotle_raw\")\n",
    "chipotle.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "<ipython-input-45-181862fd6196>:1: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True.\n",
      "  chipotle[\"price\"] = pd.to_numeric(chipotle[\"item_price\"].str.replace(\"$\", \"\"))\n"
     ]
    }
   ],
   "source": [
    "chipotle[\"price\"] = pd.to_numeric(chipotle[\"item_price\"].str.replace(\"$\", \"\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
    "chipotle[\"per_unit_price\"] = chipotle[\"price\"] / chipotle[\"quantity\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>order_id</th>\n",
       "      <th>quantity</th>\n",
       "      <th>item_name</th>\n",
       "      <th>choice_description</th>\n",
       "      <th>item_price</th>\n",
       "      <th>price</th>\n",
       "      <th>per_unit_price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Chips and Fresh Tomato Salsa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>$2.39</td>\n",
       "      <td>2.39</td>\n",
       "      <td>2.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Izze</td>\n",
       "      <td>[Clementine]</td>\n",
       "      <td>$3.39</td>\n",
       "      <td>3.39</td>\n",
       "      <td>3.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Nantucket Nectar</td>\n",
       "      <td>[Apple]</td>\n",
       "      <td>$3.39</td>\n",
       "      <td>3.39</td>\n",
       "      <td>3.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Chips and Tomatillo-Green Chili Salsa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>$2.39</td>\n",
       "      <td>2.39</td>\n",
       "      <td>2.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>Chicken Bowl</td>\n",
       "      <td>[Tomatillo-Red Chili Salsa (Hot), [Black Beans...</td>\n",
       "      <td>$16.98</td>\n",
       "      <td>16.98</td>\n",
       "      <td>8.49</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4617</th>\n",
       "      <td>1833</td>\n",
       "      <td>1</td>\n",
       "      <td>Steak Burrito</td>\n",
       "      <td>[Fresh Tomato Salsa, [Rice, Black Beans, Sour ...</td>\n",
       "      <td>$11.75</td>\n",
       "      <td>11.75</td>\n",
       "      <td>11.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4618</th>\n",
       "      <td>1833</td>\n",
       "      <td>1</td>\n",
       "      <td>Steak Burrito</td>\n",
       "      <td>[Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...</td>\n",
       "      <td>$11.75</td>\n",
       "      <td>11.75</td>\n",
       "      <td>11.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4619</th>\n",
       "      <td>1834</td>\n",
       "      <td>1</td>\n",
       "      <td>Chicken Salad Bowl</td>\n",
       "      <td>[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...</td>\n",
       "      <td>$11.25</td>\n",
       "      <td>11.25</td>\n",
       "      <td>11.25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4620</th>\n",
       "      <td>1834</td>\n",
       "      <td>1</td>\n",
       "      <td>Chicken Salad Bowl</td>\n",
       "      <td>[Fresh Tomato Salsa, [Fajita Vegetables, Lettu...</td>\n",
       "      <td>$8.75</td>\n",
       "      <td>8.75</td>\n",
       "      <td>8.75</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4621</th>\n",
       "      <td>1834</td>\n",
       "      <td>1</td>\n",
       "      <td>Chicken Salad Bowl</td>\n",
       "      <td>[Fresh Tomato Salsa, [Fajita Vegetables, Pinto...</td>\n",
       "      <td>$8.75</td>\n",
       "      <td>8.75</td>\n",
       "      <td>8.75</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>4622 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      order_id  quantity                              item_name  \\\n",
       "0            1         1           Chips and Fresh Tomato Salsa   \n",
       "1            1         1                                   Izze   \n",
       "2            1         1                       Nantucket Nectar   \n",
       "3            1         1  Chips and Tomatillo-Green Chili Salsa   \n",
       "4            2         2                           Chicken Bowl   \n",
       "...        ...       ...                                    ...   \n",
       "4617      1833         1                          Steak Burrito   \n",
       "4618      1833         1                          Steak Burrito   \n",
       "4619      1834         1                     Chicken Salad Bowl   \n",
       "4620      1834         1                     Chicken Salad Bowl   \n",
       "4621      1834         1                     Chicken Salad Bowl   \n",
       "\n",
       "                                     choice_description item_price  price  \\\n",
       "0                                                   NaN     $2.39    2.39   \n",
       "1                                          [Clementine]     $3.39    3.39   \n",
       "2                                               [Apple]     $3.39    3.39   \n",
       "3                                                   NaN     $2.39    2.39   \n",
       "4     [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98   16.98   \n",
       "...                                                 ...        ...    ...   \n",
       "4617  [Fresh Tomato Salsa, [Rice, Black Beans, Sour ...    $11.75   11.75   \n",
       "4618  [Fresh Tomato Salsa, [Rice, Sour Cream, Cheese...    $11.75   11.75   \n",
       "4619  [Fresh Tomato Salsa, [Fajita Vegetables, Pinto...    $11.25   11.25   \n",
       "4620  [Fresh Tomato Salsa, [Fajita Vegetables, Lettu...     $8.75    8.75   \n",
       "4621  [Fresh Tomato Salsa, [Fajita Vegetables, Pinto...     $8.75    8.75   \n",
       "\n",
       "      per_unit_price  \n",
       "0               2.39  \n",
       "1               3.39  \n",
       "2               3.39  \n",
       "3               2.39  \n",
       "4               8.49  \n",
       "...              ...  \n",
       "4617           11.75  \n",
       "4618           11.75  \n",
       "4619           11.25  \n",
       "4620            8.75  \n",
       "4621            8.75  \n",
       "\n",
       "[4622 rows x 7 columns]"
      ]
     },
     "execution_count": 50,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chipotle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "order_id                int64\n",
       "quantity                int64\n",
       "item_name              object\n",
       "choice_description     object\n",
       "item_price             object\n",
       "price                 float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chipotle.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'[Tomatillo-Red Chili Salsa (Hot), [Black Beans, Rice, Cheese, Sour Cream]]'"
      ]
     },
     "execution_count": 43,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chipotle.loc[4, \"choice_description\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "subslide"
    }
   },
   "source": [
    "**Exercise**\n",
    "\n",
    "We'd like you to use this data to answer the following questions.\n",
    "\n",
    "- What is the average price of an item with chicken?  \n",
    "- What is the average price of an item with steak?  \n",
    "- Did chicken or steak produce more revenue (total)?  \n",
    "- How many missing items are there in this dataset? How many missing\n",
    "  items in each column?  \n",
    "\n",
    "\n",
    "Hint: before you will be able to do any of these things you will need to\n",
    "make sure the `item_price` column has a numeric `dtype` (probably\n",
    "float)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [],
   "source": [
    "chicken_idx = chipotle[\"item_name\"].str.contains(\"Chicken\")\n",
    "chicken_price = chipotle.loc[chicken_idx, \"per_unit_price\"].mean()\n",
    "\n",
    "steak_idx = chipotle[\"item_name\"].str.contains(\"Steak\")\n",
    "steak_price = chipotle.loc[steak_idx, \"per_unit_price\"].mean()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Average price for item with chicken is 9.571519230769185\n",
      "Average price for item with steak is 10.067336182336144\n"
     ]
    }
   ],
   "source": [
    "print(f\"Average price for item with chicken is {chicken_price}\")\n",
    "print(f\"Average price for item with steak is {steak_price}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chicken_price < steak_price"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 62,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>order_id</th>\n",
       "      <th>quantity</th>\n",
       "      <th>item_name</th>\n",
       "      <th>choice_description</th>\n",
       "      <th>item_price</th>\n",
       "      <th>price</th>\n",
       "      <th>per_unit_price</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Chips and Fresh Tomato Salsa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>$2.39</td>\n",
       "      <td>2.39</td>\n",
       "      <td>2.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Izze</td>\n",
       "      <td>[Clementine]</td>\n",
       "      <td>$3.39</td>\n",
       "      <td>3.39</td>\n",
       "      <td>3.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Nantucket Nectar</td>\n",
       "      <td>[Apple]</td>\n",
       "      <td>$3.39</td>\n",
       "      <td>3.39</td>\n",
       "      <td>3.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Chips and Tomatillo-Green Chili Salsa</td>\n",
       "      <td>NaN</td>\n",
       "      <td>$2.39</td>\n",
       "      <td>2.39</td>\n",
       "      <td>2.39</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>Chicken Bowl</td>\n",
       "      <td>[Tomatillo-Red Chili Salsa (Hot), [Black Beans...</td>\n",
       "      <td>$16.98</td>\n",
       "      <td>16.98</td>\n",
       "      <td>8.49</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   order_id  quantity                              item_name  \\\n",
       "0         1         1           Chips and Fresh Tomato Salsa   \n",
       "1         1         1                                   Izze   \n",
       "2         1         1                       Nantucket Nectar   \n",
       "3         1         1  Chips and Tomatillo-Green Chili Salsa   \n",
       "4         2         2                           Chicken Bowl   \n",
       "\n",
       "                                  choice_description item_price  price  \\\n",
       "0                                                NaN     $2.39    2.39   \n",
       "1                                       [Clementine]     $3.39    3.39   \n",
       "2                                            [Apple]     $3.39    3.39   \n",
       "3                                                NaN     $2.39    2.39   \n",
       "4  [Tomatillo-Red Chili Salsa (Hot), [Black Beans...    $16.98   16.98   \n",
       "\n",
       "   per_unit_price  \n",
       "0            2.39  \n",
       "1            3.39  \n",
       "2            3.39  \n",
       "3            2.39  \n",
       "4            8.49  "
      ]
     },
     "execution_count": 62,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chipotle.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 72,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0        2.39\n",
       "1        3.39\n",
       "2        3.39\n",
       "3        2.39\n",
       "4       16.98\n",
       "        ...  \n",
       "4617    11.75\n",
       "4618    11.75\n",
       "4619    11.25\n",
       "4620     8.75\n",
       "4621     8.75\n",
       "Length: 4622, dtype: float64"
      ]
     },
     "execution_count": 72,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chipotle[\"quantity\"] * chipotle[\"per_unit_price\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 74,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Only works with column names from chipotle\n",
    "# https://pandas.pydata.org/docs/reference/api/pandas.eval.html\n",
    "chipotle[\"revenue\"] = chipotle.eval(\"quantity * per_unit_price\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 64,
   "metadata": {},
   "outputs": [],
   "source": [
    "chicken_revenue = chipotle.loc[chicken_idx, \"revenue\"].sum()\n",
    "steak_revenue = chipotle.loc[steak_idx, \"revenue\"].sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Total revenue from items with chicken is 15808.61\n",
      "Total revenue from items with steak is 7384.26\n"
     ]
    }
   ],
   "source": [
    "print(f\"Total revenue from items with chicken is {chicken_revenue}\")\n",
    "print(f\"Total revenue from items with steak is {steak_revenue}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "order_id                 0\n",
       "quantity                 0\n",
       "item_name                0\n",
       "choice_description    1246\n",
       "item_price               0\n",
       "price                    0\n",
       "per_unit_price           0\n",
       "revenue                  0\n",
       "dtype: int64"
      ]
     },
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "chipotle.isnull().sum(axis=\"index\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Appendix: Performance of `.str` Methods\n",
    "\n",
    "Let’s repeat the “remove the `#`” example from above, but this time on\n",
    "a much larger dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 68,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>floats</th>\n",
       "      <th>strings</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>71.15</td>\n",
       "      <td>71.15%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>23.21</td>\n",
       "      <td>23.21%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>49.59</td>\n",
       "      <td>49.59%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4.13</td>\n",
       "      <td>4.13%</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>76.52</td>\n",
       "      <td>76.52%</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   floats strings\n",
       "0   71.15  71.15%\n",
       "1   23.21  23.21%\n",
       "2   49.59  49.59%\n",
       "3    4.13   4.13%\n",
       "4   76.52  76.52%"
      ]
     },
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "test = pd.DataFrame({\"floats\": np.round(100*np.random.rand(100_000), 2)})\n",
    "test[\"strings\"] = test[\"floats\"].astype(str) + \"%\"\n",
    "test.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 69,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 6.27 s, sys: 0 ns, total: 6.27 s\n",
      "Wall time: 6.28 s\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "for row in test.iterrows():\n",
    "    index_value, column_values = row\n",
    "    clean_number = column_values[\"strings\"].replace(\"%\", \"\")\n",
    "    test.at[index_value, \"numbers_loop\"] = clean_number"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 70,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 37.8 ms, sys: 1.17 ms, total: 39 ms\n",
      "Wall time: 38.3 ms\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "test[\"numbers_str_method\"] = test[\"strings\"].str.replace(\"%\", \"\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 71,
   "metadata": {
    "hide-output": false,
    "slideshow": {
     "slide_type": "-"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test[\"numbers_str_method\"].equals(test[\"numbers_loop\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "-"
    }
   },
   "source": [
    "We got the exact same result in a fraction of the time!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 75,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 80,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 80,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sys.maxsize > 2**32"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 77,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'3.8.8 (default, Feb 24 2021, 21:46:12) \\n[GCC 7.3.0]'"
      ]
     },
     "execution_count": 77,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sys.version"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 76,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "sys.version_info(major=3, minor=8, micro=8, releaselevel='final', serial=0)"
      ]
     },
     "execution_count": 76,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sys.version_info"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "date": 1595352471.7839751,
  "download_nb": false,
  "filename": "data_clean.rst",
  "filename_with_path": "pandas/data_clean",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.8"
  },
  "title": "Cleaning Data"
 },
 "nbformat": 4,
 "nbformat_minor": 2
}