## Merge

In [None]:
import random

import numpy as np
import pandas as pd
import qeds

**Data: Airline Delays**

This time, we will use a dataset from the [Bureau of Transportation
Statistics](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time)
that describes the cause of all US domestic flight delays
in November 2016:

In [None]:
# This will take a minute or two to download -- It's medium sized data (15 MB)... After
# doing it once, it will be faster because it will be cached on your computer
air_perf = qeds.load("airline_performance_dec16")[["CRSDepTime", "Carrier", "CarrierDelay", "ArrDelay"]]
air_perf.info()
air_perf.head()

In [None]:
carrier_code = qeds.load("airline_carrier_codes")
carrier_code.tail()

The `Carrier` column identifies the airline and the `CarrierDelay`
reports the number of minutes of the total delay assigned as the
“carrier’s fault”.

**Exercise**:

Determine which airlines, on average, contribute most to
delays.

**Exercise**

Merge the carrier code data to find the names of the 5 airlines that have the most delays that are the carrier's fault.

## Exercise: Cohort Analysis using Shopify Data

The `qeds` library includes routines to simulate data sets in the
format of common sources

One of these sources is [Shopify](https://www.shopify.com/) — an
e-commerce platform used by many retail companies for online sales

The code below will simulate a fairly large data set that has the
properties of a order-detail report from Shopify

We’ll first look at the data, and then describe the exercise

In [None]:
# Set the "randomness" seeds
random.seed(42)
np.random.seed(42)

orders = qeds.data.shopify.simulate_orders(500000)
orders.info()

orders.head()

We define a customer’s cohort as the month in which a customer placed
their first order and the customer type as an indicator of whether this
was their first order or a returning order.

We now describe the *want* for the exercise, which we ask you to
complete.

**Want**: Compute the monthly total number of orders, total sales, and
total quantity separated by customer cohort and customer type.

Read that carefully one more time...

### Extended Exercise

Using the reshape and `groupby` tools you have learned, apply the want
operator described above.

See below for advice on how to proceed.

When you are finished, you should have something that looks like this:

<img src="https://datascience.quantecon.org/assets/_static/groupby_files/groupby_cohort_analysis_exercise_output.png" alt="groupby\_cohort\_analysis\_exercise\_output.png" style="">

  
Two notes on the table above:

1. 
  <dl style='margin: 20px 0;'>
  <dt>Your actual output will be much bigger. This is just to give you an</dt>
  <dd>
  idea of what it might look like.  
  </dd>
  
  </dl>
  
1. 
  <dl style='margin: 20px 0;'>
  <dt>The numbers you produce should actually be the same as what are</dt>
  <dd>
  included in this table… Index into your answer and compare what you
  have with this table to verify your progress.  
  </dd>
  
  </dl>
  


Now, how to do it?

There is more than one way to code this, but here are some suggested
steps.

1. Convert the `Day` column to have a `datetime` `dtype` instead
  of object (Hint: use the `pd.to_datetime` function).  
1. Add a new column that specifies the date associated with each
  customer’s `"First-time"` order.  
  - Hint 1: You can do this with a combination of `groupby` and
    `join`.  
  - Hint 2: `customer_type` is always one of `Returning` and
    `First-time`.  
  - Hint 3: Some customers don’t have a
    `customer_type == "First-time"` entry. You will need to set the
    value for these users to some date that precedes the dates in the
    sample. After adding valid data back into `orders` DataFrame,
    you can identify which customers don’t have a `"First-Time"`
    entry by checking for missing data in the new column.  
1. You’ll need to group by 3 things.  
1. You can apply one of the built-in aggregation functions to the GroupBy.  
1. After doing the aggregation, you’ll need to use your reshaping skills to
  move things to the right place in rows and columns.  


Good luck!