Tabular Data and Variable Summaries

Data “on disk”

Data are stored in plain text files

name,pclass,survived,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
"Allen, Miss. Elisabeth Walton",1,1,female,29,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
"Allison, Master. Hudson Trevor",1,1,male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
"Allison, Miss. Helen Loraine",1,0,female,2,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
"Allison, Mr. Hudson Joshua Creighton",1,0,male,30,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",1,0,female,25,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
"Anderson, Mr. Harry",1,1,male,48,0,0,19952,26.5500,E12,S,3,,"New York, NY"
"Andrews, Miss. Kornelia Theodosia",1,1,female,63,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
"Andrews, Mr. Thomas Jr",1,0,male,39,0,0,112050,0.0000,A36,S,,,"Belfast, NI"
"Appleton, Mrs. Edward Dale (Charlotte Lamson)",1,1,female,53,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
"Artagaveytia, Mr. Ramon",1,0,male,71,0,0,PC 17609,49.5042,,C,,22,"Montevideo, Uruguay"
"Astor, Col. John Jacob",1,0,male,47,1,0,PC 17757,227.5250,C62 C64,C,,124,"New York, NY"
  • This is called a csv (comma-separated) file.

  • You might see it stored as something.csv or something.txt

  • .txt files might have different delimiters (separators)

Reading data

We read the data into a program like Python by specifying:

  • what type of file it is (e.g., .csv, .txt, .xlsx)

  • where the csv file is located (the “path”)

  • if the file has a header

  • … and other information in special cases!

Example using pandas data frame:

df = pd.read_csv("data/titanic.csv")

read_csv() lives in the pandas library

df.head()
   PassengerId  Survived  Pclass  ...     Fare Cabin  Embarked
0            1         0       3  ...   7.2500   NaN         S
1            2         1       1  ...  71.2833   C85         C
2            3         1       3  ...   7.9250   NaN         S
3            4         1       1  ...  53.1000  C123         S
4            5         0       3  ...   8.0500   NaN         S

[5 rows x 12 columns]

Lecture 1.1 Check in

df = pd.read_csv("data/titanic.csv")

Question 1: What if this file lived online instead of locally?

Question 2: Why didn’t we have to specify that this dataset has a header?

Data Frames are Comprised of Rows & Columns

Looking at the Rows

df.loc[1, :]
PassengerId                                                    2
Survived                                                       1
Pclass                                                         1
Name           Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                       female
Age                                                         38.0
SibSp                                                          1
Parch                                                          0
Ticket                                                  PC 17599
Fare                                                     71.2833
Cabin                                                        C85
Embarked                                                       C
Name: 1, dtype: object
df.iloc[1, :]
PassengerId                                                    2
Survived                                                       1
Pclass                                                         1
Name           Cumings, Mrs. John Bradley (Florence Briggs Th...
Sex                                                       female
Age                                                         38.0
SibSp                                                          1
Parch                                                          0
Ticket                                                  PC 17599
Fare                                                     71.2833
Cabin                                                        C85
Embarked                                                       C
Name: 1, dtype: object

What is the difference between .loc and .iloc?

loc, iloc, and index

df2 = df.set_index('Name')
                                                    PassengerId  ...  Embarked
Name                                                             ...          
Braund, Mr. Owen Harris                                       1  ...         S
Cumings, Mrs. John Bradley (Florence Briggs Tha...            2  ...         C
Heikkinen, Miss. Laina                                        3  ...         S
Futrelle, Mrs. Jacques Heath (Lily May Peel)                  4  ...         S
Allen, Mr. William Henry                                      5  ...         S

[5 rows x 11 columns]


Why are there 11 columns now? (There were 12 before!)

loc, iloc, and index

df2.loc[1, :]
KeyError: 1
df2.iloc[1, :]
PassengerId           2
Survived              1
Pclass                1
Sex              female
Age                38.0
SibSp                 1
Parch                 0
Ticket         PC 17599
Fare            71.2833
Cabin               C85
Embarked              C
Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer), dtype: object

Why does .loc return an error but .iloc doesn’t?

loc, iloc, and index

.loc – label-based location

  • Uses labels from rows (rownames) to select data
df2.loc["Allison, Master. Hudson Trevor", :]
PassengerId        306
Survived             1
Pclass               1
Sex               male
Age               0.92
SibSp                1
Parch                2
Ticket          113781
Fare            151.55
Cabin          C22 C26
Embarked             S
Name: Allison, Master. Hudson Trevor, dtype: object

.iloc – integer location

  • Uses indices (positions) from rows to select data
df2.iloc[1, :]
PassengerId           2
Survived              1
Pclass                1
Sex              female
Age                38.0
SibSp                 1
Parch                 0
Ticket         PC 17599
Fare            71.2833
Cabin               C85
Embarked              C
Name: Cumings, Mrs. John Bradley (Florence Briggs Thayer), dtype: object

Looking at the Columns

df.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
df['Age']
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

NaN (Not a Number) represents missing or null data

Caution: Object types

type(df)
<class 'pandas.core.frame.DataFrame'>


type(df.iloc[1, :])
<class 'pandas.core.series.Series'>


type(df['Name'])
<class 'pandas.core.series.Series'>

A Series is a one-dimensional labeled array (a vector with labels)

Summarizing a Data Frame

Questions to Ask

  • Which variables (columns) are categorical?

  • Which variables are quantitative?

  • Which variables are labels (e.g. names or ID numbers)?

  • Which variables are text?

A quick look at the data

df.describe()
       PassengerId    Survived      Pclass  ...       SibSp       Parch        Fare
count   891.000000  891.000000  891.000000  ...  891.000000  891.000000  891.000000
mean    446.000000    0.383838    2.308642  ...    0.523008    0.381594   32.204208
std     257.353842    0.486592    0.836071  ...    1.102743    0.806057   49.693429
min       1.000000    0.000000    1.000000  ...    0.000000    0.000000    0.000000
25%     223.500000    0.000000    2.000000  ...    0.000000    0.000000    7.910400
50%     446.000000    0.000000    3.000000  ...    0.000000    0.000000   14.454200
75%     668.500000    1.000000    3.000000  ...    1.000000    0.000000   31.000000
max     891.000000    1.000000    3.000000  ...    8.000000    6.000000  512.329200

[8 rows x 7 columns]


Lecture 1.1 Check in

Question 3: What percent of Titanic passengers survived?

Question 4: What was the average (mean) fare paid for a ticket?

Changing Variable Types

  • The variable Pclass is categorical, but Python assumed it was quantitative because it is comprised of numbers.

  • It’s our job to check and fix data!

df["Pclass"] = df["Pclass"].astype("category")


Why choose to store pclass as a "category" instead of a "string"?

Summary of categorical variable

df["Pclass"].value_counts()
Pclass
3    491
1    216
2    184
Name: count, dtype: int64


df["Pclass"].value_counts(normalize = True)
Pclass
3    0.551066
1    0.242424
2    0.206510
Name: proportion, dtype: float64

Lecture 1.1 Check in

Question 5: What percent of Titanic passengers were in First Class?

Question 6: Which is the correct way to change a numeric column to a categorical variable?