Tabular Data and Variable Summaries

Data “on disk”

Data are stored in plain text files

name,pclass,survived,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
"Allen, Miss. Elisabeth Walton",1,1,female,29,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
"Allison, Master. Hudson Trevor",1,1,male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
"Allison, Miss. Helen Loraine",1,0,female,2,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
"Allison, Mr. Hudson Joshua Creighton",1,0,male,30,1,2,113781,151.5500,C22 C26,S,,135,"Montreal, PQ / Chesterville, ON"
"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",1,0,female,25,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
"Anderson, Mr. Harry",1,1,male,48,0,0,19952,26.5500,E12,S,3,,"New York, NY"
"Andrews, Miss. Kornelia Theodosia",1,1,female,63,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
"Andrews, Mr. Thomas Jr",1,0,male,39,0,0,112050,0.0000,A36,S,,,"Belfast, NI"
"Appleton, Mrs. Edward Dale (Charlotte Lamson)",1,1,female,53,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
"Artagaveytia, Mr. Ramon",1,0,male,71,0,0,PC 17609,49.5042,,C,,22,"Montevideo, Uruguay"
"Astor, Col. John Jacob",1,0,male,47,1,0,PC 17757,227.5250,C62 C64,C,,124,"New York, NY"
  • This is called a csv (comma-separated) file.

  • You might see it stored as something.csv or something.txt

  • .txt files might have different delimiters (separators)

Reading data

We read the data into a program like Python by specifying:

  • what type of file it is (e.g., .csv, .txt, .xlsx)

  • where the csv file is located (the “path”)

  • if the file has a header

  • … and other information in special cases!

Example using pandas data frame:

df = pd.read_csv("https://datasci112.stanford.edu/data/titanic.csv")

read_csv() lives in pandas

df.head()
                                              name  ...                        home.dest
0                    Allen, Miss. Elisabeth Walton  ...                     St Louis, MO
1                   Allison, Master. Hudson Trevor  ...  Montreal, PQ / Chesterville, ON
2                     Allison, Miss. Helen Loraine  ...  Montreal, PQ / Chesterville, ON
3             Allison, Mr. Hudson Joshua Creighton  ...  Montreal, PQ / Chesterville, ON
4  Allison, Mrs. Hudson J C (Bessie Waldo Daniels)  ...  Montreal, PQ / Chesterville, ON

[5 rows x 14 columns]

Lecture 1.1 Check in

df = pd.read_csv("https://datasci112.stanford.edu/data/titanic.csv")

Question 1: What if this file lived on a computer instead of online?

Question 2: Why didn’t we have to specify that this dataset has a header?

Looking at the rows

df.loc[1, :]
name          Allison, Master. Hudson Trevor
pclass                                     1
survived                                   1
sex                                     male
age                                   0.9167
sibsp                                      1
parch                                      2
ticket                                113781
fare                                  151.55
cabin                                C22 C26
embarked                                   S
boat                                      11
body                                     NaN
home.dest    Montreal, PQ / Chesterville, ON
Name: 1, dtype: object
df.iloc[1, :]
name          Allison, Master. Hudson Trevor
pclass                                     1
survived                                   1
sex                                     male
age                                   0.9167
sibsp                                      1
parch                                      2
ticket                                113781
fare                                  151.55
cabin                                C22 C26
embarked                                   S
boat                                      11
body                                     NaN
home.dest    Montreal, PQ / Chesterville, ON
Name: 1, dtype: object
  • What is the difference between .loc and .iloc?

  • What type of object is returned?

loc, iloc, and index

df2 = df.set_index('name')
                                                 pclass  ...                        home.dest
name                                                     ...                                 
Allen, Miss. Elisabeth Walton                         1  ...                     St Louis, MO
Allison, Master. Hudson Trevor                        1  ...  Montreal, PQ / Chesterville, ON
Allison, Miss. Helen Loraine                          1  ...  Montreal, PQ / Chesterville, ON
Allison, Mr. Hudson Joshua Creighton                  1  ...  Montreal, PQ / Chesterville, ON
Allison, Mrs. Hudson J C (Bessie Waldo Daniels)       1  ...  Montreal, PQ / Chesterville, ON

[5 rows x 13 columns]


Why are there 13 columns now? (There were 14 before!)

loc, iloc, and index

df2.loc[1, :]
KeyError: 1
df2.iloc[1, :]
pclass                                     1
survived                                   1
sex                                     male
age                                   0.9167
sibsp                                      1
parch                                      2
ticket                                113781
fare                                  151.55
cabin                                C22 C26
embarked                                   S
boat                                      11
body                                     NaN
home.dest    Montreal, PQ / Chesterville, ON
Name: Allison, Master. Hudson Trevor, dtype: object

Why is .loc returning an error?

Why is .iloc not returning an error?

loc, iloc, and index

.loc – label-based location

  • Uses labels from rows (rownames) to select data
df2.loc["Allison, Master. Hudson Trevor", :]
pclass                                     1
survived                                   1
sex                                     male
age                                   0.9167
sibsp                                      1
parch                                      2
ticket                                113781
fare                                  151.55
cabin                                C22 C26
embarked                                   S
boat                                      11
body                                     NaN
home.dest    Montreal, PQ / Chesterville, ON
Name: Allison, Master. Hudson Trevor, dtype: object

.iloc – integer location

  • Uses indices (positions) from rows to select data
df2.iloc[1, :]
pclass                                     1
survived                                   1
sex                                     male
age                                   0.9167
sibsp                                      1
parch                                      2
ticket                                113781
fare                                  151.55
cabin                                C22 C26
embarked                                   S
boat                                      11
body                                     NaN
home.dest    Montreal, PQ / Chesterville, ON
Name: Allison, Master. Hudson Trevor, dtype: object

Looking at columns

df.columns
Index(['name', 'pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')
df['home.dest']
0                          St Louis, MO
1       Montreal, PQ / Chesterville, ON
2       Montreal, PQ / Chesterville, ON
3       Montreal, PQ / Chesterville, ON
4       Montreal, PQ / Chesterville, ON
                     ...               
1304                                NaN
1305                                NaN
1306                                NaN
1307                                NaN
1308                                NaN
Name: home.dest, Length: 1309, dtype: object

NaN (Not a Number) represents missing or null data

Caution: Object types

type(df)
<class 'pandas.core.frame.DataFrame'>


type(df.iloc[1, :])
<class 'pandas.core.series.Series'>


type(df['name'])
<class 'pandas.core.series.Series'>

A Series is a one-dimensional labeled array (a vector with labels)

Summarizing a Data Frame

Questions to ask

  • Which variables (columns) are categorical?

  • Which variables are quantitative?

  • Which variables are labels (e.g. names or ID numbers)?

  • Which variables are text?

A quick look at the data

df.describe()
            pclass     survived  ...         fare        body
count  1309.000000  1309.000000  ...  1308.000000  121.000000
mean      2.294882     0.381971  ...    33.295479  160.809917
std       0.837836     0.486055  ...    51.758668   97.696922
min       1.000000     0.000000  ...     0.000000    1.000000
25%       2.000000     0.000000  ...     7.895800   72.000000
50%       3.000000     0.000000  ...    14.454200  155.000000
75%       3.000000     1.000000  ...    31.275000  256.000000
max       3.000000     1.000000  ...   512.329200  328.000000

[8 rows x 7 columns]


Lecture 1.1 Check in

Question 3: What percent of Titanic passengers survived?

Question 4: What was the average (mean) fare paid for a ticket?

Changing Variable Types

  • The variable pclass was categorical, but Python assumed it was quantitative.

  • It’s our job to check and fix data!

df["pclass"] = df["pclass"].astype("category")


Why choose to store pclass as a "category" instead of a "string"?

Summary of categorical variable

df["pclass"].value_counts()
pclass
3    709
1    323
2    277
Name: count, dtype: int64


df["pclass"].value_counts(normalize = True)
pclass
3    0.541635
1    0.246753
2    0.211612
Name: proportion, dtype: float64

Lecture 1.1 Check in

Question 5: What percent of Titanic passengers were in First Class?

Question 6: Which is the correct way to change a numeric column to a categorical variable?