JSON Data & APIs
Tabular Data
It’s likely that a huge portion of the datasets you’ve worked with in your life thus far are very well structured and organized to the point of being tabular. That is, the dataset consists of a single table with rows and columns. If you’re not familiar with the concept of tidy data I suggest you check out this tidyr vignette on what it means for a dataset to be “tidy.” Most of your datasets have likely even been tidy:
- Each variable forms a column
- Each observation forms a row
Almost everyone’s experiences beyond this probably involved taking multiple tidy datasets and combining them in some way, but the overall dataset still fit into this tabular structure. These often came in the form of TXT, CSV, Excel, or TSV files.
Our world is rich with data and a great deal of it doesn’t come in nice, neat tables…BUT a lot of it is still extremely structured and well organized.
Structured, but Non-Tabular Data
There is another very popular data format called “JSON”. JSON is short for JavaScript Object Notation and is a syntax for storing and exchanging data, but data in a much broader sense than we’re used to thinking about.
In the middle of a statistics-related course when I hear the word “data” my mind instinctively envisions something tabular with which we could do some sort of visualization or analysis with. However, in the broadest sense, the word “data” really just describes pieces of information which encapsulates everything from movie theater schedules when you look up showtimes on a website to what your cell phone uses as part of your “data plan.”
While the website for your favorite movie theater may not have had your visualization and analysis plans in mind, their data may still be of interest to us. Many companies and organizations have built tools that allow access to their data in a more streamlined way, and JSON has become a ubiquitous format for many of these data sources to use.
Hierarchical Data
One of the reasons the JSON format is so important and useful is because it can accommodate very complex types and shapes of data. So while it may not always be workable into a tabular form, it still maintains a high degree of structure that we can exploit!
R for Data Science, Chapter 23
Only read Sections 1-4 for now!
- What happens when you use
unnest_wider()with unnamed list-columns likedf2?
- What argument is now necessary to correctly unnest
df2? Specifically, the column names should bey_1,y_2, andy_3?
unnest_wider(df2, col = y)Error in `unnest_wider()`:
ℹ In column: `y`.
ℹ In row: 1.
Caused by error:
! Can't unnest elements with missing names.
ℹ Supply `names_sep` to generate automatic names.
- How does
unnest_wider()reconcile columns with different lengths?
JSON Data
R for Data Science, Chapter 23
Now read Sections 5 & 6!
- JSON Objects are held in _____.
- parentheses
- double quotes
- square brackets
- curly braces
- While it may seem tedious compared to our usual tabular forms of data, JSON data is easy to read because they _____.
- come in name / value pairs
- are separated by “—–”
- come in table / name / value triplets
- Create rectangular versions of
df_colanddf_row(shown below).
json_col <- parse_json('
{
"x": ["a", "x", "z"],
"y": [10, null, 3]
}
')
df_col <- tibble(json = list(json_col))
df_col# A tibble: 1 × 1
json
<list>
1 <named list [2]>
json_row <- parse_json('
[
{"x": "a", "y": 10},
{"x": "x", "y": null},
{"x": "z", "y": 3}
]
')
df_row <- tibble(json = json_row)
df_row# A tibble: 3 × 1
json
<list>
1 <named list [2]>
2 <named list [2]>
3 <named list [2]>
In both cases, your unnested data should look like this:
| x | y |
|---|---|
| a | 10 |
| x | NA |
| z | 3 |
Working with JSON Data
The chapter covers a few functions from the jsonlite package, but there are several other packages for working with JSON data in R (e.g., rjson, rjsonio). These packages automatically transform JSON data into nested R lists or complex data frames. However, working with these complex objects can be difficult.
The tidyjson package takes a different approach to structuring JSON data into tidy data frames. Similar to tidyr, tidyjson builds a grammar for manipulating JSON into a tidy table structure.
tidyjson is based on the following principles:
- Work on a single JSON document, or on a collection of related documents
- Create pipelines with the pipe operator (
%>%), producing code that can be read from left to right - Guarantee the structure of the data produced, even if the input JSON structure changes (with the exception of
spread_all()) - Work with arbitrarily nested arrays or objects
- Handle ‘ragged’ arrays and / or objects (varying lengths by document)
- Ensure edge cases are handled correctly (especially empty data)
- Integrate seamlessly with
dplyr, allowingtbl_jsonobjects to pipe in and out ofdplyrverbs where reasonable
Let’s use the companies JSON dataset from the tidyjson package to explore tools from the tidyjson package. This exploration parallels the worldbank examples from the tidyjson vignette.
chr [1:1000] "{ \"_id\" : { \"$oid\" : \"52cdef7e4bab8bd67529bd0c\" }, \"name\" : \"OutSmart Power Systems\", \"permalink\" :"| __truncated__ ...
Step 1 – Use spread_all() to spread the values of a JSON object
companies_tidy <- spread_all(companies)How does
spread_all()know how many columns to make?Is the JSON data retained after you have spread the data?
Step 2 – Find the values that were not able to be spread
To identify the names of an object, their type, and frequency we use the following pipeline:
name_of_data %>%
gather_object() %>%
json_types() %>%
count(name, type)The gather_object() function collapses a JSON object into name-value pairs, producing a tbl_json object. The json_types() function then inspects the JSON associated with each row of the tbl_json object, and adds a new column ("type") that identifies the type according to the JSON standard. We can then count() the number of occurrences of each name, type combination to understand how many rows / observations contained these values.
- Fill in the code below to discover how many different values are included in the
companiesdataset?
companies_values <- companies %>%
__________ %>%
json_types() %>%
__________Now that we know the name of every value in the companies dataset, we can inspect how many of these columns were not captured when we used spread-all().
- Fill in the code below to uncover how many arrays from the
companiesdataset were not spread into new columns.
companies_values %>%
filter(
________ %in% ________
) %>%
filter(name != ____) Step 3 – Manually spread the problem values
Now that we know the names of the values that were not spread into columns, we can wrangle specific values to capture their values. Let’s do this with the acquisitions value.
We do this with the following pipeline:
name_of_dataset %>%
enter_object(name_of_value) %>%
gather_array() %>%
spread_all()The enter_object() function allows you to navigate a specific value of the JSON data by referencing its name. gather_array() then collapses a JSON array into index-value pairs, creating a new column array.index to store the index of the array, and storing values in the ‘JSON’ attribute for further tidyjson manipulation.
- Fill in the code below to spread the
acquisitionsarray of thecompaniesdataset.
companies %>%
enter_object(____) %>%
gather_array() %>%
________ - What are the names of the new columns that appear in the spread dataset?
Step 4 – Combining everything together
By combining with our initial top-level spread_all(), we can aggregate the sale price of companies (’price_amount) bycategory_codeand the currency it was sold in (price_currency_code`).
company_max_sales <- companies %>%
spread_all() %>%
# Manually extracting information from acquisitions
enter_object(acquisitions) %>%
gather_array() %>%
spread_all() %>%
# Removing missing values from relevant variables
drop_na(category_code, price_currency_code, price_amount) %>%
# Finding max sale price for different company categories
group_by(category_code, price_currency_code) %>%
summarize(
max_price = max(price_amount),
.groups = "drop"
)
company_max_sales# A tibble: 14 × 3
category_code price_currency_code max_price
<chr> <chr> <dbl>
1 biotech USD 587000000
2 consulting USD 4500000
3 enterprise USD 4900000000
4 hardware USD 710000000
5 mobile USD 168000000
6 network_hosting USD 45340000
7 news USD 1850000000
8 other USD 23700000
9 public_relations USD 25000000
10 search USD 225000000
11 security USD 242000000
12 semiconductor USD 342500000
13 software USD 155000000
14 web USD 260000000
To get a professional looking table, we can reformat the columns and use the gt package:
library(gt)
company_max_sales |>
mutate(
# Convert to dollar format
max_price = scales::dollar(max_price),
# Remove _ from company sector names
category_code = str_replace_all(category_code,
pattern = "_",
replacement = " "),
# Capitalize the first letter of each word
category_code = str_to_title(category_code)
) %>%
gt() %>%
cols_label(category_code = md("**Business Sector**"),
price_currency_code = md("**Currency of Sale**"),
max_price = md("**Max Company Sale Price**")
)| Business Sector | Currency of Sale | Max Company Sale Price |
|---|---|---|
| Biotech | USD | $587,000,000 |
| Consulting | USD | $4,500,000 |
| Enterprise | USD | $4,900,000,000 |
| Hardware | USD | $710,000,000 |
| Mobile | USD | $168,000,000 |
| Network Hosting | USD | $45,340,000 |
| News | USD | $1,850,000,000 |
| Other | USD | $23,700,000 |
| Public Relations | USD | $25,000,000 |
| Search | USD | $225,000,000 |
| Security | USD | $242,000,000 |
| Semiconductor | USD | $342,500,000 |
| Software | USD | $155,000,000 |
| Web | USD | $260,000,000 |
APIs
Now that we’ve talked about the JSON format for data files, it’s time we talked about one of the primary vehicles for them: APIs.
This is the tip of the iceberg when it comes to APIs, but it will suffice to enable a lot of cool things for us. The video above used some nice examples like the restaurant and travel services, but there are APIs for accessing all kinds of data out there, like cat facts](https://alexwohlbruck.github.io/cat-facts/), movie data, and NASA data.
For our class we’re not interested in creating APIs, but mostly using them to access data that we otherwise couldn’t get our hands on…at least not as easily.
Warning! The tutorial references the wrong link to access the API on the number of people in space. The correct link is: http://api.open-notify.org/astros.json
- Which package does the tutorial use to make requests to the API server?
- rvest
- jsonlite
- httr
- tidyjson
- XML
- What type of request will we likely start with when getting data from an API?
PULL()GIMME()GET()SUMMON()
If your function call successfully requested information from the API, what
statusshould you see in the response object?What
statuscode would you get if your call to the API returned no content? Hint: The http status codes might be helpful