JSON Data & APIs

Tabular Data

It’s likely that a huge portion of the datasets you’ve worked with in your life thus far are very well structured and organized to the point of being tabular. That is, the dataset consists of a single table with rows and columns. If you’re not familiar with the concept of tidy data I suggest you check out this tidyr vignette on what it means for a dataset to be “tidy.” Most of your datasets have likely even been tidy:

  • Each variable forms a column
  • Each observation forms a row

Almost everyone’s experiences beyond this probably involved taking multiple tidy datasets and combining them in some way, but the overall dataset still fit into this tabular structure. These often came in the form of TXT, CSV, Excel, or TSV files.

Our world is rich with data and a great deal of it doesn’t come in nice, neat tables…BUT a lot of it is still extremely structured and well organized.

Structured, but Non-Tabular Data

There is another very popular data format called “JSON”. JSON is short for JavaScript Object Notation and is a syntax for storing and exchanging data, but data in a much broader sense than we’re used to thinking about.

In the middle of a statistics-related course when I hear the word “data” my mind instinctively envisions something tabular with which we could do some sort of visualization or analysis with. However, in the broadest sense, the word “data” really just describes pieces of information which encapsulates everything from movie theater schedules when you look up showtimes on a website to what your cell phone uses as part of your “data plan.”

While the website for your favorite movie theater may not have had your visualization and analysis plans in mind, their data may still be of interest to us. Many companies and organizations have built tools that allow access to their data in a more streamlined way, and JSON has become a ubiquitous format for many of these data sources to use.

Hierarchical Data

One of the reasons the JSON format is so important and useful is because it can accommodate very complex types and shapes of data. So while it may not always be workable into a tabular form, it still maintains a high degree of structure that we can exploit!

Required-readingRequired Reading

R for Data Science, Chapter 23

Only read Sections 1-4 for now!

Check-inCheck In
  1. What happens when you use unnest_wider() with unnamed list-columns like df2?
df2 <- tribble(
  ~x, ~y,
  1, list(11, 12, 13),
  2, list(21),
  3, list(31, 32),
)
  1. What argument is now necessary to correctly unnest df2? Specifically, the column names should be y_1, y_2, and y_3?
unnest_wider(df2, col = y)
Error in `unnest_wider()`:
ℹ In column: `y`.
ℹ In row: 1.
Caused by error:
! Can't unnest elements with missing names.
ℹ Supply `names_sep` to generate automatic names.
  1. How does unnest_wider() reconcile columns with different lengths?

JSON Data

Required-readingRequired Reading

R for Data Science, Chapter 23

Now read Sections 5 & 6!

Check-inCheck In
  1. JSON Objects are held in _____.
  • parentheses
  • double quotes
  • square brackets
  • curly braces
  1. While it may seem tedious compared to our usual tabular forms of data, JSON data is easy to read because they _____.
  • come in name / value pairs
  • are separated by “—–”
  • come in table / name / value triplets
  1. Create rectangular versions of df_col and df_row (shown below).
json_col <- parse_json('
  {
    "x": ["a", "x", "z"],
    "y": [10, null, 3]
  }
')

df_col <- tibble(json = list(json_col)) 

df_col
# A tibble: 1 × 1
  json            
  <list>          
1 <named list [2]>
json_row <- parse_json('
  [
    {"x": "a", "y": 10},
    {"x": "x", "y": null},
    {"x": "z", "y": 3}
  ]
')

df_row <- tibble(json = json_row)

df_row
# A tibble: 3 × 1
  json            
  <list>          
1 <named list [2]>
2 <named list [2]>
3 <named list [2]>

In both cases, your unnested data should look like this:

x y
a 10
x NA
z 3

Working with JSON Data

The chapter covers a few functions from the jsonlite package, but there are several other packages for working with JSON data in R (e.g., rjson, rjsonio). These packages automatically transform JSON data into nested R lists or complex data frames. However, working with these complex objects can be difficult.

The tidyjson package takes a different approach to structuring JSON data into tidy data frames. Similar to tidyr, tidyjson builds a grammar for manipulating JSON into a tidy table structure.

tidyjson is based on the following principles:

  • Work on a single JSON document, or on a collection of related documents
  • Create pipelines with the pipe operator (%>%), producing code that can be read from left to right
  • Guarantee the structure of the data produced, even if the input JSON structure changes (with the exception of spread_all())
  • Work with arbitrarily nested arrays or objects
  • Handle ‘ragged’ arrays and / or objects (varying lengths by document)
  • Ensure edge cases are handled correctly (especially empty data)
  • Integrate seamlessly with dplyr, allowing tbl_json objects to pipe in and out of dplyr verbs where reasonable

Let’s use the companies JSON dataset from the tidyjson package to explore tools from the tidyjson package. This exploration parallels the worldbank examples from the tidyjson vignette.

library(tidyjson)

# Companies is a long character vector
companies %>% 
  str()
 chr [1:1000] "{ \"_id\" : { \"$oid\" : \"52cdef7e4bab8bd67529bd0c\" }, \"name\" : \"OutSmart Power Systems\", \"permalink\" :"| __truncated__ ...

Step 1 – Use spread_all() to spread the values of a JSON object

companies_tidy <- spread_all(companies)
Check-inCheck In
  1. How does spread_all() know how many columns to make?

  2. Is the JSON data retained after you have spread the data?

Step 2 – Find the values that were not able to be spread

To identify the names of an object, their type, and frequency we use the following pipeline:

name_of_data %>% 
  gather_object() %>% 
  json_types() %>% 
  count(name, type)

The gather_object() function collapses a JSON object into name-value pairs, producing a tbl_json object. The json_types() function then inspects the JSON associated with each row of the tbl_json object, and adds a new column ("type") that identifies the type according to the JSON standard. We can then count() the number of occurrences of each name, type combination to understand how many rows / observations contained these values.

Check-inCheck In
  1. Fill in the code below to discover how many different values are included in the companies dataset?
companies_values <- companies %>% 
  __________ %>% 
  json_types() %>% 
  __________

Now that we know the name of every value in the companies dataset, we can inspect how many of these columns were not captured when we used spread-all().

Check-inCheck In
  1. Fill in the code below to uncover how many arrays from the companies dataset were not spread into new columns.
companies_values %>% 
  filter(
    ________ %in% ________
      ) %>% 
  filter(name != ____) 

Step 3 – Manually spread the problem values

Now that we know the names of the values that were not spread into columns, we can wrangle specific values to capture their values. Let’s do this with the acquisitions value.

We do this with the following pipeline:

name_of_dataset %>% 
  enter_object(name_of_value) %>% 
  gather_array() %>% 
  spread_all()

The enter_object() function allows you to navigate a specific value of the JSON data by referencing its name. gather_array() then collapses a JSON array into index-value pairs, creating a new column array.index to store the index of the array, and storing values in the ‘JSON’ attribute for further tidyjson manipulation.

Check-inCheck In
  1. Fill in the code below to spread the acquisitions array of the companies dataset.
companies %>% 
  enter_object(____) %>% 
  gather_array() %>% 
  ________ 
  1. What are the names of the new columns that appear in the spread dataset?

Step 4 – Combining everything together

By combining with our initial top-level spread_all(), we can aggregate the sale price of companies (’price_amount) bycategory_codeand the currency it was sold in (price_currency_code`).

company_max_sales <- companies %>% 
  spread_all() %>% 
  # Manually extracting information from acquisitions
  enter_object(acquisitions) %>% 
  gather_array() %>% 
  spread_all() %>% 
  # Removing missing values from relevant variables
  drop_na(category_code, price_currency_code, price_amount) %>% 
  # Finding max sale price for different company categories
  group_by(category_code, price_currency_code) %>%
  summarize(
    max_price = max(price_amount), 
    .groups = "drop"
            )

company_max_sales
# A tibble: 14 × 3
   category_code    price_currency_code  max_price
   <chr>            <chr>                    <dbl>
 1 biotech          USD                  587000000
 2 consulting       USD                    4500000
 3 enterprise       USD                 4900000000
 4 hardware         USD                  710000000
 5 mobile           USD                  168000000
 6 network_hosting  USD                   45340000
 7 news             USD                 1850000000
 8 other            USD                   23700000
 9 public_relations USD                   25000000
10 search           USD                  225000000
11 security         USD                  242000000
12 semiconductor    USD                  342500000
13 software         USD                  155000000
14 web              USD                  260000000

To get a professional looking table, we can reformat the columns and use the gt package:

library(gt)

company_max_sales |> 
  mutate(
    # Convert to dollar format
    max_price = scales::dollar(max_price), 
    # Remove _ from company sector names
    category_code = str_replace_all(category_code, 
                                    pattern = "_",
                                    replacement = " "), 
    # Capitalize the first letter of each word
    category_code = str_to_title(category_code)
    ) %>% 
  gt() %>% 
  cols_label(category_code = md("**Business Sector**"), 
             price_currency_code = md("**Currency of Sale**"), 
             max_price = md("**Max Company Sale Price**")
             )
Business Sector Currency of Sale Max Company Sale Price
Biotech USD $587,000,000
Consulting USD $4,500,000
Enterprise USD $4,900,000,000
Hardware USD $710,000,000
Mobile USD $168,000,000
Network Hosting USD $45,340,000
News USD $1,850,000,000
Other USD $23,700,000
Public Relations USD $25,000,000
Search USD $225,000,000
Security USD $242,000,000
Semiconductor USD $342,500,000
Software USD $155,000,000
Web USD $260,000,000

APIs

Now that we’ve talked about the JSON format for data files, it’s time we talked about one of the primary vehicles for them: APIs.

Required-videoRequired Video

This is the tip of the iceberg when it comes to APIs, but it will suffice to enable a lot of cool things for us. The video above used some nice examples like the restaurant and travel services, but there are APIs for accessing all kinds of data out there, like cat facts](https://alexwohlbruck.github.io/cat-facts/), movie data, and NASA data.

For our class we’re not interested in creating APIs, but mostly using them to access data that we otherwise couldn’t get our hands on…at least not as easily.

Required-readingRequired Reading

Dataquest: R API Tutorial

Warning! The tutorial references the wrong link to access the API on the number of people in space. The correct link is: http://api.open-notify.org/astros.json

Check-inCheck In
  1. Which package does the tutorial use to make requests to the API server?
  • rvest
  • jsonlite
  • httr
  • tidyjson
  • XML
  1. What type of request will we likely start with when getting data from an API?
  • PULL()
  • GIMME()
  • GET()
  • SUMMON()
  1. If your function call successfully requested information from the API, what status should you see in the response object?

  2. What status code would you get if your call to the API returned no content? Hint: The http status codes might be helpful