Hierarchical Data

Tabular Data

It’s likely that a huge portion of the datasets you’ve worked with in your life thus far are very well structured and organized to the point of being tabular. That is, the dataset consists of a single table with rows and columns. If you’re not familiar with the concept of tidy data I suggest you check out this tidyr vignette on what it means for a dataset to be “tidy.” Most of your datasets have likely even been tidy:

Each variable forms a column
Each observation forms a row

Almost everyone’s experiences beyond this probably involved taking multiple tidy datasets and combining them in some way, but the overall dataset still fit into this tabular structure. These often came in the form of TXT, CSV, Excel, or TSV files.

Our world is rich with data and a great deal of it doesn’t come in nice, neat tables…BUT a lot of it is still extremely structured and well organized.

Structured, but Non-Tabular Data

You’ve likely already noticed, via your work with leaflet, there is another very popular data format called “JSON”. JSON is short for JavaScript Object Notation and is a syntax for storing and exchanging data…but data in a much broader sense than we’re used to thinking about.

In the middle of a statistics-related course when I hear the word “data” my mind instinctively envisions something tabular with which we could do some sort of visualization or analysis with. However, in the broadest sense, the word “data” really just describes pieces of information…which encapsulates everything from movie theater schedules when you look up showtimes on a website to what your cell phone uses as part of your “data plan.”

While the website for your favorite movie theater may not have had your visualization and analysis plans in mind, their data may still be of interest to us. Many companies and organizations have built tools that allow access to their data in a more streamlined way, and JSON has become a ubiquitous format for many of these data sources to use.

Hierarchical Data

One of the reasons the JSON format is so important and useful is because it can accommodate very complex types and shapes of data. So while it may not always be workable into a tabular form, it still maintains a high degree of structure that we can exploit!

📖 Required Reading: R for Data Science, Chapter 23

Only read Sections 1-4 for now!

Check-in: Unnesting

What happens when you use unnest_wider() with unnamed list-columns like df2?

What argument is now necessary?

How does unnest_wider() reconcile columns with different lengths?

From time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of y and z are aligned (i.e. y and z will always have the same length within a row, and the first value of y corresponds to the first value of z).

What happens if you apply two unnest_longer() calls to this data frame?
How can you preserve the relationship between x and y? Hint: Carefully read the documentation

JSON Data

📖 Required Reading: R for Data Science, Chapter 23

Now read Sections 5 & 6!

Check-in: JSON, pt. 1

Objects are held in _____.

parentheses
double quotes
square brackets
curly braces

While it may seem tedious compared to our usual tabular forms of data, JSON data is easy to read because it _____.

comes in name / value pairs
is separated by “—–”
comes in table / name/value triplets

Create rectangular versions of df_col and df_row (shown below).

json_col <- parse_json('
  {
    "x": ["a", "x", "z"],
    "y": [10, null, 3]
  }
')

json_row <- parse_json('
  [
    {"x": "a", "y": 10},
    {"x": "x", "y": null},
    {"x": "z", "y": 3}
  ]
')

df_col <- tibble(json = list(json_col)) 

df_row <- tibble(json = json_row)

Working with JSON Data

The chapter covers a few functions from the jsonlite package, but there are several other packages for working with JSON data in R (e.g., rjson, rjsonio). These packages automatically transform JSON data into nested R lists or complex data frames. However, working with these complex objects can be difficult.

The tidyjson package takes a different approach to structuring JSON data into tidy data frames. Similar to tidyr, tidyjson builds a grammar for manipulating JSON into a tidy table structure.

tidyjson is based on the following principles:

Work on a single JSON document, or on a collection of related documents
Create pipelines with the pipe operator (%>%), producing code that can be read from left to right
Guarantee the structure of the data produced, even if the input JSON structure changes (with the exception of spread_all())
Work with arbitrarily nested arrays or objects
Handle ‘ragged’ arrays and / or objects (varying lengths by document)
Ensure edge cases are handled correctly (especially empty data)
Integrate seamlessly with dplyr, allowing tbl_json objects to pipe in and out of dplyr verbs where reasonable

Check-in: JSON, pt. 2

Let’s use the companies JSON dataset from the tidyjson package to explore tools from the tidyjson package. This exploration parallels the worldbank examples from the tidyjson vignette.

Step 1 – Use spread_all() to spread the values of a JSON object

When you use the spread_all() function on the companies dataset, how many columns are created?

Step 2 – Find the values that were not able to be spread

To identify the names of an object, their type, and frequency we use the following pipeline:

name_of_data %>% 
  gather_object() %>% 
  json_types() %>% 
  count(name, type)

The gather_object() function collapses a JSON object into name-value pairs, producing a tbl_json object. The json_types() function then inspects the JSON associated with each row of the tbl_json object, and adds a new column ("type") that identifies the type according to the JSON standard. We can then count() the number of occurrences of each name, type combination to understand how many rows / observations contained these values.

How many different values are included in the companies dataset?

Now that we know the name of every value in the companies dataset, we can inspect how many of these columns were not captured when we used spread-all().

How many arrays from the companies dataset were not spread into new columns?

Step 3 – Manually spread the problem values

Now that we know the names of the values that were not spread into columns, we can wrangle specific values to capture their values. Let’s do this with the acquisitions value.

We do this with the following pipeline:

name_of_dataset %>% 
  enter_object(name_of_value) %>% 
  gather_array() %>% 
  spread_all()

The enter_object() function allows you to navigate a specific value of the JSON data by referencing its name. gather_array() then collapses a JSON array into index-value pairs, creating a new column array.index to store the index of the array, and storing values in the ‘JSON’ attribute for further tidyjson manipulation.

Using this pipeline on the acquisitions array of the companies dataset, what values are stored in this array? i.e., what are the names of the new columns that appear in the spread dataset?

Step 4 – Combining everything together

By combining with our initial top-level spread_all(), we can aggregate the sale price of companies by category_code and the currency it was sold in.

# For nice looking tables!
library(gt)

companies %>% 
  spread_all() %>% 
  # Manually extracting information from acquisitions
  enter_object(acquisitions) %>% 
  gather_array() %>% 
  spread_all() %>% 
  # Removing missing values from relevant variables
  drop_na(category_code, price_currency_code, price_amount) %>% 
  # Finding max sale price for different company categories
  group_by(category_code, price_currency_code) %>%
  summarize(
    max_price = max(price_amount), 
    .groups = "drop"
            ) %>% 
  mutate(
    # Convert to dollar format
    max_price = scales::dollar(max_price), 
    # Remove _ from company sector
    category_code = str_replace_all(category_code, 
                                    pattern = "_",
                                    replacement = " "), 
    # Capitalize the first letter of each word
    category_code = str_to_title(category_code)
    ) %>% 
  gt() %>% 
  cols_label(category_code = md("**Business Sector**"), 
                 price_currency_code = md("**Currency of Sale**"), 
                 max_price = md("**Max Company Sale Price**")
                 )

Business Sector	Currency of Sale	Max Company Sale Price
Biotech	USD	$587,000,000
Consulting	USD	$4,500,000
Enterprise	USD	$4,900,000,000
Hardware	USD	$710,000,000
Mobile	USD	$168,000,000
Network Hosting	USD	$45,340,000
News	USD	$1,850,000,000
Other	USD	$23,700,000
Public Relations	USD	$25,000,000
Search	USD	$225,000,000
Security	USD	$242,000,000
Semiconductor	USD	$342,500,000
Software	USD	$155,000,000
Web	USD	$260,000,000

Step 5 – Scaling to multiple columns

This is a powerful process, but you might be able to see how we don’t necessarily want to do this process repeatedly for every value spread_all() was not able to parse.

Do you see a way you could use iteration to complete this task?