<- parse_json('
json_col {
"x": ["a", "x", "z"],
"y": [10, null, 3]
}
')
<- parse_json('
json_row [
{"x": "a", "y": 10},
{"x": "x", "y": null},
{"x": "z", "y": 3}
]
')
<- tibble(json = list(json_col))
df_col
<- tibble(json = json_row) df_row
Hierarchical Data
Tabular Data
It’s likely that a huge portion of the datasets you’ve worked with in your life thus far are very well structured and organized to the point of being tabular. That is, the dataset consists of a single table with rows and columns. If you’re not familiar with the concept of tidy data I suggest you check out this tidyr
vignette on what it means for a dataset to be “tidy.” Most of your datasets have likely even been tidy:
- Each variable forms a column
- Each observation forms a row
Almost everyone’s experiences beyond this probably involved taking multiple tidy datasets and combining them in some way, but the overall dataset still fit into this tabular structure. These often came in the form of TXT, CSV, Excel, or TSV files.
Our world is rich with data and a great deal of it doesn’t come in nice, neat tables…BUT a lot of it is still extremely structured and well organized.
Structured, but Non-Tabular Data
You’ve likely already noticed, via your work with leaflet
, there is another very popular data format called “JSON”. JSON is short for JavaScript Object Notation and is a syntax for storing and exchanging data…but data in a much broader sense than we’re used to thinking about.
In the middle of a statistics-related course when I hear the word “data” my mind instinctively envisions something tabular with which we could do some sort of visualization or analysis with. However, in the broadest sense, the word “data” really just describes pieces of information…which encapsulates everything from movie theater schedules when you look up showtimes on a website to what your cell phone uses as part of your “data plan.”
While the website for your favorite movie theater may not have had your visualization and analysis plans in mind, their data may still be of interest to us. Many companies and organizations have built tools that allow access to their data in a more streamlined way, and JSON has become a ubiquitous format for many of these data sources to use.
Hierarchical Data
One of the reasons the JSON format is so important and useful is because it can accommodate very complex types and shapes of data. So while it may not always be workable into a tabular form, it still maintains a high degree of structure that we can exploit!
📖 Required Reading: R for Data Science, Chapter 23
Check-in: Unnesting
- What happens when you use
unnest_wider()
with unnamed list-columns likedf2
?
- What argument is now necessary?
- How does
unnest_wider()
reconcile columns with different lengths?
- From time-to-time you encounter data frames with multiple list-columns with aligned values. For example, in the following data frame, the values of
y
andz
are aligned (i.e.y
andz
will always have the same length within a row, and the first value ofy
corresponds to the first value ofz
).
What happens if you apply two
unnest_longer()
calls to this data frame?How can you preserve the relationship between x and y? Hint: Carefully read the documentation
JSON Data
📖 Required Reading: R for Data Science, Chapter 23
Check-in: JSON, pt. 1
- Objects are held in _____.
- parentheses
- double quotes
- square brackets
- curly braces
- While it may seem tedious compared to our usual tabular forms of data, JSON data is easy to read because it _____.
- comes in name / value pairs
- is separated by “—–”
- comes in table / name/value triplets
- Create rectangular versions of
df_col
anddf_row
(shown below).
Working with JSON Data
The chapter covers a few functions from the jsonlite
package, but there are several other packages for working with JSON data in R (e.g., rjson
, rjsonio
). These packages automatically transform JSON data into nested R lists or complex data frames. However, working with these complex objects can be difficult.
The tidyjson
package takes a different approach to structuring JSON data into tidy data frames. Similar to tidyr
, tidyjson
builds a grammar for manipulating JSON into a tidy table structure.
tidyjson
is based on the following principles:
- Work on a single JSON document, or on a collection of related documents
- Create pipelines with the pipe operator (
%>%
), producing code that can be read from left to right - Guarantee the structure of the data produced, even if the input JSON structure changes (with the exception of
spread_all()
) - Work with arbitrarily nested arrays or objects
- Handle ‘ragged’ arrays and / or objects (varying lengths by document)
- Ensure edge cases are handled correctly (especially empty data)
- Integrate seamlessly with
dplyr
, allowingtbl_json
objects to pipe in and out ofdplyr
verbs where reasonable
Check-in: JSON, pt. 2
Let’s use the companies
JSON dataset from the tidyjson package to explore tools from the tidyjson package. This exploration parallels the worldbank
examples from the tidyjson vignette.
Step 1 – Use spread_all()
to spread the values of a JSON object
- When you use the
spread_all()
function on thecompanies
dataset, how many columns are created?
Step 2 – Find the values that were not able to be spread
To identify the names of an object, their type, and frequency we use the following pipeline:
%>%
name_of_data gather_object() %>%
json_types() %>%
count(name, type)
The gather_object()
function collapses a JSON object into name-value pairs, producing a tbl_json
object. The json_types()
function then inspects the JSON associated with each row of the tbl_json
object, and adds a new column ("type"
) that identifies the type according to the JSON standard. We can then count()
the number of occurrences of each name
, type
combination to understand how many rows / observations contained these values.
- How many different values are included in the
companies
dataset?
Now that we know the name of every value in the companies
dataset, we can inspect how many of these columns were not captured when we used spread-all()
.
- How many arrays from the
companies
dataset were not spread into new columns?
Step 3 – Manually spread the problem values
Now that we know the names of the values that were not spread into columns, we can wrangle specific values to capture their values. Let’s do this with the acquisitions
value.
We do this with the following pipeline:
%>%
name_of_dataset enter_object(name_of_value) %>%
gather_array() %>%
spread_all()
The enter_object()
function allows you to navigate a specific value of the JSON data by referencing its name. gather_array()
then collapses a JSON array into index-value pairs, creating a new column array.index
to store the index of the array, and storing values in the ‘JSON’ attribute for further tidyjson manipulation.
- Using this pipeline on the
acquisitions
array of thecompanies
dataset, what values are stored in this array? i.e., what are the names of the new columns that appear in the spread dataset?
Step 4 – Combining everything together
By combining with our initial top-level spread_all()
, we can aggregate the sale price of companies by category_code
and the currency it was sold in.
# For nice looking tables!
library(gt)
%>%
companies spread_all() %>%
# Manually extracting information from acquisitions
enter_object(acquisitions) %>%
gather_array() %>%
spread_all() %>%
# Removing missing values from relevant variables
drop_na(category_code, price_currency_code, price_amount) %>%
# Finding max sale price for different company categories
group_by(category_code, price_currency_code) %>%
summarize(
max_price = max(price_amount),
.groups = "drop"
%>%
) mutate(
# Convert to dollar format
max_price = scales::dollar(max_price),
# Remove _ from company sector
category_code = str_replace_all(category_code,
pattern = "_",
replacement = " "),
# Capitalize the first letter of each word
category_code = str_to_title(category_code)
%>%
) gt() %>%
cols_label(category_code = md("**Business Sector**"),
price_currency_code = md("**Currency of Sale**"),
max_price = md("**Max Company Sale Price**")
)
Business Sector | Currency of Sale | Max Company Sale Price |
---|---|---|
Biotech | USD | $587,000,000 |
Consulting | USD | $4,500,000 |
Enterprise | USD | $4,900,000,000 |
Hardware | USD | $710,000,000 |
Mobile | USD | $168,000,000 |
Network Hosting | USD | $45,340,000 |
News | USD | $1,850,000,000 |
Other | USD | $23,700,000 |
Public Relations | USD | $25,000,000 |
Search | USD | $225,000,000 |
Security | USD | $242,000,000 |
Semiconductor | USD | $342,500,000 |
Software | USD | $155,000,000 |
Web | USD | $260,000,000 |
Step 5 – Scaling to multiple columns
This is a powerful process, but you might be able to see how we don’t necessarily want to do this process repeatedly for every value spread_all()
was not able to parse.