Wide data to long using the tidyverse (tidyr's gather function)
A wide data storage format is an efficient and compact way to store information. And this organization perhaps it makes data easier to inspect. We have wide monitors our laptops and destops. However, for visualization and analysis you generally need to transform this data from the wide format to a “tidy”, long format.
We look at the case where just one variable is stored in a spreadsheet.
library(tidyverse)
Suppose you have a data frame of rankings of schools by year, and the initial data set is organized as follows (I just build one with tribble()):
df_wide <- tribble(~rankings_of_schools_by_year, ~`2000`, ~`2001`, ~`2002`,
"U of Illinois", 1, 2, 3,
"TU Dresden", 2, 3, 1,
"U of Denver", 3, 1, 1,
"Hogwarts", 4,4,4)
df_wide
## # A tibble: 4 x 4
## rankings_of_schools_by_year `2000` `2001` `2002`
## <chr> <dbl> <dbl> <dbl>
## 1 U of Illinois 1 2 3
## 2 TU Dresden 2 3 1
## 3 U of Denver 3 1 1
## 4 Hogwarts 4 4 4
Restructuring the data with gather, I define the names of columns that will contain the information in the column names (year) and the variable of interest (rank).
df_long <- df_wide %>% gather(key = year, value = rank, `2000`:`2002`)
df_long
## # A tibble: 12 x 3
## rankings_of_schools_by_year year rank
## <chr> <chr> <dbl>
## 1 U of Illinois 2000 1
## 2 TU Dresden 2000 2
## 3 U of Denver 2000 3
## 4 Hogwarts 2000 4
## 5 U of Illinois 2001 2
## 6 TU Dresden 2001 3
## 7 U of Denver 2001 1
## 8 Hogwarts 2001 4
## 9 U of Illinois 2002 3
## 10 TU Dresden 2002 1
## 11 U of Denver 2002 1
## 12 Hogwarts 2002 4
Pretty good! But we are not all the way there. Let’s use the code above a base. We need to change the first column name to be more appropriate. Also, the years are encoded as a character variable whereas they should be numeric (in this case integers, as the years are round numbers).
df_long <- df_wide %>% gather(key = year, value = rank, `2000`:`2002`) %>%
rename(school = rankings_of_schools_by_year) %>%
mutate(year = as.integer(year))
df_long
## # A tibble: 12 x 3
## school year rank
## <chr> <int> <dbl>
## 1 U of Illinois 2000 1
## 2 TU Dresden 2000 2
## 3 U of Denver 2000 3
## 4 Hogwarts 2000 4
## 5 U of Illinois 2001 2
## 6 TU Dresden 2001 3
## 7 U of Denver 2001 1
## 8 Hogwarts 2001 4
## 9 U of Illinois 2002 3
## 10 TU Dresden 2002 1
## 11 U of Denver 2002 1
## 12 Hogwarts 2002 4
Note to students: Then you might filter by year: filter(year > 2000).