“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”
The core eight packages of the tidyverse are:
| Package | Purpose | Notably replaces |
|---|---|---|
| ggplot2 | Graphics | Core graphics |
| dplyr | Data manipulation | aggregate, common row and column operations |
| tidyr | Data ‘tidying’ | melt, dcast |
| readr | Text file input | read.table, etc |
| purrr | Functional programming tools | apply family |
| tibble | Tables | data.frame |
| stringr | String manipulation | grep etc |
| forcats | Factor manipulation | as.factor etc |
Several other packages are associated, of which I regularly use:
| Package | Purpose | Notably replaces |
|---|---|---|
| magrittr | Pipe operator | Excessive numbers of nested brackets |
| glue | String manipulation | paste |
| lubridate | Dates | as.Date |
Other packages which I am not so familiar with exist for handling times, and binary data, and importing other datatypes (Excel, SPSS, JSON, XML, web scraping)…
Notably the tidyverse has not yet expanded to statistical models.
I will not cover ggplot2; that is a topic for a whole session. I also won’t go into stringr and forcats, which are fairly self-explanatory once you are familiar with the rest, or lubridate, but do note that lubridate is fantastic.
readr
Many tidyverse commands are as standard R but replace . with _. E.g. read_delim, read_table and read_csv replace read.delim, read.table and read.csv.
readr functions:
- Output tibbles rather than data.frames
- Do not convert strings to factors unless you explicitly ask stringsAsFactors=NA
- Guess column types from the data (sometimes stops too early, a common reason for readr errors in huge documents, but see the guess_max argument)
metadata <- read_csv("example_patient_characteristics.csv")
## Parsed with column specification:
## cols(
## participant_id = col_character(),
## gender = col_character(),
## age = col_double(),
## dob = col_date(format = "")
## )
metadata
## # A tibble: 7 x 4
## participant_id gender age dob
## <chr> <chr> <dbl> <date>
## 1 PAT_1 M 39 1972-10-30
## 2 PAT_2 F 26 1986-02-20
## 3 PAT_3 M 25 1987-02-28
## 4 PAT_4 M 25 1986-10-11
## 5 PAT_5 F 19 1992-08-23
## 6 PAT_6 F 20 1991-04-04
## 7 PAT_7 M 28 1984-01-24- Column types can also be specified with the col_types
argument, which will accept a very compact string representation
- e.g. “cfdd?-” for columns, in order: character, factor, double, double, guess, skip
metadata2 <- read_csv("example_patient_characteristics.csv", col_types = "cf?-")
metadata2
## # A tibble: 7 x 3
## participant_id gender age
## <chr> <fct> <dbl>
## 1 PAT_1 M 39
## 2 PAT_2 F 26
## 3 PAT_3 M 25
## 4 PAT_4 M 25
## 5 PAT_5 F 19
## 6 PAT_6 F 20
## 7 PAT_7 M 28tibble
A tibble is a data.frame and anything you would do to the latter you can do to the former. Notable enhancements of over standard data.frames are:
- Printing them displays a only ten rows by default, with column headings indicating data types, and colours if the console supports them
- They do not use row.names at all
- They are much more prone to complaining if something isn’t right (e.g. column name not found)
- Column names can contain e.g. spaces (need to be quoted with backticks)
phylo.stats <- read_csv("example_phylogenetic_summary.csv")
## Parsed with column specification:
## cols(
## host.id = col_character(),
## tree.id = col_character(),
## tips = col_double(),
## reads = col_double(),
## subgraphs = col_double(),
## clades = col_double(),
## overall.rtt = col_double(),
## largest.rtt = col_double(),
## max.branch.length = col_double(),
## max.pat.distance = col_double(),
## global.mean.pat.distance = col_double()
## )
phylo.stats
## # A tibble: 371 x 11
## host.id tree.id tips reads subgraphs clades overall.rtt largest.rtt
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PAT_3 1000_t… 11 418 1 1 0.00661 0.00661
## 2 PAT_4 1000_t… 4 97 1 3 0.0102 0.0102
## 3 PAT_5 1000_t… 16 220 1 9 0.0106 0.0106
## 4 PAT_2 1000_t… 5 289 1 4 0.000235 0.000235
## 5 PAT_7 1000_t… 12 262 1 1 0.0131 0.0131
## 6 PAT_1 1000_t… 6 703 3 3 0.00611 0
## 7 PAT_6 1000_t… 1 53 1 1 0 0
## 8 PAT_3 1160_t… 7 384 1 1 0.00368 0.00368
## 9 PAT_4 1160_t… 5 75 3 3 0.0142 0.00563
## 10 PAT_5 1160_t… 25 312 7 11 0.0132 0.00195
## # … with 361 more rows, and 3 more variables: max.branch.length <dbl>,
## # max.pat.distance <dbl>, global.mean.pat.distance <dbl>
options(tibble.print_min = 20, tibble.width = Inf)
phylo.stats
## # A tibble: 371 x 11
## host.id tree.id tips reads subgraphs clades overall.rtt largest.rtt
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PAT_3 1000_to_1319 11 418 1 1 0.00661 0.00661
## 2 PAT_4 1000_to_1319 4 97 1 3 0.0102 0.0102
## 3 PAT_5 1000_to_1319 16 220 1 9 0.0106 0.0106
## 4 PAT_2 1000_to_1319 5 289 1 4 0.000235 0.000235
## 5 PAT_7 1000_to_1319 12 262 1 1 0.0131 0.0131
## 6 PAT_1 1000_to_1319 6 703 3 3 0.00611 0
## 7 PAT_6 1000_to_1319 1 53 1 1 0 0
## 8 PAT_3 1160_to_1479 7 384 1 1 0.00368 0.00368
## 9 PAT_4 1160_to_1479 5 75 3 3 0.0142 0.00563
## 10 PAT_5 1160_to_1479 25 312 7 11 0.0132 0.00195
## 11 PAT_2 1160_to_1479 2 333 2 2 0.0000445 0
## 12 PAT_7 1160_to_1479 5 215 1 1 0.0216 0.0216
## 13 PAT_1 1160_to_1479 6 669 1 4 0.000130 0.000130
## 14 PAT_6 1160_to_1479 1 68 1 1 0 0
## 15 PAT_3 1320_to_1639 5 334 2 5 0.00057 0.000494
## 16 PAT_4 1320_to_1639 3 73 1 3 0.00444 0.00444
## 17 PAT_5 1320_to_1639 19 172 5 13 0.0156 0
## 18 PAT_2 1320_to_1639 2 266 2 2 0.0000542 0
## 19 PAT_7 1320_to_1639 4 232 1 1 0.0119 0.0119
## 20 PAT_1 1320_to_1639 4 623 1 3 0.0000798 0.0000798
## max.branch.length max.pat.distance global.mean.pat.distance
## <dbl> <dbl> <dbl>
## 1 0.00676 0.0201 0.0116
## 2 0.0104 0.0276 0.0167
## 3 0.0102 0.0373 0.0188
## 4 0.00678 0.0136 0.0102
## 5 0.0106 0.0454 0.0238
## 6 0.00675 0.0203 0.0115
## 7 NA NA NA
## 8 0.00702 0.0140 0.00866
## 9 0.0147 0.0405 0.0212
## 10 0.0149 0.0547 0.0292
## 11 0.00725 0.00725 0.00725
## 12 0.0193 0.0344 0.0195
## 13 0.00729 0.0216 0.0106
## 14 NA NA NA
## 15 0.0104 0.0173 0.0124
## 16 0.0106 0.0176 0.0117
## 17 0.0146 0.0634 0.0282
## 18 0.00708 0.00708 0.00708
## 19 0.0106 0.0244 0.0140
## 20 0.00351 0.00702 0.00584
## # … with 351 more rows
options(tibble.print_min = 10, tibble.print_max = 10, tibble.width = NULL)
phylo.stats.df <- read.csv("example_phylogenetic_summary.csv")
phylo.stats.df$foo
## NULL
phylo.stats$foo
## Warning: Unknown or uninitialised column: 'foo'.
## NULLNote: tibble and the tidyverse in general are not especially built for performance. Another data.frame alternative, data.table, is not as intuitive but is very fast.
magrittr

magrittr provides R’s pipe operator, %>%, which greatly improves code readability.
By default, the left hand side of the pipe is given to the function on the right hand side as the first argument.
c("a", "b")
## [1] "a" "b"
"a" %>% c("b")
## [1] "a" "b"If you want it to be piped to a different argument, use the placeholder ..
"a" %>% c("b", .)
## [1] "b" "a"This can also be used to build functions
f <- . %>% cos %>% sin
f(40)
## [1] -0.6185831Tidyverse functions usually take the data.frame (or tibble) as first argument, making it easy to construct long pipelines (see later).
magrittr likes functions of the form f(x) with a first argument x, which some very elementary functions lack. So they have all been rewritten to do so, see ?extract
1 + 2
## [1] 3
add(1,2)
## [1] 3
1 %>% add(2)
## [1] 3
foobar = list(foo = "foo", bar = "bar")
foobar["foo"]
## $foo
## [1] "foo"
foobar[["bar"]]
## [1] "bar"
foobar %>% extract("foo")
## $foo
## [1] "foo"
foobar %>% extract2("bar")
## [1] "bar"
metadata$gender
## [1] "M" "F" "M" "M" "F" "F" "M"
metadata %>% use_series("gender")
## [1] "M" "F" "M" "M" "F" "F" "M"(The last can also be done using pull from dplyr, when applied to data frames.)
The %<>% operator sends the pipe result back to the original variable
n <- 1
n %<>% add(1)
n
## [1] 2dplyr
ggplot2 aside, dplyr is probably the heart of the tidyverse. All functions take a data.frame/tibble as the first argument in normal use, for ease of piping.
To keep rows based on a condition, use filter:
metadata %>% filter(gender == "F")
## # A tibble: 3 x 4
## participant_id gender age dob
## <chr> <chr> <dbl> <date>
## 1 PAT_2 F 26 1986-02-20
## 2 PAT_5 F 19 1992-08-23
## 3 PAT_6 F 20 1991-04-04To keep them by row number, use slice:
metadata %>% slice(1:3)
## # A tibble: 3 x 4
## participant_id gender age dob
## <chr> <chr> <dbl> <date>
## 1 PAT_1 M 39 1972-10-30
## 2 PAT_2 F 26 1986-02-20
## 3 PAT_3 M 25 1987-02-28To keep a random sample, use sample_frac and sample_n:
metadata %>% sample_n(3)
## # A tibble: 3 x 4
## participant_id gender age dob
## <chr> <chr> <dbl> <date>
## 1 PAT_1 M 39 1972-10-30
## 2 PAT_6 F 20 1991-04-04
## 3 PAT_2 F 26 1986-02-20To sort, use arrange:
metadata %>% arrange(age)
## # A tibble: 7 x 4
## participant_id gender age dob
## <chr> <chr> <dbl> <date>
## 1 PAT_5 F 19 1992-08-23
## 2 PAT_6 F 20 1991-04-04
## 3 PAT_3 M 25 1987-02-28
## 4 PAT_4 M 25 1986-10-11
## 5 PAT_2 F 26 1986-02-20
## 6 PAT_7 M 28 1984-01-24
## 7 PAT_1 M 39 1972-10-30
metadata %>% arrange(desc(age))
## # A tibble: 7 x 4
## participant_id gender age dob
## <chr> <chr> <dbl> <date>
## 1 PAT_1 M 39 1972-10-30
## 2 PAT_7 M 28 1984-01-24
## 3 PAT_2 F 26 1986-02-20
## 4 PAT_3 M 25 1987-02-28
## 5 PAT_4 M 25 1986-10-11
## 6 PAT_6 F 20 1991-04-04
## 7 PAT_5 F 19 1992-08-23To keep or drop columns and keep the result as a tibble, use select:
metadata %>% select(participant_id, age)
## # A tibble: 7 x 2
## participant_id age
## <chr> <dbl>
## 1 PAT_1 39
## 2 PAT_2 26
## 3 PAT_3 25
## 4 PAT_4 25
## 5 PAT_5 19
## 6 PAT_6 20
## 7 PAT_7 28
metadata %>% select(-age)
## # A tibble: 7 x 3
## participant_id gender dob
## <chr> <chr> <date>
## 1 PAT_1 M 1972-10-30
## 2 PAT_2 F 1986-02-20
## 3 PAT_3 M 1987-02-28
## 4 PAT_4 M 1986-10-11
## 5 PAT_5 F 1992-08-23
## 6 PAT_6 F 1991-04-04
## 7 PAT_7 M 1984-01-24But to extract a column as a vector, use pull (or use_series):
metadata %>% pull(age)
## [1] 39 26 25 25 19 20 28
metadata %>% select(age)
## # A tibble: 7 x 1
## age
## <dbl>
## 1 39
## 2 26
## 3 25
## 4 25
## 5 19
## 6 20
## 7 28Add new columns with mutate (most useful when combined with functions from purrr, so see also below):
metadata %>% mutate(age2 = age - 1)
## # A tibble: 7 x 5
## participant_id gender age dob age2
## <chr> <chr> <dbl> <date> <dbl>
## 1 PAT_1 M 39 1972-10-30 38
## 2 PAT_2 F 26 1986-02-20 25
## 3 PAT_3 M 25 1987-02-28 24
## 4 PAT_4 M 25 1986-10-11 24
## 5 PAT_5 F 19 1992-08-23 18
## 6 PAT_6 F 20 1991-04-04 19
## 7 PAT_7 M 28 1984-01-24 27
metadata %>% mutate(foo = 1:7)
## # A tibble: 7 x 5
## participant_id gender age dob foo
## <chr> <chr> <dbl> <date> <int>
## 1 PAT_1 M 39 1972-10-30 1
## 2 PAT_2 F 26 1986-02-20 2
## 3 PAT_3 M 25 1987-02-28 3
## 4 PAT_4 M 25 1986-10-11 4
## 5 PAT_5 F 19 1992-08-23 5
## 6 PAT_6 F 20 1991-04-04 6
## 7 PAT_7 M 28 1984-01-24 7Use group_by and summarise to get summary stats by levels of a variable
metadata %>% group_by(gender)
## # A tibble: 7 x 4
## # Groups: gender [2]
## participant_id gender age dob
## <chr> <chr> <dbl> <date>
## 1 PAT_1 M 39 1972-10-30
## 2 PAT_2 F 26 1986-02-20
## 3 PAT_3 M 25 1987-02-28
## 4 PAT_4 M 25 1986-10-11
## 5 PAT_5 F 19 1992-08-23
## 6 PAT_6 F 20 1991-04-04
## 7 PAT_7 M 28 1984-01-24
metadata %>%
group_by(gender) %>%
summarise(count = n(), mean.age = mean(age))
## # A tibble: 2 x 3
## gender count mean.age
## <chr> <int> <dbl>
## 1 F 3 21.7
## 2 M 4 29.2Finally, dplyr has numerous join functions, adding rows to one tibble by looking up a value in another.
phylo.relationships <- read_csv("example_pairwise_relationships.csv")
## Parsed with column specification:
## cols(
## host.1 = col_character(),
## host.2 = col_character(),
## ancestry = col_character(),
## ancestry.tree.count = col_double()
## )
phylo.relationships
## # A tibble: 12 x 4
## host.1 host.2 ancestry ancestry.tree.count
## <chr> <chr> <chr> <dbl>
## 1 PAT_5 PAT_2 complex 9
## 2 PAT_5 PAT_2 multiTrans 2
## 3 PAT_5 PAT_2 noAncestry 1
## 4 PAT_5 PAT_2 trans 9
## 5 PAT_2 PAT_5 multiTrans 3
## 6 PAT_2 PAT_5 trans 4
## 7 PAT_2 PAT_1 complex 3
## 8 PAT_2 PAT_1 multiTrans 10
## 9 PAT_2 PAT_1 noAncestry 2
## 10 PAT_2 PAT_1 trans 13
## # … with 2 more rows
phylo.relationships %>%
left_join(metadata, by=c("host.1" = "participant_id")) %>%
left_join(metadata, by=c("host.2" = "participant_id"))
## # A tibble: 12 x 10
## host.1 host.2 ancestry ancestry.tree.c… gender.x age.x dob.x gender.y
## <chr> <chr> <chr> <dbl> <chr> <dbl> <date> <chr>
## 1 PAT_5 PAT_2 complex 9 F 19 1992-08-23 F
## 2 PAT_5 PAT_2 multiTr… 2 F 19 1992-08-23 F
## 3 PAT_5 PAT_2 noAnces… 1 F 19 1992-08-23 F
## 4 PAT_5 PAT_2 trans 9 F 19 1992-08-23 F
## 5 PAT_2 PAT_5 multiTr… 3 F 26 1986-02-20 F
## 6 PAT_2 PAT_5 trans 4 F 26 1986-02-20 F
## 7 PAT_2 PAT_1 complex 3 F 26 1986-02-20 M
## 8 PAT_2 PAT_1 multiTr… 10 F 26 1986-02-20 M
## 9 PAT_2 PAT_1 noAnces… 2 F 26 1986-02-20 M
## 10 PAT_2 PAT_1 trans 13 F 26 1986-02-20 M
## # … with 2 more rows, and 2 more variables: age.y <dbl>, dob.y <date>(Matching column names will be automatically joined upon, and by is only needed if no names match. The .x and .y suffixes that appear when joining more than once can be modified using the suffix option to the join command.)
left_join returns a copy of the original tibble with all its rows; the new columns for rows with no matches will be given (by default) NA values. Several other joins are available, see e.g. here.
purrr
purrr is extremely versatile, and I will mention only the map family, which replace the (somewhat bewildering) apply, lapply, mapply, vapply, sapply etc from base R and are (in my opinion) much easier to remember. These functions are used to apply the same function to one or more entries in each row of a data frame (or tibble), and combined with mutate from dplyr can be used to make new columns (or replace the values of old ones).
The basic map takes one column as an argument and returns a list.
metadata2 <- metadata %>% mutate(yob = map(dob, year))
metadata2
## # A tibble: 7 x 5
## participant_id gender age dob yob
## <chr> <chr> <dbl> <date> <list>
## 1 PAT_1 M 39 1972-10-30 <dbl [1]>
## 2 PAT_2 F 26 1986-02-20 <dbl [1]>
## 3 PAT_3 M 25 1987-02-28 <dbl [1]>
## 4 PAT_4 M 25 1986-10-11 <dbl [1]>
## 5 PAT_5 F 19 1992-08-23 <dbl [1]>
## 6 PAT_6 F 20 1991-04-04 <dbl [1]>
## 7 PAT_7 M 28 1984-01-24 <dbl [1]>
metadata2 %>% pull(yob)
## [[1]]
## [1] 1972
##
## [[2]]
## [1] 1986
##
## [[3]]
## [1] 1987
##
## [[4]]
## [1] 1986
##
## [[5]]
## [1] 1992
##
## [[6]]
## [1] 1991
##
## [[7]]
## [1] 1984(Both tibbles and data.frames will happily accept lists as columns and columns whose entries are lists, although tibbles make it a lot more obvious what is going on.)
To coerce output to another data type, use map_*. So map_dbl returns numerical columns, map_chr text, and so on:
metadata2 <- metadata %>% mutate(yob = map_dbl(dob, year))
metadata2
## # A tibble: 7 x 5
## participant_id gender age dob yob
## <chr> <chr> <dbl> <date> <dbl>
## 1 PAT_1 M 39 1972-10-30 1972
## 2 PAT_2 F 26 1986-02-20 1986
## 3 PAT_3 M 25 1987-02-28 1987
## 4 PAT_4 M 25 1986-10-11 1986
## 5 PAT_5 F 19 1992-08-23 1992
## 6 PAT_6 F 20 1991-04-04 1991
## 7 PAT_7 M 28 1984-01-24 1984
metadata2 <- metadata %>% mutate(yob = map_chr(dob, year))
metadata2
## # A tibble: 7 x 5
## participant_id gender age dob yob
## <chr> <chr> <dbl> <date> <chr>
## 1 PAT_1 M 39 1972-10-30 1972.000000
## 2 PAT_2 F 26 1986-02-20 1986.000000
## 3 PAT_3 M 25 1987-02-28 1987.000000
## 4 PAT_4 M 25 1986-10-11 1986.000000
## 5 PAT_5 F 19 1992-08-23 1992.000000
## 6 PAT_6 F 20 1991-04-04 1991.000000
## 7 PAT_7 M 28 1984-01-24 1984.000000The basic map expects one column as an argument, while map2 expects two, and pmap a list consisting of an arbitrary number.
(I usually use map with anonymous functions.)
phylo.relationships %<>%
left_join(metadata, by=c("host.1" = "participant_id")) %>%
left_join(metadata, by=c("host.2" = "participant_id"))
phylo.relationships %>%
mutate(age.difference = map2_dbl(age.x, age.y, function(x, y){
x - y
})) %>%
select(host.1, host.2, gender.x, gender.y, age.x, age.y, age.difference)
## # A tibble: 12 x 7
## host.1 host.2 gender.x gender.y age.x age.y age.difference
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 PAT_5 PAT_2 F F 19 26 -7
## 2 PAT_5 PAT_2 F F 19 26 -7
## 3 PAT_5 PAT_2 F F 19 26 -7
## 4 PAT_5 PAT_2 F F 19 26 -7
## 5 PAT_2 PAT_5 F F 26 19 7
## 6 PAT_2 PAT_5 F F 26 19 7
## 7 PAT_2 PAT_1 F M 26 39 -13
## 8 PAT_2 PAT_1 F M 26 39 -13
## 9 PAT_2 PAT_1 F M 26 39 -13
## 10 PAT_2 PAT_1 F M 26 39 -13
## # … with 2 more rows
random.numbers <- tibble(a = runif(10, 0, 1), b = rnorm(10, 0, 4), c=sample(10))
random.numbers
## # A tibble: 10 x 3
## a b c
## <dbl> <dbl> <int>
## 1 0.940 3.70 1
## 2 0.921 -2.92 6
## 3 0.205 5.30 7
## 4 0.418 -4.25 4
## 5 0.327 -2.45 3
## 6 0.655 -0.512 5
## 7 0.826 1.61 9
## 8 0.290 -3.51 2
## 9 0.110 3.58 8
## 10 0.940 2.21 10
random.numbers %>% mutate(d = pmap_dbl(list(a, b, c), function(x, y, z) z + (x*y)))
## # A tibble: 10 x 4
## a b c d
## <dbl> <dbl> <int> <dbl>
## 1 0.940 3.70 1 4.47
## 2 0.921 -2.92 6 3.31
## 3 0.205 5.30 7 8.09
## 4 0.418 -4.25 4 2.23
## 5 0.327 -2.45 3 2.20
## 6 0.655 -0.512 5 4.66
## 7 0.826 1.61 9 10.3
## 8 0.290 -3.51 2 0.982
## 9 0.110 3.58 8 8.39
## 10 0.940 2.21 10 12.1tidyr
“Tidy” comes from the idea of tidy data, in which each variable is one column, and each observation one row. The tidyr package, which is quite small, provides functions that help to make your tibble look like this (or, indeed, untidy it). In perhaps more familiar terms, these are tools to convert between wide and long formats.
phylo.stats <- read_csv("example_phylogenetic_summary.csv")
## Parsed with column specification:
## cols(
## host.id = col_character(),
## tree.id = col_character(),
## tips = col_double(),
## reads = col_double(),
## subgraphs = col_double(),
## clades = col_double(),
## overall.rtt = col_double(),
## largest.rtt = col_double(),
## max.branch.length = col_double(),
## max.pat.distance = col_double(),
## global.mean.pat.distance = col_double()
## )
phylo.stats
## # A tibble: 371 x 11
## host.id tree.id tips reads subgraphs clades overall.rtt largest.rtt
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PAT_3 1000_t… 11 418 1 1 0.00661 0.00661
## 2 PAT_4 1000_t… 4 97 1 3 0.0102 0.0102
## 3 PAT_5 1000_t… 16 220 1 9 0.0106 0.0106
## 4 PAT_2 1000_t… 5 289 1 4 0.000235 0.000235
## 5 PAT_7 1000_t… 12 262 1 1 0.0131 0.0131
## 6 PAT_1 1000_t… 6 703 3 3 0.00611 0
## 7 PAT_6 1000_t… 1 53 1 1 0 0
## 8 PAT_3 1160_t… 7 384 1 1 0.00368 0.00368
## 9 PAT_4 1160_t… 5 75 3 3 0.0142 0.00563
## 10 PAT_5 1160_t… 25 312 7 11 0.0132 0.00195
## # … with 361 more rows, and 3 more variables: max.branch.length <dbl>,
## # max.pat.distance <dbl>, global.mean.pat.distance <dbl>This table is in long format. Observations about one patient appear on multiple rows, with a single column (tree.id) telling you something about the numbers in other columns (in this case, which phylogeny the statistics were taken from).
phylo.stats %>% filter(host.id == "PAT_1")
## # A tibble: 53 x 11
## host.id tree.id tips reads subgraphs clades overall.rtt largest.rtt
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 PAT_1 1000_t… 6 703 3 3 0.00611 0
## 2 PAT_1 1160_t… 6 669 1 4 0.000130 0.000130
## 3 PAT_1 1320_t… 4 623 1 3 0.0000798 0.0000798
## 4 PAT_1 1480_t… 2 714 1 2 0.000001 0.000001
## 5 PAT_1 1640_t… 7 728 1 4 0.000118 0.000118
## 6 PAT_1 1800_t… 7 734 7 7 0.00186 0
## 7 PAT_1 1960_t… 6 416 5 5 0.000624 0
## 8 PAT_1 2120_t… 2 153 2 2 0.000001 0
## 9 PAT_1 2280_t… 2 207 2 2 0.000001 0
## 10 PAT_1 2440_t… 1 19 1 1 0 0
## # … with 43 more rows, and 3 more variables: max.branch.length <dbl>,
## # max.pat.distance <dbl>, global.mean.pat.distance <dbl>To move to wide data, we make every patient a row, and every different type of observation about that patient (i.e. each stat in each window), a column. Let’s work only with the number of phylogeny tips. Then call pivot_wider:
phylo.stats %>% select(host.id, tree.id, tips)
## # A tibble: 371 x 3
## host.id tree.id tips
## <chr> <chr> <dbl>
## 1 PAT_3 1000_to_1319 11
## 2 PAT_4 1000_to_1319 4
## 3 PAT_5 1000_to_1319 16
## 4 PAT_2 1000_to_1319 5
## 5 PAT_7 1000_to_1319 12
## 6 PAT_1 1000_to_1319 6
## 7 PAT_6 1000_to_1319 1
## 8 PAT_3 1160_to_1479 7
## 9 PAT_4 1160_to_1479 5
## 10 PAT_5 1160_to_1479 25
## # … with 361 more rows
phylo.stats.wide <- phylo.stats %>%
select(host.id, tree.id, tips) %>%
pivot_wider(names_from = tree.id, values_from = tips)
phylo.stats.wide
## # A tibble: 7 x 54
## host.id `1000_to_1319` `1160_to_1479` `1320_to_1639` `1480_to_1799`
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 PAT_3 11 7 5 4
## 2 PAT_4 4 5 3 5
## 3 PAT_5 16 25 19 20
## 4 PAT_2 5 2 2 1
## 5 PAT_7 12 5 4 5
## 6 PAT_1 6 6 4 2
## 7 PAT_6 1 1 1 1
## # … with 49 more variables: `1640_to_1959` <dbl>, `1800_to_2119` <dbl>,
## # `1960_to_2279` <dbl>, `2120_to_2439` <dbl>, `2280_to_2599` <dbl>,
## # `2440_to_2759` <dbl>, `2600_to_2919` <dbl>, `2760_to_3079` <dbl>,
## # `2920_to_3239` <dbl>, `3080_to_3399` <dbl>, `3240_to_3559` <dbl>,
## # `3400_to_3719` <dbl>, `3560_to_3879` <dbl>, `3720_to_4039` <dbl>,
## # `3880_to_4199` <dbl>, `4040_to_4359` <dbl>, `4200_to_4519` <dbl>,
## # `4360_to_4679` <dbl>, `4520_to_4839` <dbl>, `4680_to_4999` <dbl>,
## # `4840_to_5159` <dbl>, `5000_to_5319` <dbl>, `5160_to_5479` <dbl>,
## # `520_to_839` <dbl>, `5320_to_5639` <dbl>, `5480_to_5799` <dbl>,
## # `5640_to_5959` <dbl>, `5800_to_6119` <dbl>, `5960_to_6279` <dbl>,
## # `6120_to_6439` <dbl>, `6280_to_6599` <dbl>, `6440_to_6759` <dbl>,
## # `6760_to_7079` <dbl>, `680_to_999` <dbl>, `6920_to_7239` <dbl>,
## # `7080_to_7399` <dbl>, `7240_to_7559` <dbl>, `7400_to_7719` <dbl>,
## # `7560_to_7879` <dbl>, `7720_to_8039` <dbl>, `7880_to_8199` <dbl>,
## # `8040_to_8359` <dbl>, `8200_to_8519` <dbl>, `8360_to_8679` <dbl>,
## # `840_to_1159` <dbl>, `8520_to_8839` <dbl>, `8680_to_8999` <dbl>,
## # `8840_to_9159` <dbl>, `9160_to_9479` <dbl>Of course, you can convert back again. The second argument to pivot_longer (the first after the data, which does not appear here because of the pipe) is the set of columns to gather.
phylo.stats.wide %>% pivot_longer(2:54)
## # A tibble: 371 x 3
## host.id name value
## <chr> <chr> <dbl>
## 1 PAT_3 1000_to_1319 11
## 2 PAT_3 1160_to_1479 7
## 3 PAT_3 1320_to_1639 5
## 4 PAT_3 1480_to_1799 4
## 5 PAT_3 1640_to_1959 6
## 6 PAT_3 1800_to_2119 3
## 7 PAT_3 1960_to_2279 3
## 8 PAT_3 2120_to_2439 3
## 9 PAT_3 2280_to_2599 2
## 10 PAT_3 2440_to_2759 2
## # … with 361 more rowsThe new columns can be given custom names:
phylo.stats.wide %>% pivot_longer(2:54, names_to = "tree.id", values_to="tips")
## # A tibble: 371 x 3
## host.id tree.id tips
## <chr> <chr> <dbl>
## 1 PAT_3 1000_to_1319 11
## 2 PAT_3 1160_to_1479 7
## 3 PAT_3 1320_to_1639 5
## 4 PAT_3 1480_to_1799 4
## 5 PAT_3 1640_to_1959 6
## 6 PAT_3 1800_to_2119 3
## 7 PAT_3 1960_to_2279 3
## 8 PAT_3 2120_to_2439 3
## 9 PAT_3 2280_to_2599 2
## 10 PAT_3 2440_to_2759 2
## # … with 361 more rowsThis can be particularly handy when summarising.
phylo.stats.wide %>%
gather(tree.id, tips, 2:54) %>%
group_by(host.id) %>%
summarise(mean.tips = mean(tips), max.tips = max(tips), min.tips = min(tips))
## # A tibble: 7 x 4
## host.id mean.tips max.tips min.tips
## <chr> <dbl> <dbl> <dbl>
## 1 PAT_1 1.98 7 0
## 2 PAT_2 3 14 1
## 3 PAT_3 2.30 11 0
## 4 PAT_4 0.717 6 0
## 5 PAT_5 4.64 25 0
## 6 PAT_6 0.226 2 0
## 7 PAT_7 1.79 12 0glue
glue replaces paste(“foo”, “bar”, sep=”_”) with:
foo <- "foo"
bar <- "bar"
glue("{foo}_{bar}")
## foo_barThis can be fully integrated with other tidyverse commands, of course.
metadata %>%
mutate(full.gender = map_chr(gender, function(x){
ifelse(x == "M", "male", "female")
})) %>%
mutate(description = glue("Patient ID {participant_id} is a {full.gender} of age {age}")) %>%
pull(description)
## Patient ID PAT_1 is a male of age 39
## Patient ID PAT_2 is a female of age 26
## Patient ID PAT_3 is a male of age 25
## Patient ID PAT_4 is a male of age 25
## Patient ID PAT_5 is a female of age 19
## Patient ID PAT_6 is a female of age 20
## Patient ID PAT_7 is a male of age 28