Introduction to Tidyverse

“The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.”

The core eight packages of the tidyverse are:

Package Purpose Notably replaces
ggplot2 Graphics Core graphics
dplyr Data manipulation aggregate, common row and column operations
tidyr Data ‘tidying’ melt, dcast
readr Text file input read.table, etc
purrr Functional programming tools apply family
tibble Tables data.frame
stringr String manipulation grep etc
forcats Factor manipulation as.factor etc

Several other packages are associated, of which I regularly use:

Package Purpose Notably replaces
magrittr Pipe operator Excessive numbers of nested brackets
glue String manipulation paste
lubridate Dates as.Date

Other packages which I am not so familiar with exist for handling times, and binary data, and importing other datatypes (Excel, SPSS, JSON, XML, web scraping)…

Notably the tidyverse has not yet expanded to statistical models.

I will not cover ggplot2; that is a topic for a whole session. I also won’t go into stringr and forcats, which are fairly self-explanatory once you are familiar with the rest, or lubridate, but do note that lubridate is fantastic.

readr

Many tidyverse commands are as standard R but replace . with _. E.g. read_delim, read_table and read_csv replace read.delim, read.table and read.csv.

readr functions:

  • Output tibbles rather than data.frames
  • Do not convert strings to factors unless you explicitly ask stringsAsFactors=NA
  • Guess column types from the data (sometimes stops too early, a common reason for readr errors in huge documents, but see the guess_max argument)
metadata <- read_csv("example_patient_characteristics.csv")

## Parsed with column specification:
## cols(
##   participant_id = col_character(),
##   gender = col_character(),
##   age = col_double(),
##   dob = col_date(format = "")
## )

metadata

## # A tibble: 7 x 4
##   participant_id gender   age dob       
##   <chr>          <chr>  <dbl> <date>    
## 1 PAT_1          M         39 1972-10-30
## 2 PAT_2          F         26 1986-02-20
## 3 PAT_3          M         25 1987-02-28
## 4 PAT_4          M         25 1986-10-11
## 5 PAT_5          F         19 1992-08-23
## 6 PAT_6          F         20 1991-04-04
## 7 PAT_7          M         28 1984-01-24
  • Column types can also be specified with the col_types argument, which will accept a very compact string representation
    • e.g. “cfdd?-” for columns, in order: character, factor, double, double, guess, skip
metadata2 <- read_csv("example_patient_characteristics.csv", col_types = "cf?-")

metadata2

## # A tibble: 7 x 3
##   participant_id gender   age
##   <chr>          <fct>  <dbl>
## 1 PAT_1          M         39
## 2 PAT_2          F         26
## 3 PAT_3          M         25
## 4 PAT_4          M         25
## 5 PAT_5          F         19
## 6 PAT_6          F         20
## 7 PAT_7          M         28

tibble

A tibble is a data.frame and anything you would do to the latter you can do to the former. Notable enhancements of over standard data.frames are:

  • Printing them displays a only ten rows by default, with column headings indicating data types, and colours if the console supports them
  • They do not use row.names at all
  • They are much more prone to complaining if something isn’t right (e.g. column name not found)
  • Column names can contain e.g. spaces (need to be quoted with backticks)
phylo.stats <- read_csv("example_phylogenetic_summary.csv")

## Parsed with column specification:
## cols(
##   host.id = col_character(),
##   tree.id = col_character(),
##   tips = col_double(),
##   reads = col_double(),
##   subgraphs = col_double(),
##   clades = col_double(),
##   overall.rtt = col_double(),
##   largest.rtt = col_double(),
##   max.branch.length = col_double(),
##   max.pat.distance = col_double(),
##   global.mean.pat.distance = col_double()
## )

phylo.stats

## # A tibble: 371 x 11
##    host.id tree.id  tips reads subgraphs clades overall.rtt largest.rtt
##    <chr>   <chr>   <dbl> <dbl>     <dbl>  <dbl>       <dbl>       <dbl>
##  1 PAT_3   1000_t…    11   418         1      1    0.00661     0.00661 
##  2 PAT_4   1000_t…     4    97         1      3    0.0102      0.0102  
##  3 PAT_5   1000_t…    16   220         1      9    0.0106      0.0106  
##  4 PAT_2   1000_t…     5   289         1      4    0.000235    0.000235
##  5 PAT_7   1000_t…    12   262         1      1    0.0131      0.0131  
##  6 PAT_1   1000_t…     6   703         3      3    0.00611     0       
##  7 PAT_6   1000_t…     1    53         1      1    0           0       
##  8 PAT_3   1160_t…     7   384         1      1    0.00368     0.00368 
##  9 PAT_4   1160_t…     5    75         3      3    0.0142      0.00563 
## 10 PAT_5   1160_t…    25   312         7     11    0.0132      0.00195 
## # … with 361 more rows, and 3 more variables: max.branch.length <dbl>,
## #   max.pat.distance <dbl>, global.mean.pat.distance <dbl>

options(tibble.print_min = 20, tibble.width = Inf)

phylo.stats

## # A tibble: 371 x 11
##    host.id tree.id       tips reads subgraphs clades overall.rtt largest.rtt
##    <chr>   <chr>        <dbl> <dbl>     <dbl>  <dbl>       <dbl>       <dbl>
##  1 PAT_3   1000_to_1319    11   418         1      1   0.00661     0.00661  
##  2 PAT_4   1000_to_1319     4    97         1      3   0.0102      0.0102   
##  3 PAT_5   1000_to_1319    16   220         1      9   0.0106      0.0106   
##  4 PAT_2   1000_to_1319     5   289         1      4   0.000235    0.000235 
##  5 PAT_7   1000_to_1319    12   262         1      1   0.0131      0.0131   
##  6 PAT_1   1000_to_1319     6   703         3      3   0.00611     0        
##  7 PAT_6   1000_to_1319     1    53         1      1   0           0        
##  8 PAT_3   1160_to_1479     7   384         1      1   0.00368     0.00368  
##  9 PAT_4   1160_to_1479     5    75         3      3   0.0142      0.00563  
## 10 PAT_5   1160_to_1479    25   312         7     11   0.0132      0.00195  
## 11 PAT_2   1160_to_1479     2   333         2      2   0.0000445   0        
## 12 PAT_7   1160_to_1479     5   215         1      1   0.0216      0.0216   
## 13 PAT_1   1160_to_1479     6   669         1      4   0.000130    0.000130 
## 14 PAT_6   1160_to_1479     1    68         1      1   0           0        
## 15 PAT_3   1320_to_1639     5   334         2      5   0.00057     0.000494 
## 16 PAT_4   1320_to_1639     3    73         1      3   0.00444     0.00444  
## 17 PAT_5   1320_to_1639    19   172         5     13   0.0156      0        
## 18 PAT_2   1320_to_1639     2   266         2      2   0.0000542   0        
## 19 PAT_7   1320_to_1639     4   232         1      1   0.0119      0.0119   
## 20 PAT_1   1320_to_1639     4   623         1      3   0.0000798   0.0000798
##    max.branch.length max.pat.distance global.mean.pat.distance
##                <dbl>            <dbl>                    <dbl>
##  1           0.00676          0.0201                   0.0116 
##  2           0.0104           0.0276                   0.0167 
##  3           0.0102           0.0373                   0.0188 
##  4           0.00678          0.0136                   0.0102 
##  5           0.0106           0.0454                   0.0238 
##  6           0.00675          0.0203                   0.0115 
##  7          NA               NA                       NA      
##  8           0.00702          0.0140                   0.00866
##  9           0.0147           0.0405                   0.0212 
## 10           0.0149           0.0547                   0.0292 
## 11           0.00725          0.00725                  0.00725
## 12           0.0193           0.0344                   0.0195 
## 13           0.00729          0.0216                   0.0106 
## 14          NA               NA                       NA      
## 15           0.0104           0.0173                   0.0124 
## 16           0.0106           0.0176                   0.0117 
## 17           0.0146           0.0634                   0.0282 
## 18           0.00708          0.00708                  0.00708
## 19           0.0106           0.0244                   0.0140 
## 20           0.00351          0.00702                  0.00584
## # … with 351 more rows

options(tibble.print_min = 10, tibble.print_max = 10, tibble.width = NULL)

phylo.stats.df <- read.csv("example_phylogenetic_summary.csv")
phylo.stats.df$foo

## NULL

phylo.stats$foo

## Warning: Unknown or uninitialised column: 'foo'.

## NULL

Note: tibble and the tidyverse in general are not especially built for performance. Another data.frame alternative, data.table, is not as intuitive but is very fast.

magrittr

(In case you were wondering about the name)

magrittr provides R’s pipe operator, %>%, which greatly improves code readability.

By default, the left hand side of the pipe is given to the function on the right hand side as the first argument.

c("a", "b")

## [1] "a" "b"

"a" %>% c("b")

## [1] "a" "b"

If you want it to be piped to a different argument, use the placeholder ..

"a" %>% c("b", .)

## [1] "b" "a"

This can also be used to build functions

f <- . %>% cos %>% sin 

f(40)

## [1] -0.6185831

Tidyverse functions usually take the data.frame (or tibble) as first argument, making it easy to construct long pipelines (see later).

magrittr likes functions of the form f(x) with a first argument x, which some very elementary functions lack. So they have all been rewritten to do so, see ?extract

1 + 2

## [1] 3

add(1,2)

## [1] 3

1 %>% add(2)

## [1] 3

foobar = list(foo = "foo", bar = "bar")

foobar["foo"]

## $foo
## [1] "foo"

foobar[["bar"]]

## [1] "bar"

foobar %>% extract("foo")

## $foo
## [1] "foo"

foobar %>% extract2("bar")

## [1] "bar"

metadata$gender

## [1] "M" "F" "M" "M" "F" "F" "M"

metadata %>% use_series("gender")

## [1] "M" "F" "M" "M" "F" "F" "M"

(The last can also be done using pull from dplyr, when applied to data frames.)

The %<>% operator sends the pipe result back to the original variable

n <- 1

n %<>% add(1)

n

## [1] 2

dplyr

ggplot2 aside, dplyr is probably the heart of the tidyverse. All functions take a data.frame/tibble as the first argument in normal use, for ease of piping.

To keep rows based on a condition, use filter:

metadata %>% filter(gender == "F")

## # A tibble: 3 x 4
##   participant_id gender   age dob       
##   <chr>          <chr>  <dbl> <date>    
## 1 PAT_2          F         26 1986-02-20
## 2 PAT_5          F         19 1992-08-23
## 3 PAT_6          F         20 1991-04-04

To keep them by row number, use slice:

metadata %>% slice(1:3)

## # A tibble: 3 x 4
##   participant_id gender   age dob       
##   <chr>          <chr>  <dbl> <date>    
## 1 PAT_1          M         39 1972-10-30
## 2 PAT_2          F         26 1986-02-20
## 3 PAT_3          M         25 1987-02-28

To keep a random sample, use sample_frac and sample_n:

metadata %>% sample_n(3)

## # A tibble: 3 x 4
##   participant_id gender   age dob       
##   <chr>          <chr>  <dbl> <date>    
## 1 PAT_1          M         39 1972-10-30
## 2 PAT_6          F         20 1991-04-04
## 3 PAT_2          F         26 1986-02-20

To sort, use arrange:

metadata %>% arrange(age)

## # A tibble: 7 x 4
##   participant_id gender   age dob       
##   <chr>          <chr>  <dbl> <date>    
## 1 PAT_5          F         19 1992-08-23
## 2 PAT_6          F         20 1991-04-04
## 3 PAT_3          M         25 1987-02-28
## 4 PAT_4          M         25 1986-10-11
## 5 PAT_2          F         26 1986-02-20
## 6 PAT_7          M         28 1984-01-24
## 7 PAT_1          M         39 1972-10-30

metadata %>% arrange(desc(age))

## # A tibble: 7 x 4
##   participant_id gender   age dob       
##   <chr>          <chr>  <dbl> <date>    
## 1 PAT_1          M         39 1972-10-30
## 2 PAT_7          M         28 1984-01-24
## 3 PAT_2          F         26 1986-02-20
## 4 PAT_3          M         25 1987-02-28
## 5 PAT_4          M         25 1986-10-11
## 6 PAT_6          F         20 1991-04-04
## 7 PAT_5          F         19 1992-08-23

To keep or drop columns and keep the result as a tibble, use select:

metadata %>% select(participant_id, age)

## # A tibble: 7 x 2
##   participant_id   age
##   <chr>          <dbl>
## 1 PAT_1             39
## 2 PAT_2             26
## 3 PAT_3             25
## 4 PAT_4             25
## 5 PAT_5             19
## 6 PAT_6             20
## 7 PAT_7             28

metadata %>% select(-age)

## # A tibble: 7 x 3
##   participant_id gender dob       
##   <chr>          <chr>  <date>    
## 1 PAT_1          M      1972-10-30
## 2 PAT_2          F      1986-02-20
## 3 PAT_3          M      1987-02-28
## 4 PAT_4          M      1986-10-11
## 5 PAT_5          F      1992-08-23
## 6 PAT_6          F      1991-04-04
## 7 PAT_7          M      1984-01-24

But to extract a column as a vector, use pull (or use_series):

metadata %>% pull(age)

## [1] 39 26 25 25 19 20 28

metadata %>% select(age)

## # A tibble: 7 x 1
##     age
##   <dbl>
## 1    39
## 2    26
## 3    25
## 4    25
## 5    19
## 6    20
## 7    28

Add new columns with mutate (most useful when combined with functions from purrr, so see also below):

metadata %>% mutate(age2 = age - 1)

## # A tibble: 7 x 5
##   participant_id gender   age dob         age2
##   <chr>          <chr>  <dbl> <date>     <dbl>
## 1 PAT_1          M         39 1972-10-30    38
## 2 PAT_2          F         26 1986-02-20    25
## 3 PAT_3          M         25 1987-02-28    24
## 4 PAT_4          M         25 1986-10-11    24
## 5 PAT_5          F         19 1992-08-23    18
## 6 PAT_6          F         20 1991-04-04    19
## 7 PAT_7          M         28 1984-01-24    27

metadata %>% mutate(foo = 1:7)

## # A tibble: 7 x 5
##   participant_id gender   age dob          foo
##   <chr>          <chr>  <dbl> <date>     <int>
## 1 PAT_1          M         39 1972-10-30     1
## 2 PAT_2          F         26 1986-02-20     2
## 3 PAT_3          M         25 1987-02-28     3
## 4 PAT_4          M         25 1986-10-11     4
## 5 PAT_5          F         19 1992-08-23     5
## 6 PAT_6          F         20 1991-04-04     6
## 7 PAT_7          M         28 1984-01-24     7

Use group_by and summarise to get summary stats by levels of a variable

metadata %>% group_by(gender)

## # A tibble: 7 x 4
## # Groups:   gender [2]
##   participant_id gender   age dob       
##   <chr>          <chr>  <dbl> <date>    
## 1 PAT_1          M         39 1972-10-30
## 2 PAT_2          F         26 1986-02-20
## 3 PAT_3          M         25 1987-02-28
## 4 PAT_4          M         25 1986-10-11
## 5 PAT_5          F         19 1992-08-23
## 6 PAT_6          F         20 1991-04-04
## 7 PAT_7          M         28 1984-01-24

metadata %>% 
  group_by(gender) %>%
  summarise(count = n(), mean.age = mean(age))

## # A tibble: 2 x 3
##   gender count mean.age
##   <chr>  <int>    <dbl>
## 1 F          3     21.7
## 2 M          4     29.2

Finally, dplyr has numerous join functions, adding rows to one tibble by looking up a value in another.

phylo.relationships <- read_csv("example_pairwise_relationships.csv")

## Parsed with column specification:
## cols(
##   host.1 = col_character(),
##   host.2 = col_character(),
##   ancestry = col_character(),
##   ancestry.tree.count = col_double()
## )

phylo.relationships

## # A tibble: 12 x 4
##    host.1 host.2 ancestry   ancestry.tree.count
##    <chr>  <chr>  <chr>                    <dbl>
##  1 PAT_5  PAT_2  complex                      9
##  2 PAT_5  PAT_2  multiTrans                   2
##  3 PAT_5  PAT_2  noAncestry                   1
##  4 PAT_5  PAT_2  trans                        9
##  5 PAT_2  PAT_5  multiTrans                   3
##  6 PAT_2  PAT_5  trans                        4
##  7 PAT_2  PAT_1  complex                      3
##  8 PAT_2  PAT_1  multiTrans                  10
##  9 PAT_2  PAT_1  noAncestry                   2
## 10 PAT_2  PAT_1  trans                       13
## # … with 2 more rows

phylo.relationships %>% 
  left_join(metadata, by=c("host.1" = "participant_id")) %>%
  left_join(metadata, by=c("host.2" = "participant_id"))

## # A tibble: 12 x 10
##    host.1 host.2 ancestry ancestry.tree.c… gender.x age.x dob.x      gender.y
##    <chr>  <chr>  <chr>               <dbl> <chr>    <dbl> <date>     <chr>   
##  1 PAT_5  PAT_2  complex                 9 F           19 1992-08-23 F       
##  2 PAT_5  PAT_2  multiTr…                2 F           19 1992-08-23 F       
##  3 PAT_5  PAT_2  noAnces…                1 F           19 1992-08-23 F       
##  4 PAT_5  PAT_2  trans                   9 F           19 1992-08-23 F       
##  5 PAT_2  PAT_5  multiTr…                3 F           26 1986-02-20 F       
##  6 PAT_2  PAT_5  trans                   4 F           26 1986-02-20 F       
##  7 PAT_2  PAT_1  complex                 3 F           26 1986-02-20 M       
##  8 PAT_2  PAT_1  multiTr…               10 F           26 1986-02-20 M       
##  9 PAT_2  PAT_1  noAnces…                2 F           26 1986-02-20 M       
## 10 PAT_2  PAT_1  trans                  13 F           26 1986-02-20 M       
## # … with 2 more rows, and 2 more variables: age.y <dbl>, dob.y <date>

(Matching column names will be automatically joined upon, and by is only needed if no names match. The .x and .y suffixes that appear when joining more than once can be modified using the suffix option to the join command.)

left_join returns a copy of the original tibble with all its rows; the new columns for rows with no matches will be given (by default) NA values. Several other joins are available, see e.g. here.

purrr

purrr is extremely versatile, and I will mention only the map family, which replace the (somewhat bewildering) apply, lapply, mapply, vapply, sapply etc from base R and are (in my opinion) much easier to remember. These functions are used to apply the same function to one or more entries in each row of a data frame (or tibble), and combined with mutate from dplyr can be used to make new columns (or replace the values of old ones).

The basic map takes one column as an argument and returns a list.

metadata2 <- metadata %>% mutate(yob = map(dob, year))

metadata2

## # A tibble: 7 x 5
##   participant_id gender   age dob        yob      
##   <chr>          <chr>  <dbl> <date>     <list>   
## 1 PAT_1          M         39 1972-10-30 <dbl [1]>
## 2 PAT_2          F         26 1986-02-20 <dbl [1]>
## 3 PAT_3          M         25 1987-02-28 <dbl [1]>
## 4 PAT_4          M         25 1986-10-11 <dbl [1]>
## 5 PAT_5          F         19 1992-08-23 <dbl [1]>
## 6 PAT_6          F         20 1991-04-04 <dbl [1]>
## 7 PAT_7          M         28 1984-01-24 <dbl [1]>

metadata2 %>% pull(yob)

## [[1]]
## [1] 1972
## 
## [[2]]
## [1] 1986
## 
## [[3]]
## [1] 1987
## 
## [[4]]
## [1] 1986
## 
## [[5]]
## [1] 1992
## 
## [[6]]
## [1] 1991
## 
## [[7]]
## [1] 1984

(Both tibbles and data.frames will happily accept lists as columns and columns whose entries are lists, although tibbles make it a lot more obvious what is going on.)

To coerce output to another data type, use map_*. So map_dbl returns numerical columns, map_chr text, and so on:

metadata2 <- metadata %>% mutate(yob = map_dbl(dob, year))

metadata2

## # A tibble: 7 x 5
##   participant_id gender   age dob          yob
##   <chr>          <chr>  <dbl> <date>     <dbl>
## 1 PAT_1          M         39 1972-10-30  1972
## 2 PAT_2          F         26 1986-02-20  1986
## 3 PAT_3          M         25 1987-02-28  1987
## 4 PAT_4          M         25 1986-10-11  1986
## 5 PAT_5          F         19 1992-08-23  1992
## 6 PAT_6          F         20 1991-04-04  1991
## 7 PAT_7          M         28 1984-01-24  1984

metadata2 <- metadata %>% mutate(yob = map_chr(dob, year))

metadata2

## # A tibble: 7 x 5
##   participant_id gender   age dob        yob        
##   <chr>          <chr>  <dbl> <date>     <chr>      
## 1 PAT_1          M         39 1972-10-30 1972.000000
## 2 PAT_2          F         26 1986-02-20 1986.000000
## 3 PAT_3          M         25 1987-02-28 1987.000000
## 4 PAT_4          M         25 1986-10-11 1986.000000
## 5 PAT_5          F         19 1992-08-23 1992.000000
## 6 PAT_6          F         20 1991-04-04 1991.000000
## 7 PAT_7          M         28 1984-01-24 1984.000000

The basic map expects one column as an argument, while map2 expects two, and pmap a list consisting of an arbitrary number.

(I usually use map with anonymous functions.)

phylo.relationships %<>% 
  left_join(metadata, by=c("host.1" = "participant_id")) %>%
  left_join(metadata, by=c("host.2" = "participant_id"))


phylo.relationships %>%
  mutate(age.difference = map2_dbl(age.x, age.y, function(x, y){
    x - y
  })) %>%
  select(host.1, host.2, gender.x, gender.y, age.x, age.y, age.difference)

## # A tibble: 12 x 7
##    host.1 host.2 gender.x gender.y age.x age.y age.difference
##    <chr>  <chr>  <chr>    <chr>    <dbl> <dbl>          <dbl>
##  1 PAT_5  PAT_2  F        F           19    26             -7
##  2 PAT_5  PAT_2  F        F           19    26             -7
##  3 PAT_5  PAT_2  F        F           19    26             -7
##  4 PAT_5  PAT_2  F        F           19    26             -7
##  5 PAT_2  PAT_5  F        F           26    19              7
##  6 PAT_2  PAT_5  F        F           26    19              7
##  7 PAT_2  PAT_1  F        M           26    39            -13
##  8 PAT_2  PAT_1  F        M           26    39            -13
##  9 PAT_2  PAT_1  F        M           26    39            -13
## 10 PAT_2  PAT_1  F        M           26    39            -13
## # … with 2 more rows

random.numbers <- tibble(a = runif(10, 0, 1), b = rnorm(10, 0, 4), c=sample(10))

random.numbers

## # A tibble: 10 x 3
##        a      b     c
##    <dbl>  <dbl> <int>
##  1 0.940  3.70      1
##  2 0.921 -2.92      6
##  3 0.205  5.30      7
##  4 0.418 -4.25      4
##  5 0.327 -2.45      3
##  6 0.655 -0.512     5
##  7 0.826  1.61      9
##  8 0.290 -3.51      2
##  9 0.110  3.58      8
## 10 0.940  2.21     10

random.numbers %>% mutate(d = pmap_dbl(list(a, b, c), function(x, y, z) z + (x*y))) 

## # A tibble: 10 x 4
##        a      b     c      d
##    <dbl>  <dbl> <int>  <dbl>
##  1 0.940  3.70      1  4.47 
##  2 0.921 -2.92      6  3.31 
##  3 0.205  5.30      7  8.09 
##  4 0.418 -4.25      4  2.23 
##  5 0.327 -2.45      3  2.20 
##  6 0.655 -0.512     5  4.66 
##  7 0.826  1.61      9 10.3  
##  8 0.290 -3.51      2  0.982
##  9 0.110  3.58      8  8.39 
## 10 0.940  2.21     10 12.1

tidyr

“Tidy” comes from the idea of tidy data, in which each variable is one column, and each observation one row. The tidyr package, which is quite small, provides functions that help to make your tibble look like this (or, indeed, untidy it). In perhaps more familiar terms, these are tools to convert between wide and long formats.

phylo.stats <- read_csv("example_phylogenetic_summary.csv")

## Parsed with column specification:
## cols(
##   host.id = col_character(),
##   tree.id = col_character(),
##   tips = col_double(),
##   reads = col_double(),
##   subgraphs = col_double(),
##   clades = col_double(),
##   overall.rtt = col_double(),
##   largest.rtt = col_double(),
##   max.branch.length = col_double(),
##   max.pat.distance = col_double(),
##   global.mean.pat.distance = col_double()
## )

phylo.stats

## # A tibble: 371 x 11
##    host.id tree.id  tips reads subgraphs clades overall.rtt largest.rtt
##    <chr>   <chr>   <dbl> <dbl>     <dbl>  <dbl>       <dbl>       <dbl>
##  1 PAT_3   1000_t…    11   418         1      1    0.00661     0.00661 
##  2 PAT_4   1000_t…     4    97         1      3    0.0102      0.0102  
##  3 PAT_5   1000_t…    16   220         1      9    0.0106      0.0106  
##  4 PAT_2   1000_t…     5   289         1      4    0.000235    0.000235
##  5 PAT_7   1000_t…    12   262         1      1    0.0131      0.0131  
##  6 PAT_1   1000_t…     6   703         3      3    0.00611     0       
##  7 PAT_6   1000_t…     1    53         1      1    0           0       
##  8 PAT_3   1160_t…     7   384         1      1    0.00368     0.00368 
##  9 PAT_4   1160_t…     5    75         3      3    0.0142      0.00563 
## 10 PAT_5   1160_t…    25   312         7     11    0.0132      0.00195 
## # … with 361 more rows, and 3 more variables: max.branch.length <dbl>,
## #   max.pat.distance <dbl>, global.mean.pat.distance <dbl>

This table is in long format. Observations about one patient appear on multiple rows, with a single column (tree.id) telling you something about the numbers in other columns (in this case, which phylogeny the statistics were taken from).

phylo.stats %>% filter(host.id == "PAT_1")

## # A tibble: 53 x 11
##    host.id tree.id  tips reads subgraphs clades overall.rtt largest.rtt
##    <chr>   <chr>   <dbl> <dbl>     <dbl>  <dbl>       <dbl>       <dbl>
##  1 PAT_1   1000_t…     6   703         3      3   0.00611     0        
##  2 PAT_1   1160_t…     6   669         1      4   0.000130    0.000130 
##  3 PAT_1   1320_t…     4   623         1      3   0.0000798   0.0000798
##  4 PAT_1   1480_t…     2   714         1      2   0.000001    0.000001 
##  5 PAT_1   1640_t…     7   728         1      4   0.000118    0.000118 
##  6 PAT_1   1800_t…     7   734         7      7   0.00186     0        
##  7 PAT_1   1960_t…     6   416         5      5   0.000624    0        
##  8 PAT_1   2120_t…     2   153         2      2   0.000001    0        
##  9 PAT_1   2280_t…     2   207         2      2   0.000001    0        
## 10 PAT_1   2440_t…     1    19         1      1   0           0        
## # … with 43 more rows, and 3 more variables: max.branch.length <dbl>,
## #   max.pat.distance <dbl>, global.mean.pat.distance <dbl>

To move to wide data, we make every patient a row, and every different type of observation about that patient (i.e. each stat in each window), a column. Let’s work only with the number of phylogeny tips. Then call pivot_wider:

phylo.stats %>% select(host.id, tree.id, tips)

## # A tibble: 371 x 3
##    host.id tree.id       tips
##    <chr>   <chr>        <dbl>
##  1 PAT_3   1000_to_1319    11
##  2 PAT_4   1000_to_1319     4
##  3 PAT_5   1000_to_1319    16
##  4 PAT_2   1000_to_1319     5
##  5 PAT_7   1000_to_1319    12
##  6 PAT_1   1000_to_1319     6
##  7 PAT_6   1000_to_1319     1
##  8 PAT_3   1160_to_1479     7
##  9 PAT_4   1160_to_1479     5
## 10 PAT_5   1160_to_1479    25
## # … with 361 more rows

phylo.stats.wide <- phylo.stats %>% 
  select(host.id, tree.id, tips) %>%
  pivot_wider(names_from = tree.id, values_from = tips)
phylo.stats.wide

## # A tibble: 7 x 54
##   host.id `1000_to_1319` `1160_to_1479` `1320_to_1639` `1480_to_1799`
##   <chr>            <dbl>          <dbl>          <dbl>          <dbl>
## 1 PAT_3               11              7              5              4
## 2 PAT_4                4              5              3              5
## 3 PAT_5               16             25             19             20
## 4 PAT_2                5              2              2              1
## 5 PAT_7               12              5              4              5
## 6 PAT_1                6              6              4              2
## 7 PAT_6                1              1              1              1
## # … with 49 more variables: `1640_to_1959` <dbl>, `1800_to_2119` <dbl>,
## #   `1960_to_2279` <dbl>, `2120_to_2439` <dbl>, `2280_to_2599` <dbl>,
## #   `2440_to_2759` <dbl>, `2600_to_2919` <dbl>, `2760_to_3079` <dbl>,
## #   `2920_to_3239` <dbl>, `3080_to_3399` <dbl>, `3240_to_3559` <dbl>,
## #   `3400_to_3719` <dbl>, `3560_to_3879` <dbl>, `3720_to_4039` <dbl>,
## #   `3880_to_4199` <dbl>, `4040_to_4359` <dbl>, `4200_to_4519` <dbl>,
## #   `4360_to_4679` <dbl>, `4520_to_4839` <dbl>, `4680_to_4999` <dbl>,
## #   `4840_to_5159` <dbl>, `5000_to_5319` <dbl>, `5160_to_5479` <dbl>,
## #   `520_to_839` <dbl>, `5320_to_5639` <dbl>, `5480_to_5799` <dbl>,
## #   `5640_to_5959` <dbl>, `5800_to_6119` <dbl>, `5960_to_6279` <dbl>,
## #   `6120_to_6439` <dbl>, `6280_to_6599` <dbl>, `6440_to_6759` <dbl>,
## #   `6760_to_7079` <dbl>, `680_to_999` <dbl>, `6920_to_7239` <dbl>,
## #   `7080_to_7399` <dbl>, `7240_to_7559` <dbl>, `7400_to_7719` <dbl>,
## #   `7560_to_7879` <dbl>, `7720_to_8039` <dbl>, `7880_to_8199` <dbl>,
## #   `8040_to_8359` <dbl>, `8200_to_8519` <dbl>, `8360_to_8679` <dbl>,
## #   `840_to_1159` <dbl>, `8520_to_8839` <dbl>, `8680_to_8999` <dbl>,
## #   `8840_to_9159` <dbl>, `9160_to_9479` <dbl>

Of course, you can convert back again. The second argument to pivot_longer (the first after the data, which does not appear here because of the pipe) is the set of columns to gather.

phylo.stats.wide %>% pivot_longer(2:54)

## # A tibble: 371 x 3
##    host.id name         value
##    <chr>   <chr>        <dbl>
##  1 PAT_3   1000_to_1319    11
##  2 PAT_3   1160_to_1479     7
##  3 PAT_3   1320_to_1639     5
##  4 PAT_3   1480_to_1799     4
##  5 PAT_3   1640_to_1959     6
##  6 PAT_3   1800_to_2119     3
##  7 PAT_3   1960_to_2279     3
##  8 PAT_3   2120_to_2439     3
##  9 PAT_3   2280_to_2599     2
## 10 PAT_3   2440_to_2759     2
## # … with 361 more rows

The new columns can be given custom names:

phylo.stats.wide %>% pivot_longer(2:54, names_to = "tree.id", values_to="tips")

## # A tibble: 371 x 3
##    host.id tree.id       tips
##    <chr>   <chr>        <dbl>
##  1 PAT_3   1000_to_1319    11
##  2 PAT_3   1160_to_1479     7
##  3 PAT_3   1320_to_1639     5
##  4 PAT_3   1480_to_1799     4
##  5 PAT_3   1640_to_1959     6
##  6 PAT_3   1800_to_2119     3
##  7 PAT_3   1960_to_2279     3
##  8 PAT_3   2120_to_2439     3
##  9 PAT_3   2280_to_2599     2
## 10 PAT_3   2440_to_2759     2
## # … with 361 more rows

This can be particularly handy when summarising.

phylo.stats.wide %>% 
  gather(tree.id, tips, 2:54) %>%
  group_by(host.id) %>%
  summarise(mean.tips = mean(tips), max.tips = max(tips), min.tips = min(tips))

## # A tibble: 7 x 4
##   host.id mean.tips max.tips min.tips
##   <chr>       <dbl>    <dbl>    <dbl>
## 1 PAT_1       1.98         7        0
## 2 PAT_2       3           14        1
## 3 PAT_3       2.30        11        0
## 4 PAT_4       0.717        6        0
## 5 PAT_5       4.64        25        0
## 6 PAT_6       0.226        2        0
## 7 PAT_7       1.79        12        0

glue

glue replaces paste(“foo”, “bar”, sep=”_”) with:

foo <- "foo"
bar <- "bar"


glue("{foo}_{bar}")

## foo_bar

This can be fully integrated with other tidyverse commands, of course.

metadata %>%
  mutate(full.gender = map_chr(gender, function(x){
    ifelse(x == "M", "male", "female")
  })) %>%
  mutate(description = glue("Patient ID {participant_id} is a {full.gender} of age {age}")) %>%
  pull(description)

## Patient ID PAT_1 is a male of age 39
## Patient ID PAT_2 is a female of age 26
## Patient ID PAT_3 is a male of age 25
## Patient ID PAT_4 is a male of age 25
## Patient ID PAT_5 is a female of age 19
## Patient ID PAT_6 is a female of age 20
## Patient ID PAT_7 is a male of age 28

See also