class: title-slide, center <span class="fa-stack fa-4x"> <i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"></i> <strong class="fa-stack-1x" style="color:#E7553C;">0</strong> </span> # Introduction to Tidyverse ## ### Paul E. Santos Andrade #### Plant Functional Trait Course - PFCT5 --- layout: true <div class="my-footer"><span>Plant Functional Trait Course - PFCT5</span></div> --- # Getting Started You will learn the basics of
and data science, and practice using the RStudio IDE (integrated development environment). We'll discuss about data transformation and tidying with **dplyr** and **tidyr**. If you want to transition from coding in base R to the **tidyverse**, or just jump into doing data science in the tidyverse without any prior R experience. Here we will work on the first steps! --- ## Prework You may need to install the following: 1. A recent version of R (~3.6.3), which is available for free at [cran.r-project.org](http://www.cran.r-project.org) 2. A recent version of RStudio IDE (~1.2.5033), available for free at [www.rstudio.com/download](http://www.rstudio.com/download). --- ## Prework 3. The set of relevant R packages, which you can install by connecting to the internet, opening RStudio, and running: ```r packages <- c("tidyverse", "janitor", "skimr", "here", "writexl", "readxl", "cowplot", "patchwork", "RColorBrewer", "learnr") install.packages(packages) ``` --- # Using packages .pull-left[ ## 1 ```r install.packages("vegan") ``` Download files One time per computer ] .pull-right[ ## 2 ```r library(vegan) ``` In each session ] --- class: inverse # Your Turn
02
:
00
--- class: middle, center .left-narrow[ ## R for Data Science ### [English version](http://r4ds.had.co.nz/) ### [Spanish traslation](https://es.r4ds.hadley.nz/) ] .right-wide[ <img src="images/cover.png" width="50%" style="display: block; margin: auto;" /> ] --- ## Data science workflow <img src="images/data_science.png" width="80%" style="display: block; margin: auto;" /> --- ## Data science workflow <img src="images/horst-eco-r4ds-env-only.png" width="65%" style="display: block; margin: auto;" /> ##### by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) --- ## Data science workflow <img src="images/horst-eco-r4ds-env-comm-only.png" width="65%" style="display: block; margin: auto;" /> ##### by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) --- ## Data science workflow <img src="images/horst-eco-r4ds.png" width="65%" style="display: block; margin: auto;" /> ##### by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) --- # [tidyverse.org](https://www.tidyverse.org/) <img src="images/tidyorg.png" width="65%" style="display: block; margin: auto;" /> --- class: middle, center ## RStudio Projects .left-narrow[ Keeping all the files associated with a project organized together – input data, R scripts, results and figures. ] .right-wide[ <img src="images/project.png" width="50%" style="display: block; margin: auto;" /> ] --- ## **here()**. A simpler way to find your files. <img src="https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/here.png" width="50%" style="display: block; margin: auto;" /> ##### by [Allison Horst](https://github.com/allisonhorst/stats-illustrations) --- class: center ## Import data .pull-left[ <img src="images/readr.png" width="45%" style="display: block; margin: auto;" /> ```r library(readr) read_csv(".../") read_tsv(".../") read_delim(".../") ``` ] .pull-right[ <img src="images/readxl.png" width="45%" style="display: block; margin: auto;" /> ```r library(readxl) read_xls(".../") read_xlsx(".../") ``` ] --- ## Import data ```r chem_trait <- read_csv("data/PFTC1.2_China_2015_2016_ChemicalTraits.csv") ``` ```r chem_trait %>% names() ## [1] "Date" "Elevation" "Site" "destBlockID" "Treatment" ## [6] "Taxon" "StoichLabel" "P_percent" "C_percent" "N_percent" ## [11] "CN_ratio" "dN15_percent" "dC13_percent" "n" "CNP_Comment" ``` --- # The pipe operator Passes result on left into first argument of function on right. <img src="images/pipe.png" width="30%" style="display: block; margin: auto;" /> ```r clean_names(chem_trait) chem_trait %>% clean_names() ``` --- ## **janitor::clean_names()**. ```r chem_trait <- read_csv("data/PFTC1.2_China_2015_2016_ChemicalTraits.csv") %>% clean_names() ``` ```r chem_trait %>% names() ## [1] "date" "elevation" "site" "dest_block_id" ## [5] "treatment" "taxon" "stoich_label" "p_percent" ## [9] "c_percent" "n_percent" "cn_ratio" "d_n15_percent" ## [13] "d_c13_percent" "n" "cnp_comment" ``` --- class: inverse # Your Turn
01
:
00
--- class: inverse ## dplyr .pull-left[ <img src="images/dplyr.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ A package that transforms data. **dplyr** implements a grammar for transforming tabular data. ] --- # Isolating data <img src="images/isolating.png" width="70%" style="display: block; margin: auto;" /> --- ## select() ```r chem_trait ``` ``` ## # A tibble: 576 x 15 ## date elevation site dest_block_id treatment taxon stoich_label ## <date> <dbl> <chr> <chr> <chr> <chr> <chr> ## 1 2015-08-20 3500 M MO LOCAL Poly~ 164 ## 2 2015-08-20 3850 A AO LOCAL Poly~ 198 ## 3 2015-08-20 3500 M MO LOCAL Poly~ 163 ## 4 2015-08-20 3000 L LO LOCAL Poly~ 113 ## 5 2015-08-20 4100 H HO LOCAL Rhod~ 224 ## 6 2015-08-20 3500 M MO LOCAL Rhod~ 172 ## 7 2015-08-20 3850 A AO LOCAL Hemi~ 186 ## 8 2015-08-20 3500 M MO LOCAL Clin~ 134 ## 9 2015-08-20 3000 L LO LOCAL Swer~ 122 ## 10 2015-08-20 3500 M MO LOCAL Hemi~ 145 ## # ... with 566 more rows, and 8 more variables: p_percent <dbl>, ## # c_percent <dbl>, n_percent <dbl>, cn_ratio <dbl>, d_n15_percent <dbl>, ## # d_c13_percent <dbl>, n <dbl>, cnp_comment <chr> ``` --- ## select() - Extract columns by name. ```r chem_trait %>% select(date, site, taxon) ``` ``` ## # A tibble: 576 x 3 ## date site taxon ## <date> <chr> <chr> ## 1 2015-08-20 M Polygonum cyanandrum ## 2 2015-08-20 A Polygonum cyanandrum ## 3 2015-08-20 M Polygonum cyanandrum ## 4 2015-08-20 L Polygonum cyanandrum ## 5 2015-08-20 H Rhodiola fastigiata ## 6 2015-08-20 M Rhodiola yunnanensis ## 7 2015-08-20 A Hemiphragma heterophyllum ## 8 2015-08-20 M Clinopodium polycephalum ## 9 2015-08-20 L Swertia macrosperma ## 10 2015-08-20 M Hemiphragma heterophyllum ## # ... with 566 more rows ``` --- ## select() - Select every column but ```r chem_trait %>% select(-c(date, elevation, site)) ``` ``` ## # A tibble: 576 x 12 ## dest_block_id treatment taxon stoich_label p_percent c_percent n_percent ## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> ## 1 MO LOCAL Poly~ 164 0.0984 44.0 2.48 ## 2 AO LOCAL Poly~ 198 0.218 44.1 2.85 ## 3 MO LOCAL Poly~ 163 0.0983 45.8 3.06 ## 4 LO LOCAL Poly~ 113 0.0587 44.6 2.51 ## 5 HO LOCAL Rhod~ 224 0.128 46.6 2.64 ## 6 MO LOCAL Rhod~ 172 0.101 42.2 2.14 ## 7 AO LOCAL Hemi~ 186 0.122 47.4 1.83 ## 8 MO LOCAL Clin~ 134 0.124 46.0 3.05 ## 9 LO LOCAL Swer~ 122 0.206 48.7 2.43 ## 10 MO LOCAL Hemi~ 145 0.0632 48.6 2.07 ## # ... with 566 more rows, and 5 more variables: cn_ratio <dbl>, ## # d_n15_percent <dbl>, d_c13_percent <dbl>, n <dbl>, cnp_comment <chr> ``` --- ## select() - Rename variables .pull-left[ ```r chem_trait %>% select( date, site) ``` ``` ## # A tibble: 576 x 2 ## date site ## <date> <chr> ## 1 2015-08-20 M ## 2 2015-08-20 A ## 3 2015-08-20 M ## 4 2015-08-20 L ## 5 2015-08-20 H ## 6 2015-08-20 M ## 7 2015-08-20 A ## 8 2015-08-20 M ## 9 2015-08-20 L ## 10 2015-08-20 M ## # ... with 566 more rows ``` ] .pull-right[ ```r chem_trait %>% select( time = date, location = site) ``` ``` ## # A tibble: 576 x 2 ## time location ## <date> <chr> ## 1 2015-08-20 M ## 2 2015-08-20 A ## 3 2015-08-20 M ## 4 2015-08-20 L ## 5 2015-08-20 H ## 6 2015-08-20 M ## 7 2015-08-20 A ## 8 2015-08-20 M ## 9 2015-08-20 L ## 10 2015-08-20 M ## # ... with 566 more rows ``` ] --- #### select() **:** - select a range of columns ```r chem_trait %>% select(date:taxon) ``` ``` ## # A tibble: 576 x 6 ## date elevation site dest_block_id treatment taxon ## <date> <dbl> <chr> <chr> <chr> <chr> ## 1 2015-08-20 3500 M MO LOCAL Polygonum cyanandrum ## 2 2015-08-20 3850 A AO LOCAL Polygonum cyanandrum ## 3 2015-08-20 3500 M MO LOCAL Polygonum cyanandrum ## 4 2015-08-20 3000 L LO LOCAL Polygonum cyanandrum ## 5 2015-08-20 4100 H HO LOCAL Rhodiola fastigiata ## 6 2015-08-20 3500 M MO LOCAL Rhodiola yunnanensis ## 7 2015-08-20 3850 A AO LOCAL Hemiphragma heterophyllum ## 8 2015-08-20 3500 M MO LOCAL Clinopodium polycephalum ## 9 2015-08-20 3000 L LO LOCAL Swertia macrosperma ## 10 2015-08-20 3500 M MO LOCAL Hemiphragma heterophyllum ## # ... with 566 more rows ``` --- #### select() **start_with** - Select columns that start with... ```r chem_trait %>% select(starts_with("cn")) ``` ``` ## # A tibble: 576 x 2 ## cn_ratio cnp_comment ## <dbl> <chr> ## 1 17.8 <NA> ## 2 15.5 <NA> ## 3 15.0 <NA> ## 4 17.8 <NA> ## 5 17.6 <NA> ## 6 19.7 <NA> ## 7 25.8 <NA> ## 8 15.1 <NA> ## 9 20.1 <NA> ## 10 23.4 <NA> ## # ... with 566 more rows ``` --- ## **filter()** - Extract rows that meet logical criteria ```r chem_trait %>% select(treatment, taxon) %>% filter(treatment == "OTC") ``` ``` ## # A tibble: 74 x 2 ## treatment taxon ## <chr> <chr> ## 1 OTC Potentilla leuconota ## 2 OTC Potentilla leuconota ## 3 OTC Plantago asiatica ## 4 OTC Potentilla leuconota ## 5 OTC Plantago asiatica ## 6 OTC Potentilla leuconota ## 7 OTC Potentilla leuconota ## 8 OTC Epilobium fangii ## 9 OTC Viola biflora var. rockiana ## 10 OTC Artemisia flaccida ## # ... with 64 more rows ``` --- ## filter( ) ```r chem_trait %>% select(treatment, taxon) %>% filter(treatment == "OTC", taxon == "Epilobium fangii") ``` ``` ## # A tibble: 9 x 2 ## treatment taxon ## <chr> <chr> ## 1 OTC Epilobium fangii ## 2 OTC Epilobium fangii ## 3 OTC Epilobium fangii ## 4 OTC Epilobium fangii ## 5 OTC Epilobium fangii ## 6 OTC Epilobium fangii ## 7 OTC Epilobium fangii ## 8 OTC Epilobium fangii ## 9 OTC Epilobium fangii ``` --- ## Logical tests |Operator| Meaning| |:--------|:-------------| |x < y| Less than | |x > y| Greater than| |x == y| Equal to| |x <= y| Less than or equal to| |x >= y| Greater than or equal to| |x != y| Not equal to | |x %in% y| Group membership| |is.na(x)| Is NA | |!is.na(x)| Is not NA| --- ## Boolean operator |Operator| Meaning| |:--------|:--------------| |a & b| and| |"a|b"| or| |!a| not | |a %in% c(a, b)| one of (in)| --- ### **%in%** ```r chem_trait %>% select(site, treatment) %>% filter(treatment %in% c("LOCAL", "OTC", "O")) ``` ``` ## # A tibble: 8 x 2 ## site treatment ## <chr> <chr> ## 1 M LOCAL ## 2 A LOCAL ## 3 L LOCAL ## 4 H LOCAL ## 5 L OTC ## 6 A OTC ## 7 M OTC ## 8 H OTC ``` --- ## arrange( ) .pull-left[ ```r chem_trait %>% select(elevation, site) ## # A tibble: 576 x 2 ## elevation site ## <dbl> <chr> ## 1 3500 M ## 2 3850 A ## 3 3500 M ## 4 3000 L ## 5 4100 H ## 6 3500 M ## 7 3850 A ## 8 3500 M ## 9 3000 L ## 10 3500 M ## # ... with 566 more rows ``` ] .pull-right[ ```r chem_trait %>% select(elevation, site) %>% arrange(elevation) ## # A tibble: 576 x 2 ## elevation site ## <dbl> <chr> ## 1 3000 L ## 2 3000 L ## 3 3000 L ## 4 3000 L ## 5 3000 L ## 6 3000 L ## 7 3000 L ## 8 3000 L ## 9 3000 L ## 10 3000 L ## # ... with 566 more rows ``` ] --- ## arrange(. , desc()) .pull-left[ ```r chem_trait %>% select(elevation, site) ## # A tibble: 576 x 2 ## elevation site ## <dbl> <chr> ## 1 3500 M ## 2 3850 A ## 3 3500 M ## 4 3000 L ## 5 4100 H ## 6 3500 M ## 7 3850 A ## 8 3500 M ## 9 3000 L ## 10 3500 M ## # ... with 566 more rows ``` ] .pull-right[ ```r chem_trait %>% select(elevation, site) %>% arrange(desc(elevation)) ## # A tibble: 576 x 2 ## elevation site ## <dbl> <chr> ## 1 4100 H ## 2 4100 H ## 3 4100 H ## 4 4100 H ## 5 4100 H ## 6 4100 H ## 7 4100 H ## 8 4100 H ## 9 4100 H ## 10 4100 H ## # ... with 566 more rows ``` ] --- class: inverse # Your Turn
01
:
00
--- class: center, middle, inverse ## Deriving Information
--- ### mutate() - create new variables ```r chem_trait %>% select(site, elevation, c_percent) %>% mutate(c_percent_prop = c_percent/10) ``` ``` ## # A tibble: 576 x 4 ## site elevation c_percent c_percent_prop ## <chr> <dbl> <dbl> <dbl> ## 1 M 3500 44.0 4.40 ## 2 A 3850 44.1 4.41 ## 3 M 3500 45.8 4.58 ## 4 L 3000 44.6 4.46 ## 5 H 4100 46.6 4.66 ## 6 M 3500 42.2 4.22 ## 7 A 3850 47.4 4.74 ## 8 M 3500 46.0 4.60 ## 9 L 3000 48.7 4.87 ## 10 M 3500 48.6 4.86 ## # ... with 566 more rows ``` --- ### mutate() ```r chem_trait %>% select(site, elevation, c_percent, p_percent) %>% mutate(c_p_percent = c_percent + p_percent) ``` ``` ## # A tibble: 576 x 5 ## site elevation c_percent p_percent c_p_percent ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 M 3500 44.0 0.0984 44.1 ## 2 A 3850 44.1 0.218 44.4 ## 3 M 3500 45.8 0.0983 45.9 ## 4 L 3000 44.6 0.0587 44.7 ## 5 H 4100 46.6 0.128 46.7 ## 6 M 3500 42.2 0.101 42.3 ## 7 A 3850 47.4 0.122 47.5 ## 8 M 3500 46.0 0.124 46.1 ## 9 L 3000 48.7 0.206 49.0 ## 10 M 3500 48.6 0.0632 48.7 ## # ... with 566 more rows ``` --- ## summarise() - summarize() ```r chem_trait %>% summarise(lowest_elev = min(elevation), hight_elev = max(elevation)) ``` ``` ## # A tibble: 1 x 2 ## lowest_elev hight_elev ## <dbl> <dbl> ## 1 3000 4100 ``` --- ## summarise() ### n(), number of observations ### n_distinct(), number of unique values ```r chem_trait %>% summarise(n_entries = n(), n_treatment = n_distinct(treatment)) ``` ``` ## # A tibble: 1 x 2 ## n_entries n_treatment ## <int> <int> ## 1 576 8 ``` --- ## Number of species ```r chem_trait %>% summarise(n_species = n_distinct(taxon)) ``` ``` ## # A tibble: 1 x 1 ## n_species ## <int> ## 1 63 ``` --- class: inverse # Your Turn
01
:
00
--- class: inverse, center, middle # Splitting the data
--- ## **group_by()** Takes an existing tbl and converts it into a grouped tbl where operations are performed *by group*. <img src="images/group_by.png" width="75%" style="display: block; margin: auto;" /> --- #### group_by() %>% summarise() ```r chem_trait %>% group_by(site) %>% summarise(n_samples = n(), n_species = n_distinct(taxon), mean_p_percent = mean(p_percent)) ``` ``` ## # A tibble: 4 x 4 ## site n_samples n_species mean_p_percent ## <chr> <int> <int> <dbl> ## 1 A 144 25 NA ## 2 H 107 25 0.154 ## 3 L 113 24 NA ## 4 M 212 31 0.145 ``` --- #### group_by() %>% summarise() ```r chem_trait %>% group_by(site, treatment) %>% summarise(n_samples = n(), n_species = n_distinct(taxon), mean_p_percent = mean(p_percent)) ``` ``` ## # A tibble: 24 x 5 ## # Groups: site [4] ## site treatment n_samples n_species mean_p_percent ## <chr> <chr> <int> <int> <dbl> ## 1 A 0 16 4 0.165 ## 2 A 1 12 6 0.180 ## 3 A 2 16 7 0.167 ## 4 A C 20 5 0.164 ## 5 A LOCAL 55 15 NA ## 6 A OTC 25 6 0.157 ## 7 H 0 5 2 0.160 ## 8 H 2 21 6 0.160 ## 9 H 4 7 3 0.124 ## 10 H C 8 2 0.155 ## # ... with 14 more rows ``` --- ```r chem_trait %>% group_by(site) %>% summarise(n_samples = n(), n_species = n_distinct(taxon), mean_p_percent = mean(p_percent), * n_of_na = sum(is.na(p_percent))) ``` ``` ## # A tibble: 4 x 5 ## site n_samples n_species mean_p_percent n_of_na ## <chr> <int> <int> <dbl> <int> ## 1 A 144 25 NA 1 ## 2 H 107 25 0.154 0 ## 3 L 113 24 NA 1 ## 4 M 212 31 0.145 0 ``` --- ```r chem_trait %>% group_by(site) %>% summarise(n_samples = n(), n_species = n_distinct(taxon), * mean_p_percent = mean(p_percent, na.rm = TRUE), * n_of_na = sum(is.na(p_percent))) ``` ``` ## # A tibble: 4 x 5 ## site n_samples n_species mean_p_percent n_of_na ## <chr> <int> <int> <dbl> <int> ## 1 A 144 25 0.172 1 ## 2 H 107 25 0.154 0 ## 3 L 113 24 0.176 1 ## 4 M 212 31 0.145 0 ``` --- ### ungroup() ```r chem_trait %>% group_by(site) %>% summarise(n_samples = n(), n_species = n_distinct(taxon), mean_p_percent = mean(p_percent, na.rm = TRUE)) %>% * ungroup() ``` ``` ## # A tibble: 4 x 4 ## site n_samples n_species mean_p_percent ## <chr> <int> <int> <dbl> ## 1 A 144 25 0.172 ## 2 H 107 25 0.154 ## 3 L 113 24 0.176 ## 4 M 212 31 0.145 ``` --- class: inverse # Your Turn
01
:
00
--- class: inverse ## tidyr .pull-left[ <img src="images/tidyr.png" width="70%" style="display: block; margin: auto;" /> ] .pull-right[ A package that **reshapes** the layout of tabular data. ] --- class: center, middle, inverse # "Data comes in many formats, but R prefers just one: **tidy data**. " - Garrett Grolemund --- # Tdidy Data .pull-left[ A data set is **tidy** if: 1. Each variable is in its own `column`. 2. Each case is in its own `row`. 3. Each value is in its own `cell`. ] .pull-right[ <img src="images/tidy_data.png" width="80%" style="display: block; margin: auto;" /> ] --- class: center, middle, inverse # "Tidy data sets are all alike; but every messy data set is messy in its own way." - Hadley Wickham --- ## separate() ```r chem_trait %>% select(site, taxon) %>% * separate(taxon, c("genus", "specie")) ``` ``` ## # A tibble: 576 x 3 ## site genus specie ## <chr> <chr> <chr> ## 1 M Polygonum cyanandrum ## 2 A Polygonum cyanandrum ## 3 M Polygonum cyanandrum ## 4 L Polygonum cyanandrum ## 5 H Rhodiola fastigiata ## 6 M Rhodiola yunnanensis ## 7 A Hemiphragma heterophyllum ## 8 M Clinopodium polycephalum ## 9 L Swertia macrosperma ## 10 M Hemiphragma heterophyllum ## # ... with 566 more rows ``` --- ## unite() .pull-left[ ```r chem_trait %>% select(site, dest_block_id) ``` ``` ## # A tibble: 576 x 2 ## site dest_block_id ## <chr> <chr> ## 1 M MO ## 2 A AO ## 3 M MO ## 4 L LO ## 5 H HO ## 6 M MO ## 7 A AO ## 8 M MO ## 9 L LO ## 10 M MO ## # ... with 566 more rows ``` ] .pull-right[ ```r chem_trait %>% select(site, dest_block_id) %>% unite("site_dest", site, dest_block_id, sep = "_" ) ``` ``` ## # A tibble: 576 x 1 ## site_dest ## <chr> ## 1 M_MO ## 2 A_AO ## 3 M_MO ## 4 L_LO ## 5 H_HO ## 6 M_MO ## 7 A_AO ## 8 M_MO ## 9 L_LO ## 10 M_MO ## # ... with 566 more rows ``` ] --- ## pivot_wider() ```r df %>% * pivot_wider(everything(), names_from = "...", values_from = "...") ``` ```r df %>% * pivot_wider(-c(...), names_from = "...", values_from = "...") ``` ```r df %>% * pivot_wider(-var, names_from = "...", values_from = "...") ``` --- ## pivot_wider() ![](images/spread.gif) --- ## pivot_longer() ```r df %>% * pivot_longer(cols = , names_to = "...", values_to = "...") ``` ```r df %>% pivot_longer(cols = , * names_to = "...", values_to = "...") ``` ```r df %>% pivot_longer(cols = , names_to = "...", * values_to = "...") ``` --- ## pivot_longer() ![](images/pivot_longer_smaller.gif) --- class: inverse, right, middle <!--customize this!--> <img style="border-radius: 60%;" src="images/DSC_0140.JPG" width="250px"/> # Find me at... [
@PaulEfrenSantos](https://twitter.com/PaulEfrenSantos) [
@PaulESantos](https://github.com/PaulESantos) [
paulefrensa.rbind.io](https://paulefrensa.rbind.io)