This tutorial introduces how to easily manipulate data in R using the tidyverse package. Data manipulation usually involves

  • computing summary statistics.

  • rows filtering (filter()) and ordering (arrange()).

  • renaming (rename()) and selecting certain columns (select()

and adding columns

  • mutate(): compute and add new variables into a data table. It preserves existing variables and adds new columns at the end of your dataset and

  • transmute(): compute new columns and only keep the new columns,

setwd(“wdirectory”) Changes the current working directory to wdirectory.

setwd('C:/Users/USER/Desktop/JUPYTER_NOTEBOOK/A_MYTUTORIALS/MYR')

Load the needed packages

library(devtools)
library(tidyverse)
library(nycflights13)
library(readxl)
Loading required package: usethis

Warning message in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
"non-uniform 'Rounding' sampler used"
-- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.1 --

v ggplot2 3.3.3     v purrr   0.3.4
v tibble  3.1.1     v dplyr   1.0.5
v tidyr   1.1.3     v stringr 1.4.0
v readr   1.4.0     v forcats 0.5.1

Warning message in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
"non-uniform 'Rounding' sampler used"
-- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

Datatset

We will use titanic dataset. This dataset has 1309 observations with 14 variables. To explore the basic data manipulation verbs of dplyr, we start by converting the data into a tibble data frame for easier data manipulation

my_data<-as_tibble(read_xls('data/titanic.xls'))
names(my_data)
Warning message in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
"Coercing text to numeric in M1306 / R1306C13: '328'"

Variable Selection

we can select or subset variables by names or position.

Under variable selection we will learn how to use

  • select(): allow us to extract variables or variables as a data table and can also be used to remove variables from the data frame.

  • select_if(): Select variabless based on a particular condition.

  • Variabl Selection by position

select from the my_data variable positioned from 1 to 4 inclusive

head(my_data %>% select(1:4))
A tibble: 6 × 4
pclasssurvivednamesex
<dbl><dbl><chr><chr>
11Allen, Miss. Elisabeth Walton female
11Allison, Master. Hudson Trevor male
10Allison, Miss. Helen Loraine female
10Allison, Mr. Hudson Joshua Creighton male
10Allison, Mrs. Hudson J C (Bessie Waldo Daniels)female
11Anderson, Mr. Harry male

select variables positioned at 1, 4,6,7


head(my_data %>% select(1,4,6,7))

A tibble: 6 × 4
pclasssexsibspparch
<dbl><chr><dbl><dbl>
1female00
1male 12
1female12
1male 12
1female12
1male 00

Select varaibles by name

head(select(my_data,pclass,name:age),3)
A tibble: 3 × 4
pclassnamesexage
<dbl><chr><chr><dbl>
1Allen, Miss. Elisabeth Walton female29.0000
1Allison, Master. Hudson Trevormale 0.9167
1Allison, Miss. Helen Loraine female 2.0000

Select all variables except variables from survived to cabin

head(select(my_data,-(survived:cabin)),3)
A tibble: 3 × 5
pclassembarkedboatbodyhome.dest
<dbl><chr><chr><dbl><chr>
1S2 NASt Louis, MO
1S11NAMontreal, PQ / Chesterville, ON
1SNANAMontreal, PQ / Chesterville, ON

select variables whose name starts with bo

head(my_data %>% select(starts_with('bo')),3)
A tibble: 3 × 2
boatbody
<chr><dbl>
2 NA
11NA
NANA

select variables whose name ends with t

head(my_data %>% select(ends_with('t')),3)
A tibble: 3 × 3
ticketboathome.dest
<chr><chr><chr>
24160 2 St Louis, MO
11378111Montreal, PQ / Chesterville, ON
113781NAMontreal, PQ / Chesterville, ON

Select variables whose names contains “me”

head(my_data %>% select(contains('me')),4)
A tibble: 4 × 2
namehome.dest
<chr><chr>
Allen, Miss. Elisabeth Walton St Louis, MO
Allison, Master. Hudson Trevor Montreal, PQ / Chesterville, ON
Allison, Miss. Helen Loraine Montreal, PQ / Chesterville, ON
Allison, Mr. Hudson Joshua CreightonMontreal, PQ / Chesterville, ON

Variable selection based on a condtion

select only character variables

head(my_data %>% select_if(is.character),4)
A tibble: 4 × 7
namesexticketcabinembarkedboathome.dest
<chr><chr><chr><chr><chr><chr><chr>
Allen, Miss. Elisabeth Walton female24160 B5 S2 St Louis, MO
Allison, Master. Hudson Trevor male 113781C22 C26S11Montreal, PQ / Chesterville, ON
Allison, Miss. Helen Loraine female113781C22 C26SNAMontreal, PQ / Chesterville, ON
Allison, Mr. Hudson Joshua Creightonmale 113781C22 C26SNAMontreal, PQ / Chesterville, ON

selecting only numerical variables

head(select_if(my_data,is.numeric),4)
A tibble: 4 × 7
pclasssurvivedagesibspparchfarebody
<dbl><dbl><dbl><dbl><dbl><dbl><dbl>
1129.000000211.3375 NA
11 0.916712151.5500 NA
10 2.000012151.5500 NA
1030.000012151.5500135

Removing columns

For simplicity we will work with only few variables

sub_data<-my_data %>% select(sex,pclass,age,'survived',cabin,'name','sex','age','sibsp')

remove variables named pclass,age and sex

head(sub_data%>% select(-sex,-pclass,-age),3)
A tibble: 3 × 4
survivedcabinnamesibsp
<dbl><chr><chr><dbl>
1B5 Allen, Miss. Elisabeth Walton 0
1C22 C26Allison, Master. Hudson Trevor1
0C22 C26Allison, Miss. Helen Loraine 1

Rows filtering (filter())

This section describes how to subset or extract samples or rows from the dataset based on certain criteria

extract male (sex==’male’) passengers who survived (survived==1) and has sibsp==1 (Number of Siblings/Spouses Aboard)

head(sub_data %>% filter(sex=='male' & sibsp==1 & survived==1),2)
A tibble: 2 × 7
sexpclassagesurvivedcabinnamesibsp
<chr><dbl><dbl><dbl><chr><chr><dbl>
male1 0.91671C22 C26Allison, Master. Hudson Trevor1
male137.00001D35 Beckwith, Mr. Richard Leonard 1

OR

head(sub_data %>% filter(sex=='male', sibsp==1,survived==1),2)
A tibble: 2 × 7
sexpclassagesurvivedcabinnamesibsp
<chr><dbl><dbl><dbl><chr><chr><dbl>
male1 0.91671C22 C26Allison, Master. Hudson Trevor1
male137.00001D35 Beckwith, Mr. Richard Leonard 1

extract rows where passengers are male(sex==’male’) or survived (survived==1) or has sibsp==1 or 2 (Number of Siblings/Spouses Aboard)

head(sub_data %>% filter(sex=='male' | sibsp==1 |sibsp==2| survived==1),3)
A tibble: 3 × 7
sexpclassagesurvivedcabinnamesibsp
<chr><dbl><dbl><dbl><chr><chr><dbl>
female129.00001B5 Allen, Miss. Elisabeth Walton 0
male 1 0.91671C22 C26Allison, Master. Hudson Trevor1
female1 2.00000C22 C26Allison, Miss. Helen Loraine 1

select variables sibsp,sex,age and from these variables extract rows where age<10

head(sub_data %>% select(sex,sibsp,age) %>%filter(age<10),3)

A tibble: 3 × 3
sexsibspage
<chr><dbl><dbl>
male 10.9167
female12.0000
male 04.0000

Selecting random rows or samples from a dataset

selecting 10 random samples without replacement from the data

head(sub_data %>% sample_n(10,replace = FALSE),2)
A tibble: 2 × 7
sexpclassagesurvivedcabinnamesibsp
<chr><dbl><dbl><dbl><chr><chr><dbl>
male 3200NA Vendel, Mr. Olof Edvin 0
female1351C123Futrelle, Mrs. Jacques Heath (Lily May Peel)1

Select 1% random samples without replacement from the data

sub_data %>% sample_frac(0.01,replace = FALSE)
A tibble: 13 × 7
sexpclassagesurvivedcabinnamesibsp
<chr><dbl><dbl><dbl><chr><chr><dbl>
male 1341NA Seward, Mr. Frederic Kimber 0
male 3440NA Cribb, Mr. John Hatfield 0
male 3231NA Asplund, Mr. Johan Charles 0
male 3260NA Bostandyeff, Mr. Guentcho 0
male 2270NA Pulbaum, Mr. Franz 0
male 3170NA Elias, Mr. Joseph Jr 1
male 1410D21Kenyon, Mr. Frederick R 1
male 3311NA Stranden, Mr. Juho 0
female2251NA Shelley, Mrs. William (Imanita Parrish Hall)0
male 3NA0NA Petroff, Mr. Pastcho ("Pentcho") 0
male 3NA1NA O'Keefe, Mr. Patrick 0
male 2 21NA Wells, Master. Ralph Lester 1
male 3130NA Asplund, Master. Filip Oscar 4

Missing values

Number of missing values in the age variable

sub_data %>% summarise(num_na=sum(is.na(age)))
A tibble: 1 × 1
num_na
<int>
263

number of missing values in each variable

sub_data %>% purrr::map_df(~sum(is.na(.)))
A tibble: 1 × 7
sexpclassagesurvivedcabinnamesibsp
<int><int><int><int><int><int><int>
002630101400

nrow(sub_data)

1309

drop samples of the variables age and cabin with nas


no_nas<-sub_data %>% filter_at(vars(age,cabin),all_vars(!is.na(.)))
nrow(no_nas)

272

Adding New Variables

Under this section we will use the housing dataset


housing<-readr::read_csv('data/housing.csv',guess_max = 20)
-- Column specification ------------------------------------------------------------------------------------------------
cols(
  longitude = col_double(),
  latitude = col_double(),
  housing_median_age = col_double(),
  total_rooms = col_double(),
  total_bedrooms = col_double(),
  population = col_double(),
  households = col_double(),
  median_income = col_double(),
  median_house_value = col_double(),
  ocean_proximity = col_character()
)

Under this section we will select only few variables needed to created new variables

housing<-housing %>% select(total_rooms,households,total_bedrooms,total_rooms,
        population,households,ocean_proximity,median_income)

mutate()

mutate() adds new variables at the end of your dataset

head(housing 
      
      %>% mutate( rooms_per_household= total_rooms/households,
                  bedrooms_per_room=total_bedrooms/total_rooms,
                  population_per_household=population/households
                  ) 
                  %>% select(-c(total_rooms,households,population,total_bedrooms,median_income))
                  ,3)
A tibble: 3 × 4
ocean_proximityrooms_per_householdbedrooms_per_roompopulation_per_household
<chr><dbl><dbl><dbl>
NEAR BAY6.9841270.14659092.555556
NEAR BAY6.2381370.15579662.109842
NEAR BAY8.2881360.12951602.802260

transmute() only keep the new variables created

head(housing %>% 
            transmute(
                  rooms_per_household= total_rooms/households,
                  bedrooms_per_room=total_bedrooms/total_rooms,
                  population_per_household=population/households
                   ),
                   3)
A tibble: 3 × 3
rooms_per_householdbedrooms_per_roompopulation_per_household
<dbl><dbl><dbl>
6.9841270.14659092.555556
6.2381370.15579662.109842
8.2881360.12951602.802260

Summary Statistics

getmode <- function(n) {
   uniqn <- unique(n)
   uniqn[which.max(tabulate(match(n, uniqn)))]
}
housing %>% summarise(count=n(),mean_income=mean(median_income,na.rm=TRUE),
                      mode_income=getmode(median_income))
A tibble: 1 × 3
countmean_incomemode_income
<int><dbl><dbl>
206403.8706713.125

Group by one variable

  • Note : you can groupe by multiple variables
housing %>%group_by(ocean_proximity) %>% summarise(mean_income=mean(median_income,na.rm=TRUE),
                                                   mean_housholde=mean(households))
A tibble: 5 × 3
ocean_proximitymean_incomemean_housholde
<chr><dbl><dbl>
<1H OCEAN 4.230682517.7450
INLAND 3.208996477.4476
ISLAND 2.744420276.6000
NEAR BAY 4.172885488.6162
NEAR OCEAN4.005785501.2445

summary statistics on numerical variables group by ocean proximity

housing%>% select(-population)%>%group_by(ocean_proximity) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)
A tibble: 5 × 5
ocean_proximitytotal_roomshouseholdstotal_bedroomsmedian_income
<chr><dbl><dbl><dbl><dbl>
<1H OCEAN 2628.344517.7450546.53924.230682
INLAND 2717.743477.4476533.88163.208996
ISLAND 1574.600276.6000420.40002.744420
NEAR BAY 2493.590488.6162514.18284.172885
NEAR OCEAN2583.701501.2445538.61574.005785