This tutorial introduces how to easily manipulate data in R using the tidyverse package. Data manipulation usually involves

computing summary statistics.
rows filtering (filter()) and ordering (arrange()).
renaming (rename()) and selecting certain columns (select()

and adding columns

mutate(): compute and add new variables into a data table. It preserves existing variables and adds new columns at the end of your dataset and
transmute(): compute new columns and only keep the new columns,

setwd(“wdirectory”) Changes the current working directory to wdirectory.

setwd('C:/Users/USER/Desktop/JUPYTER_NOTEBOOK/A_MYTUTORIALS/MYR')

Load the needed packages

library(devtools)
library(tidyverse)
library(nycflights13)
library(readxl)

Loading required package: usethis

Warning message in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
"non-uniform 'Rounding' sampler used"
-- [1mAttaching packages[22m ------------------------------------------------------------------------------- tidyverse 1.3.1 --

[32mv[39m [34mggplot2[39m 3.3.3     [32mv[39m [34mpurrr  [39m 0.3.4
[32mv[39m [34mtibble [39m 3.1.1     [32mv[39m [34mdplyr  [39m 1.0.5
[32mv[39m [34mtidyr  [39m 1.1.3     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.4.0     [32mv[39m [34mforcats[39m 0.5.1

Warning message in (function (kind = NULL, normal.kind = NULL, sample.kind = NULL) :
"non-uniform 'Rounding' sampler used"
-- [1mConflicts[22m ---------------------------------------------------------------------------------- tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

Datatset

We will use titanic dataset. This dataset has 1309 observations with 14 variables. To explore the basic data manipulation verbs of dplyr, we start by converting the data into a tibble data frame for easier data manipulation

my_data<-as_tibble(read_xls('data/titanic.xls'))
names(my_data)

Warning message in read_fun(path = enc2native(normalizePath(path)), sheet_i = sheet, :
"Coercing text to numeric in M1306 / R1306C13: '328'"

Variable Selection

we can select or subset variables by names or position.

Under variable selection we will learn how to use

select(): allow us to extract variables or variables as a data table and can also be used to remove variables from the data frame.
select_if(): Select variabless based on a particular condition.
Variabl Selection by position

select from the my_data variable positioned from 1 to 4 inclusive

head(my_data %>% select(1:4))

A tibble: 6 × 4
pclass	survived	name	sex
<dbl>	<dbl>	<chr>	<chr>
1	1	Allen, Miss. Elisabeth Walton	female
1	1	Allison, Master. Hudson Trevor	male
1	0	Allison, Miss. Helen Loraine	female
1	0	Allison, Mr. Hudson Joshua Creighton	male
1	0	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female
1	1	Anderson, Mr. Harry	male

select variables positioned at 1, 4,6,7


head(my_data %>% select(1,4,6,7))

A tibble: 6 × 4
pclass	sex	sibsp	parch
<dbl>	<chr>	<dbl>	<dbl>
1	female	0	0
1	male	1	2
1	female	1	2
1	male	1	2
1	female	1	2
1	male	0	0

Select varaibles by name

head(select(my_data,pclass,name:age),3)

A tibble: 3 × 4
pclass	name	sex	age
<dbl>	<chr>	<chr>	<dbl>
1	Allen, Miss. Elisabeth Walton	female	29.0000
1	Allison, Master. Hudson Trevor	male	0.9167
1	Allison, Miss. Helen Loraine	female	2.0000

Select all variables except variables from survived to cabin

head(select(my_data,-(survived:cabin)),3)

A tibble: 3 × 5
pclass	embarked	boat	body	home.dest
<dbl>	<chr>	<chr>	<dbl>	<chr>
1	S	2	NA	St Louis, MO
1	S	11	NA	Montreal, PQ / Chesterville, ON
1	S	NA	NA	Montreal, PQ / Chesterville, ON

select variables whose name starts with bo

head(my_data %>% select(starts_with('bo')),3)

A tibble: 3 × 2
boat	body
<chr>	<dbl>
2	NA
11	NA
NA	NA

select variables whose name ends with t

head(my_data %>% select(ends_with('t')),3)

A tibble: 3 × 3
ticket	boat	home.dest
<chr>	<chr>	<chr>
24160	2	St Louis, MO
113781	11	Montreal, PQ / Chesterville, ON
113781	NA	Montreal, PQ / Chesterville, ON

Select variables whose names contains “me”

head(my_data %>% select(contains('me')),4)

A tibble: 4 × 2
name	home.dest
<chr>	<chr>
Allen, Miss. Elisabeth Walton	St Louis, MO
Allison, Master. Hudson Trevor	Montreal, PQ / Chesterville, ON
Allison, Miss. Helen Loraine	Montreal, PQ / Chesterville, ON
Allison, Mr. Hudson Joshua Creighton	Montreal, PQ / Chesterville, ON

Variable selection based on a condtion

select only character variables

head(my_data %>% select_if(is.character),4)

A tibble: 4 × 7
name	sex	ticket	cabin	embarked	boat	home.dest
<chr>	<chr>	<chr>	<chr>	<chr>	<chr>	<chr>
Allen, Miss. Elisabeth Walton	female	24160	B5	S	2	St Louis, MO
Allison, Master. Hudson Trevor	male	113781	C22 C26	S	11	Montreal, PQ / Chesterville, ON
Allison, Miss. Helen Loraine	female	113781	C22 C26	S	NA	Montreal, PQ / Chesterville, ON
Allison, Mr. Hudson Joshua Creighton	male	113781	C22 C26	S	NA	Montreal, PQ / Chesterville, ON

selecting only numerical variables

head(select_if(my_data,is.numeric),4)

A tibble: 4 × 7
pclass	survived	age	sibsp	parch	fare	body
<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>	<dbl>
1	1	29.0000	0	0	211.3375	NA
1	1	0.9167	1	2	151.5500	NA
1	0	2.0000	1	2	151.5500	NA
1	0	30.0000	1	2	151.5500	135

Removing columns

For simplicity we will work with only few variables

sub_data<-my_data %>% select(sex,pclass,age,'survived',cabin,'name','sex','age','sibsp')

remove variables named pclass,age and sex

head(sub_data%>% select(-sex,-pclass,-age),3)

A tibble: 3 × 4
survived	cabin	name	sibsp
<dbl>	<chr>	<chr>	<dbl>
1	B5	Allen, Miss. Elisabeth Walton	0
1	C22 C26	Allison, Master. Hudson Trevor	1
0	C22 C26	Allison, Miss. Helen Loraine	1

Rows filtering (filter())

This section describes how to subset or extract samples or rows from the dataset based on certain criteria

extract male (sex==’male’) passengers who survived (survived==1) and has sibsp==1 (Number of Siblings/Spouses Aboard)

head(sub_data %>% filter(sex=='male' & sibsp==1 & survived==1),2)

A tibble: 2 × 7
sex	pclass	age	survived	cabin	name	sibsp
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<chr>	<dbl>
male	1	0.9167	1	C22 C26	Allison, Master. Hudson Trevor	1
male	1	37.0000	1	D35	Beckwith, Mr. Richard Leonard	1

head(sub_data %>% filter(sex=='male', sibsp==1,survived==1),2)

A tibble: 2 × 7
sex	pclass	age	survived	cabin	name	sibsp
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<chr>	<dbl>
male	1	0.9167	1	C22 C26	Allison, Master. Hudson Trevor	1
male	1	37.0000	1	D35	Beckwith, Mr. Richard Leonard	1

extract rows where passengers are male(sex==’male’) or survived (survived==1) or has sibsp==1 or 2 (Number of Siblings/Spouses Aboard)

head(sub_data %>% filter(sex=='male' | sibsp==1 |sibsp==2| survived==1),3)

A tibble: 3 × 7
sex	pclass	age	survived	cabin	name	sibsp
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<chr>	<dbl>
female	1	29.0000	1	B5	Allen, Miss. Elisabeth Walton	0
male	1	0.9167	1	C22 C26	Allison, Master. Hudson Trevor	1
female	1	2.0000	0	C22 C26	Allison, Miss. Helen Loraine	1

select variables sibsp,sex,age and from these variables extract rows where age<10

head(sub_data %>% select(sex,sibsp,age) %>%filter(age<10),3)

A tibble: 3 × 3
sex	sibsp	age
<chr>	<dbl>	<dbl>
male	1	0.9167
female	1	2.0000
male	0	4.0000

Selecting random rows or samples from a dataset

selecting 10 random samples without replacement from the data

head(sub_data %>% sample_n(10,replace = FALSE),2)

A tibble: 2 × 7
sex	pclass	age	survived	cabin	name	sibsp
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<chr>	<dbl>
male	3	20	0	NA	Vendel, Mr. Olof Edvin	0
female	1	35	1	C123	Futrelle, Mrs. Jacques Heath (Lily May Peel)	1

Select 1% random samples without replacement from the data

sub_data %>% sample_frac(0.01,replace = FALSE)

A tibble: 13 × 7
sex	pclass	age	survived	cabin	name	sibsp
<chr>	<dbl>	<dbl>	<dbl>	<chr>	<chr>	<dbl>
male	1	34	1	NA	Seward, Mr. Frederic Kimber	0
male	3	44	0	NA	Cribb, Mr. John Hatfield	0
male	3	23	1	NA	Asplund, Mr. Johan Charles	0
male	3	26	0	NA	Bostandyeff, Mr. Guentcho	0
male	2	27	0	NA	Pulbaum, Mr. Franz	0
male	3	17	0	NA	Elias, Mr. Joseph Jr	1
male	1	41	0	D21	Kenyon, Mr. Frederick R	1
male	3	31	1	NA	Stranden, Mr. Juho	0
female	2	25	1	NA	Shelley, Mrs. William (Imanita Parrish Hall)	0
male	3	NA	0	NA	Petroff, Mr. Pastcho ("Pentcho")	0
male	3	NA	1	NA	O'Keefe, Mr. Patrick	0
male	2	2	1	NA	Wells, Master. Ralph Lester	1
male	3	13	0	NA	Asplund, Master. Filip Oscar	4

Missing values

Number of missing values in the age variable

sub_data %>% summarise(num_na=sum(is.na(age)))

A tibble: 1 × 1
num_na
<int>
263

number of missing values in each variable

sub_data %>% purrr::map_df(~sum(is.na(.)))

A tibble: 1 × 7
sex	pclass	age	survived	cabin	name	sibsp
<int>	<int>	<int>	<int>	<int>	<int>	<int>
0	0	263	0	1014	0	0


nrow(sub_data)

1309

drop samples of the variables age and cabin with nas

no_nas<-sub_data %>% filter_at(vars(age,cabin),all_vars(!is.na(.)))
nrow(no_nas)

272

Adding New Variables

Under this section we will use the housing dataset

housing<-readr::read_csv('data/housing.csv',guess_max = 20)

[36m--[39m [1m[1mColumn specification[1m[22m [36m------------------------------------------------------------------------------------------------[39m
cols(
  longitude = [32mcol_double()[39m,
  latitude = [32mcol_double()[39m,
  housing_median_age = [32mcol_double()[39m,
  total_rooms = [32mcol_double()[39m,
  total_bedrooms = [32mcol_double()[39m,
  population = [32mcol_double()[39m,
  households = [32mcol_double()[39m,
  median_income = [32mcol_double()[39m,
  median_house_value = [32mcol_double()[39m,
  ocean_proximity = [31mcol_character()[39m
)

Under this section we will select only few variables needed to created new variables

housing<-housing %>% select(total_rooms,households,total_bedrooms,total_rooms,
        population,households,ocean_proximity,median_income)

mutate()

mutate() adds new variables at the end of your dataset

head(housing 
      
      %>% mutate( rooms_per_household= total_rooms/households,
                  bedrooms_per_room=total_bedrooms/total_rooms,
                  population_per_household=population/households
                  ) 
                  %>% select(-c(total_rooms,households,population,total_bedrooms,median_income))
                  ,3)

A tibble: 3 × 4
ocean_proximity	rooms_per_household	bedrooms_per_room	population_per_household
<chr>	<dbl>	<dbl>	<dbl>
NEAR BAY	6.984127	0.1465909	2.555556
NEAR BAY	6.238137	0.1557966	2.109842
NEAR BAY	8.288136	0.1295160	2.802260

transmute() only keep the new variables created

head(housing %>% 
            transmute(
                  rooms_per_household= total_rooms/households,
                  bedrooms_per_room=total_bedrooms/total_rooms,
                  population_per_household=population/households
                   ),
                   3)

A tibble: 3 × 3
rooms_per_household	bedrooms_per_room	population_per_household
<dbl>	<dbl>	<dbl>
6.984127	0.1465909	2.555556
6.238137	0.1557966	2.109842
8.288136	0.1295160	2.802260

Summary Statistics

getmode <- function(n) {
   uniqn <- unique(n)
   uniqn[which.max(tabulate(match(n, uniqn)))]
}
housing %>% summarise(count=n(),mean_income=mean(median_income,na.rm=TRUE),
                      mode_income=getmode(median_income))

A tibble: 1 × 3
count	mean_income	mode_income
<int>	<dbl>	<dbl>
20640	3.870671	3.125

Group by one variable

Note : you can groupe by multiple variables

housing %>%group_by(ocean_proximity) %>% summarise(mean_income=mean(median_income,na.rm=TRUE),
                                                   mean_housholde=mean(households))

A tibble: 5 × 3
ocean_proximity	mean_income	mean_housholde
<chr>	<dbl>	<dbl>
<1H OCEAN	4.230682	517.7450
INLAND	3.208996	477.4476
ISLAND	2.744420	276.6000
NEAR BAY	4.172885	488.6162
NEAR OCEAN	4.005785	501.2445

summary statistics on numerical variables group by ocean proximity

housing%>% select(-population)%>%group_by(ocean_proximity) %>%
summarise_if(is.numeric, mean, na.rm = TRUE)

A tibble: 5 × 5
ocean_proximity	total_rooms	households	total_bedrooms	median_income
<chr>	<dbl>	<dbl>	<dbl>	<dbl>
<1H OCEAN	2628.344	517.7450	546.5392	4.230682
INLAND	2717.743	477.4476	533.8816	3.208996
ISLAND	1574.600	276.6000	420.4000	2.744420
NEAR BAY	2493.590	488.6162	514.1828	4.172885
NEAR OCEAN	2583.701	501.2445	538.6157	4.005785

Data Manipulation in R