R Programming Series: Data Wrangling and Visualization

0
138
Data wrangling & visualization

Previously, we have explored the concept of 3D visualization in R programming language by using the plot3d package. Now, we will get insights into the concepts of data wrangling and data visualization in R programming language.

Data wrangling is considered as the process of cleaning and unifying messy and complex data sets which calls for easy access and analysis. Data wrangling is also referred to as data munging. With the increasing amount of data and data sources, it is getting more important to focus on large amounts of available data which is organized for analysis.

This process typically includes all the manual conversions such as converting and mapping from one raw form into another format which allows more convenient consumption and organization for the data. We will now focus on the data which includes all the attributes of athletes so that we can analyze the rank and other holdings of players with respective countries.

Step 1

Include the necessary libraries which are needed for data wrangling and visualization in R.

> 
> library(tidyverse)
-- Attaching packages --------------------------------------- tidyverse 1.3.0 --
v tidyr   1.0.0     v stringr 1.4.0
v readr   1.3.1     v forcats 0.4.0
v purrr   0.3.3     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
x purrr::map()    masks maps::map()
> library(ggthemes)

Step 2

Understand the attributes of the data frame with structure and other dimensions. It is important to understand whether we need any exploratory data analysis which comes as a part of the data wrangling procedure. 

> atheletes =read.csv("athlete_events.csv", stringsAsFactors = F)
> View(atheletes)

Dataset of athletes- 1

> str(atheletes)
'data.frame':	271116 obs. of  15 variables:
 $ ID    : int  1 2 3 4 5 5 5 5 5 5 ...
 $ Name  : chr  "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ...
 $ Sex   : chr  "M" "M" "M" "M" ...
 $ Age   : int  24 23 24 34 21 21 25 25 27 27 ...
 $ Height: int  180 170 NA NA 185 185 185 185 185 185 ...
 $ Weight: num  80 60 NA NA 82 82 82 82 82 82 ...
 $ Team  : chr  "China" "China" "Denmark" "Denmark/Sweden" ...
 $ NOC   : chr  "CHN" "CHN" "DEN" "DEN" ...
 $ Games : chr  "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ...
 $ Year  : int  1992 2012 1920 1900 1988 1988 1992 1992 1994 1994 ...
 $ Season: chr  "Summer" "Summer" "Summer" "Summer" ...
 $ City  : chr  "Barcelona" "London" "Antwerpen" "Paris" ...
 $ Sport : chr  "Basketball" "Judo" "Football" "Tug-Of-War" ...
 $ Event : chr  "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ...
 $ Medal : chr  NA NA NA "Gold" ...
> glimpse(atheletes)
Observations: 271,116
Variables: 15
$ ID     <int> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 15, 15, 15, 15, 16,...
$ Name   <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau Aabye", "Christine Jacoba Aaftink", "Christine Jacoba Aaftink", "Chris...
$ Sex    <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "F",...
$ Age    <int> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33, 33, 33, 31, 31, 31, 31, 33, 33, 33, 33, 18, 18, 26, 26, 22, 31, 30, 34...
$ Height <int> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, 188, 188, 188, 188, 188, 183, 183, 183, 183, 183, 183, 183, 183, 168, 1...
$ Weight <dbl> 80.0, 60.0, NA, NA, 82.0, 82.0, 82.0, 82.0, 82.0, 82.0, 75.0, 75.0, 75.0, 75.0, 75.0, 75.0, 75.0, 75.0, 72.0, 72.0, 72.0, 72.0, 72.0, ...
$ Team   <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands...
$ NOC    <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", "NED", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "...
$ Games  <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "1988 Winter", "1988 Winter", "1992 Winter", "1992 Winter", "1994 Winter",...
$ Year   <int> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 1992, 1992, 1992, 1992, 1994, 1994, 1994, 1994, 1992, 1992, 1992, 1992, 19...
$ Season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Win...
$ City   <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgary", "Albertville", "Albertville", "Lillehammer", "Lillehammer", "Albert...
$ Sport  <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating", "Speed Skating", "Speed Skating", "Speed Skating", "Speed Skating", "...
$ Event  <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", "Football Men's Football", "Tug-Of-War Men's Tug-Of-War", "Speed Skatin...
$ Medal  <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA..

Step 3

We can observe that there are some missing values that need to be treated before starting data analysis and visualization in R.

> rem=atheletes %>%
+     filter(Medal != "<NA>") ## retain all data which is not NA
> 
> head(rem)
  ID                     Name Sex Age Height Weight           Team NOC
1  4     Edgar Lindenau Aabye   M  34     NA     NA Denmark/Sweden DEN
2 15     Arvo Ossian Aaltonen   M  30     NA     NA        Finland FIN
3 15     Arvo Ossian Aaltonen   M  30     NA     NA        Finland FIN
4 16 Juhamatti Tapio Aaltonen   M  28    184     85        Finland FIN
5 17  Paavo Johannes Aaltonen   M  28    175     64        Finland FIN
6 17  Paavo Johannes Aaltonen   M  28    175     64        Finland FIN
        Games Year Season      City      Sport
1 1900 Summer 1900 Summer     Paris Tug-Of-War
2 1920 Summer 1920 Summer Antwerpen   Swimming
3 1920 Summer 1920 Summer Antwerpen   Swimming
4 2014 Winter 2014 Winter     Sochi Ice Hockey
5 1948 Summer 1948 Summer    London Gymnastics
6 1948 Summer 1948 Summer    London Gymnastics
                                   Event  Medal
1            Tug-Of-War Men's Tug-Of-War   Gold
2 Swimming Men's 200 metres Breaststroke Bronze
3 Swimming Men's 400 metres Breaststroke Bronze
4            Ice Hockey Men's Ice Hockey Bronze
5 Gymnastics Men's Individual All-Around Bronze
6       Gymnastics Men's Team All-Around   Gold
> 
> class(rem$Year)
[1] "integer"
> 
> rem$Year = as.factor(rem$Year)

Step 4

Once the missing values are treated we can focus on further steps on analysis and visualization. Let us focus on athletes of China who were successful to gain medals for countries.

> ### Medals China
> 
> china = subset(rem, Team == "China")
> 
> head(china)
       ID         Name Sex Age Height Weight  Team NOC       Games Year
702  3610    An Yulong   M  19    173     70 China CHN 1998 Winter 1998
703  3610    An Yulong   M  19    173     70 China CHN 1998 Winter 1998
704  3610    An Yulong   M  23    173     70 China CHN 2002 Winter 2002
705  3611  An Zhongxin   F  23    170     65 China CHN 1996 Summer 1996
1453 6381       Ba Yan   F  21    183     78 China CHN 1984 Summer 1984
1760 7597 Bao Yingying   F  24    172     67 China CHN 2008 Summer 2008
     Season           City                     Sport
702  Winter         Nagano Short Track Speed Skating
703  Winter         Nagano Short Track Speed Skating
704  Winter Salt Lake City Short Track Speed Skating
705  Summer        Atlanta                  Softball
1453 Summer    Los Angeles                Basketball
1760 Summer        Beijing                   Fencing
                                                  Event  Medal
702          Short Track Speed Skating Men's 500 metres Silver
703  Short Track Speed Skating Men's 5,000 metres Relay Bronze
704  Short Track Speed Skating Men's 5,000 metres Relay Bronze
705                           Softball Women's Softball Silver
1453                      Basketball Women's Basketball Bronze
1760                        Fencing Women's Sabre, Team Silver
> 
> 
> ### medal tally/country (from 1896 onwards)
> 
> t = rem %>% group_by(Team) %>%
+     summarise(Medal_Tally = length(Medal))%>%
+     arrange(desc(Medal_Tally))
> 
> head(t)
# A tibble: 6 x 2
  Team          Medal_Tally
  <chr>               <int>
1 United States        5219
2 Soviet Union         2451
3 Germany              1984
4 Great Britain        1673
5 France               1550
6 Italy                1527
> 
> t %>%top_n(12) %>% ggplot(aes(x = Team,y = Medal_Tally)) +
+     geom_bar(stat='identity',colour="white", fill = "red")+
+     theme_wsj()
Selecting by Medal_Tally

Medal tally bar graph - 2

The bar graph depicts the count of medal tally rate is high for the United States in comparison with other countries mentioned in the list.

Step 5

Let us focus on the medal tally rate of countries every year which is described below:

> ### Medal tally of countries, per year
> 
> mt = rem %>% 
+     group_by(Team,Year) %>%
+     summarise(Total = n())
> 
> head(mt)
# A tibble: 6 x 3
# Groups:   Team [3]
  Team                  Year  Total
  <chr>                 <fct> <int>
1 A North American Team 1900      4
2 Afghanistan           2008      1
3 Afghanistan           2012      1
4 Algeria               1984      2
5 Algeria               1992      2
6 Algeria               1996      3
> 
> ## medal tally of a few countries across the years
> topm = mt %>% filter(Team %in% c("USA","Russia","Germany","France","China"))
> head(topm)
# A tibble: 6 x 3
# Groups:   Team [1]
  Team  Year  Total
  <chr> <fct> <int>
1 China 1984     74
2 China 1988     50
3 China 1992     73
4 China 1994      3
5 China 1996     94
6 China 1998     14
> 
> topm %>%
+     ggplot(aes(x = Year, y = Total, group = Team)) +
+     geom_line(aes(color = Team)) +
+     geom_point(aes(color = Team)) +theme_economist()+
+     theme(axis.text.x = element_text(angle = 90, hjust = 1))
> ##flip the x labels

Time series analysis

In the above-mentioned step, we summarize the medal tally rate every year and create an analysis with every country. The plot represents the time series analysis with cyclic patterns noticed from 1994 to 2016 and an uncertain pattern was noticed from 1896 to 1992.

So, this was all about Data Wrangling and Visualization in R programming!

In the next article, we will learn about Exploratory Data Analysis using R programming language.

LEAVE A REPLY

Please enter your comment!
Please enter your name here