Previously, we have explored the concept of “3D visualization in R programming language“ by using the plot3d package. Now, we will get insights into the concepts of data wrangling and data visualization in R programming language.
Data wrangling is considered as the process of cleaning and unifying messy and complex data sets which calls for easy access and analysis. Data wrangling is also referred to as data munging. With the increasing amount of data and data sources, it is getting more important to focus on large amounts of available data which is organized for analysis.
This process typically includes all the manual conversions such as converting and mapping from one raw form into another format which allows more convenient consumption and organization for the data. We will now focus on the data which includes all the attributes of athletes so that we can analyze the rank and other holdings of players with respective countries.
Step 1
Include the necessary libraries which are needed for data wrangling and visualization in R.
> > library(tidyverse) -- Attaching packages --------------------------------------- tidyverse 1.3.0 -- v tidyr 1.0.0 v stringr 1.4.0 v readr 1.3.1 v forcats 0.4.0 v purrr 0.3.3 -- Conflicts ------------------------------------------ tidyverse_conflicts() -- x dplyr::filter() masks stats::filter() x dplyr::lag() masks stats::lag() x purrr::map() masks maps::map() > library(ggthemes)
Step 2
Understand the attributes of the data frame with structure and other dimensions. It is important to understand whether we need any exploratory data analysis which comes as a part of the data wrangling procedure.Â
> atheletes =read.csv("athlete_events.csv", stringsAsFactors = F) > View(atheletes)
> str(atheletes) 'data.frame': 271116 obs. of 15 variables: $ ID : int 1 2 3 4 5 5 5 5 5 5 ... $ Name : chr "A Dijiang" "A Lamusi" "Gunnar Nielsen Aaby" "Edgar Lindenau Aabye" ... $ Sex : chr "M" "M" "M" "M" ... $ Age : int 24 23 24 34 21 21 25 25 27 27 ... $ Height: int 180 170 NA NA 185 185 185 185 185 185 ... $ Weight: num 80 60 NA NA 82 82 82 82 82 82 ... $ Team : chr "China" "China" "Denmark" "Denmark/Sweden" ... $ NOC : chr "CHN" "CHN" "DEN" "DEN" ... $ Games : chr "1992 Summer" "2012 Summer" "1920 Summer" "1900 Summer" ... $ Year : int 1992 2012 1920 1900 1988 1988 1992 1992 1994 1994 ... $ Season: chr "Summer" "Summer" "Summer" "Summer" ... $ City : chr "Barcelona" "London" "Antwerpen" "Paris" ... $ Sport : chr "Basketball" "Judo" "Football" "Tug-Of-War" ... $ Event : chr "Basketball Men's Basketball" "Judo Men's Extra-Lightweight" "Football Men's Football" "Tug-Of-War Men's Tug-Of-War" ... $ Medal : chr NA NA NA "Gold" ... > glimpse(atheletes) Observations: 271,116 Variables: 15 $ ID <int> 1, 2, 3, 4, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 9, 10, 11, 12, 13, 13, 14, 15, 15, 15, 15, 15, 16,... $ Name <chr> "A Dijiang", "A Lamusi", "Gunnar Nielsen Aaby", "Edgar Lindenau Aabye", "Christine Jacoba Aaftink", "Christine Jacoba Aaftink", "Chris... $ Sex <chr> "M", "M", "M", "M", "F", "F", "F", "F", "F", "F", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "M", "F",... $ Age <int> 24, 23, 24, 34, 21, 21, 25, 25, 27, 27, 31, 31, 31, 31, 33, 33, 33, 33, 31, 31, 31, 31, 33, 33, 33, 33, 18, 18, 26, 26, 22, 31, 30, 34... $ Height <int> 180, 170, NA, NA, 185, 185, 185, 185, 185, 185, 188, 188, 188, 188, 188, 188, 188, 188, 183, 183, 183, 183, 183, 183, 183, 183, 168, 1... $ Weight <dbl> 80.0, 60.0, NA, NA, 82.0, 82.0, 82.0, 82.0, 82.0, 82.0, 75.0, 75.0, 75.0, 75.0, 75.0, 75.0, 75.0, 75.0, 72.0, 72.0, 72.0, 72.0, 72.0, ... $ Team <chr> "China", "China", "Denmark", "Denmark/Sweden", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands", "Netherlands... $ NOC <chr> "CHN", "CHN", "DEN", "DEN", "NED", "NED", "NED", "NED", "NED", "NED", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "... $ Games <chr> "1992 Summer", "2012 Summer", "1920 Summer", "1900 Summer", "1988 Winter", "1988 Winter", "1992 Winter", "1992 Winter", "1994 Winter",... $ Year <int> 1992, 2012, 1920, 1900, 1988, 1988, 1992, 1992, 1994, 1994, 1992, 1992, 1992, 1992, 1994, 1994, 1994, 1994, 1992, 1992, 1992, 1992, 19... $ Season <chr> "Summer", "Summer", "Summer", "Summer", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Winter", "Win... $ City <chr> "Barcelona", "London", "Antwerpen", "Paris", "Calgary", "Calgary", "Albertville", "Albertville", "Lillehammer", "Lillehammer", "Albert... $ Sport <chr> "Basketball", "Judo", "Football", "Tug-Of-War", "Speed Skating", "Speed Skating", "Speed Skating", "Speed Skating", "Speed Skating", "... $ Event <chr> "Basketball Men's Basketball", "Judo Men's Extra-Lightweight", "Football Men's Football", "Tug-Of-War Men's Tug-Of-War", "Speed Skatin... $ Medal <chr> NA, NA, NA, "Gold", NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA..
Step 3
We can observe that there are some missing values that need to be treated before starting data analysis and visualization in R.
> rem=atheletes %>% + filter(Medal != "<NA>") ## retain all data which is not NA > > head(rem) ID Name Sex Age Height Weight Team NOC 1 4 Edgar Lindenau Aabye M 34 NA NA Denmark/Sweden DEN 2 15 Arvo Ossian Aaltonen M 30 NA NA Finland FIN 3 15 Arvo Ossian Aaltonen M 30 NA NA Finland FIN 4 16 Juhamatti Tapio Aaltonen M 28 184 85 Finland FIN 5 17 Paavo Johannes Aaltonen M 28 175 64 Finland FIN 6 17 Paavo Johannes Aaltonen M 28 175 64 Finland FIN Games Year Season City Sport 1 1900 Summer 1900 Summer Paris Tug-Of-War 2 1920 Summer 1920 Summer Antwerpen Swimming 3 1920 Summer 1920 Summer Antwerpen Swimming 4 2014 Winter 2014 Winter Sochi Ice Hockey 5 1948 Summer 1948 Summer London Gymnastics 6 1948 Summer 1948 Summer London Gymnastics Event Medal 1 Tug-Of-War Men's Tug-Of-War Gold 2 Swimming Men's 200 metres Breaststroke Bronze 3 Swimming Men's 400 metres Breaststroke Bronze 4 Ice Hockey Men's Ice Hockey Bronze 5 Gymnastics Men's Individual All-Around Bronze 6 Gymnastics Men's Team All-Around Gold > > class(rem$Year) [1] "integer" > > rem$Year = as.factor(rem$Year)
Step 4
Once the missing values are treated we can focus on further steps on analysis and visualization. Let us focus on athletes of China who were successful to gain medals for countries.
> ### Medals China > > china = subset(rem, Team == "China") > > head(china) ID Name Sex Age Height Weight Team NOC Games Year 702 3610 An Yulong M 19 173 70 China CHN 1998 Winter 1998 703 3610 An Yulong M 19 173 70 China CHN 1998 Winter 1998 704 3610 An Yulong M 23 173 70 China CHN 2002 Winter 2002 705 3611 An Zhongxin F 23 170 65 China CHN 1996 Summer 1996 1453 6381 Ba Yan F 21 183 78 China CHN 1984 Summer 1984 1760 7597 Bao Yingying F 24 172 67 China CHN 2008 Summer 2008 Season City Sport 702 Winter Nagano Short Track Speed Skating 703 Winter Nagano Short Track Speed Skating 704 Winter Salt Lake City Short Track Speed Skating 705 Summer Atlanta Softball 1453 Summer Los Angeles Basketball 1760 Summer Beijing Fencing Event Medal 702 Short Track Speed Skating Men's 500 metres Silver 703 Short Track Speed Skating Men's 5,000 metres Relay Bronze 704 Short Track Speed Skating Men's 5,000 metres Relay Bronze 705 Softball Women's Softball Silver 1453 Basketball Women's Basketball Bronze 1760 Fencing Women's Sabre, Team Silver > > > ### medal tally/country (from 1896 onwards) > > t = rem %>% group_by(Team) %>% + summarise(Medal_Tally = length(Medal))%>% + arrange(desc(Medal_Tally)) > > head(t) # A tibble: 6 x 2 Team Medal_Tally <chr> <int> 1 United States 5219 2 Soviet Union 2451 3 Germany 1984 4 Great Britain 1673 5 France 1550 6 Italy 1527 > > t %>%top_n(12) %>% ggplot(aes(x = Team,y = Medal_Tally)) + + geom_bar(stat='identity',colour="white", fill = "red")+ + theme_wsj() Selecting by Medal_Tally
The bar graph depicts the count of medal tally rate is high for the United States in comparison with other countries mentioned in the list.
Step 5
Let us focus on the medal tally rate of countries every year which is described below:
> ### Medal tally of countries, per year > > mt = rem %>% + group_by(Team,Year) %>% + summarise(Total = n()) > > head(mt) # A tibble: 6 x 3 # Groups: Team [3] Team Year Total <chr> <fct> <int> 1 A North American Team 1900 4 2 Afghanistan 2008 1 3 Afghanistan 2012 1 4 Algeria 1984 2 5 Algeria 1992 2 6 Algeria 1996 3 > > ## medal tally of a few countries across the years > topm = mt %>% filter(Team %in% c("USA","Russia","Germany","France","China")) > head(topm) # A tibble: 6 x 3 # Groups: Team [1] Team Year Total <chr> <fct> <int> 1 China 1984 74 2 China 1988 50 3 China 1992 73 4 China 1994 3 5 China 1996 94 6 China 1998 14 > > topm %>% + ggplot(aes(x = Year, y = Total, group = Team)) + + geom_line(aes(color = Team)) + + geom_point(aes(color = Team)) +theme_economist()+ + theme(axis.text.x = element_text(angle = 90, hjust = 1)) > ##flip the x labels
In the above-mentioned step, we summarize the medal tally rate every year and create an analysis with every country. The plot represents the time series analysis with cyclic patterns noticed from 1994 to 2016 and an uncertain pattern was noticed from 1896 to 1992.
So, this was all about Data Wrangling and Visualization in R programming!
In the next article, we will learn about Exploratory Data Analysis using R programming language.