Chapter 2 Geoms

2.1 Introduction

In this chapter, we will create some of the most routinely used plots to explore data using the geom_* functions.

2.2 Libraries, Code & Data

We will use the following libraries in this chapter:

All the data sets used in this chapter can be found here and code can be downloaded from here.

2.2.1 Data

ecom <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/web.csv')
ecom
## # A tibble: 1,000 x 11
##       id referrer device bouncers n_visit n_pages duration country      purchase
##    <dbl> <chr>    <chr>  <lgl>      <dbl>   <dbl>    <dbl> <chr>        <lgl>   
##  1     1 google   laptop TRUE          10       1      693 Czech Repub~ FALSE   
##  2     2 yahoo    tablet TRUE           9       1      459 Yemen        FALSE   
##  3     3 direct   laptop TRUE           0       1      996 Brazil       FALSE   
##  4     4 bing     tablet FALSE          3      18      468 China        TRUE    
##  5     5 yahoo    mobile TRUE           9       1      955 Poland       FALSE   
##  6     6 yahoo    laptop FALSE          5       5      135 South Africa FALSE   
##  7     7 yahoo    mobile TRUE          10       1       75 Bangladesh   FALSE   
##  8     8 direct   mobile TRUE          10       1      908 Indonesia    FALSE   
##  9     9 bing     mobile FALSE          3      19      209 Netherlands  FALSE   
## 10    10 google   mobile TRUE           6       1      208 Czech Repub~ FALSE   
## # ... with 990 more rows, and 2 more variables: order_items <dbl>,
## #   order_value <dbl>

2.2.2 Data Dictionary

  • id: row id
  • referrer: referrer website/search engine
  • os: operating system
  • browser: browser
  • device: device used to visit the website
  • n_pages: number of pages visited
  • duration: time spent on the website (in seconds)
  • repeat: frequency of visits
  • country: country of origin
  • purchase: whether visitor purchased
  • order_value: order value of visitor (in dollars)

2.3 Point

A scatter plot displays the relationship between two continuous variables. In ggplot2, we can build a scatter plot using geom_point(). Scatter plots can show you visually

  • the strength of the relationship between the variables
  • the direction of the relationship between the variables
  • and whether outliers exist

The variables representing the X and Y axis can be specified either in ggplot() or in geom_point(). We will learn to modify the appearance of the points in a different post.

ggplot(ecom, aes(x = n_pages, y = duration)) + 
  geom_point()

2.4 Regression Line

A regression line can be fit using either:

  • geom_abline()
  • geom_smooth()

If you are using geom_abline(), you need to specify the intercept and slope as shown in the below example:

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() + 
  geom_abline(intercept = 37.285, slope = -5.344)

If you are using geom_smooth(), you need to specify the method of fitting the line, which can be lm or loess. You also need to indicate whether the confidence interval must be displayed using the se argument.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_smooth(method = 'lm', se = TRUE)
## `geom_smooth()` using formula 'y ~ x'

Here we use the 'loess' method to fit the regression line.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_smooth(method = 'loess', se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

2.5 Bar

Bar plots present grouped data with rectangular bars. The bars may represent the frequency of the groups or values. Bar plots can be:

  • horizontal
  • vertical
  • grouped
  • stacked
  • proportional

2.5.1 Frequency

ggplot(ecom, aes(x = factor(device))) +
  geom_bar()

2.5.2 Weight

If the bars should represent a continuous variable, use the weight argument within aes(). In the below example, the bars do not represent the count of devices, instead, they represent the total order value for each device type.

ggplot(ecom, aes(x = factor(device))) +
  geom_bar(aes(weight = order_value))

2.5.3 Stacked Bar Plot

To create a stacked bar plot, the fill argument must be mapped to a categorical variable.

ggplot(ecom, aes(x = factor(device))) +
  geom_bar(aes(fill = purchase))

2.5.4 Horizontal Bar Plot

A horizontal bar plot can be created by flipping the coordinate axes using the coord_flip() function.

ggplot(ecom, aes(x = factor(device))) +
  geom_bar(aes(fill = purchase)) +
  coord_flip()

2.6 Columns

If the data has already been summarized, you can use geom_col() instead of geom_bar(). In the below example, we have the total visits for each device type. The data has already been summarized and as such we cannot use geom_bar().

device <- c('laptop', 'mobile', 'tablet')
visits <- c(30000, 12000, 5000)
traffic <- tibble::tibble(device, visits)
ggplot(traffic, aes(x = device, y = visits)) +
  geom_col(fill = 'blue') 

2.7 Boxplot

The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. Box plots are useful for detecting outliers and for comparing distributions. It shows the shape, central tendancy and variability of the data. Use geom_boxplot() to create a box plot.

ggplot(ecom, aes(x = factor(device), y = n_pages)) +
  geom_boxplot()

2.8 Histogram

A histogram is a plot that can be used to examine the shape and spread of continuous data. It looks very similar to a bar graph and can be used to detect outliers and skewness in data. Use geom_histogram() to create a histogram.

ggplot(ecom, aes(x = duration)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

You can control the number of bins using the bins argument.

ggplot(ecom, aes(x = duration)) +
  geom_histogram(bins = 5)

2.9 Line

Line charts are used to examine trends over time. We will use a different data set for exploring line plots.

2.9.1 Data

gdp <- readr::read_csv('https://raw.githubusercontent.com/rsquaredacademy/datasets/master/gdp.csv')
## Warning: Missing column names filled in: 'X1' [1]
gdp
## # A tibble: 6 x 6
##      X1     X year       growth india china
##   <dbl> <dbl> <date>      <dbl> <dbl> <dbl>
## 1     1     1 2000-01-01      6     5     8
## 2     2     2 2001-01-01      9     9     5
## 3     3     3 2002-01-01      8     8     6
## 4     4     4 2003-01-01      9     8     8
## 5     5     5 2004-01-01      9     5     9
## 6     6     6 2005-01-01      8     7     8

Use geom_line() to create a line chart. In the below plot, we chart the GDP of India, the fastest growing economy in emerging markets, across years.

ggplot(gdp, aes(year, india)) +
  geom_line()

The color and line type can be modified using the color and linetype arguments. We will explore the different line types in an upcoming post.

ggplot(gdp, aes(year, india)) +
  geom_line(color = 'blue', linetype = 'dashed')

Add horizontal or vertical lines using

  • geom_hline()
  • geom_vline()

2.9.2 Horizontal Line

To add a horizontal line, the Y axis intercept must be supplied using the yintercept argument.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_hline(yintercept = 30) 

2.9.3 Vertical Line

For the vertical line, the X axis intercept must be supplied using the xintercept argument.

ggplot(mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_vline(xintercept = 5) 

2.10 Label

You can label the points using geom_label().

ggplot(mtcars, aes(disp, mpg, label = rownames(mtcars))) +
  geom_label()

2.11 Text

geom_text() offers another way to add text to the plots. We will learn to modify the appearance and location of the text in another post.

ggplot(mtcars, aes(disp, mpg, label = rownames(mtcars))) +
  geom_text(check_overlap = TRUE, size = 2)