Chapter 1 Quick Tour
1.1 Introduction
In this chapter, we will learn to quickly build a set
of plots that are routinely used to explore data using qplot()
. It can be
used to quickly create plots but also has certain limitations. Nevertheless, if
you want to quickly explore data using a single function, qplot()
is your friend.
1.2 Libraries, Code & Data
We will use the following libraries in this chapter:
All the data sets used in this chapter can be found here and code can be downloaded from here.
1.3 Scatter Plot
Scatter plots are used to examine the relationship between two continuous variables. The relationship can be examined across the levels of a categorical variable as well. Let us begin by creating scatter plots. The first two inputs are the variables/columns representing the X and Y axis. The next input is the name of the data set.
qplot(disp, mpg, data = mtcars)
If you want the relationship between the two variables to be represented by
both points and line, use the geom
argument and supply it the values using a
character vector.
qplot(disp, mpg, data = mtcars, geom = c('point', 'line'))
The color of the points can be mapped to a categorical variable, in our case
cyl
, using the color argument. Ensure that the variable is categorical using
factor()
.
qplot(disp, mpg, data = mtcars, color = factor(cyl))
The shape and size of the points can also be mapped to variables using the
shape
and size
argument as shown in the below examples.
qplot(disp, mpg, data = mtcars, shape = factor(cyl))
Ensure that size is mapped to a continuous variable.
qplot(disp, mpg, data = mtcars, size = qsec)
1.4 Bar Plot
A bar plot represents data in rectangular bars. The length of the bars are proportional to the values they represent. Bar plots can be either horizontal or vertical. The X axis of the plot represents the levels or the categories and the Y axis represents the frequency/count of the variable.
To create a bar plot, the first input must be a categorical variable. You can
convert a variable to type factor
(R equivalent of categorical) using the
factor()
function. The next input is the name of the data set and the final
input is the geom
which is supplied the value 'bar'
.
qplot(factor(cyl), data = mtcars, geom = c('bar'))
You can create a stacked bar plot using the fill
argument and mapping it to
another categorical variable.
qplot(factor(cyl), data = mtcars, geom = c('bar'), fill = factor(am))
1.5 Box Plot
The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. Box plots are useful for detecting outliers and for comparing distributions. It shows the shape, central tendancy and variability of the data.
Box plots can be created by supplying the value 'boxplot'
to the geom
argument. The firstinput must be a categorical variable and the second must be
a continuous variable.
qplot(factor(cyl), mpg, data = mtcars, geom = c('boxplot'))
Unlike plot()
, we cannot create box plots using a single variable. If you are
not comparing the distribution of a variable across the levels of a categorical
variable, you must supply the value 1
as the first input as show below.
qplot(factor(1), mpg, data = mtcars, geom = c('boxplot'))
1.6 Line Chart
Line charts are used to examing trends across time. To create a line chart,
supply the value 'line'
to the geom
argument. The first two inputs should
be names of the columns/variables representing the X and Y axis, and the third
input must be the name of the data set.
qplot(x = date, y = unemploy, data = economics, geom = c('line'))
The appearance of the line can be modified using the color
argument as shown below.
qplot(x = date, y = unemploy, data = economics, geom = c('line'),
color = 'red')
1.7 Histogram
A histogram is a plot that can be used to examine the shape and spread of
continuous data. It looks very similar to a bar graph and can be used to detect
outliers and skewness in data. A histogram is created using the bins
argument
as shown below. The first input is the name of the continuous variable and the
second is the name of the data set.
qplot(mpg, data = mtcars, bins = 5)