Chapter 1 Quick Tour

1.1 Introduction

In this chapter, we will learn to quickly build a set of plots that are routinely used to explore data using qplot(). It can be used to quickly create plots but also has certain limitations. Nevertheless, if you want to quickly explore data using a single function, qplot() is your friend.

1.2 Libraries, Code & Data

We will use the following libraries in this chapter:

All the data sets used in this chapter can be found here and code can be downloaded from here.

1.3 Scatter Plot

Scatter plots are used to examine the relationship between two continuous variables. The relationship can be examined across the levels of a categorical variable as well. Let us begin by creating scatter plots. The first two inputs are the variables/columns representing the X and Y axis. The next input is the name of the data set.

qplot(disp, mpg, data = mtcars)

If you want the relationship between the two variables to be represented by both points and line, use the geom argument and supply it the values using a character vector.

qplot(disp, mpg, data = mtcars, geom = c('point', 'line'))

The color of the points can be mapped to a categorical variable, in our case cyl, using the color argument. Ensure that the variable is categorical using factor().

qplot(disp, mpg, data = mtcars, color = factor(cyl))

The shape and size of the points can also be mapped to variables using the shape and size argument as shown in the below examples.

qplot(disp, mpg, data = mtcars, shape = factor(cyl))

Ensure that size is mapped to a continuous variable.

qplot(disp, mpg, data = mtcars, size = qsec)

1.4 Bar Plot

A bar plot represents data in rectangular bars. The length of the bars are proportional to the values they represent. Bar plots can be either horizontal or vertical. The X axis of the plot represents the levels or the categories and the Y axis represents the frequency/count of the variable.

To create a bar plot, the first input must be a categorical variable. You can convert a variable to type factor (R equivalent of categorical) using the factor() function. The next input is the name of the data set and the final input is the geom which is supplied the value 'bar'.

qplot(factor(cyl), data = mtcars, geom = c('bar'))

You can create a stacked bar plot using the fill argument and mapping it to another categorical variable.

qplot(factor(cyl), data = mtcars, geom = c('bar'), fill = factor(am))

1.5 Box Plot

The box plot is a standardized way of displaying the distribution of data based on the five number summary: minimum, first quartile, median, third quartile, and maximum. Box plots are useful for detecting outliers and for comparing distributions. It shows the shape, central tendancy and variability of the data.

Box plots can be created by supplying the value 'boxplot' to the geom argument. The firstinput must be a categorical variable and the second must be a continuous variable.

qplot(factor(cyl), mpg, data = mtcars, geom = c('boxplot'))

Unlike plot(), we cannot create box plots using a single variable. If you are not comparing the distribution of a variable across the levels of a categorical variable, you must supply the value 1 as the first input as show below.

qplot(factor(1), mpg, data = mtcars, geom = c('boxplot'))

1.6 Line Chart

Line charts are used to examing trends across time. To create a line chart, supply the value 'line' to the geom argument. The first two inputs should be names of the columns/variables representing the X and Y axis, and the third input must be the name of the data set.

qplot(x = date, y = unemploy, data = economics, geom = c('line'))

The appearance of the line can be modified using the color argument as shown below.

qplot(x = date, y = unemploy, data = economics, geom = c('line'),
      color = 'red')

1.7 Histogram

A histogram is a plot that can be used to examine the shape and spread of continuous data. It looks very similar to a bar graph and can be used to detect outliers and skewness in data. A histogram is created using the bins argument as shown below. The first input is the name of the continuous variable and the second is the name of the data set.

qplot(mpg, data = mtcars, bins = 5)