---
title: "Visualizing Three or More Numeric Variables"
output:
html_document:
toc: yes
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(collapse=TRUE)
```
```{r, include = FALSE}
library(UsingR)
library(lattice)
library(tidyverse)
library(gridExtra)
library(rgl)
set.seed(12345)
```
* A static graph is two-dimensional.
* Showing a third dimension is a challenge.
* For statistical data three dimensions is not particularly special.
* But we can take advantage of our ability to perceive
three-dimensional structure.
## Soil Resistivity Data
[Data](http://www.stat.uiowa.edu/~luke/data/soil.dat) from Cleveland's
_Visualizing Data_ book contains measurements of soil resistivity of
an agricultural field along a roughly rectangular grid.
Using functions from the `lattice` package we can view the locations
at which measurements were taken and a perspective plot of the 3D
point cloud:
```{r, include = FALSE}
if (! file.exists("soil.dat"))
download.file("http://www.stat.uiowa.edu/~luke/data/soil.dat",
"soil.dat")
```
```{r}
soil <- read.table("soil.dat")
p <- cloud(resistivity ~ easting * northing, pch = ".", data = soil)
s <- xyplot(northing ~ easting, pch = ".", aspect = 2.44, data = soil)
print(s, split = c(1, 1, 2, 1), more = TRUE)
print(p, split = c(2, 1, 2, 1))
```
* The data is quite noisy but there is some structure.
* The viewing angle and distance for the point cloud can be adjusted.
## Conditioning Plots (Coplots)
One way to try to get a handle on higher dimensional data is to try to
fix values of some variables and visualize the values of others in 2D.
This can be done with
* interactive tools;
* small multiples with trellis displays or faceting.
A conditioning plot, or _coplot_:
* Shows a collection of plots of two variables for different
settings of one or more additional variables, the conditioning
variables.
* For ordered conditioning variables the plots are arranged in a way
that reflects the order.
* When a conditioning variable is numeric, or ordered categorical with
many levels, the values of the conditioning variable are grouped
into bins.
* A useful strategy is to allow the bins to overlap somewhat.
* Trellis/lattice plots support this with the concept of a _shingle_.
* This does not seem easy to do with `ggplot`.
A lattice plot of `resistivity` against `northing` conditioned on `easting`:
```{r}
xyplot(resistivity ~ northing | equal.count(easting), data = soil)
```
`equal.count` produces a _shingle_:
```{r}
str(equal.count(soil$easting))
```
* The default number of intervals in 6.
* The intervals used are highlighted in the title bars.
* The intervals are chosen to contain approximately equal numbers of observations.
* By default, the intervals overlap by 50%.
* The intervals are computed by `co.intervals`; the amount of overlap
is controlled by the `overlap` argument.
Using smaller points to reduce over-plotting:
```{r}
xyplot(resistivity ~ northing | equal.count(easting), data = soil, cex = 0.2)
```
A larger number of intervals can be specified as a second argument to
`equal.bins`:
```{r}
xyplot(resistivity ~ northing | equal.count(easting, 12), data = soil,
cex = 0.2)
```
### Coplot Variations
To help see the signal we can add a smooth, by default computed using
the `loess` smoother. This can be done with a `type` specification
that includes the string `"smooth"`:
```{r}
xyplot(resistivity ~ northing | equal.count(easting, 12), data = soil,
cex = 0.2, type = c("p", "smooth"))
```
The smooth is hard to see. Some options:
* Omit the data and only show the smooth.
* Show the data in a less intense color, such as gray.
* Use a contrasting color for the smooth curves.
* Show the data using alpha blending.
Color adjustments can be accomplished with the `col` and `col.line`
arguments; the line width can be adjusted with `lwd`. For a thicker
red line:
```{r}
xyplot(resistivity ~ northing | equal.count(easting, 12), data = soil,
cex = 0.2, type = c("p", "smooth"), col.line = "red", lwd = 2)
```
For gray points and a red smooth:
```{r}
xyplot(resistivity ~ northing | equal.count(easting, 12), data = soil,
cex = 0.2, type = c("p", "smooth"),
col.line = "red", col = "gray", lwd = 2)
```
The `xyplot` function supports an `alpha` argument, but it affects
both points and lines.
One option for changing only the alpha level for points is to use a
_panel function_:
```{r, eval = FALSE}
xyplot(resistivity ~ northing | equal.count(easting, 12), data = soil,
cex = 0.2, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.loess(x, y, col = "red", lwd = 2)})
```
Another option is to create a color for the points with the
desired alpha level included in its encoding:
```{r}
blue_w_alpha <- rgb(0, 0, 1, 0.1)
blue_w_alpha
```
This can then be used with the `col` argument:
```{r}
xyplot(resistivity ~ northing | equal.count(easting, 12), data = soil,
cex = 0.2, type = c("p", "smooth"),
col.line = "red", col = blue_w_alpha, lwd = 2)
```
After conditioning on one of two explanatory variables it is useful to
also consider conditioning on the other:
```{r}
xyplot(resistivity ~ easting | equal.count(northing, 12), data = soil,
cex = 0.2, type = c("p", "smooth"),
col.line = "red", col = blue_w_alpha, lwd = 2)
```
### Copots using `ggplot`
`ggplot` faceting is analogous to trellis/lattice conditioning.
* The main missing feature is the possibility of overlap among group.
* Every data point has to belong to exactly one facet.
* This also makes showing a selection of narrow windows more challenging.
The basic coplot:
```{r}
p <- ggplot(soil, aes(northing, resistivity))
p + geom_point() + facet_wrap(~ cut_number(easting, 6))
```
* `cut_number` is used to form non-overlapping groups with
approximately equal numbers of observations.
* `cut_interval` can be used to divide the range in equal intervals.
* The facet ordering is top to bottom; lattice orders bottom to top.
Using 12 facets and smaller points:
```{r}
p + geom_point(size = 0.5) + facet_wrap(~ cut_number(easting, 12))
```
The default smoothing method is currently "loess" for less than 1000 points:
```{r}
p + geom_point(size = 0.5) + geom_smooth(color = "red") +
facet_wrap(~ cut_number(easting, 12))
```
For larger data sets `method = "gam"` would be used:
```{r}
p + geom_point(size = 0.5) + facet_wrap(~ cut_number(easting, 12)) +
geom_smooth(method = "gam", color = "red", formula = y ~ s(x))
```
## Contour Plots for Surfaces
The `loess` local polynomial smoother can be used to estimate a smooth
signal surface as a function of the two location variables.
The estimated surface level can be computed on a grid of points using
the `predict` method of the fit.
```{r}
eastseq <- seq(.15, 1.410, by = .015)
northseq <- seq(.150, 3.645, by = .015)
soi.grid <- expand.grid(easting = eastseq, northing = northseq)
head(soi.grid)
m <- loess(resistivity ~ easting * northing, span = 0.25,
degree = 2, data = soil)
soi.fit <- predict(m, soi.grid)
```
Contour plots are one approach to visualizing a surface based on
values on a grid.
* Contour plots compute contours, or level curves, as polygons at
a set of levels.
* Contour plots draw the level curves, often with a level annotation.
* Contour plots can also have their polygons filled in with colors
representing the levels.
A `lattice` contour plot for the fitted surface:
```{r}
asp <- diff(range(soi.grid$n)) / diff(range(soi.grid$e))
contourplot(soi.fit ~ soi.grid$easting * soi.grid$northing,
cuts = 9, aspect = asp,
xlab = "Easting (km)",
ylab = "Northing (km)")
```
A basic contour plot in `ggplot` using `geom_contour`:
```{r}
p <- ggplot(mutate(soi.grid, fit = as.numeric(soi.fit)),
aes(x = easting, y = northing, z = fit)) +
coord_fixed()
p + geom_contour()
```
* Neither `lattice` nor `ggplot` seem to make it easy to fill in the
contours.
* The base function `filled.contour` is available for this:
```{r, fig.asp = 1}
cm.rev <- function(...) rev(cm.colors(...))
filled.contour(eastseq, northseq, soi.fit, asp = 1, color.palette = cm.rev)
```
## Level Plots
A _level plot_ colors a grid spanned by two variables by the color of a
third variable.
* Level plots are also called _image plots_
* The term _heat map_ is also used, in particular with a specific
color scheme. But heat map often means a more complex visualization
with an image plot at its core.
* Superimposing contours on a level plot is often helpful.
The `lattice` package provides the function `levelplot`.
A basic level plot:
```{r}
levelplot(soi.fit ~ soi.grid$easting * soi.grid$northing,
cuts = 9, aspect = asp,
xlab = "Easting (km)",
ylab = "Northing (km)")
```
A level plot with superimposed contours:
```{r}
lv <- levelplot(soi.fit ~ soi.grid$easting * soi.grid$northing,
cuts = 9, aspect = asp, contour = TRUE,
xlab = "Easting (km)",
ylab = "Northing (km)")
lv
```
`ggplot` provides `geom_tile` that can be used for a level plot:
```{r}
p + geom_tile(aes(fill = fit)) + geom_contour() +
scale_fill_gradientn(colors = rev(cm.colors(100)))
```
* Level plots do not require computing contours, but are not not as
smooth as filled contour plots.
* Visually, image plots and filled contour plots are very similar for
fine grids, but image plots are less smooth for coarse ones.
* Lack of smoothness is less of an issue when the data values
themselves are noisy.
The grid for the
[`volcano` data](https://teara.govt.nz/en/photograph/3920/maungawhau-mt-eden)
set is coarser and illustrates the lack of smoothness.
```{r}
vd <- expand.grid(x = seq_len(nrow(volcano)), y = seq_len(ncol(volcano)))
vd$z <- as.numeric(volcano)
levelplot(z ~ x * y, data = vd, cuts = 10)
```
A `filled.contour` plot looks like this:
```{r}
filled.contour(volcano, nlevels = 10, color.palette = cm.rev)
```
* A coarse grid can be interpolated to a finer grid.
* Irregularly spaced data can also be interpolated to a grid.
* The `interp` function in the `akima` package is useful for
this kind of interpolation.
## Wire Frame and Perspective Plots
A surface can also be visualized using a _wire frame_ plot showing a
3D view of the surface from a particular viewpoint.
* A simple wire frame plot is often sufficient.
* Lighting and shading can be used to enhance the 3D effect.
A basic wire frame plot for the volcano data:
```{r}
wireframe(z ~ x * y, data = vd, aspect = c(61 / 89, 0.3))
```
* _Wire frame_ is a bit of a misnomer since surface panels in front
occlude lines behind them.
* For a fine grid, as in the soil surface, the lines are too dense.
* The use of shading for the surfaces can help.
```{r}
wireframe(z ~ x * y, data = vd, aspect = c(61 / 89, 0.3), shade = TRUE)
```
A wire frame plot with shading for the soil resistivity data:
```{r}
wf <- wireframe(soi.fit ~ soi.grid$easting * soi.grid$northing,
aspect = asp, shade = TRUE,
xlab = "Easting (km)",
ylab = "Northing (km)")
wf
```
Both ways of looking at a surface are useful:
```{r}
print(lv, split = c(1, 1, 2, 1), more = TRUE)
print(wf, split = c(2, 1, 2, 1))
```
* The level plot/contour representation is useful for recognizing
locations of key features.
* The wire frame view helps build a mental model of the 3D structure.
* Being able to interactively adjust the viewing position for a wire
frame model greatly enhances our ability to understand the 3D
structure.
## Interactive 3D Plots Using OpenGL
[OpenGL](https://www.opengl.org/) is a standardized framework for high
performance graphics.
* The `rgl` package provides an R interface to some of OpenGL's
capabilities.
* [WebGL](https://www.khronos.org/webgl/) is a JavaScript framework
for using OpenGL within a browser window.
* Most desktop browsers support WebGL; some mobile browsers do as
well.
* In some cases support may be available but not enabled by default.
You may be able to get help at .
* `knitr` and `rgl` provide support for embedding OpenGL images in web
pages.
* It is also possible to embed OpenGL images in PDF files, but not all
PDF viewers support this.
This code run in R will open a new window containing an interactive
3D scene (but this currently fails on FastX):
```{r soi.fit_rgl, eval = FALSE}
library(rgl)
bg3d(color = "white")
clear3d()
par3d(mouseMode="trackball")
surface3d(eastseq, northseq,
soi.fit / 100, color = rep("red", length(soi.fit)))
```
To embed an image in an HTML, document first set the `webgl` hook with
a code chunk like this:
```{r}
knitr::knit_hooks$set(webgl = rgl::hook_webgl)
options(rgl.useNULL=TRUE)
```
Then a chunk with the option `webgl = TRUE` can produce an embedded OpenGL
image:
```{r, ref.label = "soi.fit_rgl", webgl = TRUE}
```
A view that includes the points and uses alpha blending to make the
surface translucent:
```{r, webgl = TRUE}
clear3d()
points3d(soil$easting, soil$northing, soil$resistivity / 100,
col = rep("black", nrow(soil)))
surface3d(eastseq, northseq,
soi.fit / 100, col = rep("red", length(soi.fit)),
alpha=0.9, front="fill", back="fill")
```
The
[WebGL vignette](https://cran.r-project.org/web/packages/rgl/vignettes/WebGL.html)
in the `rgl` package shows some more examples.
The embedded graphs show properly for me on most browser/OS
combinations I have tried.
To include a static view of an rgl scene you can use the
`rgl.snapshot` function.
To use the view found interactively with rgl for creating a `lattice`
`wireframe` view, you can use the `rglToLattice` function:
```{r, eval = FALSE}
wireframe(soi.fit ~ soi.grid$easting * soi.grid$northing,
aspect = c(asp, 0.7), shade = TRUE,
xlab = "Easting (km)",
ylab = "Northing (km)", screen = rglToLattice())
```
## Coplots for Surfaces
We can also use the idea of a coplot for examining a surface a few
slices at a time.
Choosing 12 approximately equally spaced slices along `easting`:
```{r}
sf <- soi.grid
sf$fit <- as.numeric(soi.fit)
sube <- eastseq[seq_len(length(eastseq)) %% 7 == 0]
sube <- eastseq[round(seq(1, length(eastseq), length.out = 12))]
ssf <- filter(sf, easting %in% sube)
xyplot(fit ~ northing | easting, data = ssf, type = "l")
```
Using `ggplot`:
```{r}
ggplot(ssf, aes(x = northing, y = fit)) + geom_line() + facet_wrap(~easting)
```
* For examining a surface this way we fix one variable at a specific value.
* For examining data it is also sometimes useful to choose a narrow window.
* A narrow window minimizes the variation within the variable we
are conditioning on.
* Too narrow a window contains to few observations to see a signal.
* Choosing a few narrow windows means we won't show all the data.
Shingles like `equal.count` make this easy in `lattice`, using a
negative `overlap` value:
```{r}
xyplot(resistivity ~ northing | equal.count(easting, 6, overlap = -0.8),
data = soil, cex = 0.2)
```
It is a bit more work in `ggolot`; it can be done by
* computing a set of narrow slices with a cut function
* filtering a subset of the levels:
```{r}
np <- 6
ns <- 5
s <- mutate(soil, slice = cut_number(easting, np * ns))
s <- filter(s, slice %in% levels(slice)[(1 : np) * ns])
ggplot(s, aes(x = northing, y = resistivity)) +
geom_point(size = 0.5) + facet_wrap(~slice)
```
## Conditioning with a Single Plot
It is possible to show conditioning in a single plot using an identity
channel to distinguish the conditions.
Using `lattice`:
```{r}
sf <- soi.grid
sf$fit <- as.numeric(soi.fit)
sube4 <- eastseq[round(seq(1, length(eastseq), length.out = 4))]
ssf4 <- filter(sf, easting %in% sube4)
xyplot(fit ~ northing, group = easting, data = ssf4,
type = "l", auto.key = TRUE)
```
Using `ggplot`:
```{r}
ggplot(mutate(ssf4, easting = factor(easting)),
aes(x = northing, y = fit, color = easting, group = easting)) +
geom_line()
```
* This is most useful when the effect of the conditioning variable is
a level shift.
* The number of different levels that can be used effectively is lower.
* Over-plotting becomes an issue when used with data points.
## Scatterplot Matrices
A _scatterplot matrix_ is a useful overview that shows all pairwise
scatterplots.
There are many options; a few are:
* `pairs` in base graphics;
* `splom` in lattice
* `ggpairs` in `GGally`.
```{r}
pairs(soil[1:3], cex = 0.1)
```
```{r}
splom(soil[1:3], cex = 0.1)
```
```{r}
library(GGally)
ggpairs(soil[1:3], lower = list(continuous = wrap("points", size=0.1)))
```
Some variations:
* diagonal left-top to right-bottom or left-bottom to right-top;
* how to use the panels in the two triangles;
* how to use the panels on the diagonal.
Some things to look for in the panels:
* clusters or separation of groups
* strong relationships
* outliers, rounding, clumping
Notes:
* Cleveland recommends using the full version displaying both triangles of
plots to facilitate _visual linking_.
* If you do use only one triangle, and one variable is a response,
then it is a good idea to arrange for that variable to be shown on
the vertical axis against all other variables.
* The symmetry in the plot with the diagonal running from bottom-left
to top-tight as produced by `splom` is simpler than the symmetry in
the plot with the diagonal running from top-left t bottom-right
produced by `pairs`.