Paper and screens are two-dimensional.
We live in a three-dimensional world.
For visualizing three-dimensional data we can take advantage of our visual system’s ability to reconstruct three dimensional scenes from two-dimensional images using:
perspective rendering, lighting, and shading;
motion with animation and interaction;
stereo viewing methods.
Most of us have no intuition for four and more dimensions.
Some techniques that work in three dimensions but can also be used in higher dimensions:
Grouping by encoding additional variables in color or shape channels.
Conditioning by using small multiples for different levels of additional variables.
Higher dimensions maybe up to ten; the curse of dimensionality is a limiting factor.
The lattice
package provides some facilities not easily available in ggplot
so I will use lattice
in a few examples.
A scatterplot matrix is a useful overview that shows all pairwise scatterplots.
There are many options for creating scatterplot matrices in R; a few are:
pairs
in base graphics;
splom
in package lattice
ggpairs
in GGally
.
Some examples using the mpg
data:
library(lattice)
splom(select(mpg, cty, hwy, displ),
cex = 0.5, pch = 19)
library(GGally)
ggpairs(select(mpg, cty, hwy, displ),
lower = list(continuous =
wrap("points",
size = 1)))
Some variations:
diagonal left-top to right-bottom or left-bottom to right-top;
how to use the panels in the two triangles;
how to use the panels on the diagonal.
Some things to look for in the panels:
clusters or separation of groups;
strong relationships;
outliers, rounding, clumping.
Notes:
Scatterplot matrices were popularized by Cleveland and co-workers at Bell Laboratories in the 1980s.
Cleveland recommends using the full version displaying both triangles of plots to facilitate visual linking.
If you do use only one triangle, and one variable is a response, then it is a good idea to arrange for that variable to be shown on the vertical axis against all other variables.
The symmetry in the plot with the diagonal running from bottom-left to top-tight as produced by splom
is simpler than the symmetry in the plot with the diagonal running from top-left t bottom-right produced by pairs
and ggpairs
.
Thee useful data sets to explore:
The ethanol
data frame in the lattice
package.
Soil resistivity data from from Cleveland’s Visualizing Data book.
The quakes
data frame in the datasets
package.
The ethanol
data frame in the lattice
package contains data from an experiment on efficiency and emissions in small one-cylinder engines.
The data frame contains 88 observations on three variables:
NOx
: Concentration of nitrogen oxides (NO and NO2) in micrograms.
C
Compression ratio of the engine.
E
Equivalence ratio, a measure of the richness of the air and ethanol fuel mixture.
A scatterplot matrix:
library(lattice)
splom(ethanol, ced = 0.5, pch = 19)
A goal is to understand the relationship between the pollutant NOx
and the controllable variables E
and C
.
Data from Cleveland’s Visualizing Data book contains measurements of soil resistivity of an agricultural field along a roughly rectangular grid.
A scatterplot matrix of the resistivity
, northing
and easting
variables:
if (! file.exists("soil.dat"))
download.file("http://www.stat.uiowa.edu/~luke/data/soil.dat",
"soil.dat")
soil <- read.table("soil.dat")
splom(soil[1 : 3], cex = 0.1, pch = 19)
The data is quite noisy but there is some structure.
A goal is to understand how resistivity varies across the field.
The quakes
data frame contains data on locations of seismic events of magnitude 4.0 or larger in a region near Fiji.
The time frame is from 1964 to perhaps 2000.
More recent data is available from a number of sources on the web.
A scatter plot matrix:
library(lattice)
splom(quakes, cex = 0.1, pch = 19)
Quake locations:
md <- map_data("world2", c("Fiji", "Tonga", "New Zealand"))
ggplot(quakes, aes(long, lat)) +
geom_polygon(aes(group = group), data = md, color = "black", fill = NA) +
geom_point(size = 0.5, color = "red") +
coord_map() +
ggthemes::theme_map()
Some goals:
understand the three-dimensional location of the quakes;
see if there is any association between location and magnitude.
For the ethanol
data there are only a small number of distinct levels for C
.
This suggests considering a plot mapping the level to color.
ggplot(ethanol,
aes(E, NOx,
color = C)) +
geom_point(size = 2)
A qualitative scheme can help distinguish the levels.
ggplot(ethanol,
aes(E, NOx,
color = factor(C))) +
geom_point(size = 2)
Adding smooths further helps the visual grouping:
ggplot(ethanol,
aes(E, NOx,
color = factor(C))) +
geom_point(size = 2) +
geom_smooth(se = FALSE)
Some observations:
At each level of C
there is a strong non-linear relation between NOx
and E
.
At levels of E
above 1 the value of C
has little effect.
For lower levels of E
the NOx
level increases with C
.
For the quakes
data, breaking the depth
values into thirds gives some insights:
quakes2 <-
mutate(quakes,
depth_cut = cut_number(depth, 3))
ggplot(quakes2, aes(x = long,
y = lat,
color = depth_cut)) +
geom_point() +
theme_bw() +
coord_map()
One way to try to get a handle on higher dimensional data is to try to fix values of some variables and visualize the values of others in 2D.
This can be done with
interactive tools;
small multiples with lattice/trellis displays or faceting.
A conditioning plot, or coplot:
Shows a collection of plots of two variables for different settings of one or more additional variables, the conditioning variables.
For ordered conditioning variables the plots are arranged in a way that reflects the order.
When a conditioning variable is numeric, or ordered categorical with many levels, the values of the conditioning variable are grouped into bins.
For the soil resistivity data, a coplot of resistivity
against easting
, conditioning on northing
with bins of size 0.5:
p1 <- ggplot(soil,
aes(easting, resistivity)) +
geom_point(size = 0.5) +
facet_wrap(~ cut_width(northing,
width = 0.5,
center = 0))
p1
Adding a smooth is often helpful.
With a large amount of data the smooth is hard to see.
Some options:
Omit the data and only show the smooth.
Show the data in a less intense color, such as light gray.
Use a contrasting color for the smooth curves.
Show the data using alpha blending.
This uses a muted representation of the data:
p2 <- ggplot(soil,
aes(easting, resistivity)) +
geom_point(size = 0.5,
color = "lightgrey") +
facet_wrap(~ cut_width(northing,
width = 0.5,
center = 0)) +
geom_smooth()
p2
The conditioning bins are quite wide.
Using rounding and keeping only points within 0.05 of the rounded values reduces the variability:
soil_trm <-
mutate(soil,
nrnd = round(northing * 2) / 2) %>%
filter(abs(northing - nrnd) < 0.05)
p1 %+% soil_trm +
facet_wrap(~ cut_width(northing,
width = 0.1,
center = 0))
p2 %+% soil_trm +
facet_wrap(~ cut_width(northing,
width = 0.1,
center = 0))
For the quakes data a plot of latitude against longitude conditioned on three depth levels:
qthm <- theme(panel.border = element_rect(color = "grey30", fill = NA))
ggplot(quakes2, aes(x = long, y = lat)) +
geom_point(color = scales::muted("blue"),
size = 0.5) +
facet_wrap(~ depth_cut,
nrow = 1) +
coord_map() +
qthm
The relative positions of the depth groups are much harder to see than in the grouped conditioning plot.
Adding the full data for background context, and using a more intense color for the panel subset, helps a lot:
## quakes does not contain the depth_cut
## variable used in the facet
ggplot(quakes2, aes(x = long, y = lat)) +
geom_point(data = quakes,
color = "gray", size = 0.5) +
geom_point(color = "blue", size = 0.5) +
facet_wrap(~ depth_cut, nrow = 1) +
coord_map() +
qthm
Switching latitude and depth shows another aspect:
quakes3 <-
mutate(quakes,
lat_cut = cut_width(lat,
width = 5,
boundary = 0))
ggplot(quakes3, aes(x = long, y = depth)) +
geom_point(data = quakes,
color = "gray", size = 0.5) +
geom_point(color = "blue", size = 0.5) +
scale_y_reverse() +
facet_wrap(~ lat_cut) +
qthm
Coplot for the ethanol
data:
ggplot(ethanol, aes(x = E, y = NOx)) +
geom_point() +
facet_wrap(~ C)
Adding muted full data for context:
ggplot(ethanol, aes(x = E, y = NOx)) +
geom_point(color = "grey",
data = mutate(ethanol, C = NULL)) +
geom_point() +
facet_wrap(~ C)
A number of methods can be used to estimate a smooth signal surface as a function of the two location variables.
One option is the loess
local polynomial smoother; another is gam
from package mgcv
.
The estimated surface level can be computed on a grid of points using the predict
method of the fit.
These estimated surfaces can be visualized using contour plots or level plots.
m <- loess(resistivity ~ easting * northing, span = 0.25,
degree = 2, data = soil)
eastseq <- seq(.15, 1.410, by = .015)
northseq <- seq(.150, 3.645, by = .015)
soi.grid <- expand.grid(easting = eastseq, northing = northseq)
soi.fit <- predict(m, soi.grid)
soi.fit.df <- mutate(soi.grid, fit = as.numeric(soi.fit))
Contour plots compute contours, or level curves, as polygons at a set of levels.
Contour plots draw the level curves, often with a level annotation.
Contour plots can also have their polygons filled in with colors representing the levels.
A basic contour plot of the fit soil resistivity surface in ggplot
using geom_contour
:
p <- ggplot(soi.fit.df,
aes(x = easting,
y = northing,
z = fit)) +
coord_fixed()
p + geom_contour()
Neither lattice
nor ggplot
seem to make it easy to fill in the contours.
The base function filled.contour
is available for this:
cm.rev <- function(...) rev(cm.colors(...))
filled.contour(eastseq, northseq, soi.fit,
asp = 1,
color.palette = cm.rev)
A level plot colors a grid spanned by two variables by the color of a third variable.
Level plots are also called image plots
The term heat map is also used, in particular with a specific color scheme.
But heat map often means a more complex visualization with an image plot at its core.
ggplot
provides geom_tile
that can be used for a level plot:
p + geom_tile(aes(fill = fit)) +
scale_fill_gradientn(
colors = rev(cm.colors(100)))
Superimposing contours on a level plot is often helpful.
p + geom_tile(aes(fill = fit)) +
geom_contour() +
scale_fill_gradientn(
colors = rev(cm.colors(100)))
Level plots do not require computing contours, but are not not as smooth as filled contour plots.
Visually, image plots and filled contour plots are very similar for fine grids, but image plots are less smooth for coarse ones.
Lack of smoothness is less of an issue when the data values themselves are noisy.
The grid for the volcano
data set is coarser and illustrates the lack of smoothness.
vd <- expand.grid(x = seq_len(nrow(volcano)), y = seq_len(ncol(volcano)))
vd$z <- as.numeric(volcano)
ggplot(vd, aes(x, y, fill = z)) +
geom_tile() +
scale_fill_gradientn(colors = rev(cm.colors(100))) +
coord_equal()
A filled.contour
plot looks like this:
filled.contour(seq_len(nrow(volcano)),
seq_len(ncol(volcano)),
volcano,
nlevels = 10,
color.palette = cm.rev,
asp = 1)
A coarse grid can be interpolated to a finer grid.
Irregularly spaced data can also be interpolated to a grid.
The interp
function in the akima
or interp
packages is useful for this kind of interpolation.
(interp
has a more permissive license.)
There are several options for viewing surfaces or collections of points as three-dimensional objects:
Fixed views of rotated projections.
Animated or interactive views showing a sequence of rotated projections.
The lattice
function cloud()
shows a projection of a rotated point cloud in three dimensions.
For the soil resistivity data:
cloud(resistivity ~ easting * northing,
pch = 19, cex = 0.1, data = soil)
For the quakes
data:
cloud(-depth ~ long * lat,
data = quakes)
A surface can also be visualized using a wire frame plot showing a 3D view of the surface from a particular viewpoint.
A simple wire frame plot is often sufficient.
Lighting and shading can be used to enhance the 3D effect.
A basic wire frame plot for the volcano data:
wireframe(z ~ x * y,
data = vd,
aspect = c(61 / 89, 0.3))
Wire frame is a bit of a misnomer since surface panels in front occlude lines behind them.
For a fine grid, as in the soil surface, the lines are too dense.
The use of shading for the surfaces can help.
wireframe(z ~ x * y,
data = vd,
aspect = c(61 / 89, 0.3),
shade = TRUE)
A wire frame plot with shading for the fit soil resistivity surface:
asp <- with(soi.grid,
diff(range(northing)) /
diff(range(easting)))
wf <- wireframe(
soi.fit ~
soi.grid$easting * soi.grid$northing,
aspect = asp, shade = TRUE,
screen = list(z = -50, x = -30),
xlab = "Easting (km",
ylab = "Northing (km)")
wf
Both ways of looking at a surface are useful:
lv <-
levelplot(soi.fit ~
soi.grid$easting *
soi.grid$northing,
cuts = 9,
aspect = asp,
contour = TRUE,
xlab = "Easting (km)",
ylab = "Northing (km)")
print(lv, split = c(1, 1, 2, 1),
more = TRUE)
print(wf, split = c(2, 1, 2, 1))
The level plot/contour representation is useful for recognizing locations of key features.
The wire frame view helps build a mental model of the 3D structure.
Being able to interactively adjust the viewing position for a wire frame model greatly enhances our ability to understand the 3D structure.
OpenGL is a standardized framework for high performance graphics.
The rgl
package provides an R interface to some of OpenGL’s capabilities.
WebGL is a JavaScript framework for using OpenGL within a browser window.
Most desktop browsers support WebGL; some mobile browsers do as well.
In some cases support may be available but not enabled by default. You may be able to get help at https://get.webgl.org/.
knitr
and rgl
provide support for embedding OpenGL images in web pages.
It is also possible to embed OpenGL images in PDF files, but not all PDF viewers support this.
Start by creating the fit surface data frame.
library(dplyr)
soil <- read.table("http://www.stat.uiowa.edu/~luke/data/soil.dat")
m <- loess(resistivity ~ easting * northing, span = 0.25,
degree = 2, data = soil)
eastseq <- seq(.15, 1.410, by = .015)
northseq <- seq(.150, 3.645, by = .015)
soi.grid <- expand.grid(easting = eastseq, northing = northseq)
soi.fit <- predict(m, soi.grid)
soi.fit.df <- mutate(soi.grid, fit = as.numeric(soi.fit))
This code run in R will open a new window containing an interactive 3D scene (but this may not work on FastX and is not available on the RStudio server):
library(rgl)
bg3d(color = "white")
clear3d()
par3d(mouseMode = "trackball")
surface3d(eastseq, northseq,
soi.fit / 100,
color = rep("red", length(soi.fit)))
This will work in the RStudio notebook server:
options(rgl.useNULL = TRUE)
library(rgl)
bg3d(color = "white")
clear3d()
par3d(mouseMode = "trackball")
surface3d(eastseq, northseq,
soi.fit / 100,
color = rep("red", length(soi.fit)))
rglwidget()
To embed an image in an HTML document, first set the webgl
hook with a code chunk like this:
knitr::knit_hooks$set(webgl = rgl::hook_webgl)
options(rgl.useNULL = TRUE)
Then a chunk with the option webgl = TRUE
can produce an embedded OpenGL image:
library(rgl)
bg3d(color = "white")
clear3d()
par3d(mouseMode = "trackball")
surface3d(eastseq, northseq,
soi.fit / 100,
color = rep("red", length(soi.fit)))