Lecture 7: Tabular Data, Network Data

Today’s Visualization

Assignment

Ooooooops, I forgot to put a submission point up on Blackboard.

Fixed now, redesign assignment extended until next week.

Tabular Data

Taxonomy of tabular visualizations

Munzner classifies visualizations of tabular data by 4 core design choices to be made:

  1. How values are expressed
  2. How marks are separated, ordered, aligned
  3. Which axis orientation is used
  4. Whether the layout is dense or space-filling

Key to Munzner’s analysis is the key/value semantics introduced in Chapter 2. Depending on data types of keys, different choices are needed.

Express: 1 quantitative value

A single quantitative value (for instance activation times for a neural spike train, arrival times of cars or race competitors, …)

Code
library(tidyverse)
library(R.matlab)
theme_set(theme_light())
ganglion = readMat("L7-figures/doi_10.5061_dryad.m37pvmd3q__v2/Costa_et_al_OC_Signature_Dataset.mat")
spikeTimes = ganglion$data[[2]][[1]] %>% as_tibble(.name_repair="unique")
ggplot(spikeTimes) +
  geom_point(aes(x=...4, y="Neuron 4"), shape="|") +
  labs(x="ms", y="")

Express: 2 quantitative values

The scatterplot is a widely spread idiom for expressing 2 primary quantitative variables, possibly adding any number of additional attributes in secondary visual channels.

Code
ggplot(diamonds) +
  geom_point(aes(x=carat, y=price)) +
  labs(title="Diamond carat vs. price")
ggplot(diamonds) +
  geom_point(aes(x=carat, y=price)) +
  scale_x_log10() + scale_y_log10() +
  labs(title="Diamond carat vs. price (log-scale)")
ggplot(diamonds) +
  geom_point(aes(x=carat, y=price, color=color, shape=clarity)) +
  scale_shape_manual(values=0:10) +
  scale_x_log10() + scale_y_log10() +
  labs(title="Diamond carat vs. price (log-scale)")

Separate/Sort/Align: Bi-stacked bar charts

Code
import numpy as np
import scipy as sp
import altair as a
from altair import expr, datum
from vega_datasets import data
barley = data.barley()

baseline_sel = a.selection_point(name="baseline_sel", on="click", fields=["variety"], bind="legend")
a.Chart(barley, title=a.Title("Barley yields in 1931", subtitle="Click a variety to realign the display along that variety")).add_params(baseline_sel).transform_filter(
  datum.year == 1931
).transform_calculate(
  signed_yield="datum.variety > baseline_sel.variety ? datum.yield : -datum.yield"
).transform_stack(
  stack="signed_yield",
  groupby=["site"],
  sort=[a.SortField(field="variety", order="ascending")],
  as_=["yield_lo","yield_hi"]
).mark_bar().encode(
  x=a.X("site:N", title="Site"),
  y=a.Y("yield_lo:Q", axis=a.Axis(labelExpr="abs(datum.value)"), title="Yield"),
  y2="yield_hi:Q",
  fill=a.Fill("variety:N", title="Variety")
).properties(width=300, height=300)

Separate / Sort / Align

Sorted bar charts

Code
library(tidyverse)
library(gganimate)

gdp <- read.csv("https://raw.github.com/datasets/gdp/master/data/gdp.csv")
country.codes = gdp$Country.Code %>% unique %>% as_tibble %>% filter(row_number() > 46)
gdp = gdp %>% filter(Country.Code %in% country.codes$value)

colnames(gdp) <- gsub("Country.Name", "country", colnames(gdp))
colnames(gdp) <- gsub("Country.Code", "code", colnames(gdp))
colnames(gdp) <- gsub("Value", "value", colnames(gdp))
colnames(gdp) <- gsub("Year", "year", colnames(gdp))

gap = gdp %>%
  group_by(year) %>%
  mutate(rank = min_rank(-value)*1,
         Value_rel = value/value[rank==1],
         Value_lbl = paste0(" ", format(value/1e9, digits=1, scientific=FALSE, big.mark=" "), " B$")) %>%
  filter(rank <= 25) %>%
  ungroup()

p = ggplot(gap, aes(rank, group=country)) +
  geom_tile(aes(y=value/2, height=value, width=0.9), alpha=0.8, color=NA) +
  geom_text(aes(y=0, label=paste(country, " "), hjust=1, vjust=0.2)) +
  geom_text(aes(y=value, label=Value_lbl), hjust=0) +
  coord_flip(clip = "off", expand=FALSE) +
  scale_y_continuous(labels=scales::comma) +
  scale_x_reverse() +
  labs(title="{closest_state}", x="", y="Total GDP") +
  theme(plot.margin=margin(1,2,1,4,"cm"),
        plot.title=element_text(hjust=0,size=22)) +
  transition_states(year) +
  ease_aes("cubic-in-out")

animate(p, fps=25, duration=30, end_pause=5*25)

Dot-charts, line-charts, bar-charts

Code
nordics = gdp %>% filter(country %in% c("Sweden", "Norway", "Denmark", "Iceland", "Finland")) %>%
  filter(year%%10 == 0)

ggplot(nordics, aes(x=country, y=value, color=year)) +
  geom_point() +
  labs(y="GDP")
ggplot(nordics, aes(x=year, y=value, fill=country)) +
  geom_col(position="dodge") +
  labs(y="GDP")
ggplot(nordics, aes(x=year, y=value, color=country)) +
  geom_line() +
  labs(y="GDP")

Dot-charts, line-charts and bar-charts communicate very similar data types in very similar ways. Crucial distinction in how the viewer perceives the chart:

Lines imply connectivity between the individual observations - leads the eye to look for trends, even if there is no inherent connectivity between the corresponding key attributes.

Bars communicate quantity primarily with an area mark - leads the eye to lend attention proportional to the visual impact of the bar. This is one case where for instance truncated axes lead to misleading charts.

Dot-charts, line-charts, bar-charts

Resulting recommendation:

  • Dot-charts for quantitative vs. nominal, truncated axis not inherently dishonest
  • Line-charts for quantitative vs. ordinal
  • Bar-charts for quantitative vs. nominal, truncated axis inherently dishonest

Perceptually optimal line-charts

Code
library(tidyverse)
library(tsibble)
sunspots_ts = sunspot.year %>% as_tsibble()
ggplot(sunspots_ts, aes(index, value)) +
  geom_line() +
  labs(title="Sunspots, 2:3 (Karsten)") +
  coord_fixed(ratio = 2/3)
ggplot(sunspots_ts, aes(index, value)) +
  geom_line() +
  labs(title="Sunspots, 1:√2 (ASA)") +
  coord_fixed(ratio = 1/sqrt(2))
ggplot(sunspots_ts, aes(index, value)) +
  geom_line() +
  labs(title="Sunspots, 3:4 (ANSI)") +
  coord_fixed(ratio = 3/4)

Cleveland and McGill (1987) study perceptual accuracy of slope judgements, approaching the question of which slopes are most accurately read for the shape of a curve.

Historically, Karsten (1923) suggested an aspect ratio for graphs of 2:3, American Standards Association (1938) suggested 1:\(\sqrt{2}\), and American Standards Institute (1979) suggested 3:4.

Even if we decide to let the data control the choice of aspect ratio, different writers have different suggestion for which angles might be optimal: Von Huhn (1931) “somewhere between 30º and 45º”, Weld (1947) 35º to 45º, Hall (1958) 30º to 60º, Bertin (1967, 1983) 70º.

Perceptually optimal line-charts

Code
library(ggthemes)
ggplot(sunspots_ts, aes(index, value)) +
  geom_line() +
  labs(title="Sunspots, 1:1") +
  coord_fixed(ratio = 1/1)
ggplot(sunspots_ts, aes(index, value)) +
  geom_line() +
  labs(title="Sunspots, median absolute slope") +
  coord_fixed(ratio = bank_slopes(sunspots_ts$index, sunspots_ts$value))

In perceptural accuracy experiments on judging similarity between adjacent slanted lines, Cleveland and McGill established the highest accuracy (within a span of 0º - 60º) to be right near slopes of 45º.

As a result, one recommendation for line plots is to make either the median absolute slope or the average absolute slope of the line segments in the line plot be close to 45º: to bank to 45º.

R has support for automatically computing the resulting aspect ratios in the ggthemes::bank_slopes function, while in Python you may have to compute by hand or eyeball the aspect ratio.

More on line plots

The need to accurately judge shapes and slopes of line graphs also contributes to recommendations on axis truncation.

Code
library(quantmod)
library(lubridate)
library(ggthemes)
foo = getSymbols("AAPL", src="yahoo")
AAPL.2020 = fortify(AAPL["2020-10::2020-12"])
ggplot(AAPL.2020, aes(Index, AAPL.Open)) +
  geom_line() +
  expand_limits(y=0) +
  labs(y="AAPL Opening Price", title="Q4 2020")
ggplot(AAPL.2020, aes(Index, AAPL.Open)) +
  geom_line() +
  labs(y="AAPL Opening Price", title="Q4 2020")

More important than including a (possibly arbitrary) 0 on the axis is to avoid non-varying ink in the line chart.

Many Attributes: SPLOM and Parallel Coordinates

SPLOM - grid of pairwise scatterplots

Parallel Coordinates - each attribute on an axis of its own, lines connect attributes that belong to the same unit.

Code
theme_set(theme_light())
library(GGally)
library(lubridate)
library(ggeasy)
data("freeny")
ggparcoord(freeny,
           columns=2:ncol(freeny),
           order=c(2,4,5,3),
           alphaLines = 0.75) + easy_rotate_x_labels()
ggpairs(freeny,
        columns=2:ncol(freeny))

Radial Layouts

The R help files has the following to say in their description of the pie command to draw pie-charts:

Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.

Cleveland (1985), page 264: “Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements.” This statement is based on the empirical investigations of Cleveland and McGill as well as investigations by perceptual psychologists.

Radial Layouts

More compact representation than a pie chart, as well as more accurate visual channels, comes with normalized stacked bar charts.

Code
library(geofacet)
library(ggeasy)
state.race = read_csv("L7-figures/data (1).csv")
state_race_pct = state.race %>% filter(state %in% state.name) %>%
  select(White=WhiteTotalPerc, Black=BlackTotalPerc, Indian=IndianTotalPerc,
         Asian=AsianTotalPerc, Hawaiian=HawaiianTotalPerc, Other=OtherTotalPerc,
         TwoOrMore=TwoOrMoreTotalPerc, state) %>%
  add_column(state_abbr=state.abb) %>%
  pivot_longer(-c(state, state_abbr))
ggplot(state_race_pct, aes(x=value, y="1", fill=fct(name))) +
  geom_col(position="fill", orientation="y") +
  coord_polar() +
  facet_geo(~state_abbr) +
  labs(fill="Race") +
  theme_void()
ggplot(state_race_pct, aes(x=state_abbr, y=value, fill=fct(name))) +
  geom_col(position="fill") +
  easy_rotate_x_labels() +
  labs(fill="Race", y="Proportion", x="State")

Radial Layouts

Radial Layouts excel at showing periodic phenomena.

Florence Nightingale, Public domain, via Wikimedia Commons

Note: There are issues with this graph

Facets

Facets

Facets are arrangements of frames containing graphics, often so that the arrangement itself carries some information about the data or its structure.

We have seen some examples already.

Code
ggpairs(freeny,
        columns=2:ncol(freeny))
ggplot(state_race_pct, aes(x=value, y="1", fill=fct(name))) +
  geom_col(position="fill", orientation="y") +
  coord_polar() +
  facet_geo(~state_abbr) +
  labs(fill="Race") +
  theme_void()

Algebra of Facets

Wilkinson develops an entire algebra of facet specifications. ggplot2 can comfortably handle Wilkinson’s * and + operators, and can be coaxed into handling / with added packages and some hands-on work.

Code
library(tidyverse)
library(ggrepel)
library(latex2exp)
library(patchwork)

facetdf = tribble(
  ~a, ~b, ~c, ~d, ~face,
  "Barb", "Jean", "Young", "Short", "A",
  "Jean", "Jean", "Young", "Short", "B",
  "Barb", "Mark", "Young", "Short", "C",
  "Jean", "Jean", "Old", "Short", "D",
  "Jean", "Jean", "Old", "Tall", "E",
  "Jean", "Jean", "Old", "Tall", "F"
)

ggplot(facetdf, aes(c, d, label=face)) + 
  geom_text_repel(segment.colour=NA) +
  facet_wrap(~a) +
  labs(title=TeX("Single Variable Facet: $a$"))
ggplot(facetdf, aes(c, d, label=face)) + 
  geom_text_repel(segment.colour=NA) +
  facet_grid(b~a) +
  labs(title=TeX("Two Variable Facet: $a\\times b$"))
(ggplot(facetdf %>% filter(b == "Jean"), aes(c, d, label=face)) + 
  geom_text_repel(segment.colour=NA) +
  facet_grid(~a) + labs(title="Jean")) + 
  (ggplot(facetdf %>% filter(b == "Mark"), aes(c, d, label=face)) + 
  geom_text_repel(segment.colour=NA) +
  facet_grid(~a) + labs(title="Mark", y="")) + 
  plot_annotation(title=TeX("Nested Facet: $a/b$")) +
  plot_layout(widths=c(2,1))
ggplot(facetdf, aes(c, d, label=face)) + 
  geom_text_repel(segment.colour=NA) +
  facet_wrap(~a+b) +
  labs(title=TeX("Blended Facet: $a + b$"))

Algebra of Facets

Vega and Altair can currently primarily handle the * operator

Code
import altair
from vega_datasets import data
barley = data.barley()

chart_html = altair.Chart(barley).mark_point().encode(
  x="yield:Q",
  y="site:N",
  color="year:N",
  tooltip="yield:Q"
).properties(width=100, height=100).facet("variety:N", columns=5).to_html(output_div="vis1", fullhtml=False)

print(chart_html)

Algebra of Facets

Vega and Altair can currently primarily handle the * operator

Code
chart_html = altair.Chart(barley).mark_bar().encode(
  x="year:O",
  y="yield:Q",
  color="year:N",
  tooltip="yield:Q"
).properties(width=50, height=50).facet(
  column="variety:N", row="site:N"
).to_html(output_div="vis2", fullhtml=False)

print(chart_html)

Network Data

Data Representation

Networks are graphs (possibly trees), and there are several choices for data structures to hold graph data:

Adjacency List
Vertices are objects, with each vertex containing a list of its adjacent vertices.
Edges are implicitly encoded in these adjacency lists.
Incidence List
Vertices are objects, with each vertex containing a list of its incident edges.
Edges are objects, with each edge containing a list of its incident vertices.
Adjacency Matrix
2-dimensional matrix with rows representing source vertices, columns representing target vertices, and entries non-zero if an edge is present.
Can encode weights or multiplicities of edges, but all additional attributes of either edges or vertices has to be stored externally.
Incidence Matrix
2-dimensional matrix with rows representing vertices and columns representing edges.
Entries are non-zero if a vertex connects to an edge.

Data Representation

Code
graph {
  layout=dot
  rankdir="LR"
  A -- B;
  A -- D;
  B -- C;
  D -- C;
  C -- E -- F;
}

A A B B A–B D D A–D C C B–C D–C E E C–E F F E–F

Code
# Adjacency List
graph = {
  "A": ["B", "D"],      "B": ["A", "C"], 
  "C": ["B", "D", "E"], "D": ["A", "C"], 
  "E": ["C", "F"],      "F": ["E"]
}
Code
# Adjacency Matrix 
# A B C D E F
  0 1 0 1 0 0 # A
  1 0 1 0 0 0 # B
  0 1 0 1 1 0 # C
  1 0 1 0 0 0 # D
  0 0 1 0 0 1 # E
  0 0 0 0 1 0 # F
Code
# Incidence List
graph = {
  "vertices": {
    "A": ["AB", "AD"], "B": ["AB", "BC"],
    "C": ["BC", "CD", "CE"], "D": ["AD", "CD"],
    "E": ["CE", "EF"], "F": ["EF"]
  },
  "edges": {
    "AB": ["A", "B"], "AD": ["A", "D"],
    "BC": ["B", "C"], "CD": ["C", "D"],
    "CE": ["C", "E"], "EF": ["E", "F"]
  }
}
Code
# Incidence Matrix
# AB AD BC CD CE EF
  1  1  0  0  0  0  # A
  1  0  1  0  0  0  # B
  1  0  1  1  1  0  # C
  1  1  0  1  0  0  # D
  1  0  0  0  1  1  # E
  1  0  0  0  0  1  # F

Data Representation

For a graph with \(V\) the number of vertices and \(E\) the number of edges:

Task Adjacency List Adjacency Matrix Incidence Matrix
Store Graph (space) \(O(V+E)\) \(O(V^2)\) \(O(V\cdot E)\)
Add Vertex (time) \(O(1)\) \(O(V^2)\) \(O(V\cdot E)\)
Add Edge (time) \(O(1)\) \(O(1)\) \(O(V\cdot E)\)
Remove Vertex (time) \(O(E)\) \(O(V^2)\) \(O(V\cdot E)\)
Remove Edge (time) \(O(V)\) \(O(1)\) \(O(V\cdot E)\)
Adjacency query (time) \(O(V)\) \(O(1)\) \(O(E)\)

Visual Idioms

7 encoding idioms

  1. Vertical Node-Link
  2. Icicle
  3. Radial Node-Link
  4. Concentric Circles
  5. Nested Circles
  6. Treemap
  7. Indented Outline

Important software packages

  • Graphviz - several good layout algorithms, decent file format for specifying graph structures
  • D3.js - excellent and easy to use force-directed placement layouts
  • Gephi - graph visualization workbench
  • networkx - graph computation (and some layout) in Python