Lecture 1: Data Visualization

Today’s Visualization

Lighthouses

Full animated world map at https://geodienst.github.io/lighthousemap/

Color true to real lighthouse
Timing of blinks true to real lighthouse
Size of dot corresponds to visibility range
Data drawn from OpenStreetMap – completion and correction through crowdsourcing

Related, and also excellent: https://twitter.com/i/status/1462095711508516865

Syllabus and Structure of the Course

After this course, you will…

Be able to describe the key design guidelines and techniques used for the visual display of information.
Understand how to best use the capabilities of visual perception in a graphic display.
Understand the principles of interactive visualizations.
Understand how Machine Learning techniques can determine data structure and pattern.
Explore and critically evaluate a wide range of visualization techniques and applications.

We will primarily work with:

Tamara Munzner: Visualization Analysis & Design (ISBN 978-1466-50891-0)
Leland Wilkinson: The Grammar of Graphics, 2nd ed (ISBN 978-0387-24544-7) Wilkinson can be acquired free through the library’s ebook access to SpringerLink. That way you can also get a cheap print-on-demand copy.

Syllabus and Structure of the Course

We meet in-person (or over Zoom if we are mandated back online) Wednesdays 14.00 - 16.00.

In addition, all course information can be found on Blackboard.

I can be reached on mvejdemojohansson@gc.cuny.edu, and will happily schedule meetings if you need them.

Between meetings, you will read assigned chapters from the textbook, and work on smaller assignments and semester projects.

Syllabus and Structure of the Course

Semester-long assignments

Data Visualization Project – you take some dataset and develop a data visualization of that dataset with communicative intent.
Blog Post – you take a recent paper from the IEEE Vis conference, and summarize it for a popular audience.

Weekly assignments

Read assigned textbook chapters
Find a recent published data visualization that embodies some concept in the assigned readings. Prepare to present that graphic to the class, and to demonstrate to the class what it does particularly well, and what could be improved.

Occasional assignments

Recreate a published visualization using some data visualization tool.

Syllabus and Structure of the Course

Semester Schedule (may be changed as we go)

Date	Lecture Content	Preparation
2023-01-25	Defining Data Visualization	Munzner ch. 1, Wilkinson ch. 1-2
2022-02-01	Data Abstraction and Representation	Munzner ch. 2, Wilkinson ch. 3-4. Homework: improve the graph from the lecture slides.
2022-02-08	Task Abstraction	Munzner ch. 3, Wilkinson ch. 5
2022-02-15	Analysis and Validation	Munzner ch. 4, Wilkinson ch. 6-7. Watch: https://www.youtube.com/watch?v=Z8t4k0Q8e8Y
2022-02-22	Geometric Representation	Munzner ch. 5, Wilkinson ch. 8-9. Homework: Reproduce Minard’s March on Moscow
2022-03-01	Aesthetic Mappings; Rules of Thumb	Munzner ch. 6, Wilkinson ch. 10
2022-03-8	Tabular Data, Network Data	Munzner ch. 7, 9, Wilkinson ch. 11-12
2022-03-15	Structure of a graphing library (Hannah Aizenman guest lecture)
2022-03-22	Spatial Data, Geography, Maps	Munzner ch. 8, Wilkinson ch. 13
2022-03-29	Color	Munzner ch. 10
2022-04-19	Interactivity	Munzner ch. 11-12
2022-04-26	Summaries; Time, Time-series	Munzner ch. 13-14, Wilkinson Ch. 14
2022-05-03	Spaces, Graph Layout, Manifold Learning / Dimension Reduction	ISOMAP, MDS, UMAP
2022-05-10	Presentations

What this course will not do

…teach one specific data visualization platform.
However, you should take this time to learn at least one platform well. I’m happy to help you in the process, but you pick a platform and work through tutorials to get going yourself.
…get you ready to submit a paper to a major data vis conference yourself.
However, by the time you finish the course, you should know a direction to go if you have this ambition.
…deliver content to a passive student audience.
Your participation is essential. Prepare for each lecture, collect questions and thoughts, and discuss eagerly in class.

What is data visualization anyway?!?

Defining Data Visualization

The visual representation and presentation of data to facilitate understanding.

Andy Kirk (Data Visualization)

Visual representation of datasets designed to help people carry out tasks more effectively.

Tamara Munzner

Task-oriented / understanding
Data-oriented
Assistive
Human visual system as co-processor

Why humans?

Visual representation of datasets designed to help people carry out tasks more effectively.

Tamara Munzner

Why humans?

Visual representation of datasets designed to help people carry out tasks more effectively.

Tamara Munzner

We don’t need data vis when tasks can be fully automated.
We might not know what questions we have in advance.

Why external representation?

Visual representation of datasets designed to help people carry out tasks more effectively.

Tamara Munzner

Replaces cognition with perception.
We don’t need to know how the brain does pattern recognition in order to use it.

Why visual?

Visual representation of datasets designed to help people carry out tasks more effectively.

Tamara Munzner

Vision is high-bandwidth interface with our brain.
Overview is possible due to background processing – we experience seeing everything simultaneously, and the visual system processes in parallel and pre-attentively.
Sound: lower bandwidth, different semantics, no overview, subjective experience of sequentiality.
Touch / haptics: low bandwidth, low record/replay capacity
Taste, smell: no viable record/replay devices

Why all the data?

Visual representation of datasets designed to help people carry out tasks more effectively.

Tamara Munzner

Summaries inherently lose information

4 datasets, identical statistics

Property	Value
Mean of x	9
Sample variance of x: $s^2_x$	11
Mean of y	7.50
Sample variance of y: $s^2_y$	4.125
Correlation between x and y	0.816
Linear regression line	$y = 3.00 + 0.500x$
Coefficient of determination of the linear regression: $R^{2}$	0.67

Each value exact up to at least 2 decimal places.

Why all the data?

Visual representation of datasets designed to help people carry out tasks more effectively.

Tamara Munzner

Summaries inherently lose information
12 datasets, identical statistics

Design: balancing constraints

Design in a context

All design work has a context, which provides limitations

Computational limits
- computation time, system memory
Display limits
- pixels
- information density
- “ink ratio”
Human limits
- human time, memory, attention, capacities of vision, understanding

Analytic Scaffolds

Three different sets of questions and considerations to guide your design work.

Tamara Munzner

3 design questions:

What?
Why?
How?

Andy Kirk

3 phases of understanding:

perceiving
interpreting
comprehending

4 stage design process:

formulating your brief
working with data
establishing your editorial thinking
developing the design solution

3 design principles:

trustworthy
accessible
elegant

Edward Tufte

6 design principles
coined terminology: ink ratios, chart junk

Tufte: Fundamental Principles of Analytical Design

Show comparisons, contrasts, differences.
Show causality, mechanism, explanation, systematic structure.
Show multivariate data; that is, show more than 1 or 2 variables.
Completely integrate words, numbers, images, diagrams.
Thoroughly describe the evidence. Provide a detailed title, indicate the authors and sponsors, document the data sources, show complete measurement scales, point out relevant issues.
Analytical presentations ultimately stand or fall depending on the quality, relevance, and integrity of their content.

Edward Tufte, Beautiful Evidence, pp 122 - 139

Tufte: Data ink and chart junk

Early Tufte design guidance:

Tip

Measure and maximize the data-ink ratio. Data-ink is the non-erasable core of a graphic, and the ratio to maximize is (data-ink / total ink)

Tip

Eschew and eliminate chart junk – graphical decorations, textures, patterns, all of which just increase total ink without increasing data ink.

Kirk: Phases of Understanding

Visualizer control

Viewer control

Perceiving

What do I see?

What data is shown?
How is it represented?
What features are observable?

Interpreting

What does it mean, given the subject?

What features are…

interesting?
unexpected?
important?

Comprehending

What does it mean, to me?

What have I learnt?
What do I feel?
What do I do now?

Kirk: Design Process

Four fundamental steps

Formulating your brief
planning, defining and initiating your project
Working with data
gathering, handling and preparing your data
Establishing your editorial thinking
defining what you will show your audience
Developing your design solution
making design choices about how you represent and present what it is you want to show your audience

Kirk: Design principles

Based on Dieter Rams’ 10 principles of good design:

Good design is innovative
Good design makes a product useful
Good design is aesthetic
Good design makes a product understandable
Good design is unobtrusive

Good design is honest
Good design is long-lasting
Good design is thorough down to the last detail
Good design is environmentally friendly
Good design is as little design as possible

Kirk: Design principles

Principle 1. Good visualization design is trustworthy.

Good design is innovative
Good design makes a product useful
Good design is aesthetic
Good design makes a product understandable
Good design is unobtrusive

Good design is honest
Good design is long-lasting
Good design is thorough down to the last detail
Good design is environmentally friendly
Good design is as little design as possible

Kirk: Design principles

Principle 2. Good visualization design is accessible.

Good design is innovative
Good design makes a product useful
Good design is aesthetic
Good design makes a product understandable
Good design is unobtrusive

Good design is honest
Good design is long-lasting
Good design is thorough down to the last detail
Good design is environmentally friendly
Good design is as little design as possible

Kirk: Design principles

Principle 3. Good visualization design is elegant.

Good design is innovative
Good design makes a product useful
Good design is aesthetic
Good design makes a product understandable
Good design is unobtrusive

Good design is honest
Good design is long-lasting
Good design is thorough down to the last detail
Good design is environmentally friendly
Good design is as little design as possible

Munzner: Design questions and levels

Each level (domain/abstraction/idiom/algorithm) contained in the previous.

domain situation
- who are the target users?
abstraction
- translate from specifics of domain to vocabulary of visualization
- what is shows? data abstraction
  - don’t just draw what you’re given: transform to new form
- why is the user looking at it? task abstraction
idiom
- how is it shown?
  - visual encoding idiom: how to draw
  - interaction idiom: how to manipulate
algorithm
- efficient computation

Munzner: Design questions and levels

Different levels have different failure modes

Domain:
You misunderstand their needs
- Abstraction:
  You’re showing them the wrong thing
  - Visual encoding / Interaction idiom
    The way you show it doesn’t work
    - Algorithm
      Your code is too slow

Munzner: Design questions and levels

Solution: use methods from different fields at each level.

Domain:
Observe target users using existing tools. (anthropology/ethnography)
- Abstraction:
  - Visual encoding / Interaction idiom
    Justify design with respect to alternatives. (design)
    - Algorithm
      Measure system time/memory.
      Analyze computational complexity. (computer science)
  - Analyze results qualitatively.
    Measure human time with lab experiment (lab study). (cognitive psychology)
- Observe target users after deployment (field study). (anthropology/ethnography)
Measure adoption.

Munzner: Design questions and levels

Design questions impose a structure on an otherwise vast design space.

What?

Dataset Types

Tables (tidy data)

Attributes (columns)
Items (rows)

Networks

Items (nodes)
Links

Fields

Value cells distributed on a shape

Data cubes / tensors

Value cells distributed in a hypercube

Trees

Relationships

Geometry (spatial)

Positions

Attributes

Attribute Type

Categorical
Ordinal
Quantitative (interval)

Ordering Direction

Sequential
Diverging
Cyclic

Data Availability

Static

Dynamic

Munzner: Design questions and levels

Design questions impose a structure on an otherwise vast design space.

Why?

Actions

Analyze

Consume
- Discover
- Present
- Enjoy
Produce
- Annotate
- Record
- Derive

Query

Identify
Compare
Summarize

Target known/unknown
Location known/unknown

Targets

All Data

Trends
Outliers
Features

Attributes

One
- Distribution
- Extremes
Many
- Dependency
- Correlation
- Similarity

Network Data

Topology
Paths

Spatial Data

Shape

Munzner: Design questions and levels

Design questions impose a structure on an otherwise vast design space.

How?

Encode

Arrange

Map from categorical and ordered attributes

Color
- Hue
- Saturation
- Lightness
Size, Angle, Curvature
Shape
Motion
- Direction
- Rate
- Frequency

Manipulate

Change

Select

Navigate

Juxtapose

Partition

Superimpose

Reduce

Filter

Aggregate

Embed

No One Good Answer

Data Visualization is an aesthetic field

You will see disagreements on what is and is not a good design.

And on what is and is not a good design principle.

Data Visualization is a communication field

Many applications of data visualization communicate a message, either intentionally or unintentionally.

Notice how Kirk emphasize the communication, Munzner acknowledges it, and Tufte all but ignores that aspect.

Is this a good graphic?

Tufte: Look at all that chart junk! So much decorations that do not directly encode data!

Kirk: cites Jen Christiansen, Graphics Editor at Scientific American. “I found that when I developed magazine graphics according to [Tufte’s] philosophy, they were most often met with a yawn. The reality is that Scientific American isn’t required reading. We need to engage readers, as well as inform them.”

Decorations provide context for the information – it is immediately apparent what the data is about (something something razors) without impacting the trustworthiness of the data display itself.

Is this a good graphic?

Very popular target as an example of a bad graph. The inverted y-axis is very often invoked as a condemning feature.

Is this a good graphic?

Very popular target as an example of a bad graph. The inverted y-axis is very often invoked as a condemning feature.

Kirk points out that it was designed to emulate another chart published earlier: “Iraq’s bloody toll”.

The red coloring and the inverted y-axis in combination are attempting to evoke a metaphor of blood dribbling down a wall.

Is this a good graphic?

Very popular target as an example of a bad graph. The inverted y-axis is very often invoked as a condemning feature.

Kirk points out that it was designed to emulate another chart published earlier: “Iraq’s bloody toll”.

The red coloring and the inverted y-axis in combination are attempting to evoke a metaphor of blood dribbling down a wall.

Question: Was the intended metaphor successful? In “Gun deaths in Florida”? In “Iraq’s bloody toll”? What could have been done differently to make the message more efficiently conveyed?

Data Visualization Toolkits

Pick a platform - and stick with it

In this course, you will pick one platform and do all your exercises in this platform. Good options include:

matplotlib (and seaborn)
Python, not Grammar of Graphics
ggplot2
R, Grammar of Graphics
plotnine
Python, Grammar of Graphics
altair
Python, Grammar of Graphics
d3.js / ObservableJS JavaScript, not Grammar of Graphics

It’s better to build 80% proficiency in one tool than 20% each in 3 different tools. Your next job may well use something different - and for each tool you learn, the next one is easier to learn.

Out of the box - demo visualizations

We draw on the NYC OpenData portal and collect data on traffic on the NYC Ferry network.

The data we want is available at https://data.cityofnewyork.us/Transportation/NYC-Ferry-Ridership/t5n6-gx8c and we can compose a query (to offload some computation onto the NYC OpenData servers) to extract the dily rider count:

https://data.cityofnewyork.us/resource/t5n6-gx8c.csv?$select=date,route,SUM(boardings)&$group=date,route&$limit=1000000

We want a linegraph of the daily ridership, by ferry route

Python / matplotlib + seaborn

Code

import pandas
from matplotlib import pyplot
import seaborn

ferry_url = "https://data.cityofnewyork.us/resource/t5n6-gx8c.csv?$select=date,route,SUM(boardings)&$group=date,route&$limit=1000000"
ferry = pandas.read_csv(ferry_url)
seaborn.lineplot(data=ferry, x="date", y="SUM_boardings", hue="route")

R / ggplot2

Code

library(tidyverse)

ferry.url = "https://data.cityofnewyork.us/resource/t5n6-gx8c.csv?$select=date,route,SUM(boardings)&$group=date,route&$limit=1000000"
ferry = read_csv(ferry.url)
ggplot(ferry) +
  geom_line(aes(x=date, y=SUM_boardings, color=route))

Python / plotnine

Code

import pandas
from plotnine import ggplot, geom_line, aes

ferry_url = "https://data.cityofnewyork.us/resource/t5n6-gx8c.csv?$select=date,route,SUM(boardings)&$group=date,route&$limit=1000000"
ferry = pandas.read_csv(ferry_url)

ggplot(ferry, aes("date","SUM_boardings", color="factor(route)", group="route")) + geom_line()

<ggplot: (687517027)>

Python / Altair

Code

import altair

ferry_url = "https://data.cityofnewyork.us/resource/t5n6-gx8c.json?$select=date,route,SUM(boardings)&$group=date,route&$limit=1000000"
altair.Chart(ferry_url).mark_line().encode(
  x="date:T", y="SUM_boardings:Q", color="route:N"
).interactive()

JavaScript / ObservableJS+d3.js

Code

d3 = require("d3@7")
ferry_url = "https://data.cityofnewyork.us/resource/t5n6-gx8c.csv?$select=date,route,SUM(boardings)&$group=date,route&$limit=1000000"
ferry = d3.csv(ferry_url, (d) => {
return {
  date: new Date(d.date),
  route: d.route,
  SUM_boardings: +d.SUM_boardings
}
})
Plot.plot({
  marks: [Plot.line(ferry, {sort:"date", x:"date", y:"SUM_boardings", z:"route", stroke:"route"})]
})

Your first aesthetic critique

What differences and similarities do you see between the different “Out of the box” plots here?

What would you like to change?

What would you like to check / verify?

Would you like more (or less) binning and aggregation?

What, if any, interactive features would you like?

What, if any, labels, titles, annotations would you like to use?

What would an interesting use case for this plot be?

If you were to pull properties and features freely from all platforms (or add yourself) – how could you specify the most appropriate plot for this use case?

Homework: Reconstruct this plot in your chosen platform, and improve the things you discussed in your critique.

Property	Value
Mean of x	9
Sample variance of x: \(s^2_x\)	11
Mean of y	7.50
Sample variance of y: \(s^2_y\)	4.125
Correlation between x and y	0.816
Linear regression line	\(y = 3.00 + 0.500x\)
Coefficient of determination of the linear regression: \(R^{2}\)	0.67

Lecture 1: Data Visualization

Today’s Visualization

Lighthouses

Syllabus and Structure of the Course

Syllabus and Structure of the Course

Syllabus and Structure of the Course

Syllabus and Structure of the Course

What this course will not do

What is data visualization anyway?!?

Defining Data Visualization

Why humans?

Why humans?

Why external representation?

Why visual?

Why all the data?

Why all the data?

Design: balancing constraints

Design in a context

All design work has a context, which provides limitations

Analytic Scaffolds

Tamara Munzner

Andy Kirk

Edward Tufte

Tufte: Fundamental Principles of Analytical Design

Tufte: Data ink and chart junk

Kirk: Phases of Understanding

Kirk: Design Process

Four fundamental steps

Kirk: Design principles

Kirk: Design principles

Kirk: Design principles

Kirk: Design principles

Munzner: Design questions and levels

Munzner: Design questions and levels

Munzner: Design questions and levels

Munzner: Design questions and levels

What?

Dataset Types

Attributes

Data Availability

Munzner: Design questions and levels

Why?

Actions

Targets

Munzner: Design questions and levels

How?

Encode

Manipulate

Facet

Reduce

No One Good Answer

Data Visualization is an aesthetic field

Data Visualization is a communication field

Is this a good graphic?

Is this a good graphic?

Is this a good graphic?

Is this a good graphic?

Data Visualization Toolkits

Pick a platform - and stick with it

Out of the box - demo visualizations

Python / matplotlib + seaborn

R / ggplot2

Python / plotnine

Python / Altair

JavaScript / ObservableJS+d3.js

Your first aesthetic critique