Full animated world map at https://geodienst.github.io/lighthousemap/
Related, and also excellent: https://twitter.com/i/status/1462095711508516865
After this course, you will…
We will primarily work with:
We meet in-person (or over Zoom if we are mandated back online) Wednesdays 14.00 - 16.00.
In addition, all course information can be found on Blackboard.
I can be reached on mvejdemojohansson@gc.cuny.edu, and will happily schedule meetings if you need them.
Between meetings, you will read assigned chapters from the textbook, and work on smaller assignments and semester projects.
Semester-long assignments
Weekly assignments
Occasional assignments
Semester Schedule (may be changed as we go)
Date | Lecture Content | Preparation |
---|---|---|
2023-01-25 | Defining Data Visualization | Munzner ch. 1, Wilkinson ch. 1-2 |
2022-02-01 | Data Abstraction and Representation | Munzner ch. 2, Wilkinson ch. 3-4. Homework: improve the graph from the lecture slides. |
2022-02-08 | Task Abstraction | Munzner ch. 3, Wilkinson ch. 5 |
2022-02-15 | Analysis and Validation | Munzner ch. 4, Wilkinson ch. 6-7. Watch: https://www.youtube.com/watch?v=Z8t4k0Q8e8Y |
2022-02-22 | Geometric Representation | Munzner ch. 5, Wilkinson ch. 8-9. Homework: Reproduce Minard’s March on Moscow |
2022-03-01 | Aesthetic Mappings; Rules of Thumb | Munzner ch. 6, Wilkinson ch. 10 |
2022-03-8 | Tabular Data, Network Data | Munzner ch. 7, 9, Wilkinson ch. 11-12 |
2022-03-15 | Structure of a graphing library (Hannah Aizenman guest lecture) | |
2022-03-22 | Spatial Data, Geography, Maps | Munzner ch. 8, Wilkinson ch. 13 |
2022-03-29 | Color | Munzner ch. 10 |
2022-04-19 | Interactivity | Munzner ch. 11-12 |
2022-04-26 | Summaries; Time, Time-series | Munzner ch. 13-14, Wilkinson Ch. 14 |
2022-05-03 | Spaces, Graph Layout, Manifold Learning / Dimension Reduction | ISOMAP, MDS, UMAP |
2022-05-10 | Presentations |
The visual representation and presentation of data to facilitate understanding.
Andy Kirk (Data Visualization)
Visual representation of datasets designed to help people carry out tasks more effectively.
Tamara Munzner
Visual representation of datasets designed to help people carry out tasks more effectively.
Tamara Munzner
Visual representation of datasets designed to help people carry out tasks more effectively.
Tamara Munzner
We don’t need data vis when tasks can be fully automated.
We might not know what questions we have in advance.
Visual representation of datasets designed to help people carry out tasks more effectively.
Tamara Munzner
Visual representation of datasets designed to help people carry out tasks more effectively.
Tamara Munzner
Visual representation of datasets designed to help people carry out tasks more effectively.
Tamara Munzner
Property | Value |
---|---|
Mean of x | 9 |
Sample variance of x: \(s^2_x\) | 11 |
Mean of y | 7.50 |
Sample variance of y: \(s^2_y\) | 4.125 |
Correlation between x and y | 0.816 |
Linear regression line | \(y = 3.00 + 0.500x\) |
Coefficient of determination of the linear regression: \(R^{2}\) | 0.67 |
Each value exact up to at least 2 decimal places.
Visual representation of datasets designed to help people carry out tasks more effectively.
Tamara Munzner
Summaries inherently lose information
12 datasets, identical statistics
Three different sets of questions and considerations to guide your design work.
3 design questions:
3 phases of understanding:
4 stage design process:
3 design principles:
Edward Tufte, Beautiful Evidence, pp 122 - 139
Early Tufte design guidance:
Tip
Measure and maximize the data-ink ratio. Data-ink is the non-erasable core of a graphic, and the ratio to maximize is (data-ink / total ink)
Tip
Eschew and eliminate chart junk – graphical decorations, textures, patterns, all of which just increase total ink without increasing data ink.
Visualizer control
Viewer control
Perceiving
What do I see?
Interpreting
What does it mean, given the subject?
What features are…
Comprehending
What does it mean, to me?
Based on Dieter Rams’ 10 principles of good design:
Principle 1. Good visualization design is trustworthy.
Principle 2. Good visualization design is accessible.
Principle 3. Good visualization design is elegant.
Each level (domain/abstraction/idiom/algorithm) contained in the previous.
Different levels have different failure modes
Solution: use methods from different fields at each level.
Design questions impose a structure on an otherwise vast design space.
Tables (tidy data)
Networks
Fields
Data cubes / tensors
Trees
Geometry (spatial)
Attribute Type
Ordering Direction
Static
Dynamic
Design questions impose a structure on an otherwise vast design space.
Analyze
Query
Search
All Data
Attributes
Network Data
Spatial Data
Design questions impose a structure on an otherwise vast design space.
Arrange
Map from categorical and ordered attributes
Change
Select
Navigate
Juxtapose
Partition
Superimpose
Filter
Aggregate
Embed
You will see disagreements on what is and is not a good design.
And on what is and is not a good design principle.
Many applications of data visualization communicate a message, either intentionally or unintentionally.
Notice how Kirk emphasize the communication, Munzner acknowledges it, and Tufte all but ignores that aspect.
Tufte: Look at all that chart junk! So much decorations that do not directly encode data!
Kirk: cites Jen Christiansen, Graphics Editor at Scientific American. “I found that when I developed magazine graphics according to [Tufte’s] philosophy, they were most often met with a yawn. The reality is that Scientific American isn’t required reading. We need to engage readers, as well as inform them.”
Decorations provide context for the information – it is immediately apparent what the data is about (something something razors) without impacting the trustworthiness of the data display itself.
Very popular target as an example of a bad graph. The inverted y-axis is very often invoked as a condemning feature.
Very popular target as an example of a bad graph. The inverted y-axis is very often invoked as a condemning feature.
Kirk points out that it was designed to emulate another chart published earlier: “Iraq’s bloody toll”.
The red coloring and the inverted y-axis in combination are attempting to evoke a metaphor of blood dribbling down a wall.
Very popular target as an example of a bad graph. The inverted y-axis is very often invoked as a condemning feature.
Kirk points out that it was designed to emulate another chart published earlier: “Iraq’s bloody toll”.
The red coloring and the inverted y-axis in combination are attempting to evoke a metaphor of blood dribbling down a wall.
Question: Was the intended metaphor successful? In “Gun deaths in Florida”? In “Iraq’s bloody toll”? What could have been done differently to make the message more efficiently conveyed?
In this course, you will pick one platform and do all your exercises in this platform. Good options include:
matplotlib
(and seaborn
)ggplot2
plotnine
altair
d3.js / ObservableJS
JavaScript, not Grammar of GraphicsIt’s better to build 80% proficiency in one tool than 20% each in 3 different tools. Your next job may well use something different - and for each tool you learn, the next one is easier to learn.
We draw on the NYC OpenData portal and collect data on traffic on the NYC Ferry network.
The data we want is available at https://data.cityofnewyork.us/Transportation/NYC-Ferry-Ridership/t5n6-gx8c and we can compose a query (to offload some computation onto the NYC OpenData servers) to extract the dily rider count:
https://data.cityofnewyork.us/resource/t5n6-gx8c.csv?\(select=date,route,SUM(boardings)&\)group=date,route&$limit=1000000
We want a linegraph of the daily ridership, by ferry route
import pandas
from plotnine import ggplot, geom_line, aes
ferry_url = "https://data.cityofnewyork.us/resource/t5n6-gx8c.csv?$select=date,route,SUM(boardings)&$group=date,route&$limit=1000000"
ferry = pandas.read_csv(ferry_url)
ggplot(ferry, aes("date","SUM_boardings", color="factor(route)", group="route")) + geom_line()
<ggplot: (687517027)>
d3 = require("d3@7")
ferry_url = "https://data.cityofnewyork.us/resource/t5n6-gx8c.csv?$select=date,route,SUM(boardings)&$group=date,route&$limit=1000000"
ferry = d3.csv(ferry_url, (d) => {
return {
date: new Date(d.date),
route: d.route,
SUM_boardings: +d.SUM_boardings
}
})
Plot.plot({
marks: [Plot.line(ferry, {sort:"date", x:"date", y:"SUM_boardings", z:"route", stroke:"route"})]
})
What differences and similarities do you see between the different “Out of the box” plots here?
What would you like to change?
What would you like to check / verify?
Would you like more (or less) binning and aggregation?
What, if any, interactive features would you like?
What, if any, labels, titles, annotations would you like to use?
What would an interesting use case for this plot be?
If you were to pull properties and features freely from all platforms (or add yourself) – how could you specify the most appropriate plot for this use case?
Homework: Reconstruct this plot in your chosen platform, and improve the things you discussed in your critique.