Our first meeting of the semester is at 2:30PM on Monday, August 15th in Manning 209.
During our first meeting the instructors will introduce ourselves, and we’ll go over the structure of the course, access to resources such as the weekly readings, and guidelines for success.
Our second meeting is at 2:30PM on Wednesday, August 17th. The second meeting will be an overview of the substantive content of the course.
We’ll meet for lectures each week on Monday and Wednesday at 2:30PM. We will generally not meet on Friday, with a couple of exceptions:
Monday, September 5 is Labor Day, so instead of having lectures on Monday and Wednesday that week, we will have them on Wednesday the 7th and Friday the 9th at 2:30PM.
Monday, September 26 is a wellness day, so instead of having lectures on Monday and Wednesday that week, we will have them on Wednesday the 28th and Friday the 30th at 2:30PM.
All recitation sections will begin meeting this week. See the recitation schedule.
Information is the result of a process that begins with a bunch of meaningful stuff and ends with something usable. The “bunch of meaningful stuff” can be practically anything: words on pages, recorded sounds, photographic images, tables of numbers, records of transactions, 3D models… the list goes on. The “something usable” is what we call information.
The process that leads from “stuff” to “information” can vary widely, but in this course we’ll focus on situations in which there is “too much stuff” for one person to carry out the process on their own. When that’s the case, people need to work together and construct systems to carry out the process of producing usable information. We’ll call systems that select usable information out of a mass of too much stuff “selecting systems.”
In the first part of the course we’ll introduce some concepts and terminology that will allow us to be a bit more precise when thinking and talking about what it is that selecting systems do.
Total amount of required reading for this week: 9,200 words
There’s a lot of stuff in the world, but only some of it is meaningful. If you see someone eating soup, you’re unlikely to ask them, “What does that soup mean?” On the other hand, if you see someone painting a mural, asking them what it means would be perfectly appropriate.
Not all meaningful stuff sticks around. Spoken words or messages written in sand will disappear without a trace unless they are somehow captured and made persistent. We’ll refer to persistent, meaningful stuff as “documents.”
The word “documents” might bring to mind things like college applications or tax forms. Those are certainly documents, but many other things can be documents too: photographs, pop songs, tweets, video games, even zoo animals. What makes some thing a document is not some special property it has, but the way it is used: how it is created, exchanged, understood, modified, collected, described, stored, etc.
A selecting system consists of various operations involving documents: collecting and creating them, transforming them, and arranging them.
To read before this meeting:
In this chapter from their book The Social Life of Information, John Seely Brown and Paul Duguid explain why despite 50+ years of digital computers and networks, we still use a lot of paper documents.
Total amount of required reading for this week: 6,500 words
Documents are persistent meaningful stuff. The process through which something comes to have meaning is known as semiosis, and semiotics is the study of that process.
Semiotics does not provide a theory for explaining semiosis. What it provides are conceptual tools for thinking more precisely about the production of meaning.
Selecting systems take a bunch of meaningful stuff as input and produce as output usable information—the meaning of which is somehow related to the meaning of the stuff that was input. Semiotic concepts are thus particularly useful for thinking about what selecting systems do.
To read before this meeting:
This article compares how the concept of abstraction is understood by computer scientists and semioticians. The author argues that semiotic systems should be understood as “machines for creating differences,” of which computers are one kind.
Total amount of required reading for this week: 6,600 words
Semiotics provides conceptual tools for analyzing the meaning of “meaningful stuff.” Information theory provides conceptual tools for analyzing the stuff.
Information theory starts from the recognition that in order for stuff to be potentially meaningful, it has to be patterned in some way. Information theory is the study of those patterns, and it provides mathematical tools for comparing and measuring those patterns.
Those mathematical tools have turned out to be useful for many purposes, including for the construction of selecting systems. But unlike semiotics, information theory has nothing at all to say about meaning—it is concerned only with patterns, not with what those patterns might mean.
In other words, “information theory” is a misleading name. The word “information” in “information theory” does not mean “the result of a process that begins with a bunch of meaningful stuff and ends with something usable.” A better name for information theory would be “pattern theory.”
To read before this meeting:
Claude Shannon, an engineer who worked at Bell Labs, developed a mathematical theory of communication that came to be known as “information theory.” The papers in which Shannon developed his theory were originally published in 1948 in two parts in the Bell System Technical Journal. A year later, Warren Weaver published this summary of Shannon’s work.
There is some math in this report. If you’re not mathematically inclined, just skip over it—it isn’t necessary to understand the math in order to understand the basic ideas.
About six years after information theory made its debut, Shannon wrote this one-page editorial.
This short article use the information theoretic concept of entropy to explain why it is so easy to identify individual people based on their web browsing activity.
This chapter from science writer James Gleick’s book The Information is an engaging mini-biography of Claude Shannon, but it is also an accessible introduction to information theory.
Monday, September 5 is Labor Day, so instead of having lectures on Monday and Wednesday, we will have them on Wednesday the 7th and Friday the 9th at 2:30PM.
There is no new material this week, as you will be working on exam #1.
During our usual lecture times, there will be open Q&A sessions on Zoom, before and during which you can submit questions about difficulties you might be having with the exam. (See this announcement on Canvas for the Zoom link.)
The recitations this week will also focus on discussing and helping each other think about how to answer the exam questions.
Both semiotics and information theory provide tools for understanding how documents are built up out of groups of more basic meaningful things: a text is a group of words, an image is a group of figures and grounds, an electronic record is a group of keys and values…
A selecting system carries out various operations on these groups: collecting, arranging, and transforming them into new groups (and groups of groups, and groups of groups of groups…). The goal is to select out of a mass of stuff some specific group: the group of videos that will keep you watching, the group of students that are likely to succeed in college, the group of hypotheses consistent with the data.
Designing and implementing a selecting system typically requires:
Building upon what we learned in the first part of the course, in the second part we’ll examine these two requirements.
We’ll start by considering how we draw distinctions and group things as we think and communicate about the world around us, and how the desire to coordinate these activities across broader scales motivates standardization and systematization.
Then, we'll consider and contrast two different ways of formally describing operations on groups: Boolean algebra (deductively describing and reasoning about operations on groups) and Bayesian inference (inductively describing and reasoning about operations on groups).
Total amount of required reading for this week: 10,100 words
Categories are that groups that have names. This week we’ll examine how loose, everyday categories become standardized and systematized into classifications, in order to support some kind of collective action.
For example, scientists seek to develop universal classifications rather than relying on locally-specific categories. Establishing and maintaining universal classifications is difficult, as the history of the scientific classification of clouds demonstrates. It’s not just a matter of agreeing on categories, but also a matter of establishing and documenting observational practices that make clouds classifiable.
Science is not the only institution that seeks to systematically classify things in order to coordinate collective action across great distances and over long periods of time. Law, medicine, trade and finance, engineering—every variety of large-scale coordination has its own techniques of making things classifiable (though we can identify some common features).
To read before this meeting:
Total amount of required reading for this week: 14,400 words
This week we will look at one common way of formally describing operations on groups: Boolean algebra.
Boolean algebra relies on the following “common-sense” assumptions:
If we make these assumptions, we can define groups using Boolean algebraic expressions. We can then manipulate these expressions according to the rules of Boolean algebra to deductively reason about operations on those groups (for example combining, intersecting, and negating them).
We call this formal reasoning because it depends only on the forms (the symbols and operators) of the mathematical expressions—the actual groups of things that those symbols represent are irrelevant.
To read before this meeting:
This article is by Edmund Berkeley, a pioneer of computer science and co-founder of the Association for Computing Machinery, which is still the primary scholarly association for computer scientists. But he wrote this article in 1937, before he became a computer scientist—because computers had yet to exist. At the time he was a mathematician working at the Prudential life insurance company, where he recognized the usefulness of Boolean algebra for modeling insurance data. He published this article in a professional journal for actuaries (people who compile and analyze statistics and use them to calculate insurance risks and premiums).
Berkeley uses some frightening-looking mathematical notation in parts of this article, but everything he discusses is actually quite simple. The most important parts are:
pages 373–374, where he gives a simple explanation of Boolean algebra,
pages 380–381, where he considers practical applications of Boolean algebra, and
pages 383 on, where he pays close attention to translation back and forth between Boolean algebra and English.
This is an excerpt from one of my favorite books, Data and Reality by Bill Kent. Kent was a computer programmer and database designer at IBM and Hewlett-Packard, during the era when the database technologies we use today were first being developed. He thought deeply and carefully about the challenges of data modeling and management, which he recognized were not primarily technical challenges.
The fixed-width typewriter font makes this reading look old-fashioned, but nothing in it is out-of-date. These are precisely the same issues data modelers and “data scientists” struggle with today.
Monday, September 26 is a wellness day, so instead of having lectures on Monday and Wednesday, we will have them on Wednesday the 28th and Friday the 30th at 2:30PM.
Total amount of required reading for this week: 18,400 words
Boolean algebra makes it possible to formally specify precise rules for grouping. Yet it’s often the case that we are able to distinguish different groups, but we cannot precisely specify rules for doing so.
An example is the grouping of texts by subject. Grouping together books or journal articles that are about the same things doesn’t seem so difficult, assuming that we can read and understand them. But it turns out to be difficult to precisely specify rules for doing this.
As an alternative one can approach the problem statistically: perhaps there are patterns of correlation between the attributes of texts (for example, the words that appear in them) and the way that they are grouped by subject. In order to find such patterns, we need some evidence: a collection of texts that have already been grouped, which we can then analyze to look for correlations between their attributes and the groups they’ve been assigned to.
Bayesian inference is the mathematical formalization of this process of inductively reasoning about groups: identifying patterns of correlation in existing groups, and then applying these patterns to sort new things into those groups.
To read before this meeting:
In this chapter Patrick Wilson considers the problems that arise when one tries to come up with systematic rules for classifying texts by subject.
Wilson can be a bit long-winded, but his insights are worth it. (You can skip the very long footnotes, so this reading is actually shorter than it looks.) What Wilson calls a “writing” is more typically referred to as a text. In this chapter he is criticizing the assumptions librarians make when cataloging texts by subject. The “sense of position” in the title of the chapter refers to the librarian’s sense of where in a classification scheme a text should be placed. Although he is talking about library classification, everything Wilson says is also applicable to state-of-the-art machine classification of texts today.
Bill Maron was an engineer at missile manufacturer Ramo-Wooldridge when he began investigating statistical methods for classifying and retrieving documents. In this paper he describes a method for statistically modeling the subject matter of texts. He introduces the basic ideas behind what is now known as a Bayesian classifier, a technique that is still widely used today for a variety of automatic classification tasks from spam filtering to face recognition.
Trigger warning: math. The math is relatively basic, and if you’ve studied any probability, you should be able to follow it. But if not, just skip it: Maron explains everything important about his experiment in plain English. Pay extra attention to what he says about “clue words.”
There are no new lectures or readings this week.
During our usual lecture times, there will be open Q&As, during which you can submit questions about the material we’ve covered during the first two units. (See this announcement on Canvas for the Zoom link.)
Recitations will focus on review in preparation for the second exam.
Due to the Fall break, neither the lectures nor recitations will meet this week.
During the last part of the course, you and your classmates will work together on identifying and analyzing selecting systems “in the wild.”
We’ll begin by reviewing and refining our model of how selecting systems work by carrying out various operations on groups of documents, collecting, arranging, and transforming them into new groups.
Then we’ll take another look at Boolean algebra and Bayesian inference. We’ll think about how these two different formal techniques for reasoning about groups can be used to produce different kinds of selecting systems.
Next, we'll consider the trade-offs between using human and machine labor in selecting systems.
Finally, we'll reflect on the relationship between selecting systems and society. Do new kinds of selecting systems cause changes in culture, politics, and society? Or do social, political, and cultural norms and practices determine the kind of selecting systems we create?
Total amount of required reading for this week: 8,100 words
This week we’ll look at examples of selecting systems and try to analyze them, reviewing and refining our model of how selecting systems work by carrying out various operations on groups of documents, collecting, arranging, and transforming them into new groups.
This will also be the week that the class splits into teams, each of which will choose a selecting system to analyze.
To read before this meeting:
An examination of the structure and components of information storage and retrieval systems and information filtering systems. Argues that all selection systems can be represented in terms of combinations of a set of basic components. The components are of only two types: representations of data objects and functions that operate on them.
Total amount of required reading for this week: 13,600 words
Boolean algebra and Bayesian inference are two different formal techniques for reasoning about groups. These techniques can be applied to produce different kinds of selecting systems. Why might one technique be used rather than the other? How and why might the two techniques be combined in a selecting system?
To read before this meeting:
This is an excerpt from an article arguing that, though they are perceived as outdated, selection systems based on Boolean algebra (more commonly referred to as Boolean retrieval systems) are preferable for some purposes because they offer more opportunities for human decision-making during searches.
This reading scrutinizes Bill Maron’s Bayesian classifier, identifying it as an example of a technique that is now applied for many purposes that differ quite a bit from Maron’s.
Selecting system analysis proposals must be submitted to your recitation instructor before your recitation meets this week.
Total amount of required reading for this week: 6,000 words
Selecting usable information from a mass of material involves labor. This week we’ll consider the question of automation: what kinds of selecting labor can be done by people, and what kinds can be done by machines? What kinds of selecting labor should be done by people, and what kinds should be done by machines?
To read before this meeting:
In this chapter from her book Behind the Screen, Sarah Roberts provides an overview of commercial content moderation at companies like Facebook. She explains what commercial content moderation is, who does it, and the conditions under which they work.
There are various positions one might take regarding the relationship between technology and society. Sometimes people talk about technology as an external force that exerts influence on society, pushing us in certain directions. Other times people insist that technologies are “just tools” that can be used in different ways, for better or for worse.
The same questions can be raised about selecting systems. Do new kinds of selecting systems cause changes in culture, politics, and society? Or do social, political, and cultural norms and practices determine the kind of selecting systems we create?
This week’s readings are all optional.
To read before this meeting:
The authors are attacking what they describe as “linear” models of technological development, which focus on a series of “technological breakthroughs” leading inevitably to where we are today. They argue that looking at the actual historical development of a technology like the bicycle shows that what seem in retrospect to be obvious “technological breakthroughs” were not at all obvious at the time.
It may help to consult these pages to get a sense of the different bicycle models discussed in the reading:
During this week’s recitation your group will give a 5–7 minute presentation highlighting what progress you’ve made on your selection system analysis.
Due to the Thanksgiving break, neither the lectures nor recitations will meet this week.
As classes end this week, neither lectures nor recitations will meet. However, each project group is encouraged to schedule a meeting with one of the instructors to discuss their progress so far and any issues they are having.