Today we’ll meet each other, and I’ll explain the plan for the class and how to use the course website. Finally we’ll try out our federated wiki.
If you feel like it, check out the federated wiki videos.
Total amount of required reading for this meeting: 4,600 words
This class is about the study, or science, of information. OK, but what is information? We hear the word a lot, but it’s surprisingly hard to pin down what it means. For today we’ll read an article that attempts to explain why, written by the computer-scientist-turned-information-scholar Philip Agre. Agre is an advocate of what he calls “critical technical practice,” which he suggests requires cultivating a “split identity” as both a problem-solving engineer and problem-finding critic. In this article, Agre brings that technically-informed critical perspective to bear on the idea of “information.”
To read before this meeting:
Total amount of required reading for this meeting: 3,800 words
Our lives and our societies are structured by and constituted through documents. We’ll look at some examples.
Today’s reading is the first chapter of Michael Buckland’s book on Information and Society. Buckland is a professor at the Berkeley School of Information, and he was my doctoral advisor.
Optional, but highly recommended, is an excerpt from Alva Noë’s book Strange Tools: Art and Human Nature about how playing baseball requires documents. Noë is a philosopher, also at Berkeley, who writes about human consciousness, neuroscience, and art.
To read before this meeting:
Total amount of required reading for this meeting: 9,200 words
For today we’ll read an article by Bruno Latour, a French philosopher, anthropologist and sociologist. Latour wrote this article to persuade his colleagues in the social sciences that they need to pay more attention to documents and processes of documentation.
This is the first of our more difficult readings, which will mostly be assigned for Tuesdays, giving you five days to read them. On the Thursdays before, I will give you some tips for reading these slightly more difficult texts.
To read before this meeting:
Latour uses some unusual terminology in this article. He refers to documents as inscriptions and practices of documentation as inscription procedures. He also refers to documents as immutable mobiles, highlighting what he considers to be two of their most important qualities: immutability and mobility.
Latour is interested in the relationship between practices of documentation and thinking (cognition). His basic argument is that what may seem like great advances in thought are actually better understood as the emergence of new practices of documentation. Latour focuses primarily on documents as aids to visualization rather than as carriers of information. Thus he begins by discussing the emergence of new visualization techniques, such as linear perspective.
Total amount of required reading for this meeting: 9,100 words
As we began to communicate by through wires and over radio waves, engineers sought to understand and describe how it happens, in order to design better communication systems. Claude Shannon, an engineer who worked at Bell Labs, developed an influential theory that came to be known as “information theory.” Today we’ll investigate some of the phenomena he described.
Before class you should read the excerpt from Edgar Allen Poe’s The Gold-Bug, and optionally you may also read a short historical account of the development of Shannon’s theory by science writer James Gleick.
To read before this meeting:
This chapter from science writer James Gleick’s book The Information is an engaging mini-biography of Claude Shannon, but it is also an accessible introduction to information theory.
Another approach to understanding communication through documents (in addition to Shannon’s theory) is to focus on “signs,” the organization of signs into codes or languages, and the cultures within which signs and codes operate. This approach is known as semiotics. Media scholar John Fiske provides a good basic explanation of what semiotics is and how it differs from information theory.
To read before this meeting:
Semiotics, the study of signs, isn’t limited to texts: we can also use it to describe how we understand graphics and images. Cartoonist Scott McCloud shows how.
To read before this meeting:
Total amount of required reading for this meeting: 16,900 words
Until now we’ve mainly focused on documents and the marks on them, and how we understand and interpret those marks. This week we change our focus a bit, to look at how our understanding of the world is structured.
We begin with some excerpts from a book by Eviatar Zerubavel about how we categorize and classify the world around us. Zerubavel is a cognitive sociologist, meaning that he studies how social processes shape our thinking, and he’s written a number of fascinating and accessible books on the topic.
To read before this meeting:
Eviatar Zerubavel is a cognitive sociologist, meaning that he studies how social processes shape our thinking, and he’s written a number of fascinating and accessible books on the topic. These are selections from his book The Fine Line about making distinctions in everyday life.
Total amount of required reading for this meeting: 5,600 words
We all categorize and classify all the time, but we don’t always do it intentionally and systematically. Today we’ll try out a form of systematic classification known as faceted classification.
To read before this meeting:
Total amount of required reading for this meeting: 11,300 words
Most of us would readily agree that our everyday “folk” classifications are historically contingent and somewhat arbitrary. Yet scientific classification presumably is different: science is the study of reality, and so scientific classifications are “real” in a way that other classifications are not. Today we’ll discuss the extent to which this is true.
The required reading is by Lorraine Daston, a historian of science. She traces the history of scientists’ attempts to classify clouds.
Optionally, you may also read a short (1.5 pages) article on scientific classification by the philosopher of science John Dupré.
To read before this meeting:
We can’t talk or write about things or kinds of things without giving them names. Unfortunately naming isn’t as easy as it sometimes may seem. Today we’ll investigate the difficulties of agreeing on names.
The required reading is another chapter from Buckland’s Information and Society, this time on the topic of naming.
If you have time, I also highly recommend the second book chapter on naming, by Bill Kent. Kent was a computer programmer and database designer at IBM and Hewlett-Packard, during the era when the database technologies we use today were first being developed. He thought deeply and carefully about the challenges of data management, which he recognized were not primarily technical challenges.
To read before this meeting:
The past couple of weeks we’ve looked at how people categorize, classify, and name things of interest. As we’ve seen, this can be hard work, and like other kinds of hard work, people have sought to escape it through automation.
To what extent can the organization of information be automated? Information scholar Julian Warner looks at this question by drawing a distinction between different kinds of semiotic labor.
To read before this meeting:
People were building systems to automate information organization and retrieval long before the invention of the computer, but the digital computer made possible many techniques that were previously unfeasible. The invention of computing also gave birth to a theory of computation, which gives us a mathematical framework for characterizing and measuring syntactic labor. Today we’ll look at one of the earliest computational techniques to be applied to information organization: Boolean logic.
To read before this meeting:
Total amount of required reading for this meeting: 3,400 words
Boolean logic (and ultimately, set theory) is the mathematical formalization upon which many of the techniques of information organization are built. In 1937 Edmund Berkeley, a mathematician working at the Prudential life insurance company, recognized the usefulness of Boolean logic for modeling insurance data—even though at the time there were no digital computers to assist with the calculations, only punched card tabulators.
Berkeley would later go on to be a pioneer of computer science, co-founding the Association for Computing Machinery which is still the primary scholarly association for computer scientists.
To read before this meeting:
This article is by Edmund Berkeley, a pioneer of computer science and co-founder of the Association for Computing Machinery, which is still the primary scholarly association for computer scientists. But he wrote this article in 1937, before he became a computer scientist—because computers had yet to exist. At the time he was a mathematician working at the Prudential life insurance company, where he recognized the usefulness of Boolean algebra for modeling insurance data. He published this article in a professional journal for actuaries (people who compile and analyze statistics and use them to calculate insurance risks and premiums).
Berkeley uses some frightening-looking mathematical notation in parts of this article, but everything he discusses is actually quite simple. The most important parts are:
pages 373–374, where he gives a simple explanation of Boolean algebra,
pages 380–381, where he considers practical applications of Boolean algebra, and
pages 383 on, where he pays close attention to translation back and forth between Boolean algebra and English.
Today your midterm papers are due, and each of you will give a two minute, one slide presentation briefly explaining the topic of your paper.
The midterm exam will be given in class, and it will cover all the concepts we’ve discussed so far.
In computer science, correctness refers to the degree of correspondence between what a computer program actually does, and what it is supposed to do. A “correct” program is one that does what it is supposed to. But what is a computer program “supposed” to do? It may be relatively straightforward to check that a program is correct with respect to a formal model or specification—but there is still the problem of whether that formal model corresponds with the understandings of reality that the program’s designers and users have. Philosopher and computer scientist Brian Cantwell Smith considers these issues in a paper presented to International Physicians for the Prevention of Nuclear War.
To read before this meeting:
Information science took a major turn when the designers of information retrieval systems for the military and weapons manufacturers began to explore how to automatically classify and index texts. These explorations led to a new form of modeling: the statistical modeling of language. Once we had the ability to create texts digitally and to digitize existing texts, we could use these texts to build statistical language models, a process that was greatly accelerated by the advent of the World Wide Web, which made the collection of large numbers of texts much easier than it had been before.
Text just happened to be one of the first kinds of data that we were able to collect large amounts of. But the same techniques used to statistically model language can also be used to model other phenomena—provided that one can collect large amounts of data generated by these other phenomena. Once people began using the Web for all kinds of things beyond publishing texts, these other kinds of data suddenly became available, opening the door to statistical modeling of nearly everything. Data scientist Cathy O’Neil gives an account of our present-day modeling fever.
To read before this meeting:
Computationally analyzing text first requires representing the text in a form that can be computationally manipulated. This form is quite different from the forms we are used to interpreting as readers.
To read before this meeting:
Statistics is hard. Most people don’t intuitively understand probability, including me, and including the vast majority of scientists who rely on statistical methods. So today we’ll review some of the basics, so we know just enough to be dangerous.
To read before this meeting:
Total amount of required reading for this meeting: 6,500 words
The shift to statistical modeling in information science can be traced to the work of Bill Maron. Maron was an engineer at missile manufacturer Ramo-Wooldridge when he began investigating statistical methods for classifying and retrieving documents. For today we’ll read a classic paper of Maron’s in which he develops the basic ideas behind the Bayesian classifier, a technique that is still widely used today for a variety of automatic classification tasks from spam filtering to face recognition.
To read before this meeting:
Bill Maron was an engineer at missile manufacturer Ramo-Wooldridge when he began investigating statistical methods for classifying and retrieving documents. In this paper he describes a method for statistically modeling the subject matter of texts. He introduces the basic ideas behind what is now known as a Bayesian classifier, a technique that is still widely used today for a variety of automatic classification tasks from spam filtering to face recognition.
Trigger warning: math. The math is relatively basic, and if you’ve studied any probability, you should be able to follow it. But if not, just skip it: Maron explains everything important about his experiment in plain English. Pay extra attention to what he says about “clue words.”
Topic modeling is a technique for classifying text that does not require one to specify a set of categories ahead of time. For that reason it has become particularly popular among humanities scholars and social scientists interested in exploring large collections of text, such as archival collections or social media platforms. Today we’ll try out some simple topic models.
To read before this meeting:
Once a technique for statistical modeling has been developed, it can usually be applied to problems other than those for which it was initially developed. Thus topic modeling, initially developed for the unsupervised classification of text, is easily modified to classify other things like people and organizations.
For today, please read chapter 1 of Applications of Topic Models, “The What and Wherefore of Topic Models.” In addition, please skim one of the following chapters, to get a sense of how topic modeling gets used: “Historical Documents,” “Understanding Scientific Publications,” “Fiction and Literature,” and “Computational Social Science”.
To read before this meeting:
What are the consequences of the shift from 1) information systems that allow us to precisely specify the properties of the things we seek, to 2) information systems that attempt to anticipate our needs or desires and recommend things to us? If a YouTube video, a search result, a fashion brand, a scientific paper, or a restaurant that people discover via a recommendation service becomes popular and successful, is it because that video, result, brand, paper, or restaurant is of high quality, or is it perhaps due in part to the way the recommendation service works? Sociologists Matthew Salganik and Duncan Watts sought to investigate this question by building their own streaming music service.
To read before this meeting:
There is reason to believe that recommendation services which rely on historical data are biased toward popular items, creating a “rich-get-richer” effect. This can also result in an overall homogenization of consumption—less overall diversity in what people read, watch, buy, eat, etc. This can be true even if individuals find that their use of recommendation services is introducing them to new things!
But a separate issue is that recommendation services which rely on historical data may be fooled into believing that unpopular items are actually popular. In other words, the services can be “gamed” by small groups who are strongly motivated to make something seem popular, in the hopes that this will become a self-fulfilling prophecy.
To read before this meeting:
The powerful techniques that information scientists developed for classifying and ranking texts are now being applied to every aspect of our lives. What effects is this having? How can we determine whether information technologies are aiding our decision-making or harming it? Judges make high-impact life-altering and world-altering decisions daily. One kind of high-impact decision judges make is whether to grant bail to persons accused of crimes. What is the potential impact of judges being guided in these decisions by algorithms trained on historical data?
To read before this meeting:
Today your final papers are due. We’ll review the ground we covered this semester and look ahead to more advanced information science classes, and information science careers.
The final exam is scheduled for 12 noon on Monday, May 7. It will cover all the concepts from this course.