Foundations of Information Science

UNC SILS, INLS 201, Fall 2017

August 22
Introduction

Today we’ll meet each other, and I’ll explain the plan for the class and how to use the course website. Finally we’ll try out our federated wiki.

If you feel like it, check out the federated wiki videos.

August 24
Document society

Total amount of required reading for this meeting: 3,800 words

Our lives and our societies are structured by and constituted through documents. We’ll look at some examples.

Today’s reading is the first chapter of Michael Buckland’s book on Information and Society. Buckland is a professor at the Berkeley School of Information, and he was my doctoral advisor.

Optional, but highly recommended, is an excerpt from Alva Noë’s book Strange Tools: Art and Human Nature about how playing baseball requires documents. Noë is a philosopher, also at Berkeley, who writes about human consciousness, neuroscience, and art.

📖 To read before this meeting:

  1. Buckland, Michael. “Introduction.” In Information and Society, 1–19. MIT Press, 2017. PDF.
    3,800 words
  2. Noë, Alva. “Art Loops and the Garden of Eden.” In Strange Tools, 29–48. New York: Hill and Wang, a division of Farrar, Straus and Giroux, 2015. PDF.

August 29
Thinking with our eyes and hands

View slides Updated Wednesday 4/24 5:22 PM

Total amount of required reading for this meeting: 9,200 words

For today we’ll read an article by Bruno Latour, a French philosopher, anthropologist and sociologist. Latour wrote this article to persuade his colleagues in the social sciences that they need to pay more attention to documents and processes of documentation.

This is the first of our more difficult readings, which will mostly be assigned for Tuesdays, giving you five days to read them. On the Thursdays before, I will give you some tips for reading these slightly more difficult texts.

📖 To read before this meeting:

  1. Latour, Bruno. “Visualisation and Cognition: Thinking with Eyes and Hands.” Knowledge and Society: Studies in the Sociology of Culture Past and Present 6 (1986): 1–40. PDF.
    9,200 words
    Reading tips

    Latour uses some unusual terminology in this article. He refers to documents as inscriptions and practices of documentation as inscription procedures. He also refers to documents as immutable mobiles, highlighting what he considers to be two of their most important qualities: immutability and mobility.

    Latour is interested in the relationship between practices of documentation and thinking (cognition). His basic argument is that what may seem like great advances in thought are actually better understood as the emergence of new practices of documentation. Latour focuses primarily on documents as aids to visualization rather than as carriers of information. Thus he begins by discussing the emergence of new visualization techniques, such as linear perspective.

August 31
Information theory

View slides Updated Wednesday 4/24 5:22 PM

Total amount of required reading for this meeting: 9,100 words

As we began to communicate by through wires and over radio waves, engineers sought to understand and describe how it happens, in order to design better communication systems. Claude Shannon, an engineer who worked at Bell Labs, developed an influential theory that came to be known as “information theory.” Today we’ll investigate some of the phenomena he described.

Before class you should read the excerpt from Edgar Allen Poe’s The Gold-Bug, and optionally you may also read a short historical account of the development of Shannon’s theory by science writer James Gleick.

📖 To read before this meeting:

  1. Poe, Edgar Allan. “The Cryptograph / The Solution Begun / The Cipher Read.” In The Gold Bug. Chicago, New York [etc.] Rand, McNally & Company, 1902. http://archive.org/details/goldbug00poee_1. PDF.
  2. Gleick, James. “Information Theory.” In The Information, 1st ed., 204–232. New York: Pantheon Books, 2011. PDF.
    9,100 words
    Reading tips

    This chapter from science writer James Gleick’s book The Information is an engaging mini-biography of Claude Shannon, but it is also an accessible introduction to information theory.

September 5
Meaning, signs and codes

View slides Updated Wednesday 4/24 5:22 PM

Another approach to understanding communication through documents (in addition to Shannon’s theory) is to focus on “signs,” the organization of signs into codes or languages, and the cultures within which signs and codes operate. This approach is known as semiotics. Media scholar John Fiske provides a good basic explanation of what semiotics is and how it differs from information theory.

📖 To read before this meeting:

  1. Fiske, John. “Communication Theory / Meanings, Signs, and Codes.” In Introduction to Communication Studies, 2nd ed., 6–12, 39–46, 56–58, 64–65. London ; New York: Routledge, 1990. PDF.

September 7
Understanding graphics and images

View slides Updated Wednesday 4/24 5:22 PM

Semiotics, the study of signs, isn’t limited to texts: we can also use it to describe how we understand graphics and images. Cartoonist Scott McCloud shows how.

📖 To read before this meeting:

  1. McCloud, Scott. “The Vocabulary of Comics.” In Understanding Comics, 1st HarperPerennial ed., 24–59. New York: HarperPerennial, 1994. PDF.

September 12
Making distinctions

View slides Updated Wednesday 4/24 5:22 PM

Total amount of required reading for this meeting: 16,900 words

Until now we’ve mainly focused on documents and the marks on them, and how we understand and interpret those marks. This week we change our focus a bit, to look at how our understanding of the world is structured.

We begin with some excerpts from a book by Eviatar Zerubavel about how we categorize and classify the world around us. Zerubavel is a cognitive sociologist, meaning that he studies how social processes shape our thinking, and he’s written a number of fascinating and accessible books on the topic.

📖 To read before this meeting:

  1. Zerubavel, Eviatar. “Introduction / Islands of Meaning / The Great Divide / The Social Lens.” In The Fine Line, 1–17, 21–24, 61–80. New York: Free Press, 1991. PDF.
    16,900 words
    Reading tips

    Eviatar Zerubavel is a cognitive sociologist, meaning that he studies how social processes shape our thinking, and he’s written a number of fascinating and accessible books on the topic. These are selections from his book The Fine Line about making distinctions in everyday life.

September 14
Classification in everyday life

Total amount of required reading for this meeting: 5,600 words

We all categorize and classify all the time, but we don’t always do it intentionally and systematically. Today we’ll try out a form of systematic classification known as faceted classification.

📖 To read before this meeting:

  1. Hunter, Eric. “What Is Classification? / Classification in an Information System / Faceted Classification.” In Classification Made Simple, 3rd ed. Farnham: Ashgate, 2009. PDF.
    5,600 words

September 19
Scientific classification

View slides Updated Wednesday 4/24 5:22 PM

Total amount of required reading for this meeting: 11,300 words

Most of us would readily agree that our everyday “folk” classifications are historically contingent and somewhat arbitrary. Yet scientific classification presumably is different: science is the study of reality, and so scientific classifications are “real” in a way that other classifications are not. Today we’ll discuss the extent to which this is true.

The required reading is by Lorraine Daston, a historian of science. She traces the history of scientists’ attempts to classify clouds.

Optionally, you may also read a short (1.5 pages) article on scientific classification by the philosopher of science John Dupré.

📖 To read before this meeting:

  1. Daston, Lorraine. “Cloud Physiognomy.” Representations 135, no. 1 (August 1, 2016): 45–71. https://doi.org/10.1525/rep.2016.135.1.45.
    10,100 words
  2. Dupré, John. “Scientific Classification.” Theory, Culture & Society 23, no. 2–3 (May 1, 2006): 30–32. PDF.
    1,200 words

September 21
Naming

We can’t talk or write about things or kinds of things without giving them names. Unfortunately naming isn’t as easy as it sometimes may seem. Today we’ll investigate the difficulties of agreeing on names.

The required reading is another chapter from Buckland’s Information and Society, this time on the topic of naming.

If you have time, I also highly recommend the second book chapter on naming, by Bill Kent. Kent was a computer programmer and database designer at IBM and Hewlett-Packard, during the era when the database technologies we use today were first being developed. He thought deeply and carefully about the challenges of data management, which he recognized were not primarily technical challenges.

📖 To read before this meeting:

  1. Buckland, Michael. “Naming.” In Information and Society, 89–110. MIT Press, 2017. PDF.
  2. Kent, William. “Naming.” In Data and Reality, 41–61. Amsterdam: North-Holland, 1978. PDF.

September 26
Automation

View slides Updated Wednesday 4/24 5:22 PM

The past couple of weeks we’ve looked at how people categorize, classify, and name things of interest. As we’ve seen, this can be hard work, and like other kinds of hard work, people have sought to escape it through automation.

To what extent can the organization of information be automated? Information scholar Julian Warner looks at this question by drawing a distinction between different kinds of semiotic labor.

📖 To read before this meeting:

  1. Warner, Julian. “Forms of Labour in Information Systems.” Information Research 7, no. 4 (2002). http://www.informationr.net/ir/7-4/paper135.html.

September 28
Computation

View slides Updated Wednesday 4/24 5:22 PM

People were building systems to automate information organization and retrieval long before the invention of the computer, but the digital computer made possible many techniques that were previously unfeasible. The invention of computing also gave birth to a theory of computation, which gives us a mathematical framework for characterizing and measuring syntactic labor. Today we’ll look at one of the earliest computational techniques to be applied to information organization: Boolean logic.

📖 To read before this meeting:

  1. Hillis, W. “Nuts and Bolts / Universal Building Blocks.” In The Pattern on the Stone, 1–38. New York: Basic Books, 1998. PDF.

October 3
The logic of distinctions and sets

View slides Updated Wednesday 4/24 5:22 PM

Total amount of required reading for this meeting: 3,400 words

Boolean logic (and ultimately, set theory) is the mathematical formalization upon which many of the techniques of information organization are built. In 1937 Edmund Berkeley, a mathematician working at the Prudential life insurance company, recognized the usefulness of Boolean logic for modeling insurance data—even though at the time there were no digital computers to assist with the calculations, only punched card tabulators.

Berkeley would later go on to be a pioneer of computer science, co-founding the Association for Computing Machinery which is still the primary scholarly association for computer scientists.

📖 To read before this meeting:

  1. Berkeley, Edmund C. “Boolean Algebra (the Technique for Manipulating AND, OR, NOT and Conditions).” The Record 26 part II, no. 54 (1937): 373–414. PDF.
    3,400 words
    Reading tips

    This article is by Edmund Berkeley, a pioneer of computer science and co-founder of the Association for Computing Machinery, which is still the primary scholarly association for computer scientists. But he wrote this article in 1937, before he became a computer scientist—because computers had yet to exist. At the time he was a mathematician working at the Prudential life insurance company, where he recognized the usefulness of Boolean algebra for modeling insurance data. He published this article in a professional journal for actuaries (people who compile and analyze statistics and use them to calculate insurance risks and premiums).

    Berkeley uses some frightening-looking mathematical notation in parts of this article, but everything he discusses is actually quite simple. The most important parts are:

    pages 373–374, where he gives a simple explanation of Boolean algebra,

    pages 380–381, where he considers practical applications of Boolean algebra, and

    pages 383 on, where he pays close attention to translation back and forth between Boolean algebra and English.

October 5
Modeling knowledge

View slides Updated Wednesday 4/24 5:22 PM

Total amount of required reading for this meeting: 3,000 words

By the 1970s, computer engineers had successfully built powerful and efficient databases, which they called “relational” databases because of their basis in the way relations are modeled by set theory. (This was Codd’s famous relational model of data.)

But database designers soon realized that having relational database technology was useless without a method for translating real-world situations and processes into the relational model. What they needed was a method for modeling knowledge relationally—and this is what the computer scientists Peter Chen provided in 1976 with his entity-relationship model.

In addition to the Chen article, please read database designer Eric Evans’ short account of what it is like to engage in entity-relationship modeling. For a slightly different account, you can optionally read Stephen Wolfram’s blog post about trying to model chemistry.

📖 To read before this meeting:

  1. Chen, Peter Pin-Shan. “The Entity-Relationship Model—toward a Unified View of Data.” ACM Trans. Database Syst. 1, no. 1 (March 1976): 9–36. https://doi.org/10.1145/320434.320440.
  2. Evans, Eric. “Crunching Knowledge.” In Domain-Driven Design. Boston: Addison-Wesley, 2004. PDF.
    3,000 words
  3. Wolfram, Stephen. “The Practical Business of Ontology: A Tale from the Front Lines.” Stephen Wolfram Blog, July 2017. http://blog.stephenwolfram.com/2017/07/the-practical-business-of-ontology-a-tale-from-the-front-lines/.

October 10
Correctness

View slides Updated Wednesday 4/24 5:22 PM

In computer science, correctness refers to the degree of correspondence between what a computer program actually does, and what it is supposed to do. A “correct” program is one that does what it is supposed to. But what is a computer program “supposed” to do? It may be relatively straightforward to check that a program is correct with respect to a formal model or specification—but there is still the problem of whether that formal model corresponds with the understandings of reality that the program’s designers and users have. Philosopher and computer scientist Brian Cantwell Smith considers these issues in a paper presented to International Physicians for the Prevention of Nuclear War.

📖 To read before this meeting:

  1. Smith, Brian Cantwell. “The Limits of Correctness.” In Symposium on Unintentional Nuclear War, Fifth Congress of the International Physicians for the Prevention of Nuclear War. Budapest, 1985. PDF.

October 12
Two minute madness

View slides Updated Wednesday 4/24 5:22 PM

Today your midterm papers are due, and each of you will give a two minute, one slide presentation briefly explaining the topic of your paper.

October 12
Midterm paper due

October 17
Midterm exam

The midterm exam will be given in class, and it will cover the formal concepts we’ve covered so far: information theory, semiotics, faceted classification, Boolean logic, and entity-relationship modeling.

October 19
Fall break

October 24
From individuals to populations

There is no reading for today. I’ll return your midterm papers and exams, and we’ll review the first half of the course and look ahead to the second half.

October 26
Statistical models

View slides Updated Wednesday 4/24 5:22 PM

Information science took a major turn when the designers of information retrieval systems for the military and weapons manufacturers began to explore how to automatically classify and index texts. These explorations led to a new form of modeling: the statistical modeling of language. Once we had the ability to create texts digitally and to digitize existing texts, we could use these texts to build statistical language models, a process that was greatly accelerated by the advent of the World Wide Web, which made the collection of large numbers of texts much easier than it had been before.

Text just happened to be one of the first kinds of data that we were able to collect large amounts of. But the same techniques used to statistically model language can also be used to model other phenomena—provided that one can collect large amounts of data generated by these other phenomena. Once people began using the Web for all kinds of things beyond publishing texts, these other kinds of data suddenly became available, opening the door to statistical modeling of nearly everything. Data scientist Cathy O’Neil gives an account of our present-day modeling fever.

📖 To read before this meeting:

  1. O’Neil, Cathy. “Bomb Parts: What Is a Model?” In Weapons of Math Destruction, 15–31. New York: Crown, 2016. PDF.

October 31
Modeling text for computation

View slides Updated Wednesday 4/24 5:22 PM

Computationally analyzing text first requires representing the text in a form that can be computationally manipulated. This form is quite different from the forms we are used to interpreting as readers.

📖 To read before this meeting:

  1. Manning, Christopher, Prabhakar Raghavan, and Hinrich Schütze. “Boolean Retrieval / The Term Vocabulary and Postings Lists.” In Introduction to Information Retrieval, 1–34. New York: Cambridge University Press, 2008.
    Reading tips

November 2
Probability and inductive logic

Statistics is hard. Most people don’t intuitively understand probability, including me, and including the vast majority of scientists who rely on statistical methods. So today we’ll review some of the basics, so we know just enough to be dangerous.

📖 To read before this meeting:

  1. Hacking, Ian. An Introduction to Probability and Inductive Logic. Cambridge: Cambridge University Press, 2001. PDF.

November 7
Automatically classifying text

View slides Updated Wednesday 4/24 5:22 PM

Total amount of required reading for this meeting: 6,500 words

The shift to statistical modeling in information science can be traced to the work of Bill Maron. Maron was an engineer at missile manufacturer Ramo-Wooldridge when he began investigating statistical methods for classifying and retrieving documents. For today we’ll read a classic paper of Maron’s in which he develops the basic ideas behind the Bayesian classifier, a technique that is still widely used today for a variety of automatic classification tasks from spam filtering to face recognition.

📖 To read before this meeting:

  1. Maron, M. E.“Automatic Indexing: An Experimental Inquiry.” Journal of the ACM 8, no. 3 (July 1961): 404–17. https://doi.org/10.1145/321075.321084.
    6,500 words
    Reading tips

    Bill Maron was an engineer at missile manufacturer Ramo-Wooldridge when he began investigating statistical methods for classifying and retrieving documents. In this paper he describes a method for statistically modeling the subject matter of texts. He introduces the basic ideas behind what is now known as a Bayesian classifier, a technique that is still widely used today for a variety of automatic classification tasks from spam filtering to face recognition.

    Trigger warning: math. The math is relatively basic, and if you’ve studied any probability, you should be able to follow it. But if not, just skip it: Maron explains everything important about his experiment in plain English. Pay extra attention to what he says about “clue words.”

November 9
Modeling topics

Topic modeling is a technique for classifying text that does not require one to specify a set of categories ahead of time. For that reason it has become particularly popular among humanities scholars and social scientists interested in exploring large collections of text, such as archival collections or social media platforms. Today we’ll try out some simple topic models.

📖 To read before this meeting:

  1. Sievert, Carson. “A Topic Model for Movie Reviews.” Accessed August 20, 2017. https://ldavis.cpsievert.me/reviews/reviews.html.

November 14
Modeling everything

View slides Updated Wednesday 4/24 5:22 PM

Once a technique for statistical modeling has been developed, it can usually be applied to problems other than those for which it was initially developed. Thus topic modeling, initially developed for the unsupervised classification of text, is easily modified to classify other things like people and organizations.

For today, please read chapter 1 of Applications of Topic Models, “The What and Wherefore of Topic Models.” In addition, please read one of the following chapters: “Historical Documents,” “Understanding Scientific Publications,” “Fiction and Literature,” and “Computational Social Science”.

📖 To read before this meeting:

  1. Boyd-Graber, Jordan, Yuening Hu, and David Mimno. “Applications of Topic Models.” Foundations and Trends in Information Retrieval 11, no. 2–3 (July 20, 2017): 143–296. https://doi.org/10.1561/1500000030.

November 16
Ranking, rating and recommending

View slides Updated Wednesday 4/24 5:22 PM

One of Maron’s motivations for developing statistical methods of information retrieval was the desire to provide ranked results. Ranking results involves not only matching documents to a query, but also ordering those documents from most “relevant” to least “relevant”.

Sixty years later, there are algorithmically-generated ranked lists for nearly everything. Today we’ll look at one example—university rankings—and discuss possible algorithms for another kind of ranking: your grades in this class.

📖 To read before this meeting:

  1. Ramage, Daniel, Christopher D Manning, and Daniel A McFarland. “Which Universities Lead and Lag? Toward University Rankings Based on Scholarly Output.” In Proc. of NIPS Workshop on Computational Social Science and the Wisdom of the Crowds, 2010. https://people.cs.umass.edu/~wallach/workshops/nips2010css/papers/ramage.pdf.

November 21
Cancelled

November 23
Thanksgiving

November 28
Being ranked and rated

View slides Updated Wednesday 4/24 5:22 PM

The powerful techniques that information scientists developed for classifying and ranking texts are now being applied to every aspect of our lives. What effects is this having? Sociologist Wendy Espeland examines the effects of one very influential ranking system: the U.S. News & World Report college rankings.

📖 To read before this meeting:

  1. Espeland, Wendy. “Reverse Engineering and Emotional Attachments as Mechanisms Mediating the Effects of Quantification.” Historical Social Research / Historische Sozialforschung 41, no. 2 (156) (2016): 280–304. https://doi.org/10.12759/hsr.41.2016.2.280-304.

November 30
Grading algorithm proposals

50% of your grade in this class will be based on my evaluation of your midterm and final papers. Today you will make proposals for how to determine the other 50% of your grade.

December 5
Looking back / looking ahead

Today your midterm papers are due. We’ll review the ground we covered this semester and look ahead to more advanced information science classes, and information science careers.

December 5
Final paper due

December 9
Final exam

The final exam is scheduled for 12 noon on Saturday, December 9. It will cover all the formal concepts from this course: information theory, semiotics, faceted classification, Boolean logic, entity-relationship modeling, and probabilistic modeling.