Organization of Information

UNC School of Information and Library Science, INLS 520, Spring 2011

January 10
Course Overview

First, we get to know each other a bit. Then, the basics: how the the class meetings will be run, how you’ll be evaluated, expectations regarding readings and assignments, and so on. Finally, a brief and high-level overview of the topics that will be covered in the course, and how they are related.

January 17
Martin Luther King Day

No class.

January 24
The Organizing System

This course is an introduction to the conceptual foundations of information organization and retrieval: identifying things, describing things, grouping things, relating things, and selecting things. Traditionally these things have been textual documents in the narrow sense: books, periodicals, letters, administrative records, etc.—the kinds of things organized by libraries and archives. But the principles that underlie organization in libraries and archives can be generalized and applied to organize documents and information more broadly, in a variety of contexts. To emphasize what these contexts have in common, rather than how they differ, we will use the abstract notion of an organizing system.

Explicitly or by default, an organizing system makes many interdependent decisions about the identities of things of interest and the ways they are represented as “information.” The organizing system defines how things will be named and described, how they can be grouped and related, and how people or software can create, transform, combine, compare and otherwise use these names, descriptions, groups and relations. When considering the how to make these decisions, we can ask five questions: What is being organized? Why it is being organized? How much is it being organized? When is it being organized? By whom (or by what computational processes) it is being organized?

To read before this class:

  1. Buckland, Michael K. “Information as Thing.” Journal of the American Society for Information Science 42, no. 5 (June 1991): 351–360. http://people.ischool.berkeley.edu/~buckland/thing.html.
    Reading tips

    When we claim to be organizing information, what exactly is it that we are doing? What is information?

  2. Svenonius, Elaine. “Information Organization.” In The Intellectual Foundation of Information Organization, 1–14. Cambridge, Massachusetts: MIT Press, 2000. PDF.
    Reading tips

    Defines the concepts of information and document and proposes a framework for thinking about what systems for organizing information are. Explains why we need a set of principles for designing these systems.

  3. Glushko, Robert J. “1. Foundations for Organizing Systems.” In The Discipline of Organizing, edited by Robert J. Glushko, 3rd ed. O’Reilly, 2015.
    Reading tips

    Introduction to the concept of an organizing system and the five facets along which one can analyze organizing systems.

  4. Morville, Peter, and Louis Rosenfeld. “Organization Systems.” In Information Architecture for the World Wide Web, 53–81. 3rd ed. Sebastopol, California: O’Reilly, 2006. http://proquestcombo.safaribooksonline.com/book/web-development/0596527349/basic-principles-of-information-architecture/i86131__chapterstart__chapter_5.
    Reading tips

    Broad overview of the ways organizing schemes and structures are deployed on Web sites.

January 31
Defining Things: Identity & Identification

Organization of Information in the News due

An organizing system reflects (or produces or enforces) a specific view of the world by defining what the things being organized are. This involves making decisions about when things are to be considered the same or different, i.e. how they are to be identified. Decisions about identity and identification define the basic units of organization, and these decisions have consequences for every other aspect of the organizing system.

One kind of decision regarding the basic units of organization that is commonly encountered by system designers has to do with the distinction between “documents” and “data.” Some designers contrast the two and argue that they cannot or should not be organized using the same terminology, techniques, and tools. Yet in practice there is no clear boundary between the two. Different design decisions can make the things being organized more “document-like” or “data-like,” and there is a continuous spectrum of decisions that can be made and perspectives that can be taken. Often, however, decisions about whether a system is organizing documents or data depend on how things were done in the past (the history of the people and organizations making the decisions) and reflect unstated assumptions about the nature of the domain.

To read before this class:

  1. Glushko, Robert J., Daniel D. Turner, Kimra McPherson, and Jess Hemerly. “3. Resources in Organizing Systems.” In The Discipline of Organizing, edited by Robert J Glushko, 3rd ed. O’Reilly, 2015.
    Reading tips

    An organizing system either explicitly creates, or assumes the existence of, a framework for identifying things.

  2. Kent, William. “Entities.” In Data and Reality, v–19. Amsterdam: North-Holland, 1978. PDF.
    Reading tips

    Through its (explicit or implicit) framework of identity and identification, an organizing system defines a set of entities. These entities are a model, not of reality, but of how some people or organizations process information about reality.

  3. Thomale, Jason. “Interpreting MARC: Where’s the Bibliographic Data?” Code4Lib, no. 11 (2010). http://journal.code4lib.org/articles/3832.
    Reading tips

    A metadata librarian discusses some of the problems that arise when trying to write computer programs to work with MARC (library cataloging) records. Don’t worry about following the details of MARC or the algorithm he describes; instead focus on his analysis of the broader issues involved regarding MARC’s original purpose vs. the needs of today’s library systems.

  4. Glushko, Robert J. “Modeling Methods and Artifacts for Crossing the Data/Document Divide.” In Proceedings of the 2005 IDEAlliance XML Conference. Amsterdam: IDEAlliance, 2005. http://people.ischool.berkeley.edu/~glushko/glushko_files/GlushkoXML2005.pdf.
    Reading tips

    Discusses the differences between treating information as documents vs. treating it as data and argues that rather than a sharp distinction there is a spectrum of modeling choices between the two.

February 7
Describing Things: Intro to Metadata

Descriptions can take a number of forms, depending on who is describing some thing and why they are describing it. What a “good” description is cannot be decided outside of these specific contexts, but we can identify some commonly recurring patterns. An organizing system systematizes the process of description by deciding what aspects of a thing will be described and how descriptions will be recorded. This kind of systematized description is called “metadata.” The level and degree of systemization that a system imposes will depend on its context of use.

When we describe things to one another in everyday life, we choose words freely yet our choices depend on our particular experiences and social contexts. As a result, we often use different words for the same things and the same words for different things. (And of course, we may have made different decisions about when things are to be considered the the same or different.) Because these mismatches can have serious consequences for finding and understanding things, an organizing system usually tries to impose some control on the language used to create metadata. It might seem straightforward to control or standardize the language we use, and much technology exists for attacking the “vocabulary problem,” but technology alone is not a complete solution because language use constantly evolves, as does the world being described.

To read before this class:

  1. McPherson, Kimra. “Describing Resources.” In The Discipline of Organizing, edited by Robert J Glushko, 2012. PDF.
    Reading tips

    Once we’ve decided what our basic units of organization are—our entities or “instances”—we need to decide how to describe them. What makes a “good” description?

  2. Branting, L. Karl. “Name Matching in Law Enforcement and Counter-Terrorism.” In ICAIL Workshop on Data Mining, Information Extraction, and Evidentiary Reasoning for Law Enforcement and Counter-Terrorism. Bologna, 2005. http://www.karlbranting.net/papers/icail2005.pdf.
    Reading tips

    A brief discussion of the challenges of matching personal names and how to address them.

  3. Greenberg, Ryan, Kimra McPherson, and Matthew Mayernik. “Metadata: Storing Descriptions.” In The Discipline of Organizing, edited by Robert J Glushko, 2010. PDF.
    Reading tips

    The descriptions that we choose to store for an entity constitute metadata. Why do we store these descriptions? Where do we store them? How do we store them? What language do we use to codify the descriptions? Who decides which descriptions we store and which ones go unrecorded? Note that the latest draft of this chapter is not ready yet, so the file linked here is an older draft that has a different title (“Metadata: Storing Descriptions”).

  4. Gilliland, Anne J. “Setting the Stage.” In Introduction to Metadata, edited by Murtha Baca. 3rd ed. Los Angeles: Getty Publications, 2008. http://www.getty.edu/research/publications/electronic_publications/intrometadata/setting.html.
    Reading tips

    An overview of different types of metadata, their characteristics, and when and how they are used.

February 14
Describing Things: More Metadata

Creating A Vocabulary & Descriptions due

Metadata is particularly important for non-textual and multimedia documents, because it is difficult to index the content of these documents directly. Yet because the meaning of non-textual media is even less fixed than that of text, questions of what to describe when creating metadata become particularly thorny. Thesauri and other aids for professional “metadata makers” are invaluable but rarely used by ordinary people when they tag photos or videos. On the other hand, technology for creating multimedia can easily record contextual metadata at the time of creation, and systems for sharing multimedia can be designed so that document accumulate metadata over time.

To read before this class:

  1. Doctorow, Cory. Metacrap: Putting the Torch to Seven Straw-Men of the Meta-Utopia, 2001. http://www.well.com/~doctorow/metacrap.htm.
    Reading tips

    A savage critique of the standard approach to metadata, and a call for a different approach.

  2. Harpring, Patricia. “The Language of Images: Enhancing Access to Images by Applying Metadata Schemas and Structured Vocabularies.” In Introduction to Art Image Access: Issues, Tools, Standards, and Strategies, edited by Murtha Baca. Los Angeles: Getty Publications, 2002. http://www.getty.edu/research/publications/electronic_publications/intro_aia/harpring.pdf.
    Reading tips

    How metadata schemas and controlled vocabularies are used to describe, catalogue, and index works of art and architecture, and images of them.

  3. Naaman, Mor, Susumu Harada, QianYing Wang, Hector Garcia-Molina, and Andreas Paepcke. “Context Data in Geo-Referenced Digital Photo Collections.” In Proceedings of the 12th annual ACM international conference on Multimedia - MULTIMEDIA  ’04, 196. New York, New York, USA: ACM Press, 2004. http://portal.acm.org/citation.cfm?doid=1027527.1027573.
    Reading tips

    Given devices that can automatically capture time and location metadata for, e.g., digital photographs, it is possible to generate additional contextual metadata by querying data services.

  4. Shamma, David A, Ryan Shaw, Peter L Shafton, and Yiming Liu. “Watch What I Watch.” In Proceedings of the international workshop on Workshop on multimedia information retrieval - MIR  ’07, 275. New York, New York, USA: ACM Press, 2007. http://portal.acm.org/citation.cfm?doid=1290082.1290120.
    Reading tips

    For certain kinds of media, “implicit” metadata about how the media is being used may be more useful than metadata that attempts to describe the content of the media.

February 21
Kinds of Things: Classes and Types

We impose meaning on the world by “carving it up” into concepts and categories. The conceptual and category boundaries we impose treat some things or instances as equivalent and others as different. Sometimes we do this implicitly and sometimes we do it explicitly. We do this as members of a culture and language community, as individuals, and as members of organizations or instutitions. Across these different contexts the mechanisms and outcomes of our categorization efforts differ. In most cases the resulting categories are messier than our information systems and applications can handle, and understanding why and what to do about it are essential skills for information professionals.

A classification is a system of categories, ordered according to a pre-determined set of principles and used to organize a set of instances or entities. This doesn’t mean that the principles are always good or equitable or robust: every classification is biased in one way or another. Classifications are embodied in every information-intensive activity or application. Faceted or dimensional classification is especially useful in domains that don’t have a primary hierarchical structure.

To read before this class:

  1. Glushko, Robert J., Rachelle Annechino, Jess Hemerly, and Longhao Wang. “6. Categorization: Describing Resource Classes and Types.” In The Discipline of Organizing, edited by Robert J. Glushko, 3rd ed. O’Reilly, 2015.
    Reading tips

    What categories are, how they are used in information management, and how changes in the understanding of human cognitive processes have altered theories of categorization over the years.

  2. Glushko, Robert J, Paul P Maglio, Teenie Matlock, and Lawrence W Barsalou. “Categorization in the Wild.” Trends in Cognitive Sciences 12, no. 4 (April 2008): 129–35. http://dx.doi.org/10.1016/j.tics.2008.01.007.
    Reading tips

    In studying categorization, cognitive science has focused primarily on cultural categorization, ignoring individual and institutional categorization. Because recent technological developments have made individual and institutional classification systems much more available and powerful, our understanding of the cognitive and social mechanisms that produce these systems is increasingly important.

  3. Glushko, Robert J., Jess Hemerly, Vivien Petras, Michael Manoochehri, Longhao Wang, Jordan Shedlock, and Daniel Griffin. “7. Classification: Assigning Resources to Categories.” In The Discipline of Organizing, 3rd ed. O’Reilly, 2015.
    Reading tips

    The terms “classification” and “categorization””are often used interchangeably, but they are not the same. Having a set of categories is not sufficient to create a classification. A classification must be principled so that we know where to place new items and entities in accordance with our system.

  4. Wright, Alex. “Our Sentiments, Exactly.” Communications of the ACM 52, no. 4 (April 2009): 14. http://portal.acm.org/citation.cfm?doid=1498765.1498772.
    Reading tips

    Classification is increasingly done by algorithms. Algorithmic classification schemes are usually far more simple and crude than ones designed for human use, but they have the advantage of being able to scale to vast numbers of items. “Sentiment analysis” is an example of algorithmic classification used by companies to assess online opinion as manifested in millions of tweets, posts and updates.

February 28
Relating Things and Linking (Meta)Data

An ontology defines the concepts and terms used to describe and represent an area of knowledge and the relationships among them. A dictionary can be considered a simplistic ontology, and a thesaurus a slightly more rigorous one, but we usually reserve “ontology” for meaning expressed using more formal or structured language. Put another way, an ontology relies on a controlled vocabulary for describing the relationships among concepts and terms.

The “Semantic Web” vision imagines that all information resources and services have ontology-grounded metadata that enables their automated discovery and seamless integration or composition. Whether it is possible “to get there from here” with today’s mostly HTML-encoded Web, or whether “a little semantics goes a long way” are key issues for us to consider.

To read before this class:

  1. Glushko, Robert J., Matthew Mayernik, Alberto Pepe, and Murray Maloney. “5. Describing Relationships and Structures.” In The Discipline of Organizing, edited by Robert J. Glushko, 3rd ed. O’Reilly, 2015.
    Reading tips

    Defines “relationship” and introduces five perspectives for analyzing relationships among resources: semantic, lexical, structural, architectural, and implementation.

  2. Pepper, Steve. The TAO of Topic Maps: Finding the Way in the Age of Infoglut, 2000. http://www.ontopia.net/topicmaps/materials/tao.html. PDF.
    Reading tips

    Topic maps are an ISO standard for describing knowledge structures and associating them with information resources. Topic maps are grounded in a basic model consisting of Topics, Associations, and Occurrences (TAO).

    The ontopia.net site may be down, so don’t overlook the alternative PDF link above.

  3. Ray, Kate. Web 3.0, 2010. http://vimeo.com/11529540.
    Reading tips

    A video about the Semantic Web.

  4. Marshall, Catherine C, and Frank M Shipman. “Which Semantic Web?” In Proceedings of the fourteenth ACM conference on Hypertext and Hypermedia - HYPERTEXT  ’03, 57–66. New York: ACM Press, 2003. http://portal.acm.org/citation.cfm?doid=900051.900063.
    Reading tips

    Examines three different perspectives on the Semantic Web from rhetorical, theoretical, and pragmatic viewpoints, with an eye toward possible outcomes.

  5. Gruber, Tom. “Collective Knowledge Systems: Where the Social Web Meets the Semantic Web.” Web Semantics: Science, Services and Agents on the World Wide Web 6, no. 1 (December 2008): 4–13. http://linkinghub.elsevier.com/retrieve/pii/S1570826807000583.
    Reading tips

    Proposes a class of applications called collective knowledge systems, which unlock the “collective intelligence” of the Social Web with knowledge representation and reasoning techniques of the Semantic Web.

March 7
Spring Break

No class.

March 14
The Domains of Organizing Systems

Classifying due

Now that we’ve discussed the intellectual foundations for organizing systems - description, classification, vocabulary control, relations, and so on … we can apply them to a range of domains in which organizing systems are created. We’ll see the issues and principles that are shared by these domains, and those that distinguish or are characteristic of them. First, we’ll cover the “classicial” or “core’ domains of library and information science — libraries, archives, and museums — and then move into other domains to discuss organizing systems in scientific, business, and personal contexts.

To read before this class:

  1. Rayward, W. Boyd. “Electronic Information and the Functional Integration of Libraries, Museums, and Archives.” In History and Electronic Artefacts, edited by Edward Higgs, 207–225. Oxford: Oxford University Press, 1998. PDF.
    Reading tips

    Differences in the organizational philosophies of libraries, archives and museums have arisen from differences in the formats and media they deal with. As digital and digitized information gains prominence in these institutions, might these differences disappear, leading to more integrated approaches to organization?

  2. Borgman, Christine L, Jillian C Wallis, and Noel Enyedy. “Little Science Confronts the Data Deluge: Habitat Ecology, Embedded Sensor Networks, and Digital Libraries.” International Journal on Digital Libraries 7, no. 1-2 (July 2007): 17–30. http://www.springerlink.com/index/10.1007/s00799-007-0022-9.
    Reading tips

    While “big science” fields such as physics and astronomy have tools and repositories to handle massive amounts of data, “little science” areas dependent upon fieldwork lack the tools and infrastructure to manage the growing amounts of data generated by new forms of instrumentation.

  3. Millen, David, Jonathan Feinberg, and Bernard Kerr. “Social Bookmarking in the Enterprise.” Queue 3, no. 9 (November 2005): 28. http://portal.acm.org/citation.cfm?doid=1105664.1105676.
    Reading tips

    The apparent success of Internet-based social bookmarking applications begs the question of whether large enterprises or organizations would also benefit from social bookmarking systems. This article describes the design challenges and early lessons learned from a friendly trial of an enterprise-scale social bookmarking system.

  4. Karger, David R, and William Jones. “Data Unification in Personal Information Management.” Communications of the ACM 49, no. 1 (January 2006): 77. http://portal.acm.org/citation.cfm?doid=1107458.1107496.
    Reading tips

    Users need ways to unify, simplify, and consolidate information too often fragmented by location, device, and software application.

March 21
Standards: Syntax, Structure, Value, Process

Until now we’ve focused on developing a conceptual understanding of how to define and describe entities and types of entities when organizing information. However to progress further we must familiarize ourselves with some of the various (and constantly evolving) methods and standards for formally expressing these concepts in machine-readable ways, and for guiding information organization processes to ensure consistency and interoperability. We’ll look at four kinds of standards:

Standard syntaxes for data interchange

Syntax governs the arrangement of symbols to create properly formed (but not necessarily meaningful) messages. The dominant syntax standard for encoding data so that it can be exchanged among different organization systems is the eXtensible Markup Language (XML). Read XML Foundations, chapter 2 of Glushko & McGrath’s Document Engineering book. See the slides from class on XML.

An increasingly popular alternative syntax standard is JavaScript Object Notation (JSON). Read JSON: The Fat-Free Alternative to XML.

Standard conceptual or structural models

Conceptual or structural models aim to standardize the way information is conceptualized. They can range from very abstract to very specific. Unlike syntax standards, they do not specify how symbols are arranged but instead specify basic concepts and how they are related to one another. However, conceptual or structural models often specify how their concepts should be represented in one or more syntaxes.

The Resource Description Framework (RDF) is the conceptual model at the foundation of the Semantic Web. It is a very abstract conceptual model because it aims to standardize concepts suitable for modeling any kind of data. Watch Jenn Riley’s RDF for Librarians presentation. See the slides from class on RDF.

A higher-level yet still rather abstract conceptual model is the Functional Requirements for Bibliographic Records (FRBR). Read What is FRBR?

The Atom Syndication Format is a model for describing the structure of blog feeds, or any kind of data that can be expressed as a list of time-stamped items. Atom is an example of a structural model that is relatively tightly tied to a specific syntax (XML).

Google recently released the Dataset Publishing Language (DSPL), a new conceptual model for describing quantitative datasets such as demographic statistics. Skim through the DSPL Tutorial.

Finally there are conceptual or structural models for relatively concrete, well-understood kinds of things such as contact information, calendar events, postal addresses, and recipes. Take a look at the model that Google supports for structuring recipes.

Standard values or names: Controlled vocabularies & thesauri

Conceptual or structural models usually define the kinds of attributes that entities have, but may not specify the actual values that those attributes can take. This is the role of value standards, which are usually lists or hierarchies of names or identifiers that can be used as values for certain kinds of attributes.

A very simple example of a value standard is ISO 3166-1, which standardizes 2 and 3-letter codes for identifying countries.

More complex value standards resemble (or are) classifications, with faceted and/or hierarchical structure. Browse through the Art & Architecture Thesaurus and the AGROVOC agricultural vocabulary. See the slides from class on the Medical Subject Headings (MeSH).

Standard processes: Rules & best practices

Finally, rules or best practices seek to standardize the processes by which people organize information. Among other things, they may specify when and how the other kinds of standards should be used to describe and organize particular kinds of information.

Although not an official standard, the documentation at Discogs is a good example of what rules for cataloging look like. Read the Quick Start Guide and skim through some of the other database guidelines such as Genres/Styles and Master Release.

An example of a more official standard is Graphic Materials: Rules for Describing Original Items and Historical Collections, which provides rules for describing photographs, posters, cartoons, prints and drawings. Skim through the standard to get a sense of the variety of aspects of the description process that it attempts to standardize. See the slides from class on the Graphic Materials Rules.

March 28
Combining Descriptions and Developing Standards

Building a Taxonomy due

In this lecture we’ll look at the vocabulary problem as it manifests itself across organizational contexts. Within an organization, different information systems might use data models that are incomplete or incompatible with respect to each other, and between organizations these differences can be even greater. Structural, syntactic, and semantic mismatches cause problems when processes and services attempt to span these system and organizational boundaries (for example, to create a complete model of a “customer” or to conduct a business transaction). We’ll consider how technical standards and transformation techniques can help achieve integration and interoperability, but we’ll acknowledge that interoperability is not always possible and that non-technical factors play a huge role in determining the approach.

Guest speaker: Sam Ruby

Sam Ruby is a Senior Technical Staff Member position in the Emerging Technologies Group at IBM, a co-chair of the W3C’s HTML Working Group, and a current Director of the Apache Software Foundation. He has been involved in several high-profile standardization efforts, including the Atom Syndication Format and Publishing Protocol and HTML5. He blogs about technology and standards at intertwingly.net.

Update 3/29: Sam has been kind enough to make his slides available on his website.

To read before this class:

  1. Nomorosa, Karen Joy, and J. J. M Ekaterin. “Integration & Interoperability: Combining Descriptions.” In Intellectual Foundations for Information Organization and Information Retrieval, 2010. PDF.
    Reading tips

    When organizations attempt to create “composite” or “extended” applications by combining or integrating information sources and services of their own with those from independent parties, they face a number of challenges. This chapter will discuss the concepts, strategies and technologies needed to meet these challenges.

  2. Manoochehri, Michael, and Robert J Glushko. “Standards and Governance in Organizing Systems.” In Intellectual Foundations for Information Organization and Information Retrieval, 2010. PDF.
    Reading tips

    When standards are successful, they are barely noticed. But the road to a successful standard can be highly contentious, and maintaining success usually requires an ongoing process of governance.

  3. Moore, Cathleen. “Debate Flares over Weblog Standards.” InfoWorld (2003). http://www.infoworld.com/print/8893.
    Reading tips

    In 2003 an effort begun by Sam Ruby to develop a new standard for syndicating weblog content quickly attracted attention and controversy ensued.

  4. Hicks, Matthew. “RSS Backer Seeks Merged Syndication Format.” eWEEK (2004). http://www.eweek.com/c/a/Messaging-and-Collaboration/RSS-Backer-Seeks-Merged-Syndication-Format/.
    Reading tips

    Large organizations can have a tremendous influence on standards creation processes when they make decisions to support or implement a given standard.

  5. Mazzocchi, Stefano. “Interoperability by Friction.” Stefano’s Linotype, 2008. http://web.archive.org/web/20080521183013/http://www.betaversion.org/~stefano/linotype/news/143/.
    Reading tips

    Stable standards are dead standards.

April 4
Comparing Descriptions: Where IO Meets IR

Despite the fact that they are typically treated as separate subjects, information organization is fundamentally intertwined with informational retrieval. The core problems of information retrieval are finding relevant documents and ordering the found documents according to relevance. The IR model explains how these problems are solved by (1) designing the representations of queries and documents in the collection being searched and (2) specifying the information used, and the calculations performed, that order the retrieved documents by relevance. Different IR models solve these problems in different ways; the better they solve it, the more computationally complex they are, so there are tradeoffs. The simplest, most familiar, and least effective model is the Boolean model — representations are sets of index terms, and relevance is calculated in an all-or-none way according to set theory operations with Boolean algebra.

To read before this class:

  1. Rao, Ramana. “From IR to Search, and Beyond.” Queue 2, no. 3 (May 2004): 66. http://dl.acm.org/citation.cfm?doid=1005062.1005070.
    Reading tips

    A brief history of information retrieval, beginning in the 1960s, to Xerox PARC in the 1980s, and then to mainstream uses of information currently on the Internet. Highlights the contrast between narrowly defined technological approaches and a broader understanding of the full problem set and the possible solutions.

  2. Buckland, Michael, and Christian Plaunt. “On the Construction of Selection Systems.” Library Hi Tech 12, no. 4 (1994): 15–28. PDF.
    Reading tips

    An examination of the structure and components of information storage and retrieval systems and information filtering systems. Argues that all selection systems can be represented in terms of combinations of a set of basic components. The components are of only two types: representations of data objects and functions that operate on them.

  3. Manning, Christopher D, Prabhakar Raghavan, and Hinrich Schütze. “Boolean Retrieval.” In Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008. http://nlp.stanford.edu/IR-book/pdf/01bool.pdf.
    Reading tips

    Introduces inverted indexes and shows how simple Boolean queries can be processed using such indexes.

April 11
Algorithmic Description Creation & Comparison

The Boolean model represents documents as a set of index terms that are either present or absent. This binary notion doesn’t fit our intuition that terms differ in how much they suggest what the document is about. Vector models capture this notion by representing documents and queries as word or term vectors and assigning weights that can capture term counts within a document or the importance of the term in discriminating the document in the collection. Vector algebra provides a model for computing similarity between queries and documents and between documents because of assumption that “closeness in space” means “closeness in meaning”.

Because the calculations used by simple vector models use the frequency of words and word forms, they can’t distinguish different meanings of the same word (polysymy) and they can’t detect equivalent meaning expressed with different words (synonymy). The dimensionality of the space in the simple vector model is the number of different terms in it, but the “semantic dimensionality” of the space is the number of distinct topics represented in it, which is much smaller.

Somewhat paradoxically, these reduced dimensionality vectors that define “topic space” rather than “term space” are calculated using the statistical co-occurrence of the terms in the collection, so the process is completely automatable — it requires no humanly constructed dictionaries, knowledge bases, ontologies, semantic networks, grammars, syntactic parsers, morphologies, or anything else that represents “language”. For this reason these approaches are said to extract “latent” semantics.

Guest speakers: Jane Greenberg and Hollie White will discuss automatic metadata generation in the context of their HIVE system.

To read before this class:

  1. Greenberg, Jane. “Metadata Generation: Processes, People and Tools.” Bulletin of the American Society for Information Science and Technology 29, no. 2 (January 2005): 16–19. http://doi.wiley.com/10.1002/bult.269.
    Reading tips

    This article sketches a framework for thinking about how human and automatic metadata generation can complement one another.

  2. Hlava, Marjorie M. “Automatic Indexing: A Matter of Degree.” Bulletin of the American Society for Information Science and Technology 29, no. 1 (January 2005): 12–15. http://doi.wiley.com/10.1002/bult.261.
    Reading tips

    A basic overview of automatic text classification, indexing and categorization systems.

  3. Manning, Christopher D, Prabhakar Raghavan, and Hinrich Schütze. “Scoring, Term Weighting and the Vector Space Model.” In Introduction to Information Retrieval. Cambridge, UK: Cambridge University Press, 2008. http://nlp.stanford.edu/IR-book/pdf/06vect.pdf.
    Reading tips

    Simple Boolean retrieval usually cannot sufficiently narrow a set of documents to be useful. Thus IR systems usually rely on ranking the set set of retrieved documents by assigning a score to each document given a query.

  4. Yu, Clara, John Cuadrado, Maciej Ceglowski, and J. Scott Payne. Patterns in Unstructured Data: Discovery, Aggregation, and Visualization. National Institute for Technology in Liberal Education, 2002. http://www.knowledgesearch.org/lsi/.
    Reading tips

    Latent semantic indexing adds an important step to the document indexing process. In addition to recording which keywords a document contains, the method examines the document collection as a whole, to see which other documents contain some of those same words. Read only the sections “Latent Semantic Indexing” through “Applications of LSI.”

April 18
Structural & Social Metadata

Computationally Representing Text due

Structure-based IR models combine representations of terms with information about structures within documents (i.e., hierarchical organization) and between documents (i.e. hypertext links and other explicit relationships). This structural information tells us what documents and parts of documents are most important and relevant, and provides additional justification for determining relevance and ordering a result set. The nature and pattern of links between documents has been studied for almost a century by “bibliometricians” who measured patterns of scientic citation to quantify the influence of specific documents or authors. The concepts and techniques of citation analysis seem applicable to the web since we can view it as a network of interlinked articles, and Google’s “page rank” algorithm is now the classic example. With the advent of “social media” there are now a wealth of new potential sources of structural metadata.

To read before this class:

  1. Diaz, Alejandro M. “Through the Google Goggles: Sociopolitical Bias in Search Engine Design”. Stanford University, 2005. http://epl.scu.edu/~stsvalues/readings/Diaz_thesis_final.pdf#page=55.
    Reading tips

    The most famous and influential exploitation of “structural metadata” is PageRank, the secret sauce behind Google search (and now all other major search engines). While the idea behind PageRank is simple, its implications as a system for mediating access to information are not. Read only chapters 4 and 5.

  2. MacRoberts, M. H, and Barbara R MacRoberts. “Problems of Citation Analysis.” Scientometrics 36, no. 3 (July 1996): 435–444. http://www.springerlink.com/index/10.1007/BF02129604.
    Reading tips

    As this examination of citation analysis shows, interpretations can vary widely as to what “links” in a given structure mean.

  3. Iskold, Alex. Social Graph: Concepts and Issues, 2007. http://www.readwriteweb.com/archives/social_graph_concepts_and_issues.php.
    Reading tips

    With the success of Facebook, a new buzzword appeared: the “social graph.”

  4. Dixon, Chris. Here’s What Comes After The Social Graph, 2010. http://www.businessinsider.com/heres-what-comes-after-the-social-graph-2010-7.
    Reading tips

    But as, usual, technologists and capitalists are already in search of the next “graph.”

April 25
Course Review

We’ll spend the last class wrapping up loose ends and reviewing for the final exam. Be sure to review the list of terms and concepts you should know.

April 29
Final Exam

Final Exam due

The final exam will be from 4-7PM on Friday, April 29th, in 304 Murphey Hall. It will be open-book, open-notes, but not open-Web.