First, we get to know each other a bit. Then, the basics: how the the class meetings will be run, how you’ll be evaluated, expectations regarding readings and assignments, and so on. Finally, a brief and high-level overview of the topics that will be covered in the course, and how they are related.
This course is an introduction to the conceptual foundations of information organization and retrieval: identifying things, describing things, grouping things, relating things, and selecting things. Traditionally these things have been textual documents in the narrow sense: books, periodicals, letters, administrative records, etc.—the kinds of things organized by libraries and archives. But the principles that underlie organization in libraries and archives can be generalized and applied to organize documents and information more broadly, in a variety of contexts. To emphasize what these contexts have in common, rather than how they differ, we will use the abstract notion of an organizing system.
Explicitly or by default, an organizing system makes many interdependent decisions about the identities of things of interest and the ways they are represented as “information.” The organizing system defines how things will be named and described, how they can be grouped and related, and how people or software can create, transform, combine, compare and otherwise use these names, descriptions, groups and relations. When considering the how to make these decisions, we can ask five questions: What is being organized? Why it is being organized? How much is it being organized? When is it being organized? By whom (or by what computational processes) it is being organized?
To read before this class:
Introduction to the concept of an organizing system and the five facets along which one can analyze organizing systems.
Defines the concepts of information and document and proposes a framework for thinking about what systems for organizing information are. Explains why we need a set of principles for designing these systems.
Broad overview of the ways organizing schemes and structures are deployed on Web sites.
Assignment #1 Organization of Information in the News due
To read before this class:
When we take an expansive view of organizing systems we can identify four activities that all organizing systems support or perform: selecting resources, organizing resources, designing resource-based interactions, and maintaining resources. These four activities are deeply ingrained in curricula and practice for organizing systems like libraries and museums, and they can be extended to other kinds of organizing systems employed by individuals, groups and enterprises in various domains.
To read before this class:
The great variety in what individuals, groups, and enterprises do is reflected in the huge breadth of organizing systems we encounter and the diversity of the resources that these systems organize. Even so, because every organizing system has a collection of resources at its foundation and shares some of the same general purposes and goals, organizing systems tend to follow patterns in how they organize resources, the interactions they support, and how they are implemented and operated.
You may already have some familiarity with XML, but perhaps mostly as a data format for applications or programming. In IO and IR it is essential to take a more abstract and intellectual view of XML and understand how it represents structured information models. XML encourages the separation of content from presentation, which is an important principle of information architecture. Encoding information in XML is an investment in information organization that pays off “downstream” in IR and language processing applications.
To read before this class:
No class.
An organizing system reflects (or produces or enforces) a specific view of the world by defining what the things being organized are. This involves making decisions about when things are to be considered the same or different, i.e. how they are to be identified. Decisions about identity and identification define the basic units of organization, and these decisions have consequences for every other aspect of the organizing system.
To read before this class:
An organizing system either explicitly creates, or assumes the existence of, a framework for identifying things.
Through its (explicit or implicit) framework of identity and identification, an organizing system defines a set of entities. These entities are a model, not of reality, but of how some people or organizations process information about reality.
When we describe things to one another in everyday life, we choose words freely yet our choices depend on our particular experiences and social contexts. As a result, we often use different words for the same things and the same words for different things. (And of course, we may have made different decisions about when things are to be considered the the same or different.) Because these mismatches can have serious consequences for finding and understanding things, an organizing system usually tries to impose some control on the language used to create metadata. It might seem straightforward to control or standardize the language we use, and much technology exists for attacking the “vocabulary problem,” but technology alone is not a complete solution because language use constantly evolves, as does the world being described.
To read before this class:
Once we’ve decided what our basic units of organization are—our entities or “instances”—we need to decide how to describe them. What makes a “good” description?
A brief discussion of the challenges of matching personal names and how to address them.
Descriptions can take a number of forms, depending on who is describing some thing and why they are describing it. What a “good” description is cannot be decided outside of these specific contexts, but we can identify some commonly recurring patterns. An organizing system systematizes the process of description by deciding what aspects of a thing will be described and how descriptions will be recorded. This kind of systematized description is called “metadata.” The level and degree of systemization that a system imposes will depend on its context of use.
To read before this class:
The descriptions that we choose to store for an entity constitute metadata. Why do we store these descriptions? Where do we store them? How do we store them? What language do we use to codify the descriptions? Who decides which descriptions we store and which ones go unrecorded? Note that the latest draft of this chapter is not ready yet, so the file linked here is an older draft that has a different title (“Metadata: Storing Descriptions”).
A savage critique of the standard approach to metadata, and a call for a different approach.
A survey of research looking at metadata quality, measurement, and evaluation criteria and best practices for improving metadata quality.
Assignment #2 Creating a Vocabulary & Descriptions due
Metadata is particularly important for non-textual and multimedia documents, because it is difficult to index the content of these documents directly. Yet because the meaning of non-textual media is even less fixed than that of text, questions of what to describe when creating metadata become particularly thorny. Thesauri and other aids for professional “metadata makers” are invaluable but rarely used by ordinary people when they tag photos or videos. On the other hand, technology for creating multimedia can easily record contextual metadata at the time of creation, and systems for sharing multimedia can be designed so that document accumulate metadata over time.
To read before this class:
How metadata schemas and controlled vocabularies are used to describe, catalogue, and index works of art and architecture, and images of them.
Given devices that can automatically capture time and location metadata for, e.g., digital photographs, it is possible to generate additional contextual metadata by querying data services.
For certain kinds of media, “implicit” metadata about how the media is being used may be more useful than metadata that attempts to describe the content of the media.
We impose meaning on the world by “carving it up” into concepts and categories. The conceptual and category boundaries we impose treat some things or instances as equivalent and others as different. Sometimes we do this implicitly and sometimes we do it explicitly. We do this as members of a culture and language community, as individuals, and as members of organizations or institutions. Across these different contexts the mechanisms and outcomes of our categorization efforts differ. In most cases the resulting categories are messier than our information systems and applications can handle, and understanding why and what to do about it are essential skills for information professionals.
To read before this class:
What categories are, how they are used in information management, and how changes in the understanding of human cognitive processes have altered theories of categorization over the years.
In studying categorization, cognitive science has focused primarily on cultural categorization, ignoring individual and institutional categorization. Because recent technological developments have made individual and institutional classification systems much more available and powerful, our understanding of the cognitive and social mechanisms that produce these systems is increasingly important.
A classification is a system of categories, ordered according to a pre-determined set of principles and used to organize a set of instances or entities. This doesn’t mean that the principles are always good or equitable or robust: every classification is biased in one way or another. Classifications are embodied in every information-intensive activity or application. Faceted or dimensional classification is especially useful in domains that don’t have a primary hierarchical structure.
To read before this class:
The terms “classification” and “categorization””are often used interchangeably, but they are not the same. Having a set of categories is not sufficient to create a classification. A classification must be principled so that we know where to place new items and entities in accordance with our system.
An ontology defines the concepts and terms used to describe and represent an area of knowledge and the relationships among them. A dictionary can be considered a simplistic ontology, and a thesaurus a slightly more rigorous one, but we usually reserve “ontology” for meaning expressed using more formal or structured language. Put another way, an ontology relies on a controlled vocabulary for describing the relationships among concepts and terms.
To read before this class:
Defines “relationship” and introduces five perspectives for analyzing relationships among resources: semantic, lexical, structural, architectural, and implementation.
Topic maps are an ISO standard for describing knowledge structures and associating them with information resources. Topic maps are grounded in a basic model consisting of Topics, Associations, and Occurrences (TAO).
The ontopia.net site may be down, so don’t overlook the alternative PDF link above.
Assignment #3 Classifying due
The “Semantic Web” vision imagines that all information resources and services have ontology-grounded metadata that enables their automated discovery and seamless integration or composition. Whether it is possible “to get there from here” with today’s mostly HTML-encoded Web, or whether “a little semantics goes a long way” are key issues for us to consider.
To read before this class:
A video about the Semantic Web.
Examines three different perspectives on the Semantic Web from rhetorical, theoretical, and pragmatic viewpoints, with an eye toward possible outcomes.
Specifies usage scenarios, goals and requirements for a web ontology language. An ontology formally defines a common set of terms that are used to describe and represent a domain. Ontologies can be used by automated tools to power advanced services such as more accurate web search, intelligent software agents and knowledge management.
In this podcast Karen Coyle explains why libraries are keen on the idea of using Linked Data to produce more value from their cataloging efforts.
Despite the fact that they are typically treated as separate subjects, information organization is fundamentally intertwined with informational retrieval. The core problems of information retrieval are finding relevant resources and ordering the found resources according to relevance. The IR model explains how these problems are solved by (1) designing the representations of queries and resources in the collection being searched and (2) specifying the information used, and the calculations performed, that order the retrieved resources by relevance.
To read before this class:
An examination of the structure and components of information storage and retrieval systems and information filtering systems. Argues that all selection systems can be represented in terms of combinations of a set of basic components. The components are of only two types: representations of data objects and functions that operate on them.
A brief history of information retrieval, beginning in the 1960s, to Xerox PARC in the 1980s, and then to mainstream uses of information currently on the Internet. Highlights the contrast between narrowly defined technological approaches and a broader understanding of the full problem set and the possible solutions.
No class.
No class.
Structure-based IR models combine representations of terms with information about structures within documents (i.e., hierarchical organization) and between documents (i.e. hypertext links and other explicit relationships). This structural information tells us what documents and parts of documents are most important and relevant, and provides additional justification for determining relevance and ordering a result set. The nature and pattern of links between documents has been studied for almost a century by “bibliometricians” who measured patterns of scientific citation to quantify the influence of specific documents or authors. The concepts and techniques of citation analysis seem applicable to the web since we can view it as a network of interlinked articles, and Google’s “page rank” algorithm is now the classic example. With the advent of “social media” there are now a wealth of new potential sources of structural metadata.
To read before this class:
The most famous and influential exploitation of “structural metadata” is PageRank, the secret sauce behind Google search (and now all other major search engines). While the idea behind PageRank is simple, its implications as a system for mediating access to information are not. Read only chapters 4 and 5.
As this examination of citation analysis shows, interpretations can vary widely as to what “links” in a given structure mean.
With the success of Facebook, a new buzzword appeared: the “social graph.”
Today we will have a guest: Jean Godby, Senior Research Scientist at OCLC. Jean has spent over twenty years exploring data-oriented research interests in information retrieval, library metadata standards, data exchange between libraries and publishers, knowledge organization, automated content analysis, and data transformation. We will talk with her about how libraries are managing the transition to a post-MARC world, and the possibilities that lay therein.
To read before this class:
Assignment #4 Building a Taxonomy due
Today we’ll consider the vocabulary problem as it manifests itself across organizational contexts. Within an organization, different information systems might use data models that are incomplete or incompatible with respect to each other, and between organizations these differences can be even greater. Structural, syntactic, and semantic mismatches cause problems when processes and services attempt to span these system and organizational boundaries (for example, to create a complete model of a “customer” or to conduct a business transaction). We’ll consider how technical standards and transformation techniques can help achieve integration and interoperability, but we’ll acknowledge that interoperability is not always possible and that non-technical factors play a huge role in determining the approach.
To read before this class:
The ostensible failure of a standard has to be examined not so much from the focus of whether the standard or specification was written or even implemented (the usual metric), but rather from the viewpoint of whether the participants achieved their goals from their participation in the standardization process.
Stable standards are dead standards.
Until now we’ve focused on developing a conceptual understanding of how to define and describe entities and types of entities when organizing information. However to progress further we must familiarize ourselves with some of the various (and constantly evolving) methods and standards for formally expressing these concepts in machine-readable ways, and for guiding information organization processes to ensure consistency and interoperability. Today we’ll look at two kinds of standards: standardized syntaxes for data interchange and standardized conceptual or structural models.
Syntax governs the arrangement of symbols to create properly formed (but not necessarily meaningful) messages.
The dominant syntax standard for encoding data so that it can be exchanged among different organization systems is the eXtensible Markup Language (XML). Review the XML Foundations reading from 8/31, and the XML tutorials at ZVON and W3Schools if you’ve forgotten what you learned about XML.
An increasingly popular alternative syntax standard is JavaScript Object Notation (JSON). Read JSON: The Fat-Free Alternative to XML.
Conceptual or structural models aim to standardize the way information is conceptualized. They can range from very abstract to very specific. Unlike syntax standards, they do not specify how symbols are arranged but instead specify basic concepts and how they are related to one another. However, conceptual or structural models often specify how their concepts should be represented in one or more syntaxes.
As we discussed in class two weeks ago, The Resource Description Framework (RDF) is the conceptual model at the foundation of the Semantic Web. It is a very abstract conceptual model because it aims to standardize concepts suitable for modeling any kind of data. Watch Jenn Riley’s RDF for Librarians presentation for a more detailed explanation of RDF.
A higher-level yet still rather abstract conceptual model is the Functional Requirements for Bibliographic Records (FRBR). Read What is FRBR?
The Atom Syndication Format is a model for describing the structure of blog feeds, or any kind of data that can be expressed as a list of time-stamped items. Atom is an example of a structural model that is relatively tightly tied to a specific syntax (XML).
Google recently released the Dataset Publishing Language (DSPL), a new conceptual model for describing quantitative datasets such as demographic statistics. Skim through the DSPL Tutorial.
Finally there are conceptual or structural models for relatively concrete, well-understood kinds of things such as contact information, calendar events, postal addresses, and recipes. Recently the three major search engines agreed on a set of conceptual models for these types of information and published them at schema.org. Skim the schema.org documentation and take a look at the model for structuring recipes.
Today we’ll look at two more kinds of standards: standardized values or names and standardized processes.
Conceptual or structural models usually define the kinds of attributes that entities have, but may not specify the actual values that those attributes can take. This is the role of value standards, which are usually lists or hierarchies of names or identifiers that can be used as values for certain kinds of attributes.
A very simple example of a value standard is ISO 3166-1, which standardizes 2 and 3-letter codes for identifying countries.
More complex value standards resemble (or are) classifications, with faceted and/or hierarchical structure. Browse through the Art & Architecture Thesaurus, the AGROVOC agricultural vocabulary, and the Medical Subject Headings (MeSH).
Finally, rules or best practices seek to standardize the processes by which people organize information. Among other things, they may specify when and how the other kinds of standards should be used to describe and organize particular kinds of information.
Although not an official standard, the database guidelines at Discogs are a good example of what rules for cataloging look like. Read the Quick Start Guide and skim through some of the other database guidelines such as Genres/Styles and Master Release.
An example of a more official standard is Graphic Materials: Rules for Describing Original Items and Historical Collections, which provides rules for describing photographs, posters, cartoons, prints and drawings. Skim through the standard to get a sense of the variety of aspects of the description process that it attempts to standardize.
Increasingly, description of resources is done by algorithms. Knowing what algorithms can and can’t do is critical for understanding the potential of automatic description.
To read before this class:
This article sketches a framework for thinking about how human and automatic metadata generation can complement one another.
Automated indexing is a very broad notion that encompasses various technologies and techniques, some of which involve taxonomies and some of which do not. Automated tagging, auto-classification, and auto-categorization refer to automated indexing technologies that utilize taxonomies in some way or another. Simpler search engines perform a form of automated indexing without using taxonomies, but more recently, some search systems have incorporated taxonomies.
Be sure to review the list of terms and concepts you should know for the midterm.
Assignment #5 Midterm Exam due
The midterm will be given during regular class time. It will be distributed as a Word document, so you’ll need to bring a laptop to work on it. It is open-book, open-notes.
This will be the last day that we meet as a class. For the remainder of the semester you will work with your branch groups, checking in with me periodically as needed.
Note: The cataloging branch will meet with Hollie White on Thursday, 4/12 at 3:30PM in Dey 202.
Assignment #6 Final Branch Report due