Web Information Organization

UNC SILS, INLS 620, Fall 2018

Choosing a dataset

Due September 4.

Your first assignment is to decide upon a dataset that you will work with throughout the semester to publish as Linked Data. The dataset must be:

  1. In the public domain, or available under an open license, or owned by you
  2. Not so small that it can be maintained manually
  3. Not already modeled using RDF or published as Linked Data

For this assignment you will describe the dataset thoroughly, and come up with some rough ideas for what you would like to do with the published data.

Deliverables

  1. A Markdown document with:
    • a URL for the dataset, if it is available online
    • where the data comes from / how it was created / what it is for
    • the ownership / licensing status of the data
    • the abstract model (e.g. tabular, relational, meta-markup, etc.) for the data
    • the specific serialization for the copy of the dataset you have (be as specific as possible)
    • some preliminary ideas about what you would like to do with the published Linked Data
  2. If the data is not available online, submit the dataset itself (if it is less than 100MB) or a representative extract

Your Markdown file should have the file suffix .md. To turn in the assignment, zip your deliverables—even if it is just one file—and submit the zip file using the link below.

Submit this assignment.

Writing RDF

Due September 18.

For this assignment, you will write RDF triples by hand in the Turtle syntax.

  1. Before you do anything else, you will need a text editor that supports syntax highlighting of Turtle files. Many text editors do not have syntax highlighting for Turtle built in, so you may need to install a package or extension to add support for it. Below are some links to such packages for some popular editors—I encourage you to help one another with getting these set up. You can use the messages tool in Sakai to coordinate with one another.

  2. For the first part of your assignment, create a file named about-me.ttl. This file should contain the six triples you created, using the FOAF vocabulary, to describe yourself.

  3. For the second part of your assignment, create a file named my-dataset.ttl. This file should contain sixteen triples that express assertions equivalent to the ones found in your dataset. In other words, you are “translating” a small excerpt of your data into RDF. You are free to create whatever predicates you need to express your data—simply include them in your Turtle file. For each predicate you define, you should indicate whether its values are expected to be literals or resources. For example, you could define a predicate serialNumber, the value of which is expected to be a literal, and a predicate owner, the value of which is expected to be a resource, with the following two triples:

    @prefix : <#> .
    @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
    
    :serialNumber rdfs:range rdfs:Literal .
    :owner rdfs:range rdfs:Resource .
    

Both of the files you create should use correct Turtle syntax, and you should take advantage of the features Turtle provides to group together triples having the same subject or the same subject and predicate. When you are done, zip the two files and submit them using the link below.

Submit this assignment.

Using OpenRefine to clean data and export RDF

Due October 11.

For this assignment, you will use OpenRefine to import your dataset, clean up your data, and export it as an RDF graph. At a minimum, your RDF graph should be structured as it was in your submission for the previous assignment—the only difference is that now you will be applying that structure to your entire dataset, rather than just producing 16 triples. However, you are welcome to make changes to that structure if you desire or feel it is necessary.

Begin by using OpenRefine to explore your dataset. Are there missing values? Are there fields with inconsistent values (e.g. differing date formats, different forms of names, true in some rows but yes in others)? Clean these up as best you can.

Then, edit your RDF skeleton to produce an RDF graph structured according to the example you provided in the previous assignment. Again, you are welcome to make changes to that structure, but please document the changes you make. Remember that anything that will be the subject of a triple needs to have a unique URI assigned to it.

When you are successfully able to produce an RDF graph in Turtle format, load it into Fuseki. Run some simple SPARQL queries to check whether your transformation was successful. For example: if you know that your dataset contains information about 17,003 political prisoners, you might write a SPARQL query verifying that there are 17,003 resources of type :PoliticalPrisoner in your RDF graph. If you further expect that, for every political prisoner, there is a single corresponding resource of type :Sentence, and a triple relating the two, you might write a SPARQL query to verify that.

Deliverables

  1. A short narrative report (in Markdown format, file suffix .md) explaining

    • what steps you took to clean your data using OpenRefine,
    • what changes you made, if any, to the structure of your RDF (compared to the last assignment), and why you made these changes,
    • how you verified that your data transformation was successful, using at least 5 different SPARQL queries, and
    • any other problems or challenges you ran into, and how you solved them.
  2. A single Turtle file with the RDF graph you produced.

Submit this assignment.

Enriching your dataset

Due November 8.

For this assignment, you will:

  1. use OpenRefine to reconcile values in your dataset to entities in Wikidata,
  2. use the reconciled values to fetch additional data related to yours, and
  3. add this new data to your RDF dataset.

Depending on your data, you may also use entity recognition to extract names from unstructured text before doing the steps above.

Finally, you will try replacing some of the predicate and class terms that you’ve been making up with terms from a published vocabulary.

Entity recognition

If you don’t have any or enough names or IDs in your dataset to make reconciliation interesting, you may need to produce some by processing unstructured text in your dataset. I recommend using the Spacy NLP library for this—consult the Jupyter notebook we used in class as an example.

Reconciliation

Use OpenRefine to reconcile names of things in your data against Wikidata. Remember that reconciliation can be slow: you should use OpenRefine facets to narrow down your data to a subset while you are experimenting with reconciliation. Once you have something that seems to work OK, you can run it on your whole dataset overnight.

Enrichment

Once you’ve successfully reconciled some values, use them to add additional columns to your dataset. These additional columns may also have Wikidata entity identifiers in them—you can then again use these to add yet more columns, if you wish.

Add these additional columns to your RDF skeleton, and verify that you can export it as Turtle, load it into Fuseki, and execute SPARQL queries that use the new triples.

Standardizing vocabulary

Browse and search Linked Open Vocabularies to find terms for predicates and classes that might replace the ones you’ve made up. Edit your RDF skeleton to use these predicates and classes. You do not need to replace all of your made-up terms, just as many as you can find equivalents for.

Pay attention to the domains and ranges of predicates—what do they imply about their subjects and objects? In some cases you may need to change the structure of your RDF slightly to accommodate the data modeling assumptions of the external vocabulary.

Export your dataset using the standard vocabulary terms, load it into Fuseki, and execute SPARQL queries that use the new terms.

Deliverables

  1. A short narrative report (in Markdown format, file suffix .md) explaining

    • what steps you took to enrich your data using OpenRefine (and possibly Spacy),
    • what changes you made, if any, to the structure of your RDF (compared to the last assignment), and why you made these changes,
    • how you verified that your enrichment and standardization were successful, using at least 5 different SPARQL queries (be sure to list the SPARQL queries in full, exactly as you ran them), and
    • any other problems or challenges you ran into, and how you solved them.
  2. A single Turtle file with the final RDF graph you produced.

Submit this assignment.

Final deliverable

Due December 13.

For the final project you will demonstrate some kind of use of, or interaction with, your RDF dataset.

On November 29 and December 4 everyone will give a short (~10 minutes) presentation

  1. explaining how they produced their dataset, and
  2. their plans for and initial steps toward a demonstration of interacting with the data.

As part of #1, plan on demonstrating some SPARQL queries to show the structure of your dataset.

The final deliverables are due December 13th. We won’t meet that day—you will upload them to the course web site using the link below. The specific deliverables will depend on the nature of your demonstration, but all will include a final Markdown report explaining the demonstration and the steps leading to it.

Submit this assignment.