Due September 4.
Your first assignment is to decide upon a dataset that you will work with throughout the semester to publish as Linked Data. The dataset must be:
For this assignment you will describe the dataset thoroughly, and come up with some rough ideas for what you would like to do with the published data.
Your Markdown file should have the file suffix
.md. To turn in the assignment, zip your deliverables—even if it is just one file—and submit the zip file using the link below.
Due September 18.
For this assignment, you will write RDF triples by hand in the Turtle syntax.
Before you do anything else, you will need a text editor that supports syntax highlighting of Turtle files. Many text editors do not have syntax highlighting for Turtle built in, so you may need to install a package or extension to add support for it. Below are some links to such packages for some popular editors—I encourage you to help one another with getting these set up. You can use the messages tool in Sakai to coordinate with one another.
For the first part of your assignment, create a file named
about-me.ttl. This file should contain the six triples you created, using the FOAF vocabulary, to describe yourself.
For the second part of your assignment, create a file named
my-dataset.ttl. This file should contain sixteen triples that express assertions equivalent to the ones found in your dataset. In other words, you are “translating” a small excerpt of your data into RDF. You are free to create whatever predicates you need to express your data—simply include them in your Turtle file. For each predicate you define, you should indicate whether its values are expected to be literals or resources. For example, you could define a predicate
serialNumber, the value of which is expected to be a literal, and a predicate
owner, the value of which is expected to be a resource, with the following two triples:
@prefix : <#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . :serialNumber rdfs:range rdfs:Literal . :owner rdfs:range rdfs:Resource .
Both of the files you create should use correct Turtle syntax, and you should take advantage of the features Turtle provides to group together triples having the same subject or the same subject and predicate. When you are done, zip the two files and submit them using the link below.
Due October 11.
For this assignment, you will use OpenRefine to import your dataset, clean up your data, and export it as an RDF graph. At a minimum, your RDF graph should be structured as it was in your submission for the previous assignment—the only difference is that now you will be applying that structure to your entire dataset, rather than just producing 16 triples. However, you are welcome to make changes to that structure if you desire or feel it is necessary.
Begin by using OpenRefine to explore your dataset. Are there missing values? Are there fields with inconsistent values (e.g. differing date formats, different forms of names,
true in some rows but
yes in others)? Clean these up as best you can.
Then, edit your RDF skeleton to produce an RDF graph structured according to the example you provided in the previous assignment. Again, you are welcome to make changes to that structure, but please document the changes you make. Remember that anything that will be the subject of a triple needs to have a unique URI assigned to it.
When you are successfully able to produce an RDF graph in Turtle format, load it into Fuseki. Run some simple SPARQL queries to check whether your transformation was successful. For example: if you know that your dataset contains information about 17,003 political prisoners, you might write a SPARQL query verifying that there are 17,003 resources of type
:PoliticalPrisoner in your RDF graph. If you further expect that, for every political prisoner, there is a single corresponding resource of type
:Sentence, and a triple relating the two, you might write a SPARQL query to verify that.
A short narrative report (in Markdown format, file suffix
A single Turtle file with the RDF graph you produced.
Due November 8.
For this assignment, you will:
Depending on your data, you may also use entity recognition to extract names from unstructured text before doing the steps above.
Finally, you will try replacing some of the predicate and class terms that you’ve been making up with terms from a published vocabulary.
If you don’t have any or enough names or IDs in your dataset to make reconciliation interesting, you may need to produce some by processing unstructured text in your dataset. I recommend using the Spacy NLP library for this—consult the Jupyter notebook we used in class as an example.
Use OpenRefine to reconcile names of things in your data against Wikidata. Remember that reconciliation can be slow: you should use OpenRefine facets to narrow down your data to a subset while you are experimenting with reconciliation. Once you have something that seems to work OK, you can run it on your whole dataset overnight.
Once you’ve successfully reconciled some values, use them to add additional columns to your dataset. These additional columns may also have Wikidata entity identifiers in them—you can then again use these to add yet more columns, if you wish.
Add these additional columns to your RDF skeleton, and verify that you can export it as Turtle, load it into Fuseki, and execute SPARQL queries that use the new triples.
Browse and search Linked Open Vocabularies to find terms for predicates and classes that might replace the ones you’ve made up. Edit your RDF skeleton to use these predicates and classes. You do not need to replace all of your made-up terms, just as many as you can find equivalents for.
Pay attention to the domains and ranges of predicates—what do they imply about their subjects and objects? In some cases you may need to change the structure of your RDF slightly to accommodate the data modeling assumptions of the external vocabulary.
Export your dataset using the standard vocabulary terms, load it into Fuseki, and execute SPARQL queries that use the new terms.
A short narrative report (in Markdown format, file suffix
A single Turtle file with the final RDF graph you produced.
Due December 13.
For the final project you will demonstrate some kind of use of, or interaction with, your RDF dataset.
On November 29 and December 4 everyone will give a short (~10 minutes) presentation
As part of #1, plan on demonstrating some SPARQL queries to show the structure of your dataset.
The final deliverables are due December 13th. We won’t meet that day—you will upload them to the course web site using the link below. The specific deliverables will depend on the nature of your demonstration, but all will include a final Markdown report explaining the demonstration and the steps leading to it.