Keeping track of my research day activities!
April 15-17, 2020
While these research days have not been what I expected – I was originally planning on being in Richmond, British Columbia for the BC Library Conference – everything is requiring a lot of flexibility in these COVID-19 days. May Chan and I are still planning on giving our conference presentation virtually, but in the meantime I decided to use this time to start taking courses that will count towards the Library Juice Academy certificate in XML and RDF-Based Systems.
I’m currently enrolled in Introduction to RDF, and in May I will start RDF, RDFa and Structured Data Vocabularies. So far we’ve covered an introduction to the Semantic Web and RDF, as well as the components of RDF XML and an introduction to vocabularies and ontologies. Having worked with graph databases and RDF using python last fall, some of the course content is not new to me, but I have been enjoying the systematic way that the instructor introduces terms as well as the weekly chat sessions where we can ask questions. I now feel like I have a better handle on how to talk about RDF data for sure.
To supplement this work, I’ve decided to experiment with python packages to convert MARC records to RDF/XML. But first, I’d like to share some of the “failed” attempts that got me to this point…
- I tried following the instructions in the Converting MARC to RDF lightning talk by F. Tim Knight, and while the instructions were sparse but clear, I had issues processing my files using XSLT Style Sheets. I’m definitely going to revisit this method, but got impatient and decided to look for python packages instead.
- When searching for python alternatives, I came across a Code4Lib pre-conference session about Transforming MARC and Metadata into RDF-based Applications by Jeremy Nelson and Mike Stabile. They used a package called BIBCAT, however when I went to test it out, it wasn’t for me. The library was last updated in 2018, doesn’t seem to be actively maintained, and the README files were not helpful for my new-to-python eye, despite originally appearing promising.
This experience lead me to browsing pypi.org (The Python Package Index, a repository of python-language software) for “marc” and “rdf” packages, and I easily stumbled upon pybibframe. It is simple to use, and even offers the option to create an RDF/XML representation, which is what I’m using in my course.
Before I get to the steps required to convert sample data I extracted from the University of Toronto catalogue to RDF/XML, I want to explain how this python package does things a little differently than what I originally wanted. I was hoping to convert directly to RDF/XML using whichever vocabularies I feel like, because in my course we’re often looking at schema.org and Dublin Core predicates. However, by converting from MARC XML -> RDF/XML using pybibframe, I’m required to have some knowledge of BIBFRAME (the Library of Congress Linked Data model), because that is the model that this package uses. This isn’t shocking, given that bibframe is even in the name of the package, but BIBFRAME creates RDF data using its own vocabulary that’s expressed in the RDF, and pybibframe doesn’t appear to have any built-in functions that would create triples using other ontologies. While it does makes sense to use BIBFRAME as a librarian because it is being touted as the replacement to MARC21, I may need to investigate other tools that could cross-walk between my BIBFRAME outputs and other models/vocabularies, or look in to ways to augment the data depending on the parameters of future projects that I might undertake.
But enough about that, now for the actual conversion process!
- Convert from MARC21 to MARC XML using MarcEdit (documentation about this process can be found on the Illinois Library MarcEdit Libguide).
- Open Terminal (I’m using a mac), and after installing pybibframe using pip, run:
marc2bf -o resources.versa.json –rdfxml resources.rdf records.mrx
That’s it! To confirm that the file successfully converted, I ran the triples that corresponded to the first MARC21 bibliographic record through the W3C RDF validator, and it was a success! You can see the graph and validated triples below, and if you click on the graph you will open it in full size in a new window. My next steps (during future research days) will be to play with ways to handle all of string literals by maybe introducing other ontologies, or seeing what skills would be required to introduce other ontologies into the pybibframe package.
December 18, 2019
I’m looking forward to being a reviewer for Against the Grain, and have finally received my first book in the mail!
Without spoiling the review, I’m enjoying the book overall, and have found it to be insightful when considering larger questions around professionalization, working conditions, and my very recent past as an LIS student.
November 1, 2019
Rachel and I continued to work away preparing for PyCon Canada. I really should have kept better notes, but the day just flew by. I was so excited – and nervous – for PyCon Canada that I really didn’t have time to relax until after it was all done. That said, I’m looking forward to coming back to the linked data work we were doing, but starting from scractch and creating the dataset ourselves.
Because PyCon Canada fell on a weekend I didn’t take any research days for it. However, I thought I’d include it in this post because it was what Rachel and I were working towards. At PyCon, I mostly attended sessions in the PyData track, and had the opportunity to meet a lot of people who were not librarians, which I always find to be a rich way to spend my time. I worry often that as librarians we often make things harder for ourselves than they have to be, when people in other disciplines and professions are doing similar work, but we just don’t know about it.
I enjoyed learning about word2vec, in a presentation given by Roberto Rocha (a data journalist at CBC). Word2vec is used to determine document similarity, and in this case was presented in 3D through tensor. Leaving the session I couldn’t help but think of the interesting potential word2vec presents, maybe analyzing online reference questions…
Another thing that I enjoyed learning about, which was presented in the lightning talks, is gazpacho. Gazpacho is meant to be a simpler, easier-to-use replacement for (most) things that are currently done with beautifulsoup.
I also was grateful for the knowledge shared by Niharika Krishnan (Understanding Autistic Children Using Python), Stephen Childs (Data Viz with Altair), Serena Peruzzo (Data Cleaning), Jill Cates (Algorithmic Bias), Manav Mehra (Growing Plants with Python), Cesar Osario (Voice recognition using python and deep learning), Anuj Menta, and Josh Reed (Putting your data in a box). In addition, the conversations and connections I made with other people who are interested in tech and libraries was so valuable!
October 16, 2019
In preparation for a talk that Rachel Wang and I will be giving at Pycon Canada on November 17, 2019, I took my second research day. We’ll be presenting on gathering insights from linked data, using RDFlib and SPARQL queries.
What did we get up to?
- Explored some graphing databases – Neo4j and GraphDB. While both had great user interfaces, we decided to focus our efforts on GraphDB because it works well with RDF, and therefore RDFlib, whereas Neo4j uses Labelled Property Graphs. More information on this difference can be found on the Neo4j blog. GraphDB also has a tabular data to RDF loader, which is built off of OpenRefine, a tool we’re both already familiar with and hope to use more (there will be more on that later!).
- Explored loading data into GraphDB both through the RDF options and Tabular (OntoRefine) option. The Ontorefine option kept giving us problems – no matter the size of the file, it seemed to continue to load forever, or error out. This was frustrating because it was even the sample file provided by GraphDB to test this type of load.
- Developed SPARQL queries.
Below is a screenshot of the GraphDB interface with our sample wine dataset we were testing on.
July 30, 2019
For my first-ever research day, I decided to focus on skills that would be directly applicable in my current role. It wasn’t hard to decide that it would be invaluable to focus on developing skills with the system API that I work with nearly daily, but that I lack experience with, and a conceptual understanding of. I worked my way through the training manual, though I’m certainly left with some lingering questions. For example, some of the arguments seem not to be present in the manual, and some thoughts about constructing scripts using a combination of unix commands and local tools need some clarification.
Some of the things I learnt during my research day were:
- the way our databases are structured and which keys are used where
- server configuration (specifically, what does this Unicorn directory I keep using mean)
- what are Unix commands and the Bourne and Bash shells?
I now feel like I have a better basic understanding of how the API functions and which tools to use for which types of problems. However, I wish I would have a better understanding of exactly how to write some of the commands, because I have a sneaking feeling that many options were missing in the training manual. I think this because there are some scripts that I use, which were created before my time, that have inputs or outputs not defined in the manual. When I get back to work I’ll see if I can find the definitions in the API using another shortcut I learnt about “-x” to list tool, input and output options, and if not I’ll ask my supervisor for assistance.
All of this was made even better by the view! I took the opportunity to work at one of my local public libraries – the Hillsburgh Public Library. Recently opened as architecturally part heritage house, the library is on the water and has an exceptional view and beautiful collection, and the commute was 10 minutes instead of approximately 2 hours.