It was my great privilege to have been invited as a speaker and panelest at Critical Perspectives on the Practice of Digital Archaeology at Harvard University on February 3 and 4, 2017. Here's my paper.
Other People’s Data1: Practical Realities and Ethics of Preservation, Reuse, and Dissemination at a State Repository
I’m Jolene Smith. I work for the Virginia Department of Historic Resources, the State Historic Preservation office (or SHPO). My title is long and new, although I’ve been in this position and another at the agency for about a decade collectively. I’m DHR’s Digital Media Preservation Specialist and Archaeological Data Manager. When I started in this position, my primary responsibilities as Archaeology Inventory Manager (as I was known until a couple of months ago) were to assign state site numbers to incoming site records, digitize site level geospatial data, and basically to help people find reliable information about Virginia archaeology. A lot has changed in the intervening years, but the core remains the same. I’ve expanded my purview to digital archives at the agency more broadly (we also have a huge architectural dataset, among others) and made exploding silos of data inside the agency and preserving digital media high priorities. My position is within DHR’s archives, staffed by three full time employees (including myself) and 1-3 colleagues are not classified as permanent employees. Our archives are an organic being. Physical file organization and storage, numbering systems, and then digital file organization and maintenance have developed internally over 50 years to meet very specific agency research and information needs. In short, I, my colleagues, and our predecessors did things in a way that worked at the time from the perspective of our agency, not necessarily from the perspective of a capital L Library or capital A Archive.
I’m assuming several voices in this presentation. I’m not speaking for my employers, but I’m describing what we do and my own priorities moving forward. And I’m also speaking as an archaeologist in the world and as a concerned citizen. Hopefully this will become clear enough.
I came on board after the agency was well on our way into a digital transition. Basic site records all came in through a web application and I heads-up digitized site shapes into our GIS from all variety of graphic maps. Sometimes there would be a CD mailed in along with bound cultural resource management reports. But for the most part, the authoritative documents were paper. Associated digital files were pleasant additions. The balance has shifted, although we’re still wobbling on the fulcrum. All of this is to say that small state agencies like this one are often slow on the uptake when it comes to technological innovation, for plenty of valid reasons. The upside to this fact is that there’s more time to get out in front of some of these digital challenges when internal change comes at a less-than-rapid pace.
For the purposes of this presentation, I’m going to define archaeological “digital data” as a singular concept very broadly. It’s the state-level site inventory. It’s the accompanying media- photos, artifact inventories, feature- or even artifact-level geospatial data, 3D scans, remote sensing datasets, digital versions of gray literature that synthesizes the rest, and on and on. It is born digital and digitized from some other, more tangible form. It is produced by hundreds of different individuals with varying goals and abilities. In short, it’s kind of a mess.
Getting Data In (and keeping it there)
Here’s my obligatory “herd of cats” photo that I’ll leave on the screen while I talk about the hard stuff.
I’m about to recite a laundry list of problems here, but I don’t intend for any of this to be disparaging. In fact, some of the most helpful elements of researching approaches to data preservation by small archives are the war stories. Not only because of the psychological comfort in commiseration, but also because it’s become abundantly clear that many of these challenges are common. Some are more soluble than others, but following along (or, better yet, collaborating) with another entity tackling the same problems is so much easier and more efficient than going it alone.
In Virginia, the archaeological data we curate at the Department is produced by hundreds of different archaeologists for different purposes. The vast majority of information in our collection is related to cultural resources management archaeology in support of environmental or preservation legislation. We maintain data for over 43,000 sites and each year we receive roughly between 300 and 400 archaeological surveys and accompanying data (at various scales). Our guidelines for survey and documentation are a balancing act.2. While, in a perfect world, we could mandate very strict data standards for archaeological surveys reviewed by our agency, it’s frankly not politically viable to make changes that are perceived as placing undue financial or regulatory burden on the private sector. As I watch funding stripped from humanities programs and regulatory policies gutted at the federal level, these issues are more sensitive than ever.
As mentioned previously, digital files are currently organized in a way that inherits structure from physical records storage needs of the past. Metadata are stored in folder and filenames on the network. All of this is backed up to state IT standards, but not necessarily to accepted digital archives preservation standards3. Database relationships between parent and child site records (for example, individual sites within a larger archaeological district site) don’t exist because of choices made over 20 years ago when the agency adopted its first relational database of historic properties. Consulting archaeologists around the state send accompanying data on optical media, but there are no firm standards aside from a PDF report requirement. We do a lot of work to try to keep things in order, but it’s still fairly Wild West.
From a technological resources perspective, we face a lot of challenges, as I’m sure is common. We’re a tiny, independent state agency with an equally modest budget and a fairly conservative climate. Due to various state contracting requirements, we’re limited to an incredibly small amount of network storage space for both archival and working data. We have one dedicated (and heroic) IT person (who may not always be amused with my schemes that all seem to require special installs or waivers). About a year ago I came across a slide deck from a Dublin Core Workshop entitled “Implementing Linked Data in Low Resource Conditions” that was clearly designed for work in developing nations.4 But the shoe fits our scenario quite well and I have the feeling we’re not alone. Having this kind of descriptor really allowed me to reframe my thinking about our institutional challenges and has given me the freedom to be comfortable starting small. I wear the Low Resource Conditions badge with pride.
All of these issues break out like this:
- Irregular or nonstandard data submitted by outside archaeologists
- Pressures against requiring additional labor
- Small budget for internal technology
- Very limited personnel resources
Alright. That’s a pretty steep hill. And it’s only getting steeper with a rapidly crumbling political environment. So how do I make progress here? I’m starting with education, connections, and politics. I’m educating myself in the ways of more formally trained digital archivists, taking great advantage of open course models. I’m working to train my less technically-inclined colleagues about concepts like fixity and data interoperability. As I work to revise our guidelines to (hopefully) radically improve data submitted to our agency, I’m also planning to develop entry level training (and I look forward to tomorrow’s workshop here). From very informal surveys of my Virginia colleagues, many of them don’t really understand electronic data. So developing videos and workshop content to really break down why we’re doing what we’re doing will hopefully get some buy-in. If I ask for more or better, I have to be prepared to deliver more, and better. But there’s the advantage of clean, stable datasets.
The nuclear option in my pocket at all times is the reminder that collecting and disseminating reliable data is part of archaeological ethics and our agency is mandated to preserve it, specifically, state level digital data standards and records retention schedules, as well as federal regulations requiring preservation.5
On Digital Preservation of Public Data in the Trump Era
A few days after the election, I pondered in a tweet about practical ways to prepare our data for some degree of dormancy, in the event that small operations like mine lose staff or programs. I tried to phrase it calmly but it felt silly and alarmist. Or, at least, I wanted it to be silly and alarmist. But the situation is becoming more serious by the day. We are two weeks into this presidential administration and already seeing direct threats on research, and specific categories of information at the federal level. In my role at a small state agency, I fear the impact of deregulation and defunding. Dare I say fortunately, being at the state level affords a little bit of time to prepare. As a discipline, we need to be ready for rapid response. We need a toolkit of easy strategies that organizations can use to get their information into a stable state and distribute its curation. We need to design our systems so others can access or “rescue” the data if we are unable to or prohibited from doing so ourselves. We need a network of diverse organizations and individuals who can step in to provide technical assistance. We need a Bat Signal and some life rafts.
In this environment of uncertainty, every application we develop, every database we create, every policy we enact must consider the possibility that the lights could be turned off without a whole lot of notice. It’s unpleasant (to say the least) to think about what we do this way, but I argue that we’ll come out on top with more robust and resilient systems and datasets if all of this preparation turns out to be unwarranted.
We need to ask ourselves some very unsettling questions, like, if Section 106 becomes obsolete tomorrow, who in the private sector is going to lose their contracts? If CRM companies rapidly fold, who will become the custodians of an avalanche of artifacts and records? The answer is, well, us. And we’re frankly not ready. We’re at critical storage capacity for physical and digital objects and we need to be preparing for a reckoning (note to self: schedule another meeting). Our agency got a tiny taste of this in 2008. We can’t leave it up to firms to do the right thing, either. I call for proactive statements from professional organizations like the Society for American Archaeology and state-level professional organizations to encourage coalitions for rapid response. As we have seen at the federal level, centralized data repositories (especially inaccessible ones) are very vulnerable.
Okay. Eyes on a stable future. Getting data out.
I discussed earlier the diversity of information and information quality we collect at DHR. So optimizing any system to get data out of our collections in meaningful ways is tricky. And since we operate within tight budget constraints with few experts working on data curation, getting it all to a point of real standardization is near on impossible.
When I first met Eric Kansa in 2013 at a workshop for the Digital Index of North American Archaeology, he advocated for annotation over standardization. I still want this on a t-shirt. I’m thrilled that Virginia’s site dataset is included in DINAA’s linked open data model, but we’ve got more work to do in order to make this feasible into the future, including reworking our own databases with an eye on interoperability. And with improved standards and procedures on ingest (open data, machine-readable formats, stable file formats, etc.) possibilities for output and reuse will continue to grow and improve. Better standards will also make acceptable preservation much easier. Basic implementation of simple strategies like a hard drive exchange “buddy network” and free, easy tools like BagIt developed by the Library of Congress to track file integrity are essentially instant changes we can make at very low cost with immediate results.
Following this approach is the tricky business of presenting this type of non-standardized data in meaningful ways. Again, I suggest that at least a partial solution is to increase data literacy on the part of the consumers. In Virginia, the primary users of our information are private sector archaeology consultants, followed by governmental agencies involved at various stages of the environmental review process. Researchers and the general public are not our primary customers, but I endeavor to make archaeological data more useful and relevant to these groups moving into the future. Even the trained “regulars,” CRM professionals, need tools to help them make their own data more useful, to understand the variety and limitations within the datasets of other archaeologists, and to draw meaning from it all in consistent ways. I was thrilled to see the agenda for the workshop accompanying this conference tomorrow for these very reasons.
Budgetary pressure against prioritizing research and public access has always been an unpleasant undercurrent at this and other state level archives. Anecdotally, there was a period in the past where researchers were advised not to submit archaeological data to the SHPO if it was not a part of compliance archaeology. This advice, thankfully, has not been conveyed for a long time. But real damage was done. In my 9 years in this position, I’ve had to do a lot of work to create renewed connections with academic and museum archaeologists. Virginia archaeology is still balkanized to some degree between academic, museum, and CRM archaeologists, although we continue to encourage collaboration through organizations like the Council of Virginia Archaeologists. As a matter of fact, just Wednesday, an archaeologist with one of Virginia’s most well-known archaeological sites and historical attractions emailed me to initiate getting site forms updated. Victories.
The 2013 DINAA meeting in Knoxville as well as other gatherings of various state-level data folks have really driven home a problem with isolation. A personal goal of mine since 2013 has been to push outside interaction, get skills beyond the expected, and engage frequently with interdisciplinary groups, including conferences like this one, the wider digital humanities community, libraries and archives, and software developers. It’s great fun and I’ve met a lot of wonderful people, but my ultimate goal is to bring these seeds of connection back to others in similar government positions so they can form their own.
And back to those unfortunate political pressures. As our programs continue to face risk of elimination, our data can become a shield. Much of my work over the past two years has centered on how to get more meaningful information about Virginia archaeology into the hands of the public. And from a standpoint of pure self-preservation, this is how we get people to care. This is how we get regular citizens to pick up the phone and call their elected officials when Virginia archaeology comes under threat. In 2015 and 2016 I was privileged to be a participant in the Institute on Digital Archaeology Method and Practice at Michigan State University, where I developed a proof of concept public digital repository with the aim of making gray literature and archaeological datasets accessible and interesting to the general public. It includes the ability to explore texts and datasets without a lot of technical expertise. It’s still just a pilot, but has been well-received and, with perseverance on my part, will hopefully continue to evolve.
Alas, opening up our full corpus of archaeological data for the world to freely see and use is also not an easy ethical option. Recently I read a piece by author William Gibson entitled The Future of Privacy.6 His essay explores the contemporary implications of large-scale digital data collection and the values of ethics of individual privacy. But as I read, I thought about material “dirty laundry,” embodied trauma, and objects that represent events people may not want to be forced to remember. I remembered an excavation of an early 20th century urban domestic site in Baltimore. There was an old privy filled with an incredible amount of liquor bottles, apparently concealed. We joke, but may also find ourselves feeling sensitive about commercial use of our Google search data about our own secrets. What’s the statute of limitations on ethical airing dirty laundry? I don't have the answer, by the way.
Indigenous rights and values (as well as those of descendants of enslaved peoples and other groups) also need to be front and of these discussions about open archaeological data in Virginia. North American archaeology has undeniably exploitative, colonial roots. We, as primarily white, middle class archaeologists, cannot thoughtlessly assume disclosure is appropriate in the name of science. Information extraction from indigenous communities can easily contribute to ongoing structural violence.
This is not an easy problem to solve with a technical tool or a prescribed approach. A start lies primarily in meaningful consultation and collaboration with supplementary practical, technical solutions to enable content flagging by users to initiate a conversation and provide for differential access. I also seek to provide accessible raw but easily usable data to communities (however they choose to define themselves) so that they may tell their own stories on their own terms. I realize that neither content flagging nor data remixing are perfect solutions, but they are steps in a more equitable direction.
So, takeaways. If I were to give advice to another similar organization who asked, it would be these four points:
- Be flexible. Prepare to adjust quickly to challenging parameters and various interests.
- Be uncomplicated. Not in the data itself, but in the systems that store the data.
- Be visible. Nobody is going to fight for something they don’t realize exists.
- Be connected. Form interdisciplinary partnerships and listen to the people you serve.
After this presentation was written, I discovered that I had inadvertently duplicated part of the title from the following: Atici, Levent, Sarah Whitcher Kansa, Justin Lev-Tov, and Eric C. Kansa. “Other People’s Data: A Demonstration of the Imperative of Publishing Primary Data.” Journal of Archaeological Method and Theory 20 (n.d.): 663–81. ↩
Department of Historic Resources. Guidelines for Conducting Historic Resources Survey in Virginia, 2013. ↩
See Phillips, Megan, Jefferson Bailey, Andrea Goethals, and Trevor Owens. “The NDSA Levels of Digital Preservation: Explanation and Uses.” Archiving Conference 2013, no. 1 (January 1, 2013): 216–22. A great deal of my recent progress in digital preservation is thanks to participation as a case study in Trevor Owens' graduate Digital Preservation course at the University of Maryland, as well as my own study of the open course materials. ↩
Caracciolo, Caterina, and Johannes Keizer. “Implementing Linked Data in Low Resource Conditions.” presented at the Food and Agriculture Organization of the UN, September 9, 2015. ↩
Gibson, William. “The Future of Privacy.” The New York Times, December 6, 2016. https://www.nytimes.com/2016/12/06/opinion/the-future-of-privacy.html. ↩