Primary Archaeology data for non-archaeologists?

This post is part of the May 2012 Technology Week, a quarterly topical discussion about technology and historical archaeology, presented by the SHA Technology Committee. This week’s topic examines the use and application of digital data in historical archaeology. Visit this link to view the other posts.

Is there value in exposing archaeological primary data to non-professional audiences? Can online archaeology databases serve broader goals? Can they both inform and serve as a tool for advocacy at time when the practice of archaeology is again being challenged in popular culture?

The National Park Services museum.nps.gov.

The National Park Service website, museum.nps.gov, is the online face of ICMS, the database tool that the Department of the Interior uses to manage its collections. In pre-launch testing the most common reaction was surprise that the parks actually had collections. Individual parks decide what to present on the website and it currently includes nearly 450,000 records, representing over four million objects, half of which are archaeological. Some information is removed before it reaches the web. Crucially for archaeology, this includes site name, site location, within-site provenience and UTM data; excluded to protect sites from the very real threat of looting, and at the request of Native American groups.

But stripping the artifacts of physical context before they reach the web is problematic at best for archaeology, so an attempt has been made to restore some contextual information. Collection highlights were developed to be used by the park staff to allow the grouping of objects, creating a virtual context that can represent a physical space – a site or an archaeological feature – or a thematic context, or a virtual exhibit. Fort Vancouver National Historic Site has created several highlights, including The Fort Vancouver Village. The highlight includes narrative text to explain the complex cultural landscape and is supported by 32 selected artifacts. Those artifacts are hyper-linked to the over two hundred thousand records which are part of Fort Vancouver’s online collection. I’d argue that even if most visitors never look at those records. they need to know that they are there. The National Park Service doesn’t just have great scenery, they have curated over forty million cataloged objects.

At Mount Vernon, George Washington’s Virginia plantation along the Potomac River, The South Grove midden excavation uncovered more than 60,000 artifacts. These represent almost 400 ceramic and glass vessels, hundreds of pounds of brick, mortar, and plaster fragments from renovating buildings, buckles, buttons, tobacco pipes, and more than 30,000 animal bones. A new website (in progress at www.mountvernonmidden.com) focuses on 400 objects, but the full database is there (and available on the Digital Archaeological Archive of Comparative Slavery site) and items are presented in the context of the wider collection. Additionally, the website includes a timeline, a map of the site in relation to the broader plantation landscape, historical notes and related published papers, and a database of the Washington family Invoices and Orders – all part of the larger data set that comprises the project.

So site databases, like the truth, need to be out there. Showing artifacts to the public, without this data-rich environment, suggests that just a few objects have primacy, elevating the qualitative over the quantitative. And if archaeologists want support for the process of archaeology and for digital preservation, then showing the volume of data makes sense.

The problem of exposing the soft underbelly of archaeological data is that at least some members of the public might start to question what’s presented. Why is it so hard to compare one site with another? Why are different methodologies used at different sites? Why does every project record different information? Why does the terminology differ between sites? There is a slow move forward in addressing all these issues (Kansa et al. 2011), but if archaeologists want to hammer home the point that pot hunting and looting are bad, then they should be willing to present and rationalize the datasets that professional archaeologists creates.

I’m not suggesting that advocacy is the only reason to show data. As text books and other electronic publications slowly transition from electronic copies of physical books into fully interactive media, perhaps they’ll also start to include accessible databases, and not just as appendices. Database could support graphs and result sets, allowing data to be manipulated, examined and even challenged. Perhaps eventually these datasets could be more than just one-way presentations of data. On websites, by recording the questions asked of the data, by tracking the datasets produced, these databases might come to be a part of research as well as publication.

References Cited

Sustainable Archaeological Databases — a view from Digital Antiquity

This post is part of the May 2012 Technology Week, a quarterly topical discussion about technology and historical archaeology, presented by the SHA Technology Committee. This week’s topic examines the use and application of digital data in historical archaeology. Visit this link to view the other posts.

At the Center for Digital Antiquity (Digital Antiquity), we are committed to improving access to and preservation and use of archaeological information. Over the past four years, we’ve built tDAR (The Digital Archaeological Record), a digital repository designed to preserve the digital documents, data sets, images, and other digital results of archaeological investigations and excavations. tDAR is one of a number of discipline-specific repositories designed from the bottom up to better support the needs of the content by providing rich, archaeologically-specific metadata along with tools to discover, access, and use the uploaded materials.

Looking into the crystal ball, there are a number of significant challenges and important opportunities ahead:

  1. creating and maintaining a stable foundation for future archaeological research and resource management
  2. access and use (and preservation too)
  3. collaboration

Sustainability

If there’s anything that we can learn from the basic practice of archaeology, it’s that things do not get preserved unless the environment is right to enable preservation. This works best if there are multiple sources and tools available. In the case of archaeological data, it means that there is a mixture of sustainable technology, organizations, and tools to enable and facilitate preservation.

A digital repository that has the ambition of providing long-term preservation for archaeological data must be sustainable for the long term.  There must be a realistic plan for funding the variety of activities required in order to ensure access and preservation of information, as well as succession plans.  These are core components of being certified as a “Trusted Digital Repository,” something that Digital Antiquity aspires to make tDAR in the near future.

At Digital Antiquity, we have a plan and a schedule for achieving it.  We see the development of a digital curation service useful for public agencies, research organizations, and individual researchers as key to sustaining the tDAR repository. We plan to charge for the deposit of information into tDAR to support the archiving of those materials, and are negotiating with other archives to serve as backup repositories for tDAR. The main point here is that any organization that is serious about providing for long-term support to maintain must have a plan to ensure financial support and must work diligently to execute this plan.

Digital Antiquity cannot solve this problem alone, however, sustainability requires multiple sources, technologies and approaches tools like LOCKSS or organizations like the Internet Archive or HathiTrust to help ensure sustainable archaeological information.  Sustainability also requires a change in culture. It requires that public agencies, research organizations, and individual researchers who create data ensure that it is available and remains preserved for future access and use, and budget funds as part of their activities to support the digital repositories.

Access and Use

One of the easiest ways to understand the challenges of the future is to look at the problems we’re still struggling with from the past.  Looking back to the 70′s, 80′s, and 90′s tremendous quantities of archaeological data, in the form of reports, documents, data sets, and other materials have been produced. Most of this data collected in the US has been funded by public undertakings conducted through cultural resource management (CRM) investigations.

The challenge is that much, perhaps most, of this information is on the verge of being forgotten about and lost. Almost all of the reports from the CRM era are available only as paper records. Unless systematic efforts to preserve, digitize, and make more widely available these older reports and data are undertaken, this body of work will be forgotten or essentially lost.

Recently produced archaeological reports and other data often are in digital formats. However, if these reside only on a floppy disk they too are one step away from being lost. The digital analog to the situation with paper records is not much better: a broken hard-drive or a Dropbox account that’s been corrupted, and the critical data has been lost. When data is maintained and kept at the “personal” level without appropriate documentation and backup, it’s at risk.

With the advent of the web, some documents and databases have moved to the web as simple webpages or more complex websites.  Moving to the web has been a major step forward enhancing discover and providing easier access. Tools like Google may enable these materials to be discovered and used, but not all databases are “discoverable.” For example, the NADB database has been hosted for a number of years by the Center for Advanced Spatial Technology (CAST) at the University of Arkansas. In this form, it was available online, but for potential users to use it, they had to know both about NADB and how to access the NADB web page in order to perform a search. Simply putting it on the web does not equate with accessibility.

From an archival standpoint, a database like NADB in its current form would not be preserved either. Services like the Internet Archive, attempt to archive sites, but only those that pages can be linked-to, and many databases are only accessible via search-forms. Furthermore, if they are accessible, the data is being preserved in a translated form – definitely better than not preserving the data at all, but not ideal.

The other challenge can be boiled down to a fundamental question… what will happen to the website in 20 years? Sites like Geocities or ma.gnol.ia are examples of what can happen to data on the web without stewardship. Software reaches end-of-life comparatively quickly (5 years in some cases), with backend software or hardware no longer supported — tools like Cold Fusion, early versions of Oracle, or older file formats such as Word Perfect are becoming more scarce, and harder to use / access.  Over the next 10-20 years, these challenges will grow as computing continues to evolve. The growth of cloud computing has great potential: tools like Google Docs and online databases provide a myriad of features we could have only dreamed of in the past, but offer new challenges for preservation and use as they may be dependent on the tool, and restrict access for preservation or use. These too will have time and costs involved and will require online migration and future support.

Regarding use, within the United States there are federal and state regulations that prohibit the general availability of some kinds of archaeological information, specifically detailed site location information. This protection is critical to the management and preservation of the physical site. This, however, requires that online tools be sensitive to this information and that repositories develop methods for screening access and dealing this kind of information.

There are two aspects to consider: First, most information about archaeological resources need not be held as confidential.  In our experience, documents of several hundreds of pages may have only a few with specific site location information on them and many reports do not have any of this kind of detailed information in them.  The challenge, is to ensure that the goal of site protection does not endanger overall ability to preserve and provide access, something tDAR does by enabling documents to be marked as confidential (or enabling redaction), restricting access to the site location information, preserving it and making it discoverable, but restricting access.

The other aspect of this issue is how to ensure that those individuals and officials who need to have access to confidential information can get it? Issues of the identity of repository users will require that over time, tools are created to help in the management of identity and helping to vet users to migrate from each system managing separate credentials or requiring the initial uploader to validate all users.

Collaboration

With the advent of the web, real-time, large-scale collaboration has become feasible, and in many cases quite productive. It requires a shared knowledgebase and interest between the parties, as well as trust. Examples of collaboration range from NSF projects that span a country, or even the development of the state site-files. But, for these collaborations to work, significant synthesis work must be accomplished first, agreed-upon terms, definitions, archaeological and data standards, etc. Within the world of archaeology, this is problematic. There are definitely some categories of classification that can be agreed upon, from faunal characteristics, to scientific measurements, but many qualitative classifications do not have formal, agreed-upon, meanings. Furthermore, significant work must be done once data has been collected in order to prepare it for collaborative endeavors. But, for any of this to happen, there must be more data sharing and publication through tools like tDAR or Open Context.

Reuse

The technology visionary dreams of the Semantic web and linked data, the world where data is infinitely accessible and any query can be answered with a quick search and a click of the mouse. One where data can be collated from multiple sources automatically to answer questions that were impossible otherwise. The dream of the semantic web is one where data is “free” of the database, there are no silos and data is interconnected in ways that the original creator could never conceive. The theory of the semantic web is that if you had online databases of various types linked together and available for users, that it would enable complex, advanced searching functionality that would link the multiple databases together in new, and unique ways.

The challenges of this, however, are great from data quality, to knowledge of external tools, to technical skill.  The latter being, in some ways, the greatest challenge;  Archaeologists, in general are a smart bunch, and often quite technically savvy, but these tools also have a high barrier to entry for use.  Some of these barriers include:

  1. Perceived value and need. If  putting data into a semantic format were as simple as clicking a button and hitting “save as” in Access, Excel, or Word, then this discussion would be moot. Instead, it’s a manual or technically involved process that requires users to isolate different types of data, evaluate it, standardize it, and map it.  It works best for quantitative measurements, and has some real challenges for qualitative data. But, regardless of the ability to publish the data, without a number of shining examples of how the data can be used to produce new impactful and significant ways that change the valuation of the work: reward ratio this will remain a problem.Within tDAR, we have started to develop tools to help users go through the process of making their data accessible through simple web forms. This enables the analysis and mapping of data from coding sheets to shared knowledge structures (ontologies) that can be used in data analysis within tDAR and in the future outside as well.
  2. Once data is in a semantic form, it’s difficult to use. Most archaeologists are not, and do not want to be programmers (though many programmers may want to be archaeologists). While large companies like Google, Microsoft, and Facebook are starting to make use of semantic data in searches (reviews, product searches, flight times, etc are examples of this), the main way of integrating semantic data into your own data is to do it programmatically. Until off-the-shelf tools or discipline specific tools make use of this information, most archaeologists will not be able to use it (or even understand it’s value).Within tDAR, we’ve started to build tools to enable integration of data sets by providing built-in tools enable users to map, collate, and integrate data without being a programmer. Faunal analysts have used these tools to look at use patterns across-sites and continents among other uses.
  3. Once data is in semantic form, how do you evaluate its quality? This is likely the final challenge, semantic or open data is useful only in as far as you can evaluate quality. Leveraging data from the semantic web often means joining or comparing data sets by one aspect in order to gain an understanding of another – but this requires that these connections be evaluated and that the quality of the data be vetted before those connections are made, something that may be hard within online data sets.

In summary, none of these challenges are insurmountable, we have organizations dedicated to the preservation and use of digital data; and we have tools that are evolving to make it easier to ask and answer questions that we could only dream of in the past, linking data together and making new connections.

What we must work together to do is to continue to change the culture or archaeology to ensure that both legacy and new data is properly archived and preserved. And, the challenge for the technologists to build tools that empowers non-programmers to analyze and re-use data in new ways.