Digital illustration showing sheets from a calendar strewn about on a flat surface and out of focus, with a vector outline of a calendar containing lines of code bringing one sheet into focus

Updating the open source python-edtf library to support the EDTF 2019 standard

Written by Cole Crawford
Published on 1 August 2024

Tagged under:

open source edtf

About the author

Cole Crawford, a software engineer in Harvard’s Arts and Humanities Research Computing unit, merges literary studies with software development, aiding scholars in creating digital collections and research applications. His research highlights eighteenth and nineteenth-century British labouring class writers.

Visit profile

In 2017, Greg Turner wrote about the EDTF standard, why it's important and the initial development of our python-edtf library. We made that library available through PyPi and Github with an open source licence.

Recently, Cole Crawford got in touch hoping to contribute an update to the library so that it supports a more recent version of the standard.

In this guest article, he talks about why python-edtf supports his work at the Arts and Humanities Research Computing unit at Harvard University and what the experience of contributing was like.

+++

After working on many digital humanities projects over the past decade, I have noticed that many grapple with a few similar, persistent problems. The most widespread is how to store, sort, and display uncertain, partial, or otherwise incomplete dates. This issue also frequently affects datasets in libraries, museums, and archives.

As a Senior Software Engineer with Harvard University Arts and Humanities Research Computing, I work closely with faculty to develop research software and applications. In Mapping Color in History (MCH), Dr. Jinah Kim and her research team of art historians and conservation scientists are compiling a database of pigment analysis in the context of Asian painting. The project uses Raman spectroscopy, XRF, optical microscopy, and other analytical methodologies to trace how pigments are used to create colour in artworks from roughly 1000-1800AD.

Composite of three images. Top left: Miniature artwork from the Mughal Empire depicting a couple seated beneath a tree in a peaceful scene. Top right: A machine on a table features a work of art next to it, indicating preparation for scanning and analysis. Bottom: Colour spectrum graph displaying the composite of a pigment sampled from a scanned artwork.

Top left: Artist Unknown, In Praise of the Simple Life (Detail), 1588, Lahore, Punjab, Pakistan Harvard Art Museums

Top right: In-situ XRF analysis Mapping Colour in History Project, Harvard University

Bottom: Graph showing how XRF is applied to the painting to determine that copper green is the most likely pigment used to create the vivid green hue within the artwork "In Praise of the Simple Life". Mapping Colour in History Project, Harvard University

Assigning dates to paintings is a difficult task that requires archival and historical expertise, sometimes supplemented by clues provided by imaging data to provide upper or lower bounds to a date estimate.

MCH initially modelled dates with a required start_date field, an optional end_date field, and string fields for human-readable display and description. This approach was quick to set up and initially met most of our needs, but the format also introduced some undesirable quirks. The research team developed a “styleguide” to attempt to standardise data entry for the temporal data RAs encountered in their archival research.

Mapping Colour In History research group:
Style Guide for Dates
Certain Dates Date Start date End date Notes
Specific date February 8, 1672 1672-02-08 1672-02-08 Do not index like 2/8/2011 or 8February 1672
Year 1147 1147-01-01 1147-12-31
10 CE 10-01-01 10-12-31 Do not use "AD"."CE" follows the date".
Year range 1660-1680 1660-01-01 1680-12-31 Do not abbreviate years.
50-57 СЕ 50-01-01 57-12-31
"Before date" before 1758 1753-01-01 1757-12-31 Subtract more than 5 yearswhere necessary.
"After date" after 1865 1865-01-01 1870-01-01 Add more than 5 years where necessary.
Decade date 1660s 1660-01-01 1669-12-31
Century date 16th century 1500-01-01 1599-12-31
2nd century CE 100-01-01 199-12-31
Partial century early 16th 1500-01-01 1533-12-31 Begin or end

As the research team added new artworks to the project and uncovered additional edge cases with historical dates, it eventually became clear that this model was no longer sufficiently precise, powerful, or flexible. We were shoehorning data into a data model that didn’t fully address the project’s needs.

The application could handle date ranges, but often forced researchers to be too precise with their computed dates because it couldn’t handle partial dates, approximation, or uncertainty - for instance, assigning “1147-01-01” even though the only known date component was “1147”. Useful qualifying Information was often relegated to the label or description fields, which were unusable for computational purposes like sorting or filtering.

These problems led me to seek a new approach for capturing historical dates. I soon found the Extended Date/Time Format (EDTF) Specification, which encompasses the ISO 8601-1 and ISO 8601-2 standards for handling date/time strings. EDTF provides a standard but powerful way to represent partial dates, approximation, and uncertainty - precisely the challenges I was struggling to solve in Mapping Color in History.

Mapping Color in History is a Django project, so finding a Python library to implement EDTF was a natural next step. The most popular library was python-edtf, maintained by The Interaction Consortium. I was thrilled to find that it even included an EDTF for easy Django integration.

>>> from edtf import parse_edtf

# Parse an EDTF string to an EDTFObject
>>>
>>> e = parse_edtf("1979-08~") # approx August 1979
>>> e
UncertainOrApproximate: '1979-08~'

# normalised string representation (some different EDTF strings have identical meanings)
>>>
>>> unicode(e)
u'1979-08~'

# Derive Python date objects

# lower and upper bounds that strictly adhere to the given range
>>>
>>> e.lower_strict()[:3], e.upper_strict()[:3]
((1979, 8, 1), (1979, 8, 31))

# lower and upper bounds that are padded if there's indicated uncertainty
>>>
>>> e.lower_fuzzy()[:3], e.upper_fuzzy()[:3]
((1979, 7, 1), (1979, 9, 30))

# Date intervals
>>>
>>> interval = parse_edtf("1979-08~/..")
>>> interval
Level1Interval: '1979-08~/..'

# Intervals have lower and upper EDTF objects
>>>
>>> interval.lower, interval.upper
(UncertainOrApproximate: '1979-08~', UnspecifiedIntervalSection: '..')
>>> interval.lower.lower_strict()[:3], interval.lower.upper_strict()[:3]
((1979, 8, 1), (1979, 8, 31))
>>> interval.upper.upper_strict() # '..' is interpreted to mean open interval and is returning -/+ math.inf
math.inf

# Date collections
>>>
>>> coll = parse_edtf('{1667,1668, 1670..1672}')
>>> coll
MultipleDates: '{1667, 1668, 1670..1672}'
>>> coll.objects
(Date: '1667', Date: '1668', Consecutives: '1670..1672')

I began integrating the module, but quickly realized that it only implemented the 2012 draft specification of EDTF. I considered providing the research team the old spec for data entry instead, or trying to back-convert the EDTF strings, but instead decided to carve out a bit more time to contribute to the package and help upgrade it.

I reached out to Dr. Sabine Müller, who had begun porting some of the specification changes but did not complete the transition work. With Dr. Müller’s permission, I built on her fork in collaboration with Dr. Alastair Weakley from the Interaction Consortium.

  • I implemented a modern pyproject.toml build system;
  • switched the testing framework to Pytest and the CI to Github Actions;
  • added code coverage checks;
  • upgraded to Django 5 and Python 3.12 compatibility;
  • updated the natural language parser to work with the 2019 specification;
  • added an example Django app and integration tests for the EDTF field;
  • updated documentation;
  • improved error handling and readability of errors;
  • and implemented significant digits, exponential year precision, and qualified dates at all precision levels.

Alastair provided thoughtful and timely feedback on my pull requests (PRs), implemented linting across the project with ruff, added benchmark testing and an automatically generated Github Pages benchmark site, and improved the parser performance by adding packrat.

I had previously contributed small PRs to open-source packages, and had even open-sourced some of my own, but I had never worked at such length on an existing open source project in tandem with the package maintainers. Having the existing EDTF spec as a guideline was particularly helpful, as it provided Alastair and me with examples and references for implementing some of the more esoteric requirements and helped define what “done” consisted of.

While I have more ideas for future improvements to the library, such as type checking, it is in much better shape than two months ago. I found the process to be a wonderfully productive collaboration that resulted in a v5 release of python-edtf which is fully compliant with the 2019 EDTF spec. The next version of Mapping Color in History will use this new release, and I look forward to using python-edtf in future projects as well.

A composite of two seperate images. Top: A colourful Indian work of art featuring a group of men and women in traditional attire. Bottom: A woman examines an old work of art under a microscope in a laboratory setting

Top: Artist Unknown, Lakshmana Removes a Thorn from Rama’s Foot, c.1700-1710, Himachal Pradesh, India

Bottom: MCH Mobile Heritage Lab (taken at a collaborating site at the Asiatic Society in Mumbai) Mapping Color in History Project, Harvard University

We want to thank Cole for not only putting so much work into improving python-edtf, but also taking the time to write about the experience.

If you're working on a Python project that deals with dates in a cultural collection, python-edtf might save you some time and tears. You can check out the new version – v5.0.0 – that Cole has made possible on PyPi and Github.

End of article.
The Interaction Consortium
ABN 20 651 161 296
Sydney office
Level 5 / 48 Chippen Street
Chippendale NSW 2008
Australia
Contact

tel: 1300 43 78 99

Join our Mailing List