Updating the open source python-edtf library to support the EDTF 2019 standard
Written by
Cole Crawford
Published on 1 August 2024
About the author
Cole Crawford, a software engineer in Harvard’s Arts and Humanities Research Computing unit, merges literary studies with software development, aiding scholars in creating digital collections and research applications. His research highlights eighteenth and nineteenth-century British labouring class writers.
Visit profileIn 2017, Greg Turner wrote about the EDTF standard, why it's important and the initial development of our python-edtf library. We made that library available through PyPi and Github with an open source licence.
Recently, Cole Crawford got in touch hoping to contribute an update to the library so that it supports a more recent version of the standard.
In this guest article, he talks about why python-edtf supports his work at the Arts and Humanities Research Computing unit at Harvard University and what the experience of contributing was like.
+++
After working on many digital humanities projects over the past decade, I have noticed that many grapple with a few similar, persistent problems. The most widespread is how to store, sort, and display uncertain, partial, or otherwise incomplete dates. This issue also frequently affects datasets in libraries, museums, and archives.
As a Senior Software Engineer with Harvard University Arts and Humanities Research Computing, I work closely with faculty to develop research software and applications. In Mapping Color in History (MCH), Dr. Jinah Kim and her research team of art historians and conservation scientists are compiling a database of pigment analysis in the context of Asian painting. The project uses Raman spectroscopy, XRF, optical microscopy, and other analytical methodologies to trace how pigments are used to create colour in artworks from roughly 1000-1800AD.
Assigning dates to paintings is a difficult task that requires archival and historical expertise, sometimes supplemented by clues provided by imaging data to provide upper or lower bounds to a date estimate.
MCH initially modelled dates with a required start_date
field, an optional end_date
field, and string fields for human-readable display and description. This approach was quick to set up and initially met most of our needs, but the format also introduced some undesirable quirks. The research team developed a “styleguide” to attempt to standardise data entry for the temporal data RAs encountered in their archival research.
Certain Dates | Date | Start date | End date | Notes |
---|---|---|---|---|
Specific date | February 8, 1672 | 1672-02-08 | 1672-02-08 | Do not index like 2/8/2011 or 8February 1672 |
Year | 1147 | 1147-01-01 | 1147-12-31 | |
10 CE | 10-01-01 | 10-12-31 | Do not use "AD"."CE" follows the date". | |
Year range | 1660-1680 | 1660-01-01 | 1680-12-31 | Do not abbreviate years. |
50-57 СЕ | 50-01-01 | 57-12-31 | ||
"Before date" | before 1758 | 1753-01-01 | 1757-12-31 | Subtract more than 5 yearswhere necessary. |
"After date" | after 1865 | 1865-01-01 | 1870-01-01 | Add more than 5 years where necessary. |
Decade date | 1660s | 1660-01-01 | 1669-12-31 | |
Century date | 16th century | 1500-01-01 | 1599-12-31 | |
2nd century CE 100-01-01 | 199-12-31 | |||
Partial century | early 16th | 1500-01-01 | 1533-12-31 | Begin or end |
As the research team added new artworks to the project and uncovered additional edge cases with historical dates, it eventually became clear that this model was no longer sufficiently precise, powerful, or flexible. We were shoehorning data into a data model that didn’t fully address the project’s needs.
The application could handle date ranges, but often forced researchers to be too precise with their computed dates because it couldn’t handle partial dates, approximation, or uncertainty - for instance, assigning “1147-01-01” even though the only known date component was “1147”. Useful qualifying Information was often relegated to the label or description fields, which were unusable for computational purposes like sorting or filtering.
These problems led me to seek a new approach for capturing historical dates. I soon found the Extended Date/Time Format (EDTF) Specification, which encompasses the ISO 8601-1 and ISO 8601-2 standards for handling date/time strings. EDTF provides a standard but powerful way to represent partial dates, approximation, and uncertainty - precisely the challenges I was struggling to solve in Mapping Color in History.
Mapping Color in History is a Django project, so finding a Python library to implement EDTF was a natural next step. The most popular library was python-edtf
, maintained by The Interaction Consortium. I was thrilled to find that it even included an EDTF for easy Django integration.
>>> from edtf import parse_edtf
# Parse an EDTF string to an EDTFObject
>>>
>>> e = parse_edtf("1979-08~") # approx August 1979
>>> e
UncertainOrApproximate: '1979-08~'
# normalised string representation (some different EDTF strings have identical meanings)
>>>
>>> unicode(e)
u'1979-08~'
# Derive Python date objects
# lower and upper bounds that strictly adhere to the given range
>>>
>>> e.lower_strict()[:3], e.upper_strict()[:3]
((1979, 8, 1), (1979, 8, 31))
# lower and upper bounds that are padded if there's indicated uncertainty
>>>
>>> e.lower_fuzzy()[:3], e.upper_fuzzy()[:3]
((1979, 7, 1), (1979, 9, 30))
# Date intervals
>>>
>>> interval = parse_edtf("1979-08~/..")
>>> interval
Level1Interval: '1979-08~/..'
# Intervals have lower and upper EDTF objects
>>>
>>> interval.lower, interval.upper
(UncertainOrApproximate: '1979-08~', UnspecifiedIntervalSection: '..')
>>> interval.lower.lower_strict()[:3], interval.lower.upper_strict()[:3]
((1979, 8, 1), (1979, 8, 31))
>>> interval.upper.upper_strict() # '..' is interpreted to mean open interval and is returning -/+ math.inf
math.inf
# Date collections
>>>
>>> coll = parse_edtf('{1667,1668, 1670..1672}')
>>> coll
MultipleDates: '{1667, 1668, 1670..1672}'
>>> coll.objects
(Date: '1667', Date: '1668', Consecutives: '1670..1672')
I began integrating the module, but quickly realized that it only implemented the 2012 draft specification of EDTF. I considered providing the research team the old spec for data entry instead, or trying to back-convert the EDTF strings, but instead decided to carve out a bit more time to contribute to the package and help upgrade it.
I reached out to Dr. Sabine Müller, who had begun porting some of the specification changes but did not complete the transition work. With Dr. Müller’s permission, I built on her fork in collaboration with Dr. Alastair Weakley from the Interaction Consortium.
- I implemented a modern
pyproject.toml
build system; - switched the testing framework to Pytest and the CI to Github Actions;
- added code coverage checks;
- upgraded to Django 5 and Python 3.12 compatibility;
- updated the natural language parser to work with the 2019 specification;
- added an example Django app and integration tests for the EDTF field;
- updated documentation;
- improved error handling and readability of errors;
- and implemented significant digits, exponential year precision, and qualified dates at all precision levels.
Alastair provided thoughtful and timely feedback on my pull requests (PRs), implemented linting across the project with ruff
, added benchmark testing and an automatically generated Github Pages benchmark site, and improved the parser performance by adding packrat
.
I had previously contributed small PRs to open-source packages, and had even open-sourced some of my own, but I had never worked at such length on an existing open source project in tandem with the package maintainers. Having the existing EDTF spec as a guideline was particularly helpful, as it provided Alastair and me with examples and references for implementing some of the more esoteric requirements and helped define what “done” consisted of.
While I have more ideas for future improvements to the library, such as type checking, it is in much better shape than two months ago. I found the process to be a wonderfully productive collaboration that resulted in a v5 release of python-edtf which is fully compliant with the 2019 EDTF spec. The next version of Mapping Color in History will use this new release, and I look forward to using python-edtf in future projects as well.
We want to thank Cole for not only putting so much work into improving python-edtf, but also taking the time to write about the experience.
If you're working on a Python project that deals with dates in a cultural collection, python-edtf might save you some time and tears. You can check out the new version – v5.0.0 – that Cole has made possible on PyPi and Github.