Data migration best practice: moving 10,000 pages for the Australian Museum’s website rebuild

Written by Dr Alastair Weakley
Published on 6 June 2019

Tagged under:

About the author

Alastair co-founded the Interaction Consortium in 2009 and serves as one of the studio's Principal Developers. He has a degree in Design and Technology and after a first career as a product designer for over a decade, he returned to study for a Masters' degree in Information Technology and subsequently completed a PhD ("Internet-based Support for Creative Collaboration", 2007) in Computing Science.

Alastair has collaborated with artists on exhibited interactive artworks as well as publishing in the disciplines of HCI, Information Systems, Information Visualisation and Presence.

Visit profile

The Interaction Consortium recently undertook a large and complex data migration – involving more than 10,000 pages – during the site rebuild for the Australian Museum. We outline below the best practice principles we applied to the Australian Museum website migration, which we use for all the large data migrations we undertake. We also go into detail here about specific approaches we take involving Django and Wagtail.

Structure mismatches, mapping models: beyond ‘trust, but verify’

Moving your website to a different platform often requires a migration of data from the old system to the new. If you only have a few pages, and you’re able to get the two systems running at the same time, then you can manually copy the data across. But when there are many thousands of items that need to be moved, you probably need to think about automating the process.

We recently undertook such a migration – involving more than 10,000 pages – during the site rebuild we did for the Australian Museum. The principles we applied to this site's data migration are ones we use for all of our large site migrations. There are also a number of approaches we use during data migration involving the Django web framework and Wagtail content management system.

In many of the site migrations we deal with, the structure of pages in the old system and the approach to editing are quite different to that of the new system. Additionally, in the example described here, the page tree of the site was also being updated, so many pages were moving to new URLs.

One issue that needs particular attention is verification of the migrated data. We need to be sure at the end of the process that everything that needed to be changed has changed in the correct ways and, most importantly, that nothing has been lost. Because of the change in page structure, and the sheer size of any checking task, and also any uncertainties around quite how the legacy system interpreted or displayed its data, we find it really useful to retain access to the legacy data, so that we can always refer back to it, or even re-migrate a few items if something got overlooked.

When undertaking large-scale migrations of big websites over the past 10 years or so, we’ve tried various approaches. The one we often return to is to set up a `legacy` app within the new project. If there’s a Django object-relational mapping (ORM) backend available for the legacy database system, then we’ll often connect to a copy of the original database directly with a read-only router. In some cases where no ORM backend is available, we’ve been able to write scripts that transform the dumped data into a file we can load into a Postgres database.

Using the inspectdb command, we can get the model structure, and then we knock up a read-only admin for each model using something such as this example read-only admin class for Django. It only needs to be quick-and-dirty as it’s just to reassure us that we’re looking at the right data, and that we’ve interpreted the foreign keys and other relations correctly.

For the models in the new system, we add any fields we need to enable us to always be able to refer back to the source data. If the source structure is flat, then often `legacy_id` is all we need, but it’s sometimes more complex than that. This way, just in case you miss something in the migration, you can always go back to it. Once you’re sure it’s all gone well you can discard that legacy app and remove the connection to the legacy database.

‘Legacy bodies’: empowering editors, harnessing old and new

Another trick – courtesy of Thomas Ashelford, a former director of the IC – which we used during the recent Australian Museum migration, was to add a ‘legacy body’ field to the newly imported page models. We added a checkbox so that users could specify whether the legacy body or the new one should be displayed, and where legacy was chosen (the default), we made sure the template worked adequately.

Obviously the legacy body wasn’t as sophisticated as the new page design, but it was fine and functional. This meant that we could import the pages from the old system, and the editors did not necessarily have to go through and fix up every single one before the new site was launched. They can take their time working on each page, referring to the legacy content as they create the new component-based content with the new editing system. Because of Wagtail’s publishing mechanism, they can preview the new content while the public still see the legacy content.

In this project it was important to allow the editors to continue to make updates on the legacy site while at the same time editing was happening on the imported data on the new site. We also could not get a live connection to the legacy database, so we were running our own Docker container for our copy of that legacy data. Also, the page tree was being restructured, so we knew things would be moving around quite a lot, although we did not know the final tree layout at the start.

To get around all these things we started with a dump of the data, carried out the process described above, and migrated all the page types into the new CMS, but into a flat page tree. There was a branch for blog posts, another for general pages and so on, but no more structure than that.

Then the editors got to work on those newly imported pages. When they were happy with the new content, they deselected ‘use legacy body’ and those pages were good to go. Meanwhile, editing continued on the legacy site.

When we got an updated data dump from the legacy system, we overwrote the contents of any legacy_body, title, introduction fields and so on, where ‘use legacy body’ was selected, and didn’t change the contents for anywhere it wasn’t. This way we wouldn’t mess up the editors’ work on the new site, but could benefit from their work on the legacy site. The simple structure and the use of the `legacy_id` field meant it was easy to match up the previously imported pages.

Once that work was done, and the importing complete, we ran a second migration that shuffled all the pages into their final positions.

Australian Museum migration: where it landed

Overall the migration was a great success, and because we had an easy path back to the legacy data for each page, it was straightforward to verify and correct a few issues that came up.

In the coming weeks, we will be writing more about how we approached other technical challenges during the Australian Museum website rebuild project, such as integrations, our contribution to the digital brand extension and the work we did on the CMS to support accessibility in the final site. To be sure you receive details of these posts as soon as they are available, join our mailing list.