I need to remove a Scalar project that was done several years ago. Is there a way to download the project into something like a PDF? (Understanding that any dynamic media will be screenshots, etc.).
For Omeka/Neatline, I take screenshots, download the database and the public-html directory, but was wondering if there was something more “book-like” for Scalar.
Unfortunately Scalar does not have a PDF export that I’m aware of, I would go the site archiving route that you mentioned for Omeka.
In that direction though, I can recommend a few site archiving tools that may speed up the process for you. I personally like archiveweb.page as it is a browser extension that will let you archive the site by visting each page, and you then end up with a portable web archive file that can be viewed at replayweb.page.
If you are looking to make a flattened html version of the site that you can put on just about any web server, you could also check out SiteSucker on macOS or HTTrack on Windows (with a cli version on linux and macOS).
I’ve said this a couple times to some people on these forums, but I would check out the archive work that Stanford University Press is doing in their digital archives.
I think they might be doing some of the best work in the country on this.
I’ve been looking into a suite of software made by a group called WebRecorder.
The group/blog posts that made me interested in WebRecorder was a series of blog posts by the Stanford University Press digital team. They are doing work to archive their digital scholarship, using a combination of tools from WebRecorder to archive, display, and store longterm the files in their Open Access library archive.
Here’s their blog post where they describe some of their work:
Basically, they first archive the webpage using one of WebRecorders tools. There are two tools I am interested in: (1) there is a manual tool where you can visit a webpage and as you visit each page, either a browser plugin or their standalone app saves each page, and (2) there is also a docker based command line tool where you can give it a list of webpages to crawl automatically based on rules you define. I’m going to attach to this email an archive of ed-beck.com that I archived using their manual tool. I’m really impressed. If you open it, check out my blog post “Building the SUNY Math Accessibility Cohort.” This tool is powerful enough that it scooped up the embedded YouTube video and included it in the archive. Their command line tool runs in a docker environment, and you can upload a list of websites you want it to crawl. What if that list of websites could be automatically generated? A list of all your DoOO sites? Just the sites you are about to remove from the server? The important websites?
Second, they have an HTML website located at archive.supdigital.org. Each archived project is a directory. The WebRecorder Suite can be called from an iframe, and it also can reference files from another resource. So this is an extremely simple HTML website that has 3 parts: a small header, the script to call the replayweb.page cdn, and the container to load in an iframe the web archive file that is actually stored in their library open access repository stacks.stanford.edu
Finally, in the Open Access reserve, they have the content backed up in a variety of way, depending on the project. They store the web archive, and possibly a few backups in different formats.
To check out the archived version of Ed-Beck.com use this link to my OneDrive. (The complete Ed-Beck.com is around 700 MB, this one that includes the YouTube Videos is 300 MB. It saved a lot of space for text and images, but then making sure it included all of the embedded Videos added a lot back).
To open the file, you need to download their desktop app archiveweb.page or view from their webpage replayweb.page
I’m really interested in the future of this for 3 reasons:
That using their command-line tool that runs in Docker that some of the archiving work could be automated. Their manual tool is fine for 1 off projects and small projects, but it would be really neat if we could input a list of websites and have an automated tool just crawl them for us.
That the archives that are created can be manipulated and displayed on something like a live website. Imagine if whenever we deactivated a major website we wanted to archive, we immediately set up a redirect from site.sunycreate.cloud to archive.sunycreate.cloud/site
That the actual files that make the live archive work can be stored in existing institutional storage that is created for the longtime archive of digital documents and items. Many of us work at Libraries where we have existing infrastructure for digital archives. Stanford’s solution “stores” the files in their digital repository, and only makes them appear on their website in an iframe. It’s slick.
They’ve done a pretty decent job documenting their progress, but if Reclaim invited them to speak, or invited them to do a web archiving workshop as part of their instructional design program, I’d be really interested.