Archiving Wordpress Sites as Static HTML

We can argue and say “easily parseable” or “documented” metadata. The markdown files are more like a DB backup.

Those markdown files are easier to re-hydrate than HTML, which you have to write a custom parser for.

Small sites: not a big deal. Big sites: big deal.

I had not thought of the metadata issues. Like Tim suggested, the 2 small to puny sites I did are purely to preserve the sites as they were presented. The chances of them ever being needed again are on the order of me going on a trip to Mars.

But a slap on the head thing I should have done before the decommissioning Wordpress front end would have been to export the site- at least I would have the posts and meta data as XML. Someone (aka not me) could probably figure a way to do some kind of meta data export as JSON?

It’s pretty exciting to see how a thread like this rolls into a big ball of possibility.

I think I’m with you @timmmmyboy on this. Use cases I’m thinking for this are: 1. A WP site is created to facilitate students writing as part of a course assignment (write-in-public type thing) for course x in semester y - prob with a sub-domain or sub-blog on a multi-site. For various reasons, we want/need to create a diff Wp site for the similar assignment for course x in semester y+1. Once semester y is finished, I want students & others to be able to always view the site on Web and have links to it continue to work. But there will not be any editing or changing of anything on that site. I don’t want to keep it in a live WP install since that’s work to maintain & protect. If the metadata can be found out, that’s OK. Don’t need to ever reconstitute the site as a live WP install. OR 2. student has a blog on a school’s WP multisite (think Rampages). After student grads and is gone but hasn’t/doesn’t want to migrate it to their own Domain somewhere, would like to keep a static version on public web but not allow editing.

1 Like

Yes, Wordpress XML contains all the metadata. It’s another example of “easily parseable” metadata. There are Jekyll importers, for example, that can read in WP XMLand reconstitute a site.

Definitely agree there. I think they’re separate use cases but important for people thinking about archiving to consider as a backup plan.

Where I think this does get tricky from an ethical standpoint is when you’re archiving work that is not your own. Just because we can scrape, well the question is should we. For personal stuff absolutely. I’m less clear on the implications of things like the work of students who have already graduated @econproph (recognizing we already do this through large Multisite systems with blogs that tend to stick around after a student leaves). I suppose a request to remove content is easy when the student’s work is the entire site like a blog in a multisite. When it’s a scattering of posts throughout a site with multiple authors that is going to be almost impossible without first reconstructing the site back into something dynamic so I think a backup plan like those proposed is a good practice even if one doesn’t expect to ever need to use it.

1 Like

On one hand, the whole point of having students have their own space is so that they are in control. But, if you post something publicly on the internet, it’s public. It can end up in the Way Back Machine, or archived in Google indefinitely. Works that are private and part of a class come under FERPA guidelines.

When I was teaching with Discourse, I tried hard to encourage students to always refer to each other by @usernames rather than their names. This would make it easier to reliably anonymize data by replacing @usernames with pseudonyms if the discussions were to be analyzed for resesarch. (Sure, that’ll never happen, but my undergrad degree was in CS).

1 Like

Just wanted to chime in on this thread as I’ve spent years archiving public content of faculty curricular WordPress sites using wget and SiteSucker. I’m finally getting very close to automated archives by scripting HTTrack. Once you get it up and running and figure out all the flags you need this is the fastest way I’ve found to do a full scrape while converting to relative urls.

3 Likes

Very interested in what you’re describing. That’s essentially my use case: curricular sites that need to be viewable but will not have content added after some semester.

I was just having a conversation about this very approach for archiving a domain project, or archiving a university’s old tilde space servers. I would love to know more!

I’m definitely keen on having these kind of individual archiving tools available.

I also believe that if institutions of higher education, especially public ones, are not being active at ensuring their public web content gets in the Internet Archive, they are failing a public service duty.(toss tomatoes at me now).

You can use wget:

wget -p -k http://www.somedomain.com

or better

wget --mirror --convert-links --adjust-extension --page-requisites --no-parent http://www.somedomain.com

  1. You can use Httrack - it’s more user-friendly, runs under windows

… you still need to rewrite links so they don’t point at …/subpage/index.html

which solves Wordpress to static HTML convertor

2 Likes