Archiving Discourse Sites

cogdog · September 20, 2017, 3:04am

We’ve been running a Reclaim Hosted discourse site for our UDG Agora project, but it looks like they will not be using it in the future. I’m being asked if there is any way to archive it.

I read a few threads mentioning using httrack… but that’s likely what the Mac app Site Sucker uses under the hood. Curious if anyone has given it a try, or thoughts on how to convert discourse to a static archive.

reclaimhosting · September 20, 2017, 6:44am

Alan,

I can take my SiteSucker to it and send you the archive and you can let me know if it works.

Team work makes the dream work.

cogdog · September 20, 2017, 7:07am

Thanks,

I have Site Sucker so I can do it. I was just curious if anyone tried.

reclaimhosting · September 20, 2017, 7:11am

Not that I know of, but Tim may have a better sense. This is our first request for archiving, so be a dog maverick and disrupt the archiving web

timmmmyboy · September 20, 2017, 10:33am

I’d be curious if Sitesucker works since Discourse does all that fancy lazy-loading by JavaScript. Although maybe there are settings to handle some of that I haven’t played with.

tellio · September 21, 2017, 11:54pm

What about using webrecorder.io?

cogdog · September 22, 2017, 1:02am

New to me. So all I need to do is record me clicking through every message on the site? Hmmmm.

cogdog · September 22, 2017, 4:11pm

It’s messy- I did a full site suck and so far, it sucks. I tried the “Web Views” options in Site Sucker because it sounded it would deal with lazy loading.

So far I am having to disable some of the discourse ember.js, had to redo the font-awesome, am using CSS to hide buttons and things not needed in archive… and still not quite there. I think I can get out one to a reasonable record, but it’s a messy route.

Jay_Pfaffman · September 23, 2017, 12:09am

See A basic Discourse archival tool - dev - Discourse Meta. There’s discussion there about changing the User-Agent so that you get a plain HTML version of the site.

timmmmyboy · September 23, 2017, 3:32am

This method worked for me with the ia_archiver user agent @cogdog . I got the HTML version of all pages using Site Sucker. Might not be visually representative of what the forum looks like but all the content is there which I think is the important piece here.

cogdog · September 23, 2017, 4:51am

Awesome thanks. I have to redo it because I forgot a few previous categories from previous years that hidden, so I need to put them back on front.

That’s under Preferences → Identities? Can you screen shot your settings there?

timmmmyboy · September 23, 2017, 1:03pm

That’s right, under Preferences → Identities I have this:

and then I set this in the Settings for the capture:

tellio · September 26, 2017, 11:36pm

OK, guess I didn’t understand what you needed. Won’t happen again.

cogdog · September 26, 2017, 11:48pm

Maybe I did not spend long enough seeing what that tool did.

Here’s the need. My client has a discourse site. They no longer plan to use
it as a discourse powered site (and maybe dont want to pay to keep it
going). But they want an archive of the conversations. So how can I create
a site that has the content as HTML and not rely on discourse any more, an
archive.

What i thought I saw in the tool you sent was something that recorded an
experience moving throughj the site, which might mean following every
discourse thread. ??

tellio · September 27, 2017, 12:29am

No, I shouldn’t have suggested anything without knowing what the problem. You have always been so helpful to me, I just jumped the gun trying to help. Not to worry. You’re right, the tool I suggested would be unwieldly.

Mark_McClure · September 28, 2017, 5:53pm

@cogdog I started the meta.discourse.org thread that @Jay_Pfaffman mentioned and have worked on this issue a bit. Here’s my advice.

First - try HTTrack, which is a command line, GPL tool for Mac and Linux. There’s a Windows version as well but I haven’t used it.

The httrack command should look something like so:

httrack https://yoursite.edu -https://yoursite.edu/users* -*.rss -O arxiv_name -x -o -M10000000 --user-agent "Googlebot"

The ‘-https://yoursite.edu/users* -*.rss’ bit prevents httrack from downloading files matching those patterns. You might choose to include those patterns or exclude others.
The ‘-x -o’ combo replaces both external links and errors with a local file indicating the error. So, for example, we don’t link to user profiles on the original that weren’t downloaded locally.
The ‘-M10000000’ restricts the total amount downloaded to 10MB. There appears to be some post processing and downloading of supplemental files that makes the total larger than this anyway.
The ‘–user-agent “Googlebot”’ should not be necessary if the forum is powered by a recent version of Discourse.

That will probably take several minutes to run. If it works and you’re happy with the results, try running it overnight with a larger -M value.

If that doesn’t work to your liking, you might try the Discourse Archival Tool that I wrote. It’s really tailored for my specific needs, though. On the other hand, if you’re comfortable with Python, HTML, and CSS, it should be pretty easy to hack it to your purposes.

Both have advantages and disadvantages and there’s really not a perfect solution. I do hope that helps some, though!

cogdog · September 28, 2017, 11:09pm

Totally appreciate this Mark, will be trying soon. I just hope to get a basic archive, and our discourse was not super active, but there is stuff worth saving. I speculate under the hood the Site Sucker Mac OSX app is calling HTTrack.

Mark_McClure · September 28, 2017, 11:29pm

No, I don’t think that’s correct. HTTrack is GPL so, if SiteSucker was using it, it too would have to be GPL but it’s not.

Also on the server side, you can see that they have different user agent strings. Which, makes me realize - if you know and like SiteSucker, you should be able to use it, if you set the user agent identity correctly. I’d recommend setting to to Googlebot simply because that works for HTTrack. I don’t have SiteSucker so I can’t tell you how to do this, but it looks like you can do it, according to the documentation. Just scroll to the bottom where it you see the heading Identity.

You know, I think this is exactly what @timmmmyboy suggested before, though I guess he recommended ia_archiver.

mhuff23 · July 15, 2020, 7:15pm