To experience this website in full, please enable JavaScript or upgrade your browser.

Monthly Archives: July 2010

Web-archiving: behaviours of Wayback machine

Below is one of the screencasts from the recent Web Archiving Workshop. Please note there is no sound on the video. Read the synopsis below to understand what’s happening. You may also need to view the video in full screen mode.

This screencast will show some of the behaviours of The Wayback Machine at archive.org.

I’m going to the www.archive.org URL. First I’ll use the search engine on the front page, just doing a text search for my target, which is the Arts and Humanities Research Council.

Result is Not In Archive. However, Wayback Machine admits this isn’t the best way to retrieve web content. “Using the Internet Archive Wayback Machine, it is possible to search for the names of sites contained in the Archive (URLs) and to specify date ranges for your search. We hope to implement a full text search engine at some point in the future.”

Instead, let’s find it by the URL. I go back and click on the Web tab and go to the front page of the Wayback Machine. The box here allows me to enter a link in the http:// box. I enter www.ahrc.ac.uk, and click Take Me Back.

Here we see the Results of the search. What we’re seeing is a collection of dated archived sites from 2003 to 2008. The harvest of an entire site is called a ‘Page’. What this means is they did it 32 times in 2007, for example.

Let’s follow the first link for a harvest dated 04 January 2006.

Pay attention to the Address Bar at the top of the screen. Note that the URL includes a datestamp in the path, followed by the www.ahrc.ac.uk link.

I’m now just scrolling up and down to show you how their captured page has rendered.

Now I’ll compare their capture with a similar page which I captured and added to the UKWA archive. Mine is from 2005.

I’m now toggling between the two and you can see that the Wayback version’s layout isn’t the same. This is because the style sheet was not captured, or is lost, or is not rendering in some way.

Now I’m navigating within the archived version in Wayback, and going to the Links page. I’m following a Link at random and I get the result Data Retrieval Failure.

Notice another thing – we’ve now strayed slightly from 04 January 2006 and gone back to December 2005.

Now to follow another link to the Postgraduate page.

I’m now clicking on www.prospects.csu.ac.uk, a site which lies outside the AHRS domain. There will be a pause while this loads up, so feel free to take a closer look at the Address bar.

Now I’m taken outside the AHRC page I was looking at, to another site altogether. It so happens this site is also being harvested by IA. Wayback Machine is effectively making these connections within its own collection – allowing the user to browse around copies of the entire world wide web.

Web-archiving: HTTrack in Action

Below is one of the screencasts from the recent Web Archiving Workshop. Please note it’s in three parts, and there is no sound on any of the videos. Read the synopsis.

HTTrack Screencasts pt 1

We’ll attempt to use HTTrack to copy a simple website.
The target is Copywriter.
As it happens it’s a simple flat html website with no complicated scripts.
This is the target’s front page.
We’ll just look at a few pages of the target.

HTTrack Screencasts pt 2

Let’s minimise the front page and open HTTrack software.
Here’s the simple start screen prompting me to add a Project name.
I’ll just type it in the box.
Next screen is prompting me for a web address.
I’ll go back to our target and just select it and Copy it to the clipboard.
Now to paste it in.
We won’t use any settings, let’s just let it run and see what happens.
Next, Finish.
As you see the gather fails quite badly.
It showing us an error message and when we try to browse the ‘mirror’, or copied website, it says the mirror is empty.

What went wrong?
Let’s go back and try it again.
HTTrack has saved some of the settings, so I just select the project and I’m given an option to update the existing download.
This time I will set some options.
Quite a lot of tabs here but we’ll ignore most of them. Just use three:
On Links, I’ll click get non-HTML material. What this means is that HTTrack will actively locate embedded files like css and images, and preserve the internal logic of the site.
On Spider, I’ll tell HTTrack to ignore Robots.
On Browser ID, I’ll set it to ‘None’. This is getting HTTrack to pretend it has no identity and the target site will accept HTTrack.
Click next and finish, and watch this.
There it it gathering the files. It should only take 18 seconds.
There’s a “live report” on the screen and as you can see there’s at least one Not Found item, but no matter.

HTTrack Screencasts pt 3

There’s a success message, so let’s Browse the mirrored website.
And here is the finished item. It looks just like the real thing.
Note the URL location path: we’re now looking at something on my C Drive in the My Web Sites folder.
Let’s QA some of the pages which look fine.
Under links though we can check out this ‘Sudden Impact’ design site.
See what happened there – we are now in the live web, no longer looking at the copy.
Lastly I’ll show you what the copy looks like in My Web Sites. As you can see I’ve successfully copied the structure of my target website.
Also I’ve got a log file which reports on what HTTrack did.

Download the software from http://www.httrack.com/