Archiving Government websites: Should it really be this hard?

When I did the Government 2.0 Taskforce, one of the subjects that was earnestly discussed was archiving of government sites.  It’s a big problem in government. I could never see why it should be a big problem. After all you can look at anything written on ClubTroppo since it started.  We haven’t spent any huge amount of money to deliver that kind of functionality, haven’t burned any midnight oil. But IT people in government told that it’s very expensive to keep web pages live. I have no idea why but they swore black and blue that it was.

Anyway I recently sought to track down the results of Obama’s less than spectacularly successful community brainstorming on open government when he came into office. (The top two suggestions for promoting open government were legalising marijuana. The other big thing was releasing Obama’s birth certificate.) Anyway I emailed an American friend who’d been in the White House at the relevant time – now back in academia – asking for any write up of the program and she told me there was one in a 2009 annual review of operations.  But it’s gone from the website and no-one has been able to find it in a couple of weeks. This is 2009!

For another project I was also looking up the old Power of Information Taskforce in the UK.  Here’s Tom Steinberg’s blog entry announcing its release.

I’m delighted to announce that the review I’ve been working on with Ed Mayo and the Cabinet Office has launched today. You can get the official PDF version here or my friend Sam Smith’s annotatable version that he just threw together.

I clicked on the first link and it went through to here.

http://webarchive.nationalarchives.gov.uk/+/http://www.cabinetoffice.gov.uk/publications/reports/power_information/power_information.pdf

Which was promising. It said this.

This snapshot taken on 25/11/2010, shows web content selected for preservation by The National Archives. External links, forms and search boxes may not work in archived websites. Find out more about web archiving at The National Archives. See all dates available for this archived website 

Object moved to here.

Alas, it wasn’t there either and I was diverted to a Cabinet Office Page Not found signal – as you can see for yourself if you want to click on the link.

Meanwhile one of the things that the Power of Information Taskforce and Review did was to publish using commercial blogging platforms. And everything using that remains safe and sound. “Sam Smith’s annotatable version” that Steinberg says Sam “threw together” refers to on his blog is still there, safe and sound. Likewise the Government 2.0 Taskforce published to its own url using WordPress software, and it’s still there too, it’s cost to government would be the same as the cost of Troppo to those of us who run it – the cost of the domain name registration, which is about $30 a year or something, though the cost to government of maintaining the UK’s Power of Information review, which is a sub-domain of wordpress.com is exactly zero.

So it still eludes me why, with all the resources to hand, governments make it quite so difficult for themselves.

Subscribe
Notify of
guest

31 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Antonios
12 years ago

In my experience working in IT, largely free and simple solutions are ignored by large companies/departments (and if large companies/departments seek advice from an IT services company, said IT services company will ALWAYS favour the Rube Goldberg machine).

And when there’s an internal IT department, they would find it difficult to justify their existence if everything was as easy as a WordPress installation.

Generally speaking, non-IT people asking IT people for advice and guidance is problematic — it’s like a blank cheque.

john r walker
12 years ago
Reply to  Antonios

An expert solution that does not involve trailing service fees to experts is rare, no?

cbp
cbp
12 years ago

As a computer programmer I’m often astounded at the government’s ineptitude in the web sphere. I frequently see costings for ostensibly very simple websites out by a factors 10-50.

derrida derider
derrida derider
12 years ago

I’d suggest the IT boffins were told by their managers that the government must have the ability to “edit” old web pages. As you know, actually removing things from the web is a non-trivial exercise.

NotTheGovernorGeneral
NotTheGovernorGeneral
12 years ago

There’s quite a bit of policy regarding Govt. Website Archiving requirements. A good starting point would be AGIMO’s WebGuide [http://webguide.gov.au/recordkeeping/archiving-a-website/ ]

URLs get moved, content gets updated, content management systems are abandoned and / or migrated, domains get renamed, often as part of machinery of government changes, vendor contract conclusions or product support cycles.

Link Rot [http://en.wikipedia.org/wiki/Link_rot] is a natural, if unfortunate, part of life on the web. I suspect there’s a direct correlation between the extent of link rot and the number of authors or systems involved in a body of content, too.

Agencies aren’t obliged to tell anyone that their content / URL structures have changed, either. The National Archives, for example, aren’t obliged to tell AGIMO that the last two links on the URL above – “Keeping Records Safe” [http://www.naa.gov.au/records-management/secure-and-store/index.aspx] and “Social Media and Commonwealth Records” [http://www.naa.gov.au/records-management/create-capture-describe/socialmedia/index.aspx] have been renamed or moved…

For the ‘high value’ archived online Govt. content, there’s always Pandora [http://pandora.nla.gov.au/].

aidan
aidan
12 years ago

If you want to champion something postcode standardisation would be awesome.

In a former life I did health stats database stuff. We relied entirely on postcode as a proxy for location. Unfortunately Auspost changes postcodes whenever they feel like it to suit their purposes. Fair enough, from their perspective, but it makes meaningful comparisons over time very difficult.

The gummint could mandate that postcodes are terribly important and that they either not change, or do so in a well defined statistically interpretable way.

This would be very good.

Note: I haven’t done that stuff for more than 10 years, so maybe they made this change already, if so ignore me.

FDB
FDB
12 years ago

Well obviously it’s because old postings are very likely to yield information which reflects poorly on Governments: unkept promises, abrupt changes in policy direction und so weiter.

Where’s the incentive for the incumbent regime to do the trivial work of archiving everything like a regular blog?

Peter Mariani
Peter Mariani
12 years ago

You may already be familiar with this website – I have found it useful
http://www.archive.org/web/web.php

Don Arthur
Don Arthur
12 years ago

Nicholas – Are these the Open Government brainstorming docs you’re after?

Summary Analysis of the Open Government Brainstorm
Memo from Lena Trudeau, Vice President
The National Academy of Public Administration


Wrap-Up of the Open Government Brainstorming: Transparency
.

Wrap-Up of the Open Government Brainstorming: Participation.

Wrap-Up of the Open Government Brainstorming: Collaboration.

Don Arthur
Don Arthur
12 years ago

Nicholas – Here’s the Power of Information report on the National Archives site.
http://webarchive.nationalarchives.gov.uk/20100413152047/http://www.opsi.gov.uk/advice/poi/power-of-information-review.pdf

And yes, I understand that you’ve already got this document and that you’re highlighting the ironies of open government.

But since I went to the trouble of finding the document (and yes, it was much more trouble than it should have been) I thought I might as well share the link.

Don Arthur
Don Arthur
12 years ago

Just out of curiosity, do you know the name of thing you were referred to?

Stephen Bounds
Stephen Bounds
12 years ago

Nicholas, there are traditionally two different issues involved in archiving government websites.

(1) The “snapshot” question — ie how did a government intranet or public website appear at a particular point in time? This is mostly relevant for its legal ramifications. Ie if someone is suing the Government, it is critical to know whether they relied upon information present on a website on a particular date.

To do this right, each agency would need a full HTML mirror of all content on all their sites which can be “wound back” a la Apple Time Machine to any arbitrary date. This is just not that easy to do at present (although I do think there’s a gap for an entrepreneur to fill).

The public archives are only a partial solution to this requirement, not least because they can’t archive intranets.

In practice, the minimum requirement from Archives is to take a full snapshot of the website just before major content updates and/or machinery of government changes. Even just this is becoming harder to do well due to the intricacies of AJAX and other Web 2.0 dynamic features.

(2) The “persistence” question. This is exacerbated by CMS platform incompatibility (eg come talk to me again if you ever migrate Club Troppo from WordPress to Drupal), but is more fundamentally caused by ongoing content re-architecture.

Lets say I have 500 pages on my site, but then cull these by 80% and rewrite the content in a much condensed form. I’m still aiming for the same outcomes, but tests show my visitors understand their obligations better.

Then I have a conundrum: If I keep the old contents accessible, users will continue to read out-of-date material. And there’s no easy way to map old content to the new pages since there is no 1:1 mapping!!

Note that this is really no more or less of a problem than referring to a chapter or page number of a textbook that is updated every year, since these kinds of “reference links” will also decay over time.

My personal opinion here is that there needs to be a more concerted effort at setting up persistent URL handler services within government (eg ARK, pURL, DOI) for publications that should always be reference-able such as taskforce reports.

john
john
12 years ago

cbp @ 1

Is there some sort of independent guide to how much a website should cost?
The sort of thing you are talking about is too common in the funded arts sector.

FDB
FDB
12 years ago

Stephen – a six-year-old post with some broken links I could forgive. Easily.

Are we making perfect the enemy of barely adequate here?

Stephen Bounds
Stephen Bounds
12 years ago

@FDB: What I was trying to get at is that most government websites aren’t blogs. A blog has the simplest possible Information Architecture — a big long chronological list of posts — and that’s an easy structure to maintain in the long term.

By contrast, most government websites have multiple pages that interrelate and need to be maintained or updated as a set. For major content updates, this often means republishing the whole site and this is where persistence becomes a problem.

If we think of these kinds of websites as analogous to tourism information for a town, we have content that is “signposts”, “tourist guides” and “historical books”. There are different impacts of keeping a incorrect signpost up (currency most important) versus losing a rare historical book (persistence most important).

Craig
Craig
12 years ago

Stephen Bounds,

The snapshot question can be solved by a wiki or any content management system with version control.

However the proliferation and ageing of content management systems is an ongoing issues.

Governments could adopt a standard platform which met all central requirements, madate its use and fund ongoing work on the platform to keep it current – even centrally host and manage it technically. This could even be an open source platform, which would allow a much larger and more engaged developer community to remain involved.

However, they won’t as commercial influences would deem this anti-competitive and closing the market to all those wonderful for sale systems.

Also some agencies would be unaware of or ignore the central mandate (due to internal politics and the difficulties of informing all present and potential website owners across government) and simply create or buy their own solutions, perpetuating the issue.

Plus everyone would have to migrate to the platform in the first place – this could not be reliably or cheaply done in a short time. It would need at least a five year (and based on continued IE6 use in government, at least ten year) transition period.

The other alternative is to mandate much higher minimum website standards in government than presently exist, as well as a minimum standard for content management systems which much demonstrably (not just verbally) be met for archival purposes and a standard strategic framework for how agencies must manage their online presence (all websites, mobile apps, social media beachheads, etc).

This is doable, but few are interested in supporting it unless it is a priority of the party in government.

A lot of the basic systems in government are poorly constructed and only work with massive resourcing cost as they are not vote winners or ‘interesting’ to politicians or senior managers.

As there’s no profit motive at work, waste is accepted. Poor archival of websites is part of this waste. No Minister cares about the old websites of their predecessors – except if they can use them for political point scoring.

Correspondingly, most public servants with decision-making authority don’t care either.

john
john
12 years ago

Technically how difficult would it to be to create a near enough Wayback machine solely devoted to Australian Gov Websites?

Stephen Bounds
Stephen Bounds
12 years ago

@Craig – I’m always far more interested in promoting standards than mandating software solutions.

For example, CMIS is meant to be part of the solution in terms of interoperability and long-term accessibility, but I am yet to be convinced that it more than just a figleaf for the major vendors.

I think something like Atompub has far more chance of success as a lightweight standard for online published content but as you say, the political and policy will just isn’t there to mandate anything like that.

Consider that we haven’t even got an agreed electronic archival standard! VERS is the standard in Victoria but my understanding is that NAA wants to build its own. Best practice is sadly lacking in long-term electronic archival techniques right across the globe because there is always something “more important” to do in the short term.

Stephen Bounds
Stephen Bounds
12 years ago

@john – technically not hard, but it’s the resources (necessary disk space and bandwidth) and commitment to operate in the long term which are hard to obtain.

Oh, and legal advice and workers to handle challenges, selectively remove content on request and so on.

john
john
12 years ago

Steven
“Legal advice and workers to handle challenges, selectively remove content on request and so on”

Does the government use the robot.text protocol?
I Would have assumed that If ‘its’ on a publicly available government site then ‘its'(if its not covered by privilege) the Governments problem.

Stephen Bounds
Stephen Bounds
12 years ago

@john – You’re right, I’m probably overblowing that part of the problem. Still real, but probably a rarity.

Tel
Tel
12 years ago

As a computer programmer I’m often astounded at the government’s ineptitude in the web sphere. I frequently see costings for ostensibly very simple websites out by a factors 10-50.

You think that’s bad, try building an education revolution (i.e. school sheds) or try Aboriginal housing. Perhaps have a go at getting 1500 households connected to optic fibre. I think that web design drops out of the field of financial relevance very rapidly.

The gummint could mandate that postcodes are terribly important and that they either not change, or do so in a well defined statistically interpretable way.

Such an invention already exists, going under the old name of Latitude and Longitude (thank the British Navy for that effort) or the modern shorthand “GPS” and I guess we have to give credit to the Goobers on that one.

By the way, I still keep plugging the idea of content addressable websites, which could solve the archive problem, amongst many others.

Stephen Bounds
Stephen Bounds
12 years ago

By the way, I still keep plugging the idea of content addressable websites, which could solve the archive problem, amongst many others.

Care to expand/give an example?

Tel
Tel
12 years ago

Stephen Bounds, I can point to all the pieces. Please start here —

http://en.wikipedia.org/wiki/Content-addressable_storage

In particular “Content-addressed vs. Location-addressed” is the crux of it.

So consider some random number like the uuid 4b0ff2c1-5f8c-4a42-ba56-0e9de8ed29cc which basically means nothing, but now I’ve put it up on a blog, you can plug that same number into google and get back to this blog post, use it instead of a URL. Thanks to google, the uuid is now an address for the content, thus by means of keyword search you are describing *WHAT* you want, rather than *WHERE* to look. It should not matter whether you use google, or bing, or yahoo, because they all point back to here.

Sadly, if some nefarious gangster was to copy the above uuid, they might redirect all that important traffic to their own blog, and deliver a completely different message. How to stop that from happening? We need an ID that we can trust.

This has already been solved by things like the MD5 hash, SHA1 hash, etc. whereby you generate a magic hash number from the content itself, and anyone can repeat the process to get exactly the same magic hash number. Better than that, it is extremely difficult (not impossible) to make a fake, and better still, newer pages can include hash numbers pointing to older versions of themselves (and also other pages) thus providing a reasonably solid guarantee of providence. Edits will generally be visible if someone examines the chain, history can be erased, but it cannot be rewritten (and with provision for multiple copies in cache, etc, even erasure is incredibly difficult).

http://www.eecs.harvard.edu/~cduan/technical/git/

Here is a good little tutorial of this design in action, but not in a WWW context (although potentially it could easily be extended to a WWW context)… and that’s the next step IMHO. The “git” tool is widely available for free, unfortunately there’s a bit of a learning curve. To make it user-friendly would require a bit of work and maybe change the browsers a bit, or create some sort of portal.

Hopefully that’s got you very, very excited about the possibilities here :-)

john
john
12 years ago

Tel
Perhaps we could reduce all of the Government to a Gödel number… how peaceful it would be >

Webs
12 years ago

I work as a web developer and IT, and from what I’ve seen there is tons of little things that could save the government loads of time and resources, but they choose to look past it to “more important things”.

Max
Max
12 years ago

Part of the reason for a governments inability to keep a lid on the ultimate cost of any project is that these are public officials going up against private sector project managers and contract negotiators. It’s like putting donkeys against thoroughbreds and expecting a fair race. Any project manager worth their salt knows that you bid low to get the contract, once the point in the project has been reached where it would be more expensive to abandon than to continue then the client is hit with a never ending stream of expensive “contract variations”.