Scraping the Scottish Parliament Official Record

Posted on Fri, 15th June 2007 at 15:17 under Scotland, Politics, Skills, Coding, Web Standards

My friends know that I’m a bit of a political geek and like to keep abreast of what our elected representatives are doing, so I follow the Official Reports of the UK Parliaments with an almost religious fervour, although I’ve been a bit lazy recently and just been watching the BBC Parliament channel.

I’m also a huge fan of theyworkforyou.com, a volunteer-run website which republishes the Official Reports of the House of Commons, House of Lords and the Northern Ireland Assembly in a far friendlier form than the “official” sites, with the added bonus of public commentary and statistical analysis of the MP’s performance.

Missing from theyworkforyou.com is the Official Report of the Scottish Parliament, all the more sorely missing due to the now very interesting constitution of that body with its minority SNP government. Republishing an official record is no easy task, for a number of reasons that I’ll go into later, but I know for sure that I have all the requisite skills and tools to do it.

I’ve had this project on the back-burner for some months now and since I’m starting a new job soon, any spare time I might have to work on hobby projects is going to vanish.

So here I go. I don’t need to get it perfect - just far enough along so that someone else can pick up and complete the task.

The Tools I’m Using

The programs used by TWFY are currently written in languages I do not know, nor do I wish to know; Python and Perl. I dislike Python because whitespace is meaningless and I fucking hate Perl. This is not important, because the data used by the site is XML, a system I know, love and can produce with relative ease.

One laudable characteristic of the recent SOR is that it is published in valid XHTML making it accessible to modern standards-based tools, such as XML processors and my favourite language of the moment - XSLT. Using these tools allows one to process documents as data structures rather than simple text, meaning that the transformation from source to target is expressed as a set of deterministic rules rather than as a computer program. It’s a subtle but important distinction. When the rules are correctly expressed, they work correctly forever, without exception.

Where I’m At

Going by the dates on my files, I started working on a Scottish Parliament scraping tool in January 2007 and got to a point where I couldn’t continue, due to flaws in my thinking and therefore in my rule design. I’ve been thinking about better rules for a few months now and I reckon I’ve got the gist of what needs to be done, and will develop them here. In the meantime, for those who want to examine my initial sketching, here’s my current transformation stylesheet and some source data.

scrape-scot-parl.xsl
Link to a sample source XML data document on the Scottish Parliament website

The processed output is nothing like what is required by TWFY.

Where I Need To Get

TWFY Parliament Parser XML format definitions define the target data structures
theyworkforyou-scotland mailing list archives give a feel for how much more work in necessary than just scraping the transcripts. Some more volunteers would be great!

Parsing The Transcript

The TWFY target data structure breaks the transcript into headings and speeches. The transcript itself is made up of classified paragraphs, contained within classified divisions, contained within a table.

A speech is an unbroken sequence of paragraphs uttered by the same speaker, with the exception of the last paragraph which might be a timestamp. Headings generally precede a debate, which is a sequence of speeches by one or more members on the same subject, the topic being introduced in the heading. Speeches and debates can be punctuated by notes, which are paragraphs placed in the record to indicate some inaudible activity, such as when the meeting is suspended [”12:28 … Meeting suspended until 14:15. … 14:15 … On resuming—”] or a member rises [”Jeremy Purvis rose—”]. Occasionally, many people speak at once [”Members: Answer the question!”]. Some speakers hold office, which is indicated before their name the first time they speak in any debate [”The Minister for Transport, Infrastructure and Climate Change (Stewart Stevenson)”]. Speakers names and party affiliations are written out in full the first time they speak in a debate [”Ms Wendy Alexander (Paisley North) (Lab): Will the minister give way?”] and are shortened for any subsequent speeches [”Ms Alexander: Will the minister give way?”].

There is a co-ordinate system called column numbers, which relate directly to the structure of printed report i.e. columns in a newspaper. Each session of Parliament begins at column number 1 and column numbers increase sequentially throughout the session. Column numbers are interspersed, seemingly at random (due to the disconnect between physical and digital), throughout the report documents. Members refer to or quote from speeches in the record by giving a date and column number(s) immediately after the quoted text (”quarterly reviews of project progress against cost and time targets have been established”.—[Official Report, 16 March 2006; c 24058-59.])

Transformation Rules

Locating the Transcript

The transcript is contained in a table classified “contentTable”. Speech paragraphs are contained in divisions classified “orindent”. Column numbers are outside the speech divisions in paragraphs classified “orcolno”.

select //xhtml:table[@class="contentTable"]//xhtml:div[@class="orindent"]

Gathering Speeches

Find all the paragraphs where the speaker changes and convert to speeches. Look backwards for the nearest column number and timestamp. Include all subsequent contiguous paragraphs where the speaker does not change.

The speaker’s name is in a strong phrase as the first child of the speech paragraph, the phrase ending with a colon, the second child being a text node, any further children of mixed types.

select xhtml:p[xhtml:*[1][self::xhtml:strong[ends-with(normalize-space(text()), ':')]]]

Sketch 2 Abandoned

Whew! A single-pass transformation is much more complicated than I thought, primarily due to speeches straddling columns, sometimes multiple columns for long speeches. I’m not aware of any XPath axis which will naturally solve the problem and there’s no way I know of to navigate within a selected node-set.

However, the output I have now is much closer to the target and I think a third sketch using multiple passes (1. extract all paragraphs, 2. transform to target) will be ideal.

Sketch 2 stylesheet

Sketch 3 - Two Passes

Combining the characteristics of the first two sketches, I shall create sketch 3 as a two-pass system. The first pass shall extract all the transcript paragraphs, integrating the timestamps and column numbers, which will overcome the difficulties I had straddling columns in sketch 2. The second pass shall transform the extracted paragraphs into the target format.

Although there are techniques I could use to perform both passes with a single stylesheet (such as the EXSLT function node-set()), I wish the process to be as portable as possible. I do not want the process to be dependent on PHP, which I’m using locally to perform the transformation. Ideally, the process should be compatible with the xsltproc program, so that the following command will work;

xsltproc scot-parl-pass-1.xsl source.xml | xsltproc scot-parl-pass-2.xsl > target.xml

Trying to use these tools under Windows is an exercise in frustration, mostly related to the resolving of named entities. However, all is not lost. Starting with these instructions, I created a C:\XML directory, copied the XHTML DTD files from the W3C site into it and created a catalog.xml file like so:

<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">

  <public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"
          uri="xhtml1-transitional.dtd"/>

</catalog>
Directory of C:XML

16/06/2007  17:00               204 catalog.xml
09/06/2007  20:12            11,775 xhtml-lat1.ent
09/06/2007  20:15             4,131 xhtml-special.ent
09/06/2007  20:13            13,848 xhtml-symbol.ent
09/06/2007  20:10            32,111 xhtml1-transitional.dtd

Anyways, on to sketch 3…

First Pass

The first pass is based on sketch 1, which extracted and classified the transcript paragraphs, but lost some important details. It removed break elements, which must now be retained in order to allow the separation of member’s names in votes. It did not retain the date of the meeting.

One Response

  1. Lazy Sunday Morning

    Picked this up again for a couple of hours. I tried to transform the paragraphs classified as “heading” from the first pass into the “major-heading” and “minor-heading” elements required in the target, and bumped into an oddity. It feels to me that what makes a heading major or minor is more down to taste than rules and I’m too lazy to program for taste!

    Perhaps I shall examine the code for the existing scrapers and “borrow” their rule sets. I expect to find many special cases.

    Reply

Leave a Reply

You may also log in to post a comment.

XHTML:

If you want to <q>tag</q>, please balance these; a, i, em, b, strong, u, blockquote, q, ul, li, ol, abbr, code, pre, sub and sup.