Scraping the Scottish Parliament Official Record
Posted on Fri, 15th June 2007 at 15:17 under Scotland, Politics, Skills, Coding, Web StandardsMy friends know that I’m a bit of a political geek and like to keep abreast of what our elected representatives are doing, so I follow the Official Reports of the UK Parliaments with an almost religious fervour, although I’ve been a bit lazy recently and just been watching the
I’m also a huge fan of
Missing from theyworkforyou.com is the Official Report of the
I’ve had this project
So here I go. I don’t need to get it perfect - just far enough along so that someone else can pick up and complete the task.
The Tools I’m Using
The programs used by TWFY are currently written in languages I do not know, nor do I wish to know;
One laudable characteristic of the recent SOR is that it is published in valid
Where I’m At
Going by the dates on my files, I started working on a Scottish Parliament scraping tool in January 2007 and got to a point where I couldn’t continue, due to flaws in my thinking and therefore in my rule design. I’ve been thinking about better rules for a few months now and I reckon I’ve got the gist of what needs to be done, and will develop them here. In the meantime, for those who want to examine my initial sketching, here’s my current transformation stylesheet and some source data.
scrape-scot-parl.xsl
Link to a sample source XML data document on the Scottish Parliament website
The processed output is nothing like what is required by TWFY.
Where I Need To Get
TWFY Parliament Parser XML format definitions define the target data structures
theyworkforyou-scotland mailing list archives give a feel for how much more work in necessary than just scraping the transcripts. Some more volunteers would be great!
Parsing The Transcript
The TWFY target data structure breaks the transcript into headings and speeches. The transcript itself is made up of classified paragraphs, contained within classified divisions, contained within a table.
A speech is an unbroken sequence of paragraphs uttered by the same speaker, with the exception of the last paragraph which might be a timestamp. Headings generally precede a debate, which is a sequence of speeches by one or more members on the same subject, the topic being introduced in the heading. Speeches and debates can be punctuated by notes, which are paragraphs placed in the record to indicate some inaudible activity, such as when the meeting is suspended [”12:28 … Meeting suspended until 14:15. … 14:15 … On resuming—”] or a member rises [”Jeremy Purvis rose—”]. Occasionally, many people speak at once [”Members: Answer the question!”]. Some speakers hold office, which is indicated before their name the first time they speak in any debate [”The Minister for Transport, Infrastructure and Climate Change (Stewart Stevenson)”]. Speakers names and party affiliations are written out in full the first time they speak in a debate [”Ms Wendy Alexander (Paisley North) (Lab): Will the minister give way?”] and are shortened for any subsequent speeches [”Ms Alexander: Will the minister give way?”].
There is a co-ordinate system called column numbers, which relate directly to the structure of printed report i.e. columns in a newspaper. Each session of Parliament begins at column number 1 and column numbers increase sequentially throughout the session. Column numbers are interspersed, seemingly at random (due to the disconnect between physical and digital), throughout the report documents. Members refer to or quote from speeches in the record by giving a date and column number(s) immediately after the quoted text (”quarterly reviews of project progress against cost and time targets have been established”.—[Official Report, 16 March 2006; c 24058-59.])
Transformation Rules
Locating the Transcript
The transcript is contained in a table classified “contentTable”. Speech paragraphs are contained in divisions classified “orindent”. Column numbers are outside the speech divisions in paragraphs classified “orcolno”.
select //xhtml:table[@class="contentTable"]//xhtml:div[@class="orindent"]
Gathering Speeches
Find all the paragraphs where the speaker changes and convert to speeches. Look backwards for the nearest column number and timestamp. Include all subsequent contiguous paragraphs where the speaker does not change.
The speaker’s name is in a strong phrase as the first child of the speech paragraph, the phrase ending with a colon, the second child being a text node, any further children of mixed types.
select xhtml:p[xhtml:*[1][self::xhtml:strong[ends-with(normalize-space(text()), ':')]]]
Sketch 2 Abandoned
Whew! A single-pass transformation is much more complicated than I thought, primarily due to speeches straddling columns, sometimes multiple columns for long speeches. I’m not aware of any XPath axis which will naturally solve the problem and there’s no way I know of to navigate within a selected node-set.
However, the output I have now is much closer to the target and I think a third sketch using multiple passes (1. extract all paragraphs, 2. transform to target) will be ideal.
Sketch 3 - Two Passes
Combining the characteristics of the first two sketches, I shall create sketch 3 as a two-pass system. The first pass shall extract all the transcript paragraphs, integrating the timestamps and column numbers, which will overcome the difficulties I had straddling columns in sketch 2. The second pass shall transform the extracted paragraphs into the target format.
Although there are techniques I could use to perform both passes with a single stylesheet (such as the ), I wish the process to be as portable as possible. I do not want the process to be dependent on PHP, which I’m using locally to perform the transformation. Ideally, the process should be compatible with the
xsltproc scot-parl-pass-1.xsl source.xml | xsltproc scot-parl-pass-2.xsl > target.xml
Trying to use these tools under Windows is an exercise in frustration, mostly related to the resolving of named entities. However, all is not lost. Starting with
<?xml version="1.0"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
<public publicId="-//W3C//DTD XHTML 1.0 Transitional//EN"
uri="xhtml1-transitional.dtd"/>
</catalog>
Directory of C:XML
16/06/2007 17:00 204 catalog.xml
09/06/2007 20:12 11,775 xhtml-lat1.ent
09/06/2007 20:15 4,131 xhtml-special.ent
09/06/2007 20:13 13,848 xhtml-symbol.ent
09/06/2007 20:10 32,111 xhtml1-transitional.dtd
Anyways, on to sketch 3…
First Pass
The first pass is based on sketch 1, which extracted and classified the transcript paragraphs, but lost some important details. It removed break elements, which must now be retained in order to allow the separation of member’s names in votes. It did not retain the date of the meeting.
Libertus said: July 1st, 2007 at 09:41
Lazy Sunday Morning
Picked this up again for a couple of hours. I tried to transform the paragraphs classified as “heading” from the first pass into the “major-heading” and “minor-heading” elements required in the target, and bumped into an oddity. It feels to me that what makes a heading major or minor is more down to taste than rules and I’m too lazy to program for taste!
Perhaps I shall examine the code for the existing scrapers and “borrow” their rule sets. I expect to find many special cases.
Reply