Developing Outreach 2 (requires XSL and HTML Tidy)
Posted on Fri, 7th July 2006 at 23:00 under Outreach, WordPress, Plugins, Web StandardsWow! Another thought inspired by
Narrative
Now that libertus
is hosted on a server under my complete control, I can pick and choose (and even compile) the components and services I want available to the blog. I am very fond of Outreach v0.5. As a piece of experimental and education code, it has served its purpose well, without fail, for many months. It has even evolved a primitive sense of style. See the source.
It’s time to throw it away and start again. This time, I have available the
Outreach 2 shall be a plugin for WordPress that transforms post and comment content using XSLT to implement the same well-defined (with
Additionally, I would like to offer a broader choice of entry
Requirements
- Re-engineer Outreach Content Filter Using XSLT
Completed, the wrong way, Sunday 9th July 2006 at 18:00. - Re-engineer Outreach Plugin for WordPress
- Implement XHTML Validation With HTML Tidy
- Explore Markdown and Wiki Syntaxes
- Test Locally
- Prepare Customised PHP
- Install Customised PHP on
libertus
- Install Outreach 2 on
libertus
- Integration Testing
- Launch
Libertus said: July 8th, 2006 at 11:22
Re-engineer Outreach Content Filter UsingXSLT
What made the first version of the Outreach plugin a challenging development was finding the tags, which involved my learning and usingregular expressions . See the source.
With XSLT, the tags are already found so all that is left to do is declare what I want done with them. Sweet.
A Blank XSLT Page
Every program begins with a blank page. A blank PHP page is gorgeously simple (). What about XSLT?
Nowhere near as simple but, I assure all you non-geeks, that is identical to the PHP, only far more specific. There is no chance of anyone or anything, especially not a computer, misunderstanding what is going on. I am especially pleased with the
versiondeclaration.Basic Content Filtering
As I mentioned above, with XSLT, the tags are aleady found, so all I need to tell the program is what to do with them, if anything. Unless told otherwise, XSLT extracts all the text it can find in the input - not very useful. The first thing I need is an identity transformation or, in other words, copy the input unchanged directly to the output. If there are no Outreach tags in the content, the unchanged content can be the only correct output. XSLT has a well-defined, if somewhat complex, identity transformation pattern.
Matching Outreach Tags - The Wrong Way
There are two ways I can go about this; the wrong way and the right way. I am going to do both, starting with the former. It’s a bit much to expect do something the right way the first time.
The Outreach tags in my content all have names prefixed with , both to differentiate them from the XHTML tags and to make them easier to read, remember and therefore use. The simplest tags I use a lot that are
<is:uri @cite=URI @title=title @level=level>, to reference a specific webpage, and<is:thing @cite=citation @title=title @level=level @from=URI>, to reference an authoritative source (I use Wikipedia). The result of my embedding these tags in my content is that an automatic<a>link being generated, to the citedhrefin the former, and to the correcthreffor the Wikipedia page named after the citation, or some otherhrefdirectly specified by thefromattribute. See the source code.The simpler of the two tags I want to transform is the uri. So, I have to tell XSLT I want to do something with any such tag found.
Yes, it’s that simple! So the problem is, what exactly do I want to do with this tag? Two things; leave it in the content unchanged, and wrap whatever the tag wrapped with a link, according to the cited URI. So, for my normal use
<is:uri cite="http://somwhere">link to somewhere</is:uri>is transformed into<is:uri cite="http://somwhere"><a href="http://somwhere">link to somewhere</a></is:uri>.Putting this all together gives me outreach-2/wrong-way-1.xsl. I do so hope that it does what I think it does.
The only way to find out if a program works is to run (or execute) it. If this were a PHP program I could ask the shell to pass it as input to the PHP processor like so: php outreach-2/wrong-way-1.php but no such luck. XSL is a programming language for stylesheets, so the program I have written is not strictly standalone. As a transformation program, it acts on input to product output as part of a process. To get my output by running my program, I need two more things; input and an XSLT processor.
For input, I shall use the markup for the parent post to this comment because it is chock-full of Outreach tags (copy-and-paste). A command-line XSLT processor is already installed on myUbuntu Linux system (package ). Ask the shell to put everything together and…
Usage: xsltproc [options] stylesheet file [file …]
Options:
–version or -V: show the version of libxml and libxslt used
–verbose or -v: show logs of what’s happening
… a lot of options
–load-trace : print trace of all external entites loaded
–profile or –norman : dump profiling informations
Project libxslt home page: http://xmlsoft.org/XSLT/
To report bugs and get help: http://xmlsoft.org/XSLT/bugs.html
Oh boy, that’s a lot of options. Bugger that. Just go for it.
My program is wrong. I cannot get away without declaring to the XSLT processor that I’m currently using to mean the Outreach tag namespace, in the same way that is bound to the XSL namespace. I directly copy the namespace declaration from the blog page source.
Hmmm… seems my test data isn’t tidy enough for the XSLT processor. No problem. Rather than change the test data, I can process it with HTML Tidy and provide the tidied output to the processor instead. I love the command line. What does
tidyhave to say about my test data?Double hmmm…! HTML Tidy is refusing to do its job because it doesn’t recognise the tags I want the XSLT processor to transform for me! No problem - I’ll just have to tell Tidy about my Outreach tags, using thenew-inline-tags configuration option .
Tidied test data follows
Yeah, yeah, yeah! Indeed, Outreach is not approved by theW3C . Many warnings, but no errors, so I get what I want - a clean and tidy XHTML file that I can pass into the XSLT processor. Tidy even fixed a mistake in my markup with an Outreach tag - line 11, column 577. What a wonderful tool! Unfortunately, Tidy did not copy across my namespace declaration, so the XSLT processor still cannot understand the input. I’ll have to process the output from Tidy, using the stream editor, to add the missing namespace to the input before the XSLT processor gets it.
Wait! I know I’m going crazy when I think about usingsed to do a job! I prefer PHP, which can do the tidying, replace the <html> tag line and run the XSLT processor. I’ll write a program called wrong-way-1.php that pulls everything together.
Success. A simple is:uri.
ReplyLibertus said: July 9th, 2006 at 08:45
Continued Wrongdoing
Yesterday I began with nothing and ended the day with a minor success - a single, specific form of an Outreach tag being rendered, albeit the wrong way. Today my intention is to expand the repertoire of the new Outreach XSL content filter to cover all the tags I have previously defined, their attributes and respective link generators.
But first, I shall complete the implementation of the
is:uritag so that it is perfectly wrong. I can then copy that code to speed the implementation of the others.Perfecting
is:uriThere’s one thing left to do with
is:uri- draw the citation URI from the text if nociteattribute is specified. That requires a conditional expression in the transformation program, simply .Yesterday’s transformation program for
is:uriis as follows:The problem with this is that should there be no
citeattribute, the link is generated with an emptyhref. Not for long.The
hrefattribute of the link is now generated thusly: .As I’m on the second day and I want to keep things in order, I’m creating a new set of files; test data, stylesheet and PHP script.
ReplyLibertus said: July 9th, 2006 at 11:14
It May Be Wrong, But It Works!
I love days like this! Days where I get a little grin of pride. It irked me that the PHP script I have working locally didn’t work on the blog, so I fixed it. The result from yesterday’s work is now available.
ReplyLibertus said: July 9th, 2006 at 12:07
Sustained, Willful Wrongdoing
WithWikipedia entry about something, just like that , or like that when the text isn’t the citation I’m making, or, on rare occasions, like this when Wikipedia isn’t an authority.
is:urinailed, time to move on to the rather more usefulis:thing, which I use to link to theThe transformation is pretty simple. The generated link is followed by the citation. There is one snag though. Wikipedia likes underscores instead of spaces, and multi-word citations are separated by spaces. So, a little text transformation is also required. Finally, the
fromattribute is supported byis:thing, which overrides the automatically generatedhrefleaving both the text and citation intact.This XSL Thing Is Like So Random
I start with a copy of the template forPHP strtr() function the first time around. See the source.
is:uri, altered accordingly. Thefromattribute was already implemented by theis:uriciteattribute, so it just gets renamed. If there is no from attribute, either the citation or the text must be transformed into a Wikipedia URL, somehow. I used theMy first cut, without any attempt to translate the citation for Wikipedia, is as follows. I’m starting to really like XSLT. It is ugly but elegant, like a certain someone I know.
Hmmm… what are thoseXPath documentation and find the likely-looking core function
newlines ( ) doing in myhrefs? It’s coming from thetext()function. I need to make sure that kind of whitespace is stripped out. Hmmm… I read thenormalize-space().Nearly there. I’m sure XPath also offers a character translation function.
Hmm… in a citation doesn’t match a space character, no surprise. I need to disable thefix-uri HTML Tidy configuration option . That does it.
Success.
is:thingnow supported. Perhaps it is time to add a little style, to show what is going on.Hmmm… styling with namespaces doesn’t seem to work. Generated links need to have a class anyway, so I’ll go with what I did before () and add a
relattribute set to the element name. This lets me write some simple but classy CSS.Done. Now forGoogle links.
ReplyLibertus said: July 9th, 2006 at 15:59
Deliberate, Wrongful Use Of Google
A secondary purpose of Outreach is to combatlink rot by linking indirectly through Google. The most relevant page on a particular subject today may not be so relevant next week, even if the URL has not changed. Google sets an impressive quality standard for directing people to the most relevant website for a particular concept, say search engine optimisation . Google also works well for famous people, such as Armageddon T. Thunderbird , products such as Ozric Tentacles and corporate bodies such as The Government of the United States of America . Even money Google has trouble with the last one.
There is absolutely nothing special about these tags. They all generate Google links to their citations. I copy from the
is:thingtemplate and adjust accordingly. I have to calculate therelattribute and Google doesn’t need the citation translated like Wikipedia.Add a bit more style; concepts are solid gold, people are brown, products are silver and corporations are black. Presto!
ReplyLibertus said: July 9th, 2006 at 17:33
Proper Use Of Google
is:searchis so simple that I won’t bother repeating the code here. Remarkably like the previous one, only less confident.is:wordis also too simple to repeat. It’s just a Google search ondefine:$citation.As I decided to merge the two into a single template, I’ll repeat the code here. Why not? Part of doing things the wrong way is to learn from mistakes, especially the simple ones.
Add a bit of style (blue for search and yellow for word), twiddle the test data to include a
Replyis:searchand… here they are. Almost done.Libertus said: July 9th, 2006 at 17:57
A Final Act Of Wrongness
Only two tags left, and they are trivial.
is:placehas the same implementation asis:thingandis:moneyhas no implementation (although I have thought of a few), so just needs to be accommodated for the moment.I do declare, the first step is complete.
ReplyLibertus said: July 10th, 2006 at 11:40
Re-engineer Outreach Plugin For WordPress
With a content filtering solution proven, I turn my attention to the difficult second step - turning it back into a plugin for WordPress, without losing any function.
Outreach Plugin For WordPress, version 0.5, Functional Review
The Outreach Plugin does just a little more than link generation. It also provides a control panel for the reader, who may choose from a menu of available outreach settings; removing the links entirely, revealing hidden links and returning to the default links. A control panel appears inside any block of content that contains outreach tags, providing an anchor to which the menu links refer back. See the source.
Finally, the plugin provided a rudimentary validation service during post editing, warning of imbalanced tags, which was broken in WordPress 2.0 by the snazzy new feature.
Outreach Plugin For WordPress, version 2, Functional Specification
Function
- Specification
- Activation Hook
- Check that machine is capable of running the plugin. Look for XSL, Tidy etc.. Helpful messages with links should required components be missing.
- Post and Comment Content Save
- Validate content with HTML Tidy (unless forced not to), and on-demand. Refuse to save if errors in content, allow save on warnings (filtering out those that are a by-product of the validation process itself), offer clean and repair service (do not force)
- Content Filter
- Light linear scan for occurrence of outreach tag prefix to trigger heavy text->DOM->XSL->DOM->text sequence.
ReplyContent DOM load failure (invalid content) indicated by marker prepended to text.
XSL processor prepared once
Processor performs link generation and control panel generation according to query string parameter
outreach, including .Libertus said: July 10th, 2006 at 12:54
Outreach Plugin, version 2, Technical Specification
The Outreach plugin, version 2, provides three primary functions;
An integration layer is required for each particularCMS .
Systems OK Check
Two modes;
quickorcomplete. The default iscomplete. Boolean response.quickand attempt to load classesMake human and other diagnostic information available for systems not OK.
Content Validation
This is a suite of services, oriented around one data structure, streams of text, and one library, HTML Tidy.
The intended use case is that a person has prepared a stream of text content containing markup and they want to save it in the database, with a presumption of validity. The function is therefore the validity checking of a stream of text with boolean response, the provision of validity diagnostics, with optional provision of a repaired or beautified text stream.
The incoming text is assumed to be a fragment of rather than a complete document, so is placed in a valid skeleton document (using DOM if possible) before being passed to Tidy, which then only has to return the body. If there are no errors, return the original content, unless beautification or repair explicitly requested.
Make human and other validation diagnostics available.
Make SHA1 hash of returned valid content avaiable.
Content Filtering
This is a single function, processing a text stream though XSL to add Outreach links and a control panel. All calls made to this function are presumed to be for a single page of content unless indicated otherwise. A parameter
level, intended to be provided by the reader through the query string, controls the filter. The control panel provides these links.Move the closest outreach tag title attribute into the generated link
Control panel divisions (class
outreach control-panelare sequentially id-numbered () and emitted before the content. It contains a link to the help page, so entitled, an X off link, an unordered list of main level control links (none, default, all (if different from default), with optional sub-list of controls for levels found on tags in content. The title for control links state how many tags. All control links adjust theoutreachquery parameter only and include the fragment id of the containing control panel.Actual link generation should be handled by the integration layer. The default is .
Generated links must not invalidate the content. Nested anchors are not allowed. In-line elements within generated links are valid and so must be allowed.
The generated link destinations must be reader-configurable with host-provided defaults.
ReplyLibertus said: July 10th, 2006 at 15:05
Outreach Plugin 2 Architechture Analysis
Always seems to be a matter of taste, this part. All the libraries being used are object-oriented, the interface is simple and universal CMS integration seems to be possible.
The plugin requires PHP 5 because it requires Tidy 2. The object system in PHP 5 is far better than in any previous version and seems to be the main aspect of PHP getting attention from the core developers. I’ve had difficulties mixing object syntax with array syntax, but do not anticipate the need to manipulate complicated data structures in the plugin code as all that crap is being handled by the libraries.
An object architecture seems like a good fit. There is the Outreach Core, as described in the technical specification, which does the real work and is separated from the CMS by the Outreach Plugin, which gets and delivers the work.
Each CMS requires an integration class, based on the Plugin class, for instance Outreach Plugin for WordPress.
ReplyLibertus said: July 10th, 2006 at 15:18
Outreach Plugin 2 Programming - The Wrong Way
Formal software development systems tend to have a few more levels of paperwork before reaching this point in the project - writing the actual code. The benefit of doing things the wrong way (and not caring) is that it makes something real for people to play with and think about before committing to the rather more costly enterprise of developing it the right way.
It’s a bit of a dice roll. If the wrong way is good enough, maybe the right way can be put off for a while. Maybe the wrong way will show that the right way is too costly, which is also good to know.
Getting Started
I already have a project directory andSubversion repository. It even contains a few files! I’ll create a directory and put a blank PHP file called in it. And . And .
Started
I have code that looks almost, but not entirely unlike a WordPress plugin. If I ran it, it would do nothing, although it would be working. Perhaps I should test that… perhaps not.
ReplyLibertus said: July 11th, 2006 at 12:32
The Outreach2 Control Panel
Having proven that tag rendering is easy, how challenging is constructing a useful little control panel from the tags themselves? Immediately, I can see that I need all the Outreach tags from the content being filtered. If there are none, the filter has no work to do and no control panel is required.
The Outreach content filter is defined by that template’s match, although an empty template would do the exact opposite of its intended function - strip out all the outreach tags. That’s worth a run against my test data. Create wrong-way-3.php, wrong-way-3.xsl and wrong-way-3-test-data. Contrast with the output from the previous wrong way script which uses the same test data. There are distinct signs of something being missing.
What Was Removed?
The template matches all outreach tags and does precisely bugger all with each one. If the template is going to take the tags, the least it can do is give back a count of how many were taken.
ReplyLibertus said: July 11th, 2006 at 19:46
XSL Is A Frustrating Language
indeed! So isBrainfuck . Unfortunately, my brain is fucked, being naturally Turing incomplete. I’m left with is this feeling of stupidity. There are 27 Outreach tags in my test data. Now what?
Oh well. There’s always tomorrow.
And there’s always a fallback available to PHP.
ReplyLibertus said: July 12th, 2006 at 13:45
Yesterday’s Frustration, Today’s …?
I don’t want the types of the tags but I do want their
Replylevelattributes. Each unique level is added to the list in the control panel, linked and titled accordingly. The control panel can be built as a result tree fragment in a variable and copied into the content where necessary.Libertus said: July 16th, 2006 at 21:38
Phew!
Learning how a language works is hard work! I’ve studied XSLT enough and developed Outreach 2 enough to stop. There are other priorities and I think I’m able to take them on now.
I’ll be back…
ReplyLibertus said: July 27th, 2006 at 11:28
Question From Another Project
Could the entirety of the Outreach tag language be expressed using <a>? If not, why not?
ReplySean said: August 18th, 2006 at 01:24
Wow, you could really use a plugin that allows authors to have threaded posts. That way you’re not using your comments to continue on with a thought or idea.
ReplyI’ve been playing around with your idea. The only problem is your site — and any site I try your stuff with — doesn’t even come close to validating.
So to me it seems like I can have either a valid site, or semantic site. Can’t have both.
Libertus said: August 18th, 2006 at 10:01
Hi Sean,
I miss you on IRC! You should be spending your time there instead of expanding your plugin portfolio!!
Threaded posts? Yeah, I’d like that. I love to explore tangents and off-shoot ideas, which isn’t easy with a linear discussion format. I’m working on it, just not for WordPress, which isn’t designed to host serious discussions. I’m quite fond of the style used at theyworkforyou.com, which was designed to host serious discussions.
As to validation, my site is standards-compliant - more so than the W3C validator. I mean, come on, who actually uses XML namespaces in web documents, so why bother considering them valid, or even worse, ignoring them?
ReplyLibertus said: August 18th, 2006 at 10:13
Go on, run this page by the W3C Markup Validator. A few errors on my part, lost in a sea of computer-generated errors. What a mess!
Typing in raw XHTML isn’t easy. Markup is particularly susceptible to minor errors, such as thetypo and the brain fart .
ReplySean said: August 19th, 2006 at 09:07
Hehe.. 68 errors is pretty significant. What gets me is declaring the is: namespace is apparently invalid — xmlns:is=”http://libertini.net/libertus/outreach/”>.
I’ve been trying to avoid IRC. Nothing blows enormous amounts of my time more than #wordpress.
It’s an addiction.
A threaded posts plugin would be interesting. I picture something like GMail’s threaded emails thingy. You should email me sometime. I’ve been working on “minor” projects with friends that you might be interested in.
ReplyLibertus said: September 1st, 2006 at 18:53
I’ve got threaded comments now. They’ll do.
ReplyLibertus said: August 19th, 2006 at 16:07
Ironic. The namespace is correctly declared - even the validator notices it. The namespace attribute isn’t valid HTML, nor are any of the namespaced tags. So, my page is valid XML but invalid XHTML. I can live with that. So long as
<xsl:value-of select="document('http://libertini.net/libertus/')//title/text()"/>returns libertus, I’m happy.I understand the #wordpress addiction! If you prefer e-mail, I’m on that too.
Reply