Introducing The BlogRAMthing
Posted on Fri, 19th May 2006 at 22:00 under WordPress, Coding, Plugins
A new plugin is running here which offers in-line post editing. It will offer a lot more soon. See also: blogramthing, blog-ram-thing, The Blog RAM Thing
? posted 14th March 2006.
Building Another BlogRAMThing
The first BlogRAMthing does its job well enough to prove the concept. On to the next BlogRAMthing for a completely different purpose - a rating system.
People may leave comments againsts certain posts they rate. Those ratings are counted and analysed to produce an overall rating for the post. Posts may be organised according to their overall rating.
Easy.
Except for the arbitrary technical limitation. No database modifications of any kind are allowed. It has to be done the BlogRAMthing way - the content is the data.
Easy? This task is currently rated .
Plan
Do you know what you are looking for? Would you recognise it if you saw it? Is your query specific or general? Do you know what you are not looking for?
- Recognise Post
- Specific category nicename
- Post Recognises Back
- Who are you? Why do you want to look at me?
- Recognise Post Content
- Find pre-calculated rating. Calculate and store rating.
- Recognise Rating Comment
- The many and varied ways people express their feeling about something.
Rating: 1-5
,love/hate
,terrible/awful/horrible/ugly
togreat/excellent/lovely/beautiful
,+ ++ ? — -
,5 stars to *
,:),
,
,
, >:Othis post sucks!
and so on. How deeply do you want to look? - Comment Recognises Back
- Author offered rating change.
- Recognition Pattern Selector Language (CSS/Regex)
- Recognising The Loop
- Reorder loop posts according to rating
- Feeding Data Only
Diversions
- Tidying Is Still Unpopular But More Useful Than Ever
- Building And Using A Customised PHP With Tidy On Ubuntu Linux
- Customising Ubuntu/Debian Source Packages
- PHP5 With Tidy Support On Ubuntu Linux
- The Secret Heart of BlogRAMthing
- From Markup Analyser To Markup Recogniser
- Analysing Marked-up Text With W3C HTML Tidy And PHP5
- ?
- ?
Libertus said: May 22nd, 2006 at 15:18
Recognise Post
The first BlogRAMthing was pretty dumb. It was switched on-and-off depending on the URL query string and allowed one recognition pattern of a fixed depth. This BlogRAMthing will support multiple recognition patterns of arbitrary depths.
I like simple function names.
recognise_post( $post, $as_what=null )will do for starters. This function can be recognition step 1 (what does this post look like) or step 2 (get this data from the post).BlogRAMthing will have a library of recognition patterns available to analyse the post. In this specific instance, the post is recognised as being ratable by having the appropriate category. Let’s say I wanted people to be able to rate my plugin posts. I have a category named . They have a thing called a . Hmmm…
I have some rules to write. The target is any post categorised as plugins or rating. The recogniser operates by default on content, addressible by regular expression, and makes available other information about the post, addressible by $value, e.g. $title.
Somewhere in the content of the post there should be found something identifiable as a rating. I would prefer an element with a CSS class of
rating. I could go with a number between 1 and 5 near a word . At a push I’ll take just the number and if there’s nothing? I’ll have to figure out how to calculate it..rating,
/\w(?:rate|rating).*\w(?P<rating>[1-5])\w/,
/\w(?P<rating>[1-5])\w/
{
}
Analysing the post content with those rules should yield something I can call a . It could be the correct rating or it could be nonsense. One way or the other, it may need to be calculated. The result of that calculation might be submitted back to the post.
The calculation involves all the comments left against the post, which could be a big load. The more the post stores about the previous calculation, if any, each successive calculation needs less work. If the last calculation time were available with the rating, both human and computer could observe the discrepancy and react. For now, I’ll take the hit of recalculating from scratch for each comment. A future BlogRAMthing must have a lighter touch.
The patterns matched to recognise a rating comment are more numerous and relaxed than those for the post. In all instances, a template containing correct markup is preferred. The subsequent patterns are the equivalent of guesses, which offer a little freedom of expression.
[1-5]
(hate,dislike|don’t like,okay,like,love)
(\*|1\*|one star,\*\*|2\*|two stars,\*\*\*|3\*|three stars,\*\*\*\*|4\*|four stars,\*\*\*\*\*|5\*|five stars)
But not too much, as only a concept needs proven. The first pattern is canonical and results in a numeric rating. The other patterns are synonym lists that resolve to the numeric offset of the pattern they match, so “okay” will result in 3.
Which comments to match on? I think approved only. What of the author of comments? If an author comments with a rating more than once, only the last rating counts. Ratings are unique to authors.
ReplyLibertus said: May 24th, 2006 at 08:59
Tidying Is Still Unpopular But More Useful Than Ever
What isTidy ? What are the PHP Tidy functions and why would I use them? Why on earth would I limit BlogRAMthing to working only on PHP 5.1 and above?
Tidy is a program forcleaning and repairing X(HT)ML marked-up text.
The PHP Tidy functions integrate PHP and Tidy so that PHP scripts gain the ability to use existing Tidy facilities available on the host.
PHP4 supports Tidy 1.0. PHP5 supports Tidy 2.0, which can parse marked-up text such that a PHP script can navigate the document through the markup, a facility I need. PHP5.1 adds to that navigation facilitythe line and column numbers in the original text where the markup appears, which is the fundamental information BlogRAMthing uses.
So my choices are a) write my own standard markup recogniser or b) use someone else’s.
Going Out On A Limb
I am essentially choosing between a new responsibility and a new responsibility. The salient feature of both is dependence on someone else’s work. If I choose to create my own tool, I have to take responsibility for it, whether that is a new development or using a customised PHP. I choose to relieve either theHTML Tidy developers or my upstream Linux distributor of responsibility for the correct operation of part of my overall system. Whose responsibilities do I feel most confident of being able to take on?
ReplyLibertus said: May 24th, 2006 at 09:40
Building And Using A Customised PHP With Tidy On Ubuntu Linux
The idea of building and maintaining my own markup recogniser is tempting but insane. Tidy is just… wow… look at it!“Tidy may be the biggest new piece of functionality in PHP for a long time” . I’m not sure the developers of KSES would agree, but I do. The only PHP extension to excite me so far has been MySQL .
TheUbuntu Linux maintainers do not share my excitement, so they deliver PHP5 without Tidy but with practically everything else. I shall make a polite request for a change to this policy. In the meantime, I must learn the minimal impact method for customising PHP on Ubuntu.
Minimal Impact?
Before going any further with any impact assessment, I must confirm that the problem still exists and is not likely to correct itself in the near future. Simple patience is my tool of choice where possible, so a quick recap of the problem and check on Ubuntu’s site will tell me if I have work to do.
Reconfirm absence of Tidy support in latest upstream package
My coding workstation is currently running the latest beta of Xubuntu , which I take to be the current edge. If the PHP5 delivered to my workstation cannot execute the example Tidy code from the PHP manual, I shall accept that as proof that the upstream packages are insufficient.
php -vCopyright (c) 1997-2006 The PHP Group
Zend Engine v2.1.0, Copyright (c) 1998-2006 Zend Technologies
cat > tidy_test.phpphp tidy_test.phpdpkg -s tidyStatus: install ok installed
Maintainer: Jason Thomas
Version: 20051018-1
Depends: libc6 (>= 2.3.4-1), libtidy-0.99-0
Description: HTML syntax checker and reformatter
Corrects markup in a way compliant with the latest standards, and
optimal for the popular browsers. It has a comprehensive knowledge
of the attributes defined in the HTML 4.0 recommendation from W3C,
and understands the US ASCII, ISO Latin-1, UTF-8 and the ISO 2022
family of 7-bit encodings. In the output:
.
* HTML entity names for characters are used when appropriate.
* Missing attribute quotes are added, and mismatched quotes found.
* Tags lacking a terminating ‘>’ are spotted.
* Proprietary elements are recognized and reported as such.
* The page is reformatted, from a choice of indentation styles.
.
Tidy is a product of the World Wide Web Consortium.
dpkg -s php5Status: install ok installed
Maintainer: Debian PHP Maintainers
Version: 5.1.2-1ubuntu3
Depends: libapache2-mod-php5 (>= 5.1.2-1ubuntu3) | php5-cgi (>= 5.1.2-1ubuntu3), php5-common (>= 5.1.2-1ubuntu3)
Description: server-side, HTML-embedded scripting language (meta-package)
This package is a meta-package that, when installed, guarantees that you
have at least one of the four server-side versions of the PHP5 interpreter
installed. Removing this package won’t remove PHP5 from your system, however
it may remove other packages that depend on this one.
.
PHP5 is an HTML-embedded scripting language. Much of its syntax is borrowed
from C, Java and Perl with a couple of unique PHP-specific features thrown
in. The goal of the language is to allow web developers to write dynamically
generated pages quickly.
.
Homepage: http://www.php.net/
The current Ubuntu PHP packages do not support Tidy.
Determine if upstream already plans to include Tidy support.
I have no idea where to find this information, so I start with theUbuntu Linux homepage looking for something to do with future plans or development. There are mailing lists . I’ll search them for , especially the developer’s list.
Success! Thanks again to Google,ubuntu PHP with tidy yields Bug #41690 (2006-04-27): . Surely following this thread will lead me to what is going to happen soon.
Support for tidy not included (libtidy) in php5
There’s the upstream Debian bug reportDebian Bug report logs - #355976 (08 Mar 2006) Please include tidy extension which has not been classified.#332763: (08 Oct 2005) libapache2-mod-php5: no tidy support with php5 in apache module or in another package ? .
Oh no! There’s the even older, still unclassified
There seems to be a disturbing lack of excitement about tidying at Debian! Someone, somewhere along the chain, has to be persuaded to flick the switch. I have no idea how much work is involved in doing so. I need to flick the switch myself and learning how to do so properly could take days. Someone out there knows how to do it. While I’m learning, I’ll surely encounter their e-mail addresses. I’ll ask advice along the way.
ReplyLibertus said: May 24th, 2006 at 11:31
Do It Myself, Ask Others To Follow
I have a problem. I need it fixed. I also need other people to fix the problem too, which requires diplomacy. Among engineers, diplomacy takes two forms; effort and money. The less effort you put into diplomacy, the more money it costs you. Engineers love to say .
My first effort has to be an assessment of the options available.
- Full Manual
- Build From PHP Source
- Acquire PHP source code
- Read INSTALL file
- Say the magic words for all the things I need PHP to do
- Say the magic words to make the PHP that does all the things I need
- Remove all Ubuntu PHP5 packages
- Say the magic words to put all the new bits in the right places on my host
- Semi-Automatic
- Customise Ubuntu Source Package
- Learn about source packages
- Learn about tools needed to build and manage source packages
- Acquire tools and PHP source packages
- Learn how to add Tidy support to the PHP source packages
- Customise, build and test PHP source packages
- Integrate custom packages on my host
- Automatic
- Persuade upstream to package PHP
- Locate correct address for package configuration policy change request
- Issue and track change request
- Await change
ReplyWhat next?
What next?
--with-tidyWhat next?
Libertus said: May 24th, 2006 at 12:29
Customising Ubuntu/Debian Source Packages
As I understand the term, a contains the source files and build instructions for a corresponding binary package in a ready-to-build form for a particular program. Package maintainers for a particular distribution look after this lump of source and determine the build instructions necessary to best integrate the program into the overall operating system, smoothing out and managing the potentially conflicting needs not only of the program and operating system developers, but also of the developers of any programs on which the maintainer’s package depends and the developers of any program which depend on the maintainer’s package. Maintenance of computer programs is a fiendishly complicated task, so computers are used to help.
For every binary package I can install, there is a corresponding source package that allows me to create the binary package for myself, tailored to my own needs. This is a fundamental feature of Unix operating systems, so fundamental that somewhere on my machine already there must be instructions for doing so.
I’ll start with the help for the package management tool,
apt-get. What does it have to say about source packages?apt-get --help | grep sourcesource - Download source archives
build-dep - Configure build-dependencies for source packages
-b Build the source package after fetching it
See the apt-get(8), sources.list(5) and apt.conf(5) manual
Quite a lot. Mainly, see the manual! No surprise. I don’t know something, so I need to read.
man apt-get.Source Packages, According To The Manual
The manual for the
apt-getcommand explains the operation of the option adequately enough to give me ideas for things to try.APT will examine the available packages to decide which source package to fetch. It will then find and download into the current directory the newest available version of that source package. Source packages are tracked separately from binary packages via deb-src type lines in the sources.list(5) file. This probably will mean that you will not get the same source as the package you have installed or as you could install. If the –compile options is specified then the package will be compiled to a binary .deb using dpkg-buildpackage, if –download-only is specified then the source package will not be unpacked.
A specific source version can be retrieved by postfixing the source name with an equals and then the version to fetch, similar to the mechanism used for the package files. This enables exact matching of the source package name and version, implicitly enabling the APT::Get::Only-Source option.
Note that source packages are not tracked like binary packages, they exist only in the current directory and are similar to downloading source tar balls.
My “full manual” option involves downloading the sourcetar ball . I accept that my source packages will not be managed like the ones supplied with the operating system. My guess is that the source package for
php5is worth a look.Downloading The PHP5 Source Package
cd Desktopmkdir php5_source_examinationcd php5_source_examination/apt-get source php5Building dependency tree… Done
Need to get 8163kB of source archives.
Get: 1 http://gb.archive.ubuntu.com dapper/main php5 5.1.2-1ubuntu3 (dsc) [1763B]
Get: 2 http://gb.archive.ubuntu.com dapper/main php5 5.1.2-1ubuntu3 (tar) [8064kB]
Get: 3 http://gb.archive.ubuntu.com dapper/main php5 5.1.2-1ubuntu3 (diff) [97.4kB]
Fetched 8163kB in 32s (254kB/s)
sh: dpkg-source: command not found
Unpack command ‘dpkg-source -x php5_5.1.2-1ubuntu3.dsc’ failed.
Check if the ‘dpkg-dev’ package is installed.
E: Child process failed
Oops! Some tools necessary for working with source packages need installing.
sudo apt-get install dpkg-devand try again.apt-get source php5Building dependency tree… Done
Skipping already downloaded file ‘php5_5.1.2-1ubuntu3.dsc’
Skipping already downloaded file ‘php5_5.1.2.orig.tar.gz’
Skipping already downloaded file ‘php5_5.1.2-1ubuntu3.diff.gz’
Need to get 0B of source archives.
dpkg-source: extracting php5 in php5-5.1.2
dpkg-source: unpacking php5_5.1.2.orig.tar.gz
dpkg-source: applying ./php5_5.1.2-1ubuntu3.diff.gz
OOO! Look at that. An original and a patch. Very interesting. A clear structure. What files do I have?
ls -ltotal 7996
drwxr-xr-x 15 paul paul 4096 2006-05-24 12:56 php5-5.1.2
-rw-r–r– 1 paul paul 97351 2006-05-18 05:08 php5_5.1.2-1ubuntu3.diff.gz
-rw-r–r– 1 paul paul 1763 2006-05-18 05:08 php5_5.1.2-1ubuntu3.dsc
-rw-r–r– 1 paul paul 8064193 2006-01-18 07:15 php5_5.1.2.orig.tar.gz
One directory, three files, all previously noted in the output from
apt-get. The directory listing shows nearly 8,000 files in total. One look in thephp5-5.1.2directory tree confirms that I have the source.Examining the PHP5 Source Package
My goal is to compile PHP5 with Tidy support system-wide by adding the configuration option . I’m looking for any file in the source package devoted to configuration.
find -name "*config*"./php5-5.1.2/ext/gd/config.w32
./php5-5.1.2/ext/bz2/config.m4
several pages of configuration files
./php5-5.1.2/debian/patches/052-phpinfo_no_configure.patch
./php5-5.1.2/debian/php5-module.config
Paydirt! The
module.configfile. What’s in there? Can I have the Tidy module, please? No… nothing to do with the PHP configuration. What about the patch? Aha! One mystery solved. Debians remove the PHP configuration options from the display produced by thephpinfo()function. OK, I grok that.Still no help for my primary goal. I need to look in the
php5-5.1.2/debiandirectory.I find two files of interest;
modulelistandrules. I’m not confident that Tidy is a module. What I need to do is add to the PHP compilations options.Love Rules
Eureka! I have found it! Here are the rules I need to change. Amakefile . There is a list of configuration options for PHP, called
COMMON_OPTIONSand other configuration option lists depending on which PHP is being built e.g. the Apache2 module or the standalone CLI. I want Tidy available to them all, so I’m going to add the line to the COMMON_OPTIONS and figure out how to rebuild.Rebuilding Debian PHP5 With Tidy From Source
I refer once more the the apt-get manual. It was by that program I acquired the source, so from there I should learn what next to do with it. I need to compile the source into a binary package. The manual says I do that with
dpkg-buildpackage. How does that program work?dpkg-buildpackage --helpDebian dpkg-buildpackage .
Copyright (C) 1996 Ian Jackson.
Copyright (C) 2000 Wichert Akkerman
This is free software; see the GNU General Public Licence version 2
or later for copying conditions. There is NO warranty.
Usage: dpkg-buildpackage [options]
Many lines of options I don’t understand
It doesn’t take a package name. Presumably it gets everything it needs from the current directory. Oh well, why not?
dpkg-buildpackagedpkg-buildpackage: source version is 5.1.2-1ubuntu3
dpkg-buildpackage: source changed by Adam Conrad
dpkg-buildpackage: host architecture i386
dpkg-checkbuilddeps: Unmet build dependencies: apache2-prefork-dev (>= 2.0.53-3) chrpath debhelper (>= 3) freetds-dev libbz2-dev (>= 1.0.0) libcurl3-openssl-dev | libcurl3-dev libfreetype6-dev libgcrypt11-dev libgd2-xpm-dev (>= 2.0.28-3) libgdbm-dev libjpeg62-dev libkrb5-dev libmhash-dev (>= 0.8.8) libmysqlclient15-dev | libmysqlclient12-dev libncurses5-dev libpam0g-dev libpng12-dev libpq-dev | postgresql-dev librecode-dev libsnmp9-dev | libsnmp-dev libsqlite0-dev libt1-dev libwrap0-dev libxmltok1-dev libxml2-dev (>= 2.4.14) libxslt1-dev (>= 1.0.18) re2c unixodbc-dev
dpkg-buildpackage: Build dependencies/conflicts unsatisfied; aborting.
dpkg-buildpackage: (Use -d flag to override.)
Oh boy! First, I better find out how to change the revision and maintainer. Second, I need to install more development libraries. I might be able to force my way through the second but the first is important. This is my package so I’ll add my own changelog entry.
dpkg-buildpackage -ddpkg-buildpackage: source version is 5.1.2-1ubuntu3-libertus
dpkg-buildpackage: source changed by Paul Mitchell
debian/rules clean
dh_testdir
make: dh_testdir: Command not found
make: *** [unpatch] Error 127
Fair enough. I must install more packages and that gives me time for a smoke-break!
Satifying Dependencies
This is a bit like a game. How few extra bits do I have to install to build the thing I need? After a little digging, I decide on my opening move.
sudo apt-get install build-essential chrpath debhelperdpkg-buildpackage -dsudo dpkg-buildpackage -dconfigure: error: xml2-config not found. Please check your libxml2 installation.
Yay! That was very successful considering the many missing dependent libraries. I’ll add the ones I know for sure I need.
sudo apt-get install apache2-prefork-dev libmysqlclient15-dev libxml2-devsudo dpkg-buildpackage -dconfigure: error: Please reinstall the BZip2 distribution
install libbz2-dev libjpeg62-dev libpng12-dev- configure: error: Please reinstall the libcurl distribution
This could take a long time. Dependency means dependency! So, there’s no point in trying to force things though. I should install the packages it says I need. Also, I’ve noticed a repeating message in the build output.
Autoconf 2.13 is marked obsolete, so I’ll ignore the warning. I take a deep breath and issue what I hope to be my final install command that leads to my first successful package rebuild.
sudo apt-get install freetds-dev libbz2-dev libcurl3-openssl-dev libfreetype6-dev libgcrypt11-dev libgd2-xpm-dev libgdbm-dev libjpeg62-dev libkrb5-dev libmhash-dev libncurses5-dev libpam0g-dev libpng12-dev libpq-dev librecode-dev libsnmp9-dev libsqlite0-dev libt1-dev libwrap0-dev libxmltok1-dev libxslt1-dev re2c unixodbc-devsudo dpkg-buildpackagedpkg-deb - error: Debian revision (`libertus’) doesn’t contain any digits
HAH! I wondered about that. I twice considered changing my revision number but twice thought better. Now I have to. Pity the tools didn’t check that at the start. I love the old ways.
Update the changelog and start again.
sudo dpkg-buildpackagedpkg-deb: building package `php5′ in `../php5_5.1.2-1ubuntu3-tidy1_all.deb’.
dpkg-deb: building package `php-pear’ in `../php-pear_5.1.2-1ubuntu3-tidy1_all.deb’.
signfile php5_5.1.2-1ubuntu3-tidy1.dsc
gpg: WARNING: unsafe ownership on configuration file `/home/paul/.gnupg/gpg.conf’
gpg: skipped “Paul Mitchell
gpg: [stdin]: clearsign failed: secret key not available
Did that work? Unsafe ownership on my configuration file? I don’t care about my package being signed, so I hope it doesn’t matter. Do I have packages? What’s in the parent directory?
ls -l ..total 34156
-rw-r–r– 1 root root 2270856 2006-05-24 16:18 libapache2-mod-php5_5.1.2-1ubuntu3-tidy1_i386.deb
drwxr-xr-x 19 paul paul 4096 2006-05-24 16:17 php5-5.1.2
-rw-r–r– 1 paul paul 367483 2006-05-18 05:08 php5_5.1.2-1ubuntu3.diff
-rw-r–r– 1 paul paul 1763 2006-05-18 05:08 php5_5.1.2-1ubuntu3.dsc
-rw-r–r– 1 root root 1478 2006-05-24 15:06 php5_5.1.2-1ubuntu3-libertus.dsc
-rw-r–r– 1 root root 8148601 2006-05-24 15:06 php5_5.1.2-1ubuntu3-libertus.tar.gz
-rw-r–r– 1 root root 1042 2006-05-24 16:18 php5_5.1.2-1ubuntu3-tidy1_all.deb
-rw-r–r– 1 root root 1472 2006-05-24 15:44 php5_5.1.2-1ubuntu3-tidy1.dsc
-rw-r–r– 1 root root 0 2006-05-24 16:18 php5_5.1.2-1ubuntu3-tidy1.dsc.asc
-rw-r–r– 1 root root 8149073 2006-05-24 15:44 php5_5.1.2-1ubuntu3-tidy1.tar.gz
-rw-r–r– 1 paul paul 8064193 2006-01-18 07:15 php5_5.1.2.orig.tar.gz
-rw-r–r– 1 root root 4489962 2006-05-24 16:18 php5-cgi_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 2257292 2006-05-24 16:18 php5-cli_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 131282 2006-05-24 16:17 php5-common_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 22588 2006-05-24 16:18 php5-curl_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 312268 2006-05-24 16:18 php5-dev_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 32844 2006-05-24 16:18 php5-gd_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 19804 2006-05-24 16:18 php5-ldap_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 8386 2006-05-24 16:18 php5-mhash_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 22012 2006-05-24 16:18 php5-mysql_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 37378 2006-05-24 16:18 php5-mysqli_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 27050 2006-05-24 16:18 php5-odbc_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 39798 2006-05-24 16:18 php5-pgsql_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 8068 2006-05-24 16:18 php5-recode_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 14170 2006-05-24 16:18 php5-snmp_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 25650 2006-05-24 16:18 php5-sqlite_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 20556 2006-05-24 16:18 php5-sybase_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 37828 2006-05-24 16:18 php5-xmlrpc_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 15146 2006-05-24 16:18 php5-xsl_5.1.2-1ubuntu3-tidy1_i386.deb
-rw-r–r– 1 root root 302004 2006-05-24 16:18 php-pear_5.1.2-1ubuntu3-tidy1_all.deb
My! How you’ve grown! I suppose I use
dpkg-somethingto install this. I’ll go for the “all” option.dpkg --helpdpkg -i|–install <.deb file name> … | -R|–recursive <dir> …
sudo dpkg -i php5_5.1.2-1ubuntu3-tidy1_all.debPreparing to replace php5 5.1.2-1ubuntu3 (using php5_5.1.2-1ubuntu3-tidy1_all.deb) …
Unpacking replacement php5 …
dpkg: dependency problems prevent configuration of php5:
php5 depends on libapache2-mod-php5 (>= 5.1.2-1ubuntu3-tidy1) | php5-cgi (>= 5.1.2-1ubuntu3-tidy1); however:
Version of libapache2-mod-php5 on system is 5.1.2-1ubuntu3.
Package php5-cgi is not installed.
php5 depends on php5-common (>= 5.1.2-1ubuntu3-tidy1); however:
Version of php5-common on system is 5.1.2-1ubuntu3.
dpkg: error processing php5 (–install):
dependency problems - leaving unconfigured
Errors were encountered while processing:
php5
More dependency problems, but I have the packages it needs. I install php5-common, libapache2-mod-php5 and php5-cli.
ReplyLibertus said: May 24th, 2006 at 16:44
PHP5 With Tidy Support On Ubuntu Linux
Not quite back to where I started and not that much effort later, I have fulfilled my needs.
php -vCopyright (c) 1997-2006 The PHP Group
Zend Engine v2.1.0, Copyright (c) 1998-2006 Zend Technologies
php tidy_test.php<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title></title> </head> <body> a html document </body> </html>Fucking TA-DA!!! Isn’t that a neat and Tidy HTML document? PHP5 with Tidy on Ubuntu Linux. Well, on my workstation at least. I wonder how far I can make this innovation spread?
ReplyLibertus said: May 25th, 2006 at 09:19
Dear $name,
I am writing to you for two reasons;
1) I believe you are responsible for maintenance of package “php5″, and
2) you are a competent engineer, gracious individual and rational thinker.
I and others need your help. Please may I draw your attention to the following wishlist items recorded against your package?
The request seems simple to us. Please compile PHP5 –with-tidy.
To better understand the effort required, I learned how to acquire and compile a so-changed source package. That alone took 8 hours and I can’t be sure my PHP is working as well as it did before. My brief investigation convinced me that no change to a source package, however minor, is to be taken lightly.
So in this case I want to make a different request of you. Please may we take this one of faith? The necessary change is a one-liner with clear intent, minimal impact and simple reversibility.
ReplyLibertus said: May 25th, 2006 at 09:39
The Secret Heart of BlogRAMthing
Somehow, BlogRAMthing has become the most important program I have devised so far, yet its heart belongs to someone else! Strip away all my code (when I write some) and there, glowing in the centre, you will find Tidy. This has some shameful side-effects.
First, due to the way in which Tidy has to be integrated with PHP, I won’t be able to demonstrate my progress on this blog, which is a shame. Not yet at least.
Second, few other people will be able to benefit from BlogRAMthing. No matter how open I make the source, if your host doesn’t do Tidy, you can only look, which is a shame.
Finally, BlogRAMthing is bleeding-edge, a term virtually synonymous with “doomed to failure”. Talk it up all I like, the likelihood is I’m talking bollocks again and nothing of value reward my effort, which is a shame.
So fucking what? Shame or not, I sense a rich seam of pure programming pleasure to be extracted from building BlogRAMthing, especially if it doesn’t work. Anyone who enjoys my peculiar brand of real-time documentation, please wish me luck and pray for success, as you may be richly rewarded. Anyone who enjoys my angry rants, please wish me luck and pray for failure, for similar reward.
ReplyLibertus said: May 25th, 2006 at 10:07
From Markup Analyser To Markup Recogniser
The basic Tidy function required by BlogRAMthing is markup analysis, the process of separating the tags from the text in a piece of content. Additionally, Tidy can validate the markup and perform limited repairs. BlogRAMthing can only work reliably with valid markup and I’m going to depend on Tidy to make that available, no matter what I type into the computer.
BlogRAMthing explores the content using a library of recognition patterns, noting the presence and location of each as it goes. The result of this exploration is fed back through another set of recognition patterns to adjust the content, depending on state.
I need three separate recognisers, in order of complexity; recipe, ratings and dive log. I am presuming that basic post recognition, for instance the presence of a particular category, has already happened.
Recognising A Recipe
A recipe is recognisable as a goal, the title or name of the recipe, and two primary elements; ingredients and instructions. Process the ingredients according to the instructions and the goal will be achieved.
The ingredients have a definite structure but no particular order. The instructions have a definite order but no particular structure. There’s pretty much only one way to mark them up; an unordered list and an ordered list. If there is a goal, it will be marked up as a heading immediately preceding the lists. Recipes also contain narrative and images.
In a perfect markup world, the elements that comprise a recipe would be classified accordingly to ease recognition but also open a new possibility; nesting. “A Nice, Cold Glass Of Clean, Fresh Water” is a simple recipe. “Steak and Kidney Pie” is a complex recipe that contains two other recipies; the pastry and the filling.
Pleasant Minimum For Humans
Pleasant Minimum For Computers (With A Little Indent For Humans)
<h
62>A Nice, Cold Glass Of Water</h62><p>You will need:</p>
<ul>
<li>1 clean glass</li>
<li>1 cold fresh water tap</li>
</ul>
<ol>
<li>Turn on the tap until the flow rate fills the glass in three seconds</li>
<li>Fill the glass and discard the contents 10 times</li>
<li>Fill the glass one more time</li>
<li>Turn off the tap</li>
<li>Serve</li>
</ol>
<h
62>Steak and Kidney Pie</h62><p>Yummy!</p>
<ul>
<li>
<h
73>Filling</h73><ul>
<li>Steak</li>
<li>Kidney</li>
</ul>
<ol>
<li>Mix steak and kidney, somehow</li>
</ol>
</li>
<li>
<h
73>Pastry</h73><ul>
<li>Flour</li>
<li>Water</li>
<li>Egg</li>
</ul>
<ol>
<li>Mix flour, water and egg, somehow</li>
</ol>
</li>
</ul>
<ol>
<li>Line baking tin with pastry</li>
<li>Put filling on top of pastry</li>
<li>Put more pastry on top</li>
<li>Put in oven</li>
<li>Wait</li>
<li>Take out of oven</li>
<li>Serve and enjoy</li>
</ol>
<p>Works every time!</p>
The First Recognition Specification
Find the first pair of unordered and ordered list in the content. Transform list items into textarea. Update content from transformed submission.
recipe { may have heading then must have ingredients, must have instructions; ingredients { want li in ul; make li into textarea; } instructions { want li in ol; make li into textarea; } }
heading { want h[1-
96]; make into input; }Recognising A Language
Three languages in fact. The language of the content (in my case English), the language of the markup (XHTML, of course) and the language of BlogRAMthing. At the time of writing, only two of the languages exist and only two are relevant. Unfortunately, the sets intersect. The BlogRAMthing language doesn’t exist and is relevant. I have to make it real. It is a simple programming language.
A programming language for a machine that also doesn’t exist. Is this a classicthe chicken or the egg scenario? Which comes first? The machine or the language that operates it? I don’t have time to mess around with philosophy, so I have to say . I build both at once. They are, after all, one and the same thing.
What’s That Thing?
BlogRAMthing deals with things, how to recognise things (by what they have and what you want or need them to be) and what to make of things when recognised.
Can I drop the thing thing now, please? A thing is to BlogRAMthing as a coin is to a currency.
Internally, BlogRAMthing maintains a collection of things. BlogRAMthing can tell every thing apart but doesn’t know what any thing means other than what it is, except tags, about which BlogRAMthing need be told nothing. The BlogRAMthing language defines what tags and content mean to the machine.
Simple Language Parser
When you think you hear a word, follow these rules:
The words I need BlogRAMthing to recognise are; the names of tags in the content, , , , , and . The punctuation symbols I need BlogRAMthing to recognise are; , and . Any other word is a thing.
Parsing A Simple Language
I’m keeping the details of my coding work for this initial part private until I’m done. First has to come the framework, which has to look and feel right. I’m pretty sure I’m on to something good when my code makes me laugh.
ReplyLibertus said: May 26th, 2006 at 10:08
Analysing Marked-up Text WithW3C HTML Tidy And PHP5
Understanding language is not easy. People do it naturally, depending on their linguistic skills to support almost every aspect of life, yet I expect no-one knows quite how they do it.
Programming languages,HTML included, are designed to be easily understood, especially by computers. From the beginning of time, computers have been used by a relatively small few to do one thing only: make new languages. Consequently, there are many ancient tools such as lex , yacc and bison and protocols such as SGML to simplify and, to some extent, formalise and standardise the process of language construction.
Understanding language is not easy, even when the language is formal and standard. A language must be learned before it can be understood. People take years to understand a langauage and each individual has the same learning curve every time. Computers, fortunately, can become masterfully fluent in any programming language instantaneously, so long as one person has taught one computer how to understand it. The process of teaching computers to understand a language can take many people many years, but the reward is then that everyone else can benefit with minimal effort.
With minimal effort I taught my computer how to understand HTML far better than I could write my own code to do so. Tidy parses HTML, cleans it, repairs it then presents the verified document both in textual form and as a document object model, kinda like a map of the page and its content. I need to find a way to make that map easier to search for specific things, such as tags and patterns of text.
Tidy presents a map of the structure of the document whereas I also need a simple list of what tags and text are where. PHP can do some pretty funky stuff on “arrays” and by using references variables I can make new maps that point to Tidy’s map rather than make my own copy.
I am very interested in list items within lists. I am keen to know which tags have particular identifiers and classes.
Navigating A Document With An Object Model
As well as text and tags I have data objects and a structure that interconnects those objects. They’re both the same thing, a clean and tidy HTML document, just different ways of looking at it. The plain text of the HTML document can only be navigated in one dimension: from beginning to end. Tidy analyses this long stream of text for the markup tags and creates a structure that has many beginnings and ends, according to the nesting of the tags. Tidy also separates the document into its structural elements and content elements. The document has structural elements that I’m not interested in, called the root and the head. The part I want is called the body, which contains the stuff people can see in their browser.
Tidy calls each point on its map of the document a
tidyNode. AtidyNodecan represent a structral element, a tag, a piece of text and many other things that may appear in HTML documents. The body of the document is a node and I ask for it like so:$body = $tidy->body();. From the body I can go everywhere else in the content, up and down the document via siblings and into and out of the document via children and parents.The children of body. All siblings have the same parent.
body
–h6
–p
–ul
–ol
–h6
–p
–ul
–ol
–p
The next generation
body
–h6
–p
–ul
—-li
—-li
–ol
—-li
—-li
–h6
–p
–ul
—-li
—-li
–ol
—-li
—-li
—-li
—-li
—-li
—-li
—-li
–p
Uhh… family trees are so tedious. However, this one has a little surprise. Note that with two generations revealed, the first and second
ullook identical. They’re not, as the second hides a dark little secret if you look close enough.–ul
—-li
——h7
——ul
——–li
——–li
——ol
——–li
—-li
——h7
——ul
——–li
——–li
——–li
——ol
——–li
–ol
Those sneaky list items contain there own little mini-documents that look just like their parents. There is no way the parent list can be made into a textarea (as per my recipe recognition specification), but the child lists sure can be, as they are finally simple lines of text.
Mapping The Document Tree With Tidy
My rationale for using Tidy is that writing my own code to do what Tidy has done could take ages, maybe never. I wrote the above family trees by hand. Now I’ll use Tidy to produce them for me.
tidy_family_tree( $tidy->body(), 1 )tidy_family_tree( $tidy->body(), 2 )My PHP Tidy family tree isn’t the same as my hand-drawn one. At the second generation, there are nodes without names. These are the nodes for the text contained by the parent tag. If a node has only one child and that child has no name, no need to bother following that route as it is a guaranteed dead end.
Ignore Nameless Children
tidy_family_tree( $tidy->body(), 5 )Almost right, but what’s going on in the deep dark lists? Why are there still blanks? What happened to my <h7>?
Lucky For Some, Number Seven
I refer to theHTML specification to discover that only H1 through H6 are recognised. Oops! My markup is incorrect. Well done, Tidy! I’ll use H2 and H3 instead of H6 and the invalid, although perfectly logical, H7.
ReplyLibertus said: May 27th, 2006 at 14:47
Tidy, Topography And Topiary
So far, I have used Tidy to convert a HTML document into a hierarchial structure and performed some basic analysis, just to get a feel for how Tidy sees things. The next step is to produce a topographical map of the document which allows each level (or generation) to be isolated and scanned. The search phrase is “li in ul” which I will satisfy by finding all the list items at a particular level discarding any that no not have a ul as a parent. Tidy does not bless each node on the map with knowledge of parent, only children. The topographical map must also allow me to look up the ancestry as well as down.
Topgraphy
Topography means “lay of the land”. In this case, I want to look at the “lay of the document” which is a bit like the family tree rotated with the body at the lowest level and each generation of descendent tags one level higher. Further, I want to be able to find the tags by name, know the parent of every tag and the location of every piece of plain text in the document. That may sound like a tall order but in reality is a fairly simple adaptation of the
tidy_family_tree()function that exploits PHP5 arrays and references to the max.function tidy_family_tree( $node, $generations, &$levels, $indent=0, $parent=null ) { // the combination of line number and column for each node is unique $node_key = $node->line . '.' . $node->column; // add reference to node in tag list for current level $levels['tag'][$indent][$node->name][] =& $node; // if node has parent, store reference against this node's key if( $parent ) $levels['parent'][$node_key] =& $parent; // if node has children and the search depth limit has not been reached if( $node->hasChildren() and $generations-- ) { ++$indent; if( count($node->child)==1 and empty($node->child[0]->name) ) // if node only has text child, add reference to text list for current level $levels['text'][$indent][] =& $node; else // recurse over all children foreach( $node->child as $child_node ) tidy_family_tree( $child_node, $generations, $levels, $indent, $node ); } }Whilst this works, it is syntactically clumsy. I suspect the PHP object-oriented syntax is more suited to this kind of task. Different code for the same functionality. In modern parlance this is known as “refactoring”. I prefer to call it what it is: topiary.
Topiary
$tidy = new tidy; $tidy->parseString( $html_document ); $tidy->cleanRepair(); $topgraphic_map = new TidyTopography( $tidy->body() ); class TidyTopography { var $tags; var $text; var $parents; var $max_level; function TidyTopography( $root_node, $max_depth=-1 ) { $this->map_node( $root_node, $max_depth ); } function map_node( $node, $max_depth, $level=0, $parent=null ) { if( $level > $max_level ) $max_level = $level; // add reference to node in tag list for current level $this->$tags[$level][$node->name][] =& $node; // if node has parent, store reference against this node's key if( $parent ) { // the combination of line number and column for each node is unique $node_key = $node->line . '.' . $node->column; $this->$parents[$node_key] =& $parent; } // if node has children and the search depth limit has not been reached if( $node->hasChildren() and $max_depth-- ) { if( count($node->child)==1 and empty($node->child[0]->name) ) // if node only has text child, add reference to text list for current level $this->$text[$level][] =& $node; else { ++$level; // recurse over all children foreach( $node->child as $child_node ) $this->map_node( $child_node, $max_depth, $level, $node ); } } } }Oh Dear! A Snip Too Far, It Seems
PHP object-oriented syntax may be marginally more suitable but the PHP5 parser is not!
$this->$tags[$level][$node->name][] =& $node;I findPHP bug #17290 , sigh, and go back to the procedural syntax. At least it works.
ReplyLibertus said: January 6th, 2007 at 14:44
Oh dear, that code wasn’t so good. Multiple syntax errors and $this->max_level never being set! The code snippet that caused the parsing error should be:
Reply$this->tags[$level][$node->name][] =& $node;Libertus said: May 27th, 2006 at 16:52
The Magic Of Understanding
Somewhere during this project I must solve the problem of translating this:
into this:
I took a stab at building a language parser by hand and quickly decided not to bother. Parsing language is hard and definately a science. Being a science, there are tools available to do the job for me. The output is what interests me, not the input, so for now I can work with hand-crafted test data.
Three Things
For input I have 1) a document with object model and topographical map and 2) a library of things to be recognised. For output I want 3) a list of recognised things with document co-ordinates.
Recognition begins at the deepest level of the document. Each recognition pattern is processed against the level, again deepest first. In the case of recipe, the ingredients and instructions must be satisfied and span two levels. The heading therefore may only match at the upper level.
Reply