use.perl journals and full text feeds

One of the sites I make a point of reading regularly is use.perl, and in particular, the user journals / blogs. They don't take too long to read, and there's normally a couple of posts a day that teach me something I didn't know about Perl, or that highlight a new module that's doing something useful.

But there's a problem.



The feeds for the use.perl journals don't include the full text of the entry. This is a pain -- most entries aren't too long, and I'd much rather read them all in my aggregator of choice, rather than have to click on lots of different links to see each entry in full.

Plagger to the rescue. It can extract the links from the list of new journals, follow the links to find the individual journal entries, and then extract the content from those pages to build a new feed that contains the full text of the entries.

To do this with Plagger requires two config files.

First, config.yaml. You can put this anywhere you like.

plugins:
- module: Subscription::Config
config:
feed:
- url: http://use.perl.org/search.pl?op=journals
meta:
follow_link: /journal/\\d+
- module: CustomFeed::Simple
- module: Publish::Feed
config:
dir: /home/nik/public_html/
format: RSS
filename: use.perl.journals.rss
- module: Filter::EntryFullText


This instructs Plagger to build a feed by starting at the journal page and following every link that matches the regexp /journal/\d+. This feed is published at use.perl.journals.rss, and takes a trip through Plagger::Plugin::Filter::EntryFullText to extract the content for each entry.

Now you need a config file for Plagger::Plugin::Filter::EntryFullText. This needs to be in the correct sub-directory of your Plagger assets path. On my system that's /usr/local/share/Plagger/assets/plugins/Filter-EntryFullText. If you take a look in that directory (or its equivalent on your system) you'll see tens of small configuration files.

There's one configuration file for every site that Filter::EntryFullText knows about.

I created a new file in this directory called use.perl.journal.yaml, with the following contents.

handle: use\\.perl\\.org/.*?/journal/\\d+
extract_xpath:
title: //div[@id="journalslashdot"]/div[@class="title"]/h3/text()
body: //div[@class="intro"]
author: /html/head/link[@rel="author"]/@title
day: //div[@id="journalslashdot"]/div[@class="journaldate"]/text()
time: //div[@id="journalslashdot"]/div[@class="details"]/text()
extract_after_hook: |
my %months = (January => 1, February => 2, March => 3,
April => 4, May => 5, June => 6,
July => 7, August => 8, September => 9,
October => 10, November => 11, December => 12);
my($m, $d, $y) = $data->{day} =~ /([a-z]+)\\s+(\\d+),\s(\\d{4})/i;
$m = $months{$m};
my($h, $M) = $data->{time} =~ /(\\d+):(\\d+)/;
$h = 0 if $h == 12;
$h += 12 if $data->{time} =~ /PM/;
$data->{date} = "$y-$m-$d $h:$M";


There are a few things going on here.

The handle directive contains a regexp -- any pages that match this regexp will have their content extracted using this configuration. Because it's a regexp metacharactes, like "." need to be escaped.

The extract_xpath section contains directives that tell the plugin how to retrieve certain key bits of data from the page using XPath expressions. title, body, and author are all special keys that correspond to sections of the generated RSS feed.

There's a problem with the use.perl journal pages because they don't contain the date and time the entry was published in a trivially parseable format -- the information's human readable, and split over a couple of different elements.

So here the day and time of the journal entry are extracted in to two separate variables. But these need to be post-processed to get a timestamp for the feed.

This is where extract_after_hook comes in. This is a block of Perl code that's run after the data has been extracted. In this code $data is a hash ref, with keys that match those used under extract_xpath section.

By this point, $data->{day} looks something like Tuesday December 05, 2006, and $data->{time} might be 08:36 AM. That month name needs to be converted to a number, hence the %months declaration, and the time needs to be normalised. So the hour and minute value is extracted. With use.perl the hour "12" might be midnight or midday, and all afternoon hours are indicated with "PM" in the string. So the hour is reset to 0 if it's 12, and then 12 is added to all times in the afternoon.

Finally, these values are brought together and assigned to $data->{date}, which is another special key for Plagger.

Here's the full RSS feed for use.perl journals. Feel free to use this in your aggregator, it is currently set to auto-update every 30 minutes.

10 comments:

  1. Thanks!

    When I subscribe in Google Reader though, I'm seeing lots of "'" (in case the comments system mangles that, it should be ampserand, a, p, o, s, semicolon) instead of actual apostrophes.

    ReplyDelete
  2. Looks like I can't spell ampersand either.

    Let's try again: I'm seeing ' instead of '.

    ReplyDelete
  3. This is something in Plagger. More specifically, in Plagger::Util, around line 176, the apostrophe is listed as one of the special characters that is automatically encoded to the entity.

    I'm not sure why it's there. Possibly as defensive code against SQL injection attacks or similar. Googling for "apos" is instructive, and suggests that the numeric entity should really be used instead. I'll bring this up with Miyagawa-san.

    ReplyDelete
  4. This is marvellous; well done, nik! I'm also experiencing ill-escaped character codes, so I knocked together this shonky GreaseMonkey script to straighten'em out:

    =UserScript==
    // @name Google Reader use Perl subscription fixer.
    // @namespace http://danvsdan.com
    // @description Currently the escape sequences are little broken, this fixes them.
    // @include http://www.google.com/reader/view/*
    // ==/UserScript==

    function fix_escapes(evt) {
    // Possibly cache this if possible.
    if(!document.getElementById('chrome-stream-title').innerHTML.match(/use Perl - Journals/))
    return;
    var node = evt.relatedNode;
    if(node.innerHTML.match(/&([a-z]+;)/))
    node.innerHTML = node.innerHTML.replace(/&([a-z]+;)/g, "&$1");
    }

    document.getElementById('chrome').addEventListener('DOMNodeInserted', fix_escapes, false);

    ReplyDelete
  5. Nothing wrong with encoding it once, an HTML "'" and an apostrophe are the same. But it looks like it's getting encoded twice - view source on the RSS shows &apos. But I don't know how all this stuff is dealt with in RSS.

    ReplyDelete
  6. Ah, yes, it does look like it's being double-encoded. This might be due to the encoding by Plagger, and then a subsequent set of encoding by XML::RSS (which is eventually called to write the .rss file, called by XML::Feed).

    ReplyDelete
  7. OK here is my response:

    http://groups.google.com/group/plagger-dev/browse_thread/thread/edbc2c7e4a6e5793/9199bb028256c64e#9199bb028256c64e

    Summary: double encoding apos here is syntactically correct because we're embedding HTML into XML, if not the most desired.

    Let me say this is not a bug of Plagger, but it's a problem with the RSS feed itself, since RSS elements don't have a good way to express it's plaintext or HTML marked up (while Atom feed does). And as far as I tested the feed worked fine with Thunderbird and Bloglines, so I suspect it's a Google Reader bug.

    ReplyDelete
  8. I don't see this happening on other feeds in Google Reader (although that doesn't prove it's not a Reader bug of course).

    ReplyDelete
  9. Plagger configuration that shows how to create new feed metadata...

    Plagger provides great flexibility when processing feeds. This example shows how to create a new feed by scraping an existing web site, and post-processing the scraped HTML to upgrade content in the HTML in to metadata in the feed....

    ReplyDelete