February 15, 2005

Mmmm, recursion

I spent a very fun Saturday night (turns out I am a ginormous geek) re-writing my Perl script to create indexes of images. Mmm. Now it works about twenty times faster. Basic structure:

indexImages(a folder)
  if (no sub folders)
    create page for images in this folder
    return link to this image page
  else
    start creating new index page
    create page for images in this folder (if any)
    add link to local images page to new index page (if required)
    for each subfolder
      add link to indexImages(subfolder) to new index page
    return a link to this index page

Done!

That only took about five hours on Saturday night (and I started at 10.30 *facepalm*) and two hours on Sunday morning. The script also does some clever checking to see if there are existing files and overwrites them if required. It's neat. Just the thing to help read all those scanlations that I download.

Then tonight I spent an hour or so fixing up my stripBadHtml script. I download a lot of html encoded pages to read later, and usually other people have god-awful tables and fonts and background colours, so this script rips through a folder of files and processes them against a file of regular expression rules. It's recursive too, natch, though it really is just a loop. It does do some cool things around detecting whether a file requires a backup, and accepts a switch to force update all files (if I change the html rules, for instance).

Sample regular expressions from the rules:
<h3.*?>::
</h3>::
&\#133;::...
\.{4,}::...
†::
&nbsp;::
{2,}::
Youji::Yohji

Which shows that I'm fussy about illegal characters (like the dagger and 133), grammar (ellipses only have three dots, dammit), and spelling of characters' names. I'm actually very fussy: I've also got a bunch of common spelling mistakes in there. And lots more formatting stripping. There are almost eighty lines. Ahem.

Emma suggested that the Perl programming might be a Microsoft overreaction, and she's probably not wrong. Still: fun!

No comments: