Tuesday 9 October 2007

Production release

EPrints for Windows has now been designated a production release. To install, see the instructions on the EPrints Wiki.

Friday 28 September 2007

New release

A new release with the indexer, and image and document conversion is in the pipeline. It's still awaiting editor approval, but will eventually be found at http://files.eprints.org/300/.

Thursday 13 September 2007

HTML, Word, and Powerpoint

HTML conversion is working using HTML::Parser. I might need to add some code to detect character encodings; we'll see after more testing.

As for Word, there's been a stroke of luck—I found a Windows binary of catdoc, which isn't officially available for Windows. Even better, there's also catppt, so maybe PowerPoint can be converted after all. I'm currently integrating both tools into EPrints plugins.

Update: The EPrints plugins for HTML, Word, and PowerPoint are written and seem to be working.

Monday 6 August 2007

Converting PostScript with ImageMagick

I've had a dig through the tortuous source code of ImageMagick and it seems to take all appropriate precautions when handling PostScript and PDF. I suppose I'll trust it to make thumbnails of these formats. It's very confusing for someone not intimately familiar with the internals, so I wouldn't like to bet that there aren't any race conditions, but surely some security-minded people have picked over it already.

I still need custom code to convert to plain text, because the batch files GhostScript provides don't check filenames with Web server grade paranoia. That code is finished and working already, though. Next stop, HTML; then another attempt to convert Word.

Update: I spoke too soon. There's some problem with creating temporary files when you're using the API and not the command line. Working on it.

Update of the update: A different API call works. PostScript and PDFs can now both be previewed by EPrints on Windows.

Wednesday 1 August 2007

Converting PostScript

I'm halfway there with PostScript conversion. I think it can convert to plain text—as reliably as is possible. I'm tempted to let ImageMagick handle converting to images, and hope it does correct filename handling.

The perennial problem with PostScript is of course that it's a Turing complete programming language, and any attempt to convert it to anything else rapidly encounters the halting problem. There's mathematically no way to guarantee successful conversion of a document. Of course it's not important if a malicious document is indexed correctly, so it only needs to extract text and images from safe documents, but it mustn't be allowed to compromise security or cause a denial of service, either.

Monday 30 July 2007

ImageMagick memory problems

I noticed my Windows development machine had slowed to a crawl recently, and I assumed it was something to do with the new plugins to convert images that are in development. Turns out it was, but not in an obvious way. Because the plugin requires ImageMagick, any EPrints process will load ImageMagick; this includes the scheduled task to run the indexer which I had installed and promptly forgotten about.

It looks like that because I had installed ImageMagick after I created this task, its environment didn't include the necessary paths for the Perl module to load correctly. It could find the Perl component, but not the XS part. Because of the way it autoloads unknown functions, this was causing an infinite recursion every time the indexer ran, which Perl eventually caught but not before it had used 2GB of my swap space and all my physical memory.

Recreating the task (in fact, just changing a property and changing it back) seems to have got rid of the problem. Maybe tasks store the path from the environment when they are created. This just proves why scheduling the indexer has to be done carefully; the Unix version has a complicated script to check that everything's running successfully, which I'm hoping Task Scheduler can duplicate under Windows.

On the positive side, the new plugins to convert images using the Perl API look to be working fine. Next stop, GhostScript—and the Windows version is getting closer to feature complete relative to Unix.

Thursday 26 July 2007

ImageMagick spoils my day

An ImageMagick design flaw means that it's unsafe to use in a batch environment where you can't trust the filename you are given. This is a problem solved a long time ago by all other Unix command line utilities, but to work around it I have to mess around with the Perl API rather than the command line.

What if you need to delete a file called -rf, or cat a file called --help? That's what the -- option is for, to tell the command that there aren't any more options, just filenames. ImageMagick interprets special characters in filenames as syntax, but there's no way to tell it not to. As far as I can tell, you can't load a file called image.jpg[23x42], even though it's a valid (though odd) filename. This kind of problem can easily lead to security problems in server applications.

It doesn't help either that the (sparse) documentation for the API says that you can read an image from a filehandle, when in fact that crashes Perl. Several hoops later I seem to have a reliable way of converting images from Windows; the next step is to hack together some EPrints plugins and see if it works for real.