Monday 6 August 2007

Converting PostScript with ImageMagick

I've had a dig through the tortuous source code of ImageMagick and it seems to take all appropriate precautions when handling PostScript and PDF. I suppose I'll trust it to make thumbnails of these formats. It's very confusing for someone not intimately familiar with the internals, so I wouldn't like to bet that there aren't any race conditions, but surely some security-minded people have picked over it already.

I still need custom code to convert to plain text, because the batch files GhostScript provides don't check filenames with Web server grade paranoia. That code is finished and working already, though. Next stop, HTML; then another attempt to convert Word.

Update: I spoke too soon. There's some problem with creating temporary files when you're using the API and not the command line. Working on it.

Update of the update: A different API call works. PostScript and PDFs can now both be previewed by EPrints on Windows.

Wednesday 1 August 2007

Converting PostScript

I'm halfway there with PostScript conversion. I think it can convert to plain text—as reliably as is possible. I'm tempted to let ImageMagick handle converting to images, and hope it does correct filename handling.

The perennial problem with PostScript is of course that it's a Turing complete programming language, and any attempt to convert it to anything else rapidly encounters the halting problem. There's mathematically no way to guarantee successful conversion of a document. Of course it's not important if a malicious document is indexed correctly, so it only needs to extract text and images from safe documents, but it mustn't be allowed to compromise security or cause a denial of service, either.