Wednesday 1 August 2007

Converting PostScript

I'm halfway there with PostScript conversion. I think it can convert to plain text—as reliably as is possible. I'm tempted to let ImageMagick handle converting to images, and hope it does correct filename handling.

The perennial problem with PostScript is of course that it's a Turing complete programming language, and any attempt to convert it to anything else rapidly encounters the halting problem. There's mathematically no way to guarantee successful conversion of a document. Of course it's not important if a malicious document is indexed correctly, so it only needs to extract text and images from safe documents, but it mustn't be allowed to compromise security or cause a denial of service, either.

No comments: