Wednesday 16 May 2007

Converting Office formats

I've been experimenting with ways to convert popular Microsoft Office formats to plain text, to allow EPrints to do full text searching. Unfortunately it looks like Office itself is out of the question for this task:

  • Office is not designed to run server side. It expects to be running on a real user's desktop. If a confirmation popup appears and there's no UI, there's no way to clear it; it'll just be left running.
  • It could be a security hole, or at least a denial of service vector. Anything unusual in the uploaded file and Office might crash, pop up a message which can't be dismissed, or worst of all run malicious code.
  • The license agreement requires each client to have their own license if Office is run server side. It's debatable whether a user uploading a Word document to EPrints counts, but it's best not to take the risk.

There are a few free tools to convert Word which I'll try next, but finding one which can convert PowerPoint seems impossible. If only everyone would move over to the new XML formats; until then Office documents probably won't be indexed.

No comments: