Tuesday 29 May 2007

Indexing

I made a start on the indexer a while ago, and while it could do a single pass I haven't yet looked at how to make it a background task. It's similar to subscriptions, which are done with Windows Task Scheduler, but the Unix indexer script takes great care to interrupt the indexer if it crashes. I need to investigate how Task Scheduler works to see if it can do this.

On top of this is the problem of indexing full text. Most of the formats the Unix version can understand are converted using Unix specific tools; these need to be converted to pure Perl, Windows specific, or portable tools. I already found an module which should do the trick for HTML, and GhostScript will probably work for PS and PDF.

Thursday 24 May 2007

Status report

EPrints 3 can be installed and run under Windows. It's not yet feature complete, nor fully tested, but it's suitable for archive administrators to experiment with. If you just want to see what EPrints can do, you can try DemoPrints.

What works? All the basic features are there: you can upload documents, browse and search, and create users. Full text searching is currently missing, as is anything which depends on external commands.

New releases will be made available from the EPrints Files repository.

Wednesday 23 May 2007

Subversion for desktop integration

Having looked into DAV to provide desktop integration for EPrints, I concluded it wasn't up to the job. For a start, newer versions of Web Folders (the Windows DAV client) don't communicate well with Apache DAV; also making a DAV handler with any complexity requires writing C modules to plug into Apache.

I was hoping to find a DAV handler which exported the API to Perl, but not only would that still require plenty of careful coding, there don't seem to be any systems which use DAV any more. Subversion seems to be the only DAV handler in common use apart from the simple filesystem handler. On investigation though, it looks like we could simply use Subversion to provide desktop integration—and it would be powerful, easy to build into EPrints, and cross platform on both server and client.

How it would work:

  • The user creates a new eprint in their EPrints work area, and adds a special type of document.
  • EPrints creates a new SVN repository for that document, and gives the user the URL of the repository.
  • Using a client like TortoiseSVN, the user checks out that URL into an empty working copy on their computer.
  • They add files to that directory and add them to version control. When they're done working, they check in their changes, and they're synchronised with the SVN repository.
  • EPrints detects that the repository has changed, and updates its version of the document.
  • When the document is committed to the archive, the repository is frozen. Cloning an eprint will also clone its repository, so the user can work on an updated version of the document.

Subversion has hooks for when things in the repository change, so it can notify EPrints where appropriate, and using EPrints's user authentication shouldn't be too difficult.

Subversion, Apache, and Windows

Installer packages

For a custom GUI, I'm looking at Tk::Wizard, because the Perl version we're using comes with Tk. I'm surprised Tk is as ubiquitous as this. In case a real installer package is better, Inno Setup might do the job.

Tuesday 22 May 2007

More installers

My gripe with NSIS is that to do user input, you have to copy over a DLL and INI file which communicate with the main installer's (ugly) scripting language by writing and parsing the INI file. It also seems to require you to hardcode the exact coordinates of each input field - so hope the user hasn't changed their default font or language. It seems like such a horrible hack that I was convinced there must be a better way.

I was hoping to avoid moving to a Windows-based solution so that the Linux and Windows builds can be generated by the same process, but today I gave up and started reading about WiX, which generates MSI installers. It has a nice logical XML format and real scripting language, so I was hopeful.

My first observation is that anything not provided by default, including simple text input fields, has to be done through scripting, and it suffers from the same problem as NSIS: you have to specify absolute positions for everything.

My next observation is that the XML format requires you to list each installed file explicitly! There's no way to do a whole directory at a time, and EPrints has hundreds of files to install. There's a tool (heat) to generate this list automatically, but it's intended to be run once and then maintained manually. Without completely reworking the development system or writing tools to do this, it looks like MSI will be too unwieldy.

My best idea at the moment is to use NSIS to generate a trivial installer which doesn't prompt the user for anything. Configuring the archive options can then be done by a separate utility - probably using a GUI toolkit for Perl, which will be able to work most efficiently with the configuration files as they're all serialised Perl hashes.

Wednesday 16 May 2007

Installer

Up to now the plan was to use NSIS to build the installer, because it's cross platform; the existing Linux based development cycle can also produce the Windows build. However it looks like it's really difficult (not to mention ugly) to make NSIS prompt the user for anything. The installer ought to be a configuration tool too.

Next task is to evaluate Windows Installer for the same job, if I can find out what I have to download to get the SDK. That is unless I find out how to make NSIS do anything useful in the mean time.

Converting Office formats

I've been experimenting with ways to convert popular Microsoft Office formats to plain text, to allow EPrints to do full text searching. Unfortunately it looks like Office itself is out of the question for this task:

  • Office is not designed to run server side. It expects to be running on a real user's desktop. If a confirmation popup appears and there's no UI, there's no way to clear it; it'll just be left running.
  • It could be a security hole, or at least a denial of service vector. Anything unusual in the uploaded file and Office might crash, pop up a message which can't be dismissed, or worst of all run malicious code.
  • The license agreement requires each client to have their own license if Office is run server side. It's debatable whether a user uploading a Word document to EPrints counts, but it's best not to take the risk.

There are a few free tools to convert Word which I'll try next, but finding one which can convert PowerPoint seems impossible. If only everyone would move over to the new XML formats; until then Office documents probably won't be indexed.

Introduction

Thanks to the support of Microsoft, the EPrints Web based insitutional repository software is being ported to the Windows platform.

This blog will be used to record developer progress on the Windows port, and to provide general information for people interested in using EPrints on Windows.