Tuesday, 9 October 2007

Production release

EPrints for Windows has now been designated a production release. To install, see the instructions on the EPrints Wiki.

Friday, 28 September 2007

New release

A new release with the indexer, and image and document conversion is in the pipeline. It's still awaiting editor approval, but will eventually be found at http://files.eprints.org/300/.

Thursday, 13 September 2007

HTML, Word, and Powerpoint

HTML conversion is working using HTML::Parser. I might need to add some code to detect character encodings; we'll see after more testing.

As for Word, there's been a stroke of luck—I found a Windows binary of catdoc, which isn't officially available for Windows. Even better, there's also catppt, so maybe PowerPoint can be converted after all. I'm currently integrating both tools into EPrints plugins.

Update: The EPrints plugins for HTML, Word, and PowerPoint are written and seem to be working.

Monday, 6 August 2007

Converting PostScript with ImageMagick

I've had a dig through the tortuous source code of ImageMagick and it seems to take all appropriate precautions when handling PostScript and PDF. I suppose I'll trust it to make thumbnails of these formats. It's very confusing for someone not intimately familiar with the internals, so I wouldn't like to bet that there aren't any race conditions, but surely some security-minded people have picked over it already.

I still need custom code to convert to plain text, because the batch files GhostScript provides don't check filenames with Web server grade paranoia. That code is finished and working already, though. Next stop, HTML; then another attempt to convert Word.

Update: I spoke too soon. There's some problem with creating temporary files when you're using the API and not the command line. Working on it.

Update of the update: A different API call works. PostScript and PDFs can now both be previewed by EPrints on Windows.

Wednesday, 1 August 2007

Converting PostScript

I'm halfway there with PostScript conversion. I think it can convert to plain text—as reliably as is possible. I'm tempted to let ImageMagick handle converting to images, and hope it does correct filename handling.

The perennial problem with PostScript is of course that it's a Turing complete programming language, and any attempt to convert it to anything else rapidly encounters the halting problem. There's mathematically no way to guarantee successful conversion of a document. Of course it's not important if a malicious document is indexed correctly, so it only needs to extract text and images from safe documents, but it mustn't be allowed to compromise security or cause a denial of service, either.

Monday, 30 July 2007

ImageMagick memory problems

I noticed my Windows development machine had slowed to a crawl recently, and I assumed it was something to do with the new plugins to convert images that are in development. Turns out it was, but not in an obvious way. Because the plugin requires ImageMagick, any EPrints process will load ImageMagick; this includes the scheduled task to run the indexer which I had installed and promptly forgotten about.

It looks like that because I had installed ImageMagick after I created this task, its environment didn't include the necessary paths for the Perl module to load correctly. It could find the Perl component, but not the XS part. Because of the way it autoloads unknown functions, this was causing an infinite recursion every time the indexer ran, which Perl eventually caught but not before it had used 2GB of my swap space and all my physical memory.

Recreating the task (in fact, just changing a property and changing it back) seems to have got rid of the problem. Maybe tasks store the path from the environment when they are created. This just proves why scheduling the indexer has to be done carefully; the Unix version has a complicated script to check that everything's running successfully, which I'm hoping Task Scheduler can duplicate under Windows.

On the positive side, the new plugins to convert images using the Perl API look to be working fine. Next stop, GhostScript—and the Windows version is getting closer to feature complete relative to Unix.

Thursday, 26 July 2007

ImageMagick spoils my day

An ImageMagick design flaw means that it's unsafe to use in a batch environment where you can't trust the filename you are given. This is a problem solved a long time ago by all other Unix command line utilities, but to work around it I have to mess around with the Perl API rather than the command line.

What if you need to delete a file called -rf, or cat a file called --help? That's what the -- option is for, to tell the command that there aren't any more options, just filenames. ImageMagick interprets special characters in filenames as syntax, but there's no way to tell it not to. As far as I can tell, you can't load a file called image.jpg[23x42], even though it's a valid (though odd) filename. This kind of problem can easily lead to security problems in server applications.

It doesn't help either that the (sparse) documentation for the API says that you can read an image from a filehandle, when in fact that crashes Perl. Several hoops later I seem to have a reliable way of converting images from Windows; the next step is to hack together some EPrints plugins and see if it works for real.

Thursday, 12 July 2007

New release, indexer progress

A new Windows package has now been released. This release fixes a few bugs and has a new graphical installer. Download it from http://files.eprints.org/279/.

Good progress has been made with the indexer. Metadata can now be indexed periodically, and so can full text of plain ASCII documents thanks to a bug fix. I need to write some file format plugins to convert other formats, which might involve a change in the way plugins are configured in the core. The next release (soon I hope) will have the indexer included; until then I'll post a script which can be configured manually.

Tuesday, 19 June 2007

New release coming

I'm preparing a new package for release, hopefully within the next week. There are compatibility bugs in the core which I'd like to have fixed first, in case they sneak into future releases; I can't commit into the core code so I need to wait until another developer gets back.

In the meantime I'm testing the indexer. No full text yet, but it works for metadata.

Tuesday, 29 May 2007

Indexing

I made a start on the indexer a while ago, and while it could do a single pass I haven't yet looked at how to make it a background task. It's similar to subscriptions, which are done with Windows Task Scheduler, but the Unix indexer script takes great care to interrupt the indexer if it crashes. I need to investigate how Task Scheduler works to see if it can do this.

On top of this is the problem of indexing full text. Most of the formats the Unix version can understand are converted using Unix specific tools; these need to be converted to pure Perl, Windows specific, or portable tools. I already found an module which should do the trick for HTML, and GhostScript will probably work for PS and PDF.

Thursday, 24 May 2007

Status report

EPrints 3 can be installed and run under Windows. It's not yet feature complete, nor fully tested, but it's suitable for archive administrators to experiment with. If you just want to see what EPrints can do, you can try DemoPrints.

What works? All the basic features are there: you can upload documents, browse and search, and create users. Full text searching is currently missing, as is anything which depends on external commands.

New releases will be made available from the EPrints Files repository.

Wednesday, 23 May 2007

Subversion for desktop integration

Having looked into DAV to provide desktop integration for EPrints, I concluded it wasn't up to the job. For a start, newer versions of Web Folders (the Windows DAV client) don't communicate well with Apache DAV; also making a DAV handler with any complexity requires writing C modules to plug into Apache.

I was hoping to find a DAV handler which exported the API to Perl, but not only would that still require plenty of careful coding, there don't seem to be any systems which use DAV any more. Subversion seems to be the only DAV handler in common use apart from the simple filesystem handler. On investigation though, it looks like we could simply use Subversion to provide desktop integration—and it would be powerful, easy to build into EPrints, and cross platform on both server and client.

How it would work:

  • The user creates a new eprint in their EPrints work area, and adds a special type of document.
  • EPrints creates a new SVN repository for that document, and gives the user the URL of the repository.
  • Using a client like TortoiseSVN, the user checks out that URL into an empty working copy on their computer.
  • They add files to that directory and add them to version control. When they're done working, they check in their changes, and they're synchronised with the SVN repository.
  • EPrints detects that the repository has changed, and updates its version of the document.
  • When the document is committed to the archive, the repository is frozen. Cloning an eprint will also clone its repository, so the user can work on an updated version of the document.

Subversion has hooks for when things in the repository change, so it can notify EPrints where appropriate, and using EPrints's user authentication shouldn't be too difficult.

Subversion, Apache, and Windows

Installer packages

For a custom GUI, I'm looking at Tk::Wizard, because the Perl version we're using comes with Tk. I'm surprised Tk is as ubiquitous as this. In case a real installer package is better, Inno Setup might do the job.

Tuesday, 22 May 2007

More installers

My gripe with NSIS is that to do user input, you have to copy over a DLL and INI file which communicate with the main installer's (ugly) scripting language by writing and parsing the INI file. It also seems to require you to hardcode the exact coordinates of each input field - so hope the user hasn't changed their default font or language. It seems like such a horrible hack that I was convinced there must be a better way.

I was hoping to avoid moving to a Windows-based solution so that the Linux and Windows builds can be generated by the same process, but today I gave up and started reading about WiX, which generates MSI installers. It has a nice logical XML format and real scripting language, so I was hopeful.

My first observation is that anything not provided by default, including simple text input fields, has to be done through scripting, and it suffers from the same problem as NSIS: you have to specify absolute positions for everything.

My next observation is that the XML format requires you to list each installed file explicitly! There's no way to do a whole directory at a time, and EPrints has hundreds of files to install. There's a tool (heat) to generate this list automatically, but it's intended to be run once and then maintained manually. Without completely reworking the development system or writing tools to do this, it looks like MSI will be too unwieldy.

My best idea at the moment is to use NSIS to generate a trivial installer which doesn't prompt the user for anything. Configuring the archive options can then be done by a separate utility - probably using a GUI toolkit for Perl, which will be able to work most efficiently with the configuration files as they're all serialised Perl hashes.

Wednesday, 16 May 2007

Installer

Up to now the plan was to use NSIS to build the installer, because it's cross platform; the existing Linux based development cycle can also produce the Windows build. However it looks like it's really difficult (not to mention ugly) to make NSIS prompt the user for anything. The installer ought to be a configuration tool too.

Next task is to evaluate Windows Installer for the same job, if I can find out what I have to download to get the SDK. That is unless I find out how to make NSIS do anything useful in the mean time.

Converting Office formats

I've been experimenting with ways to convert popular Microsoft Office formats to plain text, to allow EPrints to do full text searching. Unfortunately it looks like Office itself is out of the question for this task:

  • Office is not designed to run server side. It expects to be running on a real user's desktop. If a confirmation popup appears and there's no UI, there's no way to clear it; it'll just be left running.
  • It could be a security hole, or at least a denial of service vector. Anything unusual in the uploaded file and Office might crash, pop up a message which can't be dismissed, or worst of all run malicious code.
  • The license agreement requires each client to have their own license if Office is run server side. It's debatable whether a user uploading a Word document to EPrints counts, but it's best not to take the risk.

There are a few free tools to convert Word which I'll try next, but finding one which can convert PowerPoint seems impossible. If only everyone would move over to the new XML formats; until then Office documents probably won't be indexed.

Introduction

Thanks to the support of Microsoft, the EPrints Web based insitutional repository software is being ported to the Windows platform.

This blog will be used to record developer progress on the Windows port, and to provide general information for people interested in using EPrints on Windows.