Lucene and Amazon S3
I spent some time trying to have the ability to store Lucene index on Amazon S3 service. Amazon S3 is a really cool idea, and having the ability to store Lucene index on top of it will provide a simple way to allow storing Lucene index in a distributed environment supporting HA. It will also make a lot of sense for applications deployed on Amazon EC2, since working with S3 from EC2 is free.
It was pretty simply to implement Lucene Directory interface on top of Amazon S3. A bucket is considered to be a Lucene index, and each file has one file object that holds its meta data, and 0 or more file objects holding portions of it (naturally, it is configurable). This, with Compass support for such storage, and Compass local cache support, should provide minor performance overhead when switching from local file system to S3.
Even before I embarked on this quick hacking session, the main thing I was concerned about was how to implement locking on top of S3. There is no formal locking API for it, but I heard somewhere that bucket creation is atomic. Assuming that it is, a very simple locking support can be done (creating a bucket and succeeding indicates a lock obtained, failure means it is locked already, deletion of a bucket releases the lock). Sadly, this is not the case and bucket creation is certainly not atomic. Funnily enough, it does not even fail when trying to create an already existing bucket.
So for now I shelved the implementation. It would be great if the good people at Amazon would allow for simple locking support. I understand that this is not simple to do in a distributed environment (hey, I work at GigaSpaces), but it must be there in some form, it will make S3 much a more attractive offer.
November 18th, 2007 at 2:55 pm
Hi, interesting stuff, i intend to use s3 also.
But, why not using other storage for locks and simulating a distributed transaction for both the storage and s3?
Hm…Don’t shelve it.
November 19th, 2007 at 1:09 am
I can build a distributed lock system, for example using JGroups or something like that. I am not sure how this is going to hold within EC2. Would you be interested in having the ability to store things on Amazon S3 and have a distributed locks using JGroups?
November 19th, 2007 at 6:26 am
There is no out of the box pessimistic locking API for JGroups (though it is doable to write one). But the JGroups / JBoss Cache guys want to add such an API soonish. I don’t know if it will sit in JGroups, JBoss Cache or both but that will definitely make life easier for people who have such needs :)
November 28th, 2007 at 9:54 am
If you are trying to use Lucene and it’s indexing there are a couple of things that I would suggest. You will want to check out JSR 170 and possibly writing your own Node type. There are already built in Node types for files, but they are easily extended to persist to other places than the file system. This has several advantages. JSR 170 will allow you to change as the underlying S3 api changes and it will allow you to move away from S3 if Amazon really starts to jack up their prices without changing the business logic. You also might want to check the numbers again on the “free” from EC2. You don’t have to pay for the bandwidth used, but you do have to pay for every request under the new pricing scheme. It is still cheap, but not free. If you are planning on indexing often or having a ton of requests this is something to account for when in design.