Lucene and Amazon S3

2007 November 16
tags: , , ,
by Shay Banon

aws-s3

I spent some time trying to have the ability to store Lucene index on Amazon S3 service. Amazon S3 is a really cool idea, and having the ability to store Lucene index on top of it will provide a simple way to allow storing Lucene index in a distributed environment supporting HA. It will also make a lot of sense for applications deployed on Amazon EC2, since working with S3 from EC2 is free.

It was pretty simply to implement Lucene Directory interface on top of Amazon S3. A bucket is considered to be a Lucene index, and each file has one file object that holds its meta data, and 0 or more file objects holding portions of it (naturally, it is configurable). This, with Compass support for such storage, and Compass local cache support, should provide minor performance overhead when switching from local file system to S3.

Even before I embarked on this quick hacking session, the main thing I was concerned about was how to implement locking on top of S3. There is no formal locking API for it, but I heard somewhere that bucket creation is atomic. Assuming that it is, a very simple locking support can be done (creating a bucket and succeeding indicates a lock obtained, failure means it is locked already, deletion of a bucket releases the lock). Sadly, this is not the case and bucket creation is certainly not atomic. Funnily enough, it does not even fail when trying to create an already existing bucket.

So for now I shelved the implementation. It would be great if the good people at Amazon would allow for simple locking support. I understand that this is not simple to do in a distributed environment (hey, I work at GigaSpaces), but it must be there in some form, it will make S3 much a more attractive offer.

9 Responses leave one →
  1. 2007 November 18
    ursuletzu permalink

    Hi, interesting stuff, i intend to use s3 also.
    But, why not using other storage for locks and simulating a distributed transaction for both the storage and s3?
    Hm…Don’t shelve it.

  2. 2007 November 19

    I can build a distributed lock system, for example using JGroups or something like that. I am not sure how this is going to hold within EC2. Would you be interested in having the ability to store things on Amazon S3 and have a distributed locks using JGroups?

  3. 2007 November 19
    Emmanuel Bernard permalink

    There is no out of the box pessimistic locking API for JGroups (though it is doable to write one). But the JGroups / JBoss Cache guys want to add such an API soonish. I don’t know if it will sit in JGroups, JBoss Cache or both but that will definitely make life easier for people who have such needs :)

  4. 2007 November 28

    If you are trying to use Lucene and it’s indexing there are a couple of things that I would suggest. You will want to check out JSR 170 and possibly writing your own Node type. There are already built in Node types for files, but they are easily extended to persist to other places than the file system. This has several advantages. JSR 170 will allow you to change as the underlying S3 api changes and it will allow you to move away from S3 if Amazon really starts to jack up their prices without changing the business logic. You also might want to check the numbers again on the “free” from EC2. You don’t have to pay for the bandwidth used, but you do have to pay for every request under the new pricing scheme. It is still cheap, but not free. If you are planning on indexing often or having a ton of requests this is something to account for when in design.

  5. 2009 February 16

    I use lucene (but not compass) in a live vertical search engine. I need to update the index every hour. For performance reason I also want to optimize it. Currently I perform this operation on an offline machine and then transfer the index (> 1GB) to the production machine and reload it.

    I was wondering if using S3 instead of transferring the index would help? Any thoughts/comments are appreciated. I am not familiar with Compass? Would switching to compass help

  6. 2009 February 16

    It really depends, what is the problem with transferring the index now?

  7. 2009 February 17

    Current issues are:
    1) Ideally, I would like to transfer more frequently but there is network overhead transferring 1GB+ which discourages from transferring more often. Incremental transfer, e.g. would have been ideal, but then the index is not optimized.

    2) The transfer process (rsync) temporarily spikes the CPU of the production machine.

    3) I would like to scale out, but with the current infrastructure I would have to replicate the index to all the nodes. Not a big deal but would be ideal if I had a central master copy or a central lucene server which can be scaled out independently of the web server.

    Any thoughts, comments are very appreciated!

  8. 2009 February 17

    Can you open a thread in Compass forum? I think it makes more sense to have the discussion there…

  9. 2009 February 24

    I always enjoy learning how other people employ Amazon S3 online storage. I am wondering if you can check out my very own tool CloudBerry Explorer that helps to manage S3. It is a freeware.

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS