I'll look into it.
I think that in any case, it would make sense to initialize the
chunk_size to a specific value when working with Cassandra, something like
10-50mb. This will increase the snapshotting process as those are done in
parallel. Also, it will make cassandra gateway good for any type of usage
"out of the box".
On Friday, December 3, 2010 at 3:05 AM, Tom May wrote:
On Thu, Dec 2, 2010 at 4:51 PM, Shay Banon firstname.lastname@example.org:
On Thu, Dec 2, 2010 at 10:14 PM, Tom May email@example.com wrote:
I've done a Cassandra gateway implementation as a plugin at
https://github.com/gistinc/elasticsearch/tree/cassandra. It's built with
ElasticSearch 0.13.0. I'm using Cassandra 0.6.8.
Great job!, once its settled and done, we can maybe think about getting it
into elasticsearch itself, though my Cassandra knowledge is quite lacking
Thanks. I'm all for it being included in ElasticSearch at some point.
And I've just realized I could factor things more cleanly so I'm working on
I'm wondering how Cassandra's eventual consistency model interacts with how
ElasticSearch uses its gateway. I admit I have no idea how ElasticSearch
uses its gateway, especially from multiple nodes in the cluster. It seems
safest to use Cassandra's QUORUM consistency level (which requires at least
three Cassandra nodes to tolerate one failing), but it's not clear whether
multiple ElasticSearch nodes might be writing the same blob concurrently,
and what they expect to happen if they do. Can somebody enlighten me about
Only the primary shard in each shard replication group writes to the blob
store. So there is only a single writer per shard path. I think QUORUM is a
good value, as if it will fail, it will simply retry next time. Reading from
the blob store only happens when a primary shard is allocated and it needs
to recover its data.
Good news, thanks.
The Cassandra gateway is not much beyond a proof-of-concept but it does
work and we're currently testing with it at Gist http://www.gist.com/.
Its biggest limitation is that it only talks to a single Cassandra server.
It's using a bare thrift interface; I'll be looking into using an existing
higher-level Java interface that supports failover and such.
I'm not sure how well it will handle very large blobs. Handling large
blobs wasn't a goal. We intend to use this with per-user indexes that won't
get very large. We're also using it with the code on the gistlru branch of
that same git repo, written by my co-worker Matt, which allows us to use
index.store.type = memory and keep only indexes for active users in memory,
the rest being persisted to the gateway. We hope.
There is already automatic support for that. If you
set gateway.blobstore.chunk_size then files will be automatically broken
down into several blobs. This is how it is done with the s3 gateway, and it
actually automatically set the default chunk size to be 100mb.
Ok, I saw that, but I pretty well ignored it for this go, since I could.