Evaluating ES Questions

Hello All,
First off, ES seems to be pretty great from my testing. If all
holds, I would love to choose this sort of product over an Apache
Zookeeper + SOLR (ugh the admin)

After testing a few things, and trying out some configs, I have the
following questions

1 - We currently run a web crawler that gets around 100K documents per
day. As we get the docs, we tag it with meta data, such as the topics
associated with it, named entities, is it a video, etc.
We are up to around 30 million docs.
What sort of tricks/performance would I expect to get using this
distributed ES when Sorting by a publication date?
So.. if I do a query where I ask to show me all documents that match
a "Topic" then sort by date, is this something that will query against
all shards/replicas?
For the purpose of this question, lets say I do 3 shards, 2
replicas.
From just reviewing, it seems like all shards or replicas would be
queried for the query, then who ever initiated the query would get the
results back, and that box would then have to SORT. Is that correct?
Any better way to do that?

a - in the realms of better ways, would forcing certain Topic or
Meta Tagged content into the same shards make sense?
b - Is this something where you would want to have an index per
Topic? We could have millions of topics though

Or, am I worried about nothing here, and just having some small
amount of shards + replicas would give me sub second responses, no
real trickery needed?

  • just some more on this. Most of the time we do a query where we
    get documents that have "Topic(s)" and return by date. Sometimes we do
    "Topic(s)" + keywords of the text, or Topic(s) + what source they came
    from.

2 - Shards and replica placements
I have seen talk in this group about multi-data center awareness,
and wanted to see if there was any good response to this.
We are hosted at EC2 and would initially put these ES2 clusters in
us-east. I would want a hot backup in us-west though, just in case
east failed. I would not want any of the shards/replicas to take any
traffic though.
Almost like... I have 3 shards/2 replicas in us-east, across say 5
nodes. I would want to run say 2 nodes in us-west as a hot backup,
just in case there is a major failure.
Anyway to do that? Or is it just, you fail in us-east, and then you
try and start from your s3 gateway in us-west?
Or, is there a way to have other machines in us-west be a gateway,
like s3 is, and recover from those?

3 - Speaking of #2 and recovery
I know that there is the s3 gateway, what is a good way to possibly
attach an EBS, snapshot that EBS, and then recover from that
snapshot? Would all "shard" nodes need to have this EBS?
Or, would every node in the cluster need EBS, and you have to
snapshot all of them, just to have recovery?

4 - index freshness
so, I got a bit confused at the documentation on index freshness.
There is a index refresh versus the transaction log timings. Just
trying to figure out what happens after I insert a doc.
So, at time 0, I index document A
by default, after 1 second, I should be able to query for document
A?
That is what the index refresh says.
Or, is it 60 minutes, which is the default on the transaction log?

if it is the default 1 second for the index, what is the
transaction log doing that it has some kind of 60 minute flush?

5 - API queries for health errors
so, according to the docs, I was supposed to be able to call this
api url
/_cluster/health?pretty=true

which gave me an error that said you can't have a cluster with "_" in
the name
so, then I did
//health?pretty=true

and got this error:
{
"error" : "ElasticSearchParseException[Failed to derive xcontent
from (offset=0, length=0): []]",
"status" : 500
}

Getting and Putting documents were fine though, so, just trying to
figure out why this did not work.

Thanks for any help in the evaluation. So far, this seems like the
easiest and best way to manage a large, distributed index. Way better
than the Zookeeper/SOLR path.

Scott

Heya, answered inlined:

On Wed, Aug 31, 2011 at 7:11 PM, Scott Decker scott@publishthis.com wrote:

Hello All,
First off, ES seems to be pretty great from my testing. If all
holds, I would love to choose this sort of product over an Apache
Zookeeper + SOLR (ugh the admin)

After testing a few things, and trying out some configs, I have the
following questions

1 - We currently run a web crawler that gets around 100K documents per
day. As we get the docs, we tag it with meta data, such as the topics
associated with it, named entities, is it a video, etc.
We are up to around 30 million docs.
What sort of tricks/performance would I expect to get using this
distributed ES when Sorting by a publication date?
So.. if I do a query where I ask to show me all documents that match
a "Topic" then sort by date, is this something that will query against
all shards/replicas?
For the purpose of this question, lets say I do 3 shards, 2
replicas.
From just reviewing, it seems like all shards or replicas would be
queried for the query, then who ever initiated the query would get the
results back, and that box would then have to SORT. Is that correct?
Any better way to do that?

What do you mean by that box will have to sort? The results will be sorted
out and "reduced" before returned by the elasticsearch. If you have 3 shards
and 2 replicas, it will go and hit 3 shards (round robin by default between
the replicas), execute the search, reduce (resort) the results, and return
them.

a - in the realms of better ways, would forcing certain Topic or
Meta Tagged content into the same shards make sense?
b - Is this something where you would want to have an index per
Topic? We could have millions of topics though

Does not sound like this is what you want to do.

Or, am I worried about nothing here, and just having some small
amount of shards + replicas would give me sub second responses, no
real trickery needed?

Should be, you should test. The number of shards and replicas do require
some upfront testing.

  • just some more on this. Most of the time we do a query where we
    get documents that have "Topic(s)" and return by date. Sometimes we do
    "Topic(s)" + keywords of the text, or Topic(s) + what source they came
    from.

2 - Shards and replica placements
I have seen talk in this group about multi-data center awareness,
and wanted to see if there was any good response to this.
We are hosted at EC2 and would initially put these ES2 clusters in
us-east. I would want a hot backup in us-west though, just in case
east failed. I would not want any of the shards/replicas to take any
traffic though.
Almost like... I have 3 shards/2 replicas in us-east, across say 5
nodes. I would want to run say 2 nodes in us-west as a hot backup,
just in case there is a major failure.
Anyway to do that? Or is it just, you fail in us-east, and then you
try and start from your s3 gateway in us-west?
Or, is there a way to have other machines in us-west be a gateway,
like s3 is, and recover from those?

Multi data center solution has been discussed in the mailing list. The
proper elasticsearch solution is two different clusters that are kept in
sync between one and the other. Currently, you have to make sure they are
(for example, by applying the same changes done on one DC to the other).

3 - Speaking of #2 and recovery
I know that there is the s3 gateway, what is a good way to possibly
attach an EBS, snapshot that EBS, and then recover from that
snapshot? Would all "shard" nodes need to have this EBS?
Or, would every node in the cluster need EBS, and you have to
snapshot all of them, just to have recovery?

You can use the default local gateway, not the s3 one. Local gateway uses
each node "local" storage (can be placed on EBS), and the cluster knows how
to recover from that.

4 - index freshness
so, I got a bit confused at the documentation on index freshness.
There is a index refresh versus the transaction log timings. Just
trying to figure out what happens after I insert a doc.
So, at time 0, I index document A
by default, after 1 second, I should be able to query for document
A?

Yes.

That is what the index refresh says.
Or, is it 60 minutes, which is the default on the transaction log?

Did that part mention "index freshness"? Why the confusion? It has nothing
to do with index freshness.

if it is the default 1 second for the index, what is the
transaction log doing that it has some kind of 60 minute flush?

A flush basically means cleaning up the transaction log, and "committing"
things in Lucene. This is done in order not to create too big of a
transaction log so replying it will be slow on restart.

5 - API queries for health errors
so, according to the docs, I was supposed to be able to call this
api url
/_cluster/health?pretty=true

which gave me an error that said you can't have a cluster with "_" in
the name
so, then I did
//health?pretty=true

and got this error:
{
"error" : "ElasticSearchParseException[Failed to derive xcontent
from (offset=0, length=0): ]",
"status" : 500
}

The API is: curl -XGET host:port/_cluster/health. I think you POST/PUT.

Getting and Putting documents were fine though, so, just trying to
figure out why this did not work.

Thanks for any help in the evaluation. So far, this seems like the
easiest and best way to manage a large, distributed index. Way better
than the Zookeeper/SOLR path.

Scott

Super helpful, thanks!!

On the sorting piece, here is what I was trying to figure out.

Let's say my query spans across all 3 shards. and I do a sort by date
with a limit of only returning 10 items.
does the query and sort run on each shard and then it puts a limit of
10 on it, then the full results are returned to the main caller, and
then merged, and then only the "Limit" of what I was looking for
returned?

or, does each shard run the query, sort it, then all results are
returned to the main caller, the results are merged, then the limit is
applied?

I hope it is the first one, because that would mean far less network/
bandwidth happening between shards/replicas.

Thanks,
Scott

On Aug 31, 11:34 pm, Shay Banon kim...@gmail.com wrote:

Heya, answered inlined:

On Wed, Aug 31, 2011 at 7:11 PM, Scott Decker sc...@publishthis.com wrote:

Hello All,
First off, ES seems to be pretty great from my testing. If all
holds, I would love to choose this sort of product over an Apache
Zookeeper + SOLR (ugh the admin)

After testing a few things, and trying out some configs, I have the
following questions

1 - We currently run a web crawler that gets around 100K documents per
day. As we get the docs, we tag it with meta data, such as the topics
associated with it, named entities, is it a video, etc.
We are up to around 30 million docs.
What sort of tricks/performance would I expect to get using this
distributed ES when Sorting by a publication date?
So.. if I do a query where I ask to show me all documents that match
a "Topic" then sort by date, is this something that will query against
all shards/replicas?
For the purpose of this question, lets say I do 3 shards, 2
replicas.
From just reviewing, it seems like all shards or replicas would be
queried for the query, then who ever initiated the query would get the
results back, and that box would then have to SORT. Is that correct?
Any better way to do that?

What do you mean by that box will have to sort? The results will be sorted
out and "reduced" before returned by the elasticsearch. If you have 3 shards
and 2 replicas, it will go and hit 3 shards (round robin by default between
the replicas), execute the search, reduce (resort) the results, and return
them.

a - in the realms of better ways, would forcing certain Topic or
Meta Tagged content into the same shards make sense?
b - Is this something where you would want to have an index per
Topic? We could have millions of topics though

Does not sound like this is what you want to do.

Or, am I worried about nothing here, and just having some small
amount of shards + replicas would give me sub second responses, no
real trickery needed?

Should be, you should test. The number of shards and replicas do require
some upfront testing.

  • just some more on this. Most of the time we do a query where we
    get documents that have "Topic(s)" and return by date. Sometimes we do
    "Topic(s)" + keywords of the text, or Topic(s) + what source they came
    from.

2 - Shards and replica placements
I have seen talk in this group about multi-data center awareness,
and wanted to see if there was any good response to this.
We are hosted at EC2 and would initially put these ES2 clusters in
us-east. I would want a hot backup in us-west though, just in case
east failed. I would not want any of the shards/replicas to take any
traffic though.
Almost like... I have 3 shards/2 replicas in us-east, across say 5
nodes. I would want to run say 2 nodes in us-west as a hot backup,
just in case there is a major failure.
Anyway to do that? Or is it just, you fail in us-east, and then you
try and start from your s3 gateway in us-west?
Or, is there a way to have other machines in us-west be a gateway,
like s3 is, and recover from those?

Multi data center solution has been discussed in the mailing list. The
proper elasticsearch solution is two different clusters that are kept in
sync between one and the other. Currently, you have to make sure they are
(for example, by applying the same changes done on one DC to the other).

3 - Speaking of #2 and recovery
I know that there is the s3 gateway, what is a good way to possibly
attach an EBS, snapshot that EBS, and then recover from that
snapshot? Would all "shard" nodes need to have this EBS?
Or, would every node in the cluster need EBS, and you have to
snapshot all of them, just to have recovery?

You can use the default local gateway, not the s3 one. Local gateway uses
each node "local" storage (can be placed on EBS), and the cluster knows how
to recover from that.

4 - index freshness
so, I got a bit confused at the documentation on index freshness.
There is a index refresh versus the transaction log timings. Just
trying to figure out what happens after I insert a doc.
So, at time 0, I index document A
by default, after 1 second, I should be able to query for document
A?

Yes.

That is what the index refresh says.
Or, is it 60 minutes, which is the default on the transaction log?

Did that part mention "index freshness"? Why the confusion? It has nothing
to do with index freshness.

if it is the default 1 second for the index, what is the
transaction log doing that it has some kind of 60 minute flush?

A flush basically means cleaning up the transaction log, and "committing"
things in Lucene. This is done in order not to create too big of a
transaction log so replying it will be slow on restart.

5 - API queries for health errors
so, according to the docs, I was supposed to be able to call this
api url
/_cluster/health?pretty=true

which gave me an error that said you can't have a cluster with "_" in
the name
so, then I did
//health?pretty=true

and got this error:
{
"error" : "ElasticSearchParseException[Failed to derive xcontent
from (offset=0, length=0): ]",
"status" : 500
}

The API is: curl -XGET host:port/_cluster/health. I think you POST/PUT.

Getting and Putting documents were fine though, so, just trying to
figure out why this did not work.

Thanks for any help in the evaluation. So far, this seems like the
easiest and best way to manage a large, distributed index. Way better
than the Zookeeper/SOLR path.

Scott

Heya,

Its the first option, which does mean less network. But, it does mean that
you might not get a fully globally sorted result (i.e. the 11th hit in the
1st shard might still be sorted higher than the other 10 hits from the other
2 shards). You can ask for more hits in this case, something like size *
number_of_shards, to make sure you are going to get what you want, thanks to
the fact that the default search type is query then fetch, there will be
less network traffic, as you won't have 3 * 10: 30 full hits streamed from
each shard (totaling at 90), but just 30 "id and score/sort value", and then
the best 30 (and not 90) will be fetched from the respective shards they
belong to.

Long answer..., but, brings be to a possible feature in elasticsearch,
where we can do it automatically for you (use the number of shards to
automatically control how many shards are returned from each hit).
Naturally, this does not scale well for very large number of shards, but can
be good for most cases.

-shay.banon

On Thu, Sep 1, 2011 at 6:23 PM, Scott Decker scott@publishthis.com wrote:

Super helpful, thanks!!

On the sorting piece, here is what I was trying to figure out.

Let's say my query spans across all 3 shards. and I do a sort by date
with a limit of only returning 10 items.
does the query and sort run on each shard and then it puts a limit of
10 on it, then the full results are returned to the main caller, and
then merged, and then only the "Limit" of what I was looking for
returned?

or, does each shard run the query, sort it, then all results are
returned to the main caller, the results are merged, then the limit is
applied?

I hope it is the first one, because that would mean far less network/
bandwidth happening between shards/replicas.

Thanks,
Scott

On Aug 31, 11:34 pm, Shay Banon kim...@gmail.com wrote:

Heya, answered inlined:

On Wed, Aug 31, 2011 at 7:11 PM, Scott Decker sc...@publishthis.com
wrote:

Hello All,
First off, ES seems to be pretty great from my testing. If all
holds, I would love to choose this sort of product over an Apache
Zookeeper + SOLR (ugh the admin)

After testing a few things, and trying out some configs, I have the
following questions

1 - We currently run a web crawler that gets around 100K documents per
day. As we get the docs, we tag it with meta data, such as the topics
associated with it, named entities, is it a video, etc.
We are up to around 30 million docs.
What sort of tricks/performance would I expect to get using this
distributed ES when Sorting by a publication date?
So.. if I do a query where I ask to show me all documents that match
a "Topic" then sort by date, is this something that will query against
all shards/replicas?
For the purpose of this question, lets say I do 3 shards, 2
replicas.
From just reviewing, it seems like all shards or replicas would be
queried for the query, then who ever initiated the query would get the
results back, and that box would then have to SORT. Is that correct?
Any better way to do that?

What do you mean by that box will have to sort? The results will be
sorted
out and "reduced" before returned by the elasticsearch. If you have 3
shards
and 2 replicas, it will go and hit 3 shards (round robin by default
between
the replicas), execute the search, reduce (resort) the results, and
return
them.

a - in the realms of better ways, would forcing certain Topic or
Meta Tagged content into the same shards make sense?
b - Is this something where you would want to have an index per
Topic? We could have millions of topics though

Does not sound like this is what you want to do.

Or, am I worried about nothing here, and just having some small
amount of shards + replicas would give me sub second responses, no
real trickery needed?

Should be, you should test. The number of shards and replicas do require
some upfront testing.

  • just some more on this. Most of the time we do a query where we
    get documents that have "Topic(s)" and return by date. Sometimes we do
    "Topic(s)" + keywords of the text, or Topic(s) + what source they came
    from.

2 - Shards and replica placements
I have seen talk in this group about multi-data center awareness,
and wanted to see if there was any good response to this.
We are hosted at EC2 and would initially put these ES2 clusters in
us-east. I would want a hot backup in us-west though, just in case
east failed. I would not want any of the shards/replicas to take any
traffic though.
Almost like... I have 3 shards/2 replicas in us-east, across say 5
nodes. I would want to run say 2 nodes in us-west as a hot backup,
just in case there is a major failure.
Anyway to do that? Or is it just, you fail in us-east, and then you
try and start from your s3 gateway in us-west?
Or, is there a way to have other machines in us-west be a gateway,
like s3 is, and recover from those?

Multi data center solution has been discussed in the mailing list. The
proper elasticsearch solution is two different clusters that are kept in
sync between one and the other. Currently, you have to make sure they are
(for example, by applying the same changes done on one DC to the other).

3 - Speaking of #2 and recovery
I know that there is the s3 gateway, what is a good way to possibly
attach an EBS, snapshot that EBS, and then recover from that
snapshot? Would all "shard" nodes need to have this EBS?
Or, would every node in the cluster need EBS, and you have to
snapshot all of them, just to have recovery?

You can use the default local gateway, not the s3 one. Local gateway uses
each node "local" storage (can be placed on EBS), and the cluster knows
how
to recover from that.

4 - index freshness
so, I got a bit confused at the documentation on index freshness.
There is a index refresh versus the transaction log timings. Just
trying to figure out what happens after I insert a doc.
So, at time 0, I index document A
by default, after 1 second, I should be able to query for document
A?

Yes.

That is what the index refresh says.
Or, is it 60 minutes, which is the default on the transaction log?

Did that part mention "index freshness"? Why the confusion? It has
nothing
to do with index freshness.

if it is the default 1 second for the index, what is the
transaction log doing that it has some kind of 60 minute flush?

A flush basically means cleaning up the transaction log, and "committing"
things in Lucene. This is done in order not to create too big of a
transaction log so replying it will be slow on restart.

5 - API queries for health errors
so, according to the docs, I was supposed to be able to call this
api url
/_cluster/health?pretty=true

which gave me an error that said you can't have a cluster with "_" in
the name
so, then I did
//health?pretty=true

and got this error:
{
"error" : "ElasticSearchParseException[Failed to derive xcontent
from (offset=0, length=0): ]",
"status" : 500
}

The API is: curl -XGET host:port/_cluster/health. I think you POST/PUT.

Getting and Putting documents were fine though, so, just trying to
figure out why this did not work.

Thanks for any help in the evaluation. So far, this seems like the
easiest and best way to manage a large, distributed index. Way better
than the Zookeeper/SOLR path.

Scott

Actually, strike that, I got confused and answered the usual facet
aggregation question. Search hits wise all is well.

On Sep 1, 2011, at 18:59, Shay Banon kimchy@gmail.com wrote:

Heya,

Its the first option, which does mean less network. But, it does mean that
you might not get a fully globally sorted result (i.e. the 11th hit in the
1st shard might still be sorted higher than the other 10 hits from the other
2 shards). You can ask for more hits in this case, something like size *
number_of_shards, to make sure you are going to get what you want, thanks to
the fact that the default search type is query then fetch, there will be
less network traffic, as you won't have 3 * 10: 30 full hits streamed from
each shard (totaling at 90), but just 30 "id and score/sort value", and then
the best 30 (and not 90) will be fetched from the respective shards they
belong to.

Long answer..., but, brings be to a possible feature in elasticsearch,
where we can do it automatically for you (use the number of shards to
automatically control how many shards are returned from each hit).
Naturally, this does not scale well for very large number of shards, but can
be good for most cases.

-shay.banon

On Thu, Sep 1, 2011 at 6:23 PM, Scott Decker scott@publishthis.com wrote:

Super helpful, thanks!!

On the sorting piece, here is what I was trying to figure out.

Let's say my query spans across all 3 shards. and I do a sort by date
with a limit of only returning 10 items.
does the query and sort run on each shard and then it puts a limit of
10 on it, then the full results are returned to the main caller, and
then merged, and then only the "Limit" of what I was looking for
returned?

or, does each shard run the query, sort it, then all results are
returned to the main caller, the results are merged, then the limit is
applied?

I hope it is the first one, because that would mean far less network/
bandwidth happening between shards/replicas.

Thanks,
Scott

On Aug 31, 11:34 pm, Shay Banon kim...@gmail.com wrote:

Heya, answered inlined:

On Wed, Aug 31, 2011 at 7:11 PM, Scott Decker sc...@publishthis.com
wrote:

Hello All,
First off, ES seems to be pretty great from my testing. If all
holds, I would love to choose this sort of product over an Apache
Zookeeper + SOLR (ugh the admin)

After testing a few things, and trying out some configs, I have the
following questions

1 - We currently run a web crawler that gets around 100K documents per
day. As we get the docs, we tag it with meta data, such as the topics
associated with it, named entities, is it a video, etc.
We are up to around 30 million docs.
What sort of tricks/performance would I expect to get using this
distributed ES when Sorting by a publication date?
So.. if I do a query where I ask to show me all documents that match
a "Topic" then sort by date, is this something that will query against
all shards/replicas?
For the purpose of this question, lets say I do 3 shards, 2
replicas.
From just reviewing, it seems like all shards or replicas would be
queried for the query, then who ever initiated the query would get the
results back, and that box would then have to SORT. Is that correct?
Any better way to do that?

What do you mean by that box will have to sort? The results will be
sorted
out and "reduced" before returned by the elasticsearch. If you have 3
shards
and 2 replicas, it will go and hit 3 shards (round robin by default
between
the replicas), execute the search, reduce (resort) the results, and
return
them.

a - in the realms of better ways, would forcing certain Topic or
Meta Tagged content into the same shards make sense?
b - Is this something where you would want to have an index per
Topic? We could have millions of topics though

Does not sound like this is what you want to do.

Or, am I worried about nothing here, and just having some small
amount of shards + replicas would give me sub second responses, no
real trickery needed?

Should be, you should test. The number of shards and replicas do require
some upfront testing.

  • just some more on this. Most of the time we do a query where we
    get documents that have "Topic(s)" and return by date. Sometimes we do
    "Topic(s)" + keywords of the text, or Topic(s) + what source they came
    from.

2 - Shards and replica placements
I have seen talk in this group about multi-data center awareness,
and wanted to see if there was any good response to this.
We are hosted at EC2 and would initially put these ES2 clusters in
us-east. I would want a hot backup in us-west though, just in case
east failed. I would not want any of the shards/replicas to take any
traffic though.
Almost like... I have 3 shards/2 replicas in us-east, across say 5
nodes. I would want to run say 2 nodes in us-west as a hot backup,
just in case there is a major failure.
Anyway to do that? Or is it just, you fail in us-east, and then you
try and start from your s3 gateway in us-west?
Or, is there a way to have other machines in us-west be a gateway,
like s3 is, and recover from those?

Multi data center solution has been discussed in the mailing list. The
proper elasticsearch solution is two different clusters that are kept in
sync between one and the other. Currently, you have to make sure they are
(for example, by applying the same changes done on one DC to the other).

3 - Speaking of #2 and recovery
I know that there is the s3 gateway, what is a good way to possibly
attach an EBS, snapshot that EBS, and then recover from that
snapshot? Would all "shard" nodes need to have this EBS?
Or, would every node in the cluster need EBS, and you have to
snapshot all of them, just to have recovery?

You can use the default local gateway, not the s3 one. Local gateway uses
each node "local" storage (can be placed on EBS), and the cluster knows
how
to recover from that.

4 - index freshness
so, I got a bit confused at the documentation on index freshness.
There is a index refresh versus the transaction log timings. Just
trying to figure out what happens after I insert a doc.
So, at time 0, I index document A
by default, after 1 second, I should be able to query for document
A?

Yes.

That is what the index refresh says.
Or, is it 60 minutes, which is the default on the transaction log?

Did that part mention "index freshness"? Why the confusion? It has
nothing
to do with index freshness.

if it is the default 1 second for the index, what is the
transaction log doing that it has some kind of 60 minute flush?

A flush basically means cleaning up the transaction log, and "committing"
things in Lucene. This is done in order not to create too big of a
transaction log so replying it will be slow on restart.

5 - API queries for health errors
so, according to the docs, I was supposed to be able to call this
api url
/_cluster/health?pretty=true

which gave me an error that said you can't have a cluster with "_" in
the name
so, then I did
//health?pretty=true

and got this error:
{
"error" : "ElasticSearchParseException[Failed to derive xcontent
from (offset=0, length=0): ]",
"status" : 500
}

The API is: curl -XGET host:port/_cluster/health. I think you POST/PUT.

Getting and Putting documents were fine though, so, just trying to
figure out why this did not work.

Thanks for any help in the evaluation. So far, this seems like the
easiest and best way to manage a large, distributed index. Way better
than the Zookeeper/SOLR path.

Scott