Elasticsearch newbie questions

Dear list,

I've been reading the documentation and this list in order to evaluate the
adoption of elasticsearch in a project.
I've got a few open questions I'd like to see clarified.

  • number of indexes in a cluster

What are the effects of the number of different indexes managed by a single
cluster, what are the parameters and factors in play on using multiple
indexes and which kind of overheads occur when the number of managed
indexes grow?
What would be the effect of having a high number of indexes say 10000 of
which only 1000 or so are actually actively used (indexing and querying)
with the remaining 9000 being used very infrequently (say few times per
day)?

  • Indexes and mappings

On the mapping documentation is stated [1] "Mapping types are a way to try
and divide the documents indexed into the same index into logical groups.
Think of it as tables in a database".
I understand that mapping can't be updated [2] but a full index rebuild is
needed if the mapping definition needs changes.
Is this still the case if the mapping change is only the addition of a new
field?
If adding a new field to a mapping is possible, where do I find an example
of how to do it?

  • Document storage

By default elasticsearch stores the indexed document, only one post on the
mailing list [3] states how to turn off this feature, are there any
consequences to turning documet storage off?
What's the overhead in terms of storage, memory usage and indexing speed of
having document storage turned off or on?
Will the update [4] script work with document storage turned off (it's my
understanding that because of lucene internals updating a document actually
means deleting the old document from the index and adding the new one)

[1] http://www.elasticsearch.org/guide/reference/mapping/
[2] https://groups.google.com/d/topic/elasticsearch/nPHP9mH_C4Y/discussion
[3]
https://groups.google.com/forum/#!topic/elasticsearch/4XDF_mHRgAw/discussion
[4] http://www.elasticsearch.org/guide/reference/api/update.html

Thanks,

Paolo

--

Answers inline.

On Tue, Aug 21, 2012 at 4:20 AM, Paolo Negri hungryblank@gmail.com wrote:

What are the effects of the number of different indexes managed by a single
cluster, what are the parameters and factors in play on using multiple
indexes and which kind of overheads occur when the number of managed indexes
grow?
What would be the effect of having a high number of indexes say 10000 of
which only 1000 or so are actually actively used (indexing and querying)
with the remaining 9000 being used very infrequently (say few times per
day)?

Each shard is essentially a Lucene index, so each Elasticsearch index
will consume various OS resources, especially open files. Can't speak
about performance since I keep the number of indexes to be very low.
Filtered aliases can help keep the number of indexes low.

Is this still the case if the mapping change is only the addition of a new
field?

It depends if a document with that field has already been indexed or
not. If that field has never been indexed in any document, then the
mapping change does not require a new index. Not every field change
requires a re-index. Changing the type requires one, but I believe
changing the default analyzer is not (although the effects might be
confusing if documents are analyzed differently).

are there any consequences to turning documet storage off?

You would need to retrieve data from each individual field. Each field
would require its own seek in Lucene instead of one call to retrieve
the entire source.

What's the overhead in terms of storage, memory usage and indexing speed of
having document storage turned off or on?

Never benchmarked it myself. Having source has been too useful.

Will the update [4] script work with document storage turned off (it's my
understanding that because of lucene internals updating a document actually
means deleting the old document from the index and adding the new one)

Never used update, but I believe it requires for source to be enabled.

Cheers,

Ivan

--