First post, some questions

Hi,
first of all I would like to thanks for this great piece of software.
I see much potential to make elasticsearch a swiss knife for search
and database development in general.

As a ES newbie I have some more specific questions even after
reading the docs.
I have seen many requests using ES as the primary database, which is
the same with my demands here. Furhermore I have seen no problem at
all to 'emulate/mimic' sql like syntax, ES provide all and even more
to handle these cases. Joins can be avoided due to the document store
nature.
ACID has a lower priority here, so what is the best practice to use ES more
DB like?

  1. mapping: to index complete fields, the field has to indexed always
    explicite
    as keyword otherwise ES will tokenize the field with the standard
    tokenizer, which
    will prevent a complete field match. Is this correct?
    if partial match is needed it could be mapped as multifield.
    is this the optimal solution for a simple but complete text field match?

  2. mapping-templates: mapping can be done by api or file. by file the docs
    are not so clear for me. So I have to copy a file with the same name like
    the
    type name (eg. /twitter/tweets ) tweets.json here?:
    config/mapings/twitter/tweets.json
    will ES loading this tweets.json always by startup and changes are
    reflecting
    at this startup point? Can I use this for define the index (twitter.json)
    and
    all analyzer, tokenizer etc as well? and if, where to copy this file
    (_default)?

  3. sharding: from what I understand sharding will enhance index speed,
    replicas the search speed and durability? if so, what is the reason not
    to set sharding to a very high value for a single box? dispacher overhead
    or will this degrade the search speed on a single server?

  4. backup: the hdfs gateway can be used as backup gateway, adding this
    gateway to a running system will sync the hdfs automatically or is it
    nessesary to activate this gateway before filling ES?

I have about 1 TB zipped archives here (~40 mio docs) which I want
to store in ES. Each doc has 3 fields, only one text field (50 chars) has
to be indexed for a quick lookup. Is it possible to store this on a single
server (32GB RAM, 2TB SSD)?

I know, I could try the most questions here by myself but perhaps I can
get some ideas and tips just for the right path.

thanks is advance
tom

--

Hello and welcome, Tom :slight_smile:

I'll answer inline.

On Sat, Oct 27, 2012 at 4:35 PM, Tom tombon2007@gmail.com wrote:

Hi,
first of all I would like to thanks for this great piece of software.
I see much potential to make elasticsearch a swiss knife for search
and database development in general.

As a ES newbie I have some more specific questions even after
reading the docs.
I have seen many requests using ES as the primary database, which is
the same with my demands here. Furhermore I have seen no problem at
all to 'emulate/mimic' sql like syntax, ES provide all and even more
to handle these cases. Joins can be avoided due to the document store
nature.
ACID has a lower priority here, so what is the best practice to use ES more
DB like?

I don't know a generic answer to this general question, but it you
have more specific ones (like below), I'd be happy to contribute. If
the question is about the points below, please ignore this comment :slight_smile:

  1. mapping: to index complete fields, the field has to indexed always
    explicite
    as keyword otherwise ES will tokenize the field with the standard tokenizer,
    which
    will prevent a complete field match. Is this correct?

Correct.

if partial match is needed it could be mapped as multifield.
is this the optimal solution for a simple but complete text field match?

I suppose in most cases this would be the best option. With a few notes:

  • if you only want complete matches, you can just map your field with
    "not_analyzed" (I hope I'm not too Captain Obvious here)
  • if you store the source of the document (which is the default
    setting), you might want to use _source for complete matches. That
    would be slower, but there are situations where you'd have to use it
    anyway. For example, if you want to do a Terms Facet:
    Elasticsearch Platform — Find real-time answers at scale | Elastic

with lots of terms and you run into memory issues with the default
approach, which uses the indexed field.

  1. mapping-templates: mapping can be done by api or file. by file the docs
    are not so clear for me. So I have to copy a file with the same name like
    the
    type name (eg. /twitter/tweets ) tweets.json here?:
    config/mapings/twitter/tweets.json
    will ES loading this tweets.json always by startup and changes are
    reflecting
    at this startup point?

Actually, you don't need to restart ES in order to use config
mappings. So you can just have something like this:

cat /etc/elasticsearch/mappings/twitter/tweets.json

{
"tweets": {
"properties": {
"foo": {
"type": "string"
}
}
}
}

Then, assuming the index "twitter" doesn't already exist, if you do:

curl -XPUT localhost:9200/twitter/tweets/1 -d '{"bar":1}'

You can see that twitter/tweets was created with the defined mapping.
Plus, as dynamic mapping is enabled by default, ES will also add the
field "bar", autodetected as long. So your mapping will be:

curl -XGET localhost:9200/twitter/tweets/_mapping?pretty=true

{
"tweets" : {
"properties" : {
"bar" : {
"type" : "long"
},
"foo" : {
"type" : "string"
}
}
}
}

Can I use this for define the index (twitter.json)
and
all analyzer, tokenizer etc as well? and if, where to copy this file
(_default)?

If you want to define index-level settings, you can do that with index
templates:

As with mappings, you can handle them through the REST API or by
adding configuration files. And for config files, an ES restart is not
required - it's good as long as the config file is there when the
index is created.

Regarding _default, you can put your tweets.json file (see above) in a
directory named mappings/_default/ under your configuration directory.
For example, /etc/elasticsearch/mappings/_default/tweets.json. This
will apply your mapping to all the types called "tweets", no matter
the index name.

  1. sharding: from what I understand sharding will enhance index speed,
    replicas the search speed and durability?

Yes, that's the rule of thumb - but you need to have enough nodes. For
example, on the search speed front, if you have 2 shards and 2
replicas (totaling 4 shards) on 2 nodes your searches are likely to be
slower than if you disable the replicas - ending up with 2 shards on 2
nodes (one per node).

if so, what is the reason not
to set sharding to a very high value for a single box? dispacher overhead
or will this degrade the search speed on a single server?

Each shard is a Lucene index. This comes with a memory overhead, and
they also need to be "coordinated". More shards will create more such
overhead.

  1. backup: the hdfs gateway can be used as backup gateway, adding this
    gateway to a running system will sync the hdfs automatically or is it
    nessesary to activate this gateway before filling ES?

I've never tried it, but I assume you have to define the gateway
first, especially if you store your indices on-disk (which is
default). This is what I understand from here:

I have about 1 TB zipped archives here (~40 mio docs) which I want
to store in ES. Each doc has 3 fields, only one text field (50 chars) has
to be indexed for a quick lookup. Is it possible to store this on a single
server (32GB RAM, 2TB SSD)?

Unless you run out of storage, it sounds reasonable. But of course
there's no getting away from testing, because it all depends on lots
of factors.

Best regards,
Radu

http://sematext.com/ -- Elasticsearch -- Solr -- Lucene

--