New to ES


(yoni) #1

hi
i'm new to ES and to the serach server world, i decided to check ES
becuase i was in some reviews that it easy to use.
i managed to index my docs (10M) and to run some queried but the
problem they are very very slow (10 sec).j

i didn't do any setting and didnt change anything in the yml file or
any other file. my problem in setting and tunning the system is that
i'm very weak in all the index world and dont have background in
clusters,nodes etc.. i don't know how to proced becuase the more i
read the more i dont understand and each subject to learn lead me to
more subject...

does anyone know a very basic tutorial to the basic concept and what
are the basis tuning for performance ?

1)if not maybe just explain to me that the difference between index
name ,index type
2) does any doc need its own id?

thanks for the helpers
jhonny


(Lukáš Vlček) #2

Hi,

if you can share some info about your documents and what kind of queries
you were running, that could help a much.

As for 1)
You can think of index name as an identifier for single Lucene index. All
document types that fall under the same index name are indexed into the
same Lucene index (the index can be in fact distributed thus for one index
name there can exists several Lucene indices on different machines but
let's not complicate the discussion by that for now). Try to search on ES
doc pages and read some bits about it. Index name/types is not complicated
stuff.
For example take a look at:

As for 2)
No, if you do not provide docID then ES will generate unique ID for it.
http://www.elasticsearch.org/guide/reference/api/index_.html (Automatic ID
Generation)

Regards,
Lukas

On Thu, Dec 8, 2011 at 12:16 PM, yoni yonihomi@gmail.com wrote:

hi
i'm new to ES and to the serach server world, i decided to check ES
becuase i was in some reviews that it easy to use.
i managed to index my docs (10M) and to run some queried but the
problem they are very very slow (10 sec).j

i didn't do any setting and didnt change anything in the yml file or
any other file. my problem in setting and tunning the system is that
i'm very weak in all the index world and dont have background in
clusters,nodes etc.. i don't know how to proced becuase the more i
read the more i dont understand and each subject to learn lead me to
more subject...

does anyone know a very basic tutorial to the basic concept and what
are the basis tuning for performance ?

1)if not maybe just explain to me that the difference between index
name ,index type
2) does any doc need its own id?

thanks for the helpers
jhonny


(yoni) #3

first, thanks for the quick replay.. it helped a alot
what i tried to do is to use the Es java api to index my DB. each
record in the DB represent a book . this is he code
IndexResponse response =
client.prepareIndex(ES.INDEX_NAME,ES.INDEX_TYPE, ""+counter)
.setSource(jsonBuilder() .startObject()
.field("id_unique",counter)
.field("id", catalogID.toString())
.field( "parentIsbn", parentIsbn )
.field("title", rs.getString( "Title" ) )
.field( "title_exact", rs.getString( "Title" ).toLowerCase())
.field("seriesId", rs.getString( "SeriesID" ))
.field("seriesNumber",
rs.getString( "SeriesNumber" ))
.field( "publicationDate", rs.getString( "PublicationDate" ))
.field("editionNumber",
rs.getInt( "EditionNumber" ))
.field("edition", rs.getString( "EditionDescription" ))
.field( "createDate", rs.getDate( "CreateDt" ) )
.field("updateDate", rs.getDate( "UpdateDt" ))
.field("description",
rs.getString( "DescriptionTxt" ))
.field("textbook", rs.getInt( "TextBook" ) == 1 ? true : false )
.field("catgory.k","bbbbbbbbbbbbbbbbbbbb")
.field("longText",
.endObject() ) .execute()
.actionGet();

in my code each book got a different unique ID_index ={ the loop
counter}. now i don't know how this data is analyzed in the server ,
on which filed an index is created ??
becuase now when i'm tring to serach for a word or 2-3 word
combination in the all doc(q=" ") or serach direct
inside a field (q=title:)
it takes something like 10-15 seconds ?..

how can i know on which field of the json an index is created ?
does it have non-clustred index ?
or how can i decide which fileds i wnat them to be index?

and what setting can help me improve performance? ( now all my
settings is default and my index proccess is simply a loop on the DB
record and send then using java api)

thanks very much


(Lukáš Vlček) #4

Hi,

as a next step I would recommend you to take a look at mapping:
http://www.elasticsearch.org/guide/reference/mapping/index.html

Once you index your data, investigate what mappings was used:
http://www.elasticsearch.org/guide/reference/api/admin-indices-get-mapping.html

This will help you understand which fields are searchable and which are not
(from the quick glance at your example I think all fields are searchable).
Most of your fields (probably all) will be automatically mapped to one of
core types, you can learn in the below documentation which analysis was
used by default (you can change it but since you do not use mapping now the
default analyzers are used):
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

The search response is quite slow in your case. Apart from wrong search API
use it can have many other reasons (your cluster gets connected with other
cluster on your network for example or some expensive process is running on
the machine, it has not enough memory ... etc).

BTW Do not hesitate to share your search code as well. (You can use gist
instead of pasting code snippets into mail)

Regards,
Lukas

On Thu, Dec 8, 2011 at 2:31 PM, yoni yonihomi@gmail.com wrote:

first, thanks for the quick replay.. it helped a alot
what i tried to do is to use the Es java api to index my DB. each
record in the DB represent a book . this is he code
IndexResponse response =
client.prepareIndex(ES.INDEX_NAME,ES.INDEX_TYPE, ""+counter)

.setSource(jsonBuilder()
.startObject()
.field("id_unique",counter)

.field("id", catalogID.toString())

.field( "parentIsbn", parentIsbn )

.field("title", rs.getString( "Title" ) )

.field( "title_exact", rs.getString( "Title" ).toLowerCase())

               .field("seriesId", rs.getString( "SeriesID" ))

                 .field("seriesNumber",

rs.getString( "SeriesNumber" ))
.field( "publicationDate", rs.getString( "PublicationDate" ))

                 .field("editionNumber",

rs.getInt( "EditionNumber" ))
.field("edition", rs.getString( "EditionDescription" ))

         .field( "createDate", rs.getDate( "CreateDt" ) )

             .field("updateDate", rs.getDate( "UpdateDt" ))

               .field("description",

rs.getString( "DescriptionTxt" ))

.field("textbook", rs.getInt( "TextBook" ) == 1 ? true : false )
.field("catgory.k","bbbbbbbbbbbbbbbbbbbb")

                 .field("longText",

.endObject() )
.execute()
.actionGet();

in my code each book got a different unique ID_index ={ the loop
counter}. now i don't know how this data is analyzed in the server ,
on which filed an index is created ??
becuase now when i'm tring to serach for a word or 2-3 word
combination in the all doc(q=" ") or serach direct
inside a field (q=title:)
it takes something like 10-15 seconds ?..

how can i know on which field of the json an index is created ?
does it have non-clustred index ?
or how can i decide which fileds i wnat them to be index?

and what setting can help me improve performance? ( now all my
settings is default and my index proccess is simply a loop on the DB
record and send then using java api)

thanks very much


(David Pilato) #5

About performance, I will add :

  • It seems that you use the default properties of ES. For more than 1M docs, I
    suggest to modify memory settings
    set ES_MIN_MEM and set ES_MAX_MEM (defaut to 256m and 1g)
  • If you run under Windows 32bits, you will certainly have a constraint with
    available physical memory for the JVM (contiguous memory), so you will certainly
    not be able to go after 1500m :frowning:
  • As you are "starting" with ES, try to play with less documents (IMHO 100k docs
    is enough to test search, facets and index/shards/nodes concepts)

HTH
David.

Le 8 décembre 2011 à 15:18, "Lukáš Vlček" lukas.vlcek@gmail.com a écrit :

Hi,

as a next step I would recommend you to take a look at mapping:
http://www.elasticsearch.org/guide/reference/mapping/index.html

Once you index your data, investigate what mappings was used:
http://www.elasticsearch.org/guide/reference/api/admin-indices-get-mapping.html

This will help you understand which fields are searchable and which are not
(from the quick glance at your example I think all fields are searchable).
Most of your fields (probably all) will be automatically mapped to one of
core types, you can learn in the below documentation which analysis was
used by default (you can change it but since you do not use mapping now the
default analyzers are used):
http://www.elasticsearch.org/guide/reference/mapping/core-types.html

The search response is quite slow in your case. Apart from wrong search API
use it can have many other reasons (your cluster gets connected with other
cluster on your network for example or some expensive process is running on
the machine, it has not enough memory ... etc).

BTW Do not hesitate to share your search code as well. (You can use gist
instead of pasting code snippets into mail)

Regards,
Lukas

On Thu, Dec 8, 2011 at 2:31 PM, yoni yonihomi@gmail.com wrote:

first, thanks for the quick replay.. it helped a alot
what i tried to do is to use the Es java api to index my DB. each
record in the DB represent a book . this is he code
IndexResponse response =
client.prepareIndex(ES.INDEX_NAME,ES.INDEX_TYPE, ""+counter)

.setSource(jsonBuilder()
.startObject()
.field("id_unique",counter)

.field("id", catalogID.toString())

.field( "parentIsbn", parentIsbn )

.field("title", rs.getString( "Title" ) )

.field( "title_exact", rs.getString( "Title" ).toLowerCase())

               .field("seriesId", rs.getString( "SeriesID" ))

                 .field("seriesNumber",

rs.getString( "SeriesNumber" ))
.field( "publicationDate", rs.getString( "PublicationDate" ))

                 .field("editionNumber",

rs.getInt( "EditionNumber" ))
.field("edition", rs.getString( "EditionDescription" ))

         .field( "createDate", rs.getDate( "CreateDt" ) )

             .field("updateDate", rs.getDate( "UpdateDt" ))

               .field("description",

rs.getString( "DescriptionTxt" ))

.field("textbook", rs.getInt( "TextBook" ) == 1 ? true : false )
.field("catgory.k","bbbbbbbbbbbbbbbbbbbb")

                 .field("longText",

.endObject() )
.execute()
.actionGet();

in my code each book got a different unique ID_index ={ the loop
counter}. now i don't know how this data is analyzed in the server ,
on which filed an index is created ??
becuase now when i'm tring to serach for a word or 2-3 word
combination in the all doc(q=" ") or serach direct
inside a field (q=title:)
it takes something like 10-15 seconds ?..

how can i know on which field of the json an index is created ?
does it have non-clustred index ?
or how can i decide which fileds i wnat them to be index?

and what setting can help me improve performance? ( now all my
settings is default and my index proccess is simply a loop on the DB
record and send then using java api)

thanks very much

--
David Pilato
http://dev.david.pilato.fr/
Twitter : @dadoonet


(yoni) #6

Hi

about the serach testing , i didn't check to search with the java
api .. i just went to the browser navigation and tried some queries
some on all fields (uning q= ) and some on
fields(q=:).
the reason i'm starting with 10M docs and not lower number is because
i'm trying to check if the ES is a good solution to my site search
engine.
(now we use solr and i need to decide if to move to ES)

do u know if 10M-15M docs with complex query and return result in less
then 2 second can be done with ES?
and what is the load ES can handle ? is 1k earch request in a second
is reasonable?

thanks


(Lukáš Vlček) #7

Hi Yoni,

not sure if that was your case but note that if you index data into ES then
the first time you query them all internal caches are populated first and
this take time, in other words, first queries take significantly longer.
You should monitor caches in ES (for example via JMX or admin cluster REST
API: http://www.elasticsearch.org/guide/reference/api/ which I used to
build BigDesk https://github.com/lukas-vlcek/bigdesk , you might also find
elasticsearch-head useful https://github.com/mobz/elasticsearch-head).

Regards,
Lukas

On Thu, Dec 8, 2011 at 5:21 PM, yoni yonihomi@gmail.com wrote:

Hi

about the serach testing , i didn't check to search with the java
api .. i just went to the browser navigation and tried some queries
some on all fields (uning q= ) and some on
fields(q=:).
the reason i'm starting with 10M docs and not lower number is because
i'm trying to check if the ES is a good solution to my site search
engine.
(now we use solr and i need to decide if to move to ES)

do u know if 10M-15M docs with complex query and return result in less
then 2 second can be done with ES?
and what is the load ES can handle ? is 1k earch request in a second
is reasonable?

thanks


(Shay Banon) #8

Yoni,

Its a bit hard to help with advice without knowing a bit more information
on what you are running. Which operating system, how much memory do you
have on the server, how much is allocated to the elasticsearch process. How
many machines are you running (this one I think I can guess, which is 1). A
simple term based search should not take more than a second.

On Thu, Dec 8, 2011 at 6:45 PM, Lukáš Vlček lukas.vlcek@gmail.com wrote:

Hi Yoni,

not sure if that was your case but note that if you index data into ES
then the first time you query them all internal caches are populated first
and this take time, in other words, first queries take significantly longer.
You should monitor caches in ES (for example via JMX or admin cluster REST
API: http://www.elasticsearch.org/guide/reference/api/ which I used to
build BigDesk https://github.com/lukas-vlcek/bigdesk , you might also
find elasticsearch-head useful https://github.com/mobz/elasticsearch-head
).

Regards,
Lukas

On Thu, Dec 8, 2011 at 5:21 PM, yoni yonihomi@gmail.com wrote:

Hi

about the serach testing , i didn't check to search with the java
api .. i just went to the browser navigation and tried some queries
some on all fields (uning q= ) and some on
fields(q=:).
the reason i'm starting with 10M docs and not lower number is because
i'm trying to check if the ES is a good solution to my site search
engine.
(now we use solr and i need to decide if to move to ES)

do u know if 10M-15M docs with complex query and return result in less
then 2 second can be done with ES?
and what is the load ES can handle ? is 1k earch request in a second
is reasonable?

thanks


(system) #9