How to improve AutoComplete performance?

zkidkid · December 10, 2012, 3:23am

Hi,

Currently I am making a autocomplete with ElasticSearch.

My Config:

index:

analysis:
filter:
my_gram_filter:
type: edgeNGram
side: front
min_gram: 1
max_gram: 10
tokenizer:
my_gram:
type: edgeNGram
side: front
min_gram: 2
max_gram: 20
analyzer:
default:
tokenizer: standard
filter: [asciifolding,lowercase]
auto:
type: custom
tokenizer: my_gram
filter: [asciifolding,lowercase]
auto2:
type: custom
tokenizer: standard
filter: [standard,lowercase,asciifolding,my_gram_filter]

Here is my mapping:

{
"song_name":{
"properties" : {
"id" : {"type": string},
"name": {"type": string, "index_analyzer": auto2, "search_analyzer":
default} ,
"data": {"type": string, "index": not_analyzed}
}
}
}

My server:

Server 1, 30Gb Ram, 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Server 2: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Server 3: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

I have 5 index, with 5 milions records.

I stress test it around 6k Request/Second.
And the response would take from 2s-6s.

Is anything there for me to increase the performance ?

P/S: I have try to setup caching:

indices.cache.filter.size: 3072mb
index.cache.filter.max_size: 1000000
index.cache.filter.expire: 5m
index.cache.filter.type: resident

index.cache.field.max_size: 1000000
index.cache.field.expire: 5m

And I check the cache status using bigdesk plugin/
I found that:

Filter Size: 16mb
Field Size: 0

Do the caching work right ?

Thanks In Advance.

--

mvg · December 10, 2012, 11:40am

Hi,

Can you share the actual query, that you use for the autocomplete?

The field cache is used for sorting by field, faceting and in scripts.
If these features aren't used then the cache size is 0.
The filter cache is used for only filters, which can be specified in
the search dsl. Only if your auto complete search request contains
filters, then you can benefit from filter cache.

Why did you actually set the expire and max_size option on both
caches? A filter cache that has entries is good, because that will
prevent unneeded disk io. The filter cache can manage itself pretty
good. Also ES has a very good default max_size for the filter cache,
which is 20% of the specified maximum heap size. Also how much memory
did you allocate to ES via the ES_HEAP_SIZE environment variable?

Martijn

On 10 December 2012 04:23, kidkid zkidkid@gmail.com wrote:

Hi,

Currently I am making a autocomplete with Elasticsearch.

My Config:

index:

analysis:
filter:
my_gram_filter:
type: edgeNGram
side: front
min_gram: 1
max_gram: 10
tokenizer:
my_gram:
type: edgeNGram
side: front
min_gram: 2
max_gram: 20
analyzer:
default:
tokenizer: standard
filter: [asciifolding,lowercase]
auto:
type: custom
tokenizer: my_gram
filter: [asciifolding,lowercase]
auto2:
type: custom
tokenizer: standard
filter: [standard,lowercase,asciifolding,my_gram_filter]

Here is my mapping:

{
"song_name":{
"properties" : {
"id" : {"type": string},
"name": {"type": string, "index_analyzer": auto2, "search_analyzer":
default} ,
"data": {"type": string, "index": not_analyzed}
}
}
}

My server:

Server 1, 30Gb Ram, 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Server 2: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Server 3: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

I have 5 index, with 5 milions records.

I stress test it around 6k Request/Second.
And the response would take from 2s-6s.

Is anything there for me to increase the performance ?

P/S: I have try to setup caching:

indices.cache.filter.size: 3072mb
index.cache.filter.max_size: 1000000
index.cache.filter.expire: 5m
index.cache.filter.type: resident

index.cache.field.max_size: 1000000
index.cache.field.expire: 5m

And I check the cache status using bigdesk plugin/
I found that:

Filter Size: 16mb
Field Size: 0

Do the caching work right ?

Thanks In Advance.

--

--
Met vriendelijke groet,

Martijn van Groningen

--

zkidkid · December 10, 2012, 11:53am

Hi Martijn,

Server 1, 30Gb Ram, 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Server 2: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Server 3: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

30Gb Ram has allocated by setting ES_HEAP_SIZE.

My Query is text query and phrase query.

TextQueryBuilder tq = QueryBuilders.textPhraseQuery(field, query);
tq.operator(TextQueryBuilder.Operator.AND);
tq.analyzer("default");

TextQueryBuilder tq = QueryBuilders.textQuery(field, query);
tq.operator(TextQueryBuilder.Operator.AND);
tq.analyzer("default");

the data is random around 4 words, each word has about 3-6 characters.

I have checked via bigdesk plugin and found that ES just use about 30% of
allocated memory.

do you have any way to increase my suggest performance ?

I have take a look on suggest plugin but It's not help much.

Regards.

On Monday, December 10, 2012 6:40:31 PM UTC+7, Martijn v Groningen wrote:

Hi,

Can you share the actual query, that you use for the autocomplete?

The field cache is used for sorting by field, faceting and in scripts.
If these features aren't used then the cache size is 0.
The filter cache is used for only filters, which can be specified in
the search dsl. Only if your auto complete search request contains
filters, then you can benefit from filter cache.

Why did you actually set the expire and max_size option on both
caches? A filter cache that has entries is good, because that will
prevent unneeded disk io. The filter cache can manage itself pretty
good. Also ES has a very good default max_size for the filter cache,
which is 20% of the specified maximum heap size. Also how much memory
did you allocate to ES via the ES_HEAP_SIZE environment variable?

Martijn

On 10 December 2012 04:23, kidkid <zki...@gmail.com <javascript:>> wrote:

Hi,

Currently I am making a autocomplete with Elasticsearch.

My Config:

index:

analysis:
filter:
my_gram_filter:
type: edgeNGram
side: front
min_gram: 1
max_gram: 10
tokenizer:
my_gram:
type: edgeNGram
side: front
min_gram: 2
max_gram: 20
analyzer:
default:
tokenizer: standard
filter: [asciifolding,lowercase]
auto:
type: custom
tokenizer: my_gram
filter: [asciifolding,lowercase]
auto2:
type: custom
tokenizer: standard
filter: [standard,lowercase,asciifolding,my_gram_filter]

Here is my mapping:

{
"song_name":{
"properties" : {
"id" : {"type": string},
"name": {"type": string, "index_analyzer": auto2, "search_analyzer":
default} ,
"data": {"type": string, "index": not_analyzed}
}
}
}

My server:

Server 1, 30Gb Ram, 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz
Server 2: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz
Server 3: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz

I have 5 index, with 5 milions records.

I stress test it around 6k Request/Second.
And the response would take from 2s-6s.

Is anything there for me to increase the performance ?

P/S: I have try to setup caching:

indices.cache.filter.size: 3072mb
index.cache.filter.max_size: 1000000
index.cache.filter.expire: 5m
index.cache.filter.type: resident

index.cache.field.max_size: 1000000
index.cache.field.expire: 5m

And I check the cache status using bigdesk plugin/
I found that:

Filter Size: 16mb
Field Size: 0

Do the caching work right ?

Thanks In Advance.

--

--
Met vriendelijke groet,

Martijn van Groningen

--

mvg · December 10, 2012, 12:16pm

Hi kidkid,

Since you're using queries for auto suggest the filter cache won't
help you improve your performance.

I see that you allocated all your memory to ES, this isn't recommended
and actually bad for performance.
ES utilizes the OS file system cache a lot. By allocating a lot of
memory to ES's heap space, the file system
cache doesn't have much memory to do its job. A healthy memory balance
for ES is to around 50% of the memory
ES' heap space and leave the other 50% to the OS. Queries in ES rely a
lot on the filesystem cache.

Since you only utilize ~30% of the heap space I would allocate 15GB of
memory to ES' heap space on server1 and around
12GB of memory to ES' heap space on server2 and server3. I think this
will improve your performance.

In your code I see the variable field being used. What actual field
is this variable referring to?

Martijn

On 10 December 2012 12:53, kidkid zkidkid@gmail.com wrote:

Hi Martijn,

Server 1, 30Gb Ram, 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Server 2: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
Server 3: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @ 2.40GHz

30Gb Ram has allocated by setting ES_HEAP_SIZE.

My Query is text query and phrase query.

TextQueryBuilder tq = QueryBuilders.textPhraseQuery(field, query);
tq.operator(TextQueryBuilder.Operator.AND);
tq.analyzer("default");

TextQueryBuilder tq = QueryBuilders.textQuery(field, query);
tq.operator(TextQueryBuilder.Operator.AND);
tq.analyzer("default");

the data is random around 4 words, each word has about 3-6 characters.

I have checked via bigdesk plugin and found that ES just use about 30% of
allocated memory.

do you have any way to increase my suggest performance ?

I have take a look on suggest plugin but It's not help much.

Regards.

On Monday, December 10, 2012 6:40:31 PM UTC+7, Martijn v Groningen wrote:

Hi,

Can you share the actual query, that you use for the autocomplete?

The field cache is used for sorting by field, faceting and in scripts.
If these features aren't used then the cache size is 0.
The filter cache is used for only filters, which can be specified in
the search dsl. Only if your auto complete search request contains
filters, then you can benefit from filter cache.

Why did you actually set the expire and max_size option on both
caches? A filter cache that has entries is good, because that will
prevent unneeded disk io. The filter cache can manage itself pretty
good. Also ES has a very good default max_size for the filter cache,
which is 20% of the specified maximum heap size. Also how much memory
did you allocate to ES via the ES_HEAP_SIZE environment variable?

Martijn

On 10 December 2012 04:23, kidkid zki...@gmail.com wrote:

Hi,

Currently I am making a autocomplete with Elasticsearch.

My Config:

index:

analysis:
filter:
my_gram_filter:
type: edgeNGram
side: front
min_gram: 1
max_gram: 10
tokenizer:
my_gram:
type: edgeNGram
side: front
min_gram: 2
max_gram: 20
analyzer:
default:
tokenizer: standard
filter: [asciifolding,lowercase]
auto:
type: custom
tokenizer: my_gram
filter: [asciifolding,lowercase]
auto2:
type: custom
tokenizer: standard
filter: [standard,lowercase,asciifolding,my_gram_filter]

Here is my mapping:

{
"song_name":{
"properties" : {
"id" : {"type": string},
"name": {"type": string, "index_analyzer": auto2, "search_analyzer":
default} ,
"data": {"type": string, "index": not_analyzed}
}
}
}

My server:

Server 1, 30Gb Ram, 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz
Server 2: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz
Server 3: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz

I have 5 index, with 5 milions records.

I stress test it around 6k Request/Second.
And the response would take from 2s-6s.

Is anything there for me to increase the performance ?

P/S: I have try to setup caching:

indices.cache.filter.size: 3072mb
index.cache.filter.max_size: 1000000
index.cache.filter.expire: 5m
index.cache.filter.type: resident

index.cache.field.max_size: 1000000
index.cache.field.expire: 5m

And I check the cache status using bigdesk plugin/
I found that:

Filter Size: 16mb
Field Size: 0

Do the caching work right ?

Thanks In Advance.

--

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--

zkidkid · December 10, 2012, 4:47pm

Hi, 30Gb ram is memory I allocate for the ES. The server is 40Gb ram.
It's a server for load test purpose only.
field is "name" field which you could see in my mappings.

Is anyway to caching or anything to improve the performance.

Regards.

On Monday, December 10, 2012 7:16:10 PM UTC+7, Martijn v Groningen wrote:

Hi kidkid,

Since you're using queries for auto suggest the filter cache won't
help you improve your performance.

I see that you allocated all your memory to ES, this isn't recommended
and actually bad for performance.
ES utilizes the OS file system cache a lot. By allocating a lot of
memory to ES's heap space, the file system
cache doesn't have much memory to do its job. A healthy memory balance
for ES is to around 50% of the memory
ES' heap space and leave the other 50% to the OS. Queries in ES rely a
lot on the filesystem cache.

Since you only utilize ~30% of the heap space I would allocate 15GB of
memory to ES' heap space on server1 and around
12GB of memory to ES' heap space on server2 and server3. I think this
will improve your performance.

In your code I see the variable field being used. What actual field
is this variable referring to?

Martijn

On 10 December 2012 12:53, kidkid <zki...@gmail.com <javascript:>> wrote:

Hi Martijn,

Server 1, 30Gb Ram, 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz
Server 2: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz
Server 3: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz

30Gb Ram has allocated by setting ES_HEAP_SIZE.

My Query is text query and phrase query.

TextQueryBuilder tq = QueryBuilders.textPhraseQuery(field, query);
tq.operator(TextQueryBuilder.Operator.AND);
tq.analyzer("default");

TextQueryBuilder tq = QueryBuilders.textQuery(field, query);
tq.operator(TextQueryBuilder.Operator.AND);
tq.analyzer("default");

the data is random around 4 words, each word has about 3-6 characters.

I have checked via bigdesk plugin and found that ES just use about 30%
of
allocated memory.

do you have any way to increase my suggest performance ?

I have take a look on suggest plugin but It's not help much.

Regards.

On Monday, December 10, 2012 6:40:31 PM UTC+7, Martijn v Groningen
wrote:

Hi,

Can you share the actual query, that you use for the autocomplete?

The field cache is used for sorting by field, faceting and in scripts.
If these features aren't used then the cache size is 0.
The filter cache is used for only filters, which can be specified in
the search dsl. Only if your auto complete search request contains
filters, then you can benefit from filter cache.

Why did you actually set the expire and max_size option on both
caches? A filter cache that has entries is good, because that will
prevent unneeded disk io. The filter cache can manage itself pretty
good. Also ES has a very good default max_size for the filter cache,
which is 20% of the specified maximum heap size. Also how much memory
did you allocate to ES via the ES_HEAP_SIZE environment variable?

Martijn

On 10 December 2012 04:23, kidkid zki...@gmail.com wrote:

Hi,

Currently I am making a autocomplete with Elasticsearch.

My Config:

index:

analysis:
filter:
my_gram_filter:
type: edgeNGram
side: front
min_gram: 1
max_gram: 10
tokenizer:
my_gram:
type: edgeNGram
side: front
min_gram: 2
max_gram: 20
analyzer:
default:
tokenizer: standard
filter: [asciifolding,lowercase]
auto:
type: custom
tokenizer: my_gram
filter: [asciifolding,lowercase]
auto2:
type: custom
tokenizer: standard
filter: [standard,lowercase,asciifolding,my_gram_filter]

Here is my mapping:

{
"song_name":{
"properties" : {
"id" : {"type": string},
"name": {"type": string, "index_analyzer": auto2, "search_analyzer":
default} ,
"data": {"type": string, "index": not_analyzed}
}
}
}

My server:

Server 1, 30Gb Ram, 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz
Server 2: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz
Server 3: 20Gb Ram 16 Core, Intel(R) Xeon(R) CPU E5620 @
2.40GHz

I have 5 index, with 5 milions records.

I stress test it around 6k Request/Second.
And the response would take from 2s-6s.

Is anything there for me to increase the performance ?

P/S: I have try to setup caching:

indices.cache.filter.size: 3072mb
index.cache.filter.max_size: 1000000
index.cache.filter.expire: 5m
index.cache.filter.type: resident

index.cache.field.max_size: 1000000
index.cache.field.expire: 5m

And I check the cache status using bigdesk plugin/
I found that:

Filter Size: 16mb
Field Size: 0

Do the caching work right ?

Thanks In Advance.

--

--
Met vriendelijke groet,

Martijn van Groningen

--
Met vriendelijke groet,

Martijn van Groningen

--

jprante · December 10, 2012, 5:37pm

Hi,

just a question, why are you using min_gram:1 ?

Best regards,

Jörg

--

mvg · December 10, 2012, 5:52pm

Hi, 30Gb ram is memory I allocate for the ES. The server is 40Gb ram.
It's a server for load test purpose only.
field is "name" field which you could see in my mappings.
Ok, if you lower the ES_HEAP_SIZE to 20GB, then this will improve your
queries performance. The filter cache won't help you here, since
you're using filters in your query (the code that you shared).

--

zkidkid · December 11, 2012, 2:23am

Ok I will try.
But I notice that the server is work fine with 10Gb Ram Left.

Is there other way to improve performance ?

@Jorg Prante: Yeah, min_gram to 1 so that I could suggest from 1 character.

On Tuesday, December 11, 2012 12:52:42 AM UTC+7, Martijn v Groningen wrote:

Hi, 30Gb ram is memory I allocate for the ES. The server is 40Gb ram.
It's a server for load test purpose only.
field is "name" field which you could see in my mappings.
Ok, if you lower the ES_HEAP_SIZE to 20GB, then this will improve your
queries performance. The filter cache won't help you here, since
you're using filters in your query (the code that you shared).

--

zkidkid · December 11, 2012, 11:30am

Hi,

I have set up the server and I allocate ES for 50% of ram.
The result is not improve much.

Currently with single request, 1-10 req/sec, I get the suggest response at
5-10(ms)

All I want that I could get result at 10-50(ms) when 1-2k req/sec.
So what should I do ? More ram or more server or ???

Currently I have 5 index, each index has 1M record and I suggest in 5
indexes.
Should I tried for 1 index only ?

I see the document for store in memory but If the server down -> I need to
reindex it again.

Thanks.

--

mvg · December 11, 2012, 12:18pm

How many nodes are using for your tests right now? Does each index has
5 shards?
The easy way to improve your response times is to add more nodes.

Adding more RAM would give your filesystem cache more space. This
allows the OS the cache more index files.
What type of disk are you using? Using SSD can improve your response
times by many times.

Martijn

On 11 December 2012 12:30, kidkid zkidkid@gmail.com wrote:

Hi,

I have set up the server and I allocate ES for 50% of ram.
The result is not improve much.

Currently with single request, 1-10 req/sec, I get the suggest response at
5-10(ms)

All I want that I could get result at 10-50(ms) when 1-2k req/sec.
So what should I do ? More ram or more server or ???

Currently I have 5 index, each index has 1M record and I suggest in 5
indexes.
Should I tried for 1 index only ?

I see the document for store in memory but If the server down -> I need to
reindex it again.

Thanks.

--
Met vriendelijke groet,

Martijn van Groningen

--

jprante · December 11, 2012, 9:38pm

You should reconsider your decision, nobody can work with suggestions based
on just one character, there are too many or unusable alternatives.

min_gram 2 or 3 will put some relief on your system, and make suggestions
more performant.

Best regards,

Jörg

On Tuesday, December 11, 2012 3:23:55 AM UTC+1, kidkid wrote:

@Jorg Prante: Yeah, min_gram to 1 so that I could suggest from 1 character.

--

zkidkid · December 12, 2012, 2:27am

@Martijn v Groningen

How many nodes are using for your tests right now? Does each index has
5 shards?
-> I have 3 nodes on 3 servers. Each severs I setup for 5 primary shards
and 2 replicas shards.

The easy way to improve your response times is to add more nodes.
-> It's is hard to require more node, I just have only 5M record.

Adding more RAM would give your filesystem cache more space. This
allows the OS the cache more index files.
-> Server and ES don't use up Ram.

What type of disk are you using? Using SSD can improve your response
times by many times.
-> HDD, SSD is too expensive here.

Thanks for your suggest, but with 5M, I could put it on into RAM, so I
really don't know why it's too slow

@Jörg Prante :

I set min_gram = 1 so that I could suggest : hello w

--

zkidkid · December 12, 2012, 4:21am

Hi ,

I have set up to 3 nodes to bring it up to memory, and 1 node for holding
data.

index.number_of_shards: 5
index.number_of_replicas: 0
index.store.type: memory
index.gateway.type: none
gateway.type: fs
gateway.fs.location: /path/data

The request just about 200 req/s.
With the same query I got many difference response time., 7ms, 32ms, 150ms.

Is anyway to make the response time just around 50ms ?

Thanks.

--