Improving search speed for 100 million queries


(Uli Köhler) #1

Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:

  • Case-insensitive
  • Case sensitive (splitting tokens on hyphens and whitespace)
  • Case sensitive (splitting tokens only on whitespace)
  • Case insensitive stemmed (Snowball stemmer)

Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.

In order to do this, I simply use four different fields with the appropriate
analyzer (defined in the mapping) for one document.

After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).

I've got a list of about 100 million entities and need to know in which of
the documents each of them occurs - I need all the document IDs the entity
occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results but
a reasonable amount of queries. In order to reduce the overhead, I'm using
TransportClient.

My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.

Each cluster node has 32 GB RAM and 16 CPU cores.

Full config: https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).

Does anyone of you have an idea how to speed up searching for my usecase?
Would it be reasonable to execute not as much queries at once?

Thanks in advance,
Uli


(Shay Banon) #2

How much memory do you allocate to elasticsearch java process? Also, can you
upgrade to 0.17.4?

On Fri, Aug 5, 2011 at 4:42 PM, Uli Köhler ulikoehler.dev@googlemail.comwrote:

Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:

  • Case-insensitive
  • Case sensitive (splitting tokens on hyphens and whitespace)
  • Case sensitive (splitting tokens only on whitespace)
  • Case insensitive stemmed (Snowball stemmer)

Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.

In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.

After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).

I've got a list of about 100 million entities and need to know in which of
the documents each of them occurs - I need all the document IDs the entity
occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results
but a reasonable amount of queries. In order to reduce the overhead, I'm
using TransportClient.

My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.

Each cluster node has 32 GB RAM and 16 CPU cores.

Full config: https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).

Does anyone of you have an idea how to speed up searching for my usecase?
Would it be reasonable to execute not as much queries at once?

Thanks in advance,
Uli


(Uli Köhler) #3

Hi,
thanks for your fast reply!
I set ES_MIN_MEM and ES_MAX_MEM to 30 GB.

Yes, an upgrade to 0.17.4 is possible. Are there any bugs related to shard
allocation that have been fixed since 0.17.1?

Best regards,
Uli

2011/8/6 Shay Banon kimchy@gmail.com

How much memory do you allocate to elasticsearch java process? Also, can
you upgrade to 0.17.4?

On Fri, Aug 5, 2011 at 4:42 PM, Uli Köhler ulikoehler.dev@googlemail.comwrote:

Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:

  • Case-insensitive
  • Case sensitive (splitting tokens on hyphens and whitespace)
  • Case sensitive (splitting tokens only on whitespace)
  • Case insensitive stemmed (Snowball stemmer)

Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.

In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.

After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).

I've got a list of about 100 million entities and need to know in which of
the documents each of them occurs - I need all the document IDs the entity
occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results
but a reasonable amount of queries. In order to reduce the overhead, I'm
using TransportClient.

My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.

Each cluster node has 32 GB RAM and 16 CPU cores.

Full config: https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).

Does anyone of you have an idea how to speed up searching for my usecase?
Would it be reasonable to execute not as much queries at once?

Thanks in advance,
Uli


(Shay Banon) #4

Then you are not leaving enough memory for the OS if you set it to 30gb, I
suggest setting it to a lower value. You can see how much its taking now
using node stats API or bigdesk (https://github.com/lukas-vlcek/bigdesk). I
would say something around 10gb to 16gb to start with.

On Mon, Aug 8, 2011 at 10:11 AM, Uli Köhler
ulikoehler.dev@googlemail.comwrote:

Hi,
thanks for your fast reply!
I set ES_MIN_MEM and ES_MAX_MEM to 30 GB.

Yes, an upgrade to 0.17.4 is possible. Are there any bugs related to shard
allocation that have been fixed since 0.17.1?

Best regards,
Uli

2011/8/6 Shay Banon kimchy@gmail.com

How much memory do you allocate to elasticsearch java process? Also, can
you upgrade to 0.17.4?

On Fri, Aug 5, 2011 at 4:42 PM, Uli Köhler <ulikoehler.dev@googlemail.com

wrote:

Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:

  • Case-insensitive
  • Case sensitive (splitting tokens on hyphens and whitespace)
  • Case sensitive (splitting tokens only on whitespace)
  • Case insensitive stemmed (Snowball stemmer)

Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.

In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.

After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).

I've got a list of about 100 million entities and need to know in which
of the documents each of them occurs - I need all the document IDs the
entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results
but a reasonable amount of queries. In order to reduce the overhead, I'm
using TransportClient.

My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.

Each cluster node has 32 GB RAM and 16 CPU cores.

Full config: https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).

Does anyone of you have an idea how to speed up searching for my usecase?
Would it be reasonable to execute not as much queries at once?

Thanks in advance,
Uli


(Andy-2) #5

Does elasticsearch manage the caching of the index & data itself or
does it leave that job to the OS file cache?

So if I have index & data that total 10GB and I want it to stay in
RAM, should I set ES_MIN_MEM to 10G or should I leave 10G for the OS
to do the caching?

On Aug 8, 3:55 am, Shay Banon kim...@gmail.com wrote:

Then you are not leaving enough memory for the OS if you set it to 30gb, I
suggest setting it to a lower value. You can see how much its taking now
using node stats API or bigdesk (https://github.com/lukas-vlcek/bigdesk). I
would say something around 10gb to 16gb to start with.

On Mon, Aug 8, 2011 at 10:11 AM, Uli Köhler
ulikoehler....@googlemail.comwrote:

Hi,
thanks for your fast reply!
I set ES_MIN_MEM and ES_MAX_MEM to 30 GB.

Yes, an upgrade to 0.17.4 is possible. Are there any bugs related to shard
allocation that have been fixed since 0.17.1?

Best regards,
Uli

2011/8/6 Shay Banon kim...@gmail.com

How much memory do you allocate to elasticsearch java process? Also, can
you upgrade to 0.17.4?

On Fri, Aug 5, 2011 at 4:42 PM, Uli Köhler <ulikoehler....@googlemail.com

wrote:

Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:

  • Case-insensitive
  • Case sensitive (splitting tokens on hyphens and whitespace)
  • Case sensitive (splitting tokens only on whitespace)
  • Case insensitive stemmed (Snowball stemmer)

Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.

In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.

After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).

I've got a list of about 100 million entities and need to know in which
of the documents each of them occurs - I need all the document IDs the
entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results
but a reasonable amount of queries. In order to reduce the overhead, I'm
using TransportClient.

My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.

Each cluster node has 32 GB RAM and 16 CPU cores.

Full config:https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).

Does anyone of you have an idea how to speed up searching for my usecase?
Would it be reasonable to execute not as much queries at once?

Thanks in advance,
Uli


(Shay Banon) #6

the main caching elasticsearch does is filter caching, or field caching (for
sorting and faceting) and on the Lucene level, things like loading a skip
list of terms to speedup search. But, it relies also on the file system
cache heavily. Can't answer your question since it really depends on what
you do with your index (facets, sorting), and, how much memory the OS has.

On Mon, Aug 8, 2011 at 11:02 AM, Andy selforganized@gmail.com wrote:

Does elasticsearch manage the caching of the index & data itself or
does it leave that job to the OS file cache?

So if I have index & data that total 10GB and I want it to stay in
RAM, should I set ES_MIN_MEM to 10G or should I leave 10G for the OS
to do the caching?

On Aug 8, 3:55 am, Shay Banon kim...@gmail.com wrote:

Then you are not leaving enough memory for the OS if you set it to 30gb,
I
suggest setting it to a lower value. You can see how much its taking now
using node stats API or bigdesk (https://github.com/lukas-vlcek/bigdesk).
I
would say something around 10gb to 16gb to start with.

On Mon, Aug 8, 2011 at 10:11 AM, Uli Köhler
ulikoehler....@googlemail.comwrote:

Hi,
thanks for your fast reply!
I set ES_MIN_MEM and ES_MAX_MEM to 30 GB.

Yes, an upgrade to 0.17.4 is possible. Are there any bugs related to
shard

allocation that have been fixed since 0.17.1?

Best regards,
Uli

2011/8/6 Shay Banon kim...@gmail.com

How much memory do you allocate to elasticsearch java process? Also,
can

you upgrade to 0.17.4?

On Fri, Aug 5, 2011 at 4:42 PM, Uli Köhler <
ulikoehler....@googlemail.com

wrote:

Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one
contains a

text fragment and doesn't have to be stored (I'm only interested in
the

IDs).
For my usecase, I need to apply four different analyzers to the text:

  • Case-insensitive
  • Case sensitive (splitting tokens on hyphens and whitespace)
  • Case sensitive (splitting tokens only on whitespace)
  • Case insensitive stemmed (Snowball stemmer)

Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.

In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.

After indexing (which happens in a Hadoop MapReduce job) the index
isn't

changed any more (at least until the query phase is finished).

I've got a list of about 100 million entities and need to know in
which

of the documents each of them occurs - I need all the document IDs
the

entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the
type -

about 75 % of them result in span_near queries (with in_order = false
and

slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in
Java

using Lucene) to the span_term queries inside the span_near queries
because

I didn't find a way of doing this without additional overhead.
Because I

neither need scoring nor sorting, I'm using SearchType.SCROLL with a
scroll

size of 200.
Some of the entities result in queries with more than 10 million
results

but a reasonable amount of queries. In order to reduce the overhead,
I'm

using TransportClient.

My current configuration uses unicast discovery, 100 shards without
any

replicas and a local gateway - with this configuration most of the
queries

take a reasonable time but some of them take more than ten minutes
even for

very few results. Hadoop executes about 125 queries in parallel for
me.

Each cluster node has 32 GB RAM and 16 CPU cores.

Full config:https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).

Does anyone of you have an idea how to speed up searching for my
usecase?

Would it be reasonable to execute not as much queries at once?

Thanks in advance,
Uli


(Andy-2) #7

So the inverted index is cached by the OS?

What about if I use ES as a K/V store - is the K/V data cached by ES
or OS?

On Aug 8, 4:06 am, Shay Banon kim...@gmail.com wrote:

the main caching elasticsearch does is filter caching, or field caching (for
sorting and faceting) and on the Lucene level, things like loading a skip
list of terms to speedup search. But, it relies also on the file system
cache heavily. Can't answer your question since it really depends on what
you do with your index (facets, sorting), and, how much memory the OS has.

On Mon, Aug 8, 2011 at 11:02 AM, Andy selforgani...@gmail.com wrote:

Does elasticsearch manage the caching of the index & data itself or
does it leave that job to the OS file cache?

So if I have index & data that total 10GB and I want it to stay in
RAM, should I set ES_MIN_MEM to 10G or should I leave 10G for the OS
to do the caching?

On Aug 8, 3:55 am, Shay Banon kim...@gmail.com wrote:

Then you are not leaving enough memory for the OS if you set it to 30gb,
I
suggest setting it to a lower value. You can see how much its taking now
using node stats API or bigdesk (https://github.com/lukas-vlcek/bigdesk).
I
would say something around 10gb to 16gb to start with.

On Mon, Aug 8, 2011 at 10:11 AM, Uli Köhler
ulikoehler....@googlemail.comwrote:

Hi,
thanks for your fast reply!
I set ES_MIN_MEM and ES_MAX_MEM to 30 GB.

Yes, an upgrade to 0.17.4 is possible. Are there any bugs related to
shard

allocation that have been fixed since 0.17.1?

Best regards,
Uli

2011/8/6 Shay Banon kim...@gmail.com

How much memory do you allocate to elasticsearch java process? Also,
can

you upgrade to 0.17.4?

On Fri, Aug 5, 2011 at 4:42 PM, Uli Köhler <
ulikoehler....@googlemail.com

wrote:

Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one
contains a

text fragment and doesn't have to be stored (I'm only interested in
the

IDs).
For my usecase, I need to apply four different analyzers to the text:

  • Case-insensitive
  • Case sensitive (splitting tokens on hyphens and whitespace)
  • Case sensitive (splitting tokens only on whitespace)
  • Case insensitive stemmed (Snowball stemmer)

Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.

In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.

After indexing (which happens in a Hadoop MapReduce job) the index
isn't

changed any more (at least until the query phase is finished).

I've got a list of about 100 million entities and need to know in
which

of the documents each of them occurs - I need all the document IDs
the

entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the
type -

about 75 % of them result in span_near queries (with in_order = false
and

slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in
Java

using Lucene) to the span_term queries inside the span_near queries
because

I didn't find a way of doing this without additional overhead.
Because I

neither need scoring nor sorting, I'm using SearchType.SCROLL with a
scroll

size of 200.
Some of the entities result in queries with more than 10 million
results

but a reasonable amount of queries. In order to reduce the overhead,
I'm

using TransportClient.

My current configuration uses unicast discovery, 100 shards without
any

replicas and a local gateway - with this configuration most of the
queries

take a reasonable time but some of them take more than ten minutes
even for

very few results. Hadoop executes about 125 queries in parallel for
me.

Each cluster node has 32 GB RAM and 16 CPU cores.

Full config:https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).

Does anyone of you have an idea how to speed up searching for my
usecase?

Would it be reasonable to execute not as much queries at once?

Thanks in advance,
Uli


(Shay Banon) #8

The lookup keys are loaded (skipped) by Lucene, some recent keys lookup info
is cached done by elasticsearch, but the actual content is served by the OS.
If you want a slight fast lookup and you have enough mem to make it count,
you can use mmapfs as the index storage option.

On Mon, Aug 8, 2011 at 11:17 AM, Andy selforganized@gmail.com wrote:

So the inverted index is cached by the OS?

What about if I use ES as a K/V store - is the K/V data cached by ES
or OS?

On Aug 8, 4:06 am, Shay Banon kim...@gmail.com wrote:

the main caching elasticsearch does is filter caching, or field caching
(for
sorting and faceting) and on the Lucene level, things like loading a skip
list of terms to speedup search. But, it relies also on the file system
cache heavily. Can't answer your question since it really depends on what
you do with your index (facets, sorting), and, how much memory the OS
has.

On Mon, Aug 8, 2011 at 11:02 AM, Andy selforgani...@gmail.com wrote:

Does elasticsearch manage the caching of the index & data itself or
does it leave that job to the OS file cache?

So if I have index & data that total 10GB and I want it to stay in
RAM, should I set ES_MIN_MEM to 10G or should I leave 10G for the OS
to do the caching?

On Aug 8, 3:55 am, Shay Banon kim...@gmail.com wrote:

Then you are not leaving enough memory for the OS if you set it to
30gb,

I

suggest setting it to a lower value. You can see how much its taking
now

using node stats API or bigdesk (
https://github.com/lukas-vlcek/bigdesk).

I

would say something around 10gb to 16gb to start with.

On Mon, Aug 8, 2011 at 10:11 AM, Uli Köhler
ulikoehler....@googlemail.comwrote:

Hi,
thanks for your fast reply!
I set ES_MIN_MEM and ES_MAX_MEM to 30 GB.

Yes, an upgrade to 0.17.4 is possible. Are there any bugs related
to

shard

allocation that have been fixed since 0.17.1?

Best regards,
Uli

2011/8/6 Shay Banon kim...@gmail.com

How much memory do you allocate to elasticsearch java process?
Also,

can

you upgrade to 0.17.4?

On Fri, Aug 5, 2011 at 4:42 PM, Uli Köhler <
ulikoehler....@googlemail.com

wrote:

Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one
contains a

text fragment and doesn't have to be stored (I'm only interested
in

the

IDs).
For my usecase, I need to apply four different analyzers to the
text:

  • Case-insensitive
  • Case sensitive (splitting tokens on hyphens and whitespace)
  • Case sensitive (splitting tokens only on whitespace)
  • Case insensitive stemmed (Snowball stemmer)

Each analyzers additionally filters using a WordDelimiterFilter
with

different parameters.

In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.

After indexing (which happens in a Hadoop MapReduce job) the
index

isn't

changed any more (at least until the query phase is finished).

I've got a list of about 100 million entities and need to know in
which

of the documents each of them occurs - I need all the document
IDs

the

entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the
type -

about 75 % of them result in span_near queries (with in_order =
false

and

slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it
in

Java

using Lucene) to the span_term queries inside the span_near
queries

because

I didn't find a way of doing this without additional overhead.
Because I

neither need scoring nor sorting, I'm using SearchType.SCROLL
with a

scroll

size of 200.
Some of the entities result in queries with more than 10 million
results

but a reasonable amount of queries. In order to reduce the
overhead,

I'm

using TransportClient.

My current configuration uses unicast discovery, 100 shards
without

any

replicas and a local gateway - with this configuration most of
the

queries

take a reasonable time but some of them take more than ten
minutes

even for

very few results. Hadoop executes about 125 queries in parallel
for

me.

Each cluster node has 32 GB RAM and 16 CPU cores.

Full config:https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).

Does anyone of you have an idea how to speed up searching for my
usecase?

Would it be reasonable to execute not as much queries at once?

Thanks in advance,
Uli


(system) #9