Hi,
I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:
Case-insensitive
Case sensitive (splitting tokens on hyphens and whitespace)
Case sensitive (splitting tokens only on whitespace)
Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.
In order to do this, I simply use four different fields with the appropriate
analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in which of
the documents each of them occurs - I need all the document IDs the entity
occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results but
a reasonable amount of queries. In order to reduce the overhead, I'm using
TransportClient.
My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.
Hi,
I'm currently trying to index about 1 billion small document with
Elasticsearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:
Case-insensitive
Case sensitive (splitting tokens on hyphens and whitespace)
Case sensitive (splitting tokens only on whitespace)
Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.
In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in which of
the documents each of them occurs - I need all the document IDs the entity
occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results
but a reasonable amount of queries. In order to reduce the overhead, I'm
using TransportClient.
My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.
Hi,
I'm currently trying to index about 1 billion small document with
Elasticsearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:
Case-insensitive
Case sensitive (splitting tokens on hyphens and whitespace)
Case sensitive (splitting tokens only on whitespace)
Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.
In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in which of
the documents each of them occurs - I need all the document IDs the entity
occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results
but a reasonable amount of queries. In order to reduce the overhead, I'm
using TransportClient.
My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.
Hi,
I'm currently trying to index about 1 billion small document with
Elasticsearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:
Case-insensitive
Case sensitive (splitting tokens on hyphens and whitespace)
Case sensitive (splitting tokens only on whitespace)
Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.
In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in which
of the documents each of them occurs - I need all the document IDs the
entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results
but a reasonable amount of queries. In order to reduce the overhead, I'm
using TransportClient.
My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.
Does elasticsearch manage the caching of the index & data itself or
does it leave that job to the OS file cache?
So if I have index & data that total 10GB and I want it to stay in
RAM, should I set ES_MIN_MEM to 10G or should I leave 10G for the OS
to do the caching?
Hi,
I'm currently trying to index about 1 billion small document with
Elasticsearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
IDs).
For my usecase, I need to apply four different analyzers to the text:
Case-insensitive
Case sensitive (splitting tokens on hyphens and whitespace)
Case sensitive (splitting tokens only on whitespace)
Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.
In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in which
of the documents each of them occurs - I need all the document IDs the
entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results
but a reasonable amount of queries. In order to reduce the overhead, I'm
using TransportClient.
My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.
the main caching elasticsearch does is filter caching, or field caching (for
sorting and faceting) and on the Lucene level, things like loading a skip
list of terms to speedup search. But, it relies also on the file system
cache heavily. Can't answer your question since it really depends on what
you do with your index (facets, sorting), and, how much memory the OS has.
Does elasticsearch manage the caching of the index & data itself or
does it leave that job to the OS file cache?
So if I have index & data that total 10GB and I want it to stay in
RAM, should I set ES_MIN_MEM to 10G or should I leave 10G for the OS
to do the caching?
Hi,
I'm currently trying to index about 1 billion small document with
Elasticsearch 0.17.1 on a 21-node dedicated cluster. Each one
contains a
text fragment and doesn't have to be stored (I'm only interested in
the
IDs).
For my usecase, I need to apply four different analyzers to the text:
Case-insensitive
Case sensitive (splitting tokens on hyphens and whitespace)
Case sensitive (splitting tokens only on whitespace)
Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.
In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the index
isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in
which
of the documents each of them occurs - I need all the document IDs
the
entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the
type -
about 75 % of them result in span_near queries (with in_order = false
and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in
Java
using Lucene) to the span_term queries inside the span_near queries
because
I didn't find a way of doing this without additional overhead.
Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a
scroll
size of 200.
Some of the entities result in queries with more than 10 million
results
but a reasonable amount of queries. In order to reduce the overhead,
I'm
using TransportClient.
My current configuration uses unicast discovery, 100 shards without
any
replicas and a local gateway - with this configuration most of the
queries
take a reasonable time but some of them take more than ten minutes
even for
very few results. Hadoop executes about 125 queries in parallel for
me.
the main caching elasticsearch does is filter caching, or field caching (for
sorting and faceting) and on the Lucene level, things like loading a skip
list of terms to speedup search. But, it relies also on the file system
cache heavily. Can't answer your question since it really depends on what
you do with your index (facets, sorting), and, how much memory the OS has.
Does elasticsearch manage the caching of the index & data itself or
does it leave that job to the OS file cache?
So if I have index & data that total 10GB and I want it to stay in
RAM, should I set ES_MIN_MEM to 10G or should I leave 10G for the OS
to do the caching?
Hi,
I'm currently trying to index about 1 billion small document with
Elasticsearch 0.17.1 on a 21-node dedicated cluster. Each one
contains a
text fragment and doesn't have to be stored (I'm only interested in
the
IDs).
For my usecase, I need to apply four different analyzers to the text:
Case-insensitive
Case sensitive (splitting tokens on hyphens and whitespace)
Case sensitive (splitting tokens only on whitespace)
Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter with
different parameters.
In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the index
isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in
which
of the documents each of them occurs - I need all the document IDs
the
entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the
type -
about 75 % of them result in span_near queries (with in_order = false
and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in
Java
using Lucene) to the span_term queries inside the span_near queries
because
I didn't find a way of doing this without additional overhead.
Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a
scroll
size of 200.
Some of the entities result in queries with more than 10 million
results
but a reasonable amount of queries. In order to reduce the overhead,
I'm
using TransportClient.
My current configuration uses unicast discovery, 100 shards without
any
replicas and a local gateway - with this configuration most of the
queries
take a reasonable time but some of them take more than ten minutes
even for
very few results. Hadoop executes about 125 queries in parallel for
me.
The lookup keys are loaded (skipped) by Lucene, some recent keys lookup info
is cached done by elasticsearch, but the actual content is served by the OS.
If you want a slight fast lookup and you have enough mem to make it count,
you can use mmapfs as the index storage option.
the main caching elasticsearch does is filter caching, or field caching
(for
sorting and faceting) and on the Lucene level, things like loading a skip
list of terms to speedup search. But, it relies also on the file system
cache heavily. Can't answer your question since it really depends on what
you do with your index (facets, sorting), and, how much memory the OS
has.
Does elasticsearch manage the caching of the index & data itself or
does it leave that job to the OS file cache?
So if I have index & data that total 10GB and I want it to stay in
RAM, should I set ES_MIN_MEM to 10G or should I leave 10G for the OS
to do the caching?
Hi,
I'm currently trying to index about 1 billion small document with
Elasticsearch 0.17.1 on a 21-node dedicated cluster. Each one
contains a
text fragment and doesn't have to be stored (I'm only interested
in
the
IDs).
For my usecase, I need to apply four different analyzers to the
text:
Case-insensitive
Case sensitive (splitting tokens on hyphens and whitespace)
Case sensitive (splitting tokens only on whitespace)
Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter
with
different parameters.
In order to do this, I simply use four different fields with the
appropriate analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the
index
isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in
which
of the documents each of them occurs - I need all the document
IDs
the
entity occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the
type -
about 75 % of them result in span_near queries (with in_order =
false
and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it
in
Java
using Lucene) to the span_term queries inside the span_near
queries
because
I didn't find a way of doing this without additional overhead.
Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL
with a
scroll
size of 200.
Some of the entities result in queries with more than 10 million
results
but a reasonable amount of queries. In order to reduce the
overhead,
I'm
using TransportClient.
My current configuration uses unicast discovery, 100 shards
without
any
replicas and a local gateway - with this configuration most of
the
queries
take a reasonable time but some of them take more than ten
minutes
even for
very few results. Hadoop executes about 125 queries in parallel
for
me.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.