I'm currently trying to index about 1 billion small document with
ElasticSearch 0.17.1 on a 21-node dedicated cluster. Each one contains a
text fragment and doesn't have to be stored (I'm only interested in the
For my usecase, I need to apply four different analyzers to the text:
- Case sensitive (splitting tokens on hyphens and whitespace)
- Case sensitive (splitting tokens only on whitespace)
- Case insensitive stemmed (Snowball stemmer)
Each analyzers additionally filters using a WordDelimiterFilter with
In order to do this, I simply use four different fields with the appropriate
analyzer (defined in the mapping) for one document.
After indexing (which happens in a Hadoop MapReduce job) the index isn't
changed any more (at least until the query phase is finished).
I've got a list of about 100 million entities and need to know in which of
the documents each of them occurs - I need all the document IDs the entity
occurs in in order to save them elsewhere.
My Java library builds queries from the entities depending on the type -
about 75 % of them result in span_near queries (with in_order = false and
slop <= 4).
Currently I manually apply the appropriate analyzer (I rebuilt it in Java
using Lucene) to the span_term queries inside the span_near queries because
I didn't find a way of doing this without additional overhead. Because I
neither need scoring nor sorting, I'm using SearchType.SCROLL with a scroll
size of 200.
Some of the entities result in queries with more than 10 million results but
a reasonable amount of queries. In order to reduce the overhead, I'm using
My current configuration uses unicast discovery, 100 shards without any
replicas and a local gateway - with this configuration most of the queries
take a reasonable time but some of them take more than ten minutes even for
very few results. Hadoop executes about 125 queries in parallel for me.
Each cluster node has 32 GB RAM and 16 CPU cores.
Full config: https://gist.github.com/1127490
(/hadoop/hdfs0 is a physical HDD, not a FUSE-mounted HDFS).
Does anyone of you have an idea how to speed up searching for my usecase?
Would it be reasonable to execute not as much queries at once?
Thanks in advance,