Hi,
We have the following search scenarios and to achieve these we are using
analyzer settings as given below. But with these settings, the disk space
for indexing 1gig of data is almost six fold. As in after indexing 1gig raw
data ends up with 6gig index in elasticsearch. Even memory consumption is
shooting up to 4 to 5 gig while indexing.
*I want to know *
- If the high memory usage and disk space is a result of using Shingle and
NGram together ? - Are there any other combination of analyzers give us the
same behaviour but uses lesser disk space and memory ? - We have set *"term_vector" to "with_positions_offsets" *as we need
highlighting for the content.
Version being used : 0.20.6
*
*
Partial Searches
*e.g. *If we have documents with content like
DOC1 -> "Search isn’t just free text search anymore – it’s about
exploring your data. Understanding it. Gaining insights that will make your
business better or improve your product"
DOC2 -> "Store complex real world entities in Elasticsearch as structured
JSON documents. All fields are indexed by default, and all the indices can
be used in a single query, to return results at breath taking speed."
DOC3 -> "Operators like *, +, -, % are used for
performing arithmetic operations"
Search Queries:
Query String (search) - Should result in both DOC1 and DOC2
Query String (doc) - Should result in DOC2 (as it partially matches *
documents*)
As-Is Searches
Should search for phrases as-is with out tokenizing, when given in double
quotes. This includes the special characters. For this, we are specifying *"keyword"
*as explicit analyzer in our search queries, apart from node level analyzer
settings.
e.g. For the same set of DOCs
Search Queries:
*
*
Query String ("search") - Should result in only DOC1 and NOT DOC2 (because
in DOC2 its a partial match)
Query String ("like *") - Should result in only DOC3, * should NOT be
treated as wildcard.
Analyzer settings @ node level
*
*
index:
analysis:
analyzer:
default_search:
type: custom
tokenizer: whitespace
filter: [lowercase,stop,asciifolding,kstem]
default_index:
type: custom
tokenizer: whitespace
filter: [lowercase,asciifolding,my_shingle,kstem,my_ngram,
stop]
filter:
my_ngram:
max_gram: 50
type: nGram
min_gram: 2
my_shingle:
type: shingle
max_shingle_size: 5
Appreciate any help !!
-katta
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.