Keyword analyzer

111116 · May 19, 2017, 5:50pm

This is may be a very stupid question, but how does this analyzer work with duplicates? If i add a 100000 strings

Caro m'è 'l sonno, e più l'esser di sasso, mentre che 'l danno e la vergogna dura; non veder, non sentir m'è gran ventura; però non mi destar, deh, parla basso.

(this is random poem from internet)
will it store a 100000 copies or just 1?

Ivan · May 19, 2017, 6:36pm

Lucene, and therefore Elasticsearch, is a schema-less and multi-valued, so
if you insert the same data multiple times to the same field, there will be
multiple instances of the same field with the data. So to answer your
question, it will store 100000 copies.

Elasticsearch does use string ordinals, but only the most common strings
are converted. Not sure if string length is a factor.

dadoonet · May 19, 2017, 6:43pm

Well. In the inverted index part it will associate this text with all the doc ids.
But for other data structures like _source stored field, there will be a copy of each.

My 2 cents.

111116 · May 19, 2017, 7:09pm

Well, since i don't know how to handle this problem.. how would you handle it? Because as far as i know ElasticSearch is not about relations (like SQL for example, where you just creating another table and put duplicates in it).

dadoonet · May 19, 2017, 7:37pm

And? What is the problem you want to solve? Is disk space your concern?

111116 · May 20, 2017, 10:36am

Well, kinda, yeah.
Lets image that i will have 20.000.000 records with around 4 string fields each. Every string is about 40 characters, 1 character is 1 byty.
20000000440=3200000000/1024/1024=3052 MB of duplicates. (information about number of records is 100% true, 4 string is ~average, same as 40 characters per string).

With this amount of SSD space.. this is not so important, but this is does not sounds very well at least. Anyway, it would be great to know the way of "fixing" this. Not the most important thing tho.

dadoonet · May 20, 2017, 10:58am

But are you looking for a datastore or a search engine?

I mean: don't try to overthink tuning. We have very very often the good defaults. So just index and search.

As for now my simple advice (without knowing more about your use case) would be: disable _all field. (This is not a good default and this will be fixed in 6.0)

dadoonet · May 20, 2017, 11:00am

Oh yeah and measure the effect of any tuning decision you would make.

For example a simple sample dataset I'm using with 15 fields takes around 250mb for 1 million documents. No real tuning with that.

111116 · May 20, 2017, 11:32am

I'm storing a database of in-game items which have different stats like "Adds # - # damage to attacks". I planned to replace number with # and store it like ["Adds # - # damage to attacks", (firstNumber), (secondNumber), (average)] (i will probably have different fields for that). In most of the times search will be something like [string = X AND average < Y AND average > Z]. As you can see i'm not very familiar with ElasticSearch yet, but as far as i know ElasticSearch doesn't comparing string itself, so it should be.. not a tragedy.

system · June 17, 2017, 11:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to find duplicate documents containing super long text fields? Elasticsearch	4	2441	November 27, 2018
Questions relating to elastic search Elasticsearch	3	925	July 6, 2017
Strange behaviour in field cache use? Elasticsearch	5	340	July 6, 2017
Elasticsearch storage usage Elasticsearch	5	365	July 6, 2017
Increasing Field Capacity Effectively Elasticsearch	1	409	February 7, 2019

Keyword analyzer

Related topics