Keyword analyzer


#1

This is may be a very stupid question, but how does this analyzer work with duplicates? If i add a 100000 strings

Caro m'è 'l sonno, e più l'esser di sasso, mentre che 'l danno e la vergogna dura; non veder, non sentir m'è gran ventura; però non mi destar, deh, parla basso.

(this is random poem from internet)
will it store a 100000 copies or just 1?


(Ivan Brusic) #2

Lucene, and therefore Elasticsearch, is a schema-less and multi-valued, so
if you insert the same data multiple times to the same field, there will be
multiple instances of the same field with the data. So to answer your
question, it will store 100000 copies.

Elasticsearch does use string ordinals, but only the most common strings
are converted. Not sure if string length is a factor.


(David Pilato) #3

Well. In the inverted index part it will associate this text with all the doc ids.
But for other data structures like _source stored field, there will be a copy of each.

My 2 cents.


#4

Well, since i don't know how to handle this problem.. how would you handle it? Because as far as i know ElasticSearch is not about relations (like SQL for example, where you just creating another table and put duplicates in it).


(David Pilato) #5

And? What is the problem you want to solve? Is disk space your concern?


#6

Well, kinda, yeah.
Lets image that i will have 20.000.000 records with around 4 string fields each. Every string is about 40 characters, 1 character is 1 byty.
20000000440=3200000000/1024/1024=3052 MB of duplicates. (information about number of records is 100% true, 4 string is ~average, same as 40 characters per string).

With this amount of SSD space.. this is not so important, but this is does not sounds very well at least. Anyway, it would be great to know the way of "fixing" this. Not the most important thing tho.


(David Pilato) #7

But are you looking for a datastore or a search engine?

I mean: don't try to overthink tuning. We have very very often the good defaults. So just index and search.

As for now my simple advice (without knowing more about your use case) would be: disable _all field. (This is not a good default and this will be fixed in 6.0)


(David Pilato) #8

Oh yeah and measure the effect of any tuning decision you would make.

For example a simple sample dataset I'm using with 15 fields takes around 250mb for 1 million documents. No real tuning with that.


#9

I'm storing a database of in-game items which have different stats like "Adds # - # damage to attacks". I planned to replace number with # and store it like ["Adds # - # damage to attacks", (firstNumber), (secondNumber), (average)] (i will probably have different fields for that). In most of the times search will be something like [string = X AND average < Y AND average > Z]. As you can see i'm not very familiar with ElasticSearch yet, but as far as i know ElasticSearch doesn't comparing string itself, so it should be.. not a tragedy.


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.