I am confused about keyowrd, doc_value and analysis.
I found two topics (/keyword-datatype-and-analysis/66359 and help-understanding-keyword-vs-not-analyzed/10374 ) on this subjet but i am not able to understant if the answers of these topics answer to my point.
I understood that keyword are stored as doc_values and that doc_values are not analysed. This is why they can be used as filter or in all operations for which Fielddata are needed.
But it is possible to do a search query of type match on keyword and such queries are analyzed. So how it is possible if keywords are not analyzed ?
This is running quite wheel and i do not understand how.
PUT test
{
"mappings" : {
"doc" : {
"properties" : {
"category" : {
"type" : "keyword"
}
}
}
}
}
Elasticsearch creates multiple datastructures out of the documents that you index. One of those datastructures is the inverted index. Both text and keyword type string fields will get their values indexed in such an inverted index. The inverted index is what allows you to query those fields. Elasticsearch will use this datastructure for queries.
Another datastructure that gets created is those doc values. Doc values are used for operations like aggregations and sorting. Doc values are available for keyword fields, but not for text fields. This is why by default you can aggregate and sort on keyword fields but not on text fields. If you want to aggregate and sort on text fields you would have to enable fielddata. Fielddata is very similar to doc values, with the difference being that fielddata is created in-memory when needed, while doc values are stored on disk when you index your documents.
Now, how does text analysis play in to all this? Text analysis is the processing that Elasticsearch applies to text fields. By default, any value that you index as a text field will be lowercased and broken up into its individual words. For example a string like "New York" becomes new and york, and these two tokens will be put in the inverted index.
Whenever you query such a text field with for example a match query, Elasticsearch will apply the same analysis to your query string. It is this processing that allows you to case-insensitively search for the individual words new or york and find the document that contains the string "New York".
Text analysis is not applied to keyword fields. Whenever you index a string like "New York" in a keyword field, what Elasticsearch will put in the inverted index is the exact original string "New York". When you query that field, the query string will also not get analyzed and as a result, you will only be able to find this document if you query for the exact string "New York", capital N, capital Y and one space between those words.
If I were to summarize all this I would say: text fields are for full text searches (case-insensitive search on individual words, powered by text analysis), while keyword field are for sorting and aggregating.
(To format code on this forum you can use the </> button)
A great thanks for this whole and detailed explanation, I understood that keywords are stored in the inverted index without analysis, this is why it is possible to search on it but usual analysis is not performed neither on the keyword neither on the search query.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.