I noticed when inserting data that is a string with multiple words. ES will cleanup the spacing between them if there is more then 1 space.
Is that normal and should ES really be doing that?
I noticed when inserting data that is a string with multiple words. ES will cleanup the spacing between them if there is more then 1 space.
Is that normal and should ES really be doing that?
Well, depending on what you mean, yes.
By default, Elasticsearch will use the Standard Analyzer to analyze strings. The Standard Analyzer uses a tokenizer that splits on spaces (and special characters). So if we use underscores to represent spaces, the string "quick_brown_fox"
is tokenized into ["quick", "brown", "fox"]
.
In the case of double spaces, the tokenizer still splits on spaces...but the double space is ignored since there is no valid token found in between the spaces. E.g. "quick__brown_fox"
(with double space after "quick") still tokenizes into ["quick", "brown", "fox"]
Basically, ES isn't saving the spaces at all, because they are being used to tokenize. If you don't want that behavior, you can specify a new analyzer that treats spaces differently. Or set the field to not_analyzed
and no processing will be done.
Well, it is saving the _source
and the spaces are in there. And they should be returned exactly as you sent them.
If you ask for _source
as part of the search. Or perform a GET or something.
Oops yeah, that's a good point. The _source
is always saved as-is. What I was referring to is just the indexed and searchable data =)
I'm seeing the spaces removed in _source also.
Do you have any Source Transforms enabled? Alternatively, do you perhaps have some update scripts that might strip out spaces? Or something upstream of Elasticsearch (logstash, etc) that might manipulate the string?
Source Transform and updates are the only mechanisms I can think of that actively changes the _source
in Elasticsearch.
Both of them turn the source into a java Map, play with the map, and then save it. It shouldn't by changing the spaces in the strings.
Honestly your best bet is posting a minimal recreation of what you are seeing with curl
.
I was really hoping you were going to tell me that I was technically correct.
Can you execute the raw curl command and check the returned JSON? It's possible the Head plugin (which I think is what's shown in that screenshot) is munging the response for display.
How are you indexing documents? Possible something is munging the data there?
It's not the Head plugin because it also shows on the web app thats displaying the query result. I've looked at the json right before its being inserted into elasticsearch and it still has the extra spaces. Doing a batch insert using the C# driver PlainElastic.Net.
Working on another project right now. So when I get a chance will post a curl example.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.