Is it normal for ES to clean up double spaces?

Donny_Velazquez · December 8, 2015, 8:35pm

I noticed when inserting data that is a string with multiple words. ES will cleanup the spacing between them if there is more then 1 space.

Is that normal and should ES really be doing that?

polyfractal · December 8, 2015, 9:10pm

Well, depending on what you mean, yes.

By default, Elasticsearch will use the Standard Analyzer to analyze strings. The Standard Analyzer uses a tokenizer that splits on spaces (and special characters). So if we use underscores to represent spaces, the string "quick_brown_fox" is tokenized into ["quick", "brown", "fox"].

In the case of double spaces, the tokenizer still splits on spaces...but the double space is ignored since there is no valid token found in between the spaces. E.g. "quick__brown_fox" (with double space after "quick") still tokenizes into ["quick", "brown", "fox"]

Basically, ES isn't saving the spaces at all, because they are being used to tokenize. If you don't want that behavior, you can specify a new analyzer that treats spaces differently. Or set the field to not_analyzed and no processing will be done.

nik9000 · December 8, 2015, 9:29pm

Well, it is saving the _source and the spaces are in there. And they should be returned exactly as you sent them.

nik9000 · December 8, 2015, 9:29pm

If you ask for _source as part of the search. Or perform a GET or something.

polyfractal · December 8, 2015, 9:32pm

Oops yeah, that's a good point. The _source is always saved as-is. What I was referring to is just the indexed and searchable data =)

Donny_Velazquez · December 8, 2015, 9:34pm

I'm seeing the spaces removed in _source also.

polyfractal · December 8, 2015, 9:38pm

Do you have any Source Transforms enabled? Alternatively, do you perhaps have some update scripts that might strip out spaces? Or something upstream of Elasticsearch (logstash, etc) that might manipulate the string?

Source Transform and updates are the only mechanisms I can think of that actively changes the _source in Elasticsearch.

Donny_Velazquez · December 8, 2015, 9:45pm

No transforms or scripts. The Address property should have extra spaces and Grantors name.

nik9000 · December 8, 2015, 9:46pm

Both of them turn the source into a java Map, play with the map, and then save it. It shouldn't by changing the spaces in the strings.

Honestly your best bet is posting a minimal recreation of what you are seeing with curl.

I was really hoping you were going to tell me that I was technically correct.

polyfractal · December 9, 2015, 4:45pm

Can you execute the raw curl command and check the returned JSON? It's possible the Head plugin (which I think is what's shown in that screenshot) is munging the response for display.

How are you indexing documents? Possible something is munging the data there?

Donny_Velazquez · December 9, 2015, 4:59pm

It's not the Head plugin because it also shows on the web app thats displaying the query result. I've looked at the json right before its being inserted into elasticsearch and it still has the extra spaces. Doing a batch insert using the C# driver PlainElastic.Net.

Working on another project right now. So when I get a chance will post a curl example.

Topic		Replies	Views
Elasticsearch index not_analyzed Elasticsearch	5	1155	July 5, 2017
Elasticsearch can't hanlde space after add analyzer Elasticsearch	3	405	April 21, 2022
Search query / analyzer issue dealing with spaces Elasticsearch	9	470	July 6, 2017
Keyword analyzer but allow redundant white spaces Elasticsearch	3	4092	January 15, 2018
Bug in official document sample Elasticsearch	4	725	July 5, 2017

Is it normal for ES to clean up double spaces?

Related topics