Is it normal for ES to clean up double spaces?

(Donny Velazquez) #1

I noticed when inserting data that is a string with multiple words. ES will cleanup the spacing between them if there is more then 1 space.

Is that normal and should ES really be doing that?

(Zachary Tong) #2

Well, depending on what you mean, yes. :smile:

By default, Elasticsearch will use the Standard Analyzer to analyze strings. The Standard Analyzer uses a tokenizer that splits on spaces (and special characters). So if we use underscores to represent spaces, the string "quick_brown_fox" is tokenized into ["quick", "brown", "fox"].

In the case of double spaces, the tokenizer still splits on spaces...but the double space is ignored since there is no valid token found in between the spaces. E.g. "quick__brown_fox" (with double space after "quick") still tokenizes into ["quick", "brown", "fox"]

Basically, ES isn't saving the spaces at all, because they are being used to tokenize. If you don't want that behavior, you can specify a new analyzer that treats spaces differently. Or set the field to not_analyzed and no processing will be done.

(Nik Everett) #3

Well, it is saving the _source and the spaces are in there. And they should be returned exactly as you sent them.

(Nik Everett) #4

If you ask for _source as part of the search. Or perform a GET or something.

(Zachary Tong) #5

Oops yeah, that's a good point. The _source is always saved as-is. What I was referring to is just the indexed and searchable data =)

(Donny Velazquez) #6

I'm seeing the spaces removed in _source also.

(Zachary Tong) #7

Do you have any Source Transforms enabled? Alternatively, do you perhaps have some update scripts that might strip out spaces? Or something upstream of Elasticsearch (logstash, etc) that might manipulate the string?

Source Transform and updates are the only mechanisms I can think of that actively changes the _source in Elasticsearch.

(Donny Velazquez) #8

No transforms or scripts. The Address property should have extra spaces and Grantors name.

(Nik Everett) #9

Both of them turn the source into a java Map, play with the map, and then save it. It shouldn't by changing the spaces in the strings.

Honestly your best bet is posting a minimal recreation of what you are seeing with curl.

I was really hoping you were going to tell me that I was technically correct.

(Zachary Tong) #10

Can you execute the raw curl command and check the returned JSON? It's possible the Head plugin (which I think is what's shown in that screenshot) is munging the response for display.

How are you indexing documents? Possible something is munging the data there?

(Donny Velazquez) #11

It's not the Head plugin because it also shows on the web app thats displaying the query result. I've looked at the json right before its being inserted into elasticsearch and it still has the extra spaces. Doing a batch insert using the C# driver PlainElastic.Net.

Working on another project right now. So when I get a chance will post a curl example.

(system) #12