Words with some special character arenot displayed well after elasticsearch injection

Hello, I need you help,
My data looks like this (My json files)

{"index": {"_id":2530}}
{"visitor_id": "4395667", "timestamp": "2020-08-07T10:24:00Z", "station_id": "Place A"}
{"index": {"_id":2531}}
{"visitor_id": "4395668", "timestamp": "2020-08-07T11:09:55Z", "station_id": "Réaumur"}

I use curl command to injected in Elasticsearch
curl -s -H "Content-Type:application/json" -XPOST http://localhost:9200/gpsdata/log/_bulk?pretty --data-binary "@%%f"

I am working with Python Elastic search to do some queries on my data, for example :

es = Elasticsearch('http://localhost:9200')
search_param={
"size": 0,
"aggs" : {
    "langs" : {
        "terms" : { "field" : "station_id.keyword"}
    }
}}

response = es.search(index="gpsdata", body=search_param)
response['aggregations']

The result I get in after running python code is :


{'langs': {'doc_count_error_upper_bound': 0,
  'sum_other_doc_count': 0,
  'buckets': [{'key': 'Dor', 'doc_count': 639},
   {'key': 'Réaumur', 'doc_count': 92},
   {'key': 'PlaceA', 'doc_count': 61},
   {'key': 'Curé Crampette', 'doc_count': 33},
   {'key': 'Le Prieuré', 'doc_count': 17},

The words with accent like "Réaumur" are not displayed well, Thanks in advance.

Bonjour and welcome!

It's probably because you are not using UTF-8 to encode your text. Elasticsearch expects UTF-8.
Most likely you are using ISO-8859-1

1 Like

Thanks for you answer, if you could please tell me how I can use UTF-8 encoding?

Check the encoding of your json files.

When I opend my JSON files with Windows 10 notepad it shows something like this : "F\u00c3\u00a9tilly" and in the bottom of the notepad it is mentionned "UTF-8"
Is the problem in decoding those characters or they are broken from the origins?

I can't tell. Share the file somewhere and I'll give a look.

I could not ulpoaded here, I upload it instead on my google drive and here is a link for it:

The file looks like this : (the field where I have the problem is "module_nb"):

{"265626570": [{"visitor_id": "265626570", "timestamp": "2020-08-11T17:47:13Z", "station_id": "UNKNOWN", "module_nb": "F\u00c3\u00a9tilly",..

Try with:

https://gist.githubusercontent.com/dadoonet/55b88a88d789a0dcfd9f722ad85a65bb/raw/b27f08892c83b0b99b6279ee56c5cb9158ef45f4/265626570.json

I changed the bad characters to UTF8 encoded accents.

1 Like

Thank you so much for your time, sorry if my question wasn't clear in the beginning, I am a beginner...

For people that may have the same problem further : try to change bad character, or check from the source you get your data and it should be UTF-8 encoded.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.