Elasticsearch is not able to search for NonEnglish text present in PDF type of attachment

Hi Team,

We are facing an issue while searching the Non English text indexed as PDF type of document. Below are the complete details.

  1. I am having a pdf document as New_Pdf_issue.pdf which is attached in this mail.
  2. Created an indexing request alongwith mapping as well which is attached as pdf_index_issue.sh
  3. Now if you will look onto pdf attachment you will find keywords such as "अधिकार", so if i am searching as "अधिकार" I am not able to get any matching documents for the same.

Note : What we observed is like when we perform search query as

{ 
  "fields": [ 
    "SessionAtt.content_type", 
    "SessionAtt" 
  ], 
  "query": { 
    "bool": { 
      "must": [ 
        { 
          "query_string": { 
            "fields": [ 
              "Content", 
              "SessionAtt" http://elasticsearch-users.115913.n3.nabble.com/file/n4074717/pdf_index_issue.sh
            ], 
            "query": "*" 
          } 
        } 
      ] 
    } 
  } 
} 

We are observing as "अधिकार" words has been indexed as "अधधकार".

So can anyone let me know what could be the issue for the same.

Note : As I am not able to upload PDF docs and script file so please have a look onto same from below post-link as well.
http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-is-not-able-to-search-for-Nonnglish-text-present-in-PDF-type-of-attachment-td4074717.html
~Prashant

Hi All,

Any solution or suggestion for the same ?

~Prashant

Hi All,

Is there any solution or work around for the above mentioned query.

~Prashant

This might help:
http://gibrown.com/2013/05/01/three-principles-for-multilingal-indexing-in-elasticsearch/

Look at the section regarding the stop words and analyzers. There is a hindi one there which you might want to use in your mapping.

Sarwar