Elasticsearch is not able to search for NonEnglish text present in PDF type of attachment


(Prashant Agrawal) #1

Hi Team,

We are facing an issue while searching the Non English text indexed as PDF type of document. Below are the complete details.

  1. I am having a pdf document as New_Pdf_issue.pdf which is attached in this mail.
  2. Created an indexing request alongwith mapping as well which is attached as pdf_index_issue.sh
  3. Now if you will look onto pdf attachment you will find keywords such as "अधिकार", so if i am searching as "अधिकार" I am not able to get any matching documents for the same.

Note : What we observed is like when we perform search query as

{ 
  "fields": [ 
    "SessionAtt.content_type", 
    "SessionAtt" 
  ], 
  "query": { 
    "bool": { 
      "must": [ 
        { 
          "query_string": { 
            "fields": [ 
              "Content", 
              "SessionAtt" http://elasticsearch-users.115913.n3.nabble.com/file/n4074717/pdf_index_issue.sh
            ], 
            "query": "*" 
          } 
        } 
      ] 
    } 
  } 
} 

We are observing as "अधिकार" words has been indexed as "अधधकार".

So can anyone let me know what could be the issue for the same.

Note : As I am not able to upload PDF docs and script file so please have a look onto same from below post-link as well.
http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-is-not-able-to-search-for-Nonnglish-text-present-in-PDF-type-of-attachment-td4074717.html
~Prashant


(Prashant Agrawal) #2

Hi All,

Any solution or suggestion for the same ?

~Prashant


(Prashant Agrawal) #3

Hi All,

Is there any solution or work around for the above mentioned query.

~Prashant


(Sarwar Bhuiyan) #4

This might help:

Look at the section regarding the stop words and analyzers. There is a hindi one there which you might want to use in your mapping.

Sarwar


(system) #5