Elasticsearch is not able to search for NonEnglish text present in PDF type of attachmentE


(Prashant Agrawal) #1

Hi Team,

We are facing an issue while searching the Non English text indexed as PDF type of document. Below are the complete details.

  1. I am having a pdf document as New_Pdf_issue.pdf which is attached in this mail.
  2. Created an indexing request alongwith mapping as well which is attached as pdf_index_issue.sh
  3. Now if you will look onto pdf attachment you will find keywords such as "अधिकार", so if i am searching as "अधिकार" I am not able to get any matching documents for the same.

Note : What we observed is like when we perform search query as
{
"fields": [
"SessionAtt.content_type",
"SessionAtt"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"Content",
"SessionAtt"
],
"query": "*"
}
}
]
}
}
}

We are observing as "अधिकार" words has been indexed as "अधधकार".

So can anyone let me know what could be the issue for the same.

~Prashant

<nabble_a href="pdf_index_issue.sh">pdf_index_issue.sh</nabble_a>
<nabble_a href="New_Pdf_issue.pdf">New_Pdf_issue.pdf</nabble_a>


(Robert Muir-3) #2

Its your PDF (and the font being used plays a role in this case).

PDFs encode glyphs (display order), not characters (logical order).
Usually the distinction is not important, but for complex writing systems
it matters.

Open your PDF in acrobat and highlight the word in question, and do
"copy/paste" and you will see it pastes the same way.
You can also see this bogus mapping clearly if you extract the font data
with fontforge (attached).

On Tue, May 12, 2015 at 5:17 AM, Prashant Agrawal <
prashant.agrawal@paladion.net> wrote:

Hi Team,

We are facing an issue while searching the Non English text indexed as PDF
type of document. Below are the complete details.

  1. I am having a pdf document as New_Pdf_issue.pdf which is attached in
    this
    mail.
  2. Created an indexing request alongwith mapping as well which is attached
    as pdf_index_issue.sh
  3. Now if you will look onto pdf attachment you will find keywords such as
    "अधिकार", so if i am searching as "अधिकार" I am not able to get any
    matching
    documents for the same.

Note : What we observed is like when we perform search query as
{
"fields": [
"SessionAtt.content_type",
"SessionAtt"
],
"query": {
"bool": {
"must": [
{
"query_string": {
"fields": [
"Content",
"SessionAtt"
],
"query": "*"
}
}
]
}
}
}

We are observing as "अधिकार" words has been indexed as "अधधकार".

So can anyone let me know what could be the issue for the same.

~Prashant

pdf_index_issue.sh
<
http://elasticsearch-users.115913.n3.nabble.com/file/n4074717/pdf_index_issue.sh

New_Pdf_issue.pdf
<
http://elasticsearch-users.115913.n3.nabble.com/file/n4074717/New_Pdf_issue.pdf

--
View this message in context:
http://elasticsearch-users.115913.n3.nabble.com/Elasticsearch-is-not-able-to-search-for-Nonnglish-text-present-in-PDF-type-of-attachment-tp4074717.html
Sent from the Elasticsearch Users mailing list archive at Nabble.com.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/1431422241775-4074717.post%40n3.nabble.com
.
For more options, visit https://groups.google.com/d/optout.

--
Please update your bookmarks! We have moved to https://discuss.elastic.co/

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAD06sYQRvGGvtROhGFKY%3DUkfgdvbM%3DAHiEftk8it4wWpgpK5hg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Prashant Agrawal) #3

Hi Robert,

I didn't get the complete understanding from the same. Can you just elaborate like if anything can be done for the same wrt Elasticsearch or plugin perspective or this is the inconsistent in terms of PDF type of attachment and nothing can be done for the same.

Its all depends upon the PDF content where some can be indexed properly and some not.

~Prashant


(Prashant Agrawal) #4

Hi All,

Is there any solution for the above mentioned query.

~Prashant


(system) #5