Elastic search extracts text inside title attribute of anchor tags while adding pdf attachment

sandeepv · October 18, 2019, 4:41pm

I have added a base64 encoded pdf file with pipeline= attachment using kibana into Elasticsearch

when i queried the pdf with highlights i got the below result

dan@1abmedia.com

https://www.globenewswire.com/Tracker?data=D3_Gb2rHO0RJs5ptt_YaxhbTrJp2no3K1iZwzpAcG4YENDUM1UZ9wuY6DcxpQ5h0Se8zNYREibjVPWmQf024bA==

The url inside the highlights is actually the href of the anchor tag and the same is in the title attribute.

When i added the pdf into eastic search its also indexes the title attributes value is there any way to avoid considering the tooltips .

When i viewed the pdf file by id the content also contains the url.

Does any one know how to avoid this and index only the text part between the html tags and ignore the text inside the html tag's attribute value.

Please help...

dadoonet · October 18, 2019, 5:32pm

I'm wondering if you are looking for https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html ?

sandeepv · October 19, 2019, 12:02pm

Thanks for you reply @dadoonet. But how do i apply to the below data..?
Below is the put request to add pdf file using pipeline=attachment

PUT pdf/_doc/1?pipeline=attachment
{
  "data": "base64 encoded string"
}

link to the pdf file

Please try adding the pdf file attached after converting it to base64 and then search for
dan@1abmedia.com using the below request

GET /pdf/_search
{
  "_source": false,
  "from":0,
  "size":20,
  "query": {
    "query_string": {
      "query": "dan@1abmedia.com"
    }
  },
  "highlight": {
    "fields": {
      "content": {
        "type": "plain",
        "fragment_size": 500,
        "number_of_fragments": 1
      },
      "attachment.content": {
        "fragment_size": 500,
        "number_of_fragments": 1,
        "type": "plain"
      },
      "attachment.title": {
        "fragment_size": 500,
        "number_of_fragments": 1,
        "type": "unified"
      }
    }
  }
}

Response:

"hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 3.0848434,
    "hits" : [      
      {
        "_index" : "pdf",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.68484235,
        "highlight" : {
          "attachment.content" : [
            """
 of the company in general, see Gritstone’s most recent Quarterly Report on Form 10-Q filed on August 12, 2019
and any current and periodic reports filed with the Securities and Exchange Commission.

Contacts
Media:
<em>Dan</em> Budwick
1AB
xxxxxxxxxxxx
<em>dan</em>@<em>1abmedia.com</em>

https://www.globenewswire.com/Tracker?data=D3_Gb2rHO0RJs5ptt_YaxhbTrJp2no3K1iZwzpAcG4YENDUM1UZ9wuY6DcxpQ5h0Se8zNYREibjVPWmQf024bA==


Investors:
Alexandra Santos
Wheelhouse Life Science Advisors
xxxxxxxxxx
xxxxxx@xxxxxx.com
"""
          ]
        }
      }
    ]
  }
}

There is a url appearing in the above search result its actually a tooltip of a link dan@1abmedia.com it should not show up in the highlights

Elastic search is indexing the text inside the title attribute's value, i want to get avoid that is there any way to do it.

Any help would be greatly appreciated.

Thanks.

dadoonet · October 19, 2019, 1:59pm

I see. I can't think of any workaround.
You could may be add in your pipeline a gsub processor which removes all http strings.... But that's not what you are looking for here.

system · November 16, 2019, 1:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Please let me know if i can use elasticsearch for text search in pdf and word documents Elasticsearch	6	524	July 5, 2017
Search-hints highlighting in PDFA`s and PDF's with Tiff overlay Elasticsearch	2	571	August 7, 2019
Can we perform the text search present in the images or pdf files through elasticsearch Elasticsearch	9	3232	July 5, 2017
Ingest attachment plugin not analysing some html files Elasticsearch	15	1262	March 30, 2018
Rg: attachment highlighting Elasticsearch	5	665	July 6, 2017

Elastic search extracts text inside title attribute of anchor tags while adding pdf attachment

Related topics