Elastic search extracts text inside title attribute of anchor tags while adding pdf attachment

I have added a base64 encoded pdf file with pipeline= attachment using kibana into Elasticsearch

when i queried the pdf with highlights i got the below result

dan@1abmedia.com

https://www.globenewswire.com/Tracker?data=D3_Gb2rHO0RJs5ptt_YaxhbTrJp2no3K1iZwzpAcG4YENDUM1UZ9wuY6DcxpQ5h0Se8zNYREibjVPWmQf024bA==

The url inside the highlights is actually the href of the anchor tag and the same is in the title attribute.

When i added the pdf into eastic search its also indexes the title attributes value is there any way to avoid considering the tooltips .

When i viewed the pdf file by id the content also contains the url.

Does any one know how to avoid this and index only the text part between the html tags and ignore the text inside the html tag's attribute value.

Please help...

I'm wondering if you are looking for https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html ?

Thanks for you reply @dadoonet. But how do i apply to the below data..?
Below is the put request to add pdf file using pipeline=attachment

PUT pdf/_doc/1?pipeline=attachment
{
  "data": "base64 encoded string"
}

link to the pdf file

Please try adding the pdf file attached after converting it to base64 and then search for
dan@1abmedia.com using the below request

GET /pdf/_search
{
  "_source": false,
  "from":0,
  "size":20,
  "query": {
    "query_string": {
      "query": "dan@1abmedia.com"
    }
  },
  "highlight": {
    "fields": {
      "content": {
        "type": "plain",
        "fragment_size": 500,
        "number_of_fragments": 1
      },
      "attachment.content": {
        "fragment_size": 500,
        "number_of_fragments": 1,
        "type": "plain"
      },
      "attachment.title": {
        "fragment_size": 500,
        "number_of_fragments": 1,
        "type": "unified"
      }
    }
  }
}

Response:

"hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 3.0848434,
    "hits" : [      
      {
        "_index" : "pdf",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.68484235,
        "highlight" : {
          "attachment.content" : [
            """
 of the company in general, see Gritstone’s most recent Quarterly Report on Form 10-Q filed on August 12, 2019
and any current and periodic reports filed with the Securities and Exchange Commission.

Contacts
Media:
<em>Dan</em> Budwick
1AB
xxxxxxxxxxxx
<em>dan</em>@<em>1abmedia.com</em>

https://www.globenewswire.com/Tracker?data=D3_Gb2rHO0RJs5ptt_YaxhbTrJp2no3K1iZwzpAcG4YENDUM1UZ9wuY6DcxpQ5h0Se8zNYREibjVPWmQf024bA==


Investors:
Alexandra Santos
Wheelhouse Life Science Advisors
xxxxxxxxxx
xxxxxx@xxxxxx.com
"""
          ]
        }
      }
    ]
  }
}

There is a url appearing in the above search result its actually a tooltip of a link dan@1abmedia.com it should not show up in the highlights

Elastic search is indexing the text inside the title attribute's value, i want to get avoid that is there any way to do it.

Any help would be greatly appreciated.

Thanks.

I see. I can't think of any workaround.
You could may be add in your pipeline a gsub processor which removes all http strings.... But that's not what you are looking for here.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.