Possible Bug: Unable to fetch term vectors after applying filter on ingested document

Hi all.

Situation: I am using Elasticsearch 8.4.1. I am ingesting a document, then via applying the apostrophe token filter, I want to remove every character from the apostrophe to the end of the term, apply a couple of other filters, then finally obtain the term vectors.

Problem: When ingesting a document, I can't obtain any term-vectors that have the apostrophe filter applied, only term vectors without the filter. However if I insert the text as an artificial document into that same index, I get the term vectors.

Here is an analysis of the apostrophe filter, and it works as expected:

curl -X GET "localhost:9200/_analyze?pretty" -H 'Content-Type: application/json' -d'
{
  "tokenizer" : "whitespace",
  "filter" : ["apostrophe"],
  "text" : "Istanbul\u0027a veya Istanbul\u0027dan company\u0027s company Istanbul\u0027s they\u0027re Istanbul media\u0027ll medium media\u0027s media"
}
'

Now, here is a minimum working example of my index:

curl -X PUT "localhost:9200/test-index?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "analysis": {
      "analyzer":{
        "my_index":{
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["apostrophe","asciifolding","lowercase"]
        },
        "my_search":{
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["apostrophe","asciifolding","lowercase"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "doc_id": { 
        "type": "keyword",
        "index": true,
        "store": true
      },
      "attachment.data":{
        "type": "text",
        "store": true,
        "fields": {
          "terms": {
            "type": "text",
            "store": true,
            "analyzer": "my_index",
            "search_analyzer": "my_search",
            "term_vector": "with_positions_offsets"
          }
        }
      }
    }
  }
}
'

I will now insert an artificial document into the index, to show the term vectors that I am looking for. This works fine for the artificial document.

curl -X GET "localhost:9200/test-index/_termvectors?pretty" -H 'Content-Type: application/json' -d'
{
  "doc" : {
  	"doc_id": "test-file-001",
    "attachment.data" : "Istanbul\u0027a veya Istanbul\u0027dan company\u0027s company Istanbul\u0027s they\u0027re Istanbul media\u0027ll medium media\u0027s media"
  },
  "fields": ["attachment.data.terms"],
  "offsets" : true,
  "payloads" : false,
  "positions" : true,
  "field_statistics" : false,
  "term_statistics" : true
}
'

However, when I ingest the document (rather than use an artificial one), things fall apart. I am using this ingest attachment pipeline

curl -XPUT "http://localhost:9200/_ingest/pipeline/attachment?pretty" -H 'Content-Type: application/json' -d'
{
  "description" : "Extract attachment information using CBOR encoding",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars": -1,
        "ignore_missing": true
      },
      "set" : {
        "field": "last_update_time",
        "value": "{{_ingest.timestamp}}"
      }
    }
  ]
}

This is the text file being ingested below (same content as the artificial one)

Istanbul'a veya Istanbul'dan company's company Istanbul's they're Istanbul media'll medium media's media

In case it matters, ingestion is done using CBOR encoding via a python script. The ingestion works fine, but the script is below for reference. I call it using python3 script.py

#!/usr/local/bin/python3.8
import sys
import cbor2
import requests

file = "/var/www/test-file-001.txt"
headers = {'content-type': 'application/cbor'}
try:
 with open(file, 'rb') as f:
  doc = {
   'doc_id':"test-file-001",
   'data': f.read(),
  }
  requests.put(
   "http://127.0.0.1:9200/test-index/_doc/test-file-001?pipeline=attachment",
   data=cbor2.dumps(doc),
   headers=headers
  )
  print ("Ingested successfully")
except:
  e = sys.exc_info()[0]
  print ('Error ingesting: ',e)

After ingesting the file into our index, I try to fetch the term-vectors but the apostrophe filter does not get applied:

 curl -X GET "localhost:9200/test-index/_termvectors/test-file-001?pretty" -H 'Content-Type: application/json' -d'
{
  "fields" : ["attachment.content"],
  "offsets" : true,
  "payloads" : false,
  "positions" : true,
  "field_statistics" : false,
  "term_statistics" : true
}
'

Note that the term vector field that I am using in the document above is attachment.content (which comes from the ingestion process) whereas I used attachment.data.terms in the artificial document. When I try "fields" : ["attachment.data.terms"] or "fields" : ["attachment.data"] instead while fetching term vectors for the ingested document, I get no term vector results at all.

This is either a bug or I am missing something fundamental. Can someone please advise?

Solved my own problem. Glad I figured it out, because I am an Elasticsearch newbie. This was not a bug, but a "feature".

Solution: (hopefully I am saying this correctly). It appears the term vectors does not take the custom analyzer in your index definition into account. You to specify further using the per_field_analyzer i.e.

 curl -X GET "localhost:9200/test-index/_termvectors/test-file-001?pretty" -H 'Content-Type: application/json' -d'
{
  "fields" : ["attachment.content"],
  "per_field_analyzer": {
    "attachment.content":"my_index"
  },
  "offsets" : true,
  "payloads" : false,
  "positions" : true,
  "field_statistics" : false,
  "term_statistics" : true
}
'

Also new to this forum, so and I can't find how to close a thread. Anyone more experienced feel free to do that. Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.