Unexpected behaviour of termvector API in Python

Thek10patil · April 21, 2020, 2:49am

I am currently working on Elasticsearch with a huge number of documents(around 500K) in an index. I want to store n-grams of each document's text data(This is also huge ~ per doc contains 2 pages of text data) in another index. So I calculating term vectors and their count in each document to store them in the new index. So I can execute aggregation queries on the new Index.

The setting of the old index has enabled me to execute termvector and mtermvector API's. Due Elasticsearch server timeout issue most of the times mtermvectors API throwing Proxy error.

Some times I am getting expected response but most of the times I am getting the following error:

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors.

Reason: Error reading from remote server

Sample mtermvector HTTP URL after calling mtermvector API in python

http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false

That's why I decided to use Termvector API.

elasticsearch_client.termvectors(index=INDEX_NAME, doc_type=DOC_TYPE, id=fetched_id,
                                            offsets=False,
                                            fields=["plain_text"],
                                            positions=False, payloads=False, term_statistics=True,
                                            field_statistics=False)

Sample termvector HTTP URL after calling mtermvector API in python

http://**servername**/elastic/**indexName**/article/608588/_termvectors?offsets=false&fields=plain_text&field_statistics=false&term_statistics=true&payloads=false&positions=false

This request showing unexpected behaviour. I am not able to understand this. One request with HTTP request gives timeout error and next time with the same request it gives proper response

Index setting and mapping

{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingleAnalyzer": {
          "tokenizer": "letter_tokenizer",
          "filter": [
            "lowercase",
            "custom_stop",
            "custom_shingle",
            "custom_stemmer",
            "length_filter"
          ]
        }
      },
      "filter": {
        "custom_stemmer": {
          "type": "stemmer",
          "name": "english"
        },
        "custom_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "custom_shingle": {
          "type": "shingle",
          "min_shingle_size": "2",
          "max_shingle_size": "4",
          "filler_token":""
        },
        "length_filter": {
          "type": "length",
          "min": 2
        }
      },
      "tokenizer": {
        "letter_tokenizer": {
          "type": "letter"
        }
      }
    }
  },
  "mappings": {
    "properties": {"article_id":{"type": "text"},
      "plain_text": {
        "term_vector": "with_positions_offsets_payloads",
        "store": true,
        "analyzer": "shingleAnalyzer",
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

I don't think there is any problem with this setting and mapping as sometimes I am getting expected response.

Please let me know if you need more information from my side. Any help will be appreciated.

system · May 19, 2020, 2:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Mtermvectors API not working properly Elasticsearch	8	634	May 19, 2020
_mtermvectors very slow? Elasticsearch	1	870	July 5, 2017
Sending too many IDs in elasticsearch-py mtermvectors Elasticsearch	2	1898	July 5, 2017
Term Vectors in Nested Documents Elasticsearch	2	856	July 5, 2017
Term_vector access Elasticsearch	9	515	July 6, 2017

Unexpected behaviour of termvector API in Python

Related topics