Mtermvectors API not working properly

Thek10patil · April 19, 2020, 6:05am

Hi Team,

Currently, I am using Elasticsearch for my journal data in my Research lab. I am trying to retrieve mtermvectors for 25 documents at a time using python. Some times I get the result properly but most of the times I am getting error "Undecodable raw error response from the server: No JSON object could be decoded" I am not able to figure out why I am getting this error some times and other times I am getting expected result with the same HTTP request.

Example HTTP request which works some times and gives error most of the times:

http://servername/elastic/indexname/article/_mtermvectors?offsets=false&fields=plain_text&ids=85183%2C85190%2C85191%2C85194%2C85196%2C85197%2C85198%2C85203%2C85207%2C85211%2C85214%2C85230%2C85244%2C85252%2C85267%2C85269%2C85275%2C85279%2C85280%2C85284%2C85296%2C83702%2C83704%2C83707%2C83720&field_statistics=false&term_statistics=true&payloads=false&positions=false&realtime=false

Any help would be appreciated.

xeraa · April 20, 2020, 11:28pm

That is a Python error I guess, right? Could you log the raw Elasticsearch response?

Also I tried to run the query and it just worked for me 10 consecutive times (though without Python). I'm wondering what is happening there. My first thought was that you might hit the max length of a URI, but if you're always running the same query that won't be it. Our of curiosity: Does the same problem happen if you switch from a URI query to a request body one (as used in the docs)?

Thek10patil · April 20, 2020, 11:41pm

I am currently working on Elasticsearch with a huge number of documents(around 500K) in an index. I want to store n-grams of each document's text data(This is also huge ~ per doc contains 2 pages of text data) in another index. So I calculating term vectors and their count in each document to store them in the new index. So I can execute aggregation queries on the new Index.

The setting of the old index has enabled me to execute termvector and mtermvector API's. I don't want to hit too many requests to Elasticsearch server in a short amount of time so I am going with mtermvectors python API. I am trying to get termvectors of 25 documents by passing id's of 25 documents.

Sample HTTP URL after calling mtermvector API in python

http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false

Some times I am getting expected response but most of the times I am getting the following error:

Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors.

Reason: Error reading from remote server

Index setting and mapping

{
  "settings": {
    "analysis": {
      "analyzer": {
        "shingleAnalyzer": {
          "tokenizer": "letter_tokenizer",
          "filter": [
            "lowercase",
            "custom_stop",
            "custom_shingle",
            "custom_stemmer",
            "length_filter"
          ]
        }
      },
      "filter": {
        "custom_stemmer": {
          "type": "stemmer",
          "name": "english"
        },
        "custom_stop": {
          "type": "stop",
          "stopwords": "_english_"
        },
        "custom_shingle": {
          "type": "shingle",
          "min_shingle_size": "2",
          "max_shingle_size": "4",
          "filler_token":""
        },
        "length_filter": {
          "type": "length",
          "min": 2
        }
      },
      "tokenizer": {
        "letter_tokenizer": {
          "type": "letter"
        }
      }
    }
  },
  "mappings": {
    "properties": {"article_id":{"type": "text"},
      "plain_text": {
        "term_vector": "with_positions_offsets_payloads",
        "store": true,
        "analyzer": "shingleAnalyzer",
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

I don't think there is any problem with this setting and mapping as sometimes I am getting expected response.

Please let me know if you need more information from my side. Any help will be appreciated.

Thek10patil · April 20, 2020, 11:43pm

Hi Xeraa,

I have added more details about the error and mapping I have used to create the index.

I am not able to figure out, why I am getting response some times only with the same request.

xeraa · April 21, 2020, 12:31am

The "Proxy Error" would be interesting. IMO your proxy is hiding what is happening here. What is that invalid response? Maybe some logs from the proxy or maybe also Elasticsearch might shed some light.

Thek10patil · April 21, 2020, 12:51am

I dont have direct access to that server, I have asked admin to share those logs with me. Once I have those logs I will share them.

But server admin told that there was timeout error from server. Admin increased that timeout to 10 minutes and tried to send same request multiple times. Still getting same error. Any idea what could be the reason behind that?

Thek10patil · April 21, 2020, 2:52am

One more thing, due to this issue I tried to use termvector api and I decided to call termvector api 25 time for 25 documents.

I have created new thread .

@xeraa If you are aware of that issue please let me know. That will be great help.

xeraa · April 21, 2020, 4:24am

Both threads sound like they have a similar issue:

There is no timeout on Elasticsearch queries by default. You might want to add one to actually fail the query on the Elasticsearch side rather than running into a proxy timeout.
Turn on the slow query log to capture any suspiciously slow queries to figure out what is going wrong; say 3s threshold to WARN.
I would enable cluster monitoring to get a better understanding of bottlenecks.

system · May 19, 2020, 4:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Unexpected behaviour of termvector API in Python Elasticsearch	1	420	May 19, 2020
_mtermvectors very slow? Elasticsearch	1	870	July 5, 2017
Sending too many IDs in elasticsearch-py mtermvectors Elasticsearch	2	1897	July 5, 2017
How can I get _mtermvector Elasticsearch es-hadoop	2	1083	July 6, 2017
MultTermVectors in Elasticsearch Java Elasticsearch	1	803	July 5, 2017

Mtermvectors API not working properly

Related topics