I am currently working on Elasticsearch with a huge number of documents(around 500K) in an index. I want to store n-grams of each document's text data(This is also huge ~ per doc contains 2 pages of text data) in another index. So I calculating term vectors and their count in each document to store them in the new index. So I can execute aggregation queries on the new Index.
The setting of the old index has enabled me to execute termvector and mtermvector API's. Due Elasticsearch server timeout issue most of the times mtermvectors API throwing Proxy error.
Some times I am getting expected response but most of the times I am getting the following error:
Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors.
Reason: Error reading from remote server
Sample mtermvector HTTP URL after calling mtermvector API in python
http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false
That's why I decided to use Termvector API.
elasticsearch_client.termvectors(index=INDEX_NAME, doc_type=DOC_TYPE, id=fetched_id,
offsets=False,
fields=["plain_text"],
positions=False, payloads=False, term_statistics=True,
field_statistics=False)
Sample termvector HTTP URL after calling mtermvector API in python
http://**servername**/elastic/**indexName**/article/608588/_termvectors?offsets=false&fields=plain_text&field_statistics=false&term_statistics=true&payloads=false&positions=false
This request showing unexpected behaviour. I am not able to understand this. One request with HTTP request gives timeout error and next time with the same request it gives proper response
Index setting and mapping
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "letter_tokenizer",
"filter": [
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"length_filter"
]
}
},
"filter": {
"custom_stemmer": {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "4",
"filler_token":""
},
"length_filter": {
"type": "length",
"min": 2
}
},
"tokenizer": {
"letter_tokenizer": {
"type": "letter"
}
}
}
},
"mappings": {
"properties": {"article_id":{"type": "text"},
"plain_text": {
"term_vector": "with_positions_offsets_payloads",
"store": true,
"analyzer": "shingleAnalyzer",
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
I don't think there is any problem with this setting and mapping as sometimes I am getting expected response.
Please let me know if you need more information from my side. Any help will be appreciated.