I am currently working on Elasticsearch with a huge number of documents(around 500K) in an index. I want to store n-grams of each document's text data(This is also huge ~ per doc contains 2 pages of text data) in another index. So I calculating term vectors and their count in each document to store them in the new index. So I can execute aggregation queries on the new Index.
The setting of the old index has enabled me to execute termvector and mtermvector API's. I don't want to hit too many requests to Elasticsearch server in a short amount of time so I am going with mtermvectors python API. I am trying to get termvectors of 25 documents by passing id's of 25 documents.
Sample HTTP URL after calling mtermvector API in python
http://*servername*/elastic/*indexname*/article/_mtermvectors?offsets=false&fields=plain_text&ids=608467%2C608469%2C608473%2C608475%2C608477%2C608482%2C608485%2C608492%2C608498%2C608504%2C608509%2C608511%2C608520%2C608522%2C608528%2C608530%2C608541%2C608549%2C608562%2C608570%2C608573%2C608576%2C608577%2C608579%2C608585&field_statistics=true&term_statistics=true&payloads=false&positions=false
Some times I am getting expected response but most of the times I am getting the following error:
Proxy Error
The proxy server received an invalid response from an upstream server.
The proxy server could not handle the request GET /elastic/*indexname*/article/_mtermvectors.
Reason: Error reading from remote server
Index setting and mapping
{
"settings": {
"analysis": {
"analyzer": {
"shingleAnalyzer": {
"tokenizer": "letter_tokenizer",
"filter": [
"lowercase",
"custom_stop",
"custom_shingle",
"custom_stemmer",
"length_filter"
]
}
},
"filter": {
"custom_stemmer": {
"type": "stemmer",
"name": "english"
},
"custom_stop": {
"type": "stop",
"stopwords": "_english_"
},
"custom_shingle": {
"type": "shingle",
"min_shingle_size": "2",
"max_shingle_size": "4",
"filler_token":""
},
"length_filter": {
"type": "length",
"min": 2
}
},
"tokenizer": {
"letter_tokenizer": {
"type": "letter"
}
}
}
},
"mappings": {
"properties": {"article_id":{"type": "text"},
"plain_text": {
"term_vector": "with_positions_offsets_payloads",
"store": true,
"analyzer": "shingleAnalyzer",
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
I don't think there is any problem with this setting and mapping as sometimes I am getting expected response.
Please let me know if you need more information from my side. Any help will be appreciated.