We have an ES index that contains a large and constantly increasing number of documents (PDFs, word processord files etc). The index stores several details about each document, including the complete contents of the documents in plaintext in a singular field. What we would need is a completion suggestion scheme that suggests the most common words that match the given query string, occuring anywhere in these plaintext fields.
The best solution so far in terms of both results and performance has been to use the "search_as_you_type" field and let ES only return the relevant documents. The Python code that then receives these results browses the returned documents and finds the matching substrings. (ES highlighting either doesn't work or is too slow.)
So we have a mapping that is something like:
"mappings": {
"properties": {
"plaintext": {
"type": "search_as_you_type"
}
}
}
And the have several documents of the form:
{
"plaintext": ".... <thousands of words> ... elasticsearch .... <thousands of words..."
}
{
"plaintext": ".... <thousands of words> ... elastic search .... <thousands of words..."
}
And when the query is something like
"query": {
"match_phrase_prefix": {
"plaintext": "ela"
}
}
then ES should return the aforementioned documents with highlights elasticsearch and elastic. The "search_as_you_type" with highlighting works ok, but the process is much faster if the highlighting is dropped and the results are handled with Python and regular expressions. And when the query string is something like "elastic se" then the returned highlight should be elastic search which only seems to sort of work if the highligh has a separate query for each whitespace-separated substring. And then things get really slow.
On the other hand the problem with the Python method is that we cannot be sure we've grabbed the most common words. So is there a "pure" Elasticsearch way of doing what is described above, or should we stick to the current solution?