Best config for a massive add and delete documents

Hello,

I have a use case where I need to add documents , perform query/ies over that document and delete it as once I have the result the document will no longer needed on the cluster.

It's a php app that search for almost 1.5k keywords in each document . To do that I did several approaches looking for the best performance.

  1. One query per keyword and perform request with "lazy" in a concurrent way.
    Result --> When high number of request per second, the elastic cpu flies and sometimes is not capable to manage the queued requests.
  2. Multiple request. By using msearch and splitting into chunks of 100 checking of each iteration if i have results and if I have, stop the followings requests.
  3. By generation a query with 1.5 clauses (clause limit is 1024 ) so I did split again into 100 and follow the same approach of before looking on each iteration for results and avoiding further request if result found.

The third one is the one which better results is giving us allowing perform 3k/min request to our app and 7k/s requests to elastic .

However the cpu usage on elastic has not a regular behaviour and sometimes flies to 100% while the normal usage us around 40%.

I´m looking for some help or maybe another approach to perform this use case.

My elastic cluster is located in amazon with ES 6.3 and c4.2xlarge.elasticsearch each node.

Any ideas?

Thanks in advance!

Why not flip it? Index a single query as a percolator and then percolate all documents against it?

Hello Igor,

Thanks for your response, if I didn't understand you wrong, you mean by storing keywords as a percolator and query each document against them, right ?

I did not mention before but I've got 5 indexes (one per language) cause' I'm storing documents by language as I need to search by steaming and others with simple_query_string. Do you know if it's possible to do what you mention with this requirements ?

Thanks again!

P.s I've attached a sample of one of the queries and for the mapping.

Mapping

{
  "template": "documents-en",
  "settings": {
    "index": {
      "number_of_shards": 5,
      "number_of_replicas": 0
    },
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_"
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"]
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  },
  "mappings": {
    "document": {
      "properties": {
        "id": {
          "type": "keyword"
        },
        "parent_id": {
          "type": "keyword"
        },
        "text": {
          "type": "text",
          "analyzer": "rebuilt_english"
        },
        "text_identifier": {
          "type": "keyword"
        }
      }
    }
  }
}

Query example (I've put only few elements on should clause)

{
    "size": 1,
    "query": {
        "bool": {
            "must": [{
                "term": {
                    "id": "2"
                }
            }],
            "should": [{
                "simple_query_string": {
                    "query": "+\\\"\"clash of clans\\\"\" ",
                    "fields": ["text"],
                    "default_operator": "and",
                    "_name": 20
                }
            }, {
                "simple_query_string": {
                    "query": "+\\\"\"deez nuts\\\"\" ",
                    "fields": ["text"],
                    "default_operator": "and",
                    "_name": 31
                }
            }],
            "minimum_should_match": 1
        }
    },
    "highlight": {
        "order": "score",
        "pre_tags": ["<span class=\"highlight\">"],
        "post_tags": ["<\/span>"],
        "fields": {
            "text": {
                "number_of_fragments": 0,
                "force_source": true
            }
        }
    }
}
´´´

you mean by storing keywords as a percolator and query each document against them, right ?

Not storing keywords, I meant storing actual queries. It's a bit of a strange concept, but if you go through some examples in the docs, it might become a bit clearer.

Ou wow! I see it ! I'll gonna try and test it :slight_smile:

Thanks again!

Sorry, one more thing...

Do I need to store the queries every time I receive a request or just once and only add new document (query) when I have a new keyword ?

Thanks !

Hmm, you are using named queries so I assume that you need to know which keyword match what and just having highlights is not enough. If this is the case, you might need to store each current should clause as an individual document, which might slow the percolation down a bit. You should probably test it to see if you get a satisfactory performance out of it.

Thanks for your explanation again.

I did the changes by storing a query by each keyword and get the results as expected. However, I'm wondering if It's possible to get the "keyword" that matches also other input params in my request . For example:

In my request I receive params that makes my app only checks if a text match with keywords that belong to some country "TR" and goal "1" , so, how can I make it to get the result by filtering those params instead of search and return result in all the document (percolate queries) stored?

Do you have a reference place where I can read more information about that ?

Thanks in advance again.

A document with the percolation query is just a document and the percolation query is just a query. So you can add more fields to the document that contains the query and filter these documents based on these fields while percolating using boolean statement. I am not sure where you can read about it besides documentation and the blog.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.