Structured hierarchy: Repeating fields causing low performance

I have a mapping setting that has below types (self maintained type field). I have not used parent child relationship, rather repeated fields and the data for writing simpler queries.

{
	"type": "city",
	"countryName": "text",
	"stateName": "text",
	"cityName": "text"
}

{
	"type": "state",
	"countryName": "text",
	"stateName": "text"
}

{
	"type": "country",
	"countryName": "text"
}

Each of the above City has 10million docs, state has about 2million docs and country has 200 docs. This is conceptually hierarchical structure build on flat document structure.

But such type of structure is giving problems :fearful: on performance front, because when i try to do a term query with countryName, stateName and cityName fields with a filter of CITY as follows:

{
	"query": {
		"constant_score": {
			"filter": {
				"bool": {
					"must": [
						{
							"term": {
								"city": {
									"value": "boston"
								}
							}
						},
						{
							"term": {
								"state": {
									"value": "massachussets"
								}
							}
						},
						{
							"term": {
								"country": {
									"value": "usa"
								}
							}
						}
					],
					"filter": [
						{
							"term": {
								"type": {
									"value": "city"
								}
							}
						}
					]
				}
			}
		}
	}
}

QUERY
While i have filter of type "city", i expect that it should hit only city documents.
But the fact is it also hits the stateName and countryName fields of types "state" and "country".
I have tried this by comparing above query performance vs same query without the stateName and countryName clauses. The later gives better performance.

We have a very big similar concept system and we are stuck with the performance issue.
Please help !!

May be this could help: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-rescore.html

Thanks for the reply.

*If we put the type=city in query clause and the real given query (except filter) in rescoring clause, the documents fetched by type=city will be about 10million on which the real rescoring query will run.
It will be difficult to give the window size for rescoring.

Is * the thing you are trying to suggest ?

Hmmm. Right. May be that's not the best thing as all documents will most likely score the same.

I'm curious to know why actually you are seeing the behavior you described.
Which version is that?

With the term queries and constant_score filter i am not looking at scoring results as of now.

The version i am using is 6.6.2.
The behavior seems weird to me too, that is why this question.
When i add a "city" filter i expect that it will query on the Name fields for city documents only.

How is the execution sequence of query clause and filter clause decided ? If its the filter clause which executes first and the query clause executes on the documents fetched by filter clause, then something seems wrong.

We have more than 2Tbs of data and a little more than 500million documents, and growing. Once i execute below query, my performance increases because i do not query on common fields now. But that's just a test query, i want it filtered by state and country too, because Boston can be in any state or country.

{
	"query": {
		"constant_score": {
			"filter": {
				"bool": {
					"must": [
						{
							"term": {
								"city": {
									"value": "boston"
								}
							}
						}
					],
					"filter": [
						{
							"term": {
								"type": {
									"value": "city"
								}
							}
						}
					]
				}
			}
		}
	}
}

I see. I guess you have a lot of documents with type is city right?
May be that's the reason why. @jpountz could probably tell more.

Could you in the mean time run the same query with profile: true and add the ?human parameter?

Unfortunately this behavior is expected. Elasticsearch doesn't know that all documents that match city:boston also match country:usa and so it needs to verify all clauses.

One way to address this issue would be to change the way that you model your documents and move from

{
	"type": "city",
	"countryName": "text",
	"stateName": "text",
	"cityName": "text"
}

{
	"type": "state",
	"countryName": "text",
	"stateName": "text"
}

{
	"type": "country",
	"countryName": "text"
}

to

{
	"location": [ "{country}", "{country}/{state}", "{country}/{state}/{city}" ],
	"type": "city"
}

{
	"location": [ "{country}", "{country}/{state}" ],
	"type": "state"
}

{
	"location": [ "{country}" ],
	"type": "country"
}

Then your query would look like that:

{
	"query": {
		"constant_score": {
			"filter": {
				"bool": {
					"must": [
						{
							"term": {
								"location": {
									"value": "usa/massachussets/boston"
								}
							}
						}
					],
					"filter": [
						{
							"term": {
								"type": {
									"value": "city"
								}
							}
						}
					]
				}
			}
		}
	}
}

If changing your schema is not an option, something that would help as well (but less) would be to enable index sorting on your index (note that this requires reindexing), e.g. by type then country then state then city.

Thanks for you valuable inputs @jpountz as always.

So, overall it seems that all the clauses, no matter if they are in filter or query do execute in parallel, and then the results are selected based on the results from each clause.

With the filter in place, i had expected that my query would work on the type=City documents only.
It looks like we will have to think again our overall query design, a lot of change :disappointed:

Thanks for the alternative solution. We have a incrementally updating process which updates indexes from time to time. Such as Boston name changes to Toston, or may be boston gets deleted. In which case we need to know which amongst the "location" is a city so that it can be updated/deleted. Changing a big hierarchical model now would be not so good for us. In my example i have explained the concept in just a 3 level hierarchy.

Index sorting could be a solution. We will have to check the feasibility.

I wonder why filter clause is not executed before query clause. It would be cool to do so. Also if every clause is executed in parallel, why does the filter section exists ? Is it only for query caching ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.