Hey all!
Have an issue with my data
I'm building a SEO tool to analyze information about domains and their positions
Here is a part of Elasticsearch mapping
"mappings":{
"keywords":{
"properties":{
"keyword":{
"type":"string",
"analyzer":"english",
"fields":{
"raw":{
"type":"string",
"index":"not_analyzed"
}
}
},
"keyword_id":{
"type":"long"
},
"organic":{
"type":"nested",
"properties":{
"position":{
"type":"short"
},
"base_domain": {
"type": "string",
"index": "not_analyzed"
},
}
}
}
}
}
Keyword is text from search query, and it has a 100 nested documents inside - first 100 positions from search engine for this search query.
In total there is about 400 mln keywords - so, with search data it's 40 billions of documents.
I want to find top 10 keywords for specific domain, rated by this domain position.
For example, for domain "elastic.co" the query can return:
Position 1 for keywords "java rest client"
Position 1 for keywords "delete index elasticsearch"
.... etc
Position 2 for keywords "cluster health"
The issue is that response time is growing too fast for for domains that has more than 100'000 keywords in search results. Simple search without sorting is done in 0.1 sec, and the sorting by position can last more than 10 sec
Due to server performance, the issue is with IO operations
Here is the query
"body":{
"_source":{
"includes":["*"],
"excludes":["organic"]
},
"query":{
"bool":{
"filter":{
"nested":{
"path":"organic",
"query":{
"bool":{
"filter":{
"term":{
"organic.base_domain":"elastic.co"
}
}
}
},
"inner_hits":{
"_source":[
"base_domain",
"position",
],
"sort":{
"organic.position":"asc"
}
}
}
}
}
},
"sort":[
{
"organic.position":"asc"
},
{
"organic.position":{
"order":"asc",
"nested_path":"organic",
"nested_filter":{
"term":{
"organic.base_domain":"elastic.co"
}
}
}
}
]
}
10 second seems too much for sorting of 100'000 integer positions
Appreciate any ideas. Thanks for help!