Sorting issue on 40 billions rows

Hey all!
Have an issue with my data

I'm building a SEO tool to analyze information about domains and their positions

Here is a part of Elasticsearch mapping

"mappings":{
    "keywords":{
        "properties":{
            "keyword":{
                "type":"string",
                    "analyzer":"english",
                    "fields":{
                    "raw":{
                        "type":"string",
                            "index":"not_analyzed"
                    }
                }
            },
            "keyword_id":{
                "type":"long"
            },
            "organic":{
                "type":"nested",
                    "properties":{
                    "position":{
                        "type":"short"
                    },
                    "base_domain": {
                        "type": "string",
                            "index": "not_analyzed"
                    },
                }
            }
        }
    }
}

Keyword is text from search query, and it has a 100 nested documents inside - first 100 positions from search engine for this search query.

In total there is about 400 mln keywords - so, with search data it's 40 billions of documents.

I want to find top 10 keywords for specific domain, rated by this domain position.

For example, for domain "elastic.co" the query can return:

Position 1 for keywords "java rest client"
Position 1 for keywords "delete index elasticsearch"
.... etc
Position 2 for keywords "cluster health"

The issue is that response time is growing too fast for for domains that has more than 100'000 keywords in search results. Simple search without sorting is done in 0.1 sec, and the sorting by position can last more than 10 sec

Due to server performance, the issue is with IO operations

Here is the query

"body":{
"_source":{
    "includes":["*"],
    "excludes":["organic"]
},
"query":{
    "bool":{
        "filter":{
            "nested":{
                "path":"organic",
                    "query":{
                    "bool":{
                        "filter":{
                            "term":{
                                "organic.base_domain":"elastic.co"
                            }
                        }
                    }
                },
                "inner_hits":{
                    "_source":[
                        "base_domain",
                        "position",
                    ],
                        "sort":{
                        "organic.position":"asc"
                    }
                }
            }
        }
    }
},
"sort":[
    {
        "organic.position":"asc"
    },
    {
        "organic.position":{
            "order":"asc",
            "nested_path":"organic",
            "nested_filter":{
                "term":{
                    "organic.base_domain":"elastic.co"
                }
            }
        }
    }
]
}

10 second seems too much for sorting of 100'000 integer positions

Appreciate any ideas. Thanks for help!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.