We have 2 main fields[F1, F2] in our mapping, which holds the same tokens.
For instance -
F1- Every token in F1 holds some payload information in addition to the terms and its offsets.
F2 - This has mostly the same tokens (minus payload info)with a minor diff of some additional tokens present only in F2. And for these additional tokens the list of documents in which it appears is significantly high.
I am investigating some query slowness for the above configuration especially for phrase queries. I used profile API to collect some additional data.
Here are my observations and Queries:- --All in context of phrase queries--
(1) In case of phrase queries against F2, the time to taken to do "matches", evaluate "next_doc" is significantly higher. My suspicion is : the additional tokens with significantly high doc hits is contributing to the time taken. Is that right ?
To confirm the same, is there a way we can see the size of inverted indices or any tool that could be used to see the term - doc mapping of the inverted index ? Or any way that is recommended to understand the inverted index info present in every segment?
(1b) In follow-up to (1) As per the profile API documentation "matches" is a two step process. How to know the overhead from the tokens with max doc hits contribute to this ?
(2) Queries if and when executed alternately against the fields [F1 or F2], is slow, I believe a cache miss causes eviction and adds to the time than when query is fired against the same field.
Is there any way (tool/plugin/APIs) to track and understand such cache hits/misses, the memory usage pattern loading such different fields in memory ?