How to get document by id (Get API) on a time series multi indexes environment?


(Elia Palme) #1

On a time series multi index environment where a new index is created every month what is the best approach to search for a document by id to ensure the document is unique among all indexes?
I came up with two solutions:

  1. Using the search API and search across multiple indexes using a wildcard for the index name.
    pro: Simple and supposingly the standard approach.
    cons:Does not provide realtime results, it's affected by the refresh rate.

Because in our setup it could happen that the same documents is potentially added to the index in a very short amount of time we would prefer to use the Get API since it provides realtime results.

  1. Using the multi Get API and individually query every index. In order to get all indexes that need to be queried in the multi get request we first use the Get Index API specifying the indexes to be searched with a wildcard.
    pro: provides realtime data.
    cons: we don't know if this is scalable (max 12 indexes) and what the performance of the Get Index API is.

Anyone has a better suggestion on how we could achieve this, or any insight if the proposed method 2) is sustainable?


(Mark Walkom) #2

A GET on an ID is super fast, so even a mget will be efficient.


(Elia Palme) #3

I am actually more concerned about the "Get Index" API.
In order to run the Multi Get request I need to retreive the names of all available indexes at an exact point in time.
How scalable is this approach? Is the "Get Index" API meant to support multiple calls per second?
What is the performance of the "Get Index" API? Is it distributed or always handled by the master node? Does it provide realtime data or is it affected by a refresh rate?
Those are the kind of questions that worry me about approach 2).


(Christian Dahlqvist) #4

Why not use the cat indices API?


(Elia Palme) #5

The Cat Indices API sounds a bit of an overkill, it provides a lot of unnecessary informations such as the number of documents, index status, etc. I am worried that it would consume quite a substantial amount of bandwidth.

Actually I just realised the Get Index API it's probably even worst in terms of bandwidth since all index informations are returned.


(Christian Dahlqvist) #6

You can control what it returns. GET /_cat/indices/filebeat*?v&h=I would just return a header and a list of the indices matching the filebeat* pattern.


(Elia Palme) #7

Unfortunately this API is not supported by the Java client. I probably need to use the ClusterStateRequest as explained in this discussion.

Somehow I feel this is getting over complicated.
Doing a realtime lookup for a unique id across multiple indices (time series) sounds like a common use case.

Am I miss using Elasticsearch or is my set-up wrong? Is there a better way than getting all indices names and running a Multi Get to achieve this?


(Mark Walkom) #8

Just use a wildcard in the index name then, that's supported.


(Elia Palme) #9

Well as I wrote in the post I need realtime querying. So I suppose Search API with wildcard is not an option.
Or is there a way to mitigate the refresh rate issue?


(Elia Palme) #10

I run some stress tests to better understand the amount of data and avg response time for the mentioned approaches.

  1. Get Index API (winner and baseline)
    final GetIndexRequest getIndexRequest = new
    GetIndexRequest().indices("indexname-*").features(GetIndexRequest.Feature.ALIASES);
    client.admin().indices().getIndex(getIndexRequest).actionGet().getIndices();

  2. Cat Indices API +30% avg response time, +10% data transfered
    GET /_cat/indices/indexname-*?&h=index

  3. Cluster State API +400% avg respone time, +30'000% data transfer
    final ClusterStateRequest clusterStateRequest = new ClusterStateRequest();
    final IndicesOptions strictExpandIndicesOptions = IndicesOptions.strictExpand();
    client.admin().cluster().state(clusterStateRequest).get().getState().getMetaData().getIndices()

My understanding is that the most efficient method to retrieve all indices matching a wildcard is the Get Index API if the request is limited to the ALIASES feature.

To recap, it seem that in order to:

Do a realtime lookup for a unique document id across multiple indices (time series) a combination of the Get Index API and Multi Get API is the most efficient way.

Could anybody with an understanding of the implementation of those APIs confirm my findings?

p.s. Please note that a Search with an index wildcard is not an option since realtime data are required.


(system) #11

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.