How to get document by id (Get API) on a time series multi indexes environment?

elia.palme · October 6, 2017, 4:06pm

On a time series multi index environment where a new index is created every month what is the best approach to search for a document by id to ensure the document is unique among all indexes?
I came up with two solutions:

Using the search API and search across multiple indexes using a wildcard for the index name.
pro: Simple and supposingly the standard approach.
cons:Does not provide realtime results, it's affected by the refresh rate.

Because in our setup it could happen that the same documents is potentially added to the index in a very short amount of time we would prefer to use the Get API since it provides realtime results.

Using the multi Get API and individually query every index. In order to get all indexes that need to be queried in the multi get request we first use the Get Index API specifying the indexes to be searched with a wildcard.
pro: provides realtime data.
cons: we don't know if this is scalable (max 12 indexes) and what the performance of the Get Index API is.

Anyone has a better suggestion on how we could achieve this, or any insight if the proposed method 2) is sustainable?

warkolm · October 6, 2017, 9:40pm

A GET on an ID is super fast, so even a mget will be efficient.

elia.palme · October 7, 2017, 7:18am

I am actually more concerned about the "Get Index" API.
In order to run the Multi Get request I need to retreive the names of all available indexes at an exact point in time.
How scalable is this approach? Is the "Get Index" API meant to support multiple calls per second?
What is the performance of the "Get Index" API? Is it distributed or always handled by the master node? Does it provide realtime data or is it affected by a refresh rate?
Those are the kind of questions that worry me about approach 2).

Christian_Dahlqvist · October 7, 2017, 8:04am

Why not use the cat indices API?

elia.palme · October 7, 2017, 8:55am

The Cat Indices API sounds a bit of an overkill, it provides a lot of unnecessary informations such as the number of documents, index status, etc. I am worried that it would consume quite a substantial amount of bandwidth.

Actually I just realised the Get Index API it's probably even worst in terms of bandwidth since all index informations are returned.

Christian_Dahlqvist · October 7, 2017, 9:00am

You can control what it returns. GET /_cat/indices/filebeat*?v&h=I would just return a header and a list of the indices matching the filebeat* pattern.

elia.palme · October 7, 2017, 9:10am

Unfortunately this API is not supported by the Java client. I probably need to use the ClusterStateRequest as explained in this discussion.

Somehow I feel this is getting over complicated.
Doing a realtime lookup for a unique id across multiple indices (time series) sounds like a common use case.

Am I miss using Elasticsearch or is my set-up wrong? Is there a better way than getting all indices names and running a Multi Get to achieve this?

warkolm · October 7, 2017, 9:25am

Just use a wildcard in the index name then, that's supported.

elia.palme · October 7, 2017, 9:58am

Well as I wrote in the post I need realtime querying. So I suppose Search API with wildcard is not an option.
Or is there a way to mitigate the refresh rate issue?

elia.palme · October 8, 2017, 4:33pm

I run some stress tests to better understand the amount of data and avg response time for the mentioned approaches.

Get Index API (winner and baseline)
final GetIndexRequest getIndexRequest = new
GetIndexRequest().indices("indexname-*").features(GetIndexRequest.Feature.ALIASES);
client.admin().indices().getIndex(getIndexRequest).actionGet().getIndices();
Cat Indices API +30% avg response time, +10% data transfered
GET /_cat/indices/indexname-*?&h=index
Cluster State API +400% avg respone time, +30'000% data transfer
final ClusterStateRequest clusterStateRequest = new ClusterStateRequest();
final IndicesOptions strictExpandIndicesOptions = IndicesOptions.strictExpand();
client.admin().cluster().state(clusterStateRequest).get().getState().getMetaData().getIndices()

My understanding is that the most efficient method to retrieve all indices matching a wildcard is the Get Index API if the request is limited to the ALIASES feature.

To recap, it seem that in order to:

Do a realtime lookup for a unique document id across multiple indices (time series) a combination of the Get Index API and Multi Get API is the most efficient way.

Could anybody with an understanding of the implementation of those APIs confirm my findings?

p.s. Please note that a Search with an index wildcard is not an option since realtime data are required.

system · November 5, 2017, 4:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ids Query VS Multi Get API Elasticsearch	2	4197	July 5, 2017
Term query by id or get api? Elasticsearch	2	404	July 6, 2017
Optimizing Document Retrieval from Elasticsearch: Which Method Works Best? Elastic Search elastic-app-search , elastic-workplace-search	2	195	March 18, 2024
What is the most optimal way to get documents by _id? Elastic Search	4	27	December 2, 2024
Getting doc by _id is slow Elasticsearch	4	518	February 4, 2020

How to get document by id (Get API) on a time series multi indexes environment?

Related topics