Query on two indices of elasticsearch database


(Kaushik Vankayala) #1

Hi There,

I have a requirement to query on elasticsearch data and my operation is dependent on two indices at a time.

select x.date,distinct(y.number)
from index1 x,index2 y
where x.id1=y.id1
and id2 in (select id2 from index1
     having count(distinct(id1))>=10)
and a.date between 'FROM' and 'TO'
order by number

So as can be seen above, my current SQL query gives me two columns date and number, while having followed some conditions.

id2 column is present only in index1. The inner query results in a limited set of entries of index1 to be queried further.

My actual query is on that limited set of index1 and where the id1 value matches from both index and there is some restriction on date...

Could you suggest me the best possible way to achieve this as i am clueless to proceed!

I have tried SQL of elasticsearch but it does not have distinct functionality, it does not have current date function etc.

Regards
Kaushik


(Andrei Stefan) #2

Hi @kaushik.vankayala,

Most of the times, switching from RDBMS to ES is not a straightforward process. Elasticsearch has parent/child relationships and nested documents as ways to have documents relationships but usually it is better to denormalize your RDBMS data when switching to Elasticsearch.
And the same goes with your query: if you could take a step back and consider having a field/some fields from one index copied in the second index that should help you the query.

Regarding ES SQL, joins are not on our plan for the near future, nor nested selects. JOINs are not native to Elasticsearch and if we would do something like this, it would be more of an workaround. NOW() function is on our near future plans and DISTINCT should work in the scenario you were showing.


(Kaushik Vankayala) #3

Thanks a lot for a prompt response..! :slight_smile:

I did try copying the fields into one index and now i achieved the above query result with help of a python script which uses the REST API to query on the elasticsearch data with multiple calls.

But it is clearly seeming to be an inefficient way while it is able to attain results as it takes about hours together (in my case for a set of 46k docs it takes around 4-5 hours).

I tried the above with DSL but the the result of aggregation cannot be conditioned to get the specific data. What i mean is, in regards to the above quer firstly asked....

having count(distinc(pid))>10

say i applied some aggregation and achieved the list now i had to use another API call to get the list of other fileds giving the input parameter with the output list of aggregation one at a time.

Can you think of a more reasonable approach or is there a way to apply the aggregation result in the same DSL query to get the desired output?

Regards

Kaushik


(system) #4

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.