How can i retrieve timebased index from elasticsearch using python script


(Yash M.) #1

Hi Folks,
I'm trying to retrieve time based events from elasticsearch using python script for few time intervals.
could anyone here who can share the example to do it. using core api likes elasticsearch-py or elasticsearch_dsl.

Thanks,
Yash M.


(Daniel Mitterdorfer) #2

Hi,

assuming that you want to run a range query, this should get you started:

import elasticsearch

es = elasticsearch.Elasticsearch()
result = es.search(body={
    "query": {
        "range" : {
            "date" : {
                "gte" : "now-1d/d",
                "lt" :  "now/d"
            }
        }
    }
})
# TODO: process the result...

Daniel


(Yash M.) #3

HI @danielmitterdorfer,

I tried the example query in same way but the expected output is wrong for example
if we have time based index name in elasticsearch [abc-2018.07.08 & 2018.07.07]
then if i run following code
import elasticsearch

es = elasticsearch.Elasticsearch()
result = es.search(index="abc-*", body={
    "query": {
        "range" : {
            "date" : {
                "gte" : "now-15m/d",
                "lt" :  "now/d"
            }
        }
    }
})

then it return resultset from abc-2018.07.07 of initial timestamp only. the timestamp of resultset is not expected according to query. it weird..
Did you faced such kind of issue earlier ?


(Daniel Mitterdorfer) #4

Hi,

I sense the problem is not the Python API but rather how the query needs to look like. From what you describe I am not sure whether you actually want to query a range of indices based on some naming convention or whether you want to query an actual timestamp field. Can you please provide a small, self-contained example (e.g. a small number of example documents in bulk API format) and explain based on that what you want to achieve? Then it's probably easier to come up with a suggestion for a query.

Daniel


(Yash M.) #5

hi @danielmitterdorfer,

I can't put the actual data but i would try to explain as much as i can.
basically, I did troubleshoot what might be the issue. i found elasticsearch_dsl API is not working properly where as elasticsearch.py is somehow fine we can use it.

When we talk about issue in timerange i wrote a sample script for both elasticsearch and elasticsearch_dsl API.
In my elasticsearch Index name is

abc-2018.09.09
abc-2018.09.08
abc-2018.09.07

where if i defined index name in both script suppose abc-*. In that case, elasticsearch API trying to findout resulset from abc-2018.09.09 whereas elasticsearch_dsl will start from abc-2018.09.07

2nd issue i faced in elasticsearch_dsl when i tried to run some query in it's not giving appropriate resultset as we defined in condition when i cross verify the value in kibana there is value present in that time range.

for debugging perspective i can share the script but can't share data because the issue we have is based on time in that phase we are not concerning about data at this moment. you can use random dataset to evaluate or can use any other timebased dataset.

please find following code for elasticsearch_dsl API

from dateutil.relativedelta import *
from dateutil.easter import *
from dateutil.rrule import *
from dateutil.parser import *
from datetime import  *
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search

print "Start Time: ", datetime.now()
es = Elasticsearch()
s = Search(using=es, index="threat-*")

#s.filter('term', **{'@collection': 'crimeservers'})
s.filter('range', **{'updatedAt': {'gte':'now-15m', 'lt':'now'}})
for k in s.execute():
	print k['@collection']
	x = parse(k['updatedAt'])
	print x

if there is missing anything then let me know..


(Daniel Mitterdorfer) #6

Hi,

I've just created a small, self-contained sample script that creates 20 documents and simulates that each of them gets inserted into a dedicated index every minute. A range query over the last 15 minutes retrieves 15 documents as we'd expect (I tested this against Elasticsearch 6.4.0 but any recent version should work fine as well). Hope that helps you to resolve your issue.

import datetime
import elasticsearch
import json


def main():
    es = elasticsearch.Elasticsearch()

    now = datetime.datetime.now()
    num_docs = 20

    setup(es, now, num_docs)
    query(es, now, num_docs)


def setup(es, now, num_docs):
    docs = []
    es.indices.delete("elastic-test-*")
    es.indices.put_template(name="timeseries", body={
        "index_patterns": ["elastic-test-*"],
        "settings": {
            "number_of_shards": 1
        },
        "mappings": {
            "_doc": {
                "properties": {
                    "created_at": {
                        "type": "date"
                    }
                }
            }
        }
    })
    # index a number of documents
    for i in range(1, num_docs):
        docs.append({"index": {"_index": "elastic-test-%d" % i, "_type": "_doc", "_id": str(i)}})
        docs.append({"created_at": (now - datetime.timedelta(minutes=i))})
    es.bulk(body=docs)
    # force refresh so we can search
    es.indices.refresh(index="elastic-test-*")


def query(es, now, num_docs):
    result = es.search(index="elastic-test-*", body={
        "query": {
            "range": {
                "created_at": {
                    "gte": now - datetime.timedelta(minutes=15),
                    "lt": now
                }
            }
        },
        # ensure that we return all docs in our test corpus
        "size": num_docs
    })
    print("Found %d results." % result["hits"]["total"])
    # uncomment below to see all results
    # print(json.dumps(result))


if __name__ == '__main__':
    main()

Daniel


(Yash M.) #7

HI @danielmitterdorfer,

In this above scenario, you didn't use elastiicsearch_dsl API. you just use elasticsearch API. the example which i mentioned is related to elasticsearch_dsl API.

Well I think as we also comfortable to use with elasticsearch API. but there is also issue.
when we storing data using logstash script and retrieved specific timerange using python script then i think there is some issue with @timestamp. i don't know exactly what it is ?


(Daniel Mitterdorfer) #8

Hi,

I have used the Elasticsearch API instead of the DSL because it is much closer to the actual REST API and you mentioned that you have problems with the Elasticsearch API as well (+ you can usually copy & paste request bodies between the console in Kibana and the Python script and it will just work). At this point I think it is important that we get on the same page what the actual problem is.

I am a bit confused by your statement. Is this now a new issue or is this related to the example above? What is it exactly that you do not know or understand about @timestamp?

Daniel


(Yash M.) #9

In my earliest, When i mentioned about issue. I also mentioned

basically, there is 2 python API's available for elasticsearch.

  1. elasticsearch
  2. elasticsearch_dsl

both syntax and working little bit different. for simple tasks people often used elasticsearch API but for large projects they look for elasticsearch_dsl API

the code which i shared is for elasticsearch_dsl API.

As I said earlier i also fine with elasticsearch API.

people generally used logstash to pushing real time data into elasticsearch. and it is easy process as well. when it comes to retrieving the data we used certain technologies to correlations. As kibana is for visualization and dev tools is also available so we can test our query over there.
As we both needs to be on same page

For reproduce the problem:

  1. push some timebased event entry in elasticsearch using logstash.
  2. write a query in kibana dev tool to retrieve the last 15m data
  3. write a script in python to retrieve data using same query.
  4. match both the resultset @timestamp you'll probably find a difference.

Thanks,
Yash


(Daniel Mitterdorfer) #10

Ok, then the query function in the script would be:

def query(es, now, num_docs):
    s = elasticsearch_dsl.Search(using=es, index="elastic-test-*")\
        .filter("range", created_at={"gte": now - datetime.timedelta(minutes=15), "lt": now})[0:num_docs]
    response = s.execute()
    print("Found %d results." % len(response.hits))
    for hit in response:
        print(hit.created_at)

which is still producing the same quey and thus the same results.

It seems expected to me that if you use now at different points in time you will get different results as the current point in time has changed?

Daniel


(Yash M.) #11

Hi @danielmitterdorfer,

I don't think in same way if i pushing data using Logstash script in every 10 minutes interval. So Elasticsearch already have data and we retrieving data from Elasticsearch for last 15m. for more depth you can apply any value based filter query on top of that if this simple query seems complicated.

Same query run through in Kibana devtool and try to match the results if I don't think it remains same.

I also looking for another solution as well suppose i have to find one index value in another index. what should be the right approach to write python script to match value in another index and schedule it in near real time ?


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.