How to upload a file into ElastiSearch

Hi I am new to ELK,
I need to upload different types of files into ELASTIC SEARCH, is it possible to do that. if it is anyone can please help me how to do with sample code. As of now, I am doing with reading the file and sending the data. out of 6k files of one zip file, I am able to send only around 3k. But I need to attach a file directly without reading its content.

Welcome to our community! :smiley:
What sort of files are they?

Hi warkolm,
Thanks for the replay.
It is a LINUX SOS REPORT zip file consist of so many files nearly (11k files) with different type of extensions like txt, conf, rules, xml, File, CRON ,CNF File, CF File, REPO File, ALLOW File, CFG File etc..

You could use Filebeat for most of it, but it would require a bit of configuration so that it'd extract the right patterns.

Sorry i don't know much about ELK just few days back only i moved to this project, can't we strore it as a blob ?. I need to store all 11k files and again i need to get it back.

No, Elasticsearch is not a binary store. Everything is converted to json.

Ok, so Filebeat is the only way to store all these files?

It sends the data to Elasticsearch.

Check out https://www.elastic.co/products/ for a bit more info on everything.

Ok thanks warkolm for your time.

HI warkolm

I am iterating all files and reading each content and creating a document in elastic search. out of 6k files i am able to read and upload only 3k for first time. each and every run the no:of uploading files decreases. in different different platforms it is inserting different number of files.
here is my sample code:

def upload_file_to_es(name='',path=''):
if(name is not ''):
try:
try:
with open(os.path.join(path, name),'r') as file_obj:
log_file_string = file_obj.read()
except:
with open(os.path.join(path, name),'rb') as file_obj:
log_file_string = file_obj.read())
file_path = os.path.join(path.replace("C:\Users\Manikindi_shaik_Noor\Music\100.64.24.138_2020_Apr_15_07_10\nts_sles12_base_200415_0259",""), name)
upload_data={}
upload_data["path"]=file_path
upload_data["file"]=name
upload_data["content"]=str(log_file_string)#json.dumps(log_file_string)
upload_data["ip"]="0.0.0.3"
upload_data["exec"]="execution-ghi-013"
upload_data["timestamp"]=datetime.datetime.now().strftime("%Y-%m-%dT%H:%M:%S")

        try:
            resp_elastic = requests.post(
                elastic_url,
                headers=headers,
                data=json.dumps(upload_data),
                verify=False
            )
            print("ES Entry completed for {}".format(file_path))
            count21+=1
            print("file number %s"%count21)
        except Exception as e:
            print("ERROR: %s"%e)
            print("ERROR: FILE %s"%file_path)
            
    except Exception as e:
        print("ERROR: %s"%str(e))
        print("ERROR: FILE %s"%file_path) 

each and everytime the no:of files entry is different to ELK any suggestions please

I see you are using requests there , I would recommend the python library (and in particular the streaming bulk helper https://elasticsearch-py.readthedocs.io/en/master/helpers.html#example)

As stated elasticsearch will not do well with the file contents you are likely finding indexing errors on files as elasticsearch has guessed the field mapping (based upon the first doc) and some file contents will not be valid for that field mapping, You can ingest it as a blob(base64 it in your python first) with a non analysed field mapping if you really need to store them (you cannot search non analysed fields however).

Easiest way is using an index mapping template
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html

You need to enabled to false as so
https://www.elastic.co/guide/en/elasticsearch/reference/current/enabled.html

This is a bit of hacky workaround, I really recommend storing the data outside elasticsearch though and just ingesting a link to the file ie in s3/unc or something like that. Its not much more code to upload the file in python to something in then ingest the link for search purposes.

Offtopic
the "Correct" way to do this is to use something like tika https://tika.apache.org/ to parse metadata from the contents and then ingest that in a uniform manner.

You can use the ingest attachment plugin.

There an example here: https://www.elastic.co/guide/en/elasticsearch/plugins/current/using-ingest-attachment.html

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

The data field is basically the BASE64 representation of your binary file.

You can use FSCrawler. There's a tutorial to help you getting started.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.