I have a folder with around 590,035
json files. Each file is a document that has to be indexed. If I index each document using python then it is taking more than 30 hours. How do I index these documents quickly?
Note - I've seen bulk api but that requires merging all the files into one which takes similar amount of time as above.
Please tell me how to improve the speed. Thank You.
You can use filebeat may be or have a look at FSCrawler which has a json files importer mode. See https://fscrawler.readthedocs.io/en/latest/admin/fs/local-fs.html
I don't want to index the files. I want to index the
json content of the files. I don't see how FSCrawler can help with that or FileBeat
I have not used FSCrawler but know Filebeat and Logstash that can ship JSON to Elasticsearch.
Data in Elasticsearch is stored as JSON documents. Whatever technical naming convention you use e.g. Filebeat can read your JSON files and ship the content, one document at the time, to Elasticsearch, which will index each compete JSON object as individual documents.
I think Filebeat expects one JSON object per line though, so depends a bit on the format of your JSON files.
I have not tried to do this directly from Python but that should definitely be possible. I have some Python scripts as parts of ingestion pipelines and they have no problems doing thousands of JSON objects per second. But I guess it depends on the complexity and size of the JSON objects you want to index and the hardware resources you have.
That's what I meant as well.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.