I have a folder with around 590,035 json files. Each file is a document that has to be indexed. If I index each document using python then it is taking more than 30 hours. How do I index these documents quickly?
Note - I've seen bulk api but that requires merging all the files into one which takes similar amount of time as above.
Please tell me how to improve the speed. Thank You.
I have not used FSCrawler but know Filebeat and Logstash that can ship JSON to Elasticsearch.
Data in Elasticsearch is stored as JSON documents. Whatever technical naming convention you use e.g. Filebeat can read your JSON files and ship the content, one document at the time, to Elasticsearch, which will index each compete JSON object as individual documents.
I think Filebeat expects one JSON object per line though, so depends a bit on the format of your JSON files.
I have not tried to do this directly from Python but that should definitely be possible. I have some Python scripts as parts of ingestion pipelines and they have no problems doing thousands of JSON objects per second. But I guess it depends on the complexity and size of the JSON objects you want to index and the hardware resources you have.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.