Hello everybody
I have a mbox file with a size of 26 GB (this file is called bigFile.mbox).
I want to index bigFile.mbox in Elasticsearch. How can I do ?
Here's what I tried to put in place:
1.I started writing a python script that converted the bigFile.mbox file to bigFile.json:
The script works correctly when it comes to small files (<= 1MB).
But nothing works correctly when it comes to converting bigFile.mbox which has a size of 26GB
2. I created a python script (splitter.py) that splits an mbox file into several files of a given size:
But this script takes a long time to cut a 26GB file and
generates hugely small 1MB mbox files which makes indexing in
Elasticsearch much longer.
Please, is it the most efficient way to index a 26GB mbox file in Elasticsearch?
You probably don't want to index one mailbox as a whole document but you probably want to index every single email which is in the mailbox.
My guess is that you want to be able a find an email and not a mailbox which contains XYZ.
So you need to find a way to read every single email and send it to elasticsearch. Then you can use may be ingest-attachment plugin to index each email.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.