Is it possible to index a file with a size of 26GB in elasticsearch?

Hello everybody
I have a mbox file with a size of 26 GB (this file is called bigFile.mbox).
I want to index bigFile.mbox in Elasticsearch. How can I do ?

Here's what I tried to put in place:

1.I started writing a python script that converted the bigFile.mbox file to bigFile.json:

The script works correctly when it comes to small files (<= 1MB).
But nothing works correctly when it comes to converting bigFile.mbox which has a size of 26GB :disappointed_relieved:
2. I created a python script (splitter.py) that splits an mbox file into several files of a given size:
But this script takes a long time to cut a 26GB file and
generates hugely small 1MB mbox files which makes indexing in
Elasticsearch much longer.

Please, is it the most efficient way to index a 26GB mbox file in Elasticsearch?

I thank you in advance.

Best regards
MBE

You probably don't want to index one mailbox as a whole document but you probably want to index every single email which is in the mailbox.

My guess is that you want to be able a find an email and not a mailbox which contains XYZ.

So you need to find a way to read every single email and send it to elasticsearch. Then you can use may be ingest-attachment plugin to index each email.

My 2 cents.

2 Likes

Hello dadoonet,
do you know an effective way to index each email that is in a mailbox?

No I don't.
May be Tika project has some code related to parsing mailbox content. It's in Java though.

thank you very much :):grinning:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.