Elasticsearch indices stored on S3 mounted with S3FS

Zerobot · November 19, 2018, 4:38pm

Hi,

So I've a really specific infrastructure where I need to store my "Older than 30 days" indices on COLD/WARM nodes. Those nodes have a S3 bucket (1 bucket for all 4 nodes) mounted as a filesystem on each node in /data/ folder. Of course, /data/ is set as path for those nodes to store indices etc.

Setup is : 4 Hot, 4 Cold/Warm, 15GB RAM each (7GB Heap)

What I'd like to ask is: When we are talking about 100GB of data daily (right now) and something like 500GB of data daily in the future - does an infrastructure like this make any sense?

We were testing this for a while now but some problems with stability occured, like, whole Elastic was exploding. It seemed like S3 + S3FS is too slow to work on such amounts of data. All HOT and COLD/WARM nodes have 15GB of RAM and a heap of 7GB - that is a setup for 100GB of data per day, we will of course expand it but the most important question is:
Does mounting S3 with S3FS as a filesystem for Elasticsearch indices in RHEL 7 make any sense or should I look for some other ways to store old data?

I know this is a very abstract question so really I will be very gratefull for any answers!

DavidTurner · November 19, 2018, 5:13pm

It makes sense as a concept, but it's definitely not supported and probably won't work. Elasticsearch (really Lucene) expects to be able to quickly access random sections of its index files, and that's not the interface S3 exposes, so I guess it's going to be downloading a lot of data (slowly) each time it wants to access each block of each file. You could even end up paying more in access and transfer costs than simply for storing the data on block storage.

Zerobot · November 19, 2018, 5:52pm

Well, being honest we use something like that: https://www.emc.com/techpubs/ecs/ecs_s3_supported_features-1.htm#GUID-8725EEF9-EE9C-4423-A9DD-58B6877B8486

It acts like an S3 but it is not one. It's also slow as hell. Anyways costs don't matter in this situation because we don't really use a real-deal S3 from Amazon.

Christian_Dahlqvist · November 19, 2018, 5:55pm

If your storage is very slow when it comes to random read access, you might very well even at low data volumes end up with a node that is unusable. I would recommend using some other type of storage.

Zerobot · November 19, 2018, 5:56pm

What if we closed each index before putting it on COLD/WARM node and re-open it when it was really necessary? We are creating new indices every 24hours so finding a specific index from a specific point in day/year wouldn't be a problem.

Christian_Dahlqvist · November 19, 2018, 6:00pm

Closing indices just help with resource usage, not query performance (which is my primary concern in this case). If you are suffering from heap pressure I would recommend you read this blog post on sharding and watch this webinar on optimising storage.

DavidTurner · November 19, 2018, 6:13pm

Ah yes, I've met these things . The storage underneath is a normal filesystem as far as I could tell, which is what Elasticsearch wants, so the S3 compatibility layer is really just getting in the way here. Can you cut out the middleman and talk directly to the disks?

If you can do this then could you also snapshot each index and only restore them when you want to query? Snapshot/Restore is designed to work with S3's whole-object access model.

Incidentally, the idea you describe is what https://github.com/elastic/elasticsearch/issues/34352 should offer, but it's unlikely to help with the kind of storage you describe as @Christian_Dahlqvist says.

Christian_Dahlqvist · November 19, 2018, 6:19pm

If the cold tier is truly cold, you can use snapshot and restore with this file system. This will however require you to restore indices before querying and you still need to have capacity waiting your hot/warm cluster.

Zerobot · November 19, 2018, 6:34pm

Woah! Frozen Indices sound really cool!

This week I'm about to turn-on those COLD nodes (because the wheels are already turning) and my biggest "fear" is that whole Elastic will become extremely unstable. Probably as @Christian_Dahlqvist said, we would need more memory to run something like that. If our elastic goes kaboom then I'll have to do snapshots, it seems that is the only way because I can't change much in our Elastic project

system · December 17, 2018, 6:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch HOT/WARM infrastructure with S3 mounted as a filesystem using S3FS problem Elasticsearch	3	1318	January 3, 2019
Using S3 as storage for elasticsearch Elasticsearch	3	887	November 8, 2019
Using S3 as storage for elasticsearch Elasticsearch	2	275	February 8, 2022
Elasticsearch data directory in S3 bucket Elasticsearch	6	274	November 9, 2023
Recommended node hardware/settings for warm data Elasticsearch	2	449	December 2, 2021

Elasticsearch indices stored on S3 mounted with S3FS

Related topics