Adding new raw data, requesting suggestions

Trent-alex · December 29, 2024, 11:06pm

Hello Community, I appreciate you. Ok, on to my question. I have 30 TB of data on a windows drive that I plan to move to my linux cluster, hosting elastic stack. All resources are self managed on premises.

For reference: Adding data to Elasticsearch | Elasticsearch Service Documentation | Elastic
Elastic network drive connector reference | Elasticsearch Guide [8.17] | Elastic

I am considering 2 paths and would like insight for more. Lets consider the 30 TB static and the resources static.

My 2 thoughts are

use ntfs-3g to copy the data into linux and then change the data path in the elasticsearch.yml.
use the kibana network connector to ingest the data

Is there a third option? I ask as 30TB is not large but there doesnt seem to be alot of documentation.

thank you in advance

stephenb · December 29, 2024, 11:39pm

Hi @Trent-alex

Question

A) Is this data 30 TB of data already ingested and stored in other Elasticsearch Clusters? and you want to move it to a new Elasticsearch Cluster?

B) Raw data that has never been ingested into Elasticsearch, like logs, files, documents, etc., and you want to ingest it into Elasticsearch. If this please describe the data and what you plan to do with it once it is in Elasticsearch

A or B? Depending on that we can give some suggestions

To be clear, setting a random directory of files to the Elasticsearch data path does not mean the data will be available in Elasticsearch. All data must be ingested/indexed into Elasticsearchto be useful. There are many approaches/tools /methods to ingest data into Elasticsearch but in the End, they all Use and Write to the Elasticsearch via the Elasticsearch REST APIs

More clarification, typically the Elasticsearch datapath is set initially to an empty path which then stores the ingested and indexed data in a binary format that can not be viewed, edited, copied, or modified directly, the data can only be access via the Elastic APIs / Kibana etc.

Trent-alex · December 30, 2024, 12:37am

@stephenb Thank you for the question. Answer B. This is raw data primarily documents never ingested into elasticsearch

Trent-alex · December 30, 2024, 12:34pm

Additionally, I have 3 servers capable of holding 10TB each. How do I ingest and distribute? i am referencing Important Elasticsearch configuration | Elasticsearch Guide [8.17] | Elastic

It seems distributed file systems are not favored at least in ES 7.5 Why NFS is to be avoided for data directories - Elastic Stack / Elasticsearch - Discuss the Elastic Stack

setting multiple data.paths | Elasticsearch Guide [8.17] | Elastic

RainTown · December 30, 2024, 1:05pm

What sort of documents?

(btw, the word "document" usually has a more specific meaning in elasticsearch, similar to records in a database.)

Out of curiosity, precisely why are you ingesting them into elasticsearch? Be as specific as possible in your answer, e.g. "so I can search them" isn't the sort of detail I'm looking for

Trent-alex · December 30, 2024, 3:45pm

Thank you. I would say 30% pdf documents, 20% other text documents, 15% twitter and social media scraps, 35% images. I can split as necessary

stephenb · December 30, 2024, 4:10pm

Hi @Trent-alex

30TB of mixed content is a serious undertaking.

And as @RainTown asked How and what you want to do with your data is super important because you will want to come up with an "Information Architecture" that will include

And Depending on the Data Categories

Index / Datastream Strategy (How you want to partition (number and size shards etc) and manage your data)
Is data static or updated
How you want to access and Search your Data
Mapping / Schema
Index Lifecycle Management

Then there is ingest architecture. (I suspect you will need a couple tools/approaches)

Whether you are using Agent for things like Twitter Feeds
When you say Images what do you mean you want to Search the Meta Data of Images or do you want to "Vectorize" the data for Search
PDF, you will need the Connectors (OOTB) or the FS crawlers etc... and will you do a simple search / semantic search etc..etc...

This is a pretty substantial undertaking....

What I would suggest is to pick 1 of the simple sources and then have a bit more deep conversation so you can learn along the way... trying to do this all at once.... probably not a great plan.

These are my 3 cents...

RainTown · December 30, 2024, 5:37pm

A very useful and diplomatic answer from @stephenb

[ If you had asked this 10 years ago, a decent answer to your question would be "get yourself a Google Search appliance" and simply point it at your files. Google don't sell them anymore, partially because tools like elasticsearch sort of took their market, and enhanced the capavbilities, but the appliance was really easy to deploy and use. You have a lot more work to do. ]

I think you also have maybe partially misunderstood how elasticsearch works, given the original "use ntfs-3g to copy the data into linux and then change the data path in the elasticsearch.yml" idea. That, er, would not have worked. It's pretty important you understand why not, as a first priority. Hint: Assuming you have some elasticsearch/kibana installation somewhere now, and I hope you have to start learning, take a look at the files under your path.data directory tree now. You might be surprised.

As well as what Stephen wrote, just on the data volume, your 30TB of data might correspond to a lot less or a lot more when indexed into elasticsearch. If, eg, you only want to index the metadata of the images, thats going to be a lot less data than the image files themselves. Maybe for PDFs too (depends on the content).

As to storage generally, you likely want to have at least one replica for each "document", for standard data protection and availability reasons. It also helps with search. A nice 3 or 5 node cluster, with primary and replica data shared across the N data nodes was a fairly standard architecture. Also the time taken to index all your files will be pretty significant, I'd be surprised if was completed in even some days given you will likely have to try something, check, test, improve and iterate around this loop a few times.

Good luck !

Trent-alex · December 30, 2024, 7:07pm

@RainTown and @stephenb, thank you for your candor. Ok, the best way to eat an elephant is one bite at a time. What would be the simpliest type to ingest? For testing sake, I want to consider the data static.

RainTown · December 30, 2024, 8:17pm

Yes, candor, very politely put - thank you for this.

My own first contact with elasticsearch/logstash/kibana was with tweets (using Twitter API) some years ago. It's a mix of text (the tweets) and numbers (hits, likes, followers, retweets, etc), so ideal for learning the tools, visualizations, etc.

Anything that is "just text" is probably easiest.

But again, it depends what you eventually want to do with your data. If it's just simple searching of a bucket of personal data, for personal use, and you want to find any documents that mention Paris, pictures taken in Paris, or tweets that mention Paris, then elasticsearch is maybe over-engineering a bit, though its certainly a nice toy project. We simply dont know enough of your use case.

Topic		Replies	Views
Bulk indexing terabytes of time-based data Elasticsearch	5	2327	July 6, 2017
Multiple path.data and stripping Elasticsearch	7	1797	July 6, 2017
Loading JSON to ElasticSearch Elasticsearch	5	635	July 6, 2017
Are there other ways of importing data from one elasticsearch into another? Elasticsearch	2	1828	July 6, 2017
Is Elasticsearch capable of storing this amount of data? Elasticsearch	10	2546	July 6, 2017

Adding new raw data, requesting suggestions

Related topics