Hello Community, I appreciate you. Ok, on to my question. I have 30 TB of data on a windows drive that I plan to move to my linux cluster, hosting elastic stack. All resources are self managed on premises.
A) Is this data 30 TB of data already ingested and stored in other Elasticsearch Clusters? and you want to move it to a new Elasticsearch Cluster?
B) Raw data that has never been ingested into Elasticsearch, like logs, files, documents, etc., and you want to ingest it into Elasticsearch. If this please describe the data and what you plan to do with it once it is in Elasticsearch
A or B? Depending on that we can give some suggestions
To be clear, setting a random directory of files to the Elasticsearch data path does not mean the data will be available in Elasticsearch. All data must be ingested/indexed into Elasticsearchto be useful. There are many approaches/tools /methods to ingest data into Elasticsearch but in the End, they all Use and Write to the Elasticsearch via the Elasticsearch REST APIs
More clarification, typically the Elasticsearch datapath is set initially to an empty path which then stores the ingested and indexed data in a binary format that can not be viewed, edited, copied, or modified directly, the data can only be access via the Elastic APIs / Kibana etc.
(btw, the word "document" usually has a more specific meaning in elasticsearch, similar to records in a database.)
Out of curiosity, precisely why are you ingesting them into elasticsearch? Be as specific as possible in your answer, e.g. "so I can search them" isn't the sort of detail I'm looking for
And as @RainTown asked How and what you want to do with your data is super important because you will want to come up with an "Information Architecture" that will include
And Depending on the Data Categories
Index / Datastream Strategy (How you want to partition (number and size shards etc) and manage your data)
Is data static or updated
How you want to access and Search your Data
Mapping / Schema
Index Lifecycle Management
Then there is ingest architecture. (I suspect you will need a couple tools/approaches)
Whether you are using Agent for things like Twitter Feeds
When you say Images what do you mean you want to Search the Meta Data of Images or do you want to "Vectorize" the data for Search
PDF, you will need the Connectors (OOTB) or the FS crawlers etc... and will you do a simple search / semantic search etc..etc...
This is a pretty substantial undertaking....
What I would suggest is to pick 1 of the simple sources and then have a bit more deep conversation so you can learn along the way... trying to do this all at once.... probably not a great plan.
A very useful and diplomatic answer from @stephenb
[ If you had asked this 10 years ago, a decent answer to your question would be "get yourself a Google Search appliance" and simply point it at your files. Google don't sell them anymore, partially because tools like elasticsearch sort of took their market, and enhanced the capavbilities, but the appliance was really easy to deploy and use. You have a lot more work to do. ]
I think you also have maybe partially misunderstood how elasticsearch works, given the original "use ntfs-3g to copy the data into linux and then change the data path in the elasticsearch.yml" idea. That, er, would not have worked. It's pretty important you understand why not, as a first priority. Hint: Assuming you have some elasticsearch/kibana installation somewhere now, and I hope you have to start learning, take a look at the files under your path.data directory tree now. You might be surprised.
As well as what Stephen wrote, just on the data volume, your 30TB of data might correspond to a lot less or a lot more when indexed into elasticsearch. If, eg, you only want to index the metadata of the images, thats going to be a lot less data than the image files themselves. Maybe for PDFs too (depends on the content).
As to storage generally, you likely want to have at least one replica for each "document", for standard data protection and availability reasons. It also helps with search. A nice 3 or 5 node cluster, with primary and replica data shared across the N data nodes was a fairly standard architecture. Also the time taken to index all your files will be pretty significant, I'd be surprised if was completed in even some days given you will likely have to try something, check, test, improve and iterate around this loop a few times.
@RainTown and @stephenb, thank you for your candor. Ok, the best way to eat an elephant is one bite at a time. What would be the simpliest type to ingest? For testing sake, I want to consider the data static.
Yes, candor, very politely put - thank you for this.
My own first contact with elasticsearch/logstash/kibana was with tweets (using Twitter API) some years ago. It's a mix of text (the tweets) and numbers (hits, likes, followers, retweets, etc), so ideal for learning the tools, visualizations, etc.
Anything that is "just text" is probably easiest.
But again, it depends what you eventually want to do with your data. If it's just simple searching of a bucket of personal data, for personal use, and you want to find any documents that mention Paris, pictures taken in Paris, or tweets that mention Paris, then elasticsearch is maybe over-engineering a bit, though its certainly a nice toy project. We simply dont know enough of your use case.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.