We are currently storing the source for each document we are indexing on Elasticsearch. Is it possible to get the raw source of each data as a file from Elasticsearch? How is source data stored on disk compared to indices?
Ideally, we want a way to directly take the data off the disk and push it to another server rather than having to query elasticsearch for all the documents. The reason behind this is we want to run a job everyday that will push the source (the raw request) for the past 24 hours to more permanent storage. We only index 24 hours worth of data in elasticsearch, but we need the source (which we update 2-3 times while it's in elasticsearch) for longer duration for some machine learning jobs.
Nope, I'm afraid there's no way to do that at the moment.
_source is stored in a Lucene "stored field", and lives with the inverted index in the segment files. There's no way to extract it without using something like a scroll to iterate over all the docs.
That said, iterating over all the docs with a big scroll should be relatively quick, it's optimized for bulk exporting data from the segments.