Currently, CSV file format is supported and we’ll be adding support for JSON soon.
I know that it doesn't mean "the next day" but after 17 months, I checked and still can't ingest JSON.
I have TBs of data streaming into my GCS buckets and couldn't find anything that would be able to ingest that data without crashing or duplicating. Currently, I'm downloading the files in batches with a script and having Filebeat move the written events to my MQ. TBH, in 2023 to not be able to ingest huge amounts of data (50-90K e/s) in JSON (e.g., the format in which all Elastic logs are written) from Google's storage (where I can store ES snapshots) to Elasticsearch is weird to me.
The Q:
Is there even a plan to make it possible to stream JSON files to ES?
Is there any way to read hundreds of files per minute and forward the results to an MQ or Elasticsearch?
Currently only JSON and NDJSON are supported object/file formats. Objects/files may be also be gzip compressed. "JSON credential keys" and "credential files" are supported authentication types. If an array is present as the root object for an object/file, it is automatically split into individual objects and processed. If a download for a file/object fails or gets interrupted, the download is retried for 2 times. This is currently not user configurable.
Elastic can certainly scale to your ingest EPS but yes the ingest needs to scale as well.
You can run different beats with different buckets and workers perhaps that will help
Just for information in AWS we often see this pattern
s3 -> SQS - SNS -> many filebeats -> elasticsearch and filebeat know how to pull the SNS message which contains the file to go get from S3.. .it does not look like we have the same in pattern GCP yet, or perhaps I am missing that too.
I'll check out if a newer version of FB can handle what I need, but in the past, it failed due to the huge amount of files.
As I see, for the Agent integration (for self-managed ES cluster) I'd need to set up Fleet. I checked the list of integrations and couldn't find this beta one (I did allow displaying beta integrations).
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.