Ingest data directly from Google Cloud Storage into Elastic using Google

Hi,

I was reading this article and read this line:

Currently, CSV file format is supported and we’ll be adding support for JSON soon.

I know that it doesn't mean "the next day" but after 17 months, I checked and still can't ingest JSON.

I have TBs of data streaming into my GCS buckets and couldn't find anything that would be able to ingest that data without crashing or duplicating. Currently, I'm downloading the files in batches with a script and having Filebeat move the written events to my MQ. TBH, in 2023 to not be able to ingest huge amounts of data (50-90K e/s) in JSON (e.g., the format in which all Elastic logs are written) from Google's storage (where I can store ES snapshots) to Elasticsearch is weird to me.

The Q:
Is there even a plan to make it possible to stream JSON files to ES?
Is there any way to read hundreds of files per minute and forward the results to an MQ or Elasticsearch?

Hi @YvorL

Agreed this should be easier...

Perhaps I am looking at the wrong place

Or Agent Integration

Yes still beta ... sigh...

Currently only JSON and NDJSON are supported object/file formats. Objects/files may be also be gzip compressed. "JSON credential keys" and "credential files" are supported authentication types. If an array is present as the root object for an object/file, it is automatically split into individual objects and processed. If a download for a file/object fails or gets interrupted, the download is retried for 2 times. This is currently not user configurable.

Elastic can certainly scale to your ingest EPS but yes the ingest needs to scale as well.

You can run different beats with different buckets and workers perhaps that will help

Just for information in AWS we often see this pattern
s3 -> SQS - SNS -> many filebeats -> elasticsearch and filebeat know how to pull the SNS message which contains the file to go get from S3.. .it does not look like we have the same in pattern GCP yet, or perhaps I am missing that too.

1 Like

Unfortunately, these are out of my hands:

  • number of JSON files (hundreds every minute)
  • target cloud (GCP)
  • format (JSON)

I'll check out if a newer version of FB can handle what I need, but in the past, it failed due to the huge amount of files.
As I see, for the Agent integration (for self-managed ES cluster) I'd need to set up Fleet. I checked the list of integrations and couldn't find this beta one (I did allow displaying beta integrations).

I checked the list of integrations and couldn't find this beta one (I did allow displaying beta integrations).

Which version of Elasticsearch/Kibana are you using, looks like the integration requires 8.6.2+ to be usable.

Ahhh :smile: , it's 8.6.1 :smile: :smile:
I'll check that one too in this case :smile:

Thanks!

Did upgrade to 8.6.2 (ES, Kibana), still don't see the integration :frowning:

Hmm, looks like it wasn't categorized as a Google thing, try just searching the word "custom" and see if it shows up.

@YvorL

I still see this but just not sure how to use it... I will try to take look later today / tomorrow. I asked internally for some guidance.

Looks like that PR was just merged so should be in the next release (major or minor) trying to get some guidance.

I am checking in 8.6.2 which I would expect it to be in but I think it is missing

This is the link the PR says the package description should be here.. it is not

https://epr.elastic.co/search?package=google_cloud_storage

if we check another it is there

https://epr.elastic.co/search?package=gcp_pubsub

I have asked internally

@YvorL Found it and yes it is Beta but this is a Brand New Integration

You have to Select Beta Integrations then you will see it.

With respect to GA, it will most likely be a couple of releases as it also uses the new dynamic ECS template, which is also beta.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.