Ingest data directly from Google Cloud Storage into Elastic using Google

YvorL · February 16, 2023, 6:27pm

Hi,

I was reading this article and read this line:

Currently, CSV file format is supported and we’ll be adding support for JSON soon.

I know that it doesn't mean "the next day" but after 17 months, I checked and still can't ingest JSON.

I have TBs of data streaming into my GCS buckets and couldn't find anything that would be able to ingest that data without crashing or duplicating. Currently, I'm downloading the files in batches with a script and having Filebeat move the written events to my MQ. TBH, in 2023 to not be able to ingest huge amounts of data (50-90K e/s) in JSON (e.g., the format in which all Elastic logs are written) from Google's storage (where I can store ES snapshots) to Elasticsearch is weird to me.

The Q:
Is there even a plan to make it possible to stream JSON files to ES?
Is there any way to read hundreds of files per minute and forward the results to an MQ or Elasticsearch?

stephenb · February 16, 2023, 8:21pm

Hi @YvorL

Agreed this should be easier...

Perhaps I am looking at the wrong place

Or Agent Integration

Yes still beta ... sigh...

Currently only JSON and NDJSON are supported object/file formats. Objects/files may be also be gzip compressed. "JSON credential keys" and "credential files" are supported authentication types. If an array is present as the root object for an object/file, it is automatically split into individual objects and processed. If a download for a file/object fails or gets interrupted, the download is retried for 2 times. This is currently not user configurable.

Elastic can certainly scale to your ingest EPS but yes the ingest needs to scale as well.

You can run different beats with different buckets and workers perhaps that will help

Just for information in AWS we often see this pattern
s3 -> SQS - SNS -> many filebeats -> elasticsearch and filebeat know how to pull the SNS message which contains the file to go get from S3.. .it does not look like we have the same in pattern GCP yet, or perhaps I am missing that too.

YvorL · February 17, 2023, 11:45am

Unfortunately, these are out of my hands:

number of JSON files (hundreds every minute)
target cloud (GCP)
format (JSON)

I'll check out if a newer version of FB can handle what I need, but in the past, it failed due to the huge amount of files.
As I see, for the Agent integration (for self-managed ES cluster) I'd need to set up Fleet. I checked the list of integrations and couldn't find this beta one (I did allow displaying beta integrations).

BenB196 · February 17, 2023, 12:38pm

I checked the list of integrations and couldn't find this beta one (I did allow displaying beta integrations).

Which version of Elasticsearch/Kibana are you using, looks like the integration requires 8.6.2+ to be usable.

YvorL · February 17, 2023, 1:23pm

Ahhh , it's 8.6.1
I'll check that one too in this case

Thanks!

YvorL · February 20, 2023, 1:17pm

Did upgrade to 8.6.2 (ES, Kibana), still don't see the integration

BenB196 · February 20, 2023, 6:30pm

Hmm, looks like it wasn't categorized as a Google thing, try just searching the word "custom" and see if it shows up.

stephenb · February 20, 2023, 7:52pm

@YvorL

I still see this but just not sure how to use it... I will try to take look later today / tomorrow. I asked internally for some guidance.

Looks like that PR was just merged so should be in the next release (major or minor) trying to get some guidance.

I am checking in 8.6.2 which I would expect it to be in but I think it is missing

This is the link the PR says the package description should be here.. it is not

https://epr.elastic.co/search?package=google_cloud_storage

if we check another it is there

https://epr.elastic.co/search?package=gcp_pubsub

I have asked internally

stephenb · February 21, 2023, 6:32am

@YvorL Found it and yes it is Beta but this is a Brand New Integration

You have to Select Beta Integrations then you will see it.

With respect to GA, it will most likely be a couple of releases as it also uses the new dynamic ECS template, which is also beta.

system · March 21, 2023, 6:33am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Can filebeat ingest a JSON-format file as a single document? Beats filebeat	1	211	June 13, 2022
Ingest data with GCP Dataflow Elasticsearch	12	1125	March 9, 2022
Ingest CSVs with filebeat into elastic cloud Beats filebeat	2	173	August 29, 2023
Dec 18th 2022: [EN] Data Ingestion using GraphQL & Filebeat Advent Calendar painless	1	946	January 15, 2023
Receiving data via HTTPs or Ingest Pipelines Elasticsearch ingest-pipeline	7	1485	December 21, 2021

Ingest data directly from Google Cloud Storage into Elastic using Google

Related topics