Snapshot and Restore in readable format

Hi all,

I have configured an S3 bucket to push my snapshots to like the following:

https://www.elastic.co/guide/en/cloud/current/ec-aws-custom-repository.html

Is it possible out of the box to push the indexes in a readable format such as CSV, or Json?

I am using Elastic Cloud FYI

Any help/discussion is appreciated. I have done research into this and cannot find a definite answer..

Jason

No it's not. The goal of this API is to make backups in the most suitable format as possible, not to extract search response as CSV or JSON.
You'd better use logstash for that.

Hi,

Thanks for the response.

Can I use logstash to grab specific indexes from Kibana on a regular basis, like the way snapshot/restore works?

Any documentation for this?

Jason

Yes and no.

If you have a timestamp within your documents, you can filter with a query what are the data you want to extract.

I'm curious about the use case though. Why do you need this?

The specific use-case I don't want to state - however, each customer is split by tenant, and each tenant belongs to a different index which has a tenantId and a dateTimestamp.

Example Index: tid.555444333.2019-05-13

Each tenant will have a different time they want hot storage for their logs... So if it is 3 days, on the 4th day we will pipe their first days worth of logs into cold storage. I want it in a readable format so it can be sent to them upon request.

Any ideas?

I can query based on tenant and a timestamp, and grab all those logs associated with currentTime -3days (in this example)?

Then pipe those to cold storage?

GET /index/type/_mget
{
"ids" : ["631stWoBFYQ5mT065i6N", "6n1stWoBFAQ5mT065i6N"]
}

This works for what I need to do. However, do you know if there is a wildcard substitute for "ids"?

We don't externally store the _id field for each event, and as it is a random string there is no way for us to automate this...

Hard to know for sure since you don't want to explain the use case fully and I don't know how much perfomance you're willing to sacrifice for getting the stuff back out of ES in a somewhat weird way.

Do consider that most of the time it makes a lot more sense to archive the raw events into readable formats at their point of entry compared to getting them back out from elasticsearch through search queries/full scan-dump/backups.
By that I mean something like logstash duplicating everything and shipping one copy into cold storage and the other copy into ES. Or whatever is receiving the events in the first place.
A nice example of that I have heard in recent history is a presenter at Elastic{ON} in SFO who explained how their logstash infrastructure duplicated everything to send 1 copy in AWS S3 and the other copy into ES.

Although I would strongly advise against re-extracting everything from ES the way you seem to want to do it, technically, it looks like you're looking for a tool I sometime use:

Martin

2 Likes

@martinr_ubi

That is exactly what I was looking for.

I will look into archiving them before hand, however I don't handle that side of the log stream.

Although, wont performing a multiple output on pre-ES create performance issues on the maximum number of events/second that we can achieve?

Your not giving any specific arguments to back why that would be the case so it's hard to argue anything without potentially putting my foot in my mouth. That being said, if it's done correctly, no.

Imagine a correctly designed duplicating, fanning out logstash infrastructure with on (fast) disk buffer queues or in memory buffer queues where one side of the fan out leads to a logstash cluster that store events in files on "storage" directly... and one side that leads to a logstash cluster that indexes events into an elasticsearch cluster.

Which do you think cost more in terms of performance or is slowest, indexing in ES or dumping to file on storage in a scalable fashion?

I believe that if you do it by ingesting in ES and then reading it all back out through queries this is by definition slowest and more costly than if you divert a copy directly to storage.

So it depends on how you do it and there are ways to do it very badly but with the right setup I think you would create more performance issues by doing backups via reextraction than by doing a proper fan out to storage of the original events.
Naturally, only writing the event to a disk is faster than indexing an event in ES, in almost all real world use cases. So I have no reason to assume your storage flow would be slower than your ES and that it would "slow" your event/sec perfomance... if done correctly.

2 Likes

Hi Martin,

Thank you for taking time to discuss/reply to me :smile:

I will put this forward to our Devs and work out the best solution... It makes more sense to do it your way, due to the eventual multi cluster, multi tenant environment that we will have. Especially due to the climbing number of "re-extractions" we will have to do daily to support the process.

The aim is to reduce cost in ES by reducing the time that the logs stay in hot-storage and therefore reducing the total size of cluster we require to store the logs.

I have seen this: https://www.elastic.co/blog/implementing-hot-warm-cold-in-elasticsearch-with-index-lifecycle-management

However I'm not sure this would be as effective as the solution proposed by you in terms of cost - and also with its capabilities.

Hi Martin,

I'm experimenting with the tool, but I am getting "security_exception" errors - I am assuming this is because I missed off the auth for ES. Any idea where I pass in a username and password for ES? I don't see any placeholds in the "elasticdump" file for them.

I tried:

elasticdump --input https://elastic:password@esaddress:9243 --output=/tmp/my_index_mapping.json

Full error:

Wed, 15 May 2019 10:38:58 GMT | Error Emitted => {"error":{"root_cause":[{"type":"security_exception","reason":"action [indices:data/read/search] requires authentication","header":{"WWW-Authenticate":"Basic realm=\"security\" charset=\"UTF-8\""}}],"type":"security_exception","reason":"action [indices:data/read/search] requires authentication","header":{"WWW-Authenticate":"Basic realm=\"security\" charset=\"UTF-8\""}},"status":401}
Wed, 15 May 2019 10:38:58 GMT | Total Writes: 0
Wed, 15 May 2019 10:38:58 GMT | dump ended with error (get phase) => Error: {"error":{"root_cause":[{"type":"security_exception","reason":"action [indices:data/read/search] requires authentication","header":{"WWW-Authenticate":"Basic realm=\"security\" charset=\"UTF-8\""}}],"type":"security_exception","reason":"action [indices:data/read/search] requires authentication","header":{"WWW-Authenticate":"Basic realm=\"security\" charset=\"UTF-8\""}},"status":401}

Last time I played with it it was 3.3.7 it seems.
I was calling it like that: (I use a docker container but there is no difference at all.)

elasticdump \
  --input=https://user:password@production.es.com:9200/my_index \
  --output=/data/my_index.json \
  --type=data

I do have have security enabled since it's licensed and the basic auth was working just fine for me.
It's essentially the same as trying to hit: https://user:password@production.es.com:9200/_cluster/health from your browser.
Does that work for you? I mean it doesn't if you try without user and pass from an incognito chrome window... and it does work if you enter user and password?

Only thing I see is that you're missing an equal sign after input... not sure if it supports both or not and you have no index in your input URL.

You're also in Elastic Cloud and I'm not, not sure if that changes something, I would bet no.

Do you have strange characters in your user or password and this is messing up the shell?
Sorry I don't know how to help you, it works for me.

Hi Martin,

I wrote a detailed response to this explaining what I did, and what I was going to do to try fix it... :sweat_smile:

But the usual; uninstall/reinstall and try again tactic worked perfectly.... :laughing:

Thank you for taking the time to run through this with me and showing me the tool, works like a dream.

Regards,

Jason

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.