Best option to migrate data from old cluster to new

Hi Everyone,

I've also posted this exact same question on the Graylog forum - I'm just edging my bets a bit more by posting here as well as my issue is more ES than Graylog I guess :slight_smile:

I have a new cluster and an old cluster and Iā€™m using the reindex API to bring old documents into a new Index on our new server. This works and I can see 20M+ documents under management in the Index once the re-index command finishes.

{"took":6156976,"timed_out":false,"total":20000085,"updated":0,"created":20000085,"deleted":0,"batches":20001,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}

However, I can only see those messages if I find out the gl2_source_input and search for that:

gl2_source_input : 5a58389f21394d0e92c8f4fd

The messages also report that they have been ā€œ Received by deleted input on stopped node ā€ which I understand, because that is exactly what is happening, however, I thought that a re-index updated this data as it was imported.

How do I get this data to be searchable via the usual front-end search and not resorting to a search on gl2_source_input?

Is the re-index API the best way to migrate this old data to a new cluster?

I've seen various posts about doing a snapshot and resotre, but it's a bit beyond me at the moment and I'm struggling to get my head around it all. Any advice very much appreciated.

Current setup: NEW
Running 2 graylog nodes, Clustered MongoDB (3 hosts) and Clustered Elasticsearch (3 hosts)

Old Setup:
Running 2 graylog nodes, MongoDB on Graylog Primary, and Clustered Elasticsearch (3 hosts)

I can still reach the old setup and I have enabled the "whitelist remote" option in order to use the API in the first place.

Thank you.

Archie.

I am wondering if it's something simple like not having performed a refresh yet?

POST /_refresh

Could you share the exact query that is returning a document, along with the result, and a corresponding query that should return the same document but isn't?

Also what are the versions of Elasticsearch involved?

It sounds like a good way to do this, yes.

Hi @DavidTurner - Thank you for the reply.

I did a refresh and was returned the following

{"_shards":{"total":208,"successful":208,"failed":0}}

However, nothing changed, but...

The more I read and look at answers on the forums I suspect my understanding of how the Indices and streams works is flawed. When "fresh" data is diverted via a stream, I can click on that stream and view the messages, drilled down to just that stream. My understanding was that this should also be possible when a re-index is done because the said stream is using the same index - however, that is not the case.

I can still see the data I have reindexed if I do a search from the main console, outlined below:

From the main search console I can search for a source: 10.0.4.67
And the message is returned and tells me it is stored in the correct indexiamlog_0

I guess what it is confusing me is that the "Received by" field is showing "deleted input on stopped node"

However, that is to be expected (I assume) because that is exactly what happened to this data. My understanding was that a reindex would kind of "reindex" the data to match the current deployment, however, it seems a reindex just fixes up the date ranges?

It that correct? I can't find any examples on the Web of what to expect from a reindex, only how to perform one.

Thank you.

Here is the detailed data. Original was getting a bit long so split them out.

The reindex data is stored in iamlog_0

I can submit a curl to view this index

curl -H 'Content-Type: application/json' -X GET http://xx.xx.xx.xx:9200/iamlog_0/_search?pretty

Gives me:

"_index" : "iamlog_0",
        "_type" : "message",
        "_id" : "2bfbfed1-abdc-11e8-807b-005056a568da",   <-- Message ID
        "_score" : 1.0,
        "_source" : {
          "Action_Event_Name" : "QUERY_ACCOUNT_SECURITY_TOKEN",
          "Observer_Account_Id" : "0",
          "gl2_remote_ip" : "10.0.4.67",
          "Action_Event_CorrelationID" : "nmas#0#",
          "gl2_remote_port" : 41193,
          "Initiator_Account_Name" : "<REDACTED>",
               "Source" : "<REDACTED>",
          "Target_Account_Name" : "<REDACTED>",
          "Initiator_Entity_SysAddr" : "<REDACTED>",
          "Action_Outcome" : "0",
          "gl2_source_node" : "440b2d90-b30f-4aa7-be47-e91ac8237a49",   <--- OLD node
          "Observer_Account_Name" : "<REDACTED>",
          "Target_Account_Domain" : "<REDACTED>",
          "Observer_Entity_SysAddr" : "<REDACTED>",
          "timestamp" : "2018-08-29 22:37:55.389",
          "Action_ExtendedOutcome" : "0",
          "Observer_Entity_SvcName" : "nmas",
          "level" : -1,
          "streams" : [
            "5a58504021394d0e92c90e9c"     "source" : "10.0.4.67",
          "Observer_Account_Domain" : "<REDACTED>",
          "gl2_source_input" : "5a58389f21394d0e92c8f4fd",   <--- Old INPUT

Of course, "gl2_source_node" of 440b2d90-b30f-4aa7-be47-e91ac8237a49 is the old node UUID.

Our NEW UUID's end in 600e0904 and b3ee28fd

I can also search on any of the values and the gl2 values without issue, but only from the main search console and not from within the stream that uses this index (iamlog_0)

That is where my understanding fails me :slight_smile:

Thank you.

I do not know what the "streams" you're referring to are.

Reindex basically just copies documents from one index to another. I don't know what "fixes up" means here. Can you share the sequence of commands you're using to create the target index and perform the reindex, for the sake of clarity? There are many options, and it'll be easier to describe what the specific commands you're using are doing instead of trying to describe reindexing in general.

You lost me here. What's the main search console? What's "within the stream"? Can you find out what the queries going to Elasticsearch are in each case?

Thanks, @DavidTurner.

My language is referring to Graylog so that is likely why you are getting lost :slight_smile:

I am using an Elasticsearch Cluster with Graylog and the Stream is the conditional forwarder to the given Index. (iamlog_0)

image

Sure, I am using the API to do the Reindex.

I've set the command reindex.remote.whitelist: ["xxx.xxx.xxx.xxx:9200"] in elasticsearch.yml

Then the command I ran at the CLI:

curl -X POST "http://<LOCAL IP>:9200/_reindex" -d'
{
   "source": {
       "remote":   {
          "host": "http://<REMOTE IP>:9200"
      },
        "index": "iamlog_0"     <---- Remote cluster index
    },
    "dest": {
        "index": "iamlog_0"    <----  Local Index
   }
}'

This works well - No issues with that.

The main search console is the Graylog search console - Again I'm not after a Graylog 101, just trying to understand what a reindex does :slight_smile:

image

And this brings me back to my lack of understanding. I had assumed that a reindex will put data into a new index and give the imported data the UUID of the Index it is being imported into - my reference to "fixes up". This does not happen and if that is by design then all good. I just can't find a definitive answer as to what a reindex actually does between clusters.

In essence, let us say the first Elasticsearch cluster had failed and now I need to get all that data into a new cluster so that it can be referenced in the new cluster by Graylog Searches. A reindex (as per the above command) is the way to go??

As stated, that Data is there and I can search for it without issue, I just want to know that reindex is the best way to do this. I think coupling my question with Graylog has muddied the waters for you - my apologies.

Thanks for your patience :smiley:

Ok, this simply copies every document from the source to the destination as-is.

Each index does technically have a UUID, but it does not appear in any documents. The UUIDs to which you referred earlier in this thread are not related to the UUID of the index. Put differently, reindexing a document into a different index as you are doing doesn't change anything in the document.

However, one thing that reindex does not do is copy the mappings across. From the docs:

Reindex does not attempt to set up the destination index. It does not copy the settings of the source index. You should set up the destination index prior to running a _reindex action, including setting up mappings, shard counts, replicas, etc.

Did you create the destination index with the right mappings first?

Probably not in that case, because a reindex requires the source cluster not to have failed. That's what snapshot and restore is for.

Thank you - It is making sense now.

@DavidTurner - Can you explain what mappings are please?

The Index on the destination matched exactly the index on the source, same shards, replicas...
I'm keen to understand what mappings are in case I've missed something along the way.

I have read the document here: Mapping | Elasticsearch Reference [5.4] | Elastic
However, I am still none the wiser... I'll be honest.

How do you determine what mappings and currently in a working index? Are they simply the fields that an input gets split into?

For example, fields like?

"facility" "message" "level" "source" "timestamp"

Thank you.

I think I have found the answer after some searching.

I can issue the http://xxx.xxx.xx.xxx:9200/<INDEX>/_mapping?pretty command to see the mappings of a given index.

We don't have any non-default mappings defined and are relying on the index breaking the data into the correct fields via the dynamic mapping default.

I did notice on the imported data that the fields match the old data.

If I understand all this, In summary then:

  • Reindex API is just a copy of the data. Analogous to pouring water from one bucket into another.

  • I don't need to worry about specific mappings because my source and destination data are correctly mapped and identified via the "dynamic mapping" function.

  • So long as the destination index matches up on shards and replicas etc (Basically a direct copy of the old index settings) - All should be well.

I'm a bit surprised that Graylog isn't defining any specific mappings on the original index. Elasticsearch does a pretty good job of guessing what each field might be, but it's only a guess, and applications will typically define mappings for at least some of their fields so they're properly searchable. Could you share the mappings of the source and destination indices here?

It copies the documents. But if the mappings are different then the documents will indexed differently, so will behave differently in searches, hence my questions about mappings.

With dynamic mapping, Elasticsearch will guess the type of each field, but this guess might not be what the application expects. It's not just that it finds all the fields (that's the easy bit) it's also important that each field is indexed correctly.

The numbers of shards and replicas don't really matter in a reindex. You can change the number of replicas at any time anyway, and in fact if you want to change the number of shards in an index then reindexing might be the right tool to use.

Thanks @DavidTurner.

I am actually prepping the new server for a migration so I won't have the mappings to show until we start that process.

When I do start the migration I'll try and use an old index for testing so I can show the mappings. I had to remove the old data - from the new server that I had been testing with - to ensure the migration is "fresh"

I'll follow up this thread once I've done all that. Thank you.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.