How to reindex doc with multifield?

Hi,
I am writing a python script to reindex our files with new mappings on elastic 1.5.2 (before upgrade to 2.3). one of the changes was mappings of title:
"title":{"type":"string"}
that was changed to new mappings:
"title": {
"type": "string",
"fields": {
"english": {
"type": "string",
"analyzer": "buzzilla_english"
},
"arabic": {
"type": "string",
"analyzer": "buzzilla_arabic"
}
}
}

I tried to copy each doc and change it to the new mapping as follow:
old_doc = hit['_source']
new_doc = old_doc
title = new_doc['title']
language="english" //for the example
new_doc['title'] = {language:title }

When running it it trying to "put" the new file, I get:
{"error":"MapperParsingException[failed to parse [title]]; nested: ElasticsearchIllegalArgumentException[unknown property [english]]; ","status":400}

I also tried:
new_doc['title'] = {}
new_doc['title']['fields'] = {language:title }

But it also failed:
{"error":"MapperParsingException[failed to parse [parent_title]]; nested: ElasticsearchIllegalArgumentException[unknown property [fields]]; ","status":400}

What is the correct way to do it please?

Hi @Moshe_Sucaz,

it's much easier than that. You just do:

old_doc = hit['_source']
new_doc = old_doc

in other words, you don't need to alter the document. You just index the same document again based the new mapping. You cannot explicitly index multi fields. They are automatically added by Elasticsearch and you can use them in queries.

Daniel

Thanks Daniel, but in this way the title is not getting the analyzer I want. Where I wrote “language="english" //for the example” - I meant that I do manipulation there:
The old doc has language “_analyzer” field, as in elasticsearch version 1.5.2 and older versions. So for each doc I need to check its “_analyzer” and set the title (and some more fields that we use analyzer for them) with the appropriate field. For example, If I read now a doc with Hebrew language, then I wanted the new doc to have title.hebrew with the old title and with our buzzilla_hebrew analyzer. With the way you suggested I see the title as string (when opening with sense: "title": "Hello world”), and not as object, as I understand it should be: “title.hebrew": "Hello world”

Also, when searching in this way:
GET /[index name]/_search
{
"query": {
"multi_match": {
"type": "most_fields",
"query": "Hello world",
"fields": [ "title", "title.hebrew" ]
}
}
}

I do get results, but when searching with "fields": [ "title.hebrew" ], I don't get results.
How can I know that the doc will be analyzed with our buzzilla_hebrew analyzer when I search for hebrew words in it?

Hi @Moshe_Sucaz,

so if I've understood you correctly, you have currently strings in different languages in the same field (e.g. for doc[0] the "title" field contains an English title, for doc[357] the "title" field contains an Italian title)?

If that is the case, then you should map this differently. See the chapter Getting Started with Languages in the Definitive Guide.

Daniel

Yes! That is exactly what I am trying to do. I did that as I understand from the link you sent. In the next page of this link (https://www.elastic.co/guide/en/elasticsearch/guide/current/using-language-analyzers.html) I saw the use of multifields to index the title field twice: once with the english/hebrew/italian analyzer and once with the standard analyzer. But I still don't understand what am I doing wrong :confused:

Hi @Moshe_Sucaz,

Ok, then let's walk through a concrete example so we're on the same page:

Suppose you have two documents in your source index (let's call it /src):

doc[1].title = "Hello World"
doc[2].title = "Hallo erstmal"

Suppose you also know that the language of doc[1].title is English and the language of doc[2].title is German.

Now you create your target index (just in case: delete it if you run this example multiple times):

DELETE /target

PUT /target
{
   "mappings": {
      "titles": {
         "properties": {
            "title": {
               "type": "string",
               "fields": {
                  "english": {
                     "type": "string",
                     "analyzer": "english"
                  },
                  "german": {
                     "type": "string",
                     "analyzer": "german"
                  }
               }
            }
         }
      }
   }
}

You reindex the documents as is, i.e. you don't change anything:

PUT target/titles/1
{
  "title": "Hello World"
}

PUT target/titles/2
{
  "title": "Hallo erstmal"
}

If you need to remember that doc id 1 is in English and doc id 2 is in German, so you need some kind of "language" field.

Note that the "title" field (of each document) is now indexed with three different analyzers:

  1. the "root" title field is analyzed with the "standard" analyzer
  2. the "english" subfield is analyzed with the "english" analyzer
  3. the "german" subfield is analyzed with the "german" analyzer

So all analyzers run for each document you'll index.

Now we can run a query (and as far as I've understood this concrete case did not yield a result for you):

GET /target/titles/_search
{
   "query": {
      "multi_match": {
         "type": "most_fields",
         "query": "Hello world",
         "fields": [
            "title.german"
         ]
      }
   }
}

which gives:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.2712221,
      "hits": [
         {
            "_index": "target",
            "_type": "titles",
            "_id": "1",
            "_score": 0.2712221,
            "_source": {
               "title": "Hello World"
            }
         }
      ]
   }
}

because the title.german field also exists for the document that is actually in English.

I hope we can get down to your problem based on this example.

Daniel

Daniel, thank you very much for this detailed example! Now I understand it much better, and its working also for me like a magic...
I didn't get results for "fields": [ "title.hebrew" ] because I accidentally run this query on the old index

Let me please ask one small question regarding the reindex:
I have a field with dots in its name. For this one I did run next lines in my script:

new_doc = old_doc
if 'foo.bar' in new_doc:
new_doc['foo_bar'] = new_doc['foo.bar']
del new_doc['foo.bar']

I just want to make sure I am doing it right...
Thanks!

Hi Moshe,

glad I could help finally resolve your problem. :slight_smile:

To your question: Yes, this looks ok to me. Maybe you can come with an even more generic approach like something along these lines (pseudocode!):

new_doc = {}
for k, v in old_doc.items():
    new_doc[k.replace_all(".", "_")] = v

I figure you are migrating to Elasticsearch 2.x where we disallowed dots in field names.

I just wanted to make you aware that we allow them again in 5.0:

See https://github.com/elastic/elasticsearch/issues/15951 and specifically https://github.com/elastic/elasticsearch/issues/19443.

I don't know about your timeline but maybe you want to consider to upgrade to 5.0 (although this is a huge step and requires probably a lot of testing). But be aware that it is not yet final and we also don't have a release date. I just wanted to raise your awareness as you're struggling with removing dots from your field names.

With 2.x we have also introduced a reindex feature in Elasticsearch. So if you change your mapping and don't need to do any transformations (unlike your case) you can use that also in the future (see docs).

Daniel

Thanks Daniel, we are short in timeline, so for now I will upgrade to 2.3...

Oh, sure. This makes sense then.