Changing mappings of some group of ids without affecting other documents


(apanimesh061) #1

I have document sets that have their own ids. Each document set has some documents that have their ids. I am trying to figure out a way to update the mapping of all documents of a set without affecting the mappings of the documents that belong to other sets. By updating I mean to add custom analyzers to the fields. There analyzers are defined in the settings. For this I got a suggestion of using shared types like this:

PUT /test_index/
{
  "settings": {
    "index.store.type": "default",
    "index": {
        "number_of_shards": 5,
        "number_of_replicas": 1,
        "refresh_interval": "60s"
    },
    "analysis": {
        "filter": {
            "porter_stemmer_en_EN": {
                "type": "stemmer",
                "name": "porter"
            },
            "default_stop_name_en_EN": {
                "type": "stop",
                "name": "_english_"
            },
            "snowball_stop_words_en_EN": {
                "type": "stop",
                "stopwords_path": "snowball.stop"
            },
            "smart_stop_words_en_EN": {
                "type": "stop",
                "stopwords_path": "smart.stop"
            },
            "shingle_filter_en_EN": {
                "type": "shingle",
                "min_shingle_size": "2",
                "max_shingle_size": "2",
                "output_unigrams": true
            }
        }
    }
  }
}

PUT /my-index/document_set/_mapping
{
  "properties": {
    "type": {
      "type": "string",
      "index": "not_analyzed"
    },
    "doc_id": {
      "type": "string",
      "index": "not_analyzed"
    },
    "plain_text": {
      "type": "string",
      "store": true,
      "index": "analyzed"
    },
    "pdf_text": {
      "type": "attachment",
      "fields": {
        "pdf_text": {
          "type": "string",
          "store": true,
          "index": "analyzed"
        }
      }
    }
  }
}

POST /my-index/document_set/1
{
  "type": "d1",
  "doc_id": "1",
  "plain_text": "simple text for doc1."
}

POST /my-index/document_set/2
{
  "type": "d1",
  "doc_id": "2",
  "pdf_text": "cGRmIHRleHQgaXMgaGVyZS4="
}

POST /my-index/document_set/3
{
  "type": "d2",
  "doc_id": "3",
  "plain_text": "simple text for doc3 in d2."
}

POST /my-index/document_set/4
{
  "type": "d2",
  "doc_id": "4",
  "pdf_text": "cGRmIHRleHQgaXMgaGVyZSBpbiBkMi4="
}

// To get documents grouped by their document set ids
GET /my-index/document_set/_search
{
  "query" : {
    "filtered" : {
      "filter" : {
        "term" : {
          "type" : "d1"
        }
      }
    }
  }
}

Now I have 4 documents where document 1,2 are in group d1 and document 3,4 are in d2. How can I update the mapping only for one group without affecting the other group's mapping. I can get the documents per group using the search query given in the end.

There can 1000 of such document sets having ~10000 documents.

Any help/suggestion would be appreciated.


(Mark Walkom) #2

You can't because they are both of the same _type.
You need to reindex.


(apanimesh061) #3

So what should I be doing in this situation?

I thought of creating aliases for each of the document sets but updating settings for any of the aliases results in update of settings for all aliases.

Should I be going with creating separate index for all sets? Is there a limit to no. of indices?


(Mark Walkom) #4

Again, this is because they are the same _type.

It'd be easier to put them into their own index, especially since ES 2.0 will remove the ability to have multiple _types in the same index.


(apanimesh061) #5

Ok I'll make separate index for all types.

You mean multi_field will be removed in the next release?

Is there any limit on the no. of indices? The application that I am working on does not a lot of query volumes/requests. So, if I keep ~10000 indices which will have a total of ~15 Million documents, is there anything I should keep in mind?


(Mark Walkom) #6

That seems like a lot, you may run into issues with resources there, so you may want to refactor your doc structure.

Multi field is not being removed, check out https://www.elastic.co/blog/great-mapping-refactoring


(apanimesh061) #7

It is more like the projected no. of document sets. It might be less than that as well.

What exactly do you mean by refactoring my document structure?


(Mark Walkom) #8

Figure out a way to put the docs in the same index, normalise the format to remove the specific mapping or to make it applicable for multiple documents.


(apanimesh061) #9

Analysis has to be done on the documents belonging to one particular set. Each set can have separate analyzers associated with it. So, I don't think normalizing would be possible. Even if I create a single index, my queries will be confined to one particular document set.


(system) #10