How to add "position_increment_gap" in a field created by a Crawler Index?

freddyrb · April 10, 2024, 9:43pm

I have a index created by a crawler and it creates a Title field, but when I'm trying to do a match_phrase_prefix search I have the error "failed to create query: field:[title] was indexed without position data; cannot run PhraseQuery" I read that it needs to be added the "position_increment_gap", but when I tried to add it in the mappings, there's no way, gets me the error :

"type": "resource_already_exists_exception",
        "reason": "index [search-index1/BCnmIxvbQD-SUdKfU_IRpg] already exists".

My question is, how to add that attribute to the mapping if it's created by a crawler index?

I need to do a match_phrase_prefix to check exact matches on the title. It works for "match" only.

Jedr_Blaszyk · April 16, 2024, 12:08pm

Hi @freddyrb. The default mapping of the title field in a crawler index should support match_phrase_prefix queries. Can you try running a simple query and see if it works?

GET search-test/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": <your query>
    }
  },
  "_source": ["title"]
}

The position_increment_gap seems to be only used for multi-valued fields .

If this doesn't help, can you share your search query so that we can help with troubleshooting?

freddyrb · April 16, 2024, 7:37pm

Thanks @Jedr_Blaszyk , thanks for the answer. The query is:

GET search-testindex1/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": "nuxeo drive"
    }
  },
  "_source": ["title"]
}

The problem is when I send more than one word, I want to search the documents that has in the title the "nuxeo drive" phrase, I don't want that Elastic search "nuxeo" and "drive" separated. Then I get the error.

The part of the query:

"query": {
    "bool": {
      "filter": {
        "terms": {
          "indexedContentType.keyword": [
            "Documents,TSKB Article,blog,forum"
          ]
        }
      },
      "must": {
        "bool": {
          "should": [
            {
              "match_phrase_prefix": {
                "title": {
                  "query": "Documentation Docs Home Getting Starte"
                }
              }
            },
            {
              "match_phrase_prefix": {
                "subject": {
                  "query": "Documentation Docs Home Getting Starte"
                }
              }
            },

Jedr_Blaszyk · April 17, 2024, 10:01am

Aha! Now, looking at the example you provided I understand the issue. Your default template mapping of search-testindex1 should already include a number of subfields for title property, that are indexed differently, to be suitable for a range of queries.

The stem subfield of title is indexed with position information. You can try running:

GET search-testindex1/_search
{
  "query": {
    "match_phrase_prefix": {
      "title.stem": "nuxeo drive"
    }
  },
  "_source": ["title"]
}

Let me know if it helps!

Also, you can always extend your existing mapping with a subfield or a new field. Here is an example of how to add a subfield called phrase to title property:

PUT search-testindex1/_mapping
{
  "properties": {
    "title": {
      "type": "text",
      "fields": {
        "phrase": {
          "type": "text",
          "index_options": "positions",
          "analyzer": "iq_text_base"
        }
      },
      "index_options": "freqs",
      "analyzer": "iq_text_base"
    }
  }
}

freddyrb · April 17, 2024, 3:44pm

Thanks @Jedr_Blaszyk , I will see your suggestions, but my use case is a query with multiple indexes that have the field Title, and in all of them I have results (API Type) except in the Crawler index that throws the error.

Any suggestion?

Jedr_Blaszyk · April 18, 2024, 2:03pm

my use case is a query with multiple indexes that have the field title

Ok, that complicates things a bit but I see 2 ways forward to support your use case.

Solution 1: index filter inside a bool query: more manageable, with slight query-time performance hit

Example query:

GET index1,index2,index3,crawler_index/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "bool": {
            "must": {
              "multi_match": {
                "query": {query},
                "type": "phrase_prefix",
                "fields": ["title"]
              }
            },
            "must_not": {
              "terms": {
                "_index": ["crawler_index"]
              }
            }
          }
        },
        {
          "bool": {
            "must": {
              "multi_match": {
                "query": {query},
                "type": "phrase_prefix",
                "fields": ["title.stem"]
              }
            },
            "filter": {
              "terms": {
                "_index": ["crawler_index"]
              }
            }
          }
        }
      ],
      "minimum_should_match": 1
    }
  }
}

Solution 2: Update mappings in other indices with same subfield e.g. title.stem (indexed with position data) and reindex the data. A bit more work for you, but resulting slightly better query time performance. In that way you would be able to run query like:

GET index1,index2,index3,crawler_index/_search
{
  "query": {
    "match_phrase_prefix": {
      "title.stem": {query}
    }
  }
}

Hope this helps!

freddyrb · April 18, 2024, 5:04pm

Thanks, I will check those options.

system · May 16, 2024, 5:04pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Web crawler fields indexed without position data; cannot run PhraseQuery Elastic Search crawler	10	102	October 24, 2024
Feature Request: Enforce consistent position_increment_gap in multi-fields for multi-values Elasticsearch	4	312	May 11, 2021
Field “title” was indexed without position data; cannot run PhraseQuery Elasticsearch	3	2821	July 6, 2017
Enable_position_increment in query_string Elasticsearch	7	2110	February 28, 2017
Enable_position_increments not working for phrase queries with stopwords Elasticsearch	5	1311	July 27, 2019

How to add "position_increment_gap" in a field created by a Crawler Index?

Related topics