Web crawler fields indexed without position data; cannot run PhraseQuery

sarahg · September 18, 2024, 9:57pm

Hello,

I'm getting started setting up Elasticsearch for a technical documentation website. We are using this stack:

Elastic Cloud
Indexing via the Web Crawler
Search UI for the frontend

I'm having a hard time getting a simple keyword search working against the fields that were defined by the crawler. When I search for something with multiple words, I get an error:

failed to create query: field:[body_content] was indexed without position data; cannot run PhraseQuery

Searches with one word work fine.

The error comes up both when searching through our UI, and querying using the console directly like this:

GET /my-index-name/_search
{
  "query": {
    "match_phrase": {
      "headings": "git merge"
    }
  }
}

This request returns the PhraseQuery error. Searching just the word "git" returns results as expected.

The crawler-defined fields don't appear to be configurable, so everything I've found about adjusting the field mappings does not seem to apply here.

This seems so basic, but how do we search multiple words against fields defined by the crawler?

Thanks!

Sean_Story · September 19, 2024, 5:34pm

Hi @sarahg , thanks for reporting.

Totally agree, this behavior is unexpected. Trying this out myself, crawling example.com, I get a document with a body_content value of:

Example Domain This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. More information...

Searching domain for (two words) I do get a hit. Which is expected with a slop of 1, since the phrase domain is for does appear in this value. But domain foo throws the same error you're seeing.

I don't have a root cause or workaround for you yet, I'm going to follow up internally, but wanted to let you know we're looking into it.

Sean_Story · September 19, 2024, 8:22pm

@sarahg I think we've (real credit to @Kathleen_DeRusso) tracked the root cause down to the mappings that Elastic Crawler is creating. If you do a:

GET <your-crawler-index>/_mapping

You should see:

...
        "body_content": {
          "type": "text",
          "index_options": "freqs",
...

That index_options: freqs is the problem. See these docs:

freqs
Doc number and term frequencies are indexed. Term frequencies are used to score repeated terms higher than single terms.
positions (default)
Doc number, term frequencies, and term positions (or order) are indexed. Positions can be used for proximity or phrase queries.

So by not using the default of positions, we're losing the ability to do phrase queries.

This was originally added because it provided significant on-disk storage savings, and we were able to facilitate (through App Search) phrase queries through other subfields. However, App Search is no longer the only search vehicle for searching your crawl data anymore, so this rationale doesn't really hold water anymore. I'll be filing an issue for our backlog to look at changing this behavior in the future.

For now, I've got 3 workaround options for you.

You could use the stem subfield. This query works:

GET /search-test/_search
{
  "query": {
    "match_phrase": {
     "body_content.stem": "domain foo"
    }
  }
}

You can edit the mappings for your crawler index. This might mean starting a fresh index, or using a re-index operation. But you probably don't need that "index_options": "freqs", on the top-level fields that crawler's mappings create (body_content, headings, title, etc)
You can use an ingest pipeline or a copy-to field to move body_content to a new field. You'll notice that the dynamic_templates do have that "index_options": "freqs" too, so to avoid that applying, you'd need to update the mapping first to explicitly define the new text field.

Again, thank you for reporting!

sarahg · September 25, 2024, 9:48pm

@Sean_Story Hey Sean, thank you for looking into this, I really appreciate it. I am really stuck on this one and it's been very humbling, haha.

Since we're using SearchUI, I don't think we can query subfields. Our queries are built like this: Configuration | Documentation

I'd link to our actual project code since it's open-source, but the forum bans links to GitLab and I'm working on the GitLab Docs website. But it's very similar to the basic example in the docs so far:

    const driver = new SearchDriver({
      apiConnector: connector,
      searchQuery: {
        result_fields: {
          headings: { raw: {} },
          url_path: { raw: {} },
          gitlab_docs_breadcrumbs: { raw: {} },
          gitlab_docs_section: { raw: {} },
        },
        disjunctiveFacets: [],
        facets: {},
        search_fields: {
          headings: {},
          body_content: {},
        },
      },
      initialState: {
        searchTerm: query,
      },
    });

Seems like this issue, which was closed as "won't fix" is maybe relevant: Nested objects are not rendered when `elasticsearch-connector` is used · Issue #907 · elastic/search-ui · GitHub. Probably can't use that workaround.

The crawler index field mappings do not appear to be editable in the admin UI. I see the fields, and the subfields, but there's nothing there to modify them, even for a new index that hasn't crawled a site yet. For something like this, would I need to do this via the API somehow?

Ingest pipelines look like a lot to learn right now just for getting a two-word search working, but if that's our only option, we'll figure it out.

Is the web crawler just not a good option if you need to search multiple words? Options 2 and 3 seem like a lot of modification/fighting against the product. Is there a better path if we need a fairly basic search that can handle spaces?

sarahg · September 26, 2024, 4:56pm

Spent most of yesterday trying to adjust the field mappings for the crawler index (workaround #2).

This didn't work, but maybe it's close, or I'm missing something easy here?

In the console, retrieve the mappings for the existing crawler-defined index: GET /search-gitlab-docs-hugo/_mapping
Copy that JSON, and remove the "index_options": "freqs" lines for the top-level body_content and headings fields
Create a new index with the new mappings
API returns an error

Here's a Gist that shows the request and error response we get back: Elastic index mapping updates · GitHub

I tried a few other things, like changing the mapping on the existing index, but that also failed, I suppose because you can't modify an index with data in it.

Another thing that seems to not be working with this method is that when I create a new index, I don't see a way to associate a crawler with it.

For example, I did get a more simplified index to post: basic-index.txt · GitHub

This was a copy of the crawler's index mappings with most of the analyzer and freq fields removed. It does create an index, but even with "search-" prefixed on the name, it doesn't seem like something I can hook up to a crawler.

Sean_Story · September 26, 2024, 5:22pm

I bet you're missing the index _settings as well. That's what defines the things mentioned in the mappings, like iq_text_stem, i_text_bigram, q_text_bigram, etc.
instead of GET <index>/_mapping you can do GET <index> and it'll show you mappings and settings at the same time. And you can use both at the same time to create the new index.

when I create a new index, I don't see a way to associate a crawler with it.

you can't point a crawler at a different index once it's created, unfortunately. So what you can do is:

create a new crawler in the UI
before running the first crawl, change the mappings
run your crawl

OR

you already have a crawler
you define your new index
you delete the original index associated with crawler
you run a reindex operation to replace the original index with your new index (I think there are some gotachas here in making the mappings/settings transfer)

Since we're using SearchUI, I don't think we can query subfields.

I totally missed this in your first message, that you're using Search UI. Let me loop in @joemcelroy , to see if there's something special that can be done here. I expect this "just works" if you use the Search UI connector for App Search, but I could believe that this is less straightforward with the Elasticsearch connector.

joemcelroy · September 26, 2024, 7:11pm

Hey so as you're using Elasticsearch connector, you can override the elasticsearch query used which that allows you to really control how you want to search.

Below is an example of overriding the default query and specify to use multi_match but you can really return whatever elasticsearch query you want.

const connector = new ElasticsearchAPIConnector(
  {
    host: "https://example-host.es.us-central1.gcp.cloud.es.io:9243",
    index: "national-parks",
    apiKey: "exampleApiKey"
  },
  (requestBody, requestState, queryConfig) => {
    console.log("postProcess requestBody Call", requestBody); // logging out the requestBody before sending to Elasticsearch
    if (!requestState.searchTerm) return requestBody;

    // transforming the query before sending to Elasticsearch using the requestState and queryConfig
    const searchFields = queryConfig.search_fields;

    requestBody.query = {
      multi_match: {
        query: requestState.searchTerm,
        fields: Object.keys(searchFields).map((fieldName) => {
          const weight = searchFields[fieldName].weight || 1;
          return `${fieldName}^${weight}`;
        })
      }
    };

    return requestBody;
  }
);

sarahg · September 26, 2024, 7:59pm

Thanks again @Sean_Story.

I tried both approaches you mentioned here.

New crawler method fails like this: mappingFixes1.txt · GitHub

Updating an existing crawler fails like this: mappingFixes1.txt · GitHub

Trying to reindex also leads to some weirdness in the UI. I've now got a bunch of phantom crawlers that I can't delete:

To delete, you have to specify a crawler name, but these are empty, so I can't trash them. Not concerned about this, just figured I'd flag it in case it wasn't a known issue. We've been trying a ton of things this week to try to get anything to work, so it's possible some of the ghost crawlers were created in other ways, but I know the reindex operation adds new ones.

I'll try Joe's method of changing the SearchUI query next.

Sean_Story · September 26, 2024, 8:28pm

Ok, I think I've caused some confusion here. Some manual editing needs to be done between the GET and the PUT if you're creating a new index and needing to copy over the settings. The error message is identifying pieces of the payload that you'll want to just remove from the PUT. But then got an index already exists error because you used the index creation API for the approach where we don't need to create the index (it exists already), but only need to modify the mappings of an existing index. For that, use the PUT <index>/_mapping API you were trying originally. No settings necessary for that. Only need to worry about setting if you try to replace the original index.

// Delete the original, which was associated with the crawler
DELETE /search-sarah-test-5

// Replace it with the new one. Error

What did you do for the "replace it with the new one" part? Given that the error complains that search-sarah-test-5 does not exist, I'm worried that the "replace" didn't work. Again, I think it's probably going to be more straightforward to edit the mappings, so maybe we should just shelve this "replace the index" approach for now.

I've now got a bunch of phantom crawlers that I can't delete

Oof, we're just hitting all the issues. I'm sorry for the experience you're having. This is a known issue we're working to get fixed. The workaround is described here: Ghost Web Crawlers - #3 by Jedr_Blaszyk

sarahg · September 26, 2024, 8:54pm

@joemcelroy Thank you so much for the code example. That solved our issue for now. With the query adjustment, we can now run searches with multiple words. Knowing how to modify the query like this will be useful for us going forward. Thanks again

@Sean_Story Ah, thanks for clarifying! It definitely felt like an order-of-operations problem, or just replacing the wrong pieces, or something like that. I think I'm just too new to Elastic to really understand using the API like this right now. It's not a happy path for a new user for sure (but I also work on a product that's hard on new users, so I get that it's hard to support everyone and everything).

Documenting more specifically how to change mappings on a crawler index could be helpful for others if it's going to be awhile before it gets adjusted to not be configured more specifically for App Search. If I get it figured out maybe I'll do a docs PR. I do imagine we'll eventually need to make this happen for something later on.

Also I'm glad I'm not the only one that managed to create a bunch of ghost crawlers, lol.

Thanks again guys, have a good rest of your week!

system · October 24, 2024, 8:55pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Searching terms Elasticsearch	8	340	July 6, 2017
Indexing custom Lucene documents Elasticsearch	6	580	July 6, 2017
Mapping modifications in index template breaks search queries Elasticsearch	18	496	July 6, 2017
Search when url is one of the terms Elasticsearch	4	2089	July 6, 2017
Issue with searching in fields Elasticsearch	5	316	July 6, 2017

Web crawler fields indexed without position data; cannot run PhraseQuery

Related topics