How to reduce information in search results

djmarcus · April 11, 2020, 4:24pm

When I get the results of am elasticsearch, the individual hits contain this information:

    {
        "_index" : "xxxxxxx",
        "_type" : "_doc",
        "_id" : "WRn-1W8Bh5qsL8eVO5Ao",
        "_score" : 3.7232523,
        "_source" : {
           "talentId" : "yyyyyyy"
        }
    }

This is more information that I need. I only need the "_score" and "_source" fields.

Having the extra information slows things down:

Elasticsearch has to convert to JSON information that is never used
Takes longer to transmit the response (I'm getting chunks of 10,000 hits at a time)
Takes longer for my side to parse the longer JSON

How can I reduce the Hits to:

    {
        "_score" : 3.7232523,
        "_source" : {
           "talentId" : "yyyyyyy"
        }
    }

I'm using the .NET low-level driver

-Thanks in advance
David

Luca_Belluccini · April 12, 2020, 12:15am

As you want _source, Elasticsearch is already reading data from disk.

If the field talentId is indexed as keyword (or talentId.keyword exists), you can try to help Elasticsearch to do not decode the whole _source and rely only on the doc fields.
Elasticsearch will grab the values from the Lucene doc fields instead of accessing to the _source and not decoding _id.

This can be done using:

GET <your_index>/_search
{
  "stored_fields": "_none_",
  "docvalue_fields": [ "talentId.keyword"]
}

The format of the hits will be:

      {
        "_index" : "xxxxxxx",
        "_type" : "_doc",
        "_score" : 1.0,
        "fields" : {
          "talentId.keyword" : [
            "yyyyyyy"
          ]
        }
      }, ...

You can also use the parameter filter_path to remove some elements in the response:

GET <your_index>/_search?filter_path=hits.hits.fields,hits.hits._score,hits.total
{
  "stored_fields": "_none_",
  "docvalue_fields": [ "talentId.keyword"]
}

Result:

{
  "hits" : {
    "total" : {
      "value" : ...,
      "relation" : "gte"
    },
    "hits" : [
      {
        "_score" : 1.0,
        "fields" : {
          "talentId.keyword" : [
            "yyyyyyy"
          ]
        }
      }, ...

It is also possible to use the filter_path as follows to return only _score and _source.
But I do not expect to see huge benefits with this approach.

GET <your_index>/_search?filter_path=hits.hits._source,hits.hits._score,hits.total
{
...
}

If you're scanning an index, please use the scroll api.
This is available only on the High level .NET client.
Another example is available here.

djmarcus · April 12, 2020, 12:25pm

It seems that no matter which approach I take it doesn't dramatically reduce the reply size.

At the end of your replay you mentioned the scroll api.

I've been trying to use it with zero success.

My code builds a query string from user input (which is an arbitrary Boolean search expression). I can't use the DSL format for NEST (at least I haven't been able to figure out how to build a query dynamically as I parse a Boolean expression).

In all the examples of Scroll (or ScrollAll) I don't see where the iteration on the scroll is being done.

I'm looking for an example (.NET) of sending a pre-constructed arbitrary query string to elasticsearch, while turning on scrolling, and then showing how to retrieve all the data. Seems like a simple enough thing to do but no such examples (or I'm missing the obvious).

Thanks in advance.

Luca_Belluccini · April 12, 2020, 2:45pm

The suggestions proposed reduce the answer from Elasticsearch to the fields you were requesting.
Even better, the first one even avoids Elasticsearch to deserialize data from disk (when using doc fields).

What are the errors? What are the problems with it?

If you have a Lucene query string, you can use Query String or Simple Query String.

@Russ_Cam posted an example here:

This is just a skeleton, might be not working:

var client = new ElasticClient();

// number of slices in slice scroll
var numberOfSlices = 4;

var scrollObserver = client.ScrollAll<MyObject>("1m", numberOfSlices, s => s
    .MaxDegreeOfParallelism(numberOfSlices)
    .Search(search => search
        .Index("test")
        .Type("one")
        .QueryString(c => c.Query("hello world"))
    )
).Wait(TimeSpan.FromMinutes(60), r =>
{
    // do something with documents from a given response.
    var documents = r.SearchResponse.Documents;
});

djmarcus · April 12, 2020, 4:53pm

This is along the line I was hoping for BUT..

[1] My queries are dynamically constructed (based on user input). My query is not a simple 'hello world' test against a particular field in a document. A typical query for me might look like:

{
    "_source" : "talentId",
    "from" : 0,
    "size" : 10000,
    "query" : {
        "bool" : {
            "must" : [
                [
                    {
                        "terms" : {
                            "talentId" : [
                                "-1Nb66Le1Ag",  "-3JBeHIE1Qg", "-4f8N7Ha1Ag",
                                "-46EAg6r1Ag",   "-64epcYD1Qg", "-5m_tFVt1Ag",
                                "-5RLUpXc1Ag",  "-6pNEKtC1Qg"]
                            ]
                        }
                    }
                ],
                {
                    "match_phrase" : {
                        "freeText" : "java"
                    }
                }
            ]
        }
    }
}

All the stuff inside the query node is dynamically constructed. In reality, the talentId node might have 1M+ values (representing which documents I want to search for the match_phrase.

Note also that I'm not asking for the documents found, but rather just the _source.

[2] Through trial and error (and a lot of googles) I've got this much working:

    var scroll = client.LowLevel.Search<StringResponse>(SearchIndex, queryText,
        new SearchRequestParameters()
              {
                   Scroll = TimeSpan.FromSeconds(10)
               });

        //-- parse the JSON string response into a in-memory representation
        JsonObject jParsed = JsonParser.ParseString(scroll.Body).AsJsonObject;

        //-- extract the scroll id
        string scrollId = jParsed["_scroll_id"].AsString;

        //-- request the data each scroll id holds
        while (!string.IsNullOrEmpty(scrollId))
        {
              //-- get the scroll segments;
              //!! MY expectation is that this will return the
              //   hits data and a continuation scroll
              //!! BUT it fails with 400 error
===>     var scroll2 = client.LowLevel.Scroll<StringResponse>(scrollId);
              
              jParsed = JsonParser.ParseString(scroll2.Body).AsJsonObject;

              //-- iterate through the results
              //TODO

              scrollId = jParsed["_scroll_id"].AsString;
        }

I'm getting this error on trying to get the hits for each scroll id:

{
    "error" : {
        "root_cause" : [
            {
                "type" : "illegal_argument_exception",
                "reason" : "Failed to parse request body"
            }
        ],
        "type" : "illegal_argument_exception",
        "reason" : "Failed to parse request body",
        "caused_by" : {
            "type" : "json_parse_exception",
            "reason" : "Unrecognized token 'DnF1ZXJ5VGhlbkZldGNoDwAAAAAMCNnFFnB4d2lOMDRBUnQ2RndEUUZCOHoyMlEAAAAADAjZxhZweHdpTjA0QVJ0NkZ3RFFGQjh6MjJRAAAAAAwImB8WemVGQU9LT3NSREdzWUJqU2dGcVdMZwAAAAAMCJgeFnplRkFPS09zUkRHc1lCalNnRnFXTGcAAAAADAiV7hZvNkIxMGxxalNZT0VZMlZITkZpQ2Z3AAAAAAwImCAWemVGQU9LT3NSREdz...': was expecting ('true', 'false' or 'null')\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@5f661aad; line: 1, column: 257]"
        }
    },
    "status" : 400
}

So I presume the call is incorrect:

client.LowLevel.Scroll<StringResponse>(scrollId)

I've been struggling with this for days! (googling for examples has not helped :-()

Luca_Belluccini · April 12, 2020, 8:00pm

By user input I intended a search box with Lucene syntax.

Constructing a boolean query with a dynamic list of terms and a match phrase is a pure programming question which is unrelated with the client.

Terms within a query cannot be 1M+ as they're limited to 65536 by default (doc).

Probably I'm missing something on how you're indexing the documents.
The _source is the original document you've indexed.
So "documents" and _source are probably the same thing.

Which version of the .NET / NEST client are you using?

djmarcus · April 12, 2020, 9:21pm

The documents have basically two fields:
[1] A unique field, talentId which we supply when we store the document (11-char field)

[2] A text field, freeText which is the field we ultimately want to search (1k chars to 100K chars on the average)

We currently have 15M such documents, and plan to go to over 200M. We use to store them in mongo db but searching for free text using mongo is pathetically slow.

We are trying to build an engine that can search the free text field for any Boolean expressions of user-supplied terms (ex: "python OR C# AND NOT (lisp OR java)")

In all cases I know which subset of documents to search. It may be just a few, or very many (well over 64K, perhaps even 1M+).

I construct a query string from the user-supplied terms. That works well. I agree with you that this is an unrelated issue but I was including it in the description so that it would be obvious that all the DSL examples I see are really not applicable for my project (I don't know the form/content of the query string until run time). I don't see how to construct a fluent-style query as I parse the user input)

Regardless, for running the search, I have two choices:
[a] I can search all documents (no explicit document selection)

[b] Supply a list of documents to search

For now, I choose, at run time, between [a] and [b] based on how many documents I need to search. My thinking is that sending over the wire a list of a million talent ids (corresponding to the documents I want to search) is not wise .. better to just search the whole document collection.

I am surprised at the 64K limit you mentioned.

In the future, when we get to 200M documents, I'd hate to think that I would be doing a whole collection search when I only want to search 100K documents.

So I'm down to two questions:

[1] How can I specify which documents to search when I want to restrict the searches to a set of documents where the set size may be 1M+?

[2] How can I iterate the search results using scrolling (essentially a cursor)?

By an ungodly amount of trial and error(and incredible frustration) I'm making some headway on #2 but would love to see what a good solution (rather than by a newbie like me) looks like.

-Regards
David

Luca_Belluccini · April 13, 2020, 1:01am

If the talentId is unique you might be interested into using this talentId also as document _id. It will allow to perform direct access without searching.

So as I was saying, using the Lucene Query string you can just take the query of the user and use it in a Lucene Query String (as it was done in Kibana search bar, if you ever used it.

The traduction into an HTTP request would be:

GET <yourindex>/_search
{
    "query": {
        "query_string" : {
            "query" : "python OR C# AND NOT (lisp OR java)",
            "default_field" : "content"
        }
    }
}

I would suggest tagging documents instead of specifying a terms list to declare the subset to search in.
As I said, restricting the number of documents using a term query with so many terms seems sub-optimal.

I am wondering if behind those talentId properties there is any tagging or grouping you could do.
For example they belong to some group or they share some property.

I mean, suppose you have no limitation, how a user would select 1.5 talentId to be searched on?

Using the example provided by Russ linked few messages ago.
Another example:

esClient.ScrollAll<Source>("5m", parallelism, s => s
        .MaxDegreeOfParallelism(parallelism)
        .Search(q => q
        .Index(index)
        .Source(true)
        .Size(1000)
        .Query(q => q
          .Bool(b => b
            .Filter(bf => bf
              .Terms(c => c
                .Field(p => p.talentId)
                .Terms("-1Nb66Le1Ag", "-3JBeHIE1Qg", ...))
            .Must(mu => mu
              .QueryString(c => c.Query("python OR C# AND NOT (lisp OR java)"))
             )
        .Wait(TimeSpan.FromMinutes(5), r =>
        {
            // Do something with r.SearchResponse.Documents (which will contain a part of the documents, in this case 1000 by 1000)
        });

This might be a good approach if the number of terms starts to be "big".

system · May 11, 2020, 1:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.