Fetching specific fields in ES 5.X Java client API


(Teemu Kanstrén) #1

Hello,

I am still working on upgrading to ES 5.X. This time problems on my Java client API code.

In ES 2.X this would return specific field data as expected:

  public List<String> getNames(String index, String type) {
    List<String> names = new ArrayList<>();
    SearchResponse scrollResp = client.prepareSearch()
        .setIndices(index)
        .setTypes(type)
        .addFields("my_name")
        .setQuery(QueryBuilders.matchAllQuery())
        .setSize(10).execute().actionGet();
    for (SearchHit hit : scrollResp.getHits().getHits()) {
      Map<String, SearchHitField> fields = hit.getFields();
      String name = fields.get("my_name").value();
      names.add(name);
    }
    return names;
  }

But in 5.X the addFields() method seems to have disappeared. There is only addStoredFields(). I assume since my mapping does not define the fields as stored, changing the above to addStoredFields() results in zero fields returned. What is the correct method to query for specific field values in 5.X?

I also posted this question on StackOverflow, and got a reply which points to another question for ES1.5 and a different way to do this. But this was for 1.5 and above used to work for 2.X, now we are at 5.X... :confused:

Cheers,
T


(David Pilato) #2

You need to use stored fields instead.

https://www.elastic.co/guide/en/elasticsearch/reference/current/breaking_50_search_changes.html#_literal_fields_literal_parameter

If you don't store your fields, then you need to extract yourself the fields you want from _source which is stored.


(Teemu Kanstrén) #3

OK, thanks for the clarification. Just a few questions to make sure I get the concept right.

To get the specific fields I should either set them to store:true and use the addStoredFields() as in my code above, or use source filtering, which in the Java API is done with setFetchSource() as in the SO post I linked in my initial post in this thread?

The choice whether to use store:true or source filtering would depend on whether I have a set of smaller fields that I want to retrieve separately as per the ES documentation?

So source filtering still loads the whole source but with stored fields I can only load specific set of files? And if the source has no huge fields in it, source filtering is fine, otherwise I should use store:true? But the store:true has the impact of requiring more separate disk access, and thus some performance impact?


(David Pilato) #4

That is correct.

But the store:true has the impact of requiring more separate disk access, and thus some performance impact?

I believe that this part could be negligible though. But it would depends on the exact use case case probably and the amount of data needed to be stored in the field.


(Teemu Kanstrén) #5

OK I was just thinking the docs quite strongly recommend not to set fields stored, I though it was for performance reasons. But you are likely correct and for my purposes it is fine. I can always re-index if I need to re-map. Seems I end up doing that quite often anyway.

But another question.

I have a field in mapping defined as

"doc_date": {
  "type": "date",
  "format": "strict_date_optional_time||epoch_millis",
  "store": true
},

I index values into this field with Long datatypes in a map for es.indexData() method in the Java API.

In 2.X this used to work for search results

  SearchHit hit = ...
  Map<String, SearchHitField> fields = hit.getFields();
  long time = fields.get("doc_date").value();

But with 5.X API this throws java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long and looking at what "fields.get("doc_date").value();" returns, it is now a String of format such as "2014-11-10T15:00:00.000Z".

If I curl the type scema, "doc_date" is shown as type "date" and stored. If I curl the type value, "doc_date" is given as a long numerical value as expected. But the Java API is somehow different.

Is this some change in the API I have missed again? What is now the correct method to get the long value from a date field?

T


(Jörg Prante) #6

You did not index a long, but a java.util.Date or java.time.Instant?

One method to get the epoch millis back is to use the Java Time API

Instant instant = Instant.parse(fields.get("doc_date").value());
long millis = instant.toEpochMilli();

See also https://gist.github.com/jprante/cd511083d7ee38beb621adeef3053d3b for a full Groovy example (which is very close to Java).


(Teemu Kanstrén) #7

Yeah, sorry, I see I did not properly open up my index code above. Shouldn't post when too tired.. :slight_smile:

I believe I do index long and it used to work.. But I can certainly use the workaround you suggested. Thanks.

It just seems a bit of a waste if ES stores the date as long, then every time I query ES converts it to a string, gives me the string, I parse it to long..

Here is some better example code to illustrate my question:

public class DateTest {
  public static void main(String[] args) throws Exception {
    Settings settings = Settings.builder().put("cluster.name", "elasticsearch").build();
    Client client = new PreBuiltTransportClient(settings)
        .addTransportAddress(new InetSocketTransportAddress(InetAddress.getByName("localhost"), 9300));

    //create index and mapping if needed
    IndicesExistsResponse res = client.admin().indices().prepareExists("doc_test_index").execute().actionGet();
    if (!res.isExists()) {
      CreateIndexRequestBuilder reqBuilder = client.admin().indices().prepareCreate("doc_test_index");
      String testSchema = "{\n" +
          "  \"properties\": {\n" +
          "    \"stuff\": {\n" +
          "      \"type\": \"text\",\n" +
          "      \"index\": \"analyzed\"\n" +
          "    },\n" +
          "    \"doc_date\": {\n" +
          "      \"type\": \"date\",\n" +
          "      \"format\": \"strict_date_optional_time||epoch_millis\",\n" +
          "      \"store\": true\n" +
          "    }\n" +
          "  }\n" +
          "}";
      reqBuilder.addMapping("test_mapping", testSchema);
      reqBuilder.execute().actionGet();

      //insert example data
      Map<String, Object> map = new HashMap<>();
      map.put("stuff", "just something");
      map.put("doc_date", System.currentTimeMillis());
      IndexResponse response = client.prepareIndex("doc_test_index", "test_mapping", "1")
          .setSource(map)
          .get();
    }

    SearchResponse scrollResp = client.prepareSearch()
        .setIndices("doc_test_index")
        .setTypes("test_mapping")
        .addStoredField("doc_date")
        .setQuery(QueryBuilders.matchAllQuery())
        .setSize(10).execute().actionGet();
    for (SearchHit hit : scrollResp.getHits().getHits()) {
      Map<String, SearchHitField> fields = hit.getFields();
      Object docDate = fields.get("doc_date").value();
      System.out.println(docDate);
    }
  }
}

Sorry for the lenghty spam but that hopefully better illustrates my question.

For me this prints out (from the system.out in the for loop for results)

2016-12-11T10:01:41.749Z

Now if I run curl:

curl -XGET 'http://localhost:9200/doc_test_index/test_mapping/1'

I get:

{"_index":"doc_test_index","_type":"test_mapping","_id":"1","_version":1,"found":true,"_source":{"doc_date":1481450501749,"stuff":"just something"}}

It seems to me that I store the value as long, curl gets it as long, Java API used to get is as long but now gets it as formatted date string.

So this is what I was wondering. Is this some change in the Java API, and Is there some way to get the long without going through all the conversions?

For now, I guess I will go with the proposed parsing so I get it to work but would be nice to understand what changed..

Thanks,
T


(Jörg Prante) #8

The document source contains the content you sent over the API, so you see the long value in return when you sent a long. In Java, you can extract those values by hit.sourceAsMap().get(...)

Accessing a field of a specific type over the Java API is controlled by the value conversions of Elasticsearch mapping specification. So you will see the expected date format because of the date type in the mapping.

If you index dates before Jan 1, 1970:, e.g.

field("doc_date", -9184838400000 )

you will get negative long applying toEpochMilli() as specified by the java.time.Instant class.

SearchResponse response = client.prepareSearch()
  .setQuery(matchAllQuery())
  .addStoredField("_source")
  .addStoredField("doc_date")
  .execute().actionGet()

println response

response.hits.hits.each { hit ->
    println hit.sourceAsMap().get("doc_date")  // numeric
    println hit.fields.get("doc_date").value() // date string
    Instant instant = Instant.parse(hit.fields.get("doc_date").value())
    long millis = instant.toEpochMilli() // parsed date
    println millis
}

This will give a result like

{"took":1,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":1.0,"hits":[{"_index":"test","_type":"test","_id":"1","_score":1.0,"_source":{"doc_date":-9184838400000},"fields":{"doc_date":["1678-12-11T00:00:00.000Z"]}}]}}

Note the difference between _source and doc_date.


(Teemu Kanstrén) #9

So the reason why using addField("doc_date") in ES 2.X gives me the raw long value is because addField() in 2.X performs implicit source filtering in the background without conversions? And in 5.X the addStoredField() does not load full source but just the defined fields, and then performs the ES mapping specification conversions on them?

My options in 5.X then are to either use explicit source filtering or stored fields? Source filtering requires loading the full source, including any large value fields such as document contents, but allows access to raw long? Stored fields allow loading only specific fields but require me to convert the date if I want the long?


(Jörg Prante) #10

There is no "raw long" any more in ES 5.X for a date field.

Dates are stored in packed numeric byte encoded format, for instance, to be able to execute fast numeric range queries. See also https://www.elastic.co/blog/lucene-points-6.0 and org.elasticsearch.index.mapper.DateFieldMapper ES will do the conversion for you (with the help of Joda but that does not matter), even the formatted date as a string is a result of conversion, it's not the native format how a date is stored in Lucene.

To get a long value from ES from a date field, you need extra conversion.

If you pass a long value over the API to instruct ES to use this as an input value for creating a date field value, you can vice versa extract this input value from the doc source again. This does not depend on the internal date representation, but yes, it comes with the cost of loading the document source.


(Teemu Kanstrén) #11

OK, thanks a lot, this clears it all for me. I will just use the conversion you proposed.

Maybe some update on the docs would also clarify:

For 5.1 in:

https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html

"Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch."

Also, if "store" has no big penalty, maybe this could be cleared as well:

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-stored-fields.html

"The stored_fields parameter is about fields that are explicitly marked as stored in the mapping, which is off by default and generally not recommended."

So as to understand why "store:true" is generally not recommended etc. (or does it refer to "store" or "stored_fields"?).

But you have answered all my questions to details, so thank you for the information and patience..

Cheers,
Teemu


(system) #12

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.