Multi_field with collection value using Java api


(fletch) #1

Hi all,

I'm coming across some odd behavior and wondering if I'm missing something
or if it's a bug.

I created an index (see script below), and added some documents. There is
a mapping for a multi_field type, and objects of type java.util.List are
used as values. When I pull back a single document using curl, the json
looks like I would expect. However, when I do a simple bool query using
the Java api against the same index, the multi_field property is nested
inside an extra ArrayList. This ArrayList has a single entry, which is the
list I expect. Where did the extra wrapping collection come from? I know
it has to do with using the multi_field type, because when I map the
property directly to a string type it works as I would expect. And the
returned json doesn't change when I use curl...

Please let me know if any of the above needs clarification... not as easy
to copy and paste in java objects as it is json. :slight_smile:

Thanks in advance for any insights.
Matt

Index creation script:
curl -XPUT http://server:9200/photoindex1/ -d @photoIndex1.json

photoIndex1.json (abbreviated):
{
"settings" : {
"analysis" : {
"analyzer" : {
"stemmingloweringanalyzer" : {
"type" : "custom",
"tokenizer" : "lowercase",
"filter" : ["porterStem"]
}
}
}
},
"mappings" : {
"photoType" : {
"_source" : {
"enabled" : true
},
"_all" : {
"analyzer" : "default",
"enabled" : true
},
"properties" : {
"photoId" : {
"type" : "integer",
"store" : "yes",
"index" : "no",
"include_in_all" : false
},
"tags" : {
"type" : "multi_field",
"fields" : {
"tags" : {
"type" : "string",
"index_name" : "tag",
"store" : "no",
"index" : "analyzed",
"analyzer" : "stemmingloweringanalyzer",
"include_in_all" : true,
"boost" : 0.7
},
"untouched" : {
"type" : "string",
"index_name" : "tag",
"store" : "yes",
"index" : "not_analyzed"
}
}
}
}
}
}
}

Get a document using curl (doesn't change if I reindex without the
multi_field type and instead have a mapping to string type):
curl -XGET http://dsearchin001:9200/photoalias/photoType/40
{
"_index":"photoindex1",
"_type":"photoType",
"_id":"40",
"_version":1,
"exists":true,
"_source" : {
"tags": ["white kitchen cabinets","light brown granite counter
tops","recessed can lights","stainless steel refrigerator","bright"],
"photoId":40
}
}

Java code that I would expect to work....
protected List extractListField(String fieldName, SearchHit hit,
Class genericClass)
{
Map<String, SearchHitField> fieldMap = hit.getFields();
SearchHitField hitField = fieldMap.get(fieldName);
List list = Lists.newLinkedList();
if (hitField != null && hitField.getValues() != null &&
!hitField.getValues().isEmpty())
{
Collection values = (Collection) hitField.getValues();
list.addAll(values);
}
return list;
}

Java code I actually have to use...
protected List extractListField(String fieldName, SearchHit hit,
Class genericClass)
{
Map<String, SearchHitField> fieldMap = hit.getFields();
SearchHitField hitField = fieldMap.get(fieldName);
List list = Lists.newLinkedList();
if (hitField != null && hitField.getValues() != null &&
!hitField.getValues().isEmpty())
{
// Note the use of getValue() instead of getValues()
Collection values = (Collection) hitField.getValue();
list.addAll(values);
}
return list;
}

--


(Clinton Gormley) #2

Hi Fletch

I created an index (see script below), and added some documents.
There is a mapping for a multi_field type, and objects of type
java.util.List are used as values. When I pull back a single document
using curl, the json looks like I would expect. However, when I do a
simple bool query using the Java api against the same index, the
multi_field property is nested inside an extra ArrayList. This
ArrayList has a single entry, which is the list I expect. Where did
the extra wrapping collection come from? I know it has to do with
using the multi_field type, because when I map the property directly
to a string type it works as I would expect. And the returned json
doesn't change when I use curl...

The JSON that you send to ES is stored in the _source field, and that
exact same JSON is returned to you when you request a document (via GET
or search or whatever).

This is distinct from the values that are indexed for that document.

A multi-field (eg you have field 'title' and add multi-fields
'untouched' and 'autocomplete') results in the value from the 'title'
field being indexed into all sub-fields:

  • title.title (the main field, also accessible as 'title')
  • title.untouched
  • title.autocomplete

This is the reason you're seeing a difference, and it is not a bug. It
is like this by design

clint

--


(fletch) #3

Hi Clint,

Thanks for the reply. I believe I understand the difference between what
is stored in _source and what is indexed. What I'm really trying to get at
is that when I write build up a query in Java, the SearchHitField comes
back with a different value depending on whether I map the field as
multifield or not, while the json response (running exactly the same query
-- got from toString() value on ES SearchRequestBuilder instance) does not
change at all. Just seems odd to me that one data representation changes
and the other doesn't...

Sorry if I didn't explain this well the first time.
Matt

PS - Here's the json representation of the query I'm running, in case that
makes a difference:
{
"size" : 1,
"query" : {
"bool" : {
"should" : [ {
"field" : {
"primarySpaceDesc" : {
"query" : "Attic",
"boost" : 5.0
}
}
}, {
"field" : {
"secondarySpaceDesc" : "Attic"
}
} ]
}
},
"fields" : [ "description", "displayTypeId", "photoId", "segmentDesc",
"segmentId", "graphicId", "largeImageFilename", "mediumImageFilename",
"primarySpaceDesc", "primarySpaceId", "primaryStyleDesc", "primaryStyleId",
"secondarySpaceDesc", "secondarySpaceId", "secondaryStyleDesc",
"secondaryStyleId", "smallImageFilename", "tags" ]
}

On Thursday, October 11, 2012 2:03:17 AM UTC-6, Clinton Gormley wrote:

Hi Fletch

I created an index (see script below), and added some documents.
There is a mapping for a multi_field type, and objects of type
java.util.List are used as values. When I pull back a single document
using curl, the json looks like I would expect. However, when I do a
simple bool query using the Java api against the same index, the
multi_field property is nested inside an extra ArrayList. This
ArrayList has a single entry, which is the list I expect. Where did
the extra wrapping collection come from? I know it has to do with
using the multi_field type, because when I map the property directly
to a string type it works as I would expect. And the returned json
doesn't change when I use curl...

The JSON that you send to ES is stored in the _source field, and that
exact same JSON is returned to you when you request a document (via GET
or search or whatever).

This is distinct from the values that are indexed for that document.

A multi-field (eg you have field 'title' and add multi-fields
'untouched' and 'autocomplete') results in the value from the 'title'
field being indexed into all sub-fields:

  • title.title (the main field, also accessible as 'title')
  • title.untouched
  • title.autocomplete

This is the reason you're seeing a difference, and it is not a bug. It
is like this by design

clint

--


(system) #4