Search in a matrix type string

Hi everybody,

I have an index that define a field mapping as string to be use like a matrix . called "categories". something like this:
{
"customers": {
"mappings": {
"customers_list": {
"properties": {
"categories": {
"type": "string"
},
{
}
}
}
}
}
}

Then I have a created a document:
{
"_index": "cmcustomers",
"_type": "customers_list",
"_id": "xi_pKX_yTvSIRdZRqdebxQ",
"_score": 14.122552,
"_source": {
"categories": "[{"first_level":585,"second_level":[1559,2445]},{"first_level":987,"second_level":[20384]}]"
}

Now I do a search:
GET cmcustomers/customers_list/_search
{
"query": {
"range": {
"categories": {
"from": "2445",
"to": "2445"
}
}
}
}

The result does not include the previous document, but if I search for the category 20384 the result is correct.

Looks like that if I have the in the second_level array more that one element, it can not find it, but when the second_level array have only one element it can found it.

May someone help me?, What I am doing wrong?.
In advance thousand thanks for your support.

Regards

Jorge von Rudno

Hi Jorge,

your immediate problem is the tokenization process. You can use the Analyze API to inspect what's going on.

First we create an index:

PUT /customers
{
   "mappings": {
      "customers_list": {
         "properties": {
            "categories": {
               "type": "string"
            }
         }
      }
   }
}

Then we can use the Analyze API to see the tokenization in action:

GET /customers/_analyze
{
    "field": "customers_list.categories",
    "text": "[{\"first_level\":585,\"second_level\":[1559,2445]},{\"first_level\":987,\"second_level\":[20384]}]"
}

The response is (I just post a shortened version which highlights the problem):

{
   "tokens": [
      {
         "token": "first_level",
         "start_offset": 3,
         "end_offset": 14,
         "type": "<ALPHANUM>",
         "position": 0
      },
      {
         "token": "585",
         "start_offset": 16,
         "end_offset": 19,
         "type": "<NUM>",
         "position": 1
      },
      {
         "token": "second_level",
         "start_offset": 21,
         "end_offset": 33,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "1559,2445",
         "start_offset": 36,
         "end_offset": 45,
         "type": "<NUM>",
         "position": 3
      },
      ...
      {
         "token": "20384",
         "start_offset": 83,
         "end_offset": 88,
         "type": "<NUM>",
         "position": 7
      }
   ]
}

So, your second_level field gets tokenized by the analyzer of this field as "1559,2445" and not as two separate tokens as you seem to expect. You could maybe change the analyzer of this field but an even better solution is to change your data model.

If you really just have two levels of categories I'd rather store them as two separate fields. Also, I don't understand why you have two numbers on the second level. Nevertheless, I'd suggest you rather change your mapping to something along these lines.

PUT /customers
{
   "mappings": {
      "customers_list": {
         "properties": {
            "categories": {
               "type": "nested",
               "properties": {
                   "first_level": {
                       "type": "integer"
                   },
                   "second_level": {
                       "type": "integer"
                   }
               }
            }
         }
      }
   }
}

As I really don't know what the second number for thesecond_level field should be, I simplified the example but this should get you started. I've also assumed that your keys are integers, you can also use long if you need to.

If you really need the JSON structure in Elasticsearch you can still add it as a not_analyzed field.

To complete the mapping example, let's insert two customers:

POST /customers/customers_list/1
{
    "categories": [
        {
            "first_level": 585,
            "second_level": 2445
        },
        {
            "first_level": 987,
            "second_level": 20384
        }
    ]
}
POST /customers/customers_list/2
{
    "categories": [
        {
            "first_level": 600,
            "second_level": 3500
        },
        {
            "first_level": 987,
            "second_level": 20384
        }
    ]
}

And search for the customer with a range query similar to your original one (for second_level = 2445), which should return only the first customer:

GET /customers/customers_list/_search
{
   "query": {
      "nested": {
         "path": "categories",
         "query": {
            "range": {
               "categories.second_level": {
                  "from": 2445,
                  "to": 2445
               }
            }
         }
      }
   }
}

And indeed, only the first customer is returned:

{
   "took": 3,
   "timed_out": false,
   "_shards": {
      "total": 5,
      "successful": 5,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 1,
      "hits": [
         {
            "_index": "customers",
            "_type": "customers_list",
            "_id": "1",
            "_score": 1,
            "_source": {
               "categories": [
                  {
                     "first_level": 585,
                     "second_level": 2445
                  },
                  {
                     "first_level": 987,
                     "second_level": 20384
                  }
               ]
            }
         }
      ]
   }
}

If you need to search on multiple levels you can combine the individual queries with a bool query.

Also the data modelling chapter of the Elasticsearch definitive guide should help.

Daniel

Hi Daniel,
first off all, thanks a lot for your complete and clear clarifications about the problem.
I welcome your suggestions and I have changed the data model, but that have another problem, and I want to ask you about it.

I need to reload the data from my database to elasticsearch. To simplify I just take a few fields and rows.
In the database I have defined a table with this fields and types:

FIELD Type
customer_id integer
geolocation text
categories json ==> please note that this field is type json

And I have some records like this:
customer_id geolocation categories
829390 12.1711827,47.5872872 [{"first_level":359,"second_level":null}]
786297 16.3641083,48.2855517 [{"first_level":62,"second_level":null}]
854881 14.3415962,46.6132867 [{"first_level":359,"second_level":null}]
535998 6.8753946,51.2854886 null
992614 8.3640076,49.0025001 [{"first_level":585,"second_level":[1559,2445]},{"first_level":987,"second_level":[2}]
840925 16.4138530,48.2708595 [{"first_level":592,"second_level":[20521]},{"first_level":335,"second_level":null}]

I have defined the mapping:
{
"mappings": {
"customers_categories": {
"properties": {
"customer_id": {
"type": "integer"
},
"geolocation": {
"type": "geo_point",
"lat_lon": "true"
},
"categories": {
"type": "nested",
"properties": {
"firs_level": {
"type": "integer"
},
"second_level": {
"type": "integer"
}
}
}
}
}
}
}

and the _river (I know that river is deprecate, but I use river only to load the data)
PUT /_river/cat/_meta
{
"type": "jdbc",
"max_request_wait": "180S",
"jdbc": {
"strategy": "oneshot",
"url": "jdbc:postgresql://my-psql:5432/customer",
"user": "my-user",
"password": "my-password",
"sql": "SELECT * FROM to_test_categories",
"max_retries_wait": 300,
"query_timeout": 5400,
"index": "categories_v3",
"type": "customers_categories"
}
}

When I run the river to load the data I get the log:
message [MapperParsingException[object mapping for [customers_categories] tried to parse field [categories] as object, but got EOF, has a concrete value been provided to it?]].

I google the message and in principle I understood that the problem is because I send an object in the field "categories".
Really I don't know how to try this case.
In advance thanks for your help.

Regards Jorge

Hi Jorge,

I see. Your query provides one record to Elasticsearch in the following format:

POST /customers/customers_list/1
{
    "categories": "[{\"first_level\": 585, \"second_level\": 2445}, {\"first_level\": 987,\"second_level\": 20384}]"
}

Whereas based on the mapping it expects:

POST /customers/customers_list/1
{
    "categories": [
        {
            "first_level": 585,
            "second_level": 2445
        },
        {
            "first_level": 987,
            "second_level": 20384
        }
    ]
}

I'd try to extract the structure from the JSON string with the Postgres JSON functions first. If this doesn't help I fear you have to extract the data by yourself from the database and transform them accordingly in some sort of script. An even more heavyweight solution would be Logstash with a JDBC input, an Elasticsearch output and your transformation step in between.

Daniel

Hi Daniel,

I have tried to follow your recommendation, but I am unsuccessful, so I would like to apply your first suggestion (create an analizer for this field). To do this I have:

  1. change the settings like:
    {
    "customers": {
    "settings": {
    "index": {
    "refresh_interval": "1000",
    "analysis": {
    "analyzer": {
    "separate_categories": {
    "type": "custom",
    "tokenizer": "separate_categories"
    }
    },
    "tokenizer": {
    "separate_categories": {
    "pattern": ",",
    "type": "pattern"
    }
    }
    },
    "number_of_shards": "3",
    "number_of_replicas": "1",
    "version": {
    "created": "1030999"
    },
    "uuid": "pa7QLPjzRIGmc5mYGLJl9A"
    }
    }
    }
    }

and the mapping. The categories field looks like:

"cmcustomers": {
"mappings": {
"customers_list": {
"properties": {
"categories": {
"type": "string",
"analyzer": "separate_categories"
}
}
}
}
}

Then I have loaded the data and I have a document where I apply the analizer, but I can not reach the expected result.

GET /customers/_analyze
{
"field": "customer_list.categories",
"text": "[{"first_level":228,"second_level":[]},{"first_level":236,"second_level":[238,609]},{"first_level":357,"second_level":[]},{"first_level":1449,"second_level":[]},{"first_level":20982,"second_level":[]}]"
}

  {
     "token": "second_level",
     "start_offset": 118,
     "end_offset": 130,
     "type": "<ALPHANUM>",
     "position": 9
  },
  {
     "token": "238,609",
     "start_offset": 134,
     "end_offset": 141,
     "type": "<NUM>",
     "position": 10
  },

What I am doing wrong?

Some suggestion?

Thanks for your support.

Regards Jorge

Hi Jorge,

I understand that you want avoid the additional complexity of a preprocessing stage (like a custom script or Logstash) but I think sooner or later you have to deal with it as Rivers are gone in 2.0. Also, I just mentioned the analyzer in order to explain what is going wrong. I think going down this route will not really help you. With having said all that, what would you expect from the tokenization process (i.e. can you provide an example of what tokens you would expect based on a specific example)?

Daniel