Which one is better schema?

I have two options for my Elasticsearch schema

Option 1.
One data entry for an image. The key is image unique id and the value is a long string of entire list of numbers.

"visualwords" : {
    "image_id" : { "type" : "text", "store" : true },
    "numbers" : { "type" : "text", "store" : true }
}

Option 2.
There will be around 970 entries per an image in average. Storing integer instead of string may reduce storage space it needs. Multiple data can be stored in bulk with a single query.

"visualwords" : {
    "image_id" : { "type" : "text", "store" : true },
    "number" : { "type" : "integer", "store" : true }
}

First of all, I have a bunch of image_ids and each image id has about 970 vector numbers like 125, 156, 2303.

The use case is to get image_id by searching by vectors. If [123, 124, 125] is an input, an image_id which has the most matches should be returned.

I saw some people just put the entire numbers into a string with a space delimiter.
And I can store the numbers as an integer for an entry like a traditional way.

I think each option has pros and cons in terms of search latency and storage space.
How do you think about these options and what would you choose?

Numeric types in elasticsearch are optimised for range queries and unless you want to query for ranges of numbers e.g. "range 100 to 200" then it's probably not the right data structure.

If you map the data as a keyword field you'd need to supply the numbers in a JSON array to pass multiple values whereas with text field you'd need to use a single string and ensure the associated Analyzer used to parse the contents would break values up on whatever delimiter you use (whitespace or commas) and preserve all of the values eg keep numbers. The keyword type feels like the more natural option.

The use case is to get image_id by searching by vectors. If [123, 124, 125] is an input, an image_id which has the most matches should be returned

Is IDF (rarity) of these terms important to you? If so, avoid the terms query which assumes all provided terms are of equal value and instead opt for a bool query with a should array of individual term queries.

1 Like

Thank you for your comment. But I am not very sure about the IDF part. First of all, please take a look at my queries below

curl -XPUT url/vectors -d '{
  "image_id":"P1",
  "words":"123 125 235 ... 10304 50305"}
}'
curl -XGET url/vectors -d '
{
  "query": {
    "bool": {
      "should": [
        { "match": { "words": "59796" } },
        { "match": { "words": "60928" } },
        { "match": { "words": "61027" } },
         ...
        { "match": { "words": "62161" } },
        { "match": { "words": "71747" } },
        { "match": { "words": "90993" } },
      ],
      "minimum_should_match": 30
    }
  },
  "_source": ["image_id"],
  "from": 0,
  "size": 10,
  "sort": [
    {
      "_score": {
        "order": "desc"
      }
    }
  ]
}
'

As you might notice, I want to get a list of image_ids in desc order if the image has more than 30 words that are matching the input words.

My query does what is supposed to do but I want to know if there's a better or more efficient way to get the results. And currently, the result list from my query has a lot of duplicates - if there is only one image_id that meets the requirements, the return list is filled with the image_id 30 times. Is there a way to reduce the duplicates?

Thanks!

It just means matches on some words can rank higher than others.
In free text searches that would just mean a search for aardvaark book would score a document with the word aardvaark higher than a doc with just the word book.
Rarer words score higher in text search but maybe in image search the inverse may be true. If your "word" token 372 actually represents a blue sky, that's likely to be a common word but actually important to score on.

Do you want duplicates in your index to begin with?
If not make your image_id the choice of elasticsearch "_id" field and it will ensure there's only ever one (like a primary key). The default is to autogenerate unique IDs on insertion.
If for some reason you need duplicates then you can use the "field collapsing" feature to group results under the image_id field

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.