Mapping data, tokenizer

Hi

I am a beginner in using elasticsearch, and I have a few questions on how best to store the data in elasticsearch with respect to the maximum search speed.

In the relational database, I have the following three entities (tables) stored in binding:
car brand 1: n car model 1: n car type

example:

Brand			Model              				Type
MERCEDES-BENZ	E-CLASS COUPE (C123)		280 CE
MERCEDES-BENZ	E-CLASS COUPE (C123)		300 C Turbo-D (123.153)
MERCEDES-BENZ	E-CLASS COUPE (C123)		280 C
MERCEDES-BENZ	E-CLASS KOMBI ESTATE (S123)	200 T (123.280)
MERCEDES-BENZ	E-CLASS KOMBI ESTATE (S123)	200 T
MERCEDES-BENZ	E-CLASS COUPE (C124)		230 CE (124.043)
MERCEDES-BENZ	E-CLASS COUPE (C124)		300 CE
MERCEDES-BENZ	E-CLASS COUPE (C124)		300 CE-24 (124.051)

And last entity (table) product is binding n:n to car type.

How best to store this data in elasticsearch? Main the task is to find the product over car.

I have the following variants:

  • four separate indexes (car brand, car model, car type, product + car type), but in this case I don't know how to link indexes and create a simple query over all indexes
  • single index product + with tree separated column, data in columns will be separated over comma
  • single index product + with single nested field over this structure:
      "car": [
        {
          "brand": "Mercedes",
          "model": [
            {
              "name": "E-CLASS COUPE (C123)",
              "cartype": [
                "280 CE",
                "300 C Turbo-D (123.153)"
              ]
            },
            {
              "name": "E-CLASS KOMBI ESTATE (S123)",
              "cartype": [
                "200 T (123.280)",
                "200 T"
              ]
            }
          ]
        },
        {
          "brand": "BMW" …. Next cars
        }
      ]
    } 

but I also have no idea how to create a query through this structure.

The following question. How to set the tokenizer, for example, to find the product, if the input string is: “Mercedes coupe 300 turbo”?

Thank you for any answer

Jaroslav

Elasticsearch does not support joins so this approach is not possible.

The best way to work with Elasticsearch is often to completely flatten and denormalize the data model and avoid trying to replicate relational concepts using nested documents or parent-child relationships.

You could create car documents for all permutations looking something like this:

{
  "brand": "Mercedes",
  "model": {
    "name": "E-CLASS COUPE (C123)",
    "cartype":  "280 CE"
  }
}

You have not shown what information product contains, but you could either add this as an array for each car document (if not too large) or simply further denormalize and create one car document per product and add the product data to this.

Try to store the data so that the document structure aligns with what you will be searching for.

This approach increases storage somewhat, but maybe not as much as you initially would think, but this leads to dramatically simpler and faster queries. If data is updated it does mean multiple documents need to be updated, but this is generally quite fast as long as updates are not very frequent.

Once the document structure has been determined we can look at mappings to support the queries you need to run.

1 Like

OK, I understood that I was denormalizing all the data

this example me simple mapping to products:

{
  "product": {
    "properties": {
      "id": {
        "type": "integer"
      },
      "name": {
        "type": "text"
      },
      "brand": {
        "type": "keyword"
      },
      "warranty": {
        "type": "keyword"
      },
      "cars" : {
        "type": "nested"
      } 
    }
  }
}

there is example two products:

{
  "id": 1,
  "name": "Wipper",
  "brand": "Bosch",
  "warranty": 2,
  "cars": [
    {
      "brand": "ALFA ROMEO",
      "model": [
        {
          "name": "147 (937_)",
          "type": [
            "1.6 16V T.SPARK",
            "1.6 16V",
            "1.9 JDT"
          ]
        }
      ]
    },
    {
      "brand": "AUDI",
      "model": [
        {
          "name": "TT (8J3)",
          "type": [
            "1.8 TFSI",
            "1.6 16V"
          ]
        }
      ]
    }
  ]
}
{
  "id": 2,
  "name": "Wipper",
  "brand": "Eltis",
  "warranty": 2,
  "cars": [
    {
      "brand": "Avia",
      "model": [
        {
          "name": "D-Line",
          "cartype": [
            "D100",
            "D120"
          ]
        }
      ]
    },
    {
      "brand": "DAF",
      "model": [
        {
          "name": "95",
          "cartype": [
            "FAC 95.400",
            "VAC 95.350"
          ]
        }
      ]
    }
  ]
}

Notice: one product can have up to 1000 vehicles, 100 models and 10 brands (one brands can have more models). I can create a structure for car in another way.

Example:

{
  "id": 2,
  "name": "Wipper",
  "brand": "Eltis",
  "warranty": 2,
  "cars": [
    {"brand": "Avia", "Model" : "D-Line", "type" : "D100"}
    ...
    ...
  ]
}

But I can't judge which is better. I have 500 thousand vehicles associated with 10 thousand products.

Jaroslav