Some help for a beginner to structuring his data for elasticsearch

(Stefan Kruse-2) #1

i need to move from a mysql db to elasticsearch.
I have the follow situation:
We have these tables: companies, categories and listings.
Companies includes datas from the company like name, address etc.
Subcategories includes a name like "Festmacher","Zollagenturen".
Listings includes the mapping from companies to subcategories like "Company"=>1 and "Subcategory"=>8 and so on. This table ist very big because every relation from a company to a subcategory was saved here.

Ok, now i need to structure this to es. In the endresult i need the following options.

I have to search for subcategories with the result of how many times the founded subcategories are listed in listings. Like "Festmacher => 60" or "Zollagenturen => 101".
The next thing is a overview of all comapnies for a subcategory id. I have a get param like ?scat=1. Now i need all companies listet for this subcategogry.

How can i make this the best way with es? How was a good way to index the datas with an eye of performance?


(David Pilato) #2

If I understand correctly, listings is only a technical table. It does not really mean anything in term of user perspective, right?

So, I'd index companies with all their categories.

If you want to get the number of companies which have a given category, it's then super easy. And having a flat model in elasticsearch is better in term of performance IMO.

(Stefan Kruse-2) #3

Hi, yes you are right, listings is only technical, only for the relations.But i found the solution with saving the categories by a company is not so good. What if a subcategory is changing? Then i have to reindex the companies with the new cateogryname, right? Or do i only index the id by the company?

(David Pilato) #4

Yes. Reindex.

What is the cost of reindexing?
Let's say you have 1 million companies. How long does it take to reindex 1 million docs in elasticsearch? 1 or 2 minutes? (I'm not speaking about reading data from the source which can be slow but only on the elasticsearch side).

As it depends on many factors you have to test that but as an example, I'm indexing locally on my laptop around 10k to 15k documents per second.

So, basically don't try to solve issue you actually don't have by over engineering your model.
You can use parent / child feature to create relations in elasticsearch but for the use case you presented here, I won't use that. It comes with a cost.

(Stefan Kruse-2) #6

Ok, i will give it a try. And you mean i should index the categoryname directly by the company. Ok, but how can i now search for a subcategory with the count of companies where the subcategory are stored?

(David Pilato) #7


GET companies/_search
  "query": {
    "match": {
       "subcategory":  "mysubcategory"

Can you describe with a sample JSON document how your model would be like?
Something like:

  "name": "xyz",
  "category": "abc",
  "subcategory": "def"


(Stefan Kruse-2) #8

This looks good, but subcategories would be an array. But i need all subcategories for a given searchword with the counter of the related companies. Something like:

Scat 1 = 10 Companies
Scat 2 = 6 Companies

(David Pilato) #9

I think that a Terms aggregation on subcategory field will answer to that.

(Stefan Kruse-2) #10

yeah, this could work, ok thanks a lot for helping me out. i will give this a try. MAybe i come back later :slight_smile: thanks

(Stefan Kruse-2) #11

Hi, okay this works fine but one thing i dont understand. I need the aggregation only for the query match result, not for all subcategory fields. You understand?

(David Pilato) #12

Aggregations are computed on the resultset.

(Stefan Kruse-2) #13

ok, but if i stored the subcategories as array like

"subcategories":["Scat One","Scat Two","Scat Three"]

then i could search for example Scat One but if i made aggreagtion to the field subcategories he took all Subcategories in the founded docs. But i need only the founded one. In my case "Scat One".

Maybe i am wrong :slight_smile:

(David Pilato) #14

Unsure but may be using nested type could help?

I'm sorry I can't really help more. May be create a full example as a script like here would help?

(system) #15