Question about structure

Hello everyone,

I'm developing a program that uses elastic search as a search engine.

I found it interesting to have the opinion of the community on how to structure my data.

The project is quite simple, we have documents (a document can be in several languages ​​and have several versions) that are categorized and have dynamic meta data according to their category.

I have already done some test with "elasticsearch ingest attachment" to send the document and parse directly?

But I do not see how deal with meta data that is dynamic according to the category of the document.

Can you advise me on it?

Thank you in advance,

Have a good day,

Are you providing the metadata by yourself?

Hello :slight_smile:

Yes, fields encoded by me like dates, text, numbers, multiple values...

So you can do something like:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

And that should work.

Thank you, I thought to do this :smiley:

And for categories, I add them like this:

PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  "category": {
      "id": 1,
      "name": "Category name "
  },
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

OR

PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  "category": "Category name ",
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

Or like this and I have a way of making a relationship with another index containing the categories?

PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  "category_id":  1,
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

Thanks a lot :smiley:

I'd not do that. I'd do:

PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  "category": "Category name ",
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

Thanks :wink:

If my document and my categories are multilingual, I index the document for each language?

No need to have this :

PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": {
    "en": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "fr": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
 },
  "category": {
    "en": "Category name EN",
    "fr": "Category name FR"
 },
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

but

PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  "category": "Category name EN ",
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

AND the file in french :

PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  "category": "Category name FR ",
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

I prefer the later form but I'd use 2 indices: my_index_fr and my_index_en. Unless there is a need to have an absolute relationship between both versions of the same document.

Why the advantages to use 2 indices ?

And what do you mean by "an absolute relationship", how can i do that with ES ?

I mean that if you want to have within the same elasticsearch document both versions, then you have to do something like:

PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": {
    "en": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "fr": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
 },
  "category": {
    "en": "Category name EN",
    "fr": "Category name FR"
 },
  "meta": {
    "foo1": "bar1",
    "foo2": "bar2"
  }
}

Okey :slight_smile:

No I do not need it " i think :stuck_out_tongue: " , otherwise I also have a database mysql behind which can do it.

And what the advantages to use 2 indices ?

Thanks a lot for all the informations and your time :smiley:

And what the advantages to use 2 indices ?

Well. It reduces the number of fields within one index. Not a big deal here I guess as you have a few of them.
What I'd think about is "reindex" needs. If something goes wrong with a specific lang, ie FR and you need to change the text analyzer (which means reindex). Do you want to reindex both languages or only one?

I prefer one but it's really up to you.

Just giving my 2 cents here :wink:

1 Like

Thanks a lot for the informations :smiley:

Dummy question about "PUT _ingest/pipeline/attachment", i need to call once or for each "PUT
my_index/_doc/my_id?pipeline=attachment" in different request ?

Thanks :wink:

Only once.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.