Feedback for planned data structure/mapping

Hello everyone,
I am new to ES and would love to hear some feedback for a planned data
structure/mapping.

Some facts:

  • the application can have multiple projects (up to 1000)
  • Each project can contain multiple media file entries (up to 10,000)
  • Files can be shared with multiple users (up to 100 / project)
  • Files can have multiple metadata entries, which are user-defined an will
    vary from project to project (up to 30)
  • ES should be the primary data store for "media files"
  • Users and metadata are stored in mongo db
  • Searchable fields should be the "name" and all "metadata" entries
  • Each search request is first filtered by the "shared" field to make sure
    the user has access
  • Mappings are individually defined per _type(=project)

Index: localhost:9200/media
Type(s): localhost:9200/media/
Example schema for a "file entry":
{
"_index" : "media",
// Mongo id of the current project
"_type" : "51d406802e6b5e92b4000003",
"_id" : "uCdAr0J1Qdu7Iv3xdycbrg",
"_score" : 1.0,
"_source" : {
"name": "My media file",
"shared":[
// Mongo ids of users
"51d3f86f31517b3fa5000003",
"51d3f86f31517b3fa5000004",
"51d3f86f31517b3fa5000005",
"51d3f86f31517b3fa5000006",
...
],
"metadata":[
// Mongo ids of metadata entries
{"51d69d8f0c62690000000011": "Mike"},
{"51d69d8f0c62690000000016": 2000},
{"51d69d8f0c62690000000017": 2005}
{"51d69d8f0c62690000000016": [
"Autor1", "Autor2", "Autor3"
]},
...
]
}
}

Would this structure lead to any problems/work at all?
Thanks for your feedback!

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I have never used elasticsearch to store media files, so I cannot
adequately comment on the feasibility or tuning aspects.

Is your only separation between projects via the type? Since types are
primarily used to define which mapping to use (as you stated), do you
envision the mapping to be different between each project? There is nothing
in your example document which leads me to believe so.

Your volume seems to be low, so 1 index overall could be adequate. You
could separate each project into its own index if the performance
characteristics (size, # of search requests) between each project varies
greatly.

What you didn't specify is how any part of the data model is updated. Since
data tends to be de-normalized, updating only certain sections can be not
as efficient.

Cheers,

Ivan

On Wed, Aug 28, 2013 at 5:29 AM, joa joafeldmann@gmail.com wrote:

Hello everyone,
I am new to ES and would love to hear some feedback for a planned data
structure/mapping.

Some facts:

  • the application can have multiple projects (up to 1000)
  • Each project can contain multiple media file entries (up to 10,000)
  • Files can be shared with multiple users (up to 100 / project)
  • Files can have multiple metadata entries, which are user-defined an
    will vary from project to project (up to 30)
  • ES should be the primary data store for "media files"
  • Users and metadata are stored in mongo db
  • Searchable fields should be the "name" and all "metadata" entries
  • Each search request is first filtered by the "shared" field to make
    sure the user has access
  • Mappings are individually defined per _type(=project)

Index: localhost:9200/media
Type(s): localhost:9200/media/
Example schema for a "file entry":
{
"_index" : "media",
// Mongo id of the current project
"_type" : "51d406802e6b5e92b4000003",
"_id" : "uCdAr0J1Qdu7Iv3xdycbrg",
"_score" : 1.0,
"_source" : {
"name": "My media file",
"shared":[
// Mongo ids of users
"51d3f86f31517b3fa5000003",
"51d3f86f31517b3fa5000004",
"51d3f86f31517b3fa5000005",
"51d3f86f31517b3fa5000006",
...
],
"metadata":[
// Mongo ids of metadata entries
{"51d69d8f0c62690000000011": "Mike"},
{"51d69d8f0c62690000000016": 2000},
{"51d69d8f0c62690000000017": 2005}
{"51d69d8f0c62690000000016": [
"Autor1", "Autor2", "Autor3"
]},
...
]
}
}

Would this structure lead to any problems/work at all?
Thanks for your feedback!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Ivan, thanks for your answer.
First of all, I am not going to store the original files! I am just storing
a link to an AWS S3 object.

And yes, I do the separation between projects via the type, because the
metadata section in each project will be completely different and due to
this I need different mappings. (As the metadata names (=mongo id) are
unique, separating them into types isn't necessary, but it would lead to
lots of different fields within the metadata object... Would that be a
problem?)

I am not sure if its better to separate projects via the type or the index.
I decided to choose the type, because as stated here
(https://groups.google.com/forum/#!topic/elasticsearch/kiFI0QoZ3v4) "...its
pretty expensive to have many indices...". But I wonder if there is a
similar limit for types?

Would separating into more indices be better for load balancing in a
cluster (if needed later)? One other idea is to separate projects first by
language and then by type:
/media_en/<id_of_project1>
/media_en/<id_of_project2>
...
/media_de/<id_of_project6>
...

Am Donnerstag, 29. August 2013 20:40:37 UTC+2 schrieb Ivan Brusic:

I have never used elasticsearch to store media files, so I cannot
adequately comment on the feasibility or tuning aspects.

Is your only separation between projects via the type? Since types are
primarily used to define which mapping to use (as you stated), do you
envision the mapping to be different between each project? There is nothing
in your example document which leads me to believe so.

Your volume seems to be low, so 1 index overall could be adequate. You
could separate each project into its own index if the performance
characteristics (size, # of search requests) between each project varies
greatly.

What you didn't specify is how any part of the data model is updated.
Since data tends to be de-normalized, updating only certain sections can be
not as efficient.

Cheers,

Ivan

On Wed, Aug 28, 2013 at 5:29 AM, joa <joafe...@gmail.com <javascript:>>wrote:

Hello everyone,
I am new to ES and would love to hear some feedback for a planned data
structure/mapping.

Some facts:

  • the application can have multiple projects (up to 1000)
  • Each project can contain multiple media file entries (up to 10,000)
  • Files can be shared with multiple users (up to 100 / project)
  • Files can have multiple metadata entries, which are user-defined an
    will vary from project to project (up to 30)
  • ES should be the primary data store for "media files"
  • Users and metadata are stored in mongo db
  • Searchable fields should be the "name" and all "metadata" entries
  • Each search request is first filtered by the "shared" field to make
    sure the user has access
  • Mappings are individually defined per _type(=project)

Index: localhost:9200/media
Type(s): localhost:9200/media/
Example schema for a "file entry":
{
"_index" : "media",
// Mongo id of the current project
"_type" : "51d406802e6b5e92b4000003",
"_id" : "uCdAr0J1Qdu7Iv3xdycbrg",
"_score" : 1.0,
"_source" : {
"name": "My media file",
"shared":[
// Mongo ids of users
"51d3f86f31517b3fa5000003",
"51d3f86f31517b3fa5000004",
"51d3f86f31517b3fa5000005",
"51d3f86f31517b3fa5000006",
...
],
"metadata":[
// Mongo ids of metadata entries
{"51d69d8f0c62690000000011": "Mike"},
{"51d69d8f0c62690000000016": 2000},
{"51d69d8f0c62690000000017": 2005}
{"51d69d8f0c62690000000016": [
"Autor1", "Autor2", "Autor3"
]},
...
]
}
}

Would this structure lead to any problems/work at all?
Thanks for your feedback!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

If you place your metadata ids to the json left side like in the "metadata"
array, you will pile up myriads of field names in the index. You surely
don't want that and ES will get some significant stress.

A large number of ES index types use quite a lot resident memory, and may
add some mapping overhead if the mappings you use are dynamic.

I would put metadata object ids always on the json right side so they
become indexed values in ES.

If you want to connect user objects to the ids, you can consider a
client-side logic (multi get) or denormalization in the docs you index.

Jörg

On Thu, Aug 29, 2013 at 9:40 PM, joa joafeldmann@gmail.com wrote:

(As the metadata names (=mongo id) are unique, separating them into types
isn't necessary, but it would lead to lots of different fields within the
metadata object... Would that be a problem?)
But I wonder if there is a similar limit for types?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jörg, I've now put the metadata name on the json right side and removed
the different types for projects.

Each search request will first be filtered by the projectId AND the shared
array.
But now the metadata array may for example have up to 25000 (e.g. 500
projects * 50 possible metadata entries) different fields. Is that a
problem?

(As I plan to store the user data in mongo db, connecting the user objects
with their ids will happen either on client side or in backend code)

{
// Same index for all projects
"_index" : "my_index",
// Same type for all projects
"_type" : "media",
"_id" : "uCdAr0J1Qdu7Iv3xdycbrg",
"_score" : 1.0,
"_source" : {
// MONGO ID of the project
"projectId": "51d3f86f31517b22a5000001"
"name": "My media file",
...
// other fields valid for ALL projects, like file size etc.
...
"shared":[
// MONGO IDs of users
"51d3f86f31517b3fa5000003",
"51d3f86f31517b3fa5000004",
"51d3f86f31517b3fa5000005",
"51d3f86f31517b3fa5000006",
...
],
"metadata":[
{
k: "51d69d8f0c62690000000011",
v: "Mike"
},
{
k: "51d69d8f0c62690000000016",
v: 2000
},
{
k: "51d69d8f0c62690000000017",
v: 2005
},
{
k: "51d69d8f0c62690000000016",
v: [
"Autor1", "Autor2", "Autor3"
]
}
]
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Note that now in order to find a project with metadata k=... and v=... you would need to index metas as nested type. Otherwise there will be cross matches in you metas array- key from one entry and value from another

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thanks for the hint. So is the following mapping sufficient for avoiding
cross object matches?

{
"media" : {
"properties" : {
...
"metadata" : {
"type" : "nested"
},
...
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes something like this. I managed to avoid usimg it so far. It comes with price - performance and some limitations on queries and facets so lookup nesyed query/facet docs.
Also I wonder if a heterogenious array you have will work I never tried to index an array where elements have different types. It may work but I would check. Maybe someone more experienced can comment on it

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.