I am storing information about companies in an ES index. Right now I have
a company type and a file type. Each file doc has a parent document
that is a company type. I would like to be able to search companies by
both the data within the company docs and within the child file docs.
I have been unable to figure out how to do this and am considering storing
the files as nested objects inside the company docs. My concern is
that this will create massive company docs that will cause some
unforeseen problems, as a company can have thousands of files associated
with it.
Company docs have a lot of information that I need to search by:
geolocation data, certifications, descriptions, titles, etc.
File docs also have a lot of information like title, description, keywords,
etc... as well as the content of the actual file ( pdf, word, ppt ).
Searching for top children won't take company doc data into account.
Filtering by a query on the children won't allow me to add a weighting
based on file results.
What should I do? Will including thousands of file docs ( with file
attachments ) as nested objects inside the parent doc cause problems?
Would it be possible to push the company info into each file doc? This
would be painful, though, if the company data changes.
I think your only other option is nested documents. This means that when
the file or company data changes that company/files doc will get entirely
re-indexed. Will this work at your scale? I have no clue
So, you have to denormalize one way or another. Pushing the company data
into each file seems to be the cleaner approach.
Best Regards,
Paul
On Thursday, March 28, 2013 9:31:14 PM UTC-6, Brian Jones wrote:
I am storing information about companies in an ES index. Right now I have
a company type and a file type. Each file doc has a parent document
that is a company type. I would like to be able to search companies by
both the data within the company docs and within the child file docs.
I have been unable to figure out how to do this and am considering storing
the files as nested objects inside the company docs. My concern is
that this will create massive company docs that will cause some
unforeseen problems, as a company can have thousands of files associated
with it.
Company docs have a lot of information that I need to search by:
geolocation data, certifications, descriptions, titles, etc.
File docs also have a lot of information like title, description,
keywords, etc... as well as the content of the actual file ( pdf, word, ppt
).
Searching for top children won't take company doc data into account.
Filtering by a query on the children won't allow me to add a weighting
based on file results.
What should I do? Will including thousands of file docs ( with file
attachments ) as nested objects inside the parent doc cause problems?
Would it be possible to push the company info into each file doc? This
would be painful, though, if the company data changes.
I think your only other option is nested documents. This means that when
the file or company data changes that company/files doc will get entirely
re-indexed. Will this work at your scale? I have no clue
So, you have to denormalize one way or another. Pushing the company data
into each file seems to be the cleaner approach.
Best Regards,
Paul
On Thursday, March 28, 2013 9:31:14 PM UTC-6, Brian Jones wrote:
I am storing information about companies in an ES index. Right now I
have a company type and a file type. Each file doc has a parent
document that is a company type. I would like to be able to search
companies by both the data within the company docs and within the child file docs. I have been unable to figure out how to do this and am
considering storing the files as nested objects inside the company
docs. My concern is that this will create massive company docs that will
cause some unforeseen problems, as a company can have thousands of files
associated with it.
Company docs have a lot of information that I need to search by:
geolocation data, certifications, descriptions, titles, etc.
File docs also have a lot of information like title, description,
keywords, etc... as well as the content of the actual file ( pdf, word, ppt
).
Searching for top children won't take company doc data into account.
Filtering by a query on the children won't allow me to add a weighting
based on file results.
What should I do? Will including thousands of file docs ( with file
attachments ) as nested objects inside the parent doc cause problems?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.