Analyzers and JSON

Austin_Harmon · March 11, 2015, 6:50pm

Hello,

I'm trying to get an understand of the how to have full text search on the
document and have the body of the document be considered during search. I
understand how to do the mapping and use analyzers but what I don't
understand is how they get the body of the document. If your fields are
file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f7349743-2fe5-41a3-b74f-22449a9b0197%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aaron_Mefford · March 12, 2015, 5:26pm

Yes you need to include all the text you want indexed and searchable as
part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why Elasticsearch
scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:

Hello,

I'm trying to get an understand of the how to have full text search on the
document and have the body of the document be considered during search. I
understand how to do the mapping and use analyzers but what I don't
understand is how they get the body of the document. If your fields are
file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/147dea4b-54cb-43fe-b1df-6e2425c7ab99%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Austin_Harmon · March 12, 2015, 8:56pm

Okay so I have a large amount of data 2 TB and its all microsoft office
documents and pdfs and emails. What is the best way to go about indexing
the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com wrote:

Yes you need to include all the text you want indexed and searchable as
part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why Elasticsearch
scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:

Hello,

I'm trying to get an understand of the how to have full text search on
the document and have the body of the document be considered during search.
I understand how to do the mapping and use analyzers but what I don't
understand is how they get the body of the document. If your fields are
file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aaron_Mefford · March 12, 2015, 9:04pm

Take a look at Apache Tika http://tika.apache.org/. It will allow you to
extract the contents of the documents for indexing, this is outside of the
scope of the Elasticsearch indexing. A good tool to make these files
downloadable is also out of scope, but I'll answer to what is in scope.
You need to put the files some where that they can be accessed by a URL.
Any webserver is capable of this, of course your needs may very but this
isnt the list for those questions. Once you have a URL that the document
can be accessed by, include that in your indexing of the document so that
you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of
word documents, Apache Tika is one that is frequently used for this purpose
though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon aharmon2165@gmail.com
wrote:

Okay so I have a large amount of data 2 TB and its all microsoft office
documents and pdfs and emails. What is the best way to go about indexing
the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com
wrote:

Yes you need to include all the text you want indexed and searchable as
part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why Elasticsearch
scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:

Hello,

I'm trying to get an understand of the how to have full text search on
the document and have the body of the document be considered during search.
I understand how to do the mapping and use analyzers but what I don't
understand is how they get the body of the document. If your fields are
file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEpvt4ZkL%3DZ4_tXv0S9xWs-f-pzZae3iMpFHyRmhDH1SBg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Austin_Harmon · March 13, 2015, 3:42pm

Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to switch
over to Apache Solr? As I said I have 2TB of data so it isn't efficient for
me to manually input each document so it can be indexed with specific JSON.
If you have any experience with Solr please let me know if it would be a
good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

Take a look at Apache Tika http://tika.apache.org/
http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng.
It will allow you to extract the contents of the documents for indexing,
this is outside of the scope of the Elasticsearch indexing. A good tool to
make these files downloadable is also out of scope, but I'll answer to what
is in scope. You need to put the files some where that they can be
accessed by a URL. Any webserver is capable of this, of course your needs
may very but this isnt the list for those questions. Once you have a URL
that the document can be accessed by, include that in your indexing of the
document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of
word documents, Apache Tika is one that is frequently used for this purpose
though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <aharm...@gmail.com
<javascript:>> wrote:

Okay so I have a large amount of data 2 TB and its all microsoft office
documents and pdfs and emails. What is the best way to go about indexing
the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com
wrote:

Yes you need to include all the text you want indexed and searchable as
part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why Elasticsearch
scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:

Hello,

I'm trying to get an understand of the how to have full text search on
the document and have the body of the document be considered during search.
I understand how to do the mapping and use analyzers but what I don't
understand is how they get the body of the document. If your fields are
file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aaron_Mefford · March 13, 2015, 4:15pm

Your going to have the same issue with SOLR, putting the contents in to XML
which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not. I am aware of
its capabilities but have not had reason to myself.

I see what you are saying about others not having the same issue, but what
you must realize is that most users are not indexing that type of
document. They are indexing events, database records, web pages and so
on. It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon aharmon2165@gmail.com
wrote:

Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to switch
over to Apache Solr? As I said I have 2TB of data so it isn't efficient for
me to manually input each document so it can be indexed with specific JSON.
If you have any experience with Solr please let me know if it would be a
good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

Take a look at Apache Tika http://tika.apache.org/
http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng.
It will allow you to extract the contents of the documents for indexing,
this is outside of the scope of the Elasticsearch indexing. A good tool to
make these files downloadable is also out of scope, but I'll answer to what
is in scope. You need to put the files some where that they can be
accessed by a URL. Any webserver is capable of this, of course your needs
may very but this isnt the list for those questions. Once you have a URL
that the document can be accessed by, include that in your indexing of the
document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents
of word documents, Apache Tika is one that is frequently used for this
purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon aharm...@gmail.com
wrote:

Okay so I have a large amount of data 2 TB and its all microsoft office
documents and pdfs and emails. What is the best way to go about indexing
the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com
wrote:

Yes you need to include all the text you want indexed and searchable as
part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why
Elasticsearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:

Hello,

I'm trying to get an understand of the how to have full text search on
the document and have the body of the document be considered during search.
I understand how to do the mapping and use analyzers but what I don't
understand is how they get the body of the document. If your fields are
file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEoe2XBzTP15GgSCf8rjrtWMkvhkxtNzn1hXJ_8R8Fc%3D6w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Austin_Harmon · March 13, 2015, 4:35pm

Thank you for the information. This going to be very difficult I can tell.
Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:

Your going to have the same issue with SOLR, putting the contents in to
XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not. I am aware
of its capabilities but have not had reason to myself.

I see what you are saying about others not having the same issue, but what
you must realize is that most users are not indexing that type of
document. They are indexing events, database records, web pages and so
on. It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <aharm...@gmail.com
<javascript:>> wrote:

Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to switch
over to Apache Solr? As I said I have 2TB of data so it isn't efficient for
me to manually input each document so it can be indexed with specific JSON.
If you have any experience with Solr please let me know if it would be a
good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

Take a look at Apache Tika http://tika.apache.org/
http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng.
It will allow you to extract the contents of the documents for indexing,
this is outside of the scope of the Elasticsearch indexing. A good tool to
make these files downloadable is also out of scope, but I'll answer to what
is in scope. You need to put the files some where that they can be
accessed by a URL. Any webserver is capable of this, of course your needs
may very but this isnt the list for those questions. Once you have a URL
that the document can be accessed by, include that in your indexing of the
document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents
of word documents, Apache Tika is one that is frequently used for this
purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon aharm...@gmail.com
wrote:

Okay so I have a large amount of data 2 TB and its all microsoft office
documents and pdfs and emails. What is the best way to go about indexing
the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com
wrote:

Yes you need to include all the text you want indexed and searchable
as part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why
Elasticsearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:

Hello,

I'm trying to get an understand of the how to have full text search
on the document and have the body of the document be considered during
search. I understand how to do the mapping and use analyzers but what I
don't understand is how they get the body of the document. If your fields
are file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Charlie_Hull · March 13, 2015, 4:35pm

On 13/03/2015 15:42, Austin Harmon wrote:

Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to
switch over to Apache Solr? As I said I have 2TB of data so it isn't
efficient for me to manually input each document so it can be indexed
with specific JSON. If you have any experience with Solr please let me
know if it would be a good solution to my problem.

Hi Austin,

Solr's SolrCell lets you submit documents in various formats directly to
Solr, which then uses Tika to extract the plain text for indexing.
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

However we don't like this approach as Tika itself can fall over (when
faced with a great big complex PDF for example, I've seen ones that run
to 3000 pages) or just eat up all the resources on your Solr server. So,
we tend to run Tika as part of an external indexing process, written in
Python or Java, that then throws the plain text at Solr. We can then
manage it, restart it etc.

There are many other ways to do this as well of course - here's some
code that we wrote many moons ago which might be helpful:

Cheers

Charlie

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

Take a look at Apache Tika http://tika.apache.org/
<http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng>.
It will allow you to extract the contents of the documents for
indexing, this is outside of the scope of the ElasticSearch
indexing.  A good tool to make these files downloadable is also out
of scope, but I'll answer to what is in scope.  You need to put the
files some where that they can be accessed by a URL.  Any webserver
is capable of this, of course your needs may very but this isnt the
list for those questions.  Once you have a URL that the document can
be accessed by, include that in your indexing of the document so
that you can point to that URL in your search results.

I am sure there are other options out there for extracting the
contents of word documents, Apache Tika is one that is frequently
used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <aharm...@gmail.com
<javascript:>> wrote:

    Okay so I have a large amount of data 2 TB and its all microsoft
    office documents and pdfs and emails. What is the best way to go
    about indexing the body of these documents so making the
    contents of the document searchable. I tried to use the php
    client but that isn't helping and I know there are ways to
    convert files in php but is there nothing available that takes
    in these types of documents? I tried the file_get_contents
    function in php but it only takes in text documents. Also would
    you know of a good tool or a method to make the files that are
    searched downloadable?

    Thanks,
    Austin


    On Thursday, March 12, 2015 at 12:26:13 PM UTC-5,
    aa...@definemg.com wrote:

        Yes you need to include all the text you want indexed and
        searchable as part of the JSON.

        How else would you expect ElasticSearch to receive the data?

        Regarding large scale production environments, this is why
        ElasticSearch scales out.

        Aaron

        On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin
        Harmon wrote:

            Hello,

            I'm trying to get an understand of the how to have full
            text search on the document and have the body of the
            document be considered during search. I understand how
            to do the mapping and use analyzers but what I don't
            understand is how they get the body of the document. If
            your fields are file name, file size, file path, file
            type how do the analyzers get the body of the document.
            Surely you wouldn't have to put the body of every
            document into the JSON, that is how I've seen it done in
            all the examples I've seen but that doesn't make sense
            for large scale production environments. If someone
            could please give me some  insight as to how this
            process works it would be greatly appreciated.

            Thank you,
            Austin Harmon

    --
    You received this message because you are subscribed to a topic
    in the Google Groups "elasticsearch" group.
    To unsubscribe from this topic, visit
    https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe
    <https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe>.
    To unsubscribe from this group and all its topics, send an email
    to elasticsearc...@googlegroups.com <javascript:>.
    To view this discussion on the web visit
    https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
    <https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer>.

    For more options, visit https://groups.google.com/d/optout
    <https://groups.google.com/d/optout>.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearch+unsubscribe@googlegroups.com
mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout.

--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile: +44 (0)7767 825828
web: www.flax.co.uk

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/550311EF.2030008%40flax.co.uk.
For more options, visit https://groups.google.com/d/optout.

Aaron_Mefford · March 13, 2015, 4:38pm

Not certain what you are referring to so I expect not. I have used the
elasticsearch mappings, but I cant see how those would directly integrate
with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon aharmon2165@gmail.com
wrote:

Thank you for the information. This going to be very difficult I can tell.
Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:

Your going to have the same issue with SOLR, putting the contents in to
XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not. I am aware
of its capabilities but have not had reason to myself.

I see what you are saying about others not having the same issue, but
what you must realize is that most users are not indexing that type of
document. They are indexing events, database records, web pages and so
on. It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon aharm...@gmail.com
wrote:

Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to switch
over to Apache Solr? As I said I have 2TB of data so it isn't efficient for
me to manually input each document so it can be indexed with specific JSON.
If you have any experience with Solr please let me know if it would be a
good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

Take a look at Apache Tika http://tika.apache.org/
http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng.
It will allow you to extract the contents of the documents for indexing,
this is outside of the scope of the Elasticsearch indexing. A good tool to
make these files downloadable is also out of scope, but I'll answer to what
is in scope. You need to put the files some where that they can be
accessed by a URL. Any webserver is capable of this, of course your needs
may very but this isnt the list for those questions. Once you have a URL
that the document can be accessed by, include that in your indexing of the
document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents
of word documents, Apache Tika is one that is frequently used for this
purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon aharm...@gmail.com
wrote:

Okay so I have a large amount of data 2 TB and its all microsoft
office documents and pdfs and emails. What is the best way to go about
indexing the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com
wrote:

Yes you need to include all the text you want indexed and searchable
as part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why
Elasticsearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon
wrote:

Hello,

I'm trying to get an understand of the how to have full text search
on the document and have the body of the document be considered during
search. I understand how to do the mapping and use analyzers but what I
don't understand is how they get the body of the document. If your fields
are file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEq0QKD9Oq2BeA_v5Cm1SEEDfm0P2u9TWF6DkqYSxesOGg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Austin_Harmon · March 13, 2015, 5:49pm

There is a plugin called mapper
attachments: GitHub - elastic/elasticsearch-mapper-attachments: Mapper Attachments Type plugin for Elasticsearch
This plugin is supposed to use Tika to index the content of documents but
it doesn't seem to be working correctly. I base64 encode the documents but
it comes back as null when I decode it.
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:

Not certain what you are referring to so I expect not. I have used the
elasticsearch mappings, but I cant see how those would directly integrate
with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <aharm...@gmail.com
<javascript:>> wrote:

Thank you for the information. This going to be very difficult I can
tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:

Your going to have the same issue with SOLR, putting the contents in to
XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not. I am aware
of its capabilities but have not had reason to myself.

I see what you are saying about others not having the same issue, but
what you must realize is that most users are not indexing that type of
document. They are indexing events, database records, web pages and so
on. It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon aharm...@gmail.com
wrote:

Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to switch
over to Apache Solr? As I said I have 2TB of data so it isn't efficient for
me to manually input each document so it can be indexed with specific JSON.
If you have any experience with Solr please let me know if it would be a
good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

Take a look at Apache Tika http://tika.apache.org/
http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng.
It will allow you to extract the contents of the documents for indexing,
this is outside of the scope of the Elasticsearch indexing. A good tool to
make these files downloadable is also out of scope, but I'll answer to what
is in scope. You need to put the files some where that they can be
accessed by a URL. Any webserver is capable of this, of course your needs
may very but this isnt the list for those questions. Once you have a URL
that the document can be accessed by, include that in your indexing of the
document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the
contents of word documents, Apache Tika is one that is frequently used for
this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon aharm...@gmail.com
wrote:

Okay so I have a large amount of data 2 TB and its all microsoft
office documents and pdfs and emails. What is the best way to go about
indexing the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com
wrote:

Yes you need to include all the text you want indexed and searchable
as part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why
Elasticsearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon
wrote:

Hello,

I'm trying to get an understand of the how to have full text search
on the document and have the body of the document be considered during
search. I understand how to do the mapping and use analyzers but what I
don't understand is how they get the body of the document. If your fields
are file name, file size, file path, file type how do the analyzers get the
body of the document. Surely you wouldn't have to put the body of every
document into the JSON, that is how I've seen it done in all the examples
I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aaron_Mefford · March 13, 2015, 6:12pm

Have you looked at the StandAloneRunner included with that plugin?

I would practice with that, seeing first if it can extract the content,
then seeing if it can extract the content from your base64 encoded version
of the document. When that is working, I suspect you should at that point
be able to do what you are hoping.

However, while this plugin aims to make it easier, it does not make it more
efficient. You have mentioned many times that you have a large number of
documents to process, and it sounds like you think that by avoiding putting
the contents of the document into the JSON you are being more efficient.
Instead you have opted to put the entire document base64 encoded into the
json, which is far less efficient.

Base64 encoding increases the size of a document. Depending on the
document format, it may also be increasing the size of the actual text.
However if you use Tika to extract the text yourself, not via plugin, then
put that text into the json, then gzip and post the json, that will be the
optimal way to post your documents for indexing. It also gives you the
greatest level of control, and will allow you to use the bulk API.

One note is that Elasticsearch has a maximum HTTP Post size by default.

http.max_content_length

If you are posting large documents you may exceed this, especially if you
are using the bulk API.

If your concern is that you need to use PHP, then you do have an issue.
This should be written in Java to fully leverage Tika. Writing it in Java
will also allow you to leverage the Node API for writing to Elasticsearch.
All this will make your loading far more efficient than trying to stay in
PHP. If PHP is the only language you know, it might be time to learn
another. You should not be afraid to learn another language, you might
find it is easier than what you have been doing so far. If I had a
requirement to do this in PHP, after significant objections to the
requirement with adequate explanation that it was the wrong way to do it, I
would then pursue finding alternatives to tika that will work in PHP. I
see there are extractors for the doc format, docx is xml in a zip file so
that can be extracted, there are other options. Worst case you could call
a command line Tika to extract, then post using PHP, though this will be
slow.

The real point is that in order for Elasticsearch to index your content,
you need to show it the content. Putting that content into JSON is not
only a good way to do that, it is the way it is done with Elasticsearch.
You should stop looking for an alternative. Even the plugin your are using
will ultimately put the content into JSON and send it to Elasticsearch.
This does not mean that you have to store the full content of the document
in Elasticsearch, your mappings on your index can take care of that. It
also does not mean that you have to retrieve the full content in your
search results, your queries can take care of that if your mappings do not.

On Fri, Mar 13, 2015 at 11:49 AM, Austin Harmon aharmon2165@gmail.com
wrote:

There is a plugin called mapper attachments:
GitHub - elastic/elasticsearch-mapper-attachments: Mapper Attachments Type plugin for Elasticsearch This plugin
is supposed to use Tika to index the content of documents but it doesn't
seem to be working correctly. I base64 encode the documents but it comes
back as null when I decode it.
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:

Not certain what you are referring to so I expect not. I have used the
elasticsearch mappings, but I cant see how those would directly integrate
with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon aharm...@gmail.com
wrote:

Thank you for the information. This going to be very difficult I can
tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:

Your going to have the same issue with SOLR, putting the contents in to
XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not. I am
aware of its capabilities but have not had reason to myself.

I see what you are saying about others not having the same issue, but
what you must realize is that most users are not indexing that type of
document. They are indexing events, database records, web pages and so
on. It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon aharm...@gmail.com
wrote:

Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to switch
over to Apache Solr? As I said I have 2TB of data so it isn't efficient for
me to manually input each document so it can be indexed with specific JSON.
If you have any experience with Solr please let me know if it would be a
good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

Take a look at Apache Tika http://tika.apache.org/
http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng.
It will allow you to extract the contents of the documents for indexing,
this is outside of the scope of the Elasticsearch indexing. A good tool to
make these files downloadable is also out of scope, but I'll answer to what
is in scope. You need to put the files some where that they can be
accessed by a URL. Any webserver is capable of this, of course your needs
may very but this isnt the list for those questions. Once you have a URL
that the document can be accessed by, include that in your indexing of the
document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the
contents of word documents, Apache Tika is one that is frequently used for
this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon aharm...@gmail.com
wrote:

Okay so I have a large amount of data 2 TB and its all microsoft
office documents and pdfs and emails. What is the best way to go about
indexing the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com
wrote:

Yes you need to include all the text you want indexed and
searchable as part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why
Elasticsearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon
wrote:

Hello,

I'm trying to get an understand of the how to have full text
search on the document and have the body of the document be considered
during search. I understand how to do the mapping and use analyzers but
what I don't understand is how they get the body of the document. If your
fields are file name, file size, file path, file type how do the analyzers
get the body of the document. Surely you wouldn't have to put the body of
every document into the JSON, that is how I've seen it done in all the
examples I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e
3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEpEro1b4ny%3DAbzRMU1LCFx-v5fnMxU1zz4rKQa7p6Oqgw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · March 13, 2015, 7:35pm

I’m a bit concerned about your « it does not work » statement.
We have only today 4 opened issues on it: Issues · elastic/elasticsearch-mapper-attachments · GitHub https://github.com/elastic/elasticsearch-mapper-attachments/issues
1 bug and 3 feature requests.

Could you explain a bit more what is not working? May be I missed something.

--
David Pilato - Developer | Evangelist

@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 13 mars 2015 à 10:49, Austin Harmon aharmon2165@gmail.com a écrit :

There is a plugin called mapper attachments: GitHub - elastic/elasticsearch-mapper-attachments: Mapper Attachments Type plugin for Elasticsearch This plugin is supposed to use Tika to index the content of documents but it doesn't seem to be working correctly. I base64 encode the documents but it comes back as null when I decode it.
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
Not certain what you are referring to so I expect not. I have used the elasticsearch mappings, but I cant see how those would directly integrate with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <aharm...@gmail.com <javascript:>> wrote:
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not. I am aware of its capabilities but have not had reason to myself.

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document. They are indexing events, database records, web pages and so on. It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <aharm...@gmail.com <>> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika http://tika.apache.org/ http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng. It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the Elasticsearch indexing. A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope. You need to put the files some where that they can be accessed by a URL. Any webserver is capable of this, of course your needs may very but this isnt the list for those questions. Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <aharm...@gmail.com <>> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com <> wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why Elasticsearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com <>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com <>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Aaron_Mefford · March 13, 2015, 7:41pm

He posted limited details in a separate thread.

"mapper-attachment and base64 encoding"

I was not asserting that it does not work, just that it may not be the best
way to handle "large number of documents".

I suspect there is an issue with encoding or submitting the document.

On Fri, Mar 13, 2015 at 1:35 PM, David Pilato david@pilato.fr wrote:

I’m a bit concerned about your « it does not work » statement.
We have only today 4 opened issues on it:
Issues · elastic/elasticsearch-mapper-attachments · GitHub
1 bug and 3 feature requests.

Could you explain a bit more what is not working? May be I missed
something.

--
David Pilato - Developer | Evangelist
Elasticsearch.com http://Elasticsearch.com
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr
https://twitter.com/elasticsearchfr | @scrutmydocs
https://twitter.com/scrutmydocs

Le 13 mars 2015 à 10:49, Austin Harmon aharmon2165@gmail.com a écrit :

There is a plugin called mapper attachments:
GitHub - elastic/elasticsearch-mapper-attachments: Mapper Attachments Type plugin for Elasticsearch This plugin
is supposed to use Tika to index the content of documents but it doesn't
seem to be working correctly. I base64 encode the documents but it comes
back as null when I decode it.
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:

Not certain what you are referring to so I expect not. I have used the
elasticsearch mappings, but I cant see how those would directly integrate
with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon aharm...@gmail.com
wrote:

Thank you for the information. This going to be very difficult I can
tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:

Your going to have the same issue with SOLR, putting the contents in to
XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not. I am
aware of its capabilities but have not had reason to myself.

I see what you are saying about others not having the same issue, but
what you must realize is that most users are not indexing that type of
document. They are indexing events, database records, web pages and so
on. It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon aharm...@gmail.com
wrote:

Thank you for the information. I've been trying to use the mapper
attachment which has Apache Tika built into it. I am just surprised and
confused that so many companies use elasticsearch but yet it is so
difficult to index the contents of a document. If I need to index the
contents of documents then would it be easier and more efficient to switch
over to Apache Solr? As I said I have 2TB of data so it isn't efficient for
me to manually input each document so it can be indexed with specific JSON.
If you have any experience with Solr please let me know if it would be a
good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:

Take a look at Apache Tika http://tika.apache.org/
http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng.
It will allow you to extract the contents of the documents for indexing,
this is outside of the scope of the Elasticsearch indexing. A good tool to
make these files downloadable is also out of scope, but I'll answer to what
is in scope. You need to put the files some where that they can be
accessed by a URL. Any webserver is capable of this, of course your needs
may very but this isnt the list for those questions. Once you have a URL
that the document can be accessed by, include that in your indexing of the
document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the
contents of word documents, Apache Tika is one that is frequently used for
this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon aharm...@gmail.com
wrote:

Okay so I have a large amount of data 2 TB and its all microsoft
office documents and pdfs and emails. What is the best way to go about
indexing the body of these documents so making the contents of the document
searchable. I tried to use the php client but that isn't helping and I know
there are ways to convert files in php but is there nothing available that
takes in these types of documents? I tried the file_get_contents function
in php but it only takes in text documents. Also would you know of a good
tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com
wrote:

Yes you need to include all the text you want indexed and
searchable as part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why
Elasticsearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon
wrote:

Hello,

I'm trying to get an understand of the how to have full text
search on the document and have the body of the document be considered
during search. I understand how to do the mapping and use analyzers but
what I don't understand is how they get the body of the document. If your
fields are file name, file size, file path, file type how do the analyzers
get the body of the document. Surely you wouldn't have to put the body of
every document into the JSON, that is how I've seen it done in all the
examples I've seen but that doesn't make sense for large scale production
environments. If someone could please give me some insight as to how this
process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in
the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e
3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/to
pic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40goo
glegroups.com
https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/
topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr
https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

dadoonet · March 13, 2015, 7:46pm

Thanks. I missed the post.
Will answer there.

--
David Pilato - Developer | Evangelist

@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 13 mars 2015 à 12:41, Aaron Mefford aaron@definemg.com a écrit :

He posted limited details in a separate thread.

"mapper-attachment and base64 encoding"

I was not asserting that it does not work, just that it may not be the best way to handle "large number of documents".

I suspect there is an issue with encoding or submitting the document.

On Fri, Mar 13, 2015 at 1:35 PM, David Pilato <david@pilato.fr mailto:david@pilato.fr> wrote:
I’m a bit concerned about your « it does not work » statement.
We have only today 4 opened issues on it: Issues · elastic/elasticsearch-mapper-attachments · GitHub https://github.com/elastic/elasticsearch-mapper-attachments/issues
1 bug and 3 feature requests.

Could you explain a bit more what is not working? May be I missed something.

--
David Pilato - Developer | Evangelist
Elasticsearch.com http://elasticsearch.com/
@dadoonet https://twitter.com/dadoonet | @elasticsearchfr https://twitter.com/elasticsearchfr | @scrutmydocs https://twitter.com/scrutmydocs

Le 13 mars 2015 à 10:49, Austin Harmon <aharmon2165@gmail.com mailto:aharmon2165@gmail.com> a écrit :

There is a plugin called mapper attachments: GitHub - elastic/elasticsearch-mapper-attachments: Mapper Attachments Type plugin for Elasticsearch https://github.com/elastic/elasticsearch-mapper-attachments This plugin is supposed to use Tika to index the content of documents but it doesn't seem to be working correctly. I base64 encode the documents but it comes back as null when I decode it.
On Friday, March 13, 2015 at 11:38:38 AM UTC-5, Aaron Mefford wrote:
Not certain what you are referring to so I expect not. I have used the elasticsearch mappings, but I cant see how those would directly integrate with Tika.

On Fri, Mar 13, 2015 at 10:35 AM, Austin Harmon <aharm...@gmail.com <>> wrote:
Thank you for the information. This going to be very difficult I can tell. Do you have experience with the mapper attachment?

On Friday, March 13, 2015 at 11:15:18 AM UTC-5, Aaron Mefford wrote:
Your going to have the same issue with SOLR, putting the contents in to XML which is even heavier than JSON.

I wish that I had some more experience using Tika, I do not. I am aware of its capabilities but have not had reason to myself.

I see what you are saying about others not having the same issue, but what you must realize is that most users are not indexing that type of document. They are indexing events, database records, web pages and so on. It is a very small subset that index things like word docs and pdfs.

On Fri, Mar 13, 2015 at 9:42 AM, Austin Harmon <aharm...@gmail.com <>> wrote:
Thank you for the information. I've been trying to use the mapper attachment which has Apache Tika built into it. I am just surprised and confused that so many companies use elasticsearch but yet it is so difficult to index the contents of a document. If I need to index the contents of documents then would it be easier and more efficient to switch over to Apache Solr? As I said I have 2TB of data so it isn't efficient for me to manually input each document so it can be indexed with specific JSON. If you have any experience with Solr please let me know if it would be a good solution to my problem.

thanks,
Austin

On Thursday, March 12, 2015 at 4:04:29 PM UTC-5, Aaron Mefford wrote:
Take a look at Apache Tika http://tika.apache.org/ http://www.google.com/url?q=http%3A%2F%2Ftika.apache.org%2F&sa=D&sntz=1&usg=AFQjCNFq7mCziZJJYGt9JOe_w89GwPFWng. It will allow you to extract the contents of the documents for indexing, this is outside of the scope of the Elasticsearch indexing. A good tool to make these files downloadable is also out of scope, but I'll answer to what is in scope. You need to put the files some where that they can be accessed by a URL. Any webserver is capable of this, of course your needs may very but this isnt the list for those questions. Once you have a URL that the document can be accessed by, include that in your indexing of the document so that you can point to that URL in your search results.

I am sure there are other options out there for extracting the contents of word documents, Apache Tika is one that is frequently used for this purpose though.

On Thu, Mar 12, 2015 at 2:56 PM, Austin Harmon <aharm...@gmail.com <>> wrote:
Okay so I have a large amount of data 2 TB and its all microsoft office documents and pdfs and emails. What is the best way to go about indexing the body of these documents so making the contents of the document searchable. I tried to use the php client but that isn't helping and I know there are ways to convert files in php but is there nothing available that takes in these types of documents? I tried the file_get_contents function in php but it only takes in text documents. Also would you know of a good tool or a method to make the files that are searched downloadable?

Thanks,
Austin

On Thursday, March 12, 2015 at 12:26:13 PM UTC-5, aa...@definemg.com <> wrote:
Yes you need to include all the text you want indexed and searchable as part of the JSON.

How else would you expect Elasticsearch to receive the data?

Regarding large scale production environments, this is why Elasticsearch scales out.

Aaron

On Wednesday, March 11, 2015 at 12:50:25 PM UTC-6, Austin Harmon wrote:
Hello,

I'm trying to get an understand of the how to have full text search on the document and have the body of the document be considered during search. I understand how to do the mapping and use analyzers but what I don't understand is how they get the body of the document. If your fields are file name, file size, file path, file type how do the analyzers get the body of the document. Surely you wouldn't have to put the body of every document into the JSON, that is how I've seen it done in all the examples I've seen but that doesn't make sense for large scale production environments. If someone could please give me some insight as to how this process works it would be greatly appreciated.

Thank you,
Austin Harmon

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com <>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/41516b36-18e3-4ef8-8d8d-1e9da6b727a4%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com <>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/7f0de88e-db25-47f8-bdfd-9e1e51d7a0f6%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com <>.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/e904808f-0e66-44c2-980c-bc3a0af22951%40googlegroups.com?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/0b4f70b8-bcd7-4c66-ad72-b0a478332e36%40googlegroups.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr https://groups.google.com/d/msgid/elasticsearch/169B3761-629C-471D-9E97-07EA75473F7E%40pilato.fr?utm_medium=email&utm_source=footer.

For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com mailto:elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAF9vEEqxZ9XPD8aB0jg3xartJEUW6NAKQGe%3D7Z8_udHsUQDijA%40mail.gmail.com?utm_medium=email&utm_source=footer.
For more options, visit https://groups.google.com/d/optout https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/5F337472-7F1B-462F-A9A2-A617D6F4536A%40pilato.fr.
For more options, visit https://groups.google.com/d/optout.

Austin_Harmon · March 13, 2015, 9:05pm

Hello,

I'm running an instance of elasticsearch 1.3.2 on ubuntu server 14.04 on a
imac. I have the mapper-attachments plugin installed and elasticsearch gui
which I'm using for my front end.

It's possible that I am missing something here are all the things I've
tried so far:

I got the mapper-attachments plugin installed.
Then I created the index with mapping:

curl -XPUT 'http://localhost:9200/historicdata' -d
'{"mappings":{"docs":{"properties":{"content":{"type":"attachment"}}}}}'

now I use a php script to take the documents and convert the docs and
contents to base64

<?php $root = '/home/aharmon/test'; $iters = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root), RecursiveIteratorIterator::CHILD_FIRST); try { foreach( $iters as $fullFileName => $iter ) { $base64 = base64_encode($iter); $indexarray = array ("File" => $base64); $jsonarray = json_encode($indexarray); file_put_contents("/home/aharmon/data.json", $jsonarray, FILE_APPEND); } } catch (UnexpectedValueException $e) { printf("Directory [%s] contained a directory we can not recurse into", $root); } ?>

Then I take my data.json file and implement the bulk API:

{"index": {"_index": "historicdata", "_type": "docs" } }
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRm"}
{"index": {"_index": "historicdata", "_type": "docs" } }
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="}
{"_index": "historicdata", "_type": "docs" } }
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUueGxz"}
{"_index": "historicdata", "_type": "docs" } }
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0FnZW5jaWVzIE1hc3RlciBMaXN0Lnhsc3g="}

This is in a separate folder called bulk-requests

Then I run this command:

curl -s -XPOST localhost:9200/_bulk --data-binary @bulk-requests; echo

I got a successful message back so it is all indexed.

Then I run this command:

curl -XGET 'http://localhost:9200/historicdata/docs/_search' '{"fields": [
"content.content_type" ], "query":{"match":{"content.content_type":"text
plain"}}}'

{"took":2,
"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"historicdata","_type":"docs","_id":"LMkqzKbyWTGffNtr1mGPZA","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi9-ZXN0L)EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRM"}},
{"_index":"historicdata","_type":"docs","_id":"GBEIWECwRgiUbYB6pnq7dQ","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="}
}]}}

So it is indexing the documents and the search works but the contents isn't
being decoded from base64. Maybe there is a general rule with base64 that I
don't know that is assumed? I have followed the documentation religiously
on github and elasticsearch's site. Also when I decode the base64 within
the php script before I put it into the json array, it all says null. These
are .xlsx, .xls, and .pdf documents.

Thanks for your help guys, It is greatly appreciated.

Let me know if you need any more information than what I have provided.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c06948a0-5822-475e-9725-411fddaba903%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Aaron_Mefford · March 13, 2015, 9:30pm

Well.. I think I may see your issue.

I decoded this string:

L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM=

It is:

/home/aharmon/test/A Plus - Media Plan Summary.xls

Another is:
/home/aharmon/test/A Plus - Summary by Venue.pdf

I think you misunderstand the purpose or how this all fits together.

As I said you must send the contents of the document to Elasticsearch for
indexing. Sending the file name is not sufficient, unless you are just
hoping to index the file name, but then why all the fuss with the Tika
extension.

Your PHP code needs to read the full binary content of the xls, xlsx or
PDF. Then base64 encode that full content. This will be a very large
string, about 33% larger than the original file. This is done because
Base64 has a safe character set that is acceptable in a JSON document while
the raw binary is not acceptable in a JSON document.

With this understanding, perhaps you will now understand why it has been
suggested this is not the ideal way to do a large volume of documents in
this manner. It will be more efficient, to run tika locally, build your
JSON, compress your json and then send it to ES.

On Fri, Mar 13, 2015 at 3:05 PM, Austin Harmon aharmon2165@gmail.com
wrote:

Hello,

I'm running an instance of elasticsearch 1.3.2 on ubuntu server 14.04 on a
imac. I have the mapper-attachments plugin installed and elasticsearch gui
which I'm using for my front end.

It's possible that I am missing something here are all the things I've
tried so far:

I got the mapper-attachments plugin installed.
Then I created the index with mapping:

curl -XPUT 'http://localhost:9200/historicdata' -d
'{"mappings":{"docs":{"properties":{"content":{"type":"attachment"}}}}}'

now I use a php script to take the documents and convert the docs and
contents to base64
<?php $root = '/home/aharmon/test'; $iters = new RecursiveIteratorIterator(new RecursiveDirectoryIterator($root), RecursiveIteratorIterator::CHILD_FIRST); try { foreach( $iters as $fullFileName => $iter ) { $base64 = base64_encode($iter); $indexarray = array ("File" => $base64); $jsonarray = json_encode($indexarray); file_put_contents("/home/aharmon/data.json", $jsonarray, FILE_APPEND); } } catch (UnexpectedValueException $e) { printf("Directory [%s] contained a directory we can not recurse into", $root); } ?>
Then I take my data.json file and implement the bulk API:

{"index": {"_index": "historicdata", "_type": "docs" } }
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRm"}
{"index": {"_index": "historicdata", "_type": "docs" } }

{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="}
{"_index": "historicdata", "_type": "docs" } }
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUueGxz"}
{"_index": "historicdata", "_type": "docs" } }
{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0FnZW5jaWVzIE1hc3RlciBMaXN0Lnhsc3g="}

This is in a separate folder called bulk-requests

Then I run this command:

curl -s -XPOST localhost:9200/_bulk --data-binary @bulk-requests; echo

I got a successful message back so it is all indexed.

Then I run this command:

curl -XGET 'http://localhost:9200/historicdata/docs/_search' '{"fields":
[ "content.content_type" ], "query":{"match":{"content.content_type":"text
plain"}}}'

{"took":2,
"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},
"hits":{"total":2,"max_score":1.0,"hits":[{"_index":"historicdata","_type":"docs","_id":"LMkqzKbyWTGffNtr1mGPZA","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi9-ZXN0L)EgUGx1cyAtIFN1bW1hcnkgYnkgVmVudWUucGRM"}},
{"_index":"historicdata","_type":"docs","_id":"GBEIWECwRgiUbYB6pnq7dQ","_score":1.0,"_source":{"File":"L2hvbWUvYWhhcm1vbi90ZXN0L0EgUGx1cyAtIE1lZGlhIFBsYW4gU3VtbWFyeS54bHM="}
}]}}

So it is indexing the documents and the search works but the contents
isn't being decoded from base64. Maybe there is a general rule with base64
that I don't know that is assumed? I have followed the documentation
religiously on github and elasticsearch's site. Also when I decode the
base64 within the php script before I put it into the json array, it all
says null. These are .xlsx, .xls, and .pdf documents.

Thanks for your help guys, It is greatly appreciated.

Let me know if you need any more information than what I have provided.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/mG2k23vbzXQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c06948a0-5822-475e-9725-411fddaba903%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/c06948a0-5822-475e-9725-411fddaba903%40googlegroups.com?utm_medium=email&utm_source=footer
.

For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAF9vEEq7hGbOjpryy-j7ce%3Dw3KqY5UP75OB-2ab3TTMtFuKrTg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Mapper attachment with indexing a full document Elasticsearch	1	344	July 6, 2017
A simple answer to QA question: how does Elasticsearch work? Elasticsearch	13	326	October 25, 2022
Analyzer processing when indexing and searching Elasticsearch	10	832	July 1, 2019
Mappings & Analyzer Elasticsearch	2	259	July 6, 2017
Few questions about analyzers, mappings and queries in the context of search for substrings Elasticsearch	4	541	July 6, 2017

Analyzers and JSON

Related topics