Indexing web pages

Hi,

I'm trying to provide indexing of Web pages so it can be later searched
based on it's content. The problem is that I don't know the appropriate
approach to deal with content. I tried to use attachment plugin or to store
the content as document field. The problem is that either this field is not
stored at all or it can't be searched. If I encode InputStream as base64
input stream, the content is stored to ES, but can't be searched. If I
convert InputStream to bytes array nothing is stored to the content field.
The only way I managed to make this is to store web page as the local file
and then to index it as attachment.

Could you please suggest me appropriate approach or give me an example how
should I perform this?

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Could you Gist what you did using the attachment plugin?
Because it should work.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 juin 2013 à 08:12, Zoran Jeremic zoran.jeremic@gmail.com a écrit :

Hi,

I'm trying to provide indexing of Web pages so it can be later searched based on it's content. The problem is that I don't know the appropriate approach to deal with content. I tried to use attachment plugin or to store the content as document field. The problem is that either this field is not stored at all or it can't be searched. If I encode InputStream as base64 input stream, the content is stored to ES, but can't be searched. If I convert InputStream to bytes array nothing is stored to the content field. The only way I managed to make this is to store web page as the local file and then to index it as attachment.

Could you please suggest me appropriate approach or give me an example how should I perform this?

Thanks,
Zoran

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Is there any reason the webpage field couldn't be stored as a String? That's what I've done personally in the past to store HTML content.

Ahh, I see now you're wanting to actually tokenize parts of the HTML document to make it searchable.

If I were you, I'd just strip out the HTML (taking just the text content of the page) and use the standard tokenizer. That should allow the searching of pages based on their textual content, though I can't attest to performance considering webpages can be quite large. You might be able to provide a more strict pattern-based tokenizer to cut down on the size of indexed content.

What does mapper attachment and Tika under the scene is to get metadata from HTML and index them as fields.
You can do that yourself also.

--
David :wink:
Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs

Le 22 juin 2013 à 17:45, Ben Hundley ben@qbox.io a écrit :

Ahh, I see now you're wanting to actually tokenize parts of the HTML document
to make it searchable.

If I were you, I'd just strip out the HTML (taking just the text content of
the page) and use the standard tokenizer. That should allow the searching
of pages based on their textual content, though I can't attest to
performance considering webpages can be quite large. You might be able to
provide a more strict pattern-based tokenizer to cut down on the size of
indexed content.


--
View this message in context: http://elasticsearch-users.115913.n3.nabble.com/Indexing-web-pages-tp4036976p4036985.html
Sent from the ElasticSearch Users mailing list archive at Nabble.com.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi guys,

Could you Gist what you did using the attachment plugin?
Because it should work.

I committed it to the bitbucket
Bitbucket.
It stores the documents, but I can't search over it and find the terms that
exists in the web page content. I tried both with Elasticsearch Head and
Java client and content from web page could not be found. However, storing
the pdf or other document, even html page saved on the pc works fine and I
can search over the content and find the terms.

Ahh, I see now you're wanting to actually tokenize parts of the HTML
document
to make it searchable.

I'm actually trying to implement a recommendation service. The scenario I
have is as follows:
Application is learning system where students could upload documents or
links to the interesting web pages. These documents and web pages could be
later recommended to other students if they are relevant for what student
is currently working on. I have already implemented this before with Lucene
and Tika and what I actually have done is to extract a Vector of Terms
related to the student's current context he is working on and then compare
this vector with all documents stored in Lucene repository. I was using
MoreLikeThis to find similar documents to the provided vector of terms. All
stored documents (web pages) were describe by the vector of Most Interested
Terms I have extracted previously during the indexing of the Web Page.
I didn't like this approach very much as I had to re-index all the
documents after a certain period of time due to the IDF change, as it
affected a list of Interested Terms for a specific document as the
repository was populated with new documents. I thought that it could be
possible to tokenize the content of the Web page and use it instead of the
interested terms if the performance will be good still.

I'm not sure how the ES works at the backend and how does the content of
the web page is stored if I provide InputStream as the content. Will this
plugin do the rest of the job and clean it from the html code or I should
take care about this? Is it better to use the old approach and extract
Interested Terms from the content or I can store the whole web page
content? I know Lucene's MoreLikeThis has a function getInterestingTerms to
extract it, but I couldn't find anything similar in ES. Is it implemented?

I would appreciate if you give me your suggestions about this?

Best,
Zoran

On Friday, 21 June 2013 23:12:28 UTC-7, Zoran Jeremic wrote:

Hi,

I'm trying to provide indexing of Web pages so it can be later searched
based on it's content. The problem is that I don't know the appropriate
approach to deal with content. I tried to use attachment plugin or to store
the content as document field. The problem is that either this field is not
stored at all or it can't be searched. If I encode InputStream as base64
input stream, the content is stored to ES, but can't be searched. If I
convert InputStream to bytes array nothing is stored to the content field.
The only way I managed to make this is to store web page as the local file
and then to index it as attachment.

Could you please suggest me appropriate approach or give me an example how
should I perform this?

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Zorane,

Sounds like you should first take a look at Nutch or Droids from Apache.

Otis

Search Analytics - Cloud Monitoring Tools & Services | Sematext
Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Saturday, June 22, 2013 2:12:28 AM UTC-4, Zoran Jeremic wrote:

Hi,

I'm trying to provide indexing of Web pages so it can be later searched
based on it's content. The problem is that I don't know the appropriate
approach to deal with content. I tried to use attachment plugin or to store
the content as document field. The problem is that either this field is not
stored at all or it can't be searched. If I encode InputStream as base64
input stream, the content is stored to ES, but can't be searched. If I
convert InputStream to bytes array nothing is stored to the content field.
The only way I managed to make this is to store web page as the local file
and then to index it as attachment.

Could you please suggest me appropriate approach or give me an example how
should I perform this?

Thanks,
Zoran

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.