Index HTML documents

mjk · April 7, 2013, 1:56am

I have been trying to use the attachment plugin to index HTML documents. I
want to do a couple things to start to build an understanding of: a) how to
control a mapping b) control what gets stored and what does not. Seems
like this should be trivial, but after several hours of searching and
experimenting with various mapping definitions, nothing seems to have
affected the index.

I would like to store the Base64 Un-encoded document, NOT the Base64
Encoded document, and using the char_filter, strip out the markup.

Here is what I have been using to create my mapping:

curl -XPUT http://localhost:9200/myindex/htmldoc/_mapping -d '{
"htmldoc": {
"properties": {
"_source": {
"enabled": "false"
},
"contents": {
"type": "string",
"analyzer": "htmlContentAnalyzer",
"store": "yes"
},
"file": {
"type": "attachment",
"fields": {
"file": {
"store": "no"
},
"date": {
"store": "yes"
},
"author": {
"store": "yes"
}
}
},
"header-Connection": {
"type": "string"
},
"header-Content-Length": {
"type": "string"
},
"header-Content-Type": {
"type": "string"
},
"header-Keep-Alive": {
"type": "string"
},
"header-Server": {
"type": "string"
},
"header-Transfer-Encoding": {
"type": "string"
},
"header-Vary": {
"type": "string"
}
}
}
}

I have the attachment plugin installed, and I know its installed because
the first time I tried to create this mapping, I got an exception that is
well documented that basically meant that I did not have the attachment
plugin installed. I installed the 1.6 version of the plugin because the
github site seemed to indicate that this was the supported version for my
ES instance (0.20.5). After installing the plugin, this exception went
away, so I know the plugin is installed.

For now when I search, when results are returned, I want some of the HTML
to show in the results set.

--mike

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vineeth_mohan · April 7, 2013, 3:24am

On Sun, Apr 7, 2013 at 7:26 AM, mjk mj.kelleher@gmail.com wrote:

I have been trying to use the attachment plugin to index HTML documents.
I want to do a couple things to start to build an understanding of: a) how
to control a mapping b) control what gets stored and what does not. Seems
like this should be trivial, but after several hours of searching and
experimenting with various mapping definitions, nothing seems to have
affected the index.

Analyzers are what you are looking for

I would like to store the Base64 Un-encoded document, NOT the Base64
Encoded document, and using the char_filter, strip out the markup.

Here is what I have been using to create my mapping:

curl -XPUT http://localhost:9200/myindex/htmldoc/_mapping -d '{
"htmldoc": {
"properties": {
"_source": {
"enabled": "false"
},
"contents": {
"type": "string",
"analyzer": "htmlContentAnalyzer",

I am not aware of existance of htmlContentAnalyzer analyzer but if its for
stripping HTML tags , html_strip character filter works for me -

For this to work you need to define the analyzer in the index mapping JSON
and then mention the name of the analyzer in the type as you have done.
Example of creating custom analyzer which strips HTML -

gist.github.com

https://gist.github.com/Vineeth-Mohan/5328791

gistfile1.txt

curl -X PUT "http://$hostname:9200/index_name" -d '{ 	
	    "index" : { 
		"number_of_shards" : 2, 
		"number_of_replicas" : 1 ,
	    	"analysis":{      
	    		"analyzer":{         
				"html" : {
	            				"type" : "custom",
		    				"tokenizer" : "standard", 
                    				"filter" : ["lowercase" , "stop"],

This file has been truncated. show original

HTH

            "store": "yes"
        },
        "file": {
            "type": "attachment",
            "fields": {
                "file": {
                    "store": "no"
                },
                "date": {
                    "store": "yes"
                },
                "author": {
                    "store": "yes"
                }
            }
        },
        "header-Connection": {
            "type": "string"
        },
        "header-Content-Length": {
            "type": "string"
        },
        "header-Content-Type": {
            "type": "string"
        },
        "header-Keep-Alive": {
            "type": "string"
        },
        "header-Server": {
            "type": "string"
        },
        "header-Transfer-Encoding": {
            "type": "string"
        },
        "header-Vary": {
            "type": "string"
        }
    }
}
}

I have the attachment plugin installed, and I know its installed because
the first time I tried to create this mapping, I got an exception that is
well documented that basically meant that I did not have the attachment
plugin installed. I installed the 1.6 version of the plugin because the
github site seemed to indicate that this was the supported version for my
ES instance (0.20.5). After installing the plugin, this exception went
away, so I know the plugin is installed.

For now when I search, when results are returned, I want some of the HTML
to show in the results set.

--mike

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

mjk · April 7, 2013, 7:10pm

Vineeth,

Thanks for your reply. This was a complete newbie error and
misunderstanding on my part.

Apparently the attachment plugin does decode, strip the markup, and index
the document source. I have been able to search on text within the HTML
pages, and ES returns the correct result set. However, IO am now trying to
figure out how to have the result set include some of the surrounding text
including the matching term, which will eventually be used within a SERP.
I am doing this all CLI for now to build the fundamentals, and when I have
that down, I will then use the Java API and build my SERP.

Here is my query:

curl -XGET localhost/myindex/mytype?pretty=true -d '{
"fields" : [ "*" , "header-Keep-Alive" ],
"query": {
"query_string": {
"query": "triangle"
}
}
}'

echo ""

I need to also figure out how to store the _source fields so they can be
searched against, and not have to be extracted upon query. The field
"header-Keep-Alive" for example.

Thanks,

--mike

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

vineeth_mohan · April 8, 2013, 5:12am

I didnt fully understand you requirements but -

this might help.

Thanks
Vineeth

On Mon, Apr 8, 2013 at 12:40 AM, mjk mj.kelleher@gmail.com wrote:

Vineeth,

Thanks for your reply. This was a complete newbie error and
misunderstanding on my part.

Apparently the attachment plugin does decode, strip the markup, and index
the document source. I have been able to search on text within the HTML
pages, and ES returns the correct result set. However, IO am now trying to
figure out how to have the result set include some of the surrounding text
including the matching term, which will eventually be used within a SERP.
I am doing this all CLI for now to build the fundamentals, and when I have
that down, I will then use the Java API and build my SERP.

Here is my query:

curl -XGET localhost/myindex/mytype?pretty=true -d '{
"fields" : [ "*" , "header-Keep-Alive" ],
"query": {
"query_string": {
"query": "triangle"
}
}
}'

echo ""

I need to also figure out how to store the _source fields so they can be
searched against, and not have to be extracted upon query. The field
"header-Keep-Alive" for example.

Thanks,

--mike

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
Indexing HTML Elasticsearch	5	689	July 6, 2017
Indexing HTML documents, problems with JSON Elasticsearch	5	985	July 6, 2017
Attachment Plugin Elasticsearch	2	395	July 6, 2017
Attachments plugin - has anyone been using this successfully? Elasticsearch	1	279	July 6, 2017
Attachment Plugin Questions on Storing Elasticsearch	14	518	July 6, 2017

Index HTML documents

Related topics