Indexing html as raw content

pcyber · September 8, 2022, 12:04pm

hi folks, i´m realy new to Elasticsearch and trying to use it for my project.

what i want to to in simple form:
get html source from website with python
push the source to Elasticsearch
make it searchable

so i´m getting the html source with python and push it to elasticsearch as an attachment wich is base64 encoded with this json:

data = {
					"url": new_domain,
					"cloudflare": "false",
					"status": "online",
					"timestamp": timestamp,
					"encoded_doc": base64page_source
					}

and this request:

response = requests.post('http://es.local:9200/test/doc/?pipeline=doc-parser', data=json.dumps(data), verify=False, headers = headers)

i post it to an ingest pipeline named doc-parser wich has the following entries:
[
  {
    "attachment": {
      "field": "encoded_doc"
    }
  }
]

what i get in the database is this:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.28818804,
    "hits" : [
      {
        "_index" : "test",
        "_type" : "doc",
        "_id" : "KJO5HIMBiXf5Ym7v4is_",
        "_score" : 0.28818804,
        "_ignored" : [
          "attachment.content.keyword",
          "encoded_doc.keyword"
        ],
        "_source" : {
          "attachment" : {
            "content" : """You need to enable JavaScript to run this app.
                  
                
            
        	Registration
	Login


	Live 
	TV games
	popular
	New
...
"",
            "content_length" : 2618
          },
          "cloudflare" : "true",
          "url" : "some-url.com",
          "encoded_doc" : " PGh0bWwgbGFuZz0iZW4iPjxoZWFkPjxtZXRhIGNoYXJzZXQ9InV0Zi04Ij48bGluayByZWw9ImFwcGxlLXRvdWNoLWljb24iIHNpemVzPSIxODB4MTgwIiBocmVmPSIvYXBwbGUtdG91Y2gtaWNvbi5wbmciPj
some base64content
        "status" : "online",
          "timestamp" : "20220908-132744"
        }
      }
    ]
  }
}

now what i want:
so elasticsearch extracts somehow the text from the base64 content but in some fltered way.
what i want is the raw html source with all filenames, script code and so on.
like this:

html lang="en"><head><meta charset="utf-8"><link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png"><link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png"><link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png"><link rel="manifest" href="/site.webmanifest"><link rel="mask-icon" href="/safari-pinned-tab.svg" color="#808080">
...... and so on

i only want to make the raw source searchable. so if i´m searching for exaple "car"
i want to get all urls which contain that word (or any other word on the raw source)

it should find "car" in "img=thisisTheNewCarofmyfriend.jpg"

it think it hase something to to with the pipeline processor, but i cant figure out how to resolve it.

thank you for your help...
greeting

dadoonet · September 8, 2022, 7:11pm

May be have a look at the wildcard type?

And apply it to the attachment.content field.

pcyber · September 10, 2022, 10:28am

thank you for your input, its up and running...

here are my mappings:

 {
     "mappings" : {
       "properties" : {
         "attachment" : {
           "properties" : {
             "content" : {
               "type" : "wildcard"
             },
             "content_length" : {
               "type" : "long"
             },
             "content_type" : {
               "type" : "text",
               "fields" : {
                 "keyword" : {
                   "type" : "keyword",
                   "ignore_above" : 256
                 }
               }

            },
and so on.......

system · October 8, 2022, 10:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Indexing HTML documents, problems with JSON Elasticsearch	5	981	July 6, 2017
Ingesting HTML file into elasticsearch Elasticsearch	6	5002	June 29, 2017
Indexing HTML Elasticsearch	5	675	July 6, 2017
Ingest attachment plugin not analysing some html files Elasticsearch	15	1207	March 30, 2018
Index HTML documents Elasticsearch	4	2635	July 6, 2017

Indexing html as raw content

Related topics