hi folks, i´m realy new to Elasticsearch and trying to use it for my project.
what i want to to in simple form:
get html source from website with python
push the source to Elasticsearch
make it searchable
so i´m getting the html source with python and push it to elasticsearch as an attachment wich is base64 encoded with this json:
data = {
"url": new_domain,
"cloudflare": "false",
"status": "online",
"timestamp": timestamp,
"encoded_doc": base64page_source
}
and this request:
response = requests.post('http://es.local:9200/test/doc/?pipeline=doc-parser', data=json.dumps(data), verify=False, headers = headers)
i post it to an ingest pipeline named doc-parser wich has the following entries:
[
{
"attachment": {
"field": "encoded_doc"
}
}
]
what i get in the database is this:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 0.28818804,
"hits" : [
{
"_index" : "test",
"_type" : "doc",
"_id" : "KJO5HIMBiXf5Ym7v4is_",
"_score" : 0.28818804,
"_ignored" : [
"attachment.content.keyword",
"encoded_doc.keyword"
],
"_source" : {
"attachment" : {
"content" : """You need to enable JavaScript to run this app.
Registration
Login
Live
TV games
popular
New
...
"",
"content_length" : 2618
},
"cloudflare" : "true",
"url" : "some-url.com",
"encoded_doc" : " PGh0bWwgbGFuZz0iZW4iPjxoZWFkPjxtZXRhIGNoYXJzZXQ9InV0Zi04Ij48bGluayByZWw9ImFwcGxlLXRvdWNoLWljb24iIHNpemVzPSIxODB4MTgwIiBocmVmPSIvYXBwbGUtdG91Y2gtaWNvbi5wbmciPj
some base64content
"status" : "online",
"timestamp" : "20220908-132744"
}
}
]
}
}
now what i want:
so elasticsearch extracts somehow the text from the base64 content but in some fltered way.
what i want is the raw html source with all filenames, script code and so on.
like this:
html lang="en"><head><meta charset="utf-8"><link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png"><link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png"><link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png"><link rel="manifest" href="/site.webmanifest"><link rel="mask-icon" href="/safari-pinned-tab.svg" color="#808080">
...... and so on
i only want to make the raw source searchable. so if i´m searching for exaple "car"
i want to get all urls which contain that word (or any other word on the raw source)
it should find "car" in "img=thisisTheNewCarofmyfriend.jpg"
it think it hase something to to with the pipeline processor, but i cant figure out how to resolve it.
thank you for your help...
greeting