Inserting and Indexing Raw HTML into Elastic Search

I have to index about 42 million raw html files and the end objective is to search through these html. Search is suppose to be from HTML tags and content. E.g.

Insert following HTML

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" class=" js flexbox canvas canvastext webgl no-touch geolocation postmessage no-websqldatabase indexeddb hashchange history draganddrop websockets rgba hsla multiplebgs backgroundsize borderimage borderradius boxshadow textshadow opacity cssanimations csscolumns cssgradients no-cssreflections csstransforms csstransforms3d csstransitions fontface generatedcontent video audio localstorage sessionstorage webworkers applicationcache svg inlinesvg smil svgclippaths"><head>
    <meta charset="utf-8" />
    <title>TemplateGuru.co</title>
    <link type="text/css" rel="stylesheet" href="https://fonts.googleapis.com/css?family=Roboto" />
    <link rel="stylesheet" href="https://fonts.googleapis.com/icon?family=Material+Icons" />
    <link href="css/bootstrap.min.css" media="all" type="text/css" rel="stylesheet" />
    <link href="css/normalize.min.css?v=2019025" media="all" type="text/css" rel="stylesheet" />
    <link href="css/main.css?v=190606" media="all" type="text/css" rel="stylesheet" />
    <link href="css/addon.css?v=2019025" media="all" type="text/css" rel="stylesheet" />
    <script>
        window.noMoneyLink = "https://templateguru.co/go/base2.php?id=3";
        window.noMoneyLink2 = "https://templateguru.co/go/base2.php?id=3";
        window.offers = [{"url":"http:\/\/track.qdyqv.com\/aff_c?offer_id=25587&amp;aff_id=39883&amp;aff_sub=convertguru4TEST&amp;aff_sub2=","id":"oighojadcgkohahodblhlfkmiechpfgb","type":"propel"}];
        window.se_offer = "http://typ.navigateto.net/go/aff/redirect?implementation_id=aff555-ty-nf&amp;offer_id=1029&amp;aff_id=84&amp;source=googlecancel&amp;aff_sub5=converter_&amp;sendToWebstore=true&amp;retry=1";
        window.extension_name = "Templates Guru";
        window.is_mac = false;
        window.show_loading = true;
    </script>
</head>

and then search for following:

<script>
        window.noMoneyLink = "https://templateguru.co/go/base2.php?id=3";
        window.noMoneyLink2 = "https://templateguru.co/go/base2.php?id=3";
        window.offers = [{"url":"http:\/\/track.qdyqv.com\/aff_c?offer_id=25587&amp;aff_id=39883&amp;aff_sub=convertguru4TEST&amp;aff_sub2=","id":"oighojadcgkohahodblhlfkmiechpfgb","type":"propel"}];
        window.se_offer = "http://typ.navigateto.net/go/aff/redirect?implementation_id=aff555-ty-nf&amp;offer_id=1029&amp;aff_id=84&amp;source=googlecancel&amp;aff_sub5=converter_&amp;sendToWebstore=true&amp;retry=1";
        window.extension_name = "Templates Guru";
        window.is_mac = false;
        window.show_loading = true;
    </script>

I am using the following mapping as of now.

{
  "test_index": {
    "mappings": {
      "properties": {
        "created_on": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
        "html": {
          "type": "text",
          "fields": {
            "keyword": {
              "type": "keyword",
              "ignore_above": 256
            }
          }
        }
      }
    }
  }
}

The error is sometime the insertion get's stuck with following errors in Elastic Search logs.

{"type": "server", "timestamp": "2019-09-02T07:32:33,800-0400", "level": "DEBUG", "component": "o.e.a.b.TransportShardBulkAction", "cluster.name": "rnet-1", "node.name": "n2", "cluster.uuid": "wDe7HdVWRZmOaiW7nwKBRg", "node.id": "GAZddULmT42NIrLgbv66AA",  "message": "[test_raw_data_insertion2][0] failed to execute bulk item (index) index {[test_raw_data_insertion2][_doc][6RO_8WwB9lHwCadse92H], source[n/a, actual length: [109.9kb], max length: 2kb]}" ,

and some time the error changes to below:

Caused by: com.fasterxml.jackson.core.JsonParseException: Illegal unquoted character ((CTRL-CHAR, code 13)): has to be escaped using backslash to be included in string value
at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@9289079; line: 1, column: 44938]

Can someone please check what exactly is wrong here?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.