How to index a file with elasticsearch 5.5.1

bilpor · August 3, 2017, 7:39am

I'm new to Elasticsearch. I have successfully installed Elasticsearch with Kibana, X-pack and ingest-attachment. I have both Elasticsearch and Kibana running. I have kept it simple at the moment with the install using default options on a windows 2012 server. I have a directory on another drive w\mydocs and at the moment it just has 3 plain text files in it, but I will want to add others like pdf and doc file types. So now I want to get these files into Elasticsearches index.

I have set a pipeline and index up as follows.

PUT _ingest/pipeline/docs 
{
  "description": "documents",
  "processors" : [
    {
      "attachment" : {
      "field": "data",
      "indexed_chars" : -1
      }
    }]
}
PUT myindex
{
  "mappings" : {
    "documents" : {
      "properties" : {
        "attachment.data" : {
          "type": "text",
          "analyzer": "standard"
        }
      }
    }
  }
}

I then try to get m document into the index in the following way:
PUT localhost:9200/documents/1?pipeline=docs -d @/w/mydocs/README.TXT

looking around the net it seems I have to somehow get my document into Base64, but then I read that In this release of elastic, elastic doesn't store the document only the reference. When I come to set Elastic up for live we have 1000's of documents I cant believe that I would need to encode all of them to Base64 first, however if I have to how do I do it. My PUT statement above just fails.

dadoonet · August 3, 2017, 8:08am

Yes. Elasticsearch speaks JSON. You need to provide a JSON document. You can't "upload" a binary document.

Have a look at FSCrawler. It now has a REST endpoint which can act as a proxy to elasticsearch. So you can run:

echo "This is my text" > test.txt
curl -F "file=@test.txt" "http://127.0.0.1:8080/fscrawler/_upload"

HTH

bilpor · August 3, 2017, 8:28am

So what does ingest-attachment do? If I have to use FSCrawler is there any point to the ingest-attachment plugin

dadoonet · August 3, 2017, 9:07am

No. FSCrawler does similar thing but outside the context of elasticsearch (running in a separate process/JVM).

bilpor · August 3, 2017, 9:12am

H David,

I have downloaded FSCrawler 2.3 from github. Reading the installation guide, it doesn't mention anything about installing in the windows environment. Do I just need to unzip to a subdirectiry within Elasticsearch. My Elasticsearch is at the default endpoint of http://localhost:9200 at the moment. How do I let one know about the other.

bilpor · August 3, 2017, 12:44pm

HI David, I unpacked Fscrawler to my C:\Program Files\Elastic\FScrawler directory and then using powershell, I went to the bin directory and then ran .\fscrawler mydocs --loop 1 the job didn't exist, so I said yes to creating it. It's created a directory .fscrawler under C:\Users\wCrawley. Can I move this directory to a more appropriate place and wil it still run.

dadoonet · August 3, 2017, 1:00pm

--config_dir is what you are looking for I guess: https://github.com/dadoonet/fscrawler#cli-options

bilpor · August 3, 2017, 2:07pm

Thanks David,

Can I point you to this issue that I've raised on stack overflow You will probably have to expand my screen shot, but I'm getting 3 errors when I run fscrawler. The directory that I'm pointing to has 3 text files fscrawler issue

dadoonet · August 3, 2017, 3:08pm

It's better not to include screenshots but copy and paste the logs.

Then:

You don't need to define an ingest pipeline
What does your fscrawler settings look like?
There is a warning about an old FSCrawler version. Were you using 2.2 before?

bilpor · August 3, 2017, 3:10pm

Hi David,

No I haven't used an older version.

I seem to have resolved the java error. In the _settings.json file in the elsticsearch section I added a "pipline" node. now when I run from the powershell window all looks ok. If I flip over into Kibana and enter GET myindex/_search it doesn't show anything.

dadoonet · August 3, 2017, 3:11pm

Can't tell without logs and config.

bilpor · August 3, 2017, 3:13pm

what logs and config do you need to see?

dadoonet · August 3, 2017, 3:38pm

FSCrawler settings and logs.

bilpor · August 4, 2017, 8:19am

Good Morning David,

On your advice, I have deleted my pipeline and index and created a new index in the following way:

PUT myindex
{
  "settings" : {
    "number_of_shards" : 3,
    "number_of_replicas" : 2
  },
  "mappings" : {
    "mymap" : {
      "properties" : {
        "data" : {
          "type" : "text"
        }
      }
    }
  }
}

Now when I run fscrawler I do not receive any errors. In the _settings.json file I have removed the "pipeline" setting for the elasticsearch setting, but left the "index" setting to myindex. In the _status.json file that has been created in the same place as _settings file the indexed value is set to 0. Where do the fscrawler logs reside, I cannot find anything obvious.

If I use Kibana and do:
GET / myindex { "query" : { "match_all": {} } }
I receive the following error in the output window:
{ "error": "Content-Type header [text/plain] is not supported", "status": 406 }

bilpor · August 4, 2017, 9:01am

HI David,

I realised that I hadn't set the JAVA_HOME Variable correctly. I set this to as it should be and re-run my fscrawler command. Still no error after stopping and starting Elasticsearch and Kibana. Back in Kibana if I enter the following:
GET localhost:9200/myindex { "query" : { "match_all" : {} }}
then in the output window

    {
   "_index": "localhost:9200",
  "_type": "bamindex",
  "_id": "AV2sbjxqeV3LhZmJBs0X",
  "_version": 1,
  "result": "created",
  "_shards": {
   "total": 2,
   "successful": 1,
   "failed": 0
 },
  "created": true
}

dadoonet · August 4, 2017, 9:13am

Wrong usage of Kibana

Try

GET myindex/_search
{ "query" : { "match_all" : {} }}

bilpor · August 4, 2017, 9:27am

Hi David,

I have tried your suggestion in Kibana and it gives me the following output:

{
  "took": 16,
  "timed_out": false,
  "_shards": {
    "total": 3,
    "successful": 3,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

What I'm trying to see in the output window is the data from the 3 files that have been indexed. When I run the fscrawler command in my powershell window I also added --debug this clearly showed that my 3 files had been found.

dadoonet · August 4, 2017, 9:42am

Can you try with --restart --debug options and if it still does not work, copy all your logs here?

bilpor · August 4, 2017, 10:08am

interesting, introducing the --restart states that it cannot access _status.js as it is being used by another process. I stopped the kibana server in-case it was that, but it made no difference. Where would it be logging to ? I cant see anything other than in the output window of powershell

bilpor · August 4, 2017, 10:15am

it thought IE had it open. I closed IE and the error has gone.

Topic		Replies	Views
Recommended workflow for indexing many binary docs Elasticsearch	4	803	July 6, 2021
Indexing word, pdf documents? Elasticsearch	12	6884	July 7, 2020
FSCrawler Question Elasticsearch	7	3125	March 17, 2017
Index PDF in ES Elasticsearch	14	9212	April 24, 2017
How to specify file to Ingest Attachment Elasticsearch	11	4862	March 21, 2017

How to index a file with elasticsearch 5.5.1

Related topics