ElasticSearch Indexing question

Yorko · November 29, 2015, 8:32pm

Hi,

Is there an easy way to index a lot of documents into ES db? i was thinking or using a CMS of sorts that is hooked up to ES.

warkolm · November 30, 2015, 9:35pm

There are a number of ways. Many people use Logstash to do this.

What sort of data is it? Where is it held?

Yorko · December 1, 2015, 12:18am

It's a shared folder with many .doc .docx files in various subfolders.
Isn't logstash just for logs etc?

warkolm · December 1, 2015, 12:24am

Logstash can be used for many things, but not that sort of data.

You could use the mapper attachments plugin - https://github.com/elastic/elasticsearch-mapper-attachments - for this.

dadoonet · December 1, 2015, 2:52am

You can give a try to fscrawler. https://github.com/dadoonet/fscrawler

Yorko · December 1, 2015, 8:44pm

Thanks!
I actually found both of these using google but my concerns are the following:

mapper-attachments is a plugin that basically converts the documents to text version and allows you to index them if i got it right but i don't see how it automates the process.
fscrawler now states it's standalone and not supported by Es so does the results integrate with ES later?

dadoonet · December 1, 2015, 9:13pm

Fscrawler is not a plugin anymore. It's a standalone app which sends data to elasticsearch.

Yorko · December 1, 2015, 10:53pm

My question was regarding the big warning " Elasticsearch 2.0.0 doesn't support anymore rivers." since rivers were a way for ES to get data from external sources i was asking how FScrawler handles this in 2.0 since it's supported anymore, i mean are the results still compatible and acceptable by ES db?

I didn't notice you are the owner of FScrawler
I guess it should still workl, i will test it out and report back.
Right now i will test it on windows, later maybe i"l have a linux machine.

dadoonet · December 2, 2015, 6:46am

Yes. Fscrawler has been rewritten FOR elasticsearch 2.0. It might work for previous versions as well but untested.

Yorko · December 13, 2015, 9:53pm

Continuing the discussion from ElasticSearch Indexing question:

OK i've tried it and have 3 questions:

Is there any way to not include the \r\n whitespaces in the _source, in my docs and docx there are different parts separated by spaces, i would prefer the new lines to be in but not printed...
Is there a way to make it recycle memory or will java just keep eating all the memory until there is no more or the scan finishes?
I've tried hooking the results into Kibana and i didn't get any results in the discover tab, here is the template i used:

{
  "name" : "test2",
  "fs" : {
    "url" : "C:/ABP",
    "update_rate" : "15m",
    "includes" : null,
    "excludes" : null,
    "json_support" : false,
    "filename_as_id" : false,
    "add_filesize" : true,
    "remove_deleted" : true,
    "store_source" : false,
    "index_content" : true,
    "indexed_chars" : null
  },
  "server" : null,
  "elasticsearch" : {
    "nodes" : [ {
      "host" : "127.0.0.1",
      "port" : 9200
    } ],
    "index" : "test2",
    "type" : "doc",
    "bulk_size" : 100,
    "flush_interval" : "5s"
  }
}

I've used test2 and test2* as the index name and select date modified (which exists in ES) as the field and still got nothing in the discovery, plus it only shows _source as the field it searches.

Any ideas?

dadoonet · December 15, 2015, 8:36am

No. It's indexed as it is extracted by Tika. But TBH I did not understand what is the problem. May be illustrate with an example what you have now and what you would like to see?

May be some enhancements need to be done in fscrawler project. For sure I should support adding easily memory settings to the fscrawler job. For now, you have to hack the script or set $JAVA_OPTS.
I opened Add FS_JAVA_OPTS JVM option · Issue #134 · dadoonet/fscrawler · GitHub for this. Feel free to contribute!

I never tested it with Kibana for now. I'd advice that you first test with simple curl commands that everything has been indexed as expected. Is it the case?

Yorko · December 16, 2015, 12:01am

Continuing the discussion from ElasticSearch Indexing question:

I have a file that goes:
Header
Body
Footer

So the result is header\nbodynfooter\n so when i view it i want to see it as the original (separated) and not just like one long string.

Ok nice, i will defenalty try to help

The thing is that at the end i need a dashboard and kibana is an easy choice here so i must have it working, i appreciate the concern and you are correct first try it simple but i also need it to work and if you haven't tested it yet i will gladly volunteer here

dadoonet · December 18, 2015, 11:34am

could you share somewhere your binary document so I could try some tests on my side?

Yorko · December 19, 2015, 11:47pm

First of all it works fine with Kibana i just had the wrong time settings so FYI on that.

Secondly sure here is an example:

Defect subject: XXXX 
Product: XXXX vX.X 
Severity: XXX 

Description:
First paragraph: A short explanation of the issue. 

Technical Details:
Technical details about how the product was tested. 
1.    Example: 
Figure 
2.    Example: 
Figure 
 
Recommended Remediation:
Recommendation

Thanks for you help!

dadoonet · December 20, 2015, 12:08am

Is it a TXT file?

Yorko · December 20, 2015, 7:22am

MS Word usually .doc or .docx

dadoonet · December 20, 2015, 7:51am

could you share somewhere your binary document?

Yorko · December 22, 2015, 12:00am

Here you go, this is just a template but it's the same format just missing real text and images.
Template Link

dadoonet · December 29, 2015, 10:12am

So I extracted your file with fscrawler and got: Defect subject: XXXX\nProduct: XXXX vX.X\nSeverity: XXX\nDescription\nFirst paragraph: A short explanation of the issue.\nTechnical Details\nTechnical details about how the product was tested.\nExample:\nFigure\n1. Example:\nFigure\n\nRecommended Remediation\n1. Recommendation\n1. Recommendation\n1. Recommendation\n\n

I was then able to search for figure for example without any issue.

Is there anything wrong with that then?

Yorko · December 29, 2015, 10:05pm

The problem is with the format, it's not human readable so you can search for words and find them in the whole mess of a text as you showed but if you are only interested in reading a particular section it's hard to find quick where one beings and another ends...

It would be much simpler if the \n characters weren't represented as strings.

Topic		Replies	Views
Ingesting documents (pdf, word, .txt) to elasticsearch Elasticsearch	31	39011	March 21, 2017
Indexing files from filesystem Elasticsearch	6	1787	July 6, 2017
Indexing pages from local Elasticsearch	8	740	July 5, 2017
FSCrawler Question Elasticsearch	7	3125	March 17, 2017
Indexing office documents Elasticsearch	5	1869	July 6, 2017

ElasticSearch Indexing question

Related topics