ElasticSearch Indexing question


(John) #1

Hi,

Is there an easy way to index a lot of documents into ES db? i was thinking or using a CMS of sorts that is hooked up to ES.


(Mark Walkom) #2

There are a number of ways. Many people use Logstash to do this.

What sort of data is it? Where is it held?


(John) #3

It's a shared folder with many .doc .docx files in various subfolders.
Isn't logstash just for logs etc?


(Mark Walkom) #4

Logstash can be used for many things, but not that sort of data.

You could use the mapper attachments plugin - https://github.com/elastic/elasticsearch-mapper-attachments - for this.


(David Pilato) #5

You can give a try to fscrawler. https://github.com/dadoonet/fscrawler


(John) #6

Thanks!
I actually found both of these using google but my concerns are the following:

  1. mapper-attachments is a plugin that basically converts the documents to text version and allows you to index them if i got it right but i don't see how it automates the process.

  2. fscrawler now states it's standalone and not supported by Es so does the results integrate with ES later?


(David Pilato) #7

Fscrawler is not a plugin anymore. It's a standalone app which sends data to elasticsearch.


(John) #8

My question was regarding the big warning " Elasticsearch 2.0.0 doesn't support anymore rivers." since rivers were a way for ES to get data from external sources i was asking how FScrawler handles this in 2.0 since it's supported anymore, i mean are the results still compatible and acceptable by ES db?

I didn't notice you are the owner of FScrawler :blush:
I guess it should still workl, i will test it out and report back.
Right now i will test it on windows, later maybe i"l have a linux machine.


(David Pilato) #9

Yes. Fscrawler has been rewritten FOR elasticsearch 2.0. It might work for previous versions as well but untested.


(John) #10

Continuing the discussion from ElasticSearch Indexing question:

OK i've tried it and have 3 questions:

  1. Is there any way to not include the \r\n whitespaces in the _source, in my docs and docx there are different parts separated by spaces, i would prefer the new lines to be in but not printed...
  2. Is there a way to make it recycle memory or will java just keep eating all the memory until there is no more or the scan finishes?
  3. I've tried hooking the results into Kibana and i didn't get any results in the discover tab, here is the template i used:

{ "name" : "test2", "fs" : { "url" : "C:/ABP", "update_rate" : "15m", "includes" : null, "excludes" : null, "json_support" : false, "filename_as_id" : false, "add_filesize" : true, "remove_deleted" : true, "store_source" : false, "index_content" : true, "indexed_chars" : null }, "server" : null, "elasticsearch" : { "nodes" : [ { "host" : "127.0.0.1", "port" : 9200 } ], "index" : "test2", "type" : "doc", "bulk_size" : 100, "flush_interval" : "5s" } }

I've used test2 and test2* as the index name and select date modified (which exists in ES) as the field and still got nothing in the discovery, plus it only shows _source as the field it searches.

Any ideas?


(David Pilato) #11

No. It's indexed as it is extracted by Tika. But TBH I did not understand what is the problem. May be illustrate with an example what you have now and what you would like to see?

May be some enhancements need to be done in fscrawler project. For sure I should support adding easily memory settings to the fscrawler job. For now, you have to hack the script or set $JAVA_OPTS.
I opened https://github.com/dadoonet/fscrawler/issues/134 for this. Feel free to contribute! :stuck_out_tongue:

I never tested it with Kibana for now. I'd advice that you first test with simple curl commands that everything has been indexed as expected. Is it the case?


(John) #12

Continuing the discussion from ElasticSearch Indexing question:

I have a file that goes:
Header
Body
Footer

So the result is header\nbodynfooter\n so when i view it i want to see it as the original (separated) and not just like one long string.

Ok nice, i will defenalty try to help :wink:

The thing is that at the end i need a dashboard and kibana is an easy choice here so i must have it working, i appreciate the concern and you are correct first try it simple but i also need it to work and if you haven't tested it yet i will gladly volunteer here :slight_smile:


(David Pilato) #13

could you share somewhere your binary document so I could try some tests on my side?


(John) #14

First of all it works fine with Kibana i just had the wrong time settings so FYI on that.

Secondly sure here is an example:

Defect subject: XXXX 
Product: XXXX vX.X 
Severity: XXX 

Description:
First paragraph: A short explanation of the issue. 

Technical Details:
Technical details about how the product was tested. 
1.    Example: 
Figure 
2.    Example: 
Figure 
 
Recommended Remediation:
Recommendation 

Thanks for you help!


(David Pilato) #15

Is it a TXT file?


(John) #16

MS Word usually .doc or .docx


(David Pilato) #17

could you share somewhere your binary document?


(John) #18

Here you go, this is just a template but it's the same format just missing real text and images.
Template Link


(David Pilato) #19

So I extracted your file with fscrawler and got: Defect subject: XXXX\nProduct: XXXX vX.X\nSeverity: XXX\nDescription\nFirst paragraph: A short explanation of the issue.\nTechnical Details\nTechnical details about how the product was tested.\nExample:\nFigure\n1. Example:\nFigure\n\nRecommended Remediation\n1. Recommendation\n1. Recommendation\n1. Recommendation\n\n

I was then able to search for figure for example without any issue.

Is there anything wrong with that then?


(John) #20

The problem is with the format, it's not human readable so you can search for words and find them in the whole mess of a text as you showed but if you are only interested in reading a particular section it's hard to find quick where one beings and another ends...

It would be much simpler if the \n characters weren't represented as strings.