Filesearch solution using ES 5.5.0

Hi all, I am trying to set up a file searching capability for a not for profit org I am helping out.

The use case is something like this :

*** users drop documents (mainly doc files such as pdf, txt, xls, word but also some audio, video & image files) in a share folder. **
*** crawl the folder every few minutes and index the files for ES to search**
*** users to access search through web interface**

here is what I tried;
a) set up ES and kibana
b) use fscrawler 2.3 to crawl the folder and create the index.

while it does work, I faced a couple of problems, viz.
a) ES did not run on win7 due to some java version issue which I could not resolve (will create another thread for it). luckily I had access to a server running windows server 2012 where I got ES running, however that means I have only one node working for ES

b) I tested on a sample folder with about 15,000 files (~ 200-300 GB of data). unfortunately the performance was a little underwhelming. when I try to access the index through kibana, it takes about 10 seconds to load !

GIven that the target data I am supposed to work with has a volume of about 2-3 TB's and growing at about 2-3 GB/week, it doesn't look good.

I suspect the number of fields being created in the index (900+) might be a factor in slowing down the retrieval. is it ?
If so, I am quite ready to strip down the number of fields to a bare minimum, because I dont need most of them.

sorry for the long post, hoping for some Yoda-wisdom from the community !

The problem is that Kibana by default loads 500 docs per page IIRC. You should decrease that value in Kibana settings (using the UI).

Hi, thanks for the reply. will try that out. great job on fscrawler btw, it's a godsend !

do you think the approach I have taken is optimum for my use case or would you suggest something different ?

have a couple of questions regarding fscrawler;
is there a way to tweak how fscrawler parses text ? f.e I discovered that search string (through kibana) is case sensitive, I would like to make it case insensitive.

how does fs 2.3 create fields in the index, does it simply append new metadata fields it encounters in various filetypes to the index ? if so, is it possible for me to define rules about ingestion of metadata into fields.
Ideally I would like to have a small set of common fields, e.g type, filename, filesize, format, last edited date, indexing date, content (for full text search).
rest of the file metadata fields I would like to map to a single field for each file type, say image metadata, doc metadata etc.
is there any tool I could use to do this through UI ?
I did look at ES head but I am afraid of messing it up if I have to write everything in code.
thanks again !

:hugging: Thank you!

Your use case is:

  • users drop documents (mainly doc files such as pdf, txt, xls, word but also some audio, video & image files) in a share folder.
  • crawl the folder every few minutes and index the files for ES to search
  • users to access search through web interface

I believe that FSCrawler can help for that use case as it watches a directory.

That said you can also imagine building a web app where your users can drag and drop a file and then use the REST layer of FSCrawler to send this document.

is there a way to tweak how fscrawler parses text ? f.e I discovered that search string (through kibana) is case sensitive, I would like to make it case insensitive.

It should not be case sensitive for the content itself but may be for other fields. Which fields do you mean?
But to answer your question, you can make it case insensitive by changing the default mapping FSCrawler is using. I believe I'm using a keyword type for some fields. You can define either new subfields as text or switch to text.

That might be a bit hard but you can indeed do that by using an ingest pipeline in elasticsearch. FSCrawler supports it: GitHub - dadoonet/fscrawler: Elasticsearch File System Crawler (FS Crawler)

is there any tool I could use to do this through UI ?

No. Not yet. I have been thinking of providing at some point a Kibana application. But I'm not yet there.
I have so many other things to do on the backend side :slight_smile:

I did look at ES head but I am afraid of messing it up if I have to write everything in code.

Kibana dev console is a good compromise IMO. This is what I'm using everyday.

1 Like

well, the share folder solution is needed for other things so I won't mess with that. what I would need is a customised web front end for the users. may be it can be achieved through kibana or might need a custom application, we will see.

I believe it was in filename.
I didn't quite get the subfield thing, to be honest. :disappointed:

dont really see an alternative to this. always ready to learn but tutorials are hard to come by.

that would be awesome !

thanks again for the help.
regards !

As I was saying, I have no alternative but learn how to do a custom ingestion of data. could you suggest some good tutorials etc to start me off ?
I had no luck searching on my own, being new to this and not being able to understand what toolset I need to accomplish this.
thanks in advance.

added later : one more quick question.
can this https://github.com/appbaseio/gem be used achieve the above ? and if so, do you think it will work with ES 5.5.0 ?

filename uses a keyword type by default:

https://github.com/dadoonet/fscrawler/blob/master/src/main/resources/fr/pilato/elasticsearch/crawler/fs/_default/5/_settings.json#L41-L43

You can change it by:

        "filename" : {
          "type" : "text"
        },

I didn't quite get the subfield thing, to be honest.

or use sub fields:

        "filename" : {
          "type" : "keyword",
          "fields": {
            "forsearch": {
              "type" : "text"
            }
          }
        },

The later means that if you search in filename.forsearch, it will be case insensitive.

1 Like

it will be returned if I search in _all, right ?

I have a ingest node question, I want to either

a) concatenate all meta.* fields in a single field. is there a way to do this ?
I have no way of knowing what meta.* fields are going to be created dynamically.

say, if I have field - value pairs such as :
meta.field1 : value1
meta.field2 : value2

I want to append them both in a single field value pair as :
meta.catchallfield : meta.field1 : value1 | meta.field2 : value2

OR, failing that

b) remove all meta.* fields from my index (they are not much useful to me for filtering)
if I use the remove processor, something like this

{
"remove": {
"field": "meta.*"
}
}

would it work or throw up errors ?

I think you want to use copy_to: https://www.elastic.co/guide/en/elasticsearch/reference/5.5/copy-to.html

can I apply copy_to on fields with wildcard (e.g meta.*) ? without explicitly declaring them ?
because fscrawler creates the fields dynamically, it's not possible to create explicit mappings for them.

also, copy_to will still keep all the original fields in the index right ?
that is the problem actually, I am having an explosion in number of fields. it's crossing the 1000 field limit pretty quickly and in any case those fields isn't of much use to me.

all I want to do is either club all meta.* fields into one or get rid of them altogether.
thanks again.

Good to know. Can you open an issue in FScrawler project about field explosion?

You can use a dynamic template to configure all meta.* fields to be copy to whatever field. But that is not going to solve the field explosion problem here as they will be there.
And you can't use ingest in that case (with copy_to I mean).

So may be remove processor can support wildcards?
(Did not check).
Or may be you can do all that with a script processor?

If you find a good solution for this, would you like to send a PR to add this in FSCrawler documentation?

update :
wildcard option in remove processor throws javalang exception. :slightly_frowning_face:

so, script processor seems to be the way to go. any pointers how to retrive fields from fscrawler pipeline ??

I can then use the remove processor to get rid of them.

any pointers how to retrive fields from fscrawler pipeline ??

I have barely no experience with Painless. But it's pretty well documented here: Painless Scripting Language | Elasticsearch Reference [5.5] | Elastic

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.