We have plain text files in a very specific formatting
These files are in a directory
I want to do a search in ES or (another front end) for a specific id in those files
I want to return the actual file - or the exact formatting of the text
ie
we have several million text files all the same formatting, which look basically like an fancy invoice (but plain text)
User wants to search for a string and return the text file
Can logstash or ES do this.
I have ES. LS & Kibana 5.2 on a 4 node physical cluster
I don't think Elasticsearch will preserve formatting of plain text. However, if those files were PDF, then the ingest attachment plugin would, I believe, allow you to do what you desire.
Yes, I saw that. Seems the file actually has to be in binary format though. Im wondering if I can convert thm to pdf... Ill check on that. I appreciate the reply.
In all honesty, there is no other way to preserve formatting in Elasticsearch. Whatever you send becomes the _source field, and it's a single field. Whether it preserves newlines and extra spaces and tabs or not may depend on what tool sends it to Elasticsearch, and what tool reads it from Elasticsearch. You should be able to to convert to PDF at the command-line, however. The problem with many command-line formatters, however, is that they, too, may ignore the custom formatting. There may be a lot of trial and error involved.
I mean you can put an entire file in to one field (You may have to tweak the field settings so it is fully searchable) but this data will be untouched based on what you insert. As long as it is ASCII Data
If it is binary, you could BASE64 Encode the document, then create META fields which extracted data which represents data contained in the file. But at that point you might as well keep your file on the local storage and just provide a HTTP link to the file, this way it would keep your Elastic data cluster small
The file copies and pastes easily from windows to linux and keeps all the formatting. How can I tell what encoding the file is?
I just tried the hardcopy freature of vim
but the file wouldn't open in windows - was just testing. I may have to try to bring it in with logstash and pdf / binary plugin, just an idea...
not sure what you mean, if it is a ASCII file then you don't need to convert it to PDF. since you want he whole file untouched you just have to come up with a relatively simple shell script to insert it,
Mind you since I am not sure of all that your trying to accomplish, your going to have to figure out what is best for you. Elastic is a search engine, not a CMS. While it could be used to do that that you may want to review what your actually trying to do and match the technologies.
This is just quick sudo code I have not tested it. and you may have issues with BASH variable not being able to hold 15K but it should work without problems on smaller files. You will probably have to go to a programming language to do it better. Oh and you will have to change the MAPPING on the "Message" field to have more than the first 1k to be searchable
I really think you should change how your thinking to do this.
Instead of pulling in the whole file. I think you should just parse the file, create separate fields for all the invoce data and then just have a URL for the user to download the original file
So your meta data might look like this ( Just assuming since you said it was an invoice type data
I dont know how logstash could map all the fields, horizontal & vertical whereas I beleive LS would be able build an index on the pdf file itself rather then converting and mapping fields. There is a new attachment plugin that can index binary files... at least I hope it works that way.
My data in thefile is like a checkerboard some blocks have the reference like name and below it is the value, some reference is on the left and the value is on the right - I think it would be had to parse... but Im not that skilled
Re-reading your initial question, display and search would be two different idea's
Elasticsearch would have no problem holding the data and making it searchable. Getting it in there can be done with Logstash, or even just simple CLI commands (or any language you like) Parsed or unparsed.
Display would not be available right out of the box. Meaning you would have to write a Plugin to kibana or something in HTML/Javascript/php/lang of choice to display data in whatever format you wanted. Kibana mostly deals with Key=value pairs and aggregation of data.
While we could make this work but the end result would most likely be something custom.
Now thinking outside the box,
You may want to look at a Wiki, maybe like Confluence, Mediawiki or really any wiki with search capability (some do plugin into elasticsearch ) . There you would make each "Invoce" a different wiki page (or multiples on the same page for the same person) and when you search for something it would give you your page options. Seems like what your actually looking for.
Heck, you could just upload them to a private GITHUB or Bitbucket repo, that would give you everything with very little work. Search, file storage and display of the data in a raw format. It would not be as pretty as a wiki but would be secure and web gui for people to use.
I think the wiki idea is excellent and would save all kinds for coding and massaging also have a built in search - very good
2 security concerns
Needs readonly access - no one can change files
Need authentication to access the site
Ideally would need some kind of auditing
But this would be internal website
I'm thinking files get loaded in a directory and use the built in search of the wiki software - or do I need something else?
I'm going to do some googling on this. I really appreciate your ideas...
Mediawiki could work good enough for your needs as I understand it. It is PHP based, and mysql/postgres
The loading of the file, well they would have to be loaded in to the Database, at least the ones I use, its the only way they could be searched. But there should be an API to auto create the different pages.
It has User levels, so you can not only have Login, but also I believe level of securities.
Of course there are Git/SVN and other types of wiki's that may fit your needs but for search they will need to be loaded in to either a database or something like elastic, there are none that I know of that are directory based and provide that functionality, but who knows someone might have created it.
Glad I could point you in the right direction, hit me up if you
Ok thanks I'm playing around with it now - pmwiki looks like it has a lot of features I need. I'm deploying that on top of tomcat webserver. Ill keep you posted
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.