Extracting specific fields from long documents and moving them to a structured DB for analysis


(Rajesh Iyer) #1

I am struggling with creating the following application

Extract specific data
from 1000s of policies

  • Searchable PDFs - can get full text directy
  • Image PDFs - using Tesseract to OCR to get full text

feed full policy text to ES and store the following indices

  • Policy # - String
  • Premium - $
    and store them so that the end of the day, I have a table in say Oracle
    like this:

Policy # Premium
12345 $ 2314
23451 $ 4231

And so on . . ..

There is a lot of analytics that I can do with this table (there is more
fields I am execting to extract of course, ~ 7-10 total fields)

We can get the full text and we can feed to ES i one field.
We are kinda on our way to create the indices we want

I just done know if there is a way to get the stored index data (label,
value) out of ES into a structured DB table.
If you have experience attempting something like this, Id love to hear
about the feasibility/challenges of such an attempt.

Regards,

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91dce4e2-e28b-444a-aef8-1a48c123c740%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(AsyncAwait) #2

So what is your question?

ES team has commercial support options in case you have enough bandwidth to help in overall design etc.,

Disclaimer: I am a open source enthusiast.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/59bc907b-9922-4938-baeb-38ba6f0cab1e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(AsyncAwait) #3

So what is your question?

ES team has commercial support options in case you have enough bandwidth to help in overall design etc.,

Disclaimer: I am a open source enthusiast.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/56b3b09a-dcc3-4d19-a430-3cae3f425c32%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #4

It is hard to answer a question like this because you do not specify which
tools are available for you (clients, programming language, etc.) so the
answer depends.

If you can write programs it is not very hard to query ES, look up the
result docs, and execute appropriate SQL insert/update.

If you can not write programs and you want ES automatically do the SQL for
you, this is not possible. The closest you can get is probably using a
plugin like the CSV format plugin

where you can extract values from ES into tabular data from command line.
This CSV can be saved and used for RDBMS import.

Jörg

On Sat, Jul 12, 2014 at 5:09 PM, Rajesh Iyer iyer70@gmail.com wrote:

I just done know if there is a way to get the stored index data (label,
value) out of ES into a structured DB table.
If you have experience attempting something like this, Id love to hear
about the feasibility/challenges of such an attempt.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoFBT30e5JjZ9XJD-tEZu6AFEJ_Rv0fNq6sOvM_sgR8SSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(system) #5