Searching PDF


(IronMike) #1

I am trying to index/search pdf file in java api from here:


After indexing, when I try to search it returns base64? How do I getback
the original text/source?

"_score" : 0.0047945753, "_source" :
{"name":"fn6742.pdf","file":"JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8PC9MaW5lYXJpemVkIDEvTCAzODExNDQvTyAxNjMvRSAyNDcxMS9OIDEzL1QgMzc3OTM
....

[Cyclone] [docs] deleting index

[2014-02-07 15:56:40,415][INFO ][cluster.metadata ] [Cyclone]
[docs] creating index, cause [api], shards [5]/[1], mappings []

[2014-02-07 15:56:40,537][INFO ][cluster.metadata ] [Cyclone]
[docs] create_mapping [pdf]

[2014-02-07 15:56:40,793][INFO ][cluster.metadata ] [Cyclone]
[docs] update_mapping [pdf] (dynamic)

[2014-02-07 16:07:11,040][INFO ][cluster.metadata ] [Cyclone]
[docs] deleting index

[2014-02-07 16:08:10,611][INFO ][cluster.metadata ] [Cyclone]
[docs] creating index, cause [api], shards [5]/[1], mappings []

[2014-02-07 16:08:10,732][INFO ][cluster.metadata ] [Cyclone]
[docs] create_mapping [pdf]

[2014-02-07 16:08:10,927][INFO ][cluster.metadata ] [Cyclone]
[docs] update_mapping [pdf] (dynamic)

Code:

private static void internalMain() throws Exception {
String fileContents = readContent( new File("fn6742.pdf") );

           Client client = new TransportClient().addTransportAddress(new 

InetSocketTransportAddress("localhost", 9300));

            try {

                   DeleteIndexResponse deleteIndexResponse = new 

DeleteIndexRequestBuilder( client.admin().indices(), INDEX_NAME ).execute().
actionGet();

                   if ( deleteIndexResponse.isAcknowledged() ) {

                            System.out.println( "Deleted index" );

                    }       

            } catch ( Exception e ) {

                   System.out.println("Index already deleted");

           }

            System.out.println("before index create call");

            CreateIndexResponse createIndexResponse = new 

CreateIndexRequestBuilder( client.admin().indices(), INDEX_NAME ).execute().
actionGet();

            System.out.println("after index create call");


            if ( createIndexResponse.isAcknowledged() ) {

                    System.out.println( "created index" );

           }

            PutMappingResponse putMappingResponse = 

                            new PutMappingRequestBuilder( client.admin

().indices() ).setIndices(INDEX_NAME).setType( DOCUMENT_TYPE ).setSource(

                                            jsonBuilder()

                                                    .startObject()

                                                            .field("doc"

)

                                                                    .

startObject()

.field( "properties" )

                                                                        
        .startObject()

                                                                        
                .field( "file" )

                                                                        
                        .startObject()

                                                                        
                                .field( "term_vector", 

"with_positions_offsets" )

                                .field( "store", "yes" )

                                                                        
                                .field( "type", "attachment")

                                                                        
                                //.field("index", "analyzed")

                                                                        
                        .endObject()

                                                                        
        .endObject()

                                                                    .

endObject()

                                                    .endObject() 

                                            ).execute().actionGet();



            if ( putMappingResponse.isAcknowledged() ) {

                    System.out.println( "successfully defined mapping" 

);

            }

            IndexResponse indexResponse = client.prepareIndex( 

INDEX_NAME , DOCUMENT_TYPE, "1")

                    .setSource(jsonBuilder()

                                .startObject()

                                    .field("name","test/fn6742.pdf")

                                    .field( "file", fileContents)

                                    .field( "modified", new Date() )

                                    .field( "updated_at", new Date() )

                                .endObject()

                              )

                    .execute()

                    .actionGet();


            System.out.println( indexResponse );



    client.admin().indices().refresh(refreshRequest()).actionGet();

            //////////////////////// Search 

///////////////////////////////////

            SearchResponse searchResponse = client.prepareSearch( 

INDEX_NAME )

                    .setSearchType(SearchType.QUERY_AND_FETCH)

                    .setQuery(fieldQuery("file", "200nA"))

                    //.setQuery(queryString("c"))

                    .setFrom(0)

                    .setSize(60)

                    .setExplain(true)

                    .execute()

                    .actionGet();



            System.out.println( searchResponse );

            client.close();         

    }

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/087fb06d-46e4-44fb-9e54-018be4bacd9d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #2

You should be able to get the textual field values by explicitly requesting
them from fields. For example:

GET localhost:9200/_search
{
"fields": "*",
"query": {
"match_all": {}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a6fe663f-7fb5-4b72-b9fd-519fc86477a1%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #3

So, What's wrong with this?
GET localhost:9200/_search
{
"fields": "file",
"query": {
"match_all": {}
}
}

......
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "docs",
"_type": "pdf",
"_id": "1",
"_score": 1,
"fields": {
"file":
"JVBERi0xLjQNJeLjz9MNCjE1OCAwIG9iaiA8PC9MaW5lYXJpemVkIDEvTCAzODExNDQvTyAxNjMvRSAyNDcxMS9OIDEzL1QgMzc3OTM2L0ggWyAxMTU2IDQ2OF0+Pg1lbmRvYmoNICAgICAgICAgICAgDQp4cmVmDQoxNTggNDMNCjAwMDAwMDAwMTYgMDAwMDAgbg0KMDAwMDAwMTYyNCAwMDAwMCBuDQowMDAwMDAxNzk0IDAwMDAwIG4NCjAwMDAwMDE4MjAgMDAwMDAgbg0KMDAwMDAwMTg2NiAwMDAwMCBuDQowMDAwMDAxOTAwIDAwMDAwIG4NCjAwMDAwMDIxMDkgMDAwMDAgbg0KMDAwMDAwMjE4OSAwMDAwMCBuDQowMDAwMDAyMjY3IDAwMDAwIG4NCjAwMDAwMDIzNDQgMDAwMDAgbg0KMDAwMDAwMjQyMSAwMDAwMCBuDQowMDAwMDAyNDk4IDAwMDAwIG4NCjAwMDAwMDI1NzUgMDAwMDAgbg0KMDAwMDAwMjY1MiAwMDAwMCBuDQowMDAwMDAyNzI5IDAwMDAwIG4NCjAwMDAwMDI4MDYgMDAwMDAgbg0KMDAwMDAwMjg4MyAwMDAwMCBuDQowMDAwMDAyOTYwIDAwMDAwIG4NCjAwMDAwMDMwMzYgMDAwMDAgbg0KMDAwMDAwMzE5OCAwMDAwMCBuDQowMDAwMDAzNjMwIDAwMDAwIG4NCjAwMDAwMDM2NjYgMDAwMDAgbg0KMDAwMDAwMzkwMCAwMDAwMCBuDQowMDAwMDAzOTc3IDAwMDAwIG4NCjAwMDAwMDQwNTMgMDAwMDAgbg0KMDAwMDAwNDkxMSAwMDAwMCBuDQowMDAwMDA1NzA5IDAwMDAwIG4NCjAwMDAwMD

On Friday, February 7, 2014 4:48:46 PM UTC-5, Binh Ly wrote:

You should be able to get the textual field values by explicitly
requesting them from fields. For example:

GET localhost:9200/_search
{
"fields": "*",
"query": {
"match_all": {}
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/830dd808-d996-4ff5-bbc9-aaca1d5acd3a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #4

It looks like that indexing code might not be correct. I just tried this
code and it works for me:

  try {
    String fileContents = readContent( new File( "fn6742.pdf" ) );
 
    try {
      DeleteIndexResponse deleteIndexResponse = new 

DeleteIndexRequestBuilder( client.admin().indices(), INDEX_NAME
).execute().actionGet();
if (deleteIndexResponse.isAcknowledged() ) {
System.out.println( "Deleted index" );
}
}
catch (Exception e) {
//ignore
}

    CreateIndexResponse createIndexResponse = new 

CreateIndexRequestBuilder( client.admin().indices(), INDEX_NAME
).execute().actionGet();

    if ( createIndexResponse.isAcknowledged() ) {
      System.out.println( "Created index" );
    }
     
    PutMappingResponse putMappingResponse = new 

PutMappingRequestBuilder(
client.admin().indices() ).setIndices(INDEX_NAME).setType(
DOCUMENT_TYPE ).setSource(
XContentFactory.jsonBuilder().startObject()
.field("doc").startObject()
.field( "properties" ).startObject()
.field( "file" ).startObject()
.field( "term_vector", "with_positions_offsets" )
.field( "store", "yes" )
.field( "type", "attachment" )
.field("fields").startObject()
.field("file").startObject()
.field("store", "yes")
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();

    if ( putMappingResponse.isAcknowledged() ) {
      System.out.println( "Successfully defined mapping" );
    }
     
    IndexResponse indexResponse = client.prepareIndex( INDEX_NAME , 

DOCUMENT_TYPE, "1")
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field( "file").startObject()
.field("content", fileContents)
.field("_indexed_chars", -1)
.endObject()
.endObject()
).execute().actionGet();

    System.out.println( "Document indexed success: " + 

indexResponse.isCreated() );
} catch ( Exception e ) {
System.out.println(e.toString());
}

And then when I query:

{
"fields": "*",
"query": {
"match_all": {}
}
}

I get back this:

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "msdocs",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"file" : [ "\n1\nISL99201\nCAUTION: These devices are sensitive to
electrostatic discharge; follow proper IC Handling
Procedures.\n1-888-INTERSIL or 1-888-468-3774"]
}
} ]
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9fce9018-8576-4bce-ba42-025120097fe2%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(IronMike) #5

You are correct, my JSON mapping had a wrong entry. Thanks for the help!

On Friday, February 7, 2014 6:10:50 PM UTC-5, Binh Ly wrote:

It looks like that indexing code might not be correct. I just tried this
code and it works for me:

  try {
    String fileContents = readContent( new File( "fn6742.pdf" ) );
 
    try {
      DeleteIndexResponse deleteIndexResponse = new 

DeleteIndexRequestBuilder( client.admin().indices(), INDEX_NAME
).execute().actionGet();
if (deleteIndexResponse.isAcknowledged() ) {
System.out.println( "Deleted index" );
}
}
catch (Exception e) {
//ignore
}

    CreateIndexResponse createIndexResponse = new 

CreateIndexRequestBuilder( client.admin().indices(), INDEX_NAME
).execute().actionGet();

    if ( createIndexResponse.isAcknowledged() ) {
      System.out.println( "Created index" );
    }
     
    PutMappingResponse putMappingResponse = new 

PutMappingRequestBuilder(
client.admin().indices() ).setIndices(INDEX_NAME).setType(
DOCUMENT_TYPE ).setSource(
XContentFactory.jsonBuilder().startObject()
.field("doc").startObject()
.field( "properties" ).startObject()
.field( "file" ).startObject()
.field( "term_vector", "with_positions_offsets" )
.field( "store", "yes" )
.field( "type", "attachment" )
.field("fields").startObject()
.field("file").startObject()
.field("store", "yes")
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
.endObject()
).execute().actionGet();

    if ( putMappingResponse.isAcknowledged() ) {
      System.out.println( "Successfully defined mapping" );
    }
     
    IndexResponse indexResponse = client.prepareIndex( INDEX_NAME , 

DOCUMENT_TYPE, "1")
.setSource(XContentFactory.jsonBuilder()
.startObject()
.field( "file").startObject()
.field("content", fileContents)
.field("_indexed_chars", -1)
.endObject()
.endObject()
).execute().actionGet();

    System.out.println( "Document indexed success: " + 

indexResponse.isCreated() );
} catch ( Exception e ) {
System.out.println(e.toString());
}

And then when I query:

{
"fields": "*",
"query": {
"match_all": {}
}
}

I get back this:

{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [ {
"_index" : "msdocs",
"_type" : "doc",
"_id" : "1",
"_score" : 1.0,
"fields" : {
"file" : [ "\n1\nISL99201\nCAUTION: These devices are sensitive to
electrostatic discharge; follow proper IC Handling
Procedures.\n1-888-INTERSIL or 1-888-468-3774"]
}
} ]
}
}

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3229ca87-1594-460d-b43a-a802c6a57a74%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #6