I have a one to many data set where a unique identifier could have between 1 and 5 attributes about that document and additional attributes about that attribute. My initial feeling is just to flatten it and run match queries against the 5 columns.
Ultimately it's the same as the flat format you posted (Lucene will flatten all those values to g1.type, g1.confidence, etc) it just looks a little cleaner.
Alternatively you could use nested documents, but that's probably overkill for what you need.
Note that your query may need some re-arranging. If you want docs that are 80% CSV, you need to tie together the "type" query with the "confidence" query. Something like:
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.