I am sourcing some documents from a remote cluster and storing them in files on a local machine. I would like to store the documents as a valid json that I can later read in, say for example, in python.
I used the file plugin which has a default json-lines as the codec. This, however, doesn't seem to do the job as it saves the documents in a '\n' delimited format which is not a valid json.
Has anyone faced this issue before? Any leads on this?
...however if you have a lot of files, I would not recommend this, you can take out your OS or file system trying to add millions of little files to a single directory. You may want to consider an intermediary system such as mysql or kafka (or even Elasticsearch), or maybe a batch them together in a single file with a codec such as avro or protobuf (or even json lines) that your later application (say python) can decode.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.