Converting a CSV to ORC files usually takes a Hadoop cluster to perform the task. Since I only wanted to convert files for later uploading into an existing cluster, I tried some different approach. Searching for some tool to do the task, I arrived at Apache NiFi.
Here is the flow I used to transform my data.
- step 1 - list all exiting CSV files
- step 2 - read each file into memory
- step 3 - convert content into AVRO
- sadly AVRO needs a schema of you data to do the actual conversion. so here is the simple schema I used for my data:
- step 4 - convert AVRO to ORC
- step 5 - UpdateAttribute: set the target filename
- step 6 - write the ORC file to the target location
Here is the NiFi flow I used. You will have to change the file locations and data schema to really use it.