in2csv: Support for large JSON files
Created by: dhulke
Recently I tried converting a 1GB json file to csv so I would be able to import this file into Sqlite and run a few queries for data analysis. It turns out in2csv crashes out with a Memory Error. So I wrote a quick json stream parser, somewhat like a sax parser and generated my own csv (I hardcoded the total number of columns). My question is: are you guys planning on adding support for a stream like json parser so we could transform json to csv without a limit to file size?
If so, I could maybe try to implement that. I saw that today agate.Table.from_json loads the entire thing into memory and agate.Table.to_csv actually needs the entire thing in memory to loop over it calling writer.writerow(). I initially thought about changing agate.Table.from_object (what from_json returns) to return a generator that would stream parse the json, but im not sure that would work as the total number of columns could very half way through the json. I even thought about skipping the header and parsing the entire file and then at the end, rewriting the header but then I would have to fseek to every row with the wrong number of columns and add commas and nulls.
Do you guys have any ideas/plans on how to go about this?
Edit: What I actually thought about doing was adding sqlite to agate and stream parse the json straight into sqlite and as I discover new columns, run a quick alter table adding the column with a specified default value and moving on. At the end, chage agate.Table.to_csv to actually call sqlite export function. That could work and would be somewhat faster than fseeking my way around, but sounds like a bit of a hack.