Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • C csvkit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 61
    • Issues 61
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wireservice
  • csvkit
  • Issues
  • #957
Closed
Open
Issue created Apr 28, 2018 by Administrator@rootContributor

in2csv: Support for large JSON files

Created by: dhulke

Recently I tried converting a 1GB json file to csv so I would be able to import this file into Sqlite and run a few queries for data analysis. It turns out in2csv crashes out with a Memory Error. So I wrote a quick json stream parser, somewhat like a sax parser and generated my own csv (I hardcoded the total number of columns). My question is: are you guys planning on adding support for a stream like json parser so we could transform json to csv without a limit to file size?

If so, I could maybe try to implement that. I saw that today agate.Table.from_json loads the entire thing into memory and agate.Table.to_csv actually needs the entire thing in memory to loop over it calling writer.writerow(). I initially thought about changing agate.Table.from_object (what from_json returns) to return a generator that would stream parse the json, but im not sure that would work as the total number of columns could very half way through the json. I even thought about skipping the header and parsing the entire file and then at the end, rewriting the header but then I would have to fseek to every row with the wrong number of columns and add commas and nulls.

Do you guys have any ideas/plans on how to go about this?

Edit: What I actually thought about doing was adding sqlite to agate and stream parse the json straight into sqlite and as I discover new columns, run a quick alter table adding the column with a specified default value and moving on. At the end, chage agate.Table.to_csv to actually call sqlite export function. That could work and would be somewhat faster than fseeking my way around, but sounds like a bit of a hack.

Assignee
Assign to
Time tracking