Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • C csvkit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 61
    • Issues 61
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wireservice
  • csvkit
  • Issues
  • #913
Closed
Open
Issue created Dec 15, 2017 by Administrator@rootContributor

csvsql: Streaming mode for schema generation

Created by: fgregg

Related to conversation in #737, i'd really like a memory efficient (and faster) version of schema generation. Let's start by describing how csvkit currently generates a schema (this is a bit simplified)

  1. csvsql passes a file object to the from_csv class method of Table
  2. from_csv reads the file as a csv an saves all rows to a rows object
  3. from_csv passes rows to Table.__init__
  4. Table __init__ iterates through all rows to do type inference
  5. Table __init__ iterates through all rows and casts fields to appropriate type and saving result to a a new_rows. This effectively doubles the memory footprint until we exit from __init__ and rows is garbage collected
  6. the Table __init__ method exits and control is passed to the csvsql script
  7. csvsql calls to_sql_create_statement method is called on the Table object
  8. to_sql_create_statement calls the make_sql_table method
  9. if user is not using the --no-constraints flag on csvsql the make_sql_table calculates precision and length constraints for every column. This effectively doubles the memory footprint because the agate aggregation methods make calls column.values and column.values_without_nulls which create list objects of the values in a column. This methods are memoized, so these lists will remain in memory until the parent column object is garbage collected, which won't happen until csvsql exits.
  10. <agate-sql> The column information is passed to make_sql_column which makes the appropriate sql schema entry for that column.

Here's an alternative flow that I would suggest for a streaming process:

  1. csv reader is passed to a type inference that is similar to existing type inference. Exhaust all the rows. In addition to doing the type inference, we will also get information about max length and precision in this iteration
  2. this information is passed to `make_sql_column'

So, basically, bypassing creation of the Table object and expanding the functionality of the type tester.

Thoughts?

Assignee
Assign to
Time tracking