csvsql: Streaming mode for schema generation
Created by: fgregg
Related to conversation in #737, i'd really like a memory efficient (and faster) version of schema generation. Let's start by describing how csvkit currently generates a schema (this is a bit simplified)
-
csvsqlpasses a file object to thefrom_csvclass method of Table -
from_csvreads the file as a csv an saves all rows to arowsobject -
from_csvpassesrowstoTable.__init__ -
Table __init__iterates through all rows to do type inference -
Table __init__iterates through all rows and casts fields to appropriate type and saving result to a anew_rows. This effectively doubles the memory footprint until we exit from__init__androwsis garbage collected - the
Table __init__method exits and control is passed to thecsvsqlscript -
csvsqlcallsto_sql_create_statementmethod is called on the Table object -
to_sql_create_statementcalls themake_sql_tablemethod - if user is not using the
--no-constraintsflag oncsvsqlthemake_sql_tablecalculates precision and length constraints for every column. This effectively doubles the memory footprint because the agate aggregation methods make callscolumn.valuesandcolumn.values_without_nullswhich create list objects of the values in a column. This methods are memoized, so these lists will remain in memory until the parent column object is garbage collected, which won't happen until csvsql exits. -
<agate-sql>The column information is passed tomake_sql_columnwhich makes the appropriate sql schema entry for that column.
Here's an alternative flow that I would suggest for a streaming process:
- csv reader is passed to a type inference that is similar to existing type inference. Exhaust all the rows. In addition to doing the type inference, we will also get information about max length and precision in this iteration
- this information is passed to `make_sql_column'
So, basically, bypassing creation of the Table object and expanding the functionality of the type tester.
Thoughts?