Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • C csvkit
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 61
    • Issues 61
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 4
    • Merge requests 4
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • wireservice
  • csvkit
  • Issues
  • #944
Closed
Open
Issue created Mar 18, 2018 by Administrator@rootContributor

csvstat: UnicodeError in Python 2 when --csv option enabled

Created by: binarytooth

csvstat 1.0.2 Python 2.7.12 Ubuntu 16.04.3 LTS xenial

When I run csvstat on a file containing a header and a single euro sign € (ISO-8859-15 character A4), it exits with an error, but only if the --csv option is included. If the option is not included csvstat generates output correctly. A copy of the test file

euro-sign-iso-8859-15.txt

is attached to this ticket.

This happens with all characters from 0x80 (128) through 0xFF (255) with the exception of 0x85 (Next line character) All produce the same error when csvstat is run with the --csv option

When csvstat is run on the test file without the csv option, it produces the following correct output:

$ csvstat -e ISO-8859-15 euro-sign-iso-8859-15.txt

  1. "Contents"
Type of data:          Text
Contains null values:  False
Unique values:         1
Longest value:         1 characters
Most common values:    € (1x)

Row count: 1

When I run the same command but with the --csv option enabled it blows up.

$ csvstat -v -e ISO-8859-15 euro-sign-iso-8859-15.txt --csv column_id,column_name,type,nulls,unique,min,max,sum,mean,median,stdev,len,freq Traceback (most recent call last): File "/usr/local/bin/csvstat", line 9, in load_entry_point('csvkit==1.0.2', 'console_scripts', 'csvstat')() File "/usr/local/lib/python2.7/dist-packages/csvkit/utilities/csvstat.py", line 335, in launch_new_instance utility.run() File "/usr/local/lib/python2.7/dist-packages/csvkit/cli.py", line 114, in run self.main() File "/usr/local/lib/python2.7/dist-packages/csvkit/utilities/csvstat.py", line 166, in main self.print_csv(table, column_ids, stats) File "/usr/local/lib/python2.7/dist-packages/csvkit/utilities/csvstat.py", line 318, in print_csv writer.writerow(output_row) File "/usr/local/lib/python2.7/dist-packages/agate/csv_py2.py", line 190, in writerow UnicodeWriter.writerow(self, row) File "/usr/local/lib/python2.7/dist-packages/agate/csv_py2.py", line 103, in writerow self.writer.writerow([six.text_type(s if s is not None else '').encode(self.encoding) for s in row]) File "/usr/lib/python2.7/codecs.py", line 369, in write data, consumed = self.encode(object, self.errors) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 32: ordinal not in range(128)

This problem also occurs for files encoded using WINDOWS-1252. This bug prevents the use of many useful characters in csv files such as smart quotes, dagger, double dagger, many accented letters, the euro sign, etc.

Assignee
Assign to
Time tracking