he CSV Clean Service is a tool to automatically clean-up CSV data sources. It is able to parse and automatically detect the encoding of the input CSV and to determine several types of delimiter derivations, such as tab- or semicolon-separated-value files.
The cleaned file is UTF-8 encoded and an RFC 4180 compliant CSV document. RFC 4180 requires the following properties:
- Each record is located on a separate line, delimited by a line break (CRLF).
- The last record in the file may or may not have an ending line break.
- There maybe an optional header line appearing as the first line of the file with the same format as normal record lines. This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file (the presence or absence of the header line should be indicated via the optional “header” parameter of this MIME type).
- Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.
- Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, use double quotes for all fields, or not at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.
- Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.
- If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.
Additionally, the RFC standard specifies the MIME type of CSV files as “text/csv”.
The CSV Clean Service makes use of two Open Source software libraries that were developed in the course of the ADEQUATe project: the anycsv the pyyacp CSV parser.
This software module is a robust CSV parser for Python based on the default csv Python module. It extends the default module by an automated file encoding detection and an extended delimiter detection heuristic. This way, the parser is able to handle a large amount of CSV derivations coming in any country specific file encoding.
This extended CSV parser builds on top of the anycsv parser and integrates several heuristics to detect (i) missing or potentially many header lines, (ii) leading comment and description lines in the document, (iii) multiple tables (separated by empty rows) within the same document, and (iv) simple column data types, such as numerics or dates. A detailed description of the heuristics can be found in the paper “Characteristics of open data CSV files”.