The CSV MetaData Editor is an user interface that allows the assisted generation of metadata for a CSV. The JSON output is compliant to the W3C standard for CSV on the Web (CSVW). The metadata can describe CSV specifics such as the delimiter, the encoding and the quotation character. It also describes the columns and their headers and datatypes. Figure 2 below displays an example JSON metadata document for a dataset by the mumok museum. Additionally, it can contain keywords, publisher and title, the language, and licensing information.
Many files containing tabular data embed metadata, for example in lines before the header row of an otherwise standard CSV document. The CSVW specification does not define any formats for embedding metadata within CSV files, aside from the titles of columns in the header row which is defined in CSV. Instead it provides this metadata information in an additional (JSON) file which should be made available alongside the CSV. In the specification there are several ways described where and how to place the JSON metadata file. In the context of Open Data portals it is best practice and most straightforward to provide the metadata file as an additional resource at the dataset landing page of the portal, e.g.: http://data.opendataportal.at/dataset/kunstler-der-sammlung-des-mumok
Figure 1: Example CSV metadata generated by the CSV MetaData Editor
In order to automatically generate this CSV description we try to retrieve the first 100 lines of each CSV and apply the following methods and heuristics to determine the dialect and properties of the CSVs:
- We use the “Content-Type” and “Content-Length” HTTP response header fields to get the media type and file size of the resource. Note, that both of these fields might contain not accurate information in some cases, since some servers send the content length of the compressed resource and also use the compression’s media type (e.g., application/gzip).
- We use the Python magic package to detect the the encoding of the retrieved resource.
- We adapted the default Python CSV parser by including the encoding detection and refining the delimiter detection (by increasing the number of sniffed lines and modifying the preferred delimiters); see anycsv in section 2.1.
- We heuristically determine the number of header lines in a CSV file by considering changes to datatypes within the first rows. For instance, if we observe columns where all entries are numerical values and follow the same pattern – including the first row – we do not consider the first row as a leading header row. Obviously, there are cases where this heuristic may fail. Our intention here is that this “guessed” information already might be of value for an user.
- We perform a simple datatype detection on the columns of the CSVs: we distinguish between columns which contain numerical, binary, datetime or any other “string” values, and use the respective XSD datatypes number, binary, datetime and anyAtomicType.