The ADEQUATe-project spent considerable time to analyse storage strategies to retrieve and analyse open data. Storage requirements arouse in these areas:
- Open data portals metadata. The ADEQUATe-project performs analysis on different aspects of the metadata like it’s readability. For this to be performed and especially to monitor changes to those metrics over time it is required to fetch metadata and store this data.
- Open data portals data sets. Quality improvement metrics can be divided into different fields like assessing quality aspects of the providing portal (for example it’s responsiveness), the metadata (eg. readability) and of course quality aspects of the data sets themselves. Arguably data quality aspects of datasets are the most important one: After a developer identifies a data set and made herself familiar with, he or she is mostly interested in the data and not longer in the expressiveness of the assigned metadata.
- The process data to operate the ADEQUATe portal. As the ADEQUATe project monitors quality changes of data portals and data sets over time, past quality assessments have to be kept for analysis and visualisations. Furthermore ADEQUATe supports user interactions for human-made data quality improvements. This administrative data accrues automatically over time and adds considerable payload to handle efficiently.
- User contributed content. The ADEQUATe-project will incorporate user provided improvements to data sets. Some quality improvements require contextual information the system is not aware of to perform automatically. These changes require to be stored in the system ideally as a record changes to the original data sets.
We store data portals metadata in a PostgreSQL Database. Metadata obtained from CKAN portals via API is in JSON format and Postgres json-data type comes in handy for that. Storing the raw JSON allows querying the metadata using SQL to perform crude assessments of metadata existence and statistics. While we originally used MongoDB for that purpose, we ran into performance issues and found the data analytic capability of Mongo wanting.
Storing an open data portals data sets is more involved. First, not everything is “data”. You cannot simply download what you get from an open data API, store it’s contents and assume this to be any meaningful data. Even more so, identifying an API from a mere data set is not as easy as it might seem. From the Internet perspective, both are ordinary resources which respond to HTTP-Get requests and deliver data in XML, JSON, CSV or whatever mime-type, as an “ordinary”data sets itself would also do. The difference is that an API often accepts query parameters whereas a simple download doesn’t. Furthermore, using the same query parameter, an API might not return the same data (think of an API returning the current water tide level at a certain location). As of now we neglected the peculiarities of APIs.
Data can quickly accrue over time, especially as we harvest data sets in re-occurring intervals. Even a data portal with only a mediocre amount of data (in volume) can become big over time. Our analysis has shown, that only a relatively tiny amount of data (10 -15%) changes frequently, with most data sets being relatively static. Which makes sense, given the claimed refresh intervals of eg. monthly or yearly and the distribution of the data sets claiming shorter refresh intervals.
Techniques to cope with data which doesn’t change can be applied at different levels. Some file systems like the BTRFS, XFS or ZFS support de-duplication. Against intuition, de-duplication is the opposite to what it stands for: It prevents duplicate data to be written to disk at the disk block level. When you copy a file, the copy will contain the some content as the source. If the copy is to be stored on the same disk (array), it would make sense to point at the disk block level to the same data blob and simply logically duplicate the file with maybe new name, access right and so on. That’s what de-duplication does at the expense of requiring lots amounts of RAM as every write operation has to be checked if there is already a matching data block the newly to be written data can be assigned to.
De-duplication support varies from operating system and the used file system. While using de-dup to cope with sparse data writes has the advantage to be able to concentrate on the business logic, it is still new technology which should be used with caveat. We decided to go with a robust and reliable solution in user space. When downloading a resource we calculate a hash and use this hash value as the linking element between a data sets metadata and the data itself. The data itself is stored on disk using the hash value as it’s filename. When a file name clashes as the same file name has already been used (the hash calculated from it’s content) we can be confident that the resources content did not change. We update metadata (at what time did we download the resource etc.) but we need not duplicate our resources. In CS this approach is known as content-addressable-storage (CAS) and the pillar of Camlistore or the IPFS.
Ideally web servers would give us the information whether or when a resource did change. The HTTP-HEAD request has been designed to actually avoid retrieving a data set if you are only interested in associated metadata like it’s size, format, change date and so on. Unfortunately
- some web services do not support HTTP-HEAD lookups;
- send the body although they must not send it;
- leave out important information which would be useful to detect changes to resources or
- the information in a HEAD-lookup differs to the actual GET-request which makes it unreliable to detect changes.
Although during the course of the ADEQUATe-project our interest in the data itself to perform analytics rose, even detecting changes in a reliable manner involves downloading the data. You simply cannot trust the response of HTTP HEAD.
During the course of performing data quality improvements, a range of actions and data processing steps are performed. This ranges from calculating metrics (completeness, timeliness) to actually transforming data itself (eg. CSV to JSON or CSV on the web). That means an original data gets enriched with additional metadata or the data itself changes.
We currently store metadata in Postgres and so do we store the data quality assessment results. Intermediate and final data processing results are a different beast. Thinking about data processing in general, it can be perceived as a function which takes data as input and produces some form of data as output:
f’ = f(data)
If we change the functions to be of curried form, what we finally obtain is list of operations on data where the output of one function feeds as input into the next processing step and can be conveniently modelled as
X → Y → Z
This allows us to model a processing pipeline as a logical path where we take data as input, specify the functions to be performed on on it and which eventually yields a result. The set of operations will vary on the data itself (CSV files allows a different set of operations than eg. RDF or JSON) and will produce side effects like metadata.
In ADEQUATe we model data processing as a URI, where each segment denotes a function. Thus the URI
instructs the processing engine to take the data blob referenced by hash 77438ad…, feed it into the function CSVWeb and feed that output into JSONConvert. As processing pipeline we use Unified Views and intermediate steps along their side effects are persisted into a key value-database.
User contributed content
When we talk about user contributed content we actually mean user contributed changes. The ADEQUATe platform empowers users to provide changes to existing data sets but not to provide new data sets. Enabling user contributed changes to existing data is a delicate trade-off between usability and power. The set of operations a user would want to perform on data varies on his level of involvement and knowledge. Less tech-savy and less involved users would want to make quick changes online, whereas data scientist are aware of more versatile and powerful tools and likely consider the set of operations provided through a web interface limiting and not fitting to set of tools they are accustomed with.
In ADEQUATe we decided to follow the Git-Approach. Git is a decentralized storage systems which records changes to files. These changes can be pushed back to the originator with other users being able to inspect these changes and continue working from these changes instead of using the originally provided files. A prominent use-case would be someone downloading a CSV file containing coordinates in EPSG:31256, adding another column for latitude/longitude in EPSG:4326 for portrayal on Google Maps, and providing these changes to other. In git-speach downloading is called “forking” and providing changes to others is called “pushing”.
Git is an open source project with many additional components available like a friendly open source user interface called Gitlab which ADEQUATe-projects uses. Git records changes made to files, thus small changes to big files consume little resources. Additionally it is widely used in the data analytics community and open data community. However, git has considerable downsides too, most notably the notion of “change” performed to a resource. Git has it’s origins in source code management. Changes to source code are performed on a line basis, whereas changes to data (especially tabular data as is the case with the widely used CSV files) are often column-oriented. In our case of adding an additional column to a CSV file will result in changes to each and every row whereas seen from a data perspective only one column has been added. Further downsides include
- the set of visualisations available to track and monitor changes operate line-wise and do not account for the changed file type. The ODI has provided plugins to circumvent this shortcomings but they have not (yet) been included into the general Git workflow and tools;
- Git has not been originally designed to work with large files. There exists two approaches to deal with big files in git, namely Git Annex and Git-LFS. Whereas the first solution is generic and supports a wide range of storage solutions, it is more involved to set up and use. The later focuses on one thing doing right but is more limiting in scope. In ADEQUATe we went with GIT-LFS provided by GitLab;
- Git has not been designed to work with binary files. While there are binary open data formats like PNG image files, we considered this to be of a minor issue.
Including user contributed content into data quality processing pipeline means quite some changes to the underlying data infrastructure. First and foremost the notion of one data set per originator is not longer sustainable. Second, our data quality dashboard and associated components (data retrieval in re-occurring intervals) have to be made aware of users making changes to data sets delivered via Git. Also changes to data can now happen at any time and should trigger re-calculating quality metrics as the overall ADEQUATe goal is to provide comprehensible and replicable metrics to monitor data quality improvements. GUI components which displayed these metrics now have to deal with the fact that there will be more but one active version to a data set. Another issue is how to deal with new versions provided by the data owner which invalidate user provided changes.
There are alternatives to the Gitlab workflow we shortly considered but did not pursue any further as these alternatives are still a moving target (Dat changed what it wanted to be from the moment we became aware of to what it is now), to young or do not come with friendly end user interfaces which would draw away less technology oriented end users. (eg. Noms is promising as it claims to be a distributed, forkable database in spirit what Git is to files)
The incorporation of user provided changes into the data quality improvement pipeline is an ongoing challenge. We are happy to hear from your experiences. Let’s exhange our knowledge and get int touch!