My doctoral thesis “Open Collaborative Data Engineering With Subject-Matter Experts Using Domain-Specific Languages.” was published in the Open-FAU Repository.
Abstract
Open collaborative workflows are common, for example, in open-source software development or Wikipedia. They reduce costs for individual participants and can improve the overall quality of the result. A potential application domain for open collaboration is data engineering, especially for open data, which shares many qualities with open-source software as it can be freely used, modified, and shared.
However, data from complex domains requires the expertise of human subject-matter experts to be understood and made usable for later applications. These experts often lack the technical background needed to collaborate with software engineers using existing, text-based collaboration tools like general-purpose programming languages and project forges. Instead, various visual programming tools exist that allow non-technical contributors to build data pipelines. These tools are often proprietary and are not easy to collaborate on.
In this thesis, we explore a potential middle ground in the form of using domain-specific languages as the foundation for a shared collaboration artifact to describe data pipelines. To do so, we follow a design science methodology to identify underlying problems for collaborative data engineering with subject-matter experts, contribute an innovative artifact in the form of a domain-specific language, and empirically validate and evaluate this artifact to investigate the underlying reasons for its performance.
Initially, we summarize the literature on collaboration systems in open collaborative data engineering using a systematic literature review to develop an understanding of the current state of the art. We find a diverse ecosystem of participants, activities, tools used, and artifacts that are created during collaboration.
Based on an interview study with data engineering practitioners, we describe how their work is organized in social systems based on roles and their interactions for small-scale project groups and the wider open data ecosystem. We identify concrete challenges to collaborative data engineering and develop recommendations for resolving them.
Following up on a recommendation for the most pressing challenges, such as high technical barriers to contribution and no standard collaboration artifacts, we suggest and implement a textual domain-specific language for creating data pipelines, based on the well-known pipes-and-filters architecture.
Lastly, in a series of empirical studies with human participants, we first validate that the domain-specific language is a potential basis for a collaboration artifact for non-professional programmers and then evaluate its performance compared to Python using controlled experiments. By combining the results of the controlled experiments with a follow-up survey, we describe the effects that using a domain-specific language for data engineering has on collaborators.