Skip to content
README.md 3.01 KiB
Newer Older
Christian Ruiz's avatar
Christian Ruiz committed

## Overview
This repo contains code files that we created for the so called StatBot project. StatBot is a project initiated by the Federal Statistical Office (FSO) and CORSTAT, the Swiss Conference of Cantonal Statistical Offices. The project is commissioned in the context of the Data Innovation Strategy of FSO within the Data Science Competence Center (DSCC).
Christian Ruiz's avatar
Christian Ruiz committed

The goal of the project is two fold: 
- We want to help harmonize statistical data from various official sources and bring datasets into a common data space (we call it «StatBot warehouse»). 
- In a second step we want to research, if we can create a chatbot system that allows a user to query the harmonized data.
Christian Ruiz's avatar
Christian Ruiz committed

This repo mainly allowed us to collaborate on the first goal of the project. We have finished our explorations end of 2022 and will not continue to work on this repo.
Christian Ruiz's avatar
Christian Ruiz committed

Patrick Arnecke's avatar
Patrick Arnecke committed
Since beginning of 2023 a research team at the Zürich University of Applied Sciences ZHAW is working on goal two of the project under the lead of Prof. Kurt Stockinger. The research team will finish their work end of 2023 and publish their results in a research paper and likely also in a public repo. 
Christian Ruiz's avatar
Christian Ruiz committed

Patrick Arnecke's avatar
Patrick Arnecke committed
The usage of this project is free for everyone. At the same time the workflows are not self explanatory but require detailed knowledge of the data structures and ETL processes involved. If you are interested, please contact us directly (datashop at statistik dot ji dot zh dot ch). We are happy to give you more context about the project and the insights we gained. 
Christian Ruiz's avatar
Christian Ruiz committed

## Project structure
StatBot can be described as a system that
Christian Ruiz's avatar
Christian Ruiz committed

- allows a statistical institution to export their data into a common and standardized data space
- allows a user to ask the system a data related question in natural language and get an answer back in natural language
Christian Ruiz's avatar
Christian Ruiz committed

Christian Ruiz's avatar
Christian Ruiz committed

The left part of the system is the data warehouse. It consists of data pipelines that extract, transform and ingest data from various sources into a standardized and common data space.
The right part of the system is the machine learning solution. It translates natural language questions of a user into SQL queries, retrieves an answer with this query from a database and delivers the results in natural language to the user.
Christian Ruiz's avatar
Christian Ruiz committed

StatBot standardizes raw source data from data publishers on these levels:
Christian Ruiz's avatar
Christian Ruiz committed

- Time and spatial units –  All data in all datasets have a common definition of their spatial units and time.
- Value encoding – All values are encoded in a standardized way. E.g. the values for the biological gender of a person are either 1 or 2 and all totals are indicated with -111.
- Variable naming – All variables are named in a standardized way. E.g. all variables that indicate the biological gender of a person are named gender.
Christian Ruiz's avatar
Christian Ruiz committed

Christian Ruiz's avatar
Christian Ruiz committed

- are tidy data
- contain a spatial unit (named spatialunit_ontology, with mappings defined in spatialunits.parquet)
- contain a time value (named time_value following the ISO Standard and eCH) 
- and contain a value
Patrick Arnecke's avatar
Patrick Arnecke committed
![](StatBot_02.png)