Data Package as a Service

Make open data small, self-published, and actionable

Open-data-by-default web applications like Flask-based CKAN or dribdat (that runs this site), Django-based SDPP, search engines like OpenSearch, etc., offer full-text search of their content and other APIs as a standard feature. For quickly sharing single datasets or developing 'single page applications' (SPAs) or visualizations, using a large backend application like this may be excessive.

Rationale

I'm supporting Portal.js and Livemark, which accomplish this very well - but sometimes want something even simpler and more integrated in my data stack of Python and Pandas. There are portal previews, linked data endpoints, and wonderful tools like Datasette to dive into a resource, but this might not be ideal for tinkering with data in code. Providing data services through a statically-generated site like JKAN or Datacentral is another cheap and cheerful option. You may already be working on a data science notebook in Jupyter, R Shiny or Observable, but having issues preparing your data on your own.

While working with Frictionless Data (* a global initiative to improve the way quality open data is crowdsourced) - I often wished that there was a way to put a quick API around a Data Package. On top of it, a user interface ..or a data science notebook ..or a natural language interface could be built. The proposed project DaatS is a service in Python and Pandas, which instantly turns a Data Package into an API.

The idea of connecting this project to workflows would be to think of this API-fication of a dataset as a data transformation step, something a user might want to add with a couple of clicks to their data collection in order to benefit from Frictionless Data tools and other components in the open data ecosystem.

Example

You can see the idea in action, combined with Wikidata, as part of a recent project: Living Herbarium (GLAMhack 2022). Another example, involving data scraping automation, is Baumkataster.

{ hacknight challenges }

Create a Data Package. It might be your first or your 99th. It is easy and fun to scrape some data off the web and put some shiny wrapping and "nutritional" guidance around it. Ask @loleg if you need some help here, or see this or this detailed guide.

Use the DaatS template to add a repo with boilerplate code on your GitHub account. Or just download the repository to your local machine. Follow the README to install the packages, and drop in your datapackage.json and CSV dataset. Use your browser or an API testing tool to run some queries, and you should see it paginating and searching your data.

Write a converter to patch your DaatS service into a no-code workflow. This could be a Proxeus node, a Node-RED, an Airtable or Slack workflow step, a GitHub Action, etc. Whatever would potentially scratch your own itch. Make it super easy for users to connect a dataset and invoke search queries or even statistical / data science functions as embedded in their process.

This content is a preview from an external site.

Data as a (tiny) Service

This is a template repository, which lets you create a quick API around your Frictionless Data Package. This could be useful in several ways: as a microservice for your SPA frontend, for integration with Web-based workflows, for more paginated access to larger datasets, for setting up a cheap and simple Data as a service offering.

The open source code is based on Python and Pandas, and can be easily extended to fit the needs of your data science project.

Getting started

Place a datapackage.json file and data folder with your own data to start setting up an API.

If you have not used Data Packages before, an easy way to get started is to convert your dataset to a CSV file (or a set of CSV files), in UTF-8 format - which you can create with any spreadsheet program. Then, use the Data Package CLI or Create Frictionless Data tool to generate a Data Package by clicking the "Load" button and then adding and defining the columns and metadata. "Download" and place the resulting files here. Visit frictionlessdata.io for more advice on this.

Installation

This repository contains a minimalist backend service API based on the Falcon framework and Pandas DataPackage Reader. To run:

cd api
virtualenv env
. env/bin/activate
pip install -Ur requirements.txt
python server.py

(Alternatively: use Pipenv and run pipenv install && pipenv run python server.py)

At this point you should see the message "Serving on port..."

Soon there will be a webpage where you can test the API. Until then ...

Test the API using a REST client such as RESTer with queries such as:

http://localhost:8000/[my resource name]?[column]=[query]

For instance, if you had a Resource in your Data Package with the name "tree", which has a "quartier" column, you can search for "Oerlikon" in it using:

http://localhost:8000/tree?quartier=Oerlikon

You can adjust the amount of output with a page and per_page parameter in your query.

License

This project is licensed by its maintainers under the MIT License.

If you intended to use these data in a public or commercial product, please check the data sources themselves for any specific restrictions.

Event finished

30.11.2022 14:00

Joined the team

25.11.2022 08:20 ~ Vlad

Edited content version 31

23.11.2022 22:10 ~ loleg

Joined the team

23.11.2022 16:17 ~ loleg

Event started

23.11.2022 13:00

Edited content version 15

18.11.2022 09:13 ~ loleg

Repository updated

17.11.2022 21:47 ~ loleg

Ask

16.11.2022 12:49

Challenge posted

16.11.2022 12:49 ~ loleg

👋 Contact 📂 Download 💻 Source

All attendees, sponsors, partners, volunteers and staff at our hackathon are required to agree with the Hack Code of Conduct. Organisers will enforce this code throughout the event. We expect cooperation from all participants to ensure a safe environment for everybody.

The contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License.

DINAcon HACKnights