Demo

BigCode

"A boost in performance, that's kind of like hiring 33% more coders"

Readme

The next generation of programmers will have new tools for improving the transparency of where code snippets and generated code is coming from. Leandro von Werra (machine engineer at Hugging Face) presented the BigCode project at DINAcon 2022: a research collaboration inviting us to pick up the tools, use the data, and become more conscientious of how we license and reuse open source code.

Photo of @lvwerra by Oleg Lavrovsky - CC BY 4.0

{ hacknight challenges }

Use Am I in The Stack? to see if your code is included in the project, and follow @BigCodeProject to stay up to date with developments. There is not yet a user-facing tool available, but stay tuned!

Learn to work with HuggingFace, where projects like the The Stack - 3 TB of permissively licensed source code that is the basis for BigCode's model weights and datasets - can be found: take the official course online. Example notebooks can be found on GitHub. If you manage to crunch some of this data, drop a link to your notebook on forum.opendata.ch.

Explore the BigCode Dataset and contribute to some of the engineering, ethical and legal issues being worked on. Do you have a relevant professional background? BigCode is a research collaboration that you can apply to join here.

BigCode Dataset

This repository gathers all the code used to build the BigCode datasets such as The Stack as well as the preprocessing necessary used for model training.

language_selection: notebooks and file with language to file extensions mapping used to build the Stack v1.1.
pii: code for running PII detection and anonymization on code datasets.
preprocessing: code for filtering code datasets based on:
- line length and percentage of alphanumeric characters.
- number of stars.
- comments to code ratio.

Preview of external content.

👋 Contact ✨ Demo 💻 Source

All attendees, sponsors, partners, volunteers and staff at our hackathon are required to agree with the Hack Code of Conduct. Organisers will enforce this code throughout the event. We expect cooperation from all participants to ensure a safe environment for everybody.

The contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License.

Previous
HACKnight 2022
Next project

HACKnight 2022

BigCode

{ hacknight challenges }

BigCode Dataset

Contents