Challenge view
Back to ProjectBigCode
"A boost in performance, that's kind of like hiring 33% more coders"
The next generation of programmers will have new tools for improving the transparency of where code snippets and generated code is coming from. Leandro von Werra (machine engineer at Hugging Face) presented the BigCode project at DINAcon 2022: a research collaboration inviting us to pick up the tools, use the data, and become more conscientious of how we license and reuse open source code.
Photo of @lvwerra by Oleg Lavrovsky - CC BY 4.0
{ hacknight challenges }
Use Am I in The Stack? to see if your code is included in the project, and follow @BigCodeProject to stay up to date with developments. There is not yet a user-facing tool available, but stay tuned!
Learn to work with HuggingFace, where projects like the The Stack - 3 TB of permissively licensed source code that is the basis for BigCode's model weights and datasets - can be found: take the official course online. Example notebooks can be found on GitHub. If you manage to crunch some of this data, drop a link to your notebook on forum.opendata.ch.
Explore the BigCode Dataset and contribute to some of the engineering, ethical and legal issues being worked on. Do you have a relevant professional background? BigCode is a research collaboration that you can apply to join here.
BigCode Dataset
This repository gathers all the code used to build the BigCode datasets such as The Stack as well as the preprocessing necessary used for model training.
Contents
language_selection
: notebooks and file with language to file extensions mapping used to build the Stack v1.1.pii
: code for running PII detection and anonymization on code datasets.preprocessing
: code for filtering code datasets based on:- line length and percentage of alphanumeric characters.
- number of stars.
- comments to code ratio.