BigCode

"A boost in performance, that's kind of like hiring 33% more coders"

BigCode Dataset

This repository gathers all the code used to build the BigCode datasets such as The Stack as well as the preprocessing necessary used for model training.

Contents

  • language_selection: notebooks and file with language to file extensions mapping used to build the Stack v1.1.
  • pii: code for running PII detection and anonymization on code datasets.
  • preprocessing: code for filtering code datasets based on:
    • line length and percentage of alphanumeric characters.
    • number of stars.
    • comments to code ratio.
This content is a preview from an external site.
 

The next generation of programmers will have new tools for improving the transparency of where code snippets and generated code is coming from. Leandro von Werra (machine engineer at Hugging Face) presented the BigCode project at DINAcon 2022: a research collaboration inviting us to pick up the tools, use the data, and become more conscientious of how we license and reuse open source code.

 BigCode at DINAcon 2022

Photo of @lvwerra by Oleg Lavrovsky - CC BY 4.0


{ hacknight challenges }

Use Am I in The Stack? to see if your code is included in the project, and follow @BigCodeProject to stay up to date with developments. There is not yet a user-facing tool available, but stay tuned!

Learn to work with HuggingFace, where projects like the The Stack - 3 TB of permissively licensed source code that is the basis for BigCode's model weights and datasets - can be found: take the official course online. Example notebooks can be found on GitHub. If you manage to crunch some of this data, drop a link to your notebook on forum.opendata.ch.

Explore the BigCode Dataset and contribute to some of the engineering, ethical and legal issues being worked on. Do you have a relevant professional background? BigCode is a research collaboration that you can apply to join here.

Edited content version 13

01.12.2022 13:31 ~ loleg

Edited content version 10

01.12.2022 13:29 ~ loleg

Edited content version 7

01.12.2022 13:28 ~ loleg

Event finished

30.11.2022 14:00

fix indentation (@loubnabnl)

enable dataset saving (@loubnabnl)

Merge pull request #25 from bigcode-project/loubnabnl-patch-2

Update README.md (@loubnabnl)

Update README.md (@loubnabnl)

Merge pull request #24 from loubnabnl/filtering-script

Add filtering script with options: basic, stars, comments (@loubnabnl)

add details about comment to code filter (@loubnabnl)

rename args class (@loubnabnl)

Create LICENSE (@loubnabnl)

update readme (@loubnabnl)

update requiremnets (@loubnabnl)

add stars and comments filtering ad logging mechnaism (@loubnabnl)

add args and utils (@loubnabnl)

Event started

23.11.2022 13:00

Ask

23.11.2022 12:40

Repository updated

23.11.2022 12:40 ~ loleg

Challenge posted

23.11.2022 12:40 ~ loleg
 
All attendees, sponsors, partners, volunteers and staff at our hackathon are required to agree with the Hack Code of Conduct. Organisers will enforce this code throughout the event. We expect cooperation from all participants to ensure a safe environment for everybody.

Creative Commons LicenceThe contents of this website, unless otherwise stated, are licensed under a Creative Commons Attribution 4.0 International License.

HACKnight 2022