Show HN: A High-Quality Chinese Internet Language Dataset for AI

1 point

3 years ago

Hello HN,

We at BAAI are proud to release the Chinese Corpora Internet (CCI) dataset v1.0.0.

CCI's standout features:

- Extensive Data: With 104GB of rich linguistic data, CCI is one of the most high-quality Chinese language datasets available.

- Broad Temporal Range: Spanning from 2001 to 2023, CCI captures a wide array of linguistic evolution and trends.

- Trusted Data Sources: All data is sourced from reputable Chinese internet platforms, ensuring authenticity and reducing noise.

- Stringent Quality Control: From cleaning to deduplication, CCI has been processed with meticulous attention to detail to maintain data excellence.

- Ethical Standards: CCI adheres to strict ethical guidelines, with sensitive content being carefully filtered out to ensure the dataset's safety and reliability.

- Anti-Cheating Measures: We've taken proactive steps to filter out benchmark datasets, ensuring that AI models trained on CCI are evaluated fairly and accurately.

Get started with CCI:

- BAAI Open Data Repository: https://data.baai.ac.cn/details/BAAI-CCI

- HuggingFace: https://huggingface.co/datasets/BAAI/CCI-Data

We're looking forward to the community's feedback and the advancements in AI that CCI will facilitate. Your engagement is key to the success of this initiative.

2 comments