We at BAAI are proud to release the Chinese Corpora Internet (CCI) dataset v1.0.0.
CCI's standout features:
- Extensive Data: With 104GB of rich linguistic data, CCI is one of the most high-quality Chinese language datasets available.
- Broad Temporal Range: Spanning from 2001 to 2023, CCI captures a wide array of linguistic evolution and trends.
- Trusted Data Sources: All data is sourced from reputable Chinese internet platforms, ensuring authenticity and reducing noise.
- Stringent Quality Control: From cleaning to deduplication, CCI has been processed with meticulous attention to detail to maintain data excellence.
- Ethical Standards: CCI adheres to strict ethical guidelines, with sensitive content being carefully filtered out to ensure the dataset's safety and reliability.
- Anti-Cheating Measures: We've taken proactive steps to filter out benchmark datasets, ensuring that AI models trained on CCI are evaluated fairly and accurately.
Get started with CCI:
- BAAI Open Data Repository: https://data.baai.ac.cn/details/BAAI-CCI
- HuggingFace: https://huggingface.co/datasets/BAAI/CCI-Data
We're looking forward to the community's feedback and the advancements in AI that CCI will facilitate. Your engagement is key to the success of this initiative.