The GitHub archive events are published in hourly `json.gz` archives. For example, https://data.gharchive.org/2025-01-21-15.json.gz contains all the events for [15:00, 16:00) time range at January 21, 2025. Every event is published as a JSON line containing various event fields. This format ideally fits VictoriaLogs data model [5], so GitHub events can be ingested into VictoriaLogs with the following command:
curl -s https://data.gharchive.org/2025-01-21-15.json.gz \
| curl -T - -X POST -H 'Content-Encoding: gzip' 'http://localhost:9428/insert/jsonline?_time_field=created_at&_stream_fields=type'
This command streams the `json.gz` event data directly into VictoriaLogs data ingestion endpoint [6], without any intermediate transformations.So, I started using GitHub archive events as test data during VictoriaLogs development. I regularly query this data for some insights. Today I discovered an "interesting" repositories at GitHub, which contain thousands of commits per day, which are generated by a single GitHub user. For example, the https://github.com/frdpzk2/ppub repository has more than 6 million of commits, and this number increases by 35000 commits per day.
Below is the list of GitHub repositories, which got more than 10K commits on a single day - January 21, 2025:
pushes=28263 repo_url=https://github.com/frdpzk2/ppub
pushes=24714 repo_url=https://github.com/freeukapp/uk
pushes=23598 repo_url=https://github.com/CelestiaNFT/Welcome-NFT
pushes=17815 repo_url=https://github.com/freefastconnect/fastconnect
pushes=15854 repo_url=https://github.com/iniadittt/iniadittt
pushes=13000 repo_url=https://github.com/adi224foreverg/globaldl
pushes=12364 repo_url=https://github.com/frdpzk3/ppub
pushes=11670 repo_url=https://github.com/rmousavi-raspberry/raspberrypi
pushes=11221 repo_url=https://github.com/brand22/d3
You can investigate GitHub archive data on yourself at VictoriaLogs playground [7].
[1] https://docs.victoriametrics.com/victorialogs/
[2] https://jeremymorrell.dev/blog/a-practitioners-guide-to-wide-events/
[3] https://www.gharchive.org/
[4] https://docs.github.com/en/rest/using-the-rest-api/github-event-types
[5] https://docs.victoriametrics.com/victorialogs/keyconcepts/
[6] https://docs.victoriametrics.com/victorialogs/data-ingestion/#json-stream-api
[7] https://tinyurl.com/ymn8sre8