The Pile

    An 800GB Dataset of Diverse Text for Language Modeling

    What is the Pile?

    The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

    Download

    The Pile is hosted by the Eye.

    The format of the Pile is jsonlines data compressed using zstandard.

    Have a model that uses or evaluates on the Pile? Let us know!

    Why is the Pile a good training set?

    Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.

    Why is the Pile a good benchmark?

    To score well on Pile BPB (bits per byte), a model must be able to understand many disparate domains including books, github repositories, webpages, chat logs, and medical, physics, math, computer science, and philosophy papers. Pile BPB is a measure of world knowledge and reasoning ability in these domains, making it a robust benchmark of general, cross-domain text modeling ability for large language models.

    Citing

    If you use the Pile or any of the components, please cite us!

    @article{pile,
      title={The {P}ile: An 800GB Dataset of Diverse Text for Language Modeling},
      author={Gao, Leo and Biderman, Stella and Black, Sid and Golding, Laurence and Hoppe, Travis and Foster, Charles and Phang, Jason and He, Horace and Thite, Anish and Nabeshima, Noa and Presser, Shawn and Leahy, Connor},
      journal={arXiv preprint arXiv:2101.00027},
      year={2020}
    }
                    

    Leaderboard

    * indicates potential test-set overlap. Zero-shot indicates that not all of the components of the Pile were present in the training data.

    Rank Model Test BPB

    1.

    Jan 1.2021

    GPT-3 (Zero-Shot)*

    OpenAI

    0.7177

    2.

    Jan 1.2021

    GPT-2 (Zero-Shot)*

    OpenAI

    1.2253

    主站蜘蛛池模板: 精品无人乱码一区二区三区| 国产精品盗摄一区二区在线| 无码喷水一区二区浪潮AV| aⅴ一区二区三区无卡无码| 亚洲制服丝袜一区二区三区| 国精品无码一区二区三区在线蜜臀| 日本精品无码一区二区三区久久久| 无码人妻久久一区二区三区| 亚洲第一区在线观看| 国产伦精品一区二区三区免费下载| 亚洲Av无码一区二区二三区 | 国产成人午夜精品一区二区三区| 亚洲AV午夜福利精品一区二区| 中文字幕在线视频一区| 国产高清一区二区三区四区| 国产一区二区福利久久| 日韩精品一区二区三区老鸦窝| 亚洲日韩精品国产一区二区三区| 国产福利一区二区三区| 精品人妻系列无码一区二区三区| 日韩精品无码一区二区三区不卡| 无码人妻少妇色欲AV一区二区| 无码成人一区二区| 激情内射亚洲一区二区三区| 精品无码国产一区二区三区AV | 亚洲国产成人久久综合一区77 | 亚洲成AV人片一区二区密柚| 国产精品亚洲一区二区三区在线| 综合无码一区二区三区| 免费精品一区二区三区第35| 亚洲天堂一区二区| 日本一区频道在线视频| 亚洲Av永久无码精品一区二区| 国产精品一区在线麻豆| 欧洲精品一区二区三区| 亚洲AV无码一区二区三区在线观看 | 亚洲AV无码一区东京热久久| 精品黑人一区二区三区| 亚洲av色香蕉一区二区三区蜜桃| 国产视频福利一区| 国产成人无码一区二区三区在线|