亚洲国产爱久久全部精品_日韩有码在线播放_国产欧美在线观看_中文字幕不卡在线观看

Skip to content

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

License

Notifications You must be signed in to change notification settings

google-research-datasets/wit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

?

History

65 Commits
?
?
?
?
?
?
?
?
?
?
?
?
?
?

Repository files navigation

WIT : Wikipedia-based Image Text Dataset

Wikipedia-based Image Text (WIT) Dataset is a large multimodal multilingual dataset. WIT is composed of a curated set of 37.6 million entity rich image-text examples with 11.5 million unique images across 108 Wikipedia languages. Its size enables WIT to be used as a pretraining dataset for multimodal machine learning models.

Key Advantages

A few unique advantages of WIT:

  • The largest multimodal dataset (publicly available at the time of this writing) by the number of image-text examples.
  • A massively multilingual dataset (first of its kind) with coverage for 108 languages.
  • First image-text dataset with page level metadata and contextual information
  • A collection of diverse set of concepts and real world entities.
  • Brings forth challenging real-world test sets.

You can learn more about WIT Dataset from our arXiv paper.

Latest Updates

2021 April: Happy to share the good news that our paper got accepted at SIGIR Conference. From ACM site, you can find our paper, slides and presentation.

2021 September: WIT Image-Text Competition is live on Kaggle. Our collaborators from Wikimedia Research blogged about this and they have made available the raw pixels and resnet50 embeddings for the images in this set. Here is our Google AI blog post.

2022 April: We are happy to share that the WIT paper and dataset was awarded the WikiMedia Foundation's Research Award of the Year (tweet 1, tweet 2). We are deeply honored and thank you for the recognition.

2022 May: We have released the WIT validation set and test set. Please see the data page for download links.

2022 Oct: Authoring Tools for Multimedia Content proposal accepted at TREC 2023

2023 Apr: AToMiC accepted at SIGIR 2023.

2023 Apr: WikiWeb2M Dataset released.

2023 May: Accepted submissions at WikiWorkshop 2023.

  • WikiWeb2M: A Page-Level Multimodal Wikipedia Dataset (pdf, arXiv)
  • Building Authoring Tools for Multimedia Content with Human-in-the-loop Relevance Annotations (pdf)
  • Characterizing Image Accessibility on Wikipedia across Languages (pdf)

WIT Example

Wikipedia Page

For example, let's take the Wikipedia page for Half Dome, Yosemite in CA.

WIT Wikipedia Half Dome Image

From the Wikipedia page for Half Dome : Photo by DAVID ILIFF. License: CC BY-SA 3.0

Wikipedia Page with Annotations of what we can extract

From this page, we highlight the various key pieces of data that we can extract - images, their respective text snippets and some contextual metadata.

WIT Half Dome Page with Annotations

By extracting and filtering these carefully, we get a clean, high quality image-text example that can be used in multimodal modeling.

Motivation

Multimodal visio-linguistic models rely on a rich dataset to help them learn to model the relationship between images and texts. Having large image-text datasets can significantly improve performance, as shown by recent works. Furthermore the lack of language coverage in existing datasets (which are mostly only in English) also impedes research in the multilingual multimodal space – we consider this a lost opportunity given the potential shown in leveraging images (as a language-agnostic medium) to help improve our multilingual textual understanding.

To address these challenges and advance research on multilingual, multimodal learning we created the Wikipedia-based Image Text (WIT) Dataset. WIT is created by extracting multiple different texts associated with an image (e.g., as shown in the above image) from Wikipedia articles and Wikimedia image links. This was accompanied by rigorous filtering to only retain high quality image-text sets.

The resulting dataset contains over 37.6 million image-text sets – making WIT the largest multimodal dataset (publicly available at the time of this writing) with unparalleled multilingual coverage – with 12K+ examples in each of 108 languages (53 languages have 100K+ image-text pairs).

WIT: Dataset Numbers

Type Train Val Test Total / Unique
Rows / Tuples 37.13M 261.8K 210.7K 37.6M
Unique Images 11.4M 58K 57K 11.5M
Ref. Text 16.9M 150K 104K 17.2M / 16.7M
Attr. Text 34.8M 193K 200K 35.2M / 10.9M
Alt Text 5.3M 29K 29K 5.4M / 5.3M
Context Texts - - - 119.8M

WIT: Image-Text Stats by Language

Image-Text # Lang Uniq. Images # Lang
total > 1M 9 images > 1M 6
total > 500K 10 images > 500K 12
total > 100K 36 images > 100K 35
total > 50K 15 images > 50K 17
total > 14K 38 images > 13K 38

Get WIT

We believe that such a powerful diverse dataset will aid researchers in building better multimodal multilingual models and in identifying better learning and representation techniques leading to improvement of Machine Learning models in real-world tasks over visio-linguistic data.

WIT Dataset is now available for download. Please check the data page.

Citing WIT

If you use the WIT dataset, you can cite our work as follows.

@inproceedings{10.1145/3404835.3463257,
author = {Srinivasan, Krishna and Raman, Karthik and Chen, Jiecao and Bendersky, Michael and Najork, Marc},
title = {WIT: Wikipedia-Based Image Text Dataset for Multimodal Multilingual Machine Learning},
year = {2021},
isbn = {9781450380379},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3404835.3463257},
doi = {10.1145/3404835.3463257},
booktitle = {Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval},
pages = {2443–2449},
numpages = {7},
keywords = {dataset, multimodal, machine learning, wikipedia, multilingual, image-text retrieval, neural networks},
location = {Virtual Event, Canada},
series = {SIGIR '21}
}

License

This data is available under the Creative Commons Attribution-ShareAlike 3.0 Unported license.

Projects using WIT

For information regarding MURAL (Multimodal, Multitask Retrieval Across Languages) paper accepted at EMNLP 2021.

Contact

For any questions, please contact wit-dataset@google.com. To any questions to the first author, Krishna, please reach via their personal page krishna2.com for contact informaiton.

If WIT dataset is useful to you, please do write to us about it. Be it a blog post, a research project or a paper, we are delighted to learn about it.

About

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •  
亚洲国产爱久久全部精品_日韩有码在线播放_国产欧美在线观看_中文字幕不卡在线观看

    
    

    9000px;">

      
      

      欧美va在线播放| 在线观看日韩精品| 色先锋资源久久综合| 久久久久国产精品厨房| 奇米精品一区二区三区在线观看| 色吊一区二区三区| 亚洲男同性视频| 一本久久综合亚洲鲁鲁五月天| 中文字幕日韩av资源站| 成人午夜看片网址| 国产精品嫩草99a| 成人精品电影在线观看| 国产精品日韩成人| 国产99久久久国产精品潘金| 国产日韩欧美精品一区| 国产成人精品免费视频网站| 国产清纯白嫩初高生在线观看91| 国产白丝精品91爽爽久久| 国产精品午夜久久| 成人av电影在线网| 91欧美一区二区| 亚洲第一精品在线| 亚洲婷婷综合久久一本伊一区 | 国产精品私人自拍| 欧美天天综合网| 国产在线麻豆精品观看| 成人免费一区二区三区在线观看| 欧美日韩精品一区二区天天拍小说| 99久久免费精品| 国产精品一级黄| 国产精品1区2区3区在线观看| 日本中文字幕一区二区有限公司| 一区二区欧美视频| 亚洲成年人网站在线观看| 综合久久综合久久| 五月天视频一区| 日韩电影在线一区二区| 麻豆专区一区二区三区四区五区| 国产一区二区调教| 在线观看av一区二区| 成人动漫视频在线| 在线观看91精品国产麻豆| 国产精品18久久久久| 亚洲视频 欧洲视频| 国产毛片精品一区| 中文字幕一区二区三区视频| 中文字幕不卡三区| 色噜噜狠狠成人中文综合| 奇米四色…亚洲| 中文字幕成人网| 69堂精品视频| 99精品久久只有精品| 精品一区二区三区视频在线观看| 亚洲欧美影音先锋| 在线综合视频播放| 91麻豆精品在线观看| 精品亚洲国产成人av制服丝袜| 亚洲女人的天堂| 久久婷婷国产综合精品青草| 91成人在线精品| 成人午夜激情片| 日本午夜一本久久久综合| 国产精品视频线看| 51精品视频一区二区三区| 91在线播放网址| 国产在线看一区| 日韩电影一二三区| 亚洲欧美日韩国产手机在线 | 日韩欧美在线网站| 色悠悠久久综合| 国产精品资源站在线| 午夜久久久久久久久| 亚洲丝袜制服诱惑| 久久―日本道色综合久久| 欧美日本国产一区| 欧美中文字幕一区二区三区亚洲| 欧美主播一区二区三区| 国产成人超碰人人澡人人澡| 精品一区二区国语对白| 午夜激情久久久| 亚洲人精品午夜| 国产亚洲精品aa午夜观看| 精品日韩一区二区| 91精品国产91久久综合桃花| 欧美性一级生活| 91久久久免费一区二区| 成人免费视频网站在线观看| 国产伦理精品不卡| 国产真实乱对白精彩久久| 美女视频黄频大全不卡视频在线播放 | 欧美片在线播放| 91麻豆精品在线观看| av一本久道久久综合久久鬼色| 国产在线精品免费av| 毛片不卡一区二区| 喷白浆一区二区| 日韩精品欧美成人高清一区二区| 亚洲第四色夜色| 亚洲成人你懂的| 午夜天堂影视香蕉久久| 亚洲成av人综合在线观看| 亚洲成人激情综合网| 亚洲福中文字幕伊人影院| 天堂va蜜桃一区二区三区漫画版| 日本女优在线视频一区二区| 天堂va蜜桃一区二区三区| 麻豆91在线播放| 国产高清不卡一区| 国产99久久久国产精品免费看 | 精品一区二区三区在线播放视频 | 天天色综合天天| 免费成人性网站| 国产资源在线一区| 成人黄色大片在线观看| 99久久精品一区二区| 欧美午夜精品久久久久久孕妇 | 视频在线观看国产精品| 免费观看91视频大全| 国产一区不卡精品| 91免费版在线| 9191国产精品| 久久久久9999亚洲精品| 亚洲老妇xxxxxx| 美女视频网站久久| 色哟哟一区二区三区| 日韩三区在线观看| 综合自拍亚洲综合图不卡区| 日韩福利电影在线| 成人一区二区在线观看| 欧美日韩一区中文字幕| 久久久www免费人成精品| 亚洲欧美福利一区二区| 久久精品国产999大香线蕉| 丁香激情综合国产| 欧美高清dvd| 国产精品传媒在线| 日韩成人午夜精品| www..com久久爱| 欧美一区二区观看视频| 亚洲视频一二三区| 久久精品99国产国产精| 91麻豆精东视频| 欧美激情一二三区| 成人性生交大合| 欧美色精品在线视频| 亚洲精品一线二线三线无人区| 中文字幕一区av| 青椒成人免费视频| av中文一区二区三区| 日韩视频免费观看高清完整版在线观看| 欧美国产综合一区二区| 亚洲女同女同女同女同女同69| 国产一区二区三区电影在线观看| 欧美色图第一页| 精品国产乱码久久久久久久久| 亚洲亚洲精品在线观看| 精品午夜一区二区三区在线观看| 在线观看国产91| 国产精品久久久久精k8| 日韩专区中文字幕一区二区| 色香蕉成人二区免费| 欧美精品一区二区精品网| 午夜伊人狠狠久久| 成人性视频网站| 精品国产露脸精彩对白 | 尤物视频一区二区| 激情久久久久久久久久久久久久久久| 色综合网站在线| 99久久综合99久久综合网站| 欧美成人伊人久久综合网| 亚洲国产aⅴ成人精品无吗| 国产高清视频一区| 精品成人私密视频| 国产亚洲短视频| 国产91精品免费| 久久精品一区四区| 午夜在线成人av| 欧美乱熟臀69xxxxxx| 亚洲婷婷国产精品电影人久久| 国产精品69久久久久水密桃| 91论坛在线播放| 亚洲久草在线视频| 欧美性一级生活| 亚洲美女少妇撒尿| 91一区二区在线观看| 国产午夜一区二区三区| 成人一区在线观看| 亚洲图片欧美激情| 丁香婷婷深情五月亚洲| 国产精品色一区二区三区| 激情久久久久久久久久久久久久久久| 欧美一区二区高清| 视频一区在线播放| 日韩免费高清电影| 久久精品国产精品亚洲红杏| 欧美日韩你懂的| 日本91福利区| 91精品午夜视频| 国产精品香蕉一区二区三区| 国产精品久久久久久久久动漫|