最新的二十个有趣的深度学习数据集

更大的标注数据集和更多可用的计算能力是AI革命的基石。在本文中,我列出了我们最近为数据科学家发现的一些非常好玩的深度学习数据集。

1. EMNIST: An Extension of MNIST to Handwritten Letters

MNIST is a very popular dataset for people getting started with Deep Learning in particular and Machine Learning on images in general. MNIST has images of digits which are to be mapped to the digits themselves. EMNIST extends this to images of letters as well. The dataset can be downloaded here . There is an alternative dataset we discovered as well on Reddit. It’s called HASYv2 and can be downloaded here

2. HICO & HICO-DET

HICO has images containing multiple objects and these objects have been tagged along with their relationships. The proposed problem is for algorithms to be able to dig out objects in an image and relationship between them after being trained on this dataset. I expect multiple papers to come out of this dataset in future.

3. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

CLEVR is an attempt by Fei-Fei Li’s group, the same scientist who developed the revolutionary ImageNet dataset. It has objects and questions asked about those objects along with their answers specified by humans. The aim of the project is to develop machines with common sense about what they see. So for example, the machine should be able to find “an odd one out” in an image automatically. You can download the dataset here.

4. HolStep: A Machine Learning Dataset for Higher-order Logic Theorem Proving

This dataset is tagged in a way so that algorithms trained on it can be used for automatic theorem proving . The download link is here.

5. The Parallel Meaning Bank: Towards a Multilingual Corpus of Translations Annotated with Compositional Meaning Representations

The Parallel Meaning Bank (PMB), developed at the University of Groningen, comprises sentences and texts in raw and tokenised format, tags for part of speech, named entities and lexical categories, and formal meaning representations. The download link is here.

6. JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction

JFLEG dataset is an aim to tag sentences with nominal grammatical corrections and smart grammatical corrections. This dataset aims to build machines that can correct grammar automatically for people making mistakes. The dataset can be downloaded here.

7. Introducing VQA v2.0: A More Balanced and Bigger VQA Dataset!

This dataset has images, questions asked on them and their answers tagged. The aim is to train machines to answer questions asked about images (and in continuation about the real world they are seeing). Visual QA is an old dataset but its 2.0 version came out just this december.

8. Google Cloud & YouTube-8M Video Understanding Challenge

Probably the largest dataset available for training in the open. This is a dataset of 8 Million Youtube videos tagged with the objects within them. There is also a running Kaggle competition on the dataset with a bounty of 1,00,000 dollars.

9. Data Science Bowl 2017

This turns out to be the largest bounty offered to crack a Data Science problem. There are prizes of $1 Million to be grabbed by Data Scientists who can detect lung cancer using this dataset of tagges CT-Scans.

10. Exoplanets Dataset

Today, a team that includes MIT and is led by the Carnegie Institution for Science has released the largest collection of observations made with a technique called radial velocity, to be used for hunting exoplanets. The dataset can be downloaded here.

11. End-to-End Interpretation of the French Street Name Signs Dataset

This is a huge dataset of French Street signs labeled with what they denote. The dataset is easily readable by everyone’s favorite Tensorflow and can be downloaded here.

12. A Realistic Dataset for the Smart Home Device Scheduling Problem for DCOPs

An upcoming dataset for IoT and AI interface. You can download it here.

13. RepEval 2017 Shared Task

From Sam Bowman’s team, the creators of the famous SNLI dataset, this dataset about understanding the meaning of the text is going to be out soon as a competition. The dataset is expected by 15th March. You can find it here once it’s live.

14. Driver Speed Dataset

A 200 Gb huge dataset, which is aimed to calculate speed of moving vehicles. Can be downloaded here.

15. NWPU-RESISC45 Remote sensing images dataset

A huge dataset of remote sensing images covering a wide array of landscapes which can be seen through sattelites. Potential technology that can be developed includes satellite surveys, monitoring, and surveillance. Unfortunately, we are still waiting for the download link here.

16. Recipe to create your own free datasets from the open web

This is probably the most interesting of the datasets. This dataset has not been tagged by humans but by machines. Also, the authors make things clear about what is to be done if we want to create a similar dataset from the millions of images which are already available on the web.

17. The LIP Dataset

This large-scale data set focuses on the semantic understanding of a person. The download link for the dataset is here.

18. WikiReading Data

This dataset is a large-scale natural language understanding task and publicly-available dataset with 18 million instances. The downlaod link is here.

19. MUSCIMA++

MUSCIMA++ is a dataset of handwritten music notation for musical symbol detection. Here is the download link.

20. DeScript (Describing Script Structure)

DeScript is a crowdsourced corpus of event sequence descriptions (ESDs) for different scenarios crowdsourced via Amazon Mechanical Turk. Here is the download link.


Reference:

http://blog.paralleldots.com/data-scientist/new-deep-learning-datasets-data-scientists/


如果觉得内容有用,帮助多多分享哦 :)

长按或者扫描如下二维码,关注 “CoderPai” 微信号(coderpai)。添加底部的 coderpai 小助手,添加小助手时,请备注 “算法” 二字,小助手会拉你进算法群。如果你想进入 AI 实战群,那么请备注 “AI”,小助手会拉你进AI实战群。

内容概要:本文研究了基于CNN-BiGRU-Attention混合神经网络模型的风电功率预测方法,旨在提升风力发电功率预测的准确性。该模型融合卷积神经网络(CNN)以提取输入变量中的局部时空特征,结合双向门控循环单元(BiGRU)充分捕捉时间序列前后向的长期依赖关系,并引入注意力机制(Attention)动态加权关键时间步的特征信息,增强模型对重要时刻的敏感度。研究采用多变量输入进行单步预测,综合纳入风速、风向、温度等多种气象因素作为模型输入,全面反映环境变量对风电输出的影响。通过Matlab平台完成模型构建、训练仿真验证,实验结果表明该混合模型在预测精度稳定性方面优于传统单一模型,有效提升了风电功率预测性能。; 适合人群:具备一定机器学习深度学习理论基础,熟悉Matlab编程环境,从事新能源发电预测、电力系统调度、智能算法应用等相关领域的科研人员、工程技术人员及高校研究生。; 使用场景及目标:①应用于风电场实际运行中的短期功率预测,提高电网调度的安全性可再生能源消纳效率;②为深度学习模型在复杂时序预测任务中的设计优化提供实践范例,推动AI技术在能源系统智能化中的深度融合;③支持学术研究复现、课程项目设计教学演示,帮助深入理解CNN、BiGRUAttention机制的协同建模范式实现细节。; 阅读建议:建议结合提供的Matlab代码进行动手实践,重点关注数据预处理流程、模型网络结构设计、超参数调优及训练收敛过程,鼓励尝试替换输入变量组合、调整网络层数或优化注意力结构,以进一步探究模型性能边界并提升预测鲁棒性。
内容概要:本文研究了基于Benders分解算法输电网-配电网运营商(TSO-DSO)协调机制的双层优化模型,旨在有效应对新能源出力波动、负荷不确定性等对现代电力系统运行带来的挑战。模型上层由输电网运营商(TSO)负责全局资源优化主网稳定性调控,下层由多个配电网运营商(DSO)实现本地分布式能源的灵活调度,通过Benders分解实现上下层之间的迭代协调信息交互,从而在保障系统安全的前提下提升整体运行的经济性鲁棒性。研究提供了完整的Matlab代码实现,涵盖数学建模、算法求解、收敛性分析及仿真结果可视化等环节,有助于深入理解双层优化架构在输配电网协同调度中的具体应用技术细节。; 适合人群:具备电力系统分析、优化理论基础及一定Matlab编程能力的研究生、科研人员,以及从事电网调度、能源系统规划等相关领域的工程技术人员。; 使用场景及目标:①掌握Benders分解在电力系统双层优化问题中的建模求解流程;②理解TSO-DSO协同机制下输配电网交互建模的核心思想实现方法;③复现并拓展高水平学术论文中的优化模型,服务于科研项目攻关或实际工程仿真需求。; 阅读建议:建议结合凸优化理论、电力系统经济调度Benders分解原理进行系统学习,优先运行并调试所提供的Matlab代码,调整关键参数以观察算法收敛行为模型性能变化,从而深化对协调机制优化机理的理解。
内容概要:本文档是一份关于经济学期刊论文复现的研究资料,聚焦核心议题“数字化转型能否促进企业的高质量发展”。文档构建了一个完整的量化分析框架,基于中国上市公司数据,实证探讨数字化转型对企业全要素生产率(TFP)及高质量发展的实际影响。内容涵盖数字化转型指标的构建、企业高质量发展评价体系的设计、计量经济模型的选择应用(如固定效应模型、GMM方法),并提供Matlab代码实现全过程,包括数据处理、模型估计稳健性检验。研究还系统梳理了OL、FE、LP、OP、GMM等多种全要素生产率的测算方法,为读者复现高水平经济学论文、深入理解数字经济时代的企业发展路径政策含义提供了详尽的技术支持理论指导。; 适合人群:具备扎实的经济学理论基础和较强的定量分析能力,熟悉Matlab或Python编程语言,正在从事经济管理、产业经济或数字经济等领域研究的研究生、高校教师及科研机构研究人员。; 使用场景及目标:①完整复现经济学顶刊论文的实证研究流程,掌握规范的学术研究范式;②学习并应用数字化转型企业绩效间的因果识别策略,提升独立开展实证研究的能力;③为撰写学位论文、申报科研课题或编制政策咨询报告中涉及数字经济效应的章节提供直接的方法论参考和代码支持; 阅读建议:建议读者务必结合文档提供的数据Matlab代码进行同步实操,重点钻研变量定义、模型设定、内生性处理和稳健性检验等关键环节,通过反复调试验证,深刻领会高水平实证研究的严谨逻辑技术细节,从而全面提升自身的科研素养论文写作水平。
内容概要:本文围绕“绿电直连型电氢氨园区优化运行”开展创新性未发表研究,提出一种集成绿色电力直接供给、电解水制氢合成氨工艺的多能耦合系统优化模型,旨在实现园区能源系统的低碳化、高效化经济化运行。研究采用MatlabPython编程语言,结合实际气象负荷数据,构建涵盖电-氢-氨能量转换、存储利用全过程的能量流、物质流及经济性协同优化框架,重点解决可再生能源出力波动导致的供需失衡问题,并通过优化电解槽、储氢罐、合成氨反应器等关键设备的运行策略容量配置,提升系统对风光能源的就地消纳能力。文中配套提供完整的仿真代码、原始数据及Word格式论文,支持结果复现模型拓展,具有较高的科研参考价值工程应用潜力。; 适合人群:具备电力系统、能源工程、优化建模或新能源技术背景,从事综合能源系统、氢能利用、碳中和园区等相关领域研究的研发人员及硕士、博士研究生。; 使用场景及目标:①研究绿电直供模式下电-氢-氨多能系统协同运行机制优化调度策略;②探索高比例可再生能源就地转化为高附加值化工产品的技术路径;③为工业园区实现深度脱碳能源自洽提供决策支持;④作为学术论文撰写、课题申报或科研复现的高质量参考资料。; 阅读建议:建议结合MatlabPython代码逐模块解析模型实现过程,重点关注目标函数构建、约束条件设定(如设备动态特性、能量平衡、安全边界)以及多场景仿真对比分析,宜在调试过程中调整权重系数参数设置,深入理解系统灵敏度优化机理,并尝试引入更多不确定性因素进行鲁棒性扩展。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值