知行编程网知行编程网  2022-07-06 15:00 知行编程网 隐藏边栏 |   抢沙发  7 
文章评分 0 次,平均分 0.0

180G!中文ELECTRA预训练模型再升级

来自 | 哈工大讯飞联合实验室

在今年3月,哈工大讯飞联合实验室推出了中文ELECTRA预训练模型,并将相关资源进行开源,目前在GitHub上已获得580个star。本次更新中,我们将预训练语料从原有的约20G提升至180G,利用接近9倍大小的数据集。在阅读理解、自然语言推断、句对分类等中文自然语言处理任务中,ELECTRA-180G相比原版ELECTRA获得了显著性能提升。欢迎各位读者下载试用相关模型。

180G!中文ELECTRA预训练模型再升级
项目地址:http://github.com/ymcui/Chinese-ELECTRA

   ELECTRA简介

ELECTRA提出了一套新的预训练框架,其中包含两个部分:Generator和Discriminator。
  • Generator: 一个小的MLM,在[MASK]的位置预测原来的词。Generator将用来把输入文本做部分词的替换。
  • Discriminator: 判断输入句子中的每个词是否被替换,即使用Replaced Token Detection (RTD)预训练任务,取代了BERT原始的Masked Language Model (MLM)。需要注意的是这里并没有使用Next Sentence Prediction (NSP)任务。
在预训练阶段结束之后,我们只使用Discriminator作为下游任务精调的基模型。
180G!中文ELECTRA预训练模型再升级
更详细的技术内容请查阅ELECTRA论文:ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators(https://openreview.net/pdf?id=r1xMH1BtvB
同时,也可通过阅读我们的讲义《Revisiting Pre-trained Models for Chinese Natural Language Processing》了解更多预训练语言模型相关前沿进展(后台回复NLPCC2020即可下载)。

   中文ELECTRA

除了使用与RoBERTa-wwm-ext系列模型一致的扩展训练数据(约20G)之外,我们从CommonCrawl中获取了更大规模中文文本数据,并经过数据清洗等操作,进一步将预训练语料规模扩充到180G。本次发布以下四个模型:
  • ELECTRA-180g-large, Chinese: 24-layer, 1024-hidden, 16-heads, 324M parameters

  • ELECTRA-180g-base, Chinese: 12-layer, 768-hidden, 12-heads, 102M parameters

  • ELECTRA-180g-small-ex, Chinese: 24-layer, 256-hidden, 4-heads, 25M parameters

  • ELECTRA-180g-small, Chinese: 12-layer, 256-hidden, 4-heads, 12M parameters


   快速加载

哈工大讯飞联合实验室发布的所有中文预训练语言模型均可通过huggingface transformers库进行快速加载访问,请登录我们的共享页面获取更多信息。

https://huggingface.co/HFL

   效果评测

在CMRC 2018(简体中文阅读理解),DRCD(繁体中文阅读理解),XNLI(自然语言推断),BQ Corpus(句对分类)任务上,ELECTRA-180G显著超过原版ELECTRA的效果。更详细的效果评测请查看项目的GitHub。

CMRC 2018
180G!中文ELECTRA预训练模型再升级

DRCD
180G!中文ELECTRA预训练模型再升级

XNLI
180G!中文ELECTRA预训练模型再升级

BQ Corpus
180G!中文ELECTRA预训练模型再升级

相关资源地址

TextBrewer知识蒸馏工具
http://github.com/airaria/TextBrewer
中文BERT、RoBERTa、RBT系列模型
https://github.com/ymcui/Chinese-BERT-wwm
中文XLNet系列模型
https://github.com/ymcui/Chinese-XLNet
中文MacBERT模型
https://github.com/ymcui/MacBERT


<section data-brushtype="text" style="padding-right: 0em;padding-left: 0em;white-space: normal;max-width: 100%;letter-spacing: 0.544px;color: rgb(62, 62, 62);font-family: "Helvetica Neue", Helvetica, "Hiragino Sans GB", "Microsoft YaHei", Arial, sans-serif;widows: 1;word-spacing: 2px;caret-color: rgb(255, 0, 0);text-align: center;box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;color: rgb(0, 0, 0);font-family: -apple-system-font, system-ui, "Helvetica Neue", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.5px;font-size: 14px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;font-size: 16px;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.5px;box-sizing: border-box !important;overflow-wrap: break-word !important;">—</span></strong>完<strong style="max-width: 100%;font-size: 16px;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.5px;font-size: 14px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;font-size: 16px;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;letter-spacing: 0.5px;box-sizing: border-box !important;overflow-wrap: break-word !important;">—</span></strong></span></strong></span></strong></section><pre style="padding-right: 0em;padding-left: 0em;max-width: 100%;letter-spacing: 0.544px;color: rgb(62, 62, 62);widows: 1;word-spacing: 2px;caret-color: rgb(255, 0, 0);text-align: center;box-sizing: border-box !important;overflow-wrap: break-word !important;"><pre style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="max-width: 100%;letter-spacing: 0.544px;white-space: normal;font-family: -apple-system-font, system-ui, "Helvetica Neue", "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei UI", "Microsoft YaHei", Arial, sans-serif;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section powered-by="xiumi.us" style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-top: 15px;margin-bottom: 25px;max-width: 100%;opacity: 0.8;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="max-width: 100%;letter-spacing: 0.544px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section powered-by="xiumi.us" style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-top: 15px;margin-bottom: 25px;max-width: 100%;opacity: 0.8;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><section style="margin-bottom: 15px;padding-right: 0em;padding-left: 0em;max-width: 100%;color: rgb(127, 127, 127);font-size: 12px;font-family: sans-serif;line-height: 25.5938px;letter-spacing: 3px;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;color: rgb(0, 0, 0);box-sizing: border-box !important;overflow-wrap: break-word !important;"><strong style="max-width: 100%;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="max-width: 100%;font-size: 16px;font-family: 微软雅黑;caret-color: red;box-sizing: border-box !important;overflow-wrap: break-word !important;">为您推荐</span></strong></span></section><section style="margin-top: 5px;margin-bottom: 5px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;box-sizing: border-box !important;overflow-wrap: break-word !important;">一文了解深度推荐算法的演进</section><section style="margin-top: 5px;margin-bottom: 5px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;box-sizing: border-box !important;overflow-wrap: break-word !important;">干货 | 算法工程师超实用技术路线图</section><section style="margin-top: 5px;margin-bottom: 5px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="font-size: 14px;">13个算法工程师必须掌握的PyTorch Tricks</span></section><section style="margin-top: 5px;margin-bottom: 5px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;box-sizing: border-box !important;overflow-wrap: break-word !important;"><span style="font-size: 14px;">吴恩达上新:生成对抗网络(GAN)专项课程</span><br  /></section><section style="margin-top: 5px;margin-bottom: 5px;padding-right: 0em;padding-left: 0em;max-width: 100%;min-height: 1em;font-family: sans-serif;letter-spacing: 0px;opacity: 0.8;line-height: normal;box-sizing: border-box !important;overflow-wrap: break-word !important;">拿到2021灰飞烟灭算法岗offer的大佬们是啥样的<span style="font-size: 14px;">?</span></section></section></section></section></section></section></section></section></section>

180G!中文ELECTRA预训练模型再升级

本篇文章来源于: 深度学习这件小事

本文为原创文章,版权归所有,欢迎分享本文,转载请保留出处!

知行编程网
知行编程网 关注:1    粉丝:1
这个人很懒,什么都没写

发表评论

表情 格式 链接 私密 签到
扫一扫二维码分享