Home | 简体中文 | 繁体中文 | 杂文 | Github | 知乎专栏 | 51CTO学院 | CSDN程序员研修院 | OSChina 博客 | 腾讯云社区 | 阿里云栖社区 | Facebook | Linkedin | Youtube | 打赏(Donations) | About
知乎专栏多维度架构

15.2. transformers

15.2.1. 安装 transformers

安装 transformers

	
pip install transformers	
	
		

15.2.2. 加载本地模型

		
#!/usr/bin/python
# -*-coding:utf-8-*-
from transformers import BertTokenizer, BertModel

model = "./transformers/bert-base-chinese"
tokenizer = BertTokenizer.from_pretrained(model)
print(tokenizer.tokenize("I have a good time, thank you."))		
		
		
		
neo@MacBook-Pro-M2 ~/w/test> ll transformers/bert-base-chinese/
total 804576
-rw-r--r--  1 neo  staff   624B 10 23 17:00 config.json
-rw-r--r--  1 neo  staff   392M 10 23 17:05 model.safetensors
-rw-r--r--  1 neo  staff   263K 10 23 17:05 tokenizer.json
-rw-r--r--  1 neo  staff    29B 10 23 17:05 tokenizer_config.json
-rw-r--r--@ 1 neo  staff   107K 10 23 16:13 vocab.txt		
		
		

运行结果

		
neo@MacBook-Pro-M2 ~/w/test> python3 /Users/neo/workspace/test/test.py
['[UNK]', 'have', 'a', 'good', 'time', ',', 'than', '##k', 'you', '.']		
		
		

15.2.3. 自动下载模型

		
#!/usr/bin/python
# -*-coding:utf-8-*-
from transformers import AutoTokenizer, AutoModel

modelPath = "hfl/chinese-macbert-base"
tokenizer = AutoTokenizer.from_pretrained(modelPath)
print(tokenizer.tokenize("I have a good time, thank you."))		
		
		

15.2.4. 编码

		
#!/usr/bin/python
# -*-coding:utf-8-*-
from transformers import BertTokenizer, BertModel

model_path = "./transformers/bert-base-chinese"
tokenizer = BertTokenizer.from_pretrained(model_path)

print(tokenizer.encode('你好世界')) 
print(tokenizer.encode_plus('我不喜欢这世界','你好中国'))
print(tokenizer.convert_ids_to_tokens(tokenizer.encode('我爱你')))		
		
		

15.2.5. 计算向量

		
from transformers import AutoTokenizer, AutoModel

cache_dir = "/tmp/transformers"
# model = "hfl/chinese-macbert-base"
pretrained_model_name_or_path = "bert-base-chinese"

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, cache_dir=cache_dir)  # , force_download=True
model = AutoModel.from_pretrained(pretrained_model_name_or_path, cache_dir=cache_dir)

sentences = ['https://www.netkiller.cn']

inputs = tokenizer(sentences, return_tensors="pt")
outputs = model(**inputs)
array = outputs.pooler_output.tolist()
print(array)		
		
		

15.2.6. FAQ

15.2.6.1. 隐藏警告

			
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).			
			
			

解决方案

			
from transformers import logging
logging.set_verbosity_warning()

或:

from transformers import logging
logging.set_verbosity_error()			
			
			

打开 config.json 文件,查看 architectures 配置项,讲 BertModel 更换为 BertForMaskedLM 即可

			
{
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "directionality": "bidi",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "pooler_fc_size": 768,
  "pooler_num_attention_heads": 12,
  "pooler_num_fc_layers": 3,
  "pooler_size_per_head": 128,
  "pooler_type": "first_token_transform",
  "type_vocab_size": 2,
  "vocab_size": 21128
}
			
			
			
Some weights of the model checkpoint at ./transformers/bert-base-chinese were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).			
			
			

添加 from_tf=True 参数

			
#!/usr/bin/python
# -*-coding:utf-8-*-
from transformers import BertTokenizer, BertModel,BertConfig,BertForMaskedLM
pretrained_model_name_or_path = "./transformers/bert-base-chinese"
tokenizer = BertTokenizer.from_pretrained(pretrained_model_name_or_path)
model = BertForMaskedLM.from_pretrained(pretrained_model_name_or_path, from_tf=True)
			
			
			
All TF 2.0 model weights were used when initializing BertForMaskedLM.

All the weights of BertForMaskedLM were initialized from the TF 2.0 model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertForMaskedLM for predictions without further training.