Home | 简体中文 | 繁体中文 | 杂文 | Github | 知乎专栏 | 51CTO学院 | CSDN程序员研修院 | OSChina 博客 | 腾讯云社区 | 阿里云栖社区 | Facebook | Linkedin | Youtube | 打赏(Donations) | About
知乎专栏多维度架构

第 15 章 AI 相关

目录

15.1. tokenizers
15.1.1. Normalization
15.1.2. Pre-Tokenization
15.2. transformers
15.2.1. 安装 transformers
15.2.2. 加载本地模型
15.2.3. 自动下载模型
15.2.4. 编码
15.2.5. 计算向量
15.2.6. FAQ

15.1. tokenizers

15.1.1. Normalization

文本清洗,Normalization 对原始文本 sentence 执行一系列清洗操作,如:删除空格、去除重音字符、小写化

		
from tokenizers import normalizers
from tokenizers.normalizers import NFD, StripAccents

normalizer = normalizers.Sequence([NFD(), StripAccents()])
text = normalizer.normalize_str("Héllò hôw are ü?")
print(text)
# "Hello how are u?"		
		
		

15.1.2. Pre-Tokenization

拆分文本,并标记文本的位置

		
from tokenizers.pre_tokenizers import Whitespace
from tokenizers import pre_tokenizers
from tokenizers.pre_tokenizers import Digits

#Whitespace使用正则表达式\w+|[^\w\s]+,即以word开头,以空格或非word结尾来拆分token,返回数据  List[Tuple[str, Offsets]]:
pre_tokenizer = Whitespace()
data1 = pre_tokenizer.pre_tokenize_str("What's your nickname? My nickname is netkiller.")
print(data1)

pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)])
data2 = pre_tokenizer.pre_tokenize_str("https://www.netkiller.cn")
print(data2)		
		
		

输出结果

		
[('What', (0, 4)), ("'", (4, 5)), ('s', (5, 6)), ('your', (7, 11)), ('nickname', (12, 20)), ('?', (20, 21)), ('My', (22, 24)), ('nickname', (25, 33)), ('is', (34, 36)), ('netkiller', (37, 46)), ('.', (46, 47))]
[('https', (0, 5)), ('://', (5, 8)), ('www', (8, 11)), ('.', (11, 12)), ('netkiller', (12, 21)), ('.', (21, 22)), ('cn', (22, 24))]