知乎专栏 | 多维度架构 |
目录
文本清洗,Normalization 对原始文本 sentence 执行一系列清洗操作,如:删除空格、去除重音字符、小写化
from tokenizers import normalizers from tokenizers.normalizers import NFD, StripAccents normalizer = normalizers.Sequence([NFD(), StripAccents()]) text = normalizer.normalize_str("Héllò hôw are ü?") print(text) # "Hello how are u?"
拆分文本,并标记文本的位置
from tokenizers.pre_tokenizers import Whitespace from tokenizers import pre_tokenizers from tokenizers.pre_tokenizers import Digits #Whitespace使用正则表达式\w+|[^\w\s]+,即以word开头,以空格或非word结尾来拆分token,返回数据 List[Tuple[str, Offsets]]: pre_tokenizer = Whitespace() data1 = pre_tokenizer.pre_tokenize_str("What's your nickname? My nickname is netkiller.") print(data1) pre_tokenizer = pre_tokenizers.Sequence([Whitespace(), Digits(individual_digits=True)]) data2 = pre_tokenizer.pre_tokenize_str("https://www.netkiller.cn") print(data2)
输出结果
[('What', (0, 4)), ("'", (4, 5)), ('s', (5, 6)), ('your', (7, 11)), ('nickname', (12, 20)), ('?', (20, 21)), ('My', (22, 24)), ('nickname', (25, 33)), ('is', (34, 36)), ('netkiller', (37, 46)), ('.', (46, 47))] [('https', (0, 5)), ('://', (5, 8)), ('www', (8, 11)), ('.', (11, 12)), ('netkiller', (12, 21)), ('.', (21, 22)), ('cn', (22, 24))]