Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\HUAWEI\AppData\Local\Temp\jieba.cache
Loading model cost 1.153 seconds.
Prefix dict has been built successfully.
# 从文件读取停用词表# 用集合set存储停用词表,查找速度比列表快withopen(stopwords_file, 'r', encoding='utf-8') as f: stopwords =set(line.strip() for line in f)# 假设words是列表,各元素是分词后的一个个词元tokenresult = [w for w in words if w notin stopwords]
18.4 自定义词典
jieba 内置的词典很全面,但某些领域专有词可能没有被收录。
jieba.add_word() 可以将在程序中动态修改词典,把新词添加到词典中:
text ='原神是一款开放世界游戏'print('添加前:', jieba.lcut(text))jieba.add_word('原神')jieba.add_word('开放世界')print('添加后:', jieba.lcut(text))
text ='中国社会各阶级的分析是毛泽东的一篇重要文章。中国社会各阶级的分析分析了中国社会的各个阶级。'# 分词words = jieba.lcut(text)# 去停用词stopwords = {'的', '是', '了', '。'}words = [w for w in words if w notin stopwords]# 词频统计(用字典)freq = {}for w in words:if w in freq: freq[w] +=1else: freq[w] =1# 排序输出前 5 个sorted_freq =sorted(freq.items(), key=lambda x: x[1], reverse=True)print(sorted_freq[:5])