当前位置:网站首页>K-means clustering based on word2vec
K-means clustering based on word2vec
2022-04-21 14:04:00 【ddy-ddy】
1. take txt Text with word2vec Convert each word into a word vector
2. take 300 The word vector of dimension is pca Convert to 2 dimension
3. take 2 Dimensional data as k-means Input of clustering
text.txt: As a training text ( Best in English , If it is in Chinese, you can use jieba The library parses Chinese )
word_model.txt: Create an empty text
data.csv: Create an empty csv file
#1. Replace the punctuation of the text with a space
import re
import os
list=[',','?','.','?','!','*','(',')','“','”',':','"','`','\''] ## Make a list of punctuation marks to be replaced
with open('text.txt','r') as f: ##text.txt It's a text for training ( English novels )
result = f.read()
for i in range(len(list)):
result=result.replace(list[i],' ')
with open('text.txt','w') as w:
w.write(str(result))
##2.wordvec2 Get the word vector
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
def wordsCluster(text, vectorSize): ##text: Enter the local path of the text vectorSize: Word vector size ( How many dimensions )
name = []
data = open(text, 'r', encoding='utf-8')
for line in data.readlines():
line = line.replace('\n', '')
if line not in name:
name.append(line)
# word2vec To quantify
model = Word2Vec(LineSentence(text), size=vectorSize, window=5, min_count=1, workers=4)
model.wv.save_word2vec_format('word_model.txt', binary=False) ## Save the word vector in word_model.txt In the text
# obtain model All the keywords in it
keys = model.wv.vocab.keys()
wordsCluster('text.txt',300)
##3. The vector containing the word txt The text is converted to csv Text
f = open("word_model.txt","r")
new=[]
for line in f:
new.append(line)
new[0]='\n'
f.close()
f = open("word_model.txt","w")
for n in new:
f.write(n)
f.close()
import csv
with open('data.csv', 'w', newline='') as csvfile: ##data.csv Is used to store word vectors csv file
writer = csv.writer(csvfile)
data = open('word_model.txt')
for each_line in data:
a = each_line.split()
writer.writerow(a)
##4. use pca take 300 Dimensional data is reduced to 2 dimension
# coding=utf-8
import numpy as np
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
l = []
words=[]
with open('data.csv', 'r') as fd:
line = fd.readline()
line=fd.readline()
while line:
if line == "":
continue
line = line.strip()
word = line.split(",")
words.append(word[0])
l.append(word[1:])
line = fd.readline()
X = np.array(l) # Import data , Dimension for 300
pca = PCA(n_components=2) # drop to 2 dimension
pca.fit(X) # Training
newX=pca.fit_transform(X) # The data after dimensionality reduction is stored in newX In the list
##5. Build a word vector dictionary and use kmeans Training , Get the classification
dict={
}
for i in range(len(words)):
word_=words[i]
dict[word_]=newX[i]
for j in range(len(words)):
print(words[j]+':',end='')
print(dict[words[j]])
from sklearn.cluster import KMeans
import numpy as np
X = np.array(newX)
kmeans = KMeans(n_clusters=5, random_state=0).fit(X)
print(" The coordinates of the five central words :")
print(kmeans.cluster_centers_)
list1=[]
list2=[]
list3=[]
list4=[]
list5=[]
for j in range(len(words)):
if kmeans.labels_[j]==0:
list1.append(words[j])
elif kmeans.labels_[j]==1:
list2.append(words[j])
elif kmeans.labels_[j]==2:
list3.append(words[j])
elif kmeans.labels_[j]==3:
list4.append(words[j])
elif kmeans.labels_[j]==4:
list5.append(words[j])
print(" And keywords "+list1[0]+" Related words are :",end='')
print(list1)
print(" And keywords "+list2[0]+" Related words are :",end='')
print(list2)
print(" And keywords "+list3[0]+" Related words are :",end='')
print(list3)
print(" And keywords "+list4[0]+" Related words are :",end='')
print(list4)
print(" And keywords "+list5[0]+" Related words are :",end='')
print(list5)
## Visualize the data with a scatter chart
f1=[]
f2=[]
for i in range(len(newX)):
f1.append(newX[i][0])
f2.append(newX[i][1])
plt.scatter(f1, f2, c='blue', s=6)
plt.show()
The test results 

版权声明
本文为[ddy-ddy]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204211351090557.html
边栏推荐
- CEPH multi monitor for high availability
- ForkJoin
- Zabbix5 series - monitoring MySQL (5.7 / 5.8 / MariaDB) (x)
- 大学英语词汇解析 中国大学mooc 华中科技大学 测验题答案
- MySQL dblink的实现以及密码中含有@问题的解决
- The importance of computing edge in Networkx: edge intermediate number or intermediate centrality edge_ betweenness
- Zabbix5系列-监控惠普服务器iLO管理口 (六)
- How does Jupiter notebook copy multiple code blocks / cells across files (to another file)
- MySQL read / write separation server -- maxscale service
- Introduction to redis cluster construction and management
猜你喜欢

RedisJSON:一个可以存储 JSON 的 Redis

栈概念 转化为循环 括号匹配 逆波兰表达式 模拟实现 干货满满

Cdh5 delete data node

Zabbix5 series - monitoring MySQL (5.7 / 5.8 / MariaDB) (x)

ssh服务器--密钥认证

Shandong University project training raspberry pie promotion plan phase II (VII) objects and categories

软件测试常见问题 开发模型 PC端qq登录测试用例 BUG的相关问题 测试用例设计的常用方法

SQL injection vulnerability shooting range - sqli labs learning

C语言选择和循环经典习题

Chapter IV key points for implementation of password application security assessment in commercial password application and security assessment - Summary
随机推荐
< 2021SC@SDUSC > Application and practice of software engineering in Shandong University jpress code analysis (9)
Zabbix5系列-监控海康威视摄像头 (七)
求导法则 高阶导数
微信退款 No appropriate protocol (protocol is disabled or cipher suites are inappropriate)
SQL注入之sqli-labs等(安装,配置)
Zabbix5 series - nail alarm (XV)
Zabbix5 series - monitoring MySQL (5.7 / 5.8 / MariaDB) (x)
Shandong University project training raspberry pie promotion plan phase II (III) SSH Remote Connection
< 2021SC@SDUSC > Application and practice of software engineering in Shandong University jpress code analysis (III)
Zabbix5 series - monitoring HP server ILO management port (6)
1 ActiveMQ介绍与安装
用C语言实现有序数组的二分查找
iscsi
Chapter IV key points for implementation of password application security assessment in commercial password application and security assessment - Summary
集合线程安全
MySQL主从同步-多实例
socket组播出现的问题记录
2021.10.24 程序员(媛)节日快乐!!!
Shandong University project training raspberry pie promotion plan phase II (VIII) array and ArrayList
stm32笔记