当前位置:网站首页>Sogou cell thesaurus analysis (only extract words and word frequency)
Sogou cell thesaurus analysis (only extract words and word frequency)
2022-04-23 18:52:00 【Brick Porter】
#pragma once
#include <string>
#include <list>
#include <fstream>
struct Data {
public:
Data(std::wstring _word, UINT16 count) :word(_word), byRate(count)
{
if (byRate > 250)
byRate = 250;
}
std::wstring word;// word
BYTE byRate;// Word frequency
};
class SougouScelReader
{
// There are two main parts
// 1. Global Pinyin Table , It seems to be all the Pinyin combinations , Dictionary order
// The format is (index,len,pinyin) A list of
// index: A two byte integer Represents the index of this Pinyin
// len: A two byte integer Byte length of Pinyin
// pinyin: Current Pinyin , Two bytes per character , Chief, len
//
// 2. Chinese phrase list
// The format is (same,py_table_len,py_table,{word_len,word,ext_len,ext}) A list of
// same: Two bytes Integers Number of homonyms
// py_table_len: Two bytes Integers
// py_table: List of integers , Two bytes per integer , Each integer represents a Pinyin index
//
// word_len: Two bytes Integers Represents the number of bytes and length of Chinese phrases
// word: Chinese phrases , Each Chinese character has two bytes , Total length word_len
// ext_len: Two bytes Integers Represents the length of extended information , It's all like 10
// ext: Extended information The first two bytes are an integer ( I don't know if it's word frequency ) The last eight bytes are all 0
//
// {word_len,word,ext_len,ext} Repeat it all same Time Homonyms Same Pinyin Table
public:
//# Pinyin Table offset ,
static const INT32 startPy = 0x1540;
// Chinese phrase list offset
static const INT32 startChinese = 0x2628;
// Global Pinyin Table
// Analysis results
// Tuples ( Word frequency , pinyin , Chinese phrases ) A list of
// Convert the original bytecode to a string
std::wstring byte2str(byte data[], size_t len)const
{
int pos = 0;
std::wstring str;
while (pos < len)
{
wchar_t c = (wchar_t)(data[pos + 1] << 8 | data[pos]);
if (c != 0)
{
str += c;
}
pos += 2;
}
return str;
}
void getChinese(byte data[], size_t len, std::list<Data> &out)const
{
int pos = 0;
while (pos < len)
{
// Number of homonyms
UINT16 same = data[pos + 1] << 8 | data[pos];
// Pinyin index table length
pos += 2;
UINT16 py_table_len = data[pos + 1] << 8 | data[pos];
// Pinyin index table
pos += 2;
// Chinese phrases
pos += py_table_len;
for (int i = 0; i < same; i++)
{
// Chinese phrase length
INT16 c_len = data[pos + 1] << 8 | data[pos];
// Chinese phrases
pos += 2;
std::wstring word = byte2str(data + pos, c_len);
// Extended data length
pos += c_len;
UINT16 ext_len = data[pos + 1] << 8 | data[pos];
// Word frequency
pos += 2;
UINT16 count = data[pos + 1] << 8 | data[pos];
out.push_back(Data(word, count));
pos += ext_len;
}
}
}
};
class CSogoScelParse
{
std::wstring name;
std::list<Data> words;
public:
CSogoScelParse(std::wstring inputPath)
{
std::ifstream infile(inputPath.c_str(), std::ios_base::binary| std::ios_base::in);
if (infile.is_open())
{
infile.seekg(0, std::ios_base::end);
int nFileLen = infile.tellg();
infile.seekg(0, std::ios_base::beg);
byte* buffes = new byte[nFileLen];
if (buffes)
{
SougouScelReader scelReader;
if (nFileLen < scelReader.startChinese)// The file is too small , No further verification has been performed for the time being .
{
delete buffes;
infile.close();
return;
}
infile.read((char*)buffes, nFileLen);
infile.close();
// Thesaurus name
name = scelReader.byte2str(buffes+0x130, 0x338-0x130);
// Parse word list
scelReader.getChinese(buffes + scelReader.startChinese, nFileLen - scelReader.startChinese,words);
}
delete buffes;
}
}
size_t GetWordCount()const
{
return words.size();
}
const std::list<Data>& GetWordList()const
{
return words;
}
};
版权声明
本文为[Brick Porter]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204210603257515.html
边栏推荐
- ESP32 LVGL8. 1 - msgbox message box (msgbox 28)
- Query the logistics update quantity according to the express order number
- os_ authent_ Prefix
- Iptables - L executes slowly
- iptables -L执行缓慢
- Simple use of viewbinding
- mysql_ Download and installation of Linux version
- Introduction to ROS learning notes (II)
- Nacos作为服务注册中心
- WebView saves the last browsing location
猜你喜欢

机器学习理论之(7):核函数 Kernels —— 一种帮助 SVM 实现非线性化决策边界的方式

机器学习实战 -朴素贝叶斯

iptables初探

Simple use of navigation in jetpack

Esp32 (UART event) - serial port event learning (1)

机器学习理论之(8):模型集成 Ensemble Learning

Introduction to ROS learning notes (II)

ctfshow-web362(SSTI)

PyGame tank battle

解决:cnpm : 無法加載文件 ...\cnpm.ps1,因為在此系統上禁止運行脚本
随机推荐
The first leg of the national tour of shengteng AI developer creation and enjoyment day was successfully held in Xi'an
CANopen STM32 transplantation
Practice of Druid SQL and security in meituan review
Advanced transfer learning
Promote QT default control to custom control
listener. log
ESP32 LVGL8. 1 - roller rolling (roller 24)
Redis common interview questions
解决:cnpm : 無法加載文件 ...\cnpm.ps1,因為在此系統上禁止運行脚本
ctfshow-web362(SSTI)
Machine learning theory (7): kernel function kernels -- a way to help SVM realize nonlinear decision boundary
教你用简单几个步骤快速重命名文件夹名
根据快递单号查询物流查询更新量
Configure iptables
ESP32 LVGL8. 1 - textarea text area (textarea 26)
22年字节跳动飞书人力套件三面面经
Esp32 (UART ecoh) - serial port echo worm learning (2)
Nacos as service registry
Recyclerview control list item layout match_ Fundamental principle of parent attribute invalidation
Simple use of viewbinding