当前位置:网站首页>Sogou cell thesaurus analysis (only extract words and word frequency)
Sogou cell thesaurus analysis (only extract words and word frequency)
2022-04-23 18:52:00 【Brick Porter】
#pragma once
#include <string>
#include <list>
#include <fstream>
struct Data {
public:
Data(std::wstring _word, UINT16 count) :word(_word), byRate(count)
{
if (byRate > 250)
byRate = 250;
}
std::wstring word;// word
BYTE byRate;// Word frequency
};
class SougouScelReader
{
// There are two main parts
// 1. Global Pinyin Table , It seems to be all the Pinyin combinations , Dictionary order
// The format is (index,len,pinyin) A list of
// index: A two byte integer Represents the index of this Pinyin
// len: A two byte integer Byte length of Pinyin
// pinyin: Current Pinyin , Two bytes per character , Chief, len
//
// 2. Chinese phrase list
// The format is (same,py_table_len,py_table,{word_len,word,ext_len,ext}) A list of
// same: Two bytes Integers Number of homonyms
// py_table_len: Two bytes Integers
// py_table: List of integers , Two bytes per integer , Each integer represents a Pinyin index
//
// word_len: Two bytes Integers Represents the number of bytes and length of Chinese phrases
// word: Chinese phrases , Each Chinese character has two bytes , Total length word_len
// ext_len: Two bytes Integers Represents the length of extended information , It's all like 10
// ext: Extended information The first two bytes are an integer ( I don't know if it's word frequency ) The last eight bytes are all 0
//
// {word_len,word,ext_len,ext} Repeat it all same Time Homonyms Same Pinyin Table
public:
//# Pinyin Table offset ,
static const INT32 startPy = 0x1540;
// Chinese phrase list offset
static const INT32 startChinese = 0x2628;
// Global Pinyin Table
// Analysis results
// Tuples ( Word frequency , pinyin , Chinese phrases ) A list of
// Convert the original bytecode to a string
std::wstring byte2str(byte data[], size_t len)const
{
int pos = 0;
std::wstring str;
while (pos < len)
{
wchar_t c = (wchar_t)(data[pos + 1] << 8 | data[pos]);
if (c != 0)
{
str += c;
}
pos += 2;
}
return str;
}
void getChinese(byte data[], size_t len, std::list<Data> &out)const
{
int pos = 0;
while (pos < len)
{
// Number of homonyms
UINT16 same = data[pos + 1] << 8 | data[pos];
// Pinyin index table length
pos += 2;
UINT16 py_table_len = data[pos + 1] << 8 | data[pos];
// Pinyin index table
pos += 2;
// Chinese phrases
pos += py_table_len;
for (int i = 0; i < same; i++)
{
// Chinese phrase length
INT16 c_len = data[pos + 1] << 8 | data[pos];
// Chinese phrases
pos += 2;
std::wstring word = byte2str(data + pos, c_len);
// Extended data length
pos += c_len;
UINT16 ext_len = data[pos + 1] << 8 | data[pos];
// Word frequency
pos += 2;
UINT16 count = data[pos + 1] << 8 | data[pos];
out.push_back(Data(word, count));
pos += ext_len;
}
}
}
};
class CSogoScelParse
{
std::wstring name;
std::list<Data> words;
public:
CSogoScelParse(std::wstring inputPath)
{
std::ifstream infile(inputPath.c_str(), std::ios_base::binary| std::ios_base::in);
if (infile.is_open())
{
infile.seekg(0, std::ios_base::end);
int nFileLen = infile.tellg();
infile.seekg(0, std::ios_base::beg);
byte* buffes = new byte[nFileLen];
if (buffes)
{
SougouScelReader scelReader;
if (nFileLen < scelReader.startChinese)// The file is too small , No further verification has been performed for the time being .
{
delete buffes;
infile.close();
return;
}
infile.read((char*)buffes, nFileLen);
infile.close();
// Thesaurus name
name = scelReader.byte2str(buffes+0x130, 0x338-0x130);
// Parse word list
scelReader.getChinese(buffes + scelReader.startChinese, nFileLen - scelReader.startChinese,words);
}
delete buffes;
}
}
size_t GetWordCount()const
{
return words.size();
}
const std::list<Data>& GetWordList()const
{
return words;
}
};
版权声明
本文为[Brick Porter]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204210603257515.html
边栏推荐
- Yyds dry goods inventory stringprep --- Internet string preparation
- Actual combat of Nacos as service configuration center
- ctfshow-web362(SSTI)
- 中金财富怎么样?在上边开户安全吗
- Screenshot using projectmediamanager
- Methods of nested recycleview to solve sliding conflict and incomplete item display
- 根据快递单号查询物流查询更新量
- Introduction to ROS learning notes (II)
- ESP32 LVGL8. 1 - BTN button (BTN 15)
- Machine learning theory (7): kernel function kernels -- a way to help SVM realize nonlinear decision boundary
猜你喜欢
One of the reasons why the WebView web page cannot be opened (and some WebView problem records encountered by myself)
ESP32 LVGL8. 1 - input devices (input devices 18)
简化路径(力扣71)
[mathematical modeling] - analytic hierarchy process (AHP)
C: generic reflection
Introduction to ROS learning notes (II)
Simple use of navigation in jetpack
【数学建模】—— 层次分析法(AHP)
Esp32 drive encoder -- siq-02fvs3 (vscade + IDF)
从技术体系到商业洞察,中小研发团队架构实践之收尾篇
随机推荐
ctfshow-web362(SSTI)
How can programmers quickly develop high-quality code?
SQL中函数 decode()与 replace()的用法
iptables -L执行缓慢
Introduction to micro build low code zero Foundation (lesson 3)
After CANopen starts PDO timing transmission, the heartbeat frame time is wrong, PDO is delayed, and CANopen time axis is disordered
使用 bitnami/postgresql-repmgr 镜像快速设置 PostgreSQL HA
Yyds dry goods inventory stringprep --- Internet string preparation
教你用简单几个步骤快速重命名文件夹名
Promote QT default control to custom control
Ionic instruction set order from creation to packaging
c#:泛型反射
机器学习理论之(7):核函数 Kernels —— 一种帮助 SVM 实现非线性化决策边界的方式
CANopen usage method and main parameters of object dictionary
ctfshow-web362(SSTI)
关于unity文件读取的操作(一)
Machine learning theory (7): kernel function kernels -- a way to help SVM realize nonlinear decision boundary
The type initializer for ‘Gdip‘ threw an exception
机器学习实战 -朴素贝叶斯
C: generic reflection