当前位置:网站首页>Sogou cell thesaurus analysis (only extract words and word frequency)
Sogou cell thesaurus analysis (only extract words and word frequency)
2022-04-23 18:52:00 【Brick Porter】
#pragma once
#include <string>
#include <list>
#include <fstream>
struct Data {
public:
Data(std::wstring _word, UINT16 count) :word(_word), byRate(count)
{
if (byRate > 250)
byRate = 250;
}
std::wstring word;// word
BYTE byRate;// Word frequency
};
class SougouScelReader
{
// There are two main parts
// 1. Global Pinyin Table , It seems to be all the Pinyin combinations , Dictionary order
// The format is (index,len,pinyin) A list of
// index: A two byte integer Represents the index of this Pinyin
// len: A two byte integer Byte length of Pinyin
// pinyin: Current Pinyin , Two bytes per character , Chief, len
//
// 2. Chinese phrase list
// The format is (same,py_table_len,py_table,{word_len,word,ext_len,ext}) A list of
// same: Two bytes Integers Number of homonyms
// py_table_len: Two bytes Integers
// py_table: List of integers , Two bytes per integer , Each integer represents a Pinyin index
//
// word_len: Two bytes Integers Represents the number of bytes and length of Chinese phrases
// word: Chinese phrases , Each Chinese character has two bytes , Total length word_len
// ext_len: Two bytes Integers Represents the length of extended information , It's all like 10
// ext: Extended information The first two bytes are an integer ( I don't know if it's word frequency ) The last eight bytes are all 0
//
// {word_len,word,ext_len,ext} Repeat it all same Time Homonyms Same Pinyin Table
public:
//# Pinyin Table offset ,
static const INT32 startPy = 0x1540;
// Chinese phrase list offset
static const INT32 startChinese = 0x2628;
// Global Pinyin Table
// Analysis results
// Tuples ( Word frequency , pinyin , Chinese phrases ) A list of
// Convert the original bytecode to a string
std::wstring byte2str(byte data[], size_t len)const
{
int pos = 0;
std::wstring str;
while (pos < len)
{
wchar_t c = (wchar_t)(data[pos + 1] << 8 | data[pos]);
if (c != 0)
{
str += c;
}
pos += 2;
}
return str;
}
void getChinese(byte data[], size_t len, std::list<Data> &out)const
{
int pos = 0;
while (pos < len)
{
// Number of homonyms
UINT16 same = data[pos + 1] << 8 | data[pos];
// Pinyin index table length
pos += 2;
UINT16 py_table_len = data[pos + 1] << 8 | data[pos];
// Pinyin index table
pos += 2;
// Chinese phrases
pos += py_table_len;
for (int i = 0; i < same; i++)
{
// Chinese phrase length
INT16 c_len = data[pos + 1] << 8 | data[pos];
// Chinese phrases
pos += 2;
std::wstring word = byte2str(data + pos, c_len);
// Extended data length
pos += c_len;
UINT16 ext_len = data[pos + 1] << 8 | data[pos];
// Word frequency
pos += 2;
UINT16 count = data[pos + 1] << 8 | data[pos];
out.push_back(Data(word, count));
pos += ext_len;
}
}
}
};
class CSogoScelParse
{
std::wstring name;
std::list<Data> words;
public:
CSogoScelParse(std::wstring inputPath)
{
std::ifstream infile(inputPath.c_str(), std::ios_base::binary| std::ios_base::in);
if (infile.is_open())
{
infile.seekg(0, std::ios_base::end);
int nFileLen = infile.tellg();
infile.seekg(0, std::ios_base::beg);
byte* buffes = new byte[nFileLen];
if (buffes)
{
SougouScelReader scelReader;
if (nFileLen < scelReader.startChinese)// The file is too small , No further verification has been performed for the time being .
{
delete buffes;
infile.close();
return;
}
infile.read((char*)buffes, nFileLen);
infile.close();
// Thesaurus name
name = scelReader.byte2str(buffes+0x130, 0x338-0x130);
// Parse word list
scelReader.getChinese(buffes + scelReader.startChinese, nFileLen - scelReader.startChinese,words);
}
delete buffes;
}
}
size_t GetWordCount()const
{
return words.size();
}
const std::list<Data>& GetWordList()const
{
return words;
}
};
版权声明
本文为[Brick Porter]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204210603257515.html
边栏推荐
- ESP32 LVGL8. 1 - BTN button (BTN 15)
- ESP32 LVGL8. 1 - label (style 14)
- 【数学建模】—— 层次分析法(AHP)
- 关于unity文件读取的操作(一)
- Go 语言 GUI 框架 fyne 中文乱码或者不显示的问题
- Download xshell 6 and xftp6 official websites
- WebView opens H5 video and displays gray background or black triangle button. Problem solved
- After CANopen starts PDO timing transmission, the heartbeat frame time is wrong, PDO is delayed, and CANopen time axis is disordered
- Esp32 (UART event) - serial port event learning (1)
- Advanced transfer learning
猜你喜欢

ESP32 LVGL8. 1 - img picture (IMG 20)

Esp32 (UART 485 communication) - 485 communication of serial port (3)

Summary of actual business optimization scheme - main directory - continuous update

Esp32 (UART event) - serial port event learning (1)

Esp32 drive encoder -- siq-02fvs3 (vscade + IDF)

Introduction to ROS learning notes (I)

PyGame tank battle

12 examples to consolidate promise Foundation

教你用简单几个步骤快速重命名文件夹名

Use bitnami / PostgreSQL repmgr image to quickly set up PostgreSQL ha
随机推荐
mysql_linux版本的下载及安装详解
MySQL statement
Resolution: cnpm: unable to load file \cnpm. PS1, because running scripts is prohibited on this system
Summary of actual business optimization scheme - main directory - continuous update
关于unity文件读取的操作(一)
c#:泛型反射
Druid SQL和Security在美团点评的实践
ESP32 LVGL8. 1 - img picture (IMG 20)
Dynamically add and delete layouts
Sentinel service fusing practice (sentinel integration ribbon + openfeign + fallback)
Yyds dry goods inventory stringprep --- Internet string preparation
回路-通路
Seata处理分布式事务
How to virtualize the video frame and background is realized in a few simple steps
Seata handles distributed transactions
Practice of Druid SQL and security in meituan review
ESP32 LVGL8. 1 - label (style 14)
Methods of nested recycleview to solve sliding conflict and incomplete item display
Kettle paoding jieniu Chapter 17 text file output
Tangle