当前位置:网站首页>Sogou cell thesaurus analysis (only extract words and word frequency)
Sogou cell thesaurus analysis (only extract words and word frequency)
2022-04-23 18:52:00 【Brick Porter】
#pragma once
#include <string>
#include <list>
#include <fstream>
struct Data {
public:
Data(std::wstring _word, UINT16 count) :word(_word), byRate(count)
{
if (byRate > 250)
byRate = 250;
}
std::wstring word;// word
BYTE byRate;// Word frequency
};
class SougouScelReader
{
// There are two main parts
// 1. Global Pinyin Table , It seems to be all the Pinyin combinations , Dictionary order
// The format is (index,len,pinyin) A list of
// index: A two byte integer Represents the index of this Pinyin
// len: A two byte integer Byte length of Pinyin
// pinyin: Current Pinyin , Two bytes per character , Chief, len
//
// 2. Chinese phrase list
// The format is (same,py_table_len,py_table,{word_len,word,ext_len,ext}) A list of
// same: Two bytes Integers Number of homonyms
// py_table_len: Two bytes Integers
// py_table: List of integers , Two bytes per integer , Each integer represents a Pinyin index
//
// word_len: Two bytes Integers Represents the number of bytes and length of Chinese phrases
// word: Chinese phrases , Each Chinese character has two bytes , Total length word_len
// ext_len: Two bytes Integers Represents the length of extended information , It's all like 10
// ext: Extended information The first two bytes are an integer ( I don't know if it's word frequency ) The last eight bytes are all 0
//
// {word_len,word,ext_len,ext} Repeat it all same Time Homonyms Same Pinyin Table
public:
//# Pinyin Table offset ,
static const INT32 startPy = 0x1540;
// Chinese phrase list offset
static const INT32 startChinese = 0x2628;
// Global Pinyin Table
// Analysis results
// Tuples ( Word frequency , pinyin , Chinese phrases ) A list of
// Convert the original bytecode to a string
std::wstring byte2str(byte data[], size_t len)const
{
int pos = 0;
std::wstring str;
while (pos < len)
{
wchar_t c = (wchar_t)(data[pos + 1] << 8 | data[pos]);
if (c != 0)
{
str += c;
}
pos += 2;
}
return str;
}
void getChinese(byte data[], size_t len, std::list<Data> &out)const
{
int pos = 0;
while (pos < len)
{
// Number of homonyms
UINT16 same = data[pos + 1] << 8 | data[pos];
// Pinyin index table length
pos += 2;
UINT16 py_table_len = data[pos + 1] << 8 | data[pos];
// Pinyin index table
pos += 2;
// Chinese phrases
pos += py_table_len;
for (int i = 0; i < same; i++)
{
// Chinese phrase length
INT16 c_len = data[pos + 1] << 8 | data[pos];
// Chinese phrases
pos += 2;
std::wstring word = byte2str(data + pos, c_len);
// Extended data length
pos += c_len;
UINT16 ext_len = data[pos + 1] << 8 | data[pos];
// Word frequency
pos += 2;
UINT16 count = data[pos + 1] << 8 | data[pos];
out.push_back(Data(word, count));
pos += ext_len;
}
}
}
};
class CSogoScelParse
{
std::wstring name;
std::list<Data> words;
public:
CSogoScelParse(std::wstring inputPath)
{
std::ifstream infile(inputPath.c_str(), std::ios_base::binary| std::ios_base::in);
if (infile.is_open())
{
infile.seekg(0, std::ios_base::end);
int nFileLen = infile.tellg();
infile.seekg(0, std::ios_base::beg);
byte* buffes = new byte[nFileLen];
if (buffes)
{
SougouScelReader scelReader;
if (nFileLen < scelReader.startChinese)// The file is too small , No further verification has been performed for the time being .
{
delete buffes;
infile.close();
return;
}
infile.read((char*)buffes, nFileLen);
infile.close();
// Thesaurus name
name = scelReader.byte2str(buffes+0x130, 0x338-0x130);
// Parse word list
scelReader.getChinese(buffes + scelReader.startChinese, nFileLen - scelReader.startChinese,words);
}
delete buffes;
}
}
size_t GetWordCount()const
{
return words.size();
}
const std::list<Data>& GetWordList()const
{
return words;
}
};
版权声明
本文为[Brick Porter]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204210603257515.html
边栏推荐
- MVVM model
- Go 语言 GUI 框架 fyne 中文乱码或者不显示的问题
- Kettle paoding jieniu Chapter 17 text file output
- Get a list of recent apps
- C language simulates entering and leaving the stack, first in first out, first in first out, shared memory
- 中金财富怎么样?在上边开户安全吗
- 12个例子夯实promise基础
- Introduction to ROS learning notes (I)
- Sentinel服务熔断实战(sentinel整合ribbon+openFeign+fallback)
- Sentinel service fusing practice (sentinel integration ribbon + openfeign + fallback)
猜你喜欢
Chondroitin sulfate in vitreous
os_authent_prefix
Summary of actual business optimization scheme - main directory - continuous update
ctfshow-web362(SSTI)
解决:cnpm : 無法加載文件 ...\cnpm.ps1,因為在此系統上禁止運行脚本
Tangle
【历史上的今天】4 月 23 日:YouTube 上传第一个视频;网易云音乐正式上线;数字音频播放器的发明者出生
iptables -L执行缓慢
【数学建模】—— 层次分析法(AHP)
os_ authent_ Prefix
随机推荐
RPM package management
Configure iptables
解决:cnpm : 无法加载文件 ...\cnpm.ps1,因为在此系统上禁止运行脚本
The corresponding permissions required to automatically open the app in the setting interface through accessibility service
ESP32 LVGL8. 1 - bar progress bar (bar 21)
#yyds干货盘点#stringprep --- 因特网字符串预备
Simplified path (force buckle 71)
Treatment of incomplete display of listview height
Scrollto and scrollby
Introduction to ROS learning notes (II)
Usage of functions decode() and replace() in SQL
7、 DOM (Part 2) - chapter after class exercises and answers
Introduction to ROS learning notes (I)
Use of content provider
STM32: LCD显示
mysql_linux版本的下載及安裝詳解
Use Chenxi bookkeeping book to analyze the balance of revenue and expenditure of each account in a certain period of time
Tencent map and high logo removal method
Solutions such as unknown or garbled code or certificate problem prompt in Charles's mobile phone packet capture, actual measurement.
Simple use of viewbinding