当前位置:网站首页>C# F23. Stringsimilarity Library: String repeatability, text similarity, anti plagiarism
C# F23. Stringsimilarity Library: String repeatability, text similarity, anti plagiarism
2022-04-23 11:40:00 【Detective Liu Yadong】
Recently in our Shop detective ( Buy a shop, rent a shop, just the upper one, detective ) In the project , In order to prevent brokers from copying other brokers' shop source introduction when publishing shop source , So we are going to make a preliminary anti plagiarism judgment according to the similarity of the content .
Don't talk much , Don't bother typing , The following is the dry goods :
Directly in VS Of Nuget Search in manager :F23.StringSimilarity install .GitHub Portal , There is a detailed introduction , And constantly updated .
At present, the library has implemented more than ten algorithms , Select the algorithm suitable for your business according to your own needs , Each algorithm has its own advantages and disadvantages , It is suggested to roughly understand each algorithm , It's convenient for you to choose which one to use , You can search and understand according to the name of each algorithm in the plug-in .
The following figure is a screenshot of more than ten algorithms automatically translated by Google , above GitHub The portal can go in and see , Each algorithm is introduced :

The use of each algorithm is simple , The following are examples of various algorithms :
public static void Main(string[] args)
{
var str1 = Html2Text(@"<div><div style='padding: 10px 0 5px 0'><p>1: Property area : On the second floor 1221㎡, Third floor :518㎡.<br/>2: Whole floor selling , Independent aisle , You can get through and use it by yourself .<br/>3: Holiday Resort , The flow of people is very large .</p></div></div>");
var str2 = Html2Text(@"<div><div style='padding: 10px 0 5px 0'><p>1:<br/> Third floor :518.76 Square meters <br/> 2: Whole floor , Independent aisle , You can get through and use by yourself <br/>3: Holiday Resort , Large commercial Street Center , The traffic is huge .</p></div></div>");
var jaroWinkler = new JaroWinkler();
var a= jaroWinkler.Similarity(str1, str2);
var normalizedLevenshtein = new NormalizedLevenshtein();
var b= normalizedLevenshtein.Similarity(str1, str2);
var cosine = new Cosine();
var c= cosine.Similarity(str1, str2);
var jaccard = new Jaccard();
var d = jaccard.Similarity(str1, str2);
var sorensenDice = new SorensenDice();
var e = sorensenDice.Similarity(str1, str2);
var ratcliffObershelp = new RatcliffObershelp();
var f = ratcliffObershelp.Similarity(str1, str2);
var longestCommonSubsequence = new LongestCommonSubsequence();
var g= longestCommonSubsequence.Distance(str1, str2);
}
public static string Html2Text(string htmlStr)
{
if (String.IsNullOrEmpty(htmlStr))
{
return "";
}
string regEx_style = "<style[^>]*?>[\\s\\S]*?<\\/style>"; // Definition style Regular expression of
string regEx_script = "<script[^>]*?>[\\s\\S]*?<\\/script>"; // Definition script Regular expression of
string regEx_html = "<[^>]+>"; // Definition HTML Regular expressions for tags
htmlStr = Regex.Replace(htmlStr, regEx_style, "");// Delete css
htmlStr = Regex.Replace(htmlStr, regEx_script, "");// Delete js
htmlStr = Regex.Replace(htmlStr, regEx_html, "");// Delete html Mark
htmlStr = Regex.Replace(htmlStr, "\\s*|\t|\r|\n", "");// Remove tab、 Space 、 Blank line
htmlStr = htmlStr.Replace(" ", "");
htmlStr = htmlStr.Replace("\"", "");
htmlStr = htmlStr.Replace("\"", "");
htmlStr = htmlStr.Replace(" ", "");
return htmlStr.Trim();
}
By the way, give a filter html Functions of code , In the picture above Html2Text(string htmlStr)
ha-ha .
Then talk about the ideas I use , I use this :NormalizedLevenshtein, The similarity value is 0~1 Between
Because I checked on the Internet , This algorithm is more suitable for plagiarism judgment of papers , More in line with my needs , Of course, you need to add some of your own logic , For example, put two strings of html Remove the code and compare , Whether numbers and letters need to be compared, etc , Then set a threshold , What I set up is 0.65, Greater than this value is considered plagiarism . Of course, plagiarism only refers to the direct plagiarism of the system , Because I only compared with the data of this system .
版权声明
本文为[Detective Liu Yadong]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231134496510.html
边栏推荐
- PyTorch 神经网络训练器
- On the integration of steam education in early childhood education
- 第四章 为物化视图启用和禁用IM列存储(IM 4.6)
- Chapter 4 specifies the attribute of the inmemory column on the no inmemory table for im enabled filling objects: examples (Part IV of im-4.4)
- laravel-admin表单验证
- Sofa weekly | excellent Committee of the year, contributor of this week, QA of this week
- Tensorflow uses keras to create neural networks
- How does QT turn qwigdet into qdialog
- 解析社交性机器人对基础科学的作用
- nacos基础(8):登录管理
猜你喜欢

C#的学习笔记【八】SQL【一】

分享两个实用的shell脚本

让中小学生在快乐中学习的创客教育

Nacos Foundation (8): login management

Siri gave the most embarrassing social death moment of the year

《通用数据保护条例》(GDPR)系列解读三:欧洲子公司如何向国内母公司回传数据?

Laravel绑定钉钉群警报(php)

nacos基础(6):nacos配置管理模型

Nacos Foundation (6): Nacos configuration management model

qt 64位静态版本显示gif
随机推荐
rebbitMQ的简单搭建
Share two practical shell scripts
Pytorch neural network trainer
Summary of convolution layer and pooling layer
Nacos Foundation (8): login management
《通用数据保护条例》(GDPR)系列解读三:欧洲子公司如何向国内母公司回传数据?
汇编语言 运行环境设置等教程链接整理
qt5.8 64 位静态库中想使用sqlite但静态库没有编译支持库的方法
SOFA Weekly | 年度优秀 Committer 、本周 Contributor、本周 QA
oh-my-lotto
远程访问家里的树莓派(上)
The fourth chapter is to enable the filling object of IM and enable ADO for im column storage (IM 4.8)
Application of remote integrated monitoring system in power distribution room in 10kV prefabricated cabin project
Nacos Basics (5): getting started with Nacos configuration
golang之筆試題&面試題01
Chapter 4 specifies the attribute of the inmemory column on the no inmemory table for im enabled filling objects: examples (Part IV of im-4.4)
Analyzing the role of social robots in basic science
GPU, CUDA,cuDNN三者的关系总结
Master slave replication configuration of MySQL
解读机器人编程课程的生物认知度