当前位置:网站首页>C# F23. Stringsimilarity Library: String repeatability, text similarity, anti plagiarism
C# F23. Stringsimilarity Library: String repeatability, text similarity, anti plagiarism
2022-04-23 11:40:00 【Detective Liu Yadong】
Recently in our Shop detective ( Buy a shop, rent a shop, just the upper one, detective ) In the project , In order to prevent brokers from copying other brokers' shop source introduction when publishing shop source , So we are going to make a preliminary anti plagiarism judgment according to the similarity of the content .
Don't talk much , Don't bother typing , The following is the dry goods :
Directly in VS Of Nuget Search in manager :F23.StringSimilarity install .GitHub Portal , There is a detailed introduction , And constantly updated .
At present, the library has implemented more than ten algorithms , Select the algorithm suitable for your business according to your own needs , Each algorithm has its own advantages and disadvantages , It is suggested to roughly understand each algorithm , It's convenient for you to choose which one to use , You can search and understand according to the name of each algorithm in the plug-in .
The following figure is a screenshot of more than ten algorithms automatically translated by Google , above GitHub The portal can go in and see , Each algorithm is introduced :
The use of each algorithm is simple , The following are examples of various algorithms :
public static void Main(string[] args)
{
var str1 = Html2Text(@"<div><div style='padding: 10px 0 5px 0'><p>1: Property area : On the second floor 1221㎡, Third floor :518㎡.<br/>2: Whole floor selling , Independent aisle , You can get through and use it by yourself .<br/>3: Holiday Resort , The flow of people is very large .</p></div></div>");
var str2 = Html2Text(@"<div><div style='padding: 10px 0 5px 0'><p>1:<br/> Third floor :518.76 Square meters <br/> 2: Whole floor , Independent aisle , You can get through and use by yourself <br/>3: Holiday Resort , Large commercial Street Center , The traffic is huge .</p></div></div>");
var jaroWinkler = new JaroWinkler();
var a= jaroWinkler.Similarity(str1, str2);
var normalizedLevenshtein = new NormalizedLevenshtein();
var b= normalizedLevenshtein.Similarity(str1, str2);
var cosine = new Cosine();
var c= cosine.Similarity(str1, str2);
var jaccard = new Jaccard();
var d = jaccard.Similarity(str1, str2);
var sorensenDice = new SorensenDice();
var e = sorensenDice.Similarity(str1, str2);
var ratcliffObershelp = new RatcliffObershelp();
var f = ratcliffObershelp.Similarity(str1, str2);
var longestCommonSubsequence = new LongestCommonSubsequence();
var g= longestCommonSubsequence.Distance(str1, str2);
}
public static string Html2Text(string htmlStr)
{
if (String.IsNullOrEmpty(htmlStr))
{
return "";
}
string regEx_style = "<style[^>]*?>[\\s\\S]*?<\\/style>"; // Definition style Regular expression of
string regEx_script = "<script[^>]*?>[\\s\\S]*?<\\/script>"; // Definition script Regular expression of
string regEx_html = "<[^>]+>"; // Definition HTML Regular expressions for tags
htmlStr = Regex.Replace(htmlStr, regEx_style, "");// Delete css
htmlStr = Regex.Replace(htmlStr, regEx_script, "");// Delete js
htmlStr = Regex.Replace(htmlStr, regEx_html, "");// Delete html Mark
htmlStr = Regex.Replace(htmlStr, "\\s*|\t|\r|\n", "");// Remove tab、 Space 、 Blank line
htmlStr = htmlStr.Replace(" ", "");
htmlStr = htmlStr.Replace("\"", "");
htmlStr = htmlStr.Replace("\"", "");
htmlStr = htmlStr.Replace(" ", "");
return htmlStr.Trim();
}
By the way, give a filter html Functions of code , In the picture above Html2Text(string htmlStr)
ha-ha .
Then talk about the ideas I use , I use this :NormalizedLevenshtein, The similarity value is 0~1 Between
Because I checked on the Internet , This algorithm is more suitable for plagiarism judgment of papers , More in line with my needs , Of course, you need to add some of your own logic , For example, put two strings of html Remove the code and compare , Whether numbers and letters need to be compared, etc , Then set a threshold , What I set up is 0.65, Greater than this value is considered plagiarism . Of course, plagiarism only refers to the direct plagiarism of the system , Because I only compared with the data of this system .
版权声明
本文为[Detective Liu Yadong]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231134496510.html
边栏推荐
- Im architecture: CPU architecture: SIMD vector processing (im-2.3)
- thinkphp 添加图片文字水印生成带二维码的推广海报
- 《通用数据保护条例》(GDPR)系列解读三:欧洲子公司如何向国内母公司回传数据?
- Resolution due to AMD not found_ ags_ x64. DLL, unable to continue code execution. Reinstallation of the program may solve this problem, Forza horizon 5
- nacos基础(7):配置管理
- Pytorch neural network trainer
- 系统编程之高级文件IO(十三)——IO多路复用-select
- 激活函数之sigmoid函数
- golang之笔试题&面试题01
- laravel-admin表单验证
猜你喜欢
Use kettle to copy records to and get records from results
实践数据湖iceberg 第三十课 mysql->iceberg,不同客户端有时区问题
kettle复制记录到结果和从结果获取记录使用
激活函数之阶跃函数
docker MySQL主从备份
Exploring the equipment and teaching of robot education
解决由于找不到amd_ags_x64.dll,无法继续执行代码。重新安装程序可能会解决此问题,地平线(Forza Horizon 5)
Nacos Basics (5): getting started with Nacos configuration
探究机器人教育的器材与教学
分享两个实用的shell脚本
随机推荐
微型机器人的认知和研发技术
Database design of simple voting system
解读2022机器人教育产业分析报告
解读机器人创造出来的艺术
IM表达式的目的(IM 5.2)
Nacos Basics (5): getting started with Nacos configuration
探究机器人教育的器材与教学
SOFA Weekly | 年度优秀 Committer 、本周 Contributor、本周 QA
Tensorflow common functions
RebbitMQ的初步了解
TclError: no display name and no $DISPLAY environment variable
Cognition and R & D technology of micro robot
golang之笔试题&面试题01
第四章 为物化视图启用和禁用IM列存储(IM 4.6)
Laravel always returns JSON response
The fourth chapter is the enable and disable columns of IM enabled fill objects (Part III of im-4.3)
系统编程之高级文件IO(十三)——IO多路复用-select
ImportError: libX11. so. 6: cannot open shared object file: No such file or directory
ThinkPHP adds image text watermark to generate promotion poster with QR code
博客文章导航(实时更新)