当前位置:网站首页>OCR recognition PDF file
OCR recognition PDF file
2022-04-23 02:41:00 【An effort to XX program yuan】
1 Existing resolution pdf Methods
Use org.apache.pdfbox Read pdf, Can only read pdf The text , Some papers are scanned into pdf The words will be confused , Some words are still displayed in the form of pictures , The content read is incomplete , Often you can't get the data you want .
2 OCR Character recognition
pdf Need to convert to picture , For identification , High recognition rate .
2.1 Call Baidu interface
advantage : High recognition rate , Fast recognition
shortcoming : Charge per time
2.2 Use open source tools to read pdf file
2.2.1 Download Kit
https://github.com/tesseract-ocr/tessdata download chi_sim.traineddata,chi_sim_vert.traineddata
2.2.2 Add dependency
<dependencies>
<dependency>
<groupId>net.sourceforge.tess4j</groupId>
<artifactId>tess4j</artifactId>
<version>4.4.0</version>
</dependency>
</dependencies>
2.2.3 Programming
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
public class TestTess {
public static void main(String[] args) {
// Identify the path of the picture ( Change to your own picture path )
String path = "C:\\work\\notebook\\prototype\\target\\qq1.jpg";
// Language library location ( Change it to the path of your own language Library folder )
// String lagnguagePath = "C:\\work\\projects\\tess\\resources\\tessdata";
File file = new File(path);
ITesseract instance = new Tesseract();
// Set the location of the training library
//instance.setDatapath(lagnguagePath);
//chi_sim : Simplified Chinese , eng Select the language library according to the needs
instance.setLanguage("chi_sim");
String result = null;
try {
long startTime = System.currentTimeMillis();
result = instance.doOCR(file);
long endTime = System.currentTimeMillis();
System.out.println("Time is:" + (endTime - startTime) + " millisecond ");
} catch (TesseractException e) {
e.printStackTrace();
}
System.out.println("result: ");
System.out.println(result);
}
}
2.3 Read pdf Location specific data
2.3.1 Manual interception pdf Specifies the material of the rectangular area
Need front-end cooperation to make pages , Change to automatic acquisition , When adding document types in this way , I need to configure it .
2.3.2 Python The program obtains the rectangular coordinates of the image identification area in the whole document
import aircv
def matchImg(imgsrc, imgobj, confidence=0.2):
""" Picture contrast recognition imgobj stay imgsrc Relative position on ( Batch identify the parts needed in the unified picture ) :param imgsrc: Original image path (str) :param imgobj: Image path to be found ( Templates )(str) :param confidence: Recognition (0<confidence<1.0) :return: None or dict({'confidence': Similarity degree (float), 'rectangle': Rectangular coordinates on the original picture (tuple), 'result': Central coordinates (tuple)}) """
imsrc = aircv.imread(imgsrc)
imobj = aircv.imread(imgobj)
match_result = aircv.find_template(imsrc, imobj,
confidence) # {'confidence': 0.5435812473297119, 'rectangle': ((394, 384), (394, 416), (450, 384), (450, 416)), 'result': (422.0, 400.0)}
if match_result is not None:
match_result['shape'] = (imsrc.shape[1], imsrc.shape[0]) # 0 For the high ,1 To be wide
return match_result
template = {
'address':'dz.jpg','doc_num':'fw.jpg','doc_type':'fwlx.jpg','issue_date':'fwrq.jpg',
'int_cls': 'splb.jpg','apply_num':'sqh.jpg','applyer':'sqr.jpg','content':'zw.jpg'}
for key, value in template.items():
orig = matchImg("target/qq.jpg","target/"+value)
rect = orig['rectangle']
w = rect[3][0] - rect[0][0]
h = rect[3][1] - rect[0][1]
x = rect[0][0]
y = rect[0][1]
ret = [x,y,w,h]
print(key,ret)
2.3.3 Java The program is based on the coordinates of the rectangular area , Get the specified location information
package odysssey.tess;
import java.io.File;
import java.io.IOException;
import java.util.List;
import javax.imageio.ImageIO;
import java.awt.Rectangle;
import net.sourceforge.tess4j.ITessAPI.TessPageIteratorLevel;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
public class TestTess {
public static void main(String[] args) throws IOException {
// Identify the path of the picture ( Change to your own picture path )
String path = "C:\\work\\notebook\\prototype\\target\\qq.jpg";
// Language library location ( Change it to the path of your own language Library folder )
// String lagnguagePath = "C:\\work\\projects\\tess\\resources\\tessdata";
File file = new File(path);
ITesseract instance = new Tesseract();
// Set the location of the training library
/* address [61, 312, 734, 82] doc_num [1002, 338, 527, 78] doc_type [425, 736, 801, 82] issue_date [64, 593, 495, 64] int_cls [115, 969, 346, 82] apply_num [676, 589, 388, 68] applyer [72, 450, 481, 68] content [107, 899, 1439, 70] */
int rects[][] = {
{
61, 312, 734, 82},
{
1002, 338, 527, 78},
{
425, 736, 801, 82},
{
64, 593, 495, 64},
{
115, 969, 346, 82},
{
676, 589, 388, 68},
{
72, 450, 481, 68},
{
107, 899, 1439, 70}};
//chi_sim : Simplified Chinese , eng Select the language library according to the needs
instance.setLanguage("chi_sim");
instance.setTessVariable("user_defined_dpi", "96");
String result = null;
try {
long startTime = System.currentTimeMillis();
//result = instance.doOCR(file);
for(int i = 0 ;i < rects.length;i++){
Rectangle rr = new Rectangle(rects[i][0],rects[i][1],rects[i][2],rects[i][3]);
result =instance.doOCR(file, rr);
System.out.println(result);
}
/* List<Rectangle> resul=instance.getSegmentedRegions(ImageIO.read(file), TessPageIteratorLevel.RIL_SYMBOL); for (int i = 0; i < resul.size(); i++) { Rectangle rect = resul.get(i); System.out.println(String.format("Box[%d]: x=%d, y=%d, w=%d, h=%d", i, rect.x, rect.y, rect.width, rect.height)); } */ long endTime = System.currentTimeMillis();
System.out.println("Time is:" + (endTime - startTime) + " millisecond ");
} catch (TesseractException e) {
e.printStackTrace();
}
}
}
版权声明
本文为[An effort to XX program yuan]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230239165580.html
边栏推荐
- Interpretation of the future development of smart agriculture
- Water diversion into chengluo Valley p1514
- 使用Go语言构建Web服务器
- 手写内存池以及原理代码分析【C语言】
- Global, exclusive, local Routing Guard
- 双亲委派模型【理解】
- [XJTU computer network security and management] Lecture 2 password technology
- Handwritten memory pool and principle code analysis [C language]
- Yes, from today on, our fans can participate in Netease data analysis training camp for free!
- Machine learning (Zhou Zhihua) Chapter 14 probability graph model
猜你喜欢

php+mysql對下拉框搜索的內容修改

The 16th day of sprint to the big factory, noip popularization Group Three Kingdoms game

Servlet template engine usage example
![[unity3d] rolling barrage effect in live broadcasting room](/img/61/46a7d6c4bf887fca8f088e7673cf2f.png)
[unity3d] rolling barrage effect in live broadcasting room

windows MySQL8 zip安装

双亲委派模型【理解】

接口请求时间太长,jstack观察锁持有情况

Hack the box optimum

How to solve the complexity of project document management?

【2019-CVPR-3D人体姿态估计】Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views
随机推荐
hack the box optimum靶机
wordpress 调用指定页面内容详解2 get_children()
Six very 6 computer driver managers: what software is good for driver upgrade? Recommended by the best computer driver management software abroad
RT_Thread自问自答
grain rain
Parental delegation model [understanding]
国产轻量级看板式Scrum敏捷项目管理工具
day18--栈队列
高效音乐格式转换工具Music Converter Pro
php+mysql对下拉框搜索的内容修改
Rhcsa second day operation
JZ22 鏈錶中倒數最後k個結點
Deploying sbert model based on torchserve < semantic similarity task >
【无标题】
Solve the problem that the registered Google email Gmail mobile number cannot be used for verification
手写内存池以及原理代码分析【C语言】
How to build an integrated industrial Internet plus hazardous safety production management platform?
机器学习(周志华) 第十四章概率图模型
Essential qualities of advanced programmers
php+mysql對下拉框搜索的內容修改