--- title: LSTM应用于验证码识别 tags: - LSTM - Python - tesseract - 验证码 id: '309' categories: - - Python练习 date: 2020-07-11 20:12:46 --- [![]()](https://limour.lanzous.com/iTNKZeg7tja) [jTessBoxEditorFX-2.3.0](https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/) [![]()](https://limour.lanzous.com/iLQ25eh2o1i) [预训练数据](https://tesseract-ocr.github.io/tessdoc/Data-Files) ``` #For CentOS 7 run the following as root to install Tesseract with English language traineddata: yum -y install yum-utils yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/ sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key yum update yum install tesseract yum install tesseract-langpack-eng ``` ``` #For Win10 to install Tesseract: 1.下载解压 jTessBoxEditor 2.将 {解压目录}\jTessBoxEditorFX\tesseract-ocr 添加到 Path 3.下载解压预训练数据到当前目录 4.新建环境变量 TESSDATA_PREFIX ,值为 {解压目录}\tessdata ``` ![](https://img-cdn.limour.top/blog_wp/2020/07/微信图片_20200710090821.png) 终端中运行命令 tesseract --help-extra 显示如上信息表示安装成功 [![]()](https://limour.lanzous.com/icIheeh321c) 自行获取训练所需的验证码 按照[肖鹏伟的《Tesseract-OCR-04-使用 jTessBoxEditor提高文字识别准确率》](https://blog.csdn.net/qq_40147863/article/details/82290015)中的方法生成`fdu.ufont.exp0.tif`文件 ``` #通过此命令生成fdu.ufont.exp0.box文件 tesseract fdu.ufont.exp0.tif fdu.ufont.exp0 -l eng --psm 8 --oem 0 nobatch box.train -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzAT- ``` 继续按照肖鹏伟的方法修正`.box`文件 ``` #将fdu.ufont.exp0.tif文件、修正后的fdu.ufont.exp0.box文件一起保存到独立的同一新文件夹下,同目录下运行此.ps1文件即可得到fdu.traineddata tesseract fdu.ufont.exp0.tif fdu.ufont.exp0 -l enb --psm 8 lstm.train combine_tessdata -e "$env:TESSDATA_PREFIX\enb.traineddata" enb.lstm $PSroot = Get-ChildItem $PSroot = Split-Path $PSroot.Get(0).FullName $fso=New-Object -ComObject Scripting.FileSystemObject $fso.CreateTextFile('fdu.training_files.txt',2).Write("$PSroot\fdu.ufont.exp0.lstmf" ) if (-not (Test-Path -Path output)){mkdir output} lstmtraining --model_output="$PSroot\output\output" --continue_from="$PSroot\enb.lstm" --train_listfile="$PSroot\fdu.training_files.txt" --traineddata="$env:TESSDATA_PREFIX\enb.traineddata" --debug_interval -1 --target_error_rate 0.001 lstmtraining --stop_training --continue_from="$PSroot\output\output_checkpoint" --traineddata="$env:TESSDATA_PREFIX\enb.traineddata" --model_output="$PSroot\fdu.traineddata" ``` [![]()](https://limour.lanzous.com/ifW3jeireib) 最终得到如上结果 将得到的`fdu.traineddata`文件移动到`tessdata`文件夹下即可通过参数`-l fdu`进行使用 ``` #此程序用于简单判断训练效果 from PIL import Image #from itertools import cycle import os, random, re import pytesseract fl = re.compile(r'[a-zA-Z-]+') def clearStr(str): return ''.join(fl.findall(str)) class Fileset(list): def __init__(self, name, ext='', _read=None, root=None): if isinstance(name, str) : self.root = os.path.join(root or os.getcwd(), name) self.extend(f for f in os.listdir(self.root) if f.endswith(ext)) self._read = _read def __getitem__(self, index): if isinstance(index, int):# index是索引 return os.path.join(self.root, super().__getitem__(index)) else:# index是切片 fileset = Fileset(None) fileset.root = self.root fileset._read = self._read fileset.extend(super().__getitem__(index)) return fileset def getFileName(self, index): fname, ext = os.path.splitext(super().__getitem__(index)) return fname def __iter__(self): if self._read: return (self._read(os.path.join(self.root, f)) for f in super().__iter__()) else: return (os.path.join(self.root, f) for f in super().__iter__()) def __call__(self): retn = random.choice(self) if self._read: return self._read(retn) else: return retn # def fopen(path): # with open(path, 'rb') as f: # return f.read() # #from tesOCR import tesOCR as OCR1 # sample = Fileset('Captcha', '.jpg', fopen) sample = Fileset('Captcha', '.jpg', Image.open) config1 = '--psm 8' def OCR1(img): return pytesseract.image_to_string(img, lang='fdu', config=config1) config2 = "--psm 8 --oem 0 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzAT-" def OCR2(img): return pytesseract.image_to_string(img, lang='eng', config=config2) for a in sample: b = a.convert("L") x = clearStr(OCR1(b)) y = clearStr(OCR2(b)) if x != y: display(a) print(f"LSTM is {x} ; Legacy is {y}") ``` [![]()](https://limour.lanzous.com/imK6me9lurg) 我的结果和python调用封装 注释: 1. [jTessBoxEditor](https://sourceforge.net/projects/vietocr/files/jTessBoxEditor/)中带有FX表示支持中文 2.预训练数据中22.3Mb的是Legacy数据,14.6Mb的是LSTM数据,语言均为eng 3."tessedit\_char\_whitelist="后面所接内容为验证码中可能出现的字符