title: LSTM应用于验证码识别 tags:
#For CentOS 7 run the following as root to install Tesseract with English language traineddata:
yum -y install yum-utils
yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
yum update
yum install tesseract
yum install tesseract-langpack-eng
#For Win10 to install Tesseract:
1.下载解压 jTessBoxEditor
2.将 {解压目录}\jTessBoxEditorFX\tesseract-ocr 添加到 Path
3.下载解压预训练数据到当前目录
4.新建环境变量 TESSDATA_PREFIX ,值为 {解压目录}\tessdata
终端中运行命令 tesseract --help-extra 显示如上信息表示安装成功
自行获取训练所需的验证码
按照肖鹏伟的《Tesseract-OCR-04-使用 jTessBoxEditor提高文字识别准确率》中的方法生成fdu.ufont.exp0.tif
文件
#通过此命令生成fdu.ufont.exp0.box文件
tesseract fdu.ufont.exp0.tif fdu.ufont.exp0 -l eng --psm 8 --oem 0 nobatch box.train -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzAT-
继续按照肖鹏伟的方法修正.box
文件
#将fdu.ufont.exp0.tif文件、修正后的fdu.ufont.exp0.box文件一起保存到独立的同一新文件夹下,同目录下运行此.ps1文件即可得到fdu.traineddata
tesseract fdu.ufont.exp0.tif fdu.ufont.exp0 -l enb --psm 8 lstm.train
combine_tessdata -e "$env:TESSDATA_PREFIX\enb.traineddata" enb.lstm
$PSroot = Get-ChildItem
$PSroot = Split-Path $PSroot.Get(0).FullName
$fso=New-Object -ComObject Scripting.FileSystemObject
$fso.CreateTextFile('fdu.training_files.txt',2).Write("$PSroot\fdu.ufont.exp0.lstmf" )
if (-not (Test-Path -Path output)){mkdir output}
lstmtraining --model_output="$PSroot\output\output" --continue_from="$PSroot\enb.lstm" --train_listfile="$PSroot\fdu.training_files.txt" --traineddata="$env:TESSDATA_PREFIX\enb.traineddata" --debug_interval -1 --target_error_rate 0.001
lstmtraining --stop_training --continue_from="$PSroot\output\output_checkpoint" --traineddata="$env:TESSDATA_PREFIX\enb.traineddata" --model_output="$PSroot\fdu.traineddata"
最终得到如上结果
将得到的fdu.traineddata
文件移动到tessdata
文件夹下即可通过参数-l fdu
进行使用
#此程序用于简单判断训练效果
from PIL import Image
#from itertools import cycle
import os, random, re
import pytesseract
fl = re.compile(r'[a-zA-Z-]+')
def clearStr(str):
return ''.join(fl.findall(str))
class Fileset(list):
def __init__(self, name, ext='', _read=None, root=None):
if isinstance(name, str) :
self.root = os.path.join(root or os.getcwd(), name)
self.extend(f for f in os.listdir(self.root) if f.endswith(ext))
self._read = _read
def __getitem__(self, index):
if isinstance(index, int):# index是索引
return os.path.join(self.root, super().__getitem__(index))
else:# index是切片
fileset = Fileset(None)
fileset.root = self.root
fileset._read = self._read
fileset.extend(super().__getitem__(index))
return fileset
def getFileName(self, index):
fname, ext = os.path.splitext(super().__getitem__(index))
return fname
def __iter__(self):
if self._read: return (self._read(os.path.join(self.root, f)) for f in super().__iter__())
else: return (os.path.join(self.root, f) for f in super().__iter__())
def __call__(self):
retn = random.choice(self)
if self._read: return self._read(retn)
else: return retn
# def fopen(path):
# with open(path, 'rb') as f:
# return f.read()
# #from tesOCR import tesOCR as OCR1
# sample = Fileset('Captcha', '.jpg', fopen)
sample = Fileset('Captcha', '.jpg', Image.open)
config1 = '--psm 8'
def OCR1(img):
return pytesseract.image_to_string(img, lang='fdu', config=config1)
config2 = "--psm 8 --oem 0 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzAT-"
def OCR2(img):
return pytesseract.image_to_string(img, lang='eng', config=config2)
for a in sample:
b = a.convert("L")
x = clearStr(OCR1(b))
y = clearStr(OCR2(b))
if x != y:
display(a)
print(f"LSTM is {x} ; Legacy is {y}")
我的结果和python调用封装
注释: