lstm应用于验证码识别.md 31 KB


title: LSTM应用于验证码识别 tags:

  • LSTM
  • Python
  • tesseract
  • 验证码 id: '309' categories:
  • - Python练习 date: 2020-07-11 20:12:46 ---

jTessBoxEditorFX-2.3.0

预训练数据

#For CentOS 7 run the following as root to install Tesseract with English language traineddata:
yum -y install yum-utils
yum-config-manager --add-repo https://download.opensuse.org/repositories/home:/Alexander_Pozdnyakov/CentOS_7/
sudo rpm --import https://build.opensuse.org/projects/home:Alexander_Pozdnyakov/public_key
yum update
yum install tesseract 
yum install tesseract-langpack-eng
#For Win10 to install Tesseract:
1.下载解压 jTessBoxEditor
2.将 {解压目录}\jTessBoxEditorFX\tesseract-ocr 添加到 Path
3.下载解压预训练数据到当前目录
4.新建环境变量 TESSDATA_PREFIX ,值为 {解压目录}\tessdata

终端中运行命令 tesseract --help-extra 显示如上信息表示安装成功

自行获取训练所需的验证码

按照肖鹏伟的《Tesseract-OCR-04-使用 jTessBoxEditor提高文字识别准确率》中的方法生成fdu.ufont.exp0.tif文件

#通过此命令生成fdu.ufont.exp0.box文件
tesseract fdu.ufont.exp0.tif fdu.ufont.exp0 -l eng --psm 8 --oem 0 nobatch box.train -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzAT-

继续按照肖鹏伟的方法修正.box文件

#将fdu.ufont.exp0.tif文件、修正后的fdu.ufont.exp0.box文件一起保存到独立的同一新文件夹下,同目录下运行此.ps1文件即可得到fdu.traineddata
tesseract fdu.ufont.exp0.tif fdu.ufont.exp0 -l enb --psm 8 lstm.train
combine_tessdata -e "$env:TESSDATA_PREFIX\enb.traineddata" enb.lstm
$PSroot = Get-ChildItem
$PSroot = Split-Path $PSroot.Get(0).FullName
$fso=New-Object -ComObject Scripting.FileSystemObject
$fso.CreateTextFile('fdu.training_files.txt',2).Write("$PSroot\fdu.ufont.exp0.lstmf" )
if (-not (Test-Path -Path output)){mkdir output}
lstmtraining --model_output="$PSroot\output\output" --continue_from="$PSroot\enb.lstm" --train_listfile="$PSroot\fdu.training_files.txt" --traineddata="$env:TESSDATA_PREFIX\enb.traineddata" --debug_interval -1 --target_error_rate 0.001
lstmtraining --stop_training --continue_from="$PSroot\output\output_checkpoint" --traineddata="$env:TESSDATA_PREFIX\enb.traineddata" --model_output="$PSroot\fdu.traineddata"

最终得到如上结果

将得到的fdu.traineddata文件移动到tessdata文件夹下即可通过参数-l fdu进行使用

#此程序用于简单判断训练效果
from PIL import Image
#from itertools import cycle
import os, random, re
import pytesseract
fl = re.compile(r'[a-zA-Z-]+')
def clearStr(str):
    return ''.join(fl.findall(str))

class Fileset(list):
    def __init__(self, name,  ext='', _read=None, root=None):
        if isinstance(name, str)  :
            self.root = os.path.join(root or os.getcwd(), name)
            self.extend(f for f in os.listdir(self.root) if f.endswith(ext))
            self._read = _read
    def __getitem__(self, index):
        if isinstance(index, int):# index是索引
            return os.path.join(self.root, super().__getitem__(index))
        else:# index是切片
            fileset = Fileset(None)
            fileset.root = self.root
            fileset._read = self._read
            fileset.extend(super().__getitem__(index))
            return fileset
    def getFileName(self, index):
        fname, ext = os.path.splitext(super().__getitem__(index))
        return fname
    def __iter__(self):
        if self._read: return (self._read(os.path.join(self.root, f)) for f in super().__iter__())
        else: return (os.path.join(self.root, f) for f in super().__iter__())
    def __call__(self):
        retn = random.choice(self)
        if self._read: return self._read(retn)
        else: return retn

# def fopen(path):
    # with open(path, 'rb') as f:
        # return f.read()
# #from tesOCR import tesOCR as OCR1
# sample = Fileset('Captcha', '.jpg', fopen)
sample = Fileset('Captcha', '.jpg', Image.open)

config1 = '--psm 8'
def OCR1(img):
     return pytesseract.image_to_string(img, lang='fdu', config=config1)

config2 = "--psm 8 --oem 0 -c tessedit_char_whitelist=abcdefghijklmnopqrstuvwxyzAT-"
def OCR2(img):
     return pytesseract.image_to_string(img, lang='eng', config=config2)

for a in sample:
    b = a.convert("L")
    x = clearStr(OCR1(b))
    y = clearStr(OCR2(b))
    if x != y:
        display(a)
        print(f"LSTM is {x} ; Legacy is {y}")

我的结果和python调用封装

注释:

  1. jTessBoxEditor中带有FX表示支持中文
    2.预训练数据中22.3Mb的是Legacy数据,14.6Mb的是LSTM数据,语言均为eng
    3."tessedit_char_whitelist="后面所接内容为验证码中可能出现的字符