项目介绍

验证码识别 - 该项目是基于 CNN5/ResNet+BLSTM/LSTM/GRU/SRU/BSRU+CTC 来实现验证码识别. 该项目仅用于训练，如果需要部署模型请移步：

https://github.com/kerlomz/captcha_platform （通用WEB服务，HTTP请求调用）

https://github.com/kerlomz/captcha_library_c （动态链接库，DLL调用，基于TensoFlow C++）

https://github.com/kerlomz/captcha_demo_csharp （C#源码调用，基于TensorFlowSharp）

许多人问我，部署识别也需要GPU吗？我的答案是，完全没必要。理想中是用GPU训练，使用CPU部署识别服务，部署如果也需要这么高的成本，那还有什么现实意义和应用场景呢，实测阿里云最低配1核1G的配置识别1次大约30ms，我的i7-8700k大约10-15ms之间。

注意事项

如何使用CPU训练：

本项目默认安装TensorFlow-GPU版，建议使用GPU进行训练，如需换用CPU训练请替换 requirements.txt 文件中的tensorflow-gpu==1.6.0 为tensorflow==1.6.0，其他无需改动。
关于LSTM网络:

保证CNN得到的featuremap输入到LSTM时的宽度至少大于等于最大字符数的3倍左右，即time_step大于等于最大字符数3倍。
No valid path found 问题解决：

在model.yaml中修改Pretreatment->Resize的参数，自行调整为合适的值，总结了百来个验证码训练经验，可以尝试这个较为通用的值：Resize: [150, 50]，或者使用代码tutorial.py （自动生成配置文件、打包样本、训练一体化），填写训练集路径执行。
参数修改：

切记，ModelName 是绑定一个模型的唯一标志，如果修改了训练参数如：ImageWidth，ImageHeight，Resize，CharSet，CNNNetwork，RecurrentNetwork，HiddenNum 这类影响计算图的参数，需要删除model路径下的旧文件，重新训练，或者使用新的ModelName 重新训练，否则默认作为断点续练。

准备工作

如果你准备使用GPU训练，请先安装CUDA和cuDNN，可以了解下官方测试过的编译版本对应: https://www.tensorflow.org/install/install_sources#tested_source_configurations Github上可以下载到第三方编译好的TensorFlow的WHL安装包：

https://github.com/fo40225/tensorflow-windows-wheel

CUDA下载地址：https://developer.nvidia.com/cuda-downloads

cuDNN下载地址：https://developer.nvidia.com/rdp/form/cudnn-download-survey （需要注册账号）

笔者使用的版本为：CUDA10+cuDNN7.3.1+TensorFlow 1.12

环境安装

安装Python 3.6 环境（包含pip）
安装虚拟环境 virtualenv pip3 install virtualenv

为该项目创建独立的虚拟环境:

virtualenv -p /usr/bin/python3 venv # venv is the name of the virtual environment.
cd venv/ # venv is the name of the virtual environment.
source bin/activate # to activate the current virtual environment.
cd captcha_trainer # captcha_trainer is the project path.

安装本项目的依赖列表：pip install -r requirements.txt

开始

1. 架构与流程

本项目依赖于训练配置config.yaml和模型配置model.yaml，初始化项目的时候请复制config_demo.yaml到当前目录下命名为config.yaml，model_demo.yaml同理。或者可以使用tutorial.py 自动设置模型配置。

训练流程：配置好两个配置文件后，执行trains.py 中的代码，读取配置，根据model.yaml配置文件构建神经网络计算图，依据config.yaml的配置参数进行训练。

关于config.yaml中的训练参数有几点建议：

BatchSize（训练批次大小）与TestBatchSize（测试批次大小）是需要大家关注的，建议根据显卡条件进行调整，显存小的建议BatchSize不要太大，TestBatchSize也是，我提供的默认配置是基于显存8G，使用率50%设置的，请悉知。
LearningRate（学习率）也是需要关注的，深度学习本质就是调参，一般的模型可以保持默认的配置无需调整，有些模型想要获得更高的识别精度可以先使用0.01快速收敛，准确率差不多95%左右再使用0.001/0.0001提高精度。
TestSetNum（测试集数目），这个是专门为懒人（说我自己）设计提供的，根据给定的测试集数目切割训练集，有一个前提，测试集必须是随机的，随机的，随机的，重要的事说三遍，有些人用Windows资源管理器打开，一拖动选择几百个，默认都是按名称排序的，如果名称是标注，那么就不是随机了，也就是很可能你取的测试集是标注为0~3之间的图片，这样可能导致永远无法收敛。

TrainRegex 和 TestRegex，正则匹配，请各位采集样本的时候，尽量和我给的示例保持一致吧，正则问题请谷歌，如果是为1111.jpg这种命名的话，这里提供了一个批量转换的代码：

import re
import os
import hashlib

# 训练集路径
root = r"D:\TrainSet\***"
all_files = os.listdir(root)

for file in all_files:
    old_path = os.path.join(root, file)
    
    # 已被修改过忽略
    if len(file.split(".")[0]) > 32:
        continue
    
    # 采用标注_文件md5码.图片后缀 进行命名
    with open(old_path, "rb") as f:
        _id = hashlib.md5(f.read()).hexdigest()
    new_path = os.path.join(root, file.replace(".", "_{}.".format(_id)))
    
    # 重复标签的时候会出现形如：abcd (1).jpg 这种形式的文件名
    new_path = re.sub(" \(\d+\)", "", new_path)
    print(new_path)
    os.rename(old_path, new_path)

2. 配置化

config.yaml - System Config

# - requirement.txt  -  GPU: tensorflow-gpu, CPU: tensorflow
# - If you use the GPU version, you need to install some additional applications.
# TrainRegex and TestRegex: Default matching apple_20181010121212.jpg file.
# - The Default is .*?(?=_.*\.)
# TrainsPath and TestPath: The local absolute path of your training and testing set.
# TestSetNum: This is an optional parameter that is used when you want to extract some of the test set
# - from the training set when you are not preparing the test set separately.
System:
  DeviceUsage: 0.5
  TrainRegex: '.*?(?=_)'
  TestRegex: '.*?(?=_)'
  TestSetNum: 300

# CNNNetwork: [CNN5, DenseNet]
# RecurrentNetwork: [BLSTM, LSTM]
# - The recommended configuration is CNN5+BLSTM / DenseNet+BLSTM
# HiddenNum: [64, 128, 256]
# - This parameter indicates the number of nodes used to remember and store past states.
NeuralNet:
  CNNNetwork: CNN5
  RecurrentNetwork: BLSTM
  HiddenNum: 64
  KeepProb: 0.99

# SavedEpochs: A Session.run() execution is called a Epochs,
# - Used to save traininsed to calculate accuracy, Default value is 100.
# TestNum: The number of samples for each test batch.
# - A test for every saved steps.
# CompileAcc: When the accuracy reaches the set threshold,
# - the model will be compiled together each time it is archived.
# - Available for specific usage scenarios.
# EndAcc: Finish the training when the accuracy reaches [EndAcc*100]%.
# EndEpochs: Finish the training when the epoch is greater than the defined epoch.
# PreprocessCollapseRepe ated: If True, then a preprocessing step runs
# - before loss calculation, wherein repeated labels passed to the loss
# - are merged into single labels.  This is useful if the training labels come
# - from, e.g., forced alignments and therefore have unnecessary repetitions.
# CTCMergeRepeated: If False, then deep within the CTC calculation,
# - repeated non-blank labels will not be merged and are interpreted
# - as individual labels. This is a simplified (non-standard) version of CTC.
Trains:
  SavedSteps: 100
  ValidationSteps: 500
  EndAcc: 0.98
  EndCost: 1
  EndEpochs: 2
  BatchSize: 64
  TestBatchSize: 300
  LearningRate: 0.01
  DecayRate: 0.98
  DecaySteps: 100000
  PreprocessCollapseRepeated: False
  CTCMergeRepeated: True  
  CTCBeamWidth: 5
  CTCTopPaths: 1

There are several common examples of TrainRegex: i. apple_20181010121212.jpg

.*?(?=_.*\.)

ii apple.png

.*?(?=\.)

model.yaml - Model Config

# Sites: A bindable parameter used to select a model. 
# - If this parameter is defined, 
# - it can be identified by using the model_site parameter 
# - to identify a model that is inconsistent with the actual size of the current model.
# ModelName: Corresponding to the model file in the model directory,
# - such as YourModelName.pb, fill in YourModelName here.
# ModelType: This parameter is also used to locate the model. 
# - The difference from the sites is that if there is no corresponding site, 
# - the size will be used to assign the model. 
# - If a model of the corresponding size and corresponding to the ModelType is not found, 
# - the model belonging to the category is preferentially selected.
# CharSet: Provides a default optional built-in solution:
# - [ALPHANUMERIC, ALPHANUMERIC_LOWER, ALPHANUMERIC_UPPER,
# -- NUMERIC, ALPHABET_LOWER, ALPHABET_UPPER, ALPHABET]
# - Or you can use your own customized character set like: ['a', '1', '2'].
# CharExclude: CharExclude should be a list, like: ['a', '1', '2']
# - which is convenient for users to freely combine character sets.
# - If you don't want to manually define the character set manually,
# - you can choose a built-in character set
# - and set the characters to be excluded by CharExclude parameter.
Model:
  Sites: []
  ModelName: YourModelName
  ModelType: 150x50
  CharSet: ALPHANUMERIC_LOWER
  CharExclude: []
  CharReplace: {}
  ImageWidth: 150
  ImageHeight: 50

# Binaryzation: [-1: Off, >0 and < 255: On].
# Smoothing: [-1: Off, >0: On].
# Blur: [-1: Off, >0: On].
Pretreatment:
  Binaryzation: -1
  Smoothing: -1
  Blur: -1
  Resize: [150, 50]

工具集

预处理预览工具，只支持为打包的训练集查看 python -m tools.preview
新手指南（只支持字符集推荐，我觉得是个鸡肋各位请忽略） python -m tools.navigator
PyInstaller 一键打包（训练的话支持不好，部署的打包效果不错）
```
pip install pyinstaller
python -m tools.package
```

运行

命令行或终端运行：python trains.py
使用 PyCharm 运行，右键 Run

开源许可

身在一个965的公司难以想像996是怎样可怕的一件事情。 996工作制意味着8点多起，10点多到家，意味着几乎没有个人时间，没有时间学习，没有时间陪伴爱人亲人，没有时间维持工作以外的社交，人生中只有睡觉吃饭上班和唯一的周末，那么我们从工作中等价交换了什么？

个人报酬：生存的主要收入来源
个人价值：通过工作收获技能和社会承认
社会接触：了解不同的人，不同的观点、经验、思想等等

所以这些就是生活的全部了吗？你的付出是否交换到等价的收益？想要得到多少就应该牺牲等价的东西去交换，有些人牺牲一切去换取被爱的可能，有些人牺牲生活和爱情去换金钱和社会地位，有些人牺牲一切去逐梦或筑梦，韭菜春风吹又生，但这不能成为我们虐待它们的理由，这些也不应该成为企业盲目跟风996制度的理由，那些敢于提出996的企业领导人应该学习的是承担，承担把大饼从纸上送到手上，比起《跳槽上征信》，那些天天熬制无法兑现的鸡汤厨子更应该上征信吧。

即使你们有一万种虐待韭菜的方法，即使是飞蛾扑火，是以卵击石，我仍愿以个人的名义加入 ANTI-996 大军

详细指南

之前专门为该项目写的文章，欢迎大家点评

https://www.jianshu.com/p/80ef04b16efc

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
dataset		dataset
network		network
resource		resource
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
character.py		character.py
config.py		config.py
config_demo.yaml		config_demo.yaml
exception.py		exception.py
framework.py		framework.py
make_dataset.py		make_dataset.py
model_demo.yaml		model_demo.yaml
predict_testing.py		predict_testing.py
pretreatment.py		pretreatment.py
requirements.txt		requirements.txt
trainer.spec		trainer.spec
trains.py		trains.py
tutorial.py		tutorial.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

项目介绍

注意事项

准备工作

环境安装

开始

1. 架构与流程

2. 配置化

工具集

运行

开源许可

详细指南

About

Releases

Packages

Languages

License

joanneffffff/captcha_trainer

Folders and files

Latest commit

History

Repository files navigation

项目介绍

注意事项

准备工作

环境安装

开始

1. 架构与流程

2. 配置化

工具集

运行

开源许可

详细指南

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages