Skip to content

[验证码识别-训练] This project is based on CNN5/ResNet+BLSTM/LSTM/GRU/SRU/BSRU+CTC to realize verification code identification. This project is only for training the model.

License

Notifications You must be signed in to change notification settings

joanneffffff/captcha_trainer

 
 

Repository files navigation

项目介绍

验证码识别 - 该项目是基于 CNN5/ResNet+BLSTM/LSTM/GRU/SRU/BSRU+CTC 来实现验证码识别. 该项目仅用于训练,如果需要部署模型请移步:

https://github.com/kerlomz/captcha_platform (通用WEB服务,HTTP请求调用)

https://github.com/kerlomz/captcha_library_c (动态链接库,DLL调用,基于TensoFlow C++)

https://github.com/kerlomz/captcha_demo_csharp (C#源码调用,基于TensorFlowSharp)

许多人问我,部署识别也需要GPU吗?我的答案是,完全没必要。理想中是用GPU训练,使用CPU部署识别服务,部署如果也需要这么高的成本,那还有什么现实意义和应用场景呢,实测阿里云最低配1核1G的配置识别1次大约30ms,我的i7-8700k大约10-15ms之间。

LICENSE

注意事项

  1. 如何使用CPU训练:

    本项目默认安装TensorFlow-GPU版,建议使用GPU进行训练,如需换用CPU训练请替换 requirements.txt 文件中的tensorflow-gpu==1.6.0tensorflow==1.6.0,其他无需改动。

  2. 关于LSTM网络:

    保证CNN得到的featuremap输入到LSTM时的宽度至少大于等于最大字符数的3倍左右,即time_step大于等于最大字符数3倍。

  3. No valid path found 问题解决

    model.yaml中修改Pretreatment->Resize的参数,自行调整为合适的值,总结了百来个验证码训练经验,可以尝试这个较为通用的值:Resize: [150, 50],或者使用代码tutorial.py (自动生成配置文件、打包样本、训练一体化),填写训练集路径执行。

  4. 参数修改:

    切记,ModelName 是绑定一个模型的唯一标志,如果修改了训练参数如:ImageWidth,ImageHeight,Resize,CharSet,CNNNetwork,RecurrentNetwork,HiddenNum 这类影响计算图的参数,需要删除model路径下的旧文件,重新训练,或者使用新的ModelName 重新训练,否则默认作为断点续练。

准备工作

如果你准备使用GPU训练,请先安装CUDA和cuDNN,可以了解下官方测试过的编译版本对应: https://www.tensorflow.org/install/install_sources#tested_source_configurations Github上可以下载到第三方编译好的TensorFlow的WHL安装包:

https://github.com/fo40225/tensorflow-windows-wheel

CUDA下载地址:https://developer.nvidia.com/cuda-downloads

cuDNN下载地址:https://developer.nvidia.com/rdp/form/cudnn-download-survey (需要注册账号)

笔者使用的版本为:CUDA10+cuDNN7.3.1+TensorFlow 1.12

环境安装

  1. 安装Python 3.6 环境(包含pip)

  2. 安装虚拟环境 virtualenv pip3 install virtualenv

  3. 为该项目创建独立的虚拟环境:

    virtualenv -p /usr/bin/python3 venv # venv is the name of the virtual environment.
    cd venv/ # venv is the name of the virtual environment.
    source bin/activate # to activate the current virtual environment.
    cd captcha_trainer # captcha_trainer is the project path.
  4. 安装本项目的依赖列表:pip install -r requirements.txt

开始

1. 架构与流程

本项目依赖于训练配置config.yaml和模型配置model.yaml,初始化项目的时候请复制config_demo.yaml到当前目录下命名为config.yamlmodel_demo.yaml同理。或者可以使用tutorial.py 自动设置模型配置。

训练流程:配置好两个配置文件后,执行trains.py 中的代码,读取配置,根据model.yaml配置文件构建神经网络计算图,依据config.yaml的配置参数进行训练。

关于config.yaml中的训练参数有几点建议:

  1. BatchSize(训练批次大小)与TestBatchSize(测试批次大小)是需要大家关注的,建议根据显卡条件进行调整,显存小的建议BatchSize不要太大,TestBatchSize也是,我提供的默认配置是基于显存8G,使用率50%设置的,请悉知。

  2. LearningRate(学习率)也是需要关注的,深度学习本质就是调参,一般的模型可以保持默认的配置无需调整,有些模型想要获得更高的识别精度可以先使用0.01快速收敛,准确率差不多95%左右再使用0.001/0.0001提高精度。

  3. TestSetNum(测试集数目),这个是专门为懒人(说我自己)设计提供的,根据给定的测试集数目切割训练集,有一个前提,测试集必须是随机的,随机的,随机的,重要的事说三遍,有些人用Windows资源管理器打开,一拖动选择几百个,默认都是按名称排序的,如果名称是标注,那么就不是随机了,也就是很可能你取的测试集是标注为0~3之间的图片,这样可能导致永远无法收敛。

  4. TrainRegex 和 TestRegex,正则匹配,请各位采集样本的时候,尽量和我给的示例保持一致吧,正则问题请谷歌,如果是为1111.jpg这种命名的话,这里提供了一个批量转换的代码:

    import re
    import os
    import hashlib
    
    # 训练集路径
    root = r"D:\TrainSet\***"
    all_files = os.listdir(root)
    
    for file in all_files:
        old_path = os.path.join(root, file)
        
        # 已被修改过忽略
        if len(file.split(".")[0]) > 32:
            continue
        
        # 采用标注_文件md5码.图片后缀 进行命名
        with open(old_path, "rb") as f:
            _id = hashlib.md5(f.read()).hexdigest()
        new_path = os.path.join(root, file.replace(".", "_{}.".format(_id)))
        
        # 重复标签的时候会出现形如:abcd (1).jpg 这种形式的文件名
        new_path = re.sub(" \(\d+\)", "", new_path)
        print(new_path)
        os.rename(old_path, new_path)

2. 配置化

  1. config.yaml - System Config

    # - requirement.txt  -  GPU: tensorflow-gpu, CPU: tensorflow
    # - If you use the GPU version, you need to install some additional applications.
    # TrainRegex and TestRegex: Default matching apple_20181010121212.jpg file.
    # - The Default is .*?(?=_.*\.)
    # TrainsPath and TestPath: The local absolute path of your training and testing set.
    # TestSetNum: This is an optional parameter that is used when you want to extract some of the test set
    # - from the training set when you are not preparing the test set separately.
    System:
      DeviceUsage: 0.5
      TrainRegex: '.*?(?=_)'
      TestRegex: '.*?(?=_)'
      TestSetNum: 300
    
    # CNNNetwork: [CNN5, DenseNet]
    # RecurrentNetwork: [BLSTM, LSTM]
    # - The recommended configuration is CNN5+BLSTM / DenseNet+BLSTM
    # HiddenNum: [64, 128, 256]
    # - This parameter indicates the number of nodes used to remember and store past states.
    NeuralNet:
      CNNNetwork: CNN5
      RecurrentNetwork: BLSTM
      HiddenNum: 64
      KeepProb: 0.99
    
    # SavedEpochs: A Session.run() execution is called a Epochs,
    # - Used to save traininsed to calculate accuracy, Default value is 100.
    # TestNum: The number of samples for each test batch.
    # - A test for every saved steps.
    # CompileAcc: When the accuracy reaches the set threshold,
    # - the model will be compiled together each time it is archived.
    # - Available for specific usage scenarios.
    # EndAcc: Finish the training when the accuracy reaches [EndAcc*100]%.
    # EndEpochs: Finish the training when the epoch is greater than the defined epoch.
    # PreprocessCollapseRepe ated: If True, then a preprocessing step runs
    # - before loss calculation, wherein repeated labels passed to the loss
    # - are merged into single labels.  This is useful if the training labels come
    # - from, e.g., forced alignments and therefore have unnecessary repetitions.
    # CTCMergeRepeated: If False, then deep within the CTC calculation,
    # - repeated non-blank labels will not be merged and are interpreted
    # - as individual labels. This is a simplified (non-standard) version of CTC.
    Trains:
      SavedSteps: 100
      ValidationSteps: 500
      EndAcc: 0.98
      EndCost: 1
      EndEpochs: 2
      BatchSize: 64
      TestBatchSize: 300
      LearningRate: 0.01
      DecayRate: 0.98
      DecaySteps: 100000
      PreprocessCollapseRepeated: False
      CTCMergeRepeated: True  
      CTCBeamWidth: 5
      CTCTopPaths: 1
    

    There are several common examples of TrainRegex: i. apple_20181010121212.jpg

    .*?(?=_.*\.)
    

    ii apple.png

    .*?(?=\.)
    
  2. model.yaml - Model Config

    # Sites: A bindable parameter used to select a model. 
    # - If this parameter is defined, 
    # - it can be identified by using the model_site parameter 
    # - to identify a model that is inconsistent with the actual size of the current model.
    # ModelName: Corresponding to the model file in the model directory,
    # - such as YourModelName.pb, fill in YourModelName here.
    # ModelType: This parameter is also used to locate the model. 
    # - The difference from the sites is that if there is no corresponding site, 
    # - the size will be used to assign the model. 
    # - If a model of the corresponding size and corresponding to the ModelType is not found, 
    # - the model belonging to the category is preferentially selected.
    # CharSet: Provides a default optional built-in solution:
    # - [ALPHANUMERIC, ALPHANUMERIC_LOWER, ALPHANUMERIC_UPPER,
    # -- NUMERIC, ALPHABET_LOWER, ALPHABET_UPPER, ALPHABET]
    # - Or you can use your own customized character set like: ['a', '1', '2'].
    # CharExclude: CharExclude should be a list, like: ['a', '1', '2']
    # - which is convenient for users to freely combine character sets.
    # - If you don't want to manually define the character set manually,
    # - you can choose a built-in character set
    # - and set the characters to be excluded by CharExclude parameter.
    Model:
      Sites: []
      ModelName: YourModelName
      ModelType: 150x50
      CharSet: ALPHANUMERIC_LOWER
      CharExclude: []
      CharReplace: {}
      ImageWidth: 150
      ImageHeight: 50
    
    # Binaryzation: [-1: Off, >0 and < 255: On].
    # Smoothing: [-1: Off, >0: On].
    # Blur: [-1: Off, >0: On].
    Pretreatment:
      Binaryzation: -1
      Smoothing: -1
      Blur: -1
      Resize: [150, 50]

工具集

  1. 预处理预览工具,只支持为打包的训练集查看 python -m tools.preview

  2. 新手指南 (只支持字符集推荐,我觉得是个鸡肋各位请忽略) python -m tools.navigator

  3. PyInstaller 一键打包(训练的话支持不好,部署的打包效果不错)

    pip install pyinstaller
    python -m tools.package
    

运行

  1. 命令行或终端运行:python trains.py
  2. 使用 PyCharm 运行,右键 Run

开源许可

身在一个965的公司难以想像996是怎样可怕的一件事情。 996工作制意味着8点多起,10点多到家,意味着几乎没有个人时间,没有时间学习,没有时间陪伴爱人亲人,没有时间维持工作以外的社交,人生中只有睡觉吃饭上班和唯一的周末,那么我们从工作中等价交换了什么?

  1. 个人报酬:生存的主要收入来源

  2. 个人价值:通过工作收获技能和社会承认

  3. 社会接触:了解不同的人,不同的观点、经验、思想等等

所以这些就是生活的全部了吗?你的付出是否交换到等价的收益? 想要得到多少就应该牺牲等价的东西去交换,有些人牺牲一切去换取被爱的可能,有些人牺牲生活和爱情去换金钱和社会地位,有些人牺牲一切去逐梦或筑梦,韭菜春风吹又生,但这不能成为我们虐待它们的理由,这些也不应该成为企业盲目跟风996制度的理由,那些敢于提出996的企业领导人应该学习的是承担,承担把大饼从纸上送到手上,比起《跳槽上征信》,那些天天熬制无法兑现的鸡汤厨子更应该上征信吧。

即使你们有一万种虐待韭菜的方法,即使是飞蛾扑火,是以卵击石,我仍愿以个人的名义加入 ANTI-996 大军

详细指南

之前专门为该项目写的文章,欢迎大家点评

https://www.jianshu.com/p/80ef04b16efc

About

[验证码识别-训练] This project is based on CNN5/ResNet+BLSTM/LSTM/GRU/SRU/BSRU+CTC to realize verification code identification. This project is only for training the model.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%