Source

pycon2013 / image-processing-pipeline.txt

Full commit
- User upload recipe, they extract data from it and reward user
- Stack
    - Ubuntu, nginx, redis, s3, mako, mysql, tornado
    - OpenCV, NumPy, IMagick, Tesseract
    - Mongo + Hadoop
- Pipeline: pre process -> OCR -> parsing -> scoring -> select best
                    <-------------------------+
    - Internal part runs many times
- Upload can be several images (long receipt) 
- Pre processing (OpenCV + NumPy):
    [1]
    - color -> b/w
    - unblur /sharpen
    - un-highlight color regions
    - adaptive thresholding
    [2]
    - Cropping (carpet story)
    [3]
    - Extracting lines (line recognition)
- Tesseract OCR
    - Had to train on receipt font
    - Created shopping dictionary
- Use Levenshtein distance used by Fuzzy Matches
- Handling errors
    - Never loose originals
    - Have re-run capabilities
- 80% accuracy on good pictures