[FFmpeg-trac] #9151(undetermined:new): Missing white space in the white list of tesseract configuration

FFmpeg trac at avcodec.org
Sat Mar 13 00:46:42 EET 2021


#9151: Missing white space in the white list of tesseract configuration
-------------------------------------+-------------------------------------
             Reporter:  dominic108   |                     Type:  defect
               Status:  new          |                 Priority:  normal
            Component:               |                  Version:
  undetermined                       |  unspecified
             Keywords:  tesseract    |               Blocked By:
             Blocking:               |  Reproduced by developer:  0
Analyzed by developer:  0            |
-------------------------------------+-------------------------------------
 Summary of the bug: I compiled ffmpeg on Ubuntu to have the tesseract
 module:
 {{{
 ffmpeg version N-101412-gb7e7813 Copyright (c) 2000-2021 the FFmpeg
 developers
 built with gcc 9 (Ubuntu 9.3.0-17ubuntu1~20.04)
 configuration: --prefix=/home/working/app_download/ffmpeg_build --pkg-
 config-flags=--static --extra-
 cflags=-I/home/working/app_download/ffmpeg_build/include --extra-
 ldflags=-L/home/working/app_download/ffmpeg_build/lib --extra-
 libs='-lpthread -lm' --ld=g++
 --bindir=/home/working/app_download/ffmpeg_build/bin --enable-gpl
 --enable-gnutls --enable-libaom --enable-libass --enable-libfdk-aac
 --enable-libfreetype --enable-libmp3lame --enable-libopus --enable-
 libsvtav1 --enable-libdav1d --enable-libvorbis --enable-libvpx --enable-
 libx264 --enable-libx265 --enable-nonfree --enable-libtesseract
 }}}

 I tested with the command line:
 {{{
 % ffmpeg -i input -vf "ocr,metadata=mode=print:file=ocr.txt:direct=1"
 output
 }}}
 Here is an extract from ocr.txt after the above command:
 {{{
 frame:53   pts:212752  pts_time:212.752
 lavfi.ocr.text=Transcendingisthedeepsettlingoftheactivityofthemind
 whilethemindremainsawake.
 lavfi.ocr.confidence=0 0 95
 }}}
 The white spaces were not recognized by Tesseract and I checked that
 Tesseract perfectly recognized the white spaces when directly applied on
 the frame image. So, I checked the code libavfilter/vf_ocr.c to see what
 is going on. The white space was not in the white list of  characters.
 Even though it is a strange idea to consider a white space as a character
 in the context of ocr, I boldly added a white space to the list. The
 original code was :
 {{{
    { "whitelist", "set character whitelist", OFFSET(whitelist),
 AV_OPT_TYPE_STRING,
 {.str="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:;,-+_!?\"'[]{}()<>|/\\=*&%$#@!~"},
 0, 0, FLAGS },
 }}}
 The modified code was
 {{{
    { "whitelist", "set character whitelist", OFFSET(whitelist),
 AV_OPT_TYPE_STRING,
 {.str="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:;,-+_!?\"'[]{}()<>|/\\=*&%$#@!~
 "}, 0, 0, FLAGS },
 }}}
 After recompilation I tried again and here was result:
 {{{
 frame:53   pts:212752  pts_time:212.752
 lavfi.ocr.text=Transcending is the deep settling of the activity of the
 mind
 while the mind remains awake.
 lavfi.ocr.confidence=96 96 97 96 96 96 96 96 96 96 96 96 96 96 96 96 95
 }}}

--
Ticket URL: <https://trac.ffmpeg.org/ticket/9151>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker


More information about the FFmpeg-trac mailing list