[FFmpeg-trac] #9151(undetermined:new): Missing white space in the white list of tesseract configuration
FFmpeg
trac at avcodec.org
Sat Mar 13 00:46:42 EET 2021
#9151: Missing white space in the white list of tesseract configuration
-------------------------------------+-------------------------------------
Reporter: dominic108 | Type: defect
Status: new | Priority: normal
Component: | Version:
undetermined | unspecified
Keywords: tesseract | Blocked By:
Blocking: | Reproduced by developer: 0
Analyzed by developer: 0 |
-------------------------------------+-------------------------------------
Summary of the bug: I compiled ffmpeg on Ubuntu to have the tesseract
module:
{{{
ffmpeg version N-101412-gb7e7813 Copyright (c) 2000-2021 the FFmpeg
developers
built with gcc 9 (Ubuntu 9.3.0-17ubuntu1~20.04)
configuration: --prefix=/home/working/app_download/ffmpeg_build --pkg-
config-flags=--static --extra-
cflags=-I/home/working/app_download/ffmpeg_build/include --extra-
ldflags=-L/home/working/app_download/ffmpeg_build/lib --extra-
libs='-lpthread -lm' --ld=g++
--bindir=/home/working/app_download/ffmpeg_build/bin --enable-gpl
--enable-gnutls --enable-libaom --enable-libass --enable-libfdk-aac
--enable-libfreetype --enable-libmp3lame --enable-libopus --enable-
libsvtav1 --enable-libdav1d --enable-libvorbis --enable-libvpx --enable-
libx264 --enable-libx265 --enable-nonfree --enable-libtesseract
}}}
I tested with the command line:
{{{
% ffmpeg -i input -vf "ocr,metadata=mode=print:file=ocr.txt:direct=1"
output
}}}
Here is an extract from ocr.txt after the above command:
{{{
frame:53 pts:212752 pts_time:212.752
lavfi.ocr.text=Transcendingisthedeepsettlingoftheactivityofthemind
whilethemindremainsawake.
lavfi.ocr.confidence=0 0 95
}}}
The white spaces were not recognized by Tesseract and I checked that
Tesseract perfectly recognized the white spaces when directly applied on
the frame image. So, I checked the code libavfilter/vf_ocr.c to see what
is going on. The white space was not in the white list of characters.
Even though it is a strange idea to consider a white space as a character
in the context of ocr, I boldly added a white space to the list. The
original code was :
{{{
{ "whitelist", "set character whitelist", OFFSET(whitelist),
AV_OPT_TYPE_STRING,
{.str="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:;,-+_!?\"'[]{}()<>|/\\=*&%$#@!~"},
0, 0, FLAGS },
}}}
The modified code was
{{{
{ "whitelist", "set character whitelist", OFFSET(whitelist),
AV_OPT_TYPE_STRING,
{.str="0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ.:;,-+_!?\"'[]{}()<>|/\\=*&%$#@!~
"}, 0, 0, FLAGS },
}}}
After recompilation I tried again and here was result:
{{{
frame:53 pts:212752 pts_time:212.752
lavfi.ocr.text=Transcending is the deep settling of the activity of the
mind
while the mind remains awake.
lavfi.ocr.confidence=96 96 97 96 96 96 96 96 96 96 96 96 96 96 96 96 95
}}}
--
Ticket URL: <https://trac.ffmpeg.org/ticket/9151>
FFmpeg <https://ffmpeg.org>
FFmpeg issue tracker
More information about the FFmpeg-trac
mailing list