Some errors in the 800K datasets: oversized word/char box, missing labels

Hi, I am having similar problems as those discussed in #13 #15.

I am using the pre-generated 800K dataset to train a model, and found that there exist the following issues:

(1) Some word/char boxes are oversized, as discussed in #13, #15. (2) Some word recognition annotations are wrong. (3) There are some confusing bounding box coordinate values, e.g. negative value, coordinates that cross over the image boundary, char box coordinates that actually consist of 2 pairs of vertexes(e.g. p1,p1,p2,p2, while 4 different points are expected).