We have not trained explicitly on handwriting datasets (completely handwritten documents). But, there are lots of forms data with handwriting present in training. So, do try on your files, there is a huggingface demo, you can quickly test there: https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s
We are currently working on creating completely handwritten document datasets for our next model release.
Hey, the reason for the long processing time is that lots of people are using it, and with probably larger documents. I tested your file locally seems to be working correctly. https://ibb.co/C36RRjYs
Regarding the token limit, it depends on the text. We are using the qwen-2.5-vl tokenizer in case you are interested in reading about it.
> I tested your file locally seems to be working correctly
Apologies if there's some unspoken nuance in this exchange, but by "working correctly" did you just mean that it ran to completion? I don't even recognize some of the unicode characters that it emitted (or maybe you're using some kind of strange font, I guess?)
Don't misunderstand me, a ginormous number of floating point numbers attempting to read that handwriting is already doing better than I can, but I was just trying to understand if you thought that outcome is what was expected
We have not trained explicitly on handwriting datasets (completely handwritten documents). But, there are lots of forms data with handwriting present in training. So, do try on your files, there is a huggingface demo, you can quickly test there: https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s
We are currently working on creating completely handwritten document datasets for our next model release.
Document:
* https://imgur.com/cAtM8Qn
Result:
* https://imgur.com/ElUlZys
Perhaps it needed more than 1K tokens? But it took about an hour (number 28 in queue) to generate that and I didn't feel like trying again.
How many tokens does it usually take to represent a page of text with 554 characters?
Hey, the reason for the long processing time is that lots of people are using it, and with probably larger documents. I tested your file locally seems to be working correctly. https://ibb.co/C36RRjYs
Regarding the token limit, it depends on the text. We are using the qwen-2.5-vl tokenizer in case you are interested in reading about it.
You can run it very easily in a Colab notebook. This should be faster than the demo https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
There are incorrect words in the extraction, so I would suggest you to wait for the handwritten text model's release.
> I tested your file locally seems to be working correctly
Apologies if there's some unspoken nuance in this exchange, but by "working correctly" did you just mean that it ran to completion? I don't even recognize some of the unicode characters that it emitted (or maybe you're using some kind of strange font, I guess?)
Don't misunderstand me, a ginormous number of floating point numbers attempting to read that handwriting is already doing better than I can, but I was just trying to understand if you thought that outcome is what was expected
It actually did a decent. Perhaps the font is weird? For reference here is the 'ground truth' content, not in markdown:
Page# 8
Log: MA 6100 2.03.15
34 cement emitter resistors - 0.33R 5W 5% measure 0.29R 0.26R
35 replaced R436, R430 emitter resistors on R-chn P.O. brd w/new WW 5W .33R 5% w/ ceramic lead insulators
36 applied de-oxit d100 to speaker outs, card terminals, terminal blocks, output trans jacks
37 replace R-chn drivers and class A BJTs w/ BD139/146, & TIP31AG
38 placed boards back in
39 desoldered grnd lug from volume control
40 contact cleaner, Deoxit D5, faderlube on pots & switches teflon lube on rotor joint
41 cleaned ground lug & resoldered, reattached panel
This is the result. ``` Page 1 of 1 Page # <page_number>8</page_number>
Log: MA 6100 Z. O 3. 15
<table> <tr> <td>34</td> <td>cement emitter resistors -</td> </tr> <tr> <td></td> <td>0.33 R SW 5% measure</td> </tr> <tr> <td></td> <td>0.29 R, 0.26 R</td> </tr> <tr> <td>35</td> <td>replaced R'4 36, R4 30</td> </tr> <tr> <td></td> <td>emitter resistor on R-44</td> </tr> <tr> <td></td> <td>0.0. 3rd w/ new WW 5W .33R</td> </tr> <tr> <td>36</td> <td>% w/ ceramic lead insulators</td> </tr> <tr> <td></td> <td>applied de-oat d100 to Speak</td> </tr> <tr> <td></td> <td>outs, card terminals, terminal</td> </tr> <tr> <td></td> <td>blocks, output tran jacks</td> </tr> <tr> <td>37</td> <td>replace &-clun diviers</td> </tr> <tr> <td></td> <td>and class A BJTs w/ BD139/140</td> </tr> <tr> <td></td> <td>& TIP37A2</td> </tr> <tr> <td>38</td> <td>placed boards back in</td> </tr> <tr> <td>39</td> <td>desoldered ground lus from volume</td> </tr> <tr> <td></td> <td>(con 48)</td> </tr> <tr> <td>40</td> <td>contact cleaner, Deox. t DS, facel/42</td> </tr> <tr> <td></td> <td>on pots & switches</td> </tr> <tr> <td></td> <td>· teflon lube on rotor joint</td> </tr> <tr> <td>41</td> <td>reably cleaned ground lus &</td> </tr> <tr> <td></td> <td>resoldered, reattatched panel</td> </tr> </table> ```
You can paste it in https://markdownlivepreview.com/ and see the extraction. This is using the Colab notebook I have shared before.
Which Unicode characters are you mentioning here?