Monday, January 26, 2009

Let Code Together..

..right now..ohh yeah..in a Sweet Harmony. WHO was around in 1993-1994 might still remember this song of The Beloved..in a video that was clevearly catching your eyes with some possible nudity of a dozen models while blurring their private parts with some creative ..techniques.:))
While this has started as an introductory digression, I do not wish to digress further into the world of incidental , so I will get right to the point. I have now for several weeks start to code my latest dataset and along to my future planed additions this will make (hopefully) a great and flexible tool for various research questions at the plant, firm and country level. Heck, I might throw in some of that multi-level analysis, if possible. And although now the hope are high and the anxiety building on future possible ramifications of this project, the coding is a haunting task. Although it seems faster than anticipated, and with less stress associated, due to other external reasons, still staying still for hours and checking and double checking everything, matching names et al. is not as fun as it looks.
The Publishing Conundrums and some Surprising Fast Rewards

I have opted in the end for a more trade oriented journal as suggested by Mr. X, the editor of my previous try, and it seems that he was right. Lesson #1: do your homework before submitting!! but on the other hand, don't get me wrong the former journal was fitted like a glove, in terms of geographical and comparative scope of the work debated there. However, Lesson #2: The editor is ALWAYS right! so there is no point in being upset if your work doesn't make it..which brings me to Lesson #3: Search & you will find it. There are so many economic journals nowadays that if you indeed pursue a new endeavor, did something new or interesting, you will find THE ONE which is better suited for your work. However, keep in mind, the final lesson of this small publishing grievance article: "Small can be beautiful! (but especially a lot faster)" I have heard stuff like it takes 2-3 years to publish in second tier A journals, and from my experience even with lesser ranked journals in Economics it takes on average 12-15 months...so if you have the right stuff for the next AER or JEL issue, hold your breath and un-cross your fingers...it might take a while..
On the other hand, exceptions are always welcomed. And I have just benefited from one, which made me happy and quite optimistic that not the whole world functions in the same agnostic way.

Sunday, January 11, 2009

Choosing the best OCR software for you

Proved to be a haunting task for me (at least). I was trying to get some scanned data into excel. However, multiple problems:
- scanning quality (some pages were too dark for OCR)--> hence, unusable
- miss-recognition (getting an O instead of 0 and Gs instead of 6ers can become a common pain in the @@$)
- complex tables (this is the biggest challenge, since messing up the columns is hard to fix post-processing).
Choosing an OCR software
I have tried several software (both under Win and Mac): first my personal favorite..a small OCR simple program called Able2Extract Professional. Obviously it does the work for simple tables usually from clean-cut pdfs. In most cases, doesn't go beyond that. Then moved up to the big guns: ABBYY Finereader Pro 9.0 , IRIS.Readiris.Pro.v11.5.6 and OmniPage.Professional.v16.0. However, none of the above blew me away. Abby is terrible slow but seems to have a bit more options for customization, Omnipage is the fastest and the best quality, but I had trouble in doing what I wanted to.
Conclusion:
In the end, none of the above could do both a FAST and HI-Q OCR recognition of text considering the difficulties associated with my PDF files. My Chinese names and other foreign firms were painful to distinguish even for me, not to mention any OCR soft, thus in the end, I opted for manual data recognition (MDR) and I just entered it myself in Excel. Lots and lots of hours and nerves wasted, but in the end I think it would have taken the same (or more) hours just to follow, correct and change the OCR outputs.