Optical Character Recognition (OCR)

What to do when someone asks you to type up 100 pages of text from clippings of old newspapers.

In the webscraping and Twitter API tutorials, we worked with texts that are already stored in digital form, so that getting them into R is just a matter of removing all the HTML code. But other texts (lots of texts) are not so digitally accessible. Maybe you’re interested in historical archives living in a dark basement. Or old press releases living in a PDF. In either case, we need a method that can recognize text in images, and convert it into plain text. This is a job for Optical Character Recognition (OCR).

Optical Character Recognition

OCR is a notoriously difficult task for computers, which is why the “are you a human” tests on some websites might ask you to type a bunch of numbers and letters within a blurry or distorted image.

But if the text is in straight lines on a white background, off-the-shelf OCR packages do a pretty good job. The workhorse OCR engine is called Tesseract, and it’s available in R through a package called tesseract.

Let’s see if it can recognize the text in that xkcd comic above, using the ocr() function.

library(tesseract)

xkcd <- ocr(image = 'img/suspicion.png')

xkcd
[1] "Fie, HED UR 1S JUST... NOW AND THEN | BEFORE THIS GOES ANY FURTHER,\nONLINE CHATS You MENTION PRODUCTS YOU | T THINK WE SHOULD GO GET TESTED,\nTHESE. PAST FEW LIKE, AND... I UORRY. | YOUKNOU, TOGETHER. VK Coupes Tene\nONTHS, I = PROVES Cape.\nUSA. \\ ROM KE. WHAT? HONEY... \\ You Dont Taist mee\nYOU, ROB. I. JUST WANT To BE\n4 SURE. O S\nio) #) OKAY, PINE SANS Uy\nd z PUBRARY: YOURS? \"5\n\\ iS TH MORE THAN A\n4) WE OCOD. SeAMBOT! OUR LOVE\nVe = Goooere, usa, 4 WAS REAL!\nana VC\n"

Clearly it cannot make out “Library” and “Kittens”, but by my count it reads about 65% of these handwritten words correctly. They’re not in the right order though. Tesseract reads from left to right, top to bottom, and does not understand things like comic panels or speech bubbles. If we’re just using a bag of words representation and we don’t care about word order, then we can use the ocr_data() function, which splits the input into a dataframe with one row for each word. It even includes a handy confidence column, which tells you how confident the model is in its prediction.

xkcd <- ocr_data(image = 'img/suspicion.png')

head(xkcd)
# A tibble: 6 × 3
  word    confidence bbox        
  <chr>        <dbl> <chr>       
1 Fie,           0   4,6,37,41   
2 HED           41.0 50,6,74,41  
3 UR            48.8 85,6,107,41 
4 1S            55.7 170,9,185,21
5 JUST...       82.5 191,9,318,23
6 NOW           65.9 238,5,260,33

But if we do care about word order, then we’ll need to be more careful about pre-processing the image before conducting OCR. For example, let’s try to read in the text on page 3 of this document about the California Supreme Court. The pdftools package can convert the page to an image, when we can then OCR.

library(pdftools)

pdf_convert('img/SOJ.pdf', pages = 3, dpi = 600, filenames = 'img/SOJ.png')
Converting page 3 to img/SOJ.png... done!
[1] "img/SOJ.png"
text <- ocr('img/SOJ.png')

text
[1] "Remarks  . ~~\n_ By Chief Justice Rose Elizabeth Bird*  - a,\nIt is a pleasure to be here this afternoon. I own making. Our court system is an indepen-\nwould like to thank the Conference of dent branch of government, and there is\nDelegates for affording me this opportunity to strength to be derived from thai unity. But un-\nshare with you some thoughts about issues til we judges begin to see ourselves as part of :\nwhich judges, lawyers, and citizens of Califor- an organic whole, that strength will be |\nnia willbe facing during thecoming year. _ . dissipated and wasted. .\nOur democratic form of government is Sadly, our justice system is marred by\nblessed with a court system that is particular- fractionization and segmentation at all levels.\nly well designed to resolve the disputes that Judging is by its nature often a solitary and\narise in a heterogeneous society. There is time-consuming task. Its demands tend to\nstrength in the diversity of views found in our isolate judges from one another, even though\nsociety, but that strength cannot be drawn they may serve on the same court. This isola-\nupon unless conflicting views are moderated tion is intensified when judges are sitting at\nand balanced and excesses checked. different levels and is sometimes characteriz-\nThe courts hold a unique position among —_ ed by feelings of elitism and of superiority\nour democratic institutions. In a sense, they over judges of “lower courts,” __ |\nrepresent one of the last bastions of par- _ dn turn, these antagonistic feelings have 5\nticipatory democracy in our society. They — often led judges to take a competitive, rather i\nstand as a symbol of the great strength of our than a cooperative, view of one another ~- an .\n- representative form of government. The in- attitude which only further deepens the sense\n. dividual disputants go directly before a judge of isolation and fragmentation.\nora jury to raise and resolve a specific issue. It is time for this cycle to stop. Our justice §\nIn no other context within our governmental system is an interrelated whole, and the more\n, system does an individual have the opportuni- that judges in one part of the system unders-\nty to take a problem directly to the decision ._ tand how the other levels function, the more i\nmaker who represents the full force and power effective we will be in meeting the complex i\nof that particular branch of government. realities of the society we serve. The judges at i\nThis direct interchange between the in- different levels of the system have different\ndividual and the state is at the heart of the but equally important roles to play. :\ndemocratic process. AS more barriers are. There are many things that can be done to\nraised between the litigant and the decision- | promote the spirit of cooperation of which I\nmaker, the participatory nature of the ex- speak. And this spirit can be achieved without i\nperience is diminished. We must protect this imposing any constraints on the power of\nunique heritage and strive to preserve the judges torun their own courtrooms or to make :\nvalues it represents. local administrative decisions, which they are\nThe barriers to which I refer are in part uniquely well situated to do. :\nthe resuit of the increasingly complex society An important step toward this sense of\nin which we live. However, I fear that to some common judicial venture could be taken by\nextent these barriers are of the judiciary’s providing state funding for our trail courts.\nnen This issue, though not a new one, was recently\n* California Supreme Court. This is a speech given brought into sharp focus by the passage of pro-\nbefore the Canference of Delegates at the State position 13 and the availability of local funding\nBar Convention in San Francisco, Cal. on Septem- for the trial courts. As you may know, Califor-\n, ber 10, 1978. nia ranks last among all states in the percen- |\n4\n"

Notice two things. First, ocr() performs much better with the typed text than it did with the handwritten comic. Second, all the words are out of order, because it reads left to right across the two columns on that page. So we need to first crop the image into two columns and read each column separately.

For that, we’ll turn to the magick package.

Image Pre-Processing

library(magick)

# First, read in the image with magick
page3 <- image_read('img/SOJ.png')
page3
# Next, crop it into two images with image_crop()
# syntax is width x height + left offset + top offset
page3_left <- image_crop(page3, '2550 x 4300 + 0 + 1000')
page3_left
page3_right <- image_crop(page3, '2550 x 5000 + 2550 + 1000')
page3_right
# Finally, ocr() each column, then paste the results together
text_left <- ocr(page3_left)
text_right <- ocr(page3_right)

text <- paste(text_left, text_right)

text
[1] "It is a pleasure to be here this afternoon. I\nwould like te thank the Conference of\nDelegates for affording me this opportunity to\nshare with you some thoughts about issues\nwhich judges, lawyers, and citizens of Califor-\nnia willbe facing during thecoming year. _\n\nOur democratic form of government is\nblessed with a court system that is particular-\nly well designed to resolve the disputes that\narise in a heterogeneous society. There is\nstrength in the diversity of views found in our\nsociety, but that strength cannot be drawn\nupon unless conflicting views are moderated\nand balanced and excesses checked.\n\nThe courts hold a unique position among\nour democratic institutions. In a sense, they\nrepresent one of the last bastions of par-\nticipatory democracy in our society. They\nstand as a symbol of the great strength of our\n\n. representative form of government. The in-\n\n. dividual disputants go directly before a judge\n\nora jury to raise and resolve a specific issue.\n\nIn no other context within our governmental\n\n, system does an individual have the opportuni-\nty to take a problem directly to the decision |\n\nmaker who represents the full force and power\n\nof that particular branch of government.\n\nThis direct interchange between the in-\ndividual and the state is at the heart of the\ndemocratic precess. AS more barriers are\nraised between the litigant and the decision- _\nmaker, the participatory nature of the ex-\nperience is diminished. We must protect this\nunique heritage and strive to preserve the\nvalues it represents.\n\nThe barriers to which I refer are in part\nthe result of the increasingly complex society\nin which we live. However, I fear that to some\nextent these barriers are of the judiciary’s\n own making. Our court system is an indepen-\ndent branch of government, and there is\nstrength to be derived from thai unity. But un-\ntil we judges begin to see ourselves as part of :\nan organic whole, that strength will be |\n. dissipated and wasted. ,\nSadly, our justice system is marred by\nfractionization and segmentation at all levels.\nJudging is by its nature often a solitary and\ntime-consuming task. Its demands tend to\nisolate judges from one another, even though\nthey may serve on the same court. This isola-\ntion is intensified when judges are sitting at\ndifferent levels and is sometimes characteriz-\ned by feelings of elitism and of superiority\nover judges of “lower courts,” __ }\n_ dn turn, these antagonistic feelings have 5\n_ often led judges to take a competitive, rather i\nthan a cooperative, view of one another ~- an\nattitude which only further deepens the sense\nof isolation and fragmentation.\n\nIt is time for this cycle to stop. Our justice f\nsystem is an interrelated whole, and the more\nthat judges in one part of the system unders-\ntand how the other levels function, the more |\neffective we will be in meeting the complex i\nrealities of the society we serve. The judges at\ndifferent levels of the system have different P\nbut equally important roles to play. :\n\n; There are many things that can be done to :\npromote the spirit of cooperation of which I\nspeak. And this spirit can be achieved without i\nimposing any constraints on the power of\njudges torun their own courtrooms or to make :\nlocal administrative decisions, which they are\nuniquely well situated to do. :\n\nAn important step toward this sense of\ncommon judicial venture could be taken by\nproviding state funding for our trail courts.\n\nThis issue, though not a new one, was recently\nbrought into sharp focus by the passage of pro-\nposition 13 and the availability of local funding\nfor the trial courts. As you may know, Califor-\nnia ranks last among all states in the percen- |\n"

Practice Problems

  1. OCR the text on page 4 of SOJ.pdf.

  2. OCR the text in this newspaper clipping about the sinking of the Titanic.

Further Reading

Torres, Michelle, and Francisco Cantú. 2021. “Learning to See: Convolutional Neural Networks for the Analysis of Social Science Data.” Political Analysis, April, 1–19. https://doi.org/10.1017/pan.2021.9.

References