[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

What I did on my summer vacation

Almost a year and a half ago, I started working on a secret project at
Google. I wasn't allowed to say anything about it except that it had
something to do with "image processing," which is what I told everyone
who asked (and some who didn't).

As of this morning, the project is *finally* public: Google Print. Not
the wimpy text-only small-excerpts minimal-content Google Print that's
been up for a while, but a new Google Print that actually has full
scans of a lot of books, lets you search the full text, and then shows
you the original page scans. Think amazon's search-inside-the-book,
but handier and better implemented. (Maybe I'm biased. ;) )

Google Print results are supposed to come up for any search that's
relevant, but putting "books about" (no quotes) in front of the search
seems to give it a little nudge to give more books. For example,
"books about tolkien" comes up with a little "Book results for
tolkien" subheading, complete with a cute little icon. Click the link
and you get to see 5 pages from the book by using the arrow links. You
can see more by using the "more results" box on the left.

So, what did I do? Page numbers. Google actually scans the physical
books itself to get the images and text out. There's a decent chance
for errors like scanning the same pages twice or skipping a page, so
they needed a way to tell if pages were missing. I wrote code to pull
as many page numbers out of a series of page scans and check that they
were all present and accounted for. And on the side, I wrote code to
check that the images were all in focus too, and tried to segment the
page into header, body, and footer.

I don't know for sure, but I suspect that they're also using my code
to come up with the page numbers that get used to label the pages in
the viewer. When it works, you can't really tell, but the current
system seems to have the same failure modes as what I wrote...

For example, do a search for [books about "Game Developer's Market
Guide"] and then do a Search Within This Book for 500 (or just about
anything else). See how it labelled all the page numbers with roman
numerals? This book somehow got scanned backwards, so my code would
completely fail to find the page numbers. It slightly prefers roman
numerals when it can't find anything because they're harder to spot
even when they are there, so it just labelled most of the book with
roman numerals. (It should also have flagged the book to be rescanned,
but actually doing the rescanning wasn't my problem.)

So, yeah. That's what I did. Isn't the project cool?

 - John