AI breaks the 18th century “heavenly book” ledger in seconds, Google’s new model blind test hits the entire network

AI breaks the 18th century “heavenly book” ledger in seconds, Google’s new model blind test hits the entire network

Recently, a mysterious model on Google AI Studio not only successfully identified a businessman’s “Heavenly Book” ledger from over 200 years ago, but also corrected formatting errors and vague expressions, demonstrating a reasoning ability that shocked historians.

Quietly, Google has solved two ancient problems in the AI industry?

Not long ago, a mysterious model on Google AI Studio caught the attention of netizens, including a historian named Mark Humphries.

He took out the “Heavenly Book” ledger from an Albany businessman over 200 years ago to test the large model’s ability in handwritten text recognition (HTR).

A shocking scene has occurred!

The mysterious model not only achieved near perfect scores in automatic handwriting recognition, but also corrected a writing format error in the original ledger and optimized one of the ambiguous expressions that may have caused ambiguity.

This means that the model can not only recognize letters, but also understand the logic and knowledge background behind the letters.

Moreover, these abilities are demonstrated without the model being prompted.

The solution to the two major challenges of expert level handwritten text recognition ability and inference ability without explicit rules marks a leap in AI model capabilities.

Netizens speculate that this mysterious model may be Google’s upcoming Gemini-3, but it has not been officially confirmed yet.

Cracking the Problem of Historians

Mark Humphries is a history professor at Wilfrid Laurie University.

As a historian, he is very concerned about whether AI has reached the level of human expert level reasoning in his professional field.

谷歌AI Studio

Therefore, Humphries chose to have the large model recognize historical handwriting, which he believed was a golden test of the overall ability of the large model.

Recognizing historical handwriting is not only a visual task, but also requires a certain understanding of the historical background in which the manuscript is located.

Without this knowledge, it is almost impossible to accurately identify and transcribe a historical document.

In Humphries’ view, this is precisely the most difficult part to identify in historical literature.

With the development of large model capabilities, their recognition accuracy on HTR can exceed 90%, but the remaining 10% is the most difficult and critical.

Humphries believes that today’s large models (Transformer architecture) are essentially predictive (with the core mechanism of predicting the next token), but spelling errors and inconsistent styles in historical literature are inherently unpredictable and low probability answers.

Therefore, to transcribe ‘the cat sat on the rug’ instead of ‘mat’, the model must do it against the training distribution’s tendency.

This is also why big models are not very good at transcribing unfamiliar people’s names (especially surnames), obscure place names, dates, or numbers (such as amounts).

For example, is a letter written by Richard Darby or Richard Derby? Is the date March 15th, 1762 or March 16th, 1782? Is the bill 339 dollars or 331 dollars?

When such difficult to recognize letters or numbers appear in historical documents, it is often necessary to find answers through other types of background knowledge.

Humphries believes that this’ last mile accuracy ‘is the prerequisite for historical handwritten text recognition to be used by humans.

Does predictive architecture have a ‘ceiling’?

To measure the accuracy of handwritten transcription, Humphries and Dr. Lianne Leddy specifically created a test set consisting of 50 documents with a total of approximately 10000 words.

And they took all reasonable precautions to ensure that these documents were not included in the training data of the large model as much as possible.

This test set includes different styles of handwriting (from illegible scribbles to formal secretary handwriting), as well as images captured by various tools.

In Humphries’ view, these documents represent the most common types encountered by him and historians studying English literature from the 18th and 19th centuries.

They measure the proportion of transcription errors using Character Error Rate (CER) and Word Error Rate (WER).

Research shows that non professionals typically have a WER of 4-10%.

Even professional transcription services are expected to have a small number of errors, and they typically guarantee a WER of 1%, provided that the text is clear and easy to read.

So, this is basically the upper limit of accuracy.

Last year, on Humphries et al.’s test set, Gemini-2.5-Pro performed as follows:

Strict CER is 4% and WER is 11%.

When excluding errors in capitalization and punctuation, they usually do not change the actual meaning of the text, nor do they affect search and readability, resulting in error rates of CER 2% and WER 4%.

Humphries also found that improvements in each generation of models are indeed steadily occurring.

The performance of Gemini-2.5-Pro has improved by about 50-70% compared to the Gemini-1.5-Pro they tested a few months ago, which in turn has improved by about 50-70% compared to the GPT-4 they initially tested.

This also confirms the expected expansion pattern:

As the model grows, its performance on such tasks can be roughly predicted based solely on its size.

Performance of the new model

They started testing Google’s new model on the same dataset.

The specific method is to upload the image to AI Studio and enter the following fixed prompt words:

Your task is to accurately transcribe handwritten historical documents and minimize CER and WER as much as possible. Work word by word and line by line, transcribe the text strictly according to the presentation on the page. To maintain the authenticity of historical texts, spelling errors, grammar, syntax, punctuation, and line breaks are preserved. Translate all text on the page, including headers, footers, footers, footnotes, inserted content, page numbers, etc. If these contents exist, please insert them in the position indicated by the author

Humphries tries to choose the documents with the most errors and the most difficult to recognize when selecting test documents.

They are not only sloppy handwriting, but also full of spelling and grammar errors, lack proper punctuation, and have extremely inconsistent capitalization.

The purpose is simple, to explore the bottom of this mysterious model.

In the end, he selected 5 documents from the test set.

The result is very astonishing.

The 5 documents transcribed by this model (totaling just over 1000 words, accounting for about one tenth of the sample) have a strict CER of 1.7% and a WER of 6.5%.

That is to say, including punctuation and capitalization, there is approximately one error every 50 characters.

Moreover, almost all errors are in capitalization and punctuation, and the areas where mistakes occur are highly ambiguous. There are very few errors at the actual “word” level.

If these types of errors are excluded from the count, the error rate drops to CER 0.56% and WER 1.22%.

That is to say, the performance of this new Gemini model on HTR has reached the level of human experts.

Breaking the mystery of the ledger from over 200 years ago in seconds

Subsequently, Humphries decided to continue adding strength to the new model.

He took out the diary of an Albany businessman from over 200 years ago.

This is a ledger recorded in English by a Dutch shop assistant.

He may not speak English very well, with extremely irregular spelling and letter writing, mixed with Dutch and English.

The accounts were also written in the old-fashioned British pound/shilling/pence format, using the common shorthand format of the time: ‘To 30 Gallons Rum @ 4/6/15/0’.

This indicates that someone purchased (debited to their account) 30 gallons of rum at 4 shillings and 6 pence per gallon, totaling 6 pounds, 15 shillings, and 0 pence.

For most people today, this non decimal currency unit is unfamiliar: 1 shilling is equal to 12 pence, and 1 pound is equal to 20 shillings.

A single transaction should be recorded in the account at any time, separated by a horizontal line, with the date and number of the day written in the middle.

Each transaction is recorded as a debit (Dr, purchase) or credit (Cr, payment).

Some transactions are crossed out, which may indicate that they have been reconciled or transferred to the customer account in the general ledger (similar to “pending” becoming “recorded”).

These records do not yet have a standard format.

Large models have always been prone to problems when dealing with such ledgers.

Not only because there is very little training data, but also because there are not many patterns to speak of: people can buy any quantity of anything, the unit price can be arbitrary, and the total price is not rounded up according to conventional methods.

Large models often distinguish some names and products, but are completely lost in numbers.

For example, they are often difficult to accurately transcribe numbers and tend to mix unit prices with total prices.

Especially for some complex pages, the model may temporarily ‘crash’ by constantly repeating certain numbers or phrases, or sometimes simply failing to answer.

However, Humphries saw in Google’s new model that it performed almost perfectly in identifying Albany merchant journal pages.

Not only is the numerical part astonishingly correct, but what’s even more interesting is that it also corrected a small formatting error made by the store clerk during bookkeeping.

For example, Samuel Stitt bought two punch bowls, and the clerk recorded them as 2/each, meaning 2 shillings each; For convenience, he omitted ‘0 pence’. But in order to maintain consistency, the model transcribed it as @ 2/0, which is actually more standardized and clear.

Reading through the text, Humphries also saw a ‘mistake’ that made his hair stand on end.

He saw Gemini transcribe a line from “To 1 loft Sugar 145 @ 1/4 0 19 1” to “To 1 loft Sugar 14 lb 5 oz @ 1/4 0 19 1”.

Sugar from the 18th century was sold in hardened cone-shaped sugar bars, and Mr. Slitt was a shop owner who purchased a large amount of sugar for resale.

At first glance, this seems like an illusionary error: the model was required to strictly transcribe the original text, but it inserted a ’14 lb 5 oz’ that was not present in the original text.

After careful examination, Humphries realized that the big model had done an extremely clever thing.

Gemini correctly inferred that 1, 4, and 5 are numerical values composed of weight units, describing the total weight of the purchased sugar.

In order to determine the correct weight and decode 145, Gemini also uses the final total price of 0/19/1 to deduce the weight, which requires back and forth conversion between two decimal systems and two non decimal systems.

Humphries speculated on the inference process of large models:

The unit price of sugar is 1 shilling and 4 pence per unit, which is 16 pence. The total transaction price is 0 pounds, 19 shillings, and 1 penny, which can be converted to 229 pence.

To calculate how much sugar you bought, divide 229 by 16 to get 14.3125, or 14 pounds 5 ounces.

So Gemini concluded that it was not “145” or “145”, but “14 5”, and therefore 14 lb 5 oz, and clarified it in the transcription.

In Humphries’ testing, no other model has shown similar performance when asked to transcribe the same document.

The reason why this example caught Humphries’ attention is that AI seems to have crossed boundaries that some experts have long claimed existing models cannot.

Faced with a vague number, it is able to infer the missing context, perform a series of multi-step conversions between historical currency and weight systems, and obtain a correct conclusion. This process requires abstract reasoning about the world described in the literature.

Humphries believes that what may occur is an emergent, implicit inference that spontaneously combines perception, memory, and logic within a statistical model, rather than being specifically designed to reason symbolically, although he is not yet clear about the specific principles behind it.

If this assumption holds true, Humphries believes that the ‘sugar ingot entry’ is not only a remarkable transcription, but also sends a small and clear signal: pattern recognition is starting to cross the boundaries of true ‘understanding’.

This indicates that the large model can not only accurately transcribe historical documents with human expert level accuracy, but also begin to demonstrate an understanding of the economic and cultural systems behind these historical documents.

Humphries believes that this may reveal the beginning of another thing: machines are beginning to be able to perform truly abstract, symbolic reasoning about the world they see.

reference material:

https://generativehistory.substack.com/p/has-google-quietly-solved-two-of

© 版权声明
THE END
If you like it, please support it
点赞15 分享
comment 抢沙发

请登录后发表评论

    暂无评论内容