Monday, December 10, 2007

AuthorBot

So let's say that you take the sum of all human English literature, and feed it into a computer program. Would that computer program, if properly written, be able to make judgments about the nature of literature? My guess is yes. The tricky part is the "if properly written" clause.

When we talk about emergent properties, we're talking about properties that arise from a system, rather than properties that are encoded in the system. Some people would call this "high level phenomenon". What I want to talk about isn't emergent properties, but something that's sort of the opposite; deriving basic system rules from an examination of relevant data. This is something that we do all the time in the sciences. We conduct studies to try to figure out what's really going on in the world. This is the basis for the scientific method.

Literature isn't something that we really need to figure out the basic rules on. As humans, we naturally know language, and our understanding of it doesn't come from the basic rules but from patterns that get built up in our brains. Humans are naturally suited to language from birth. That's why it takes babies so little time to go from gibberish to forming words and sentences; they do this without knowing what a word or sentence is. It makes a lot of sense really, because how could someone tell them what a word or sentence is without using language in the first place? Language is an emergent property of the way the human brain is structured. The structure of the human brain is an emergent property of our genetic code: your DNA doesn't code for every single neuron in your brain. This dynamic system that doesn't depend on hard-coding is why the human brain (and body) can handle all sorts of different experiences and environments. And on top of that, what DNA codes for is in the long run determined by another emergent property called evolution.

I've digressed. Let me bring it back; in the first generation of artificial intelligence, programmers attempted to hard code hundreds and thousands of rules, hoping to make something intelligent. They failed, for the simple reason that intelligence is an emergent property. Perhaps the logic was that by piling so many rules and their exceptions on top of each other, something could be produced that resembled intelligence. This is the main argument that philosopher John Searle puts forth against the possibility of so-called "strong AI" (read: human level). He says that a mere system of rules can't produce intelligence, because it wouldn't be anything more than what was put into it. I would agree, so long as we confine our definitions of artificial intelligence to programs which can't create their own rules.

So then, to create intelligence, we need a system that can develop its own rules. What I want to make is a program that can understand literature. The rules (other than a few bootstrap and admin rules) will be developed from a vast amount of data rather than us designing rules that create emergence. The idea is that the bootstrap rules will be able to look at repeating sequences and statistical correlations to make new rules.

So we feed the program 88,000 books. I use this number only for example, and because it's the number of e-books currently supported by the Amazon Kindle. The program runs through a book, notes statistically significant repetitions, and analyzes for structure. For example, if we assume that the computer starts without knowing words or grammar, then running through the book the program would make a note of the ways in which characters are arranged. It would see that certain letters appear in certain ways. For example: Q is usually followed by U; X is usually preceded by E or A; one space is rarely followed by another; quote symbol usually has another quote symbol close by; strings of characters are ended with a period. These are basic rules, rules that the program can figure out without knowing anything about the system itself. The program still wouldn't know what a word is. But if, through one of our hard-coded admin rules, we ask it whether a set of characters is likely to have a space or punctuation on either side of it, it would be able to give us a pretty judgment of that (which is how we would define a word in a pure character way).

The next step, once the program has written new rules into itself about what it believes to be valid constructions in our system, is for the program to go up a level. It would have to look at sequences of letters that are appearing and determine both their frequency and location within the data set. For example, it would have to see that "cat" appears both as a word (separated by spaces/punctuation), set in among other characters (as an unrelated thing), or as part of another word (in compounds). In the example of "cat", it would see "cat", "catty", "categorically", "scat", etc. But it would also see that the sequence "categorically" appears almost exclusively as a word. So from that, the program would derive rules about word frequency.

From a study of which words appear where, it would have to be able to derive other properties. It would find out that punctuation marks are significant to determining word order. And when looking at word order, it would eventually (with the application of statistics) find that certain types of words tend to follow other types of words. Nouns are usually followed by verbs, and vice versa (with adjectives sometimes separating them). It would be able to see that many words share character sequences within them and at certain location within the sequence - we call these prefixes and suffixes. Both of those, along with word locations in relation to other word locations, are essential to knowing whether a thing is a verb, noun, adjective, etc.

So far this is all stuff that we could hardcode, and to be sure, scientists have laboriously done it. Those programs could make words without spelling errors (usually pulled from a dictionary database) and sentences with proper grammar (those being emergent). Intelligence it was not. The next stumbling block for our theoretical program is the big one; meaning. That's what separates a clever trick like making sentences from something really astounding.

How would the program determine meaning? This is why we need thousands of books. By analyzing where words show up in relation to each other, and figuring out that some words with the same root are statistically correlated despite prefixes and suffixes (such as like and likely), the program can come up with a new rule that postulates those derivations to be synonymous. By looking at words that show up near each other, and the ways they do so, the program would be able to postulate both synonyms and antonyms. It would make new rules for tenses and new rules for view. With a large enough data set, the muddy rules of English could be mapped automatically. It still wouldn't really know what anything meant, but it would be able to see make rules that describe patterns. It would know that dog can be preceded by fat or skinny, and that, once preceded by that, it will often be preceded by it again (or one of it's synonyms). And with enough of these associations, it would see when one synonym set tends to be associated with another synonym set. In that way, it would get to "know" (make new predictive rules) about things. For example, cats are agile, or rocks are hard.

So we have grammar rules, and we have a rudimentary form of meaning, so it's likely that this program could make a sentence like "The rock is hard." But the real challenge is to produce multiple sentences that maintain a thought. Again, we need statistical analysis to give our program this rule. This is analysis at the same level as earlier levels, just bigger. It would have to know types of sentences. And once it got those, it would have to see (again, through statistical analysis), that sentences vary within a paragraph, rarely repeating their types. Of course, it would also need to define "paragraph", but that wouldn't be too hard if we've come this far; a paragraph is any group of sentences that are separated by the carriage return symbol.

So once it knows both what a paragraph is, and how they are typically structured, it needs to look at chapters, and section breaks, and entire books, and compare and contrast those both within and without their containing structures. Patterns would emerge, and those would be codified into further rules (ones that are much more muddy than the lower level rules). Here comes another tricky part; once it knows all of these rules, that might be enough. It might be able to write a halfway decent novel from that. But if it isn't enough, then it needs to be able to figure out (again, through pattern analysis) that words talk about each other, both explicitly and implicitly. And once it understood that ... well, the world would be its oyster.

Imagine the following; "AuthorBot, write me a three hundred page novel in the style of Dickens, set in 1970s New York." A few seconds later, you're printing it out for easy reading. That's the end goal, but as you see there are many layers that need to be built up for it to happen. This is all under the assumption that meaning can be derived solely by looking at a language, without any knowledge of the outside world. And someday, when my brain is big and strong, and the computers can handle the data flow, I'll program it (unless someone does it first). Consider; Moore's law predicts that in about ten years, computers will be 100 times more powerful than they are today.

If it's possible, it will be done.

No comments: