I hack, heads-down, on link-grammar every now and then. Yesterday, I fixed another round of broken parse rules: making sure that sentences like “John is altogether amazingly quick.” “That one is marginally better” “I am done working” “I asked Jim a question” “I was told that crap, too” all parse correctly.
Solving these required adding new rules to the link-grammar dictionary: so, for example, adding the rule “O+ & {@MV+}” to the dictionary entry for “asked” — thus allowing the word “asked” to take a direct object (“…asked Jim…”) and an indirect object (“… a qeustion”).
Patching dictionary entries by hand is tedious and time consuming. The long term goal is to get to the point where the system can learn new entries on its own. So I’ve been day-dreaming about how to do that, and some of the baby-steps in that direction.
The first step is to put all of the rules into a database where they can be easily modified, added-to, and deleted. Currently, the rules all live in a file, which can only be hand-edited.
A second step is to assign probabilities to the rules: often, multiple parses are possible, yet not each rule is equally likely to lead to a correct parse. So it would be better to have a probability assigned to each rule, indicating how often it leads to a good parse. These probabilities are already needed for the GSoC project of adding a SAT solver to link-grammar. I’ve got a database of tens of millions of word pairs and link types, ready to go for this step.
A third step is to distinguish good parses from bad ones: this is the parse-ranking step: to judge parsees for the likely-hood of being good, or bad. There are superficial things one can do for parse ranking, and quite complex ones. I am hoping to someday-soon reinforce the parse-ranking scores with the results of word-sense disambiguation. Which brings us to the next point:
Step four — some grammar parsing rules are appropriate only for some senses of a word, but not for others. Link grammar already accounts for this at a rough scale: it has distinct rules for different parts of speech. These are indicated by tacking a single extra letter to the end of a word: walk.v (I walk.v) versus walk.n (I took.v a walk.n) This basic idea can be further refined, for finer divisions than just parts of speech: some word senses can only be used in certain ways, and not others. The technical problem is that 26 letters is not enough… already, link-grammar is using some 20 or so of these suffixes. So this mechanism needs to be expanded, somehow.
Step five — where teh rubber hits the road — actually learning new rules. This is much more vague; my ideas are still swimming. There are two distinct problems: new words, and fixing rules for existing words. New words should not be *much* of a problem: try to find synonyms for a word, and assume the new word can be used like the synonym. Modifying existing rules is much much harder…
… so hard, in fact, that I’m not sure I’m ready to engage in that, just yet. Some steps can be taken in that direction, by looking at minimum-spanning-tree dependency grammars (MST grammars): that is, computing the mutual information of word pairs, and comparing them to the output of link-grammar. This should suggest where new links can be created. Then comes the question: what should the appropriate link type be, for this new link? For some words, perhaps synonyms can help, but other words are so unique, that they have no synonyms: the word “to be” is so central to the English language that trying to discover information about it is very difficult: it has no synonyms, even as it has many.
I’ve got some infrastructure set up to run this last experiment: I’ve got a fairly large collection of parsed sentences, from which I can build mutual information pairs, and compare these to link-grammar parses. In fact, I’ve already done so: I use these for parse ranking. (that is, if a link-grammar parse has a high mutual information content, then I assume its a good parse).
I could turn this around: given a sentence that is parsing badly, I can look for high mutual-info parses. Perhaps broaden the coverage by comparing to parses of approximately synonymous words, perhaps reinforced by word-sense disambiguation. But this generally leads to a combinatorial explosion — which, I guess is expected. We know that general intelligence requires a lot of CPU. The trick is to have the patience to set up an experiment, run the experiment, wait for its results, and then do it again…
Enough for now. I’m off to work on “related words” — another part of the puzzle.