[Logo]

Link Grammar Parser

by Davy Temperley, John Lafferty and Daniel Sleator
(this variant maintained by Dom Lachowicz - <domlachowicz@gmail.com> and Linas Vepstas - <linasvepstas@gmail.com> )

News

October, 2014: link-grammar 5.1.3 released! See below for a description of recent changes.

The 5.0.0 version of Link Grammar now uses a new license: the LGPL v2.1 license. Older versions remain available under the BSD license. This license change was made to allow greater participation in the project.

The new version includes the Persian and Arabic systems, which were previously distributed separately. It also includes prototype, experimental dictionaries for Hebrew and Turkish, and an expanded Lithuanian dictionary. In addition, the programming interfaces for python and ocaml are now integrated, joining those for java and common lisp. A shell script to run the JSON network parse server is included.

What is the Link Grammar?

The Link Grammar Parser is a syntactic parser of English, Russian, Arabic and Persian (and other languages as well), based on link grammar, an original theory of English syntax. Given a sentence, the system assigns to it a syntactic structure, which consists of a set of labelled links connecting pairs of words. The parser also produces a "constituent" (Penn tree-bank style phrase tree) representation of a sentence (showing noun phrases, verb phrases, etc.). The RelEx extension provides dependency-parse output.

Did the AbiWord team write Link Grammar?

In large part, no. The project is the brainchild of Davy Temperley, John Lafferty and Daniel Sleator, all university professors. It is the product of a decade of academic research into grammar, and is founded on a theory backed by numerous publications. Its original homepage is hosted by Carnegie Mellon University.

So, then what is it doing @ AbiSource.com?

The AbiWord team had a concrete need - to integrate a grammar checking feature into AbiWord. The best choice, they felt, was to build upon Temperley et. al.'s successful Link Grammar project.

However, in order for the link-grammar project to be useful to them and to the greater Free Software world, the AbiWord community felt that a variety of changes to the project would be necessary. While they did have success (a few years ago) convincing the authors to release Link Grammar under a GPL-compatible license, there was no practical way to continue project development and maintenance at the CMU website. So the AbiWord community took it under its wing and has nurtured the project since.

Ongoing development by OpenCog

Ongoing development of link-grammar is being primarily guided by the Open Cognition project, where the parser plays an important role in the OpenCog natural language processing subsystem. Research and implementation is ongoing; current work includes investigations into unsupervised learning of language, unsupervised learning of morphology, semantically guided parsing and grammatically induced word-sense disambiguation.

A sibling project, RelEx, uses constraint-grammar-like techniques to extract dependency relations and assorted additional linguistic information, including FrameNet-style framing and reference (anaphora) resolution. The dependency output is similar to that of the Stanford parser. It's performance is comparable to the Stanford PCFG parsing model, and is more than three times faster than the Stanford "lexicalized" (factored) model.

The NLGen and NLGen2 projects provide natural language generation modules, based on, and compatible with link-grammar and RelEx. They implement the SegSim ideas for NL generation. See the following NLGen demos: Demo of Virtual Dog Learning to Play Fetch via Imitation and Reinforcement, AI Virtual Dog's Emotions Fluctuate Based on Its Experiences, Demo of Embodied Anaphora Resolution and AI Virtual Dog Answers Simple Questions about Itself and Its Environment.

Notable changes from the Carnegie-Mellon Link Grammar package include:

  • Actively maintained! New releases typically happen bi-annually.
  • Many bug fixes and large performance improvements.
  • Russian dictionaries!
  • Expanded English dictionaries, with many thousands of new words; dramatically improved parse coverage for a wide variety of constructions.
  • Merger of BioLG project changes, for improved parsing of biomedical text. This includes enhanced entity recognition, and precise identification of numeric quantities.
  • New bindings, including Ruby, Python, perl, Lisp, Java and Ocaml.
  • Support for UTF8 Unicode; Arabic and Persian dictionaries; prototype German dictionary.
  • Multi-threading support; a standard build system; pkg-config integration; dynamic/shared library support; fixes for non-Linux platforms, including Windows, MacOSX, FreeBSD.

Downloading Link Grammar

The system can be downloaded either as a tarball, or via git. The current stable version is Link Grammar 5.1.3 (October, 2014). Older versions are available here. Unstable, development versions are available via the link-grammar github repository.

Documentation

One of the best ways to obtain a solid, easy-to-understand overview of the parser is to review the original papers describing it, here, here, here and here. There is an extensive set of pages documenting the dictionary; specifically, the names of links and their meanings, as well as how to write new rules. There is also a short primer for creating dictionaries for new languages. The documentation for the programming API is here. Documentation for additions made in the 4.0 release is on the improvements page. A fairly comprehensive bibliography of papers written before 2004 is here (mirror).

Mailing Lists

The mailing list for Link Grammar discussion is at the link-grammar google group.

Subscribe to link-grammar:

Enter email:

Bug Tracker

Bug reports, patches, RFEs, etc. are gladly welcomed.

Disclaimer

Link grammar is a natural language parser, not an artificial intelligence. This means that there are many sentences that it cannot parse correctly, and many others for which it generates multiple parses. There are also entire classes of speech that it cannot parse, such as Valley-girl speak. Link grammar does best on "newspaper English": medium-length sentences written with good grammar, proper punctuation, and proper capitalization. It don't do 733t speek, etc. In particular, it has problems with the following "registers" and types of writing:

  • Phrases (that are not a part of a complete sentence)
  • Bulleted lists, such as this.
  • Quotations within sentences (and parenthetical remarks) These can be handled by an appropriate front-end, that separates out the quotations from the rest of the text.
  • Slang speech, words, like 733t warez d00dz, although it can certainly guess from context if the slang is sufficiently grammatical.
  • Long run-on sentences. These can generate thousands of alternative parses in a combinatorial explosion.
  • Certain "registers", such as newspaper headlines; for example, "Thieves rob bank."

In addition, it has a variety of "bugs": it currently has trouble with "if...then..." constructs, compound queries ("who did it, and why?"), lists, "...not only...but also..." constructs, certain types of idiomatic phrases, certain types of "institutional utterances", and so on. The goal of the project is to eventually fix all of these cases; progress is ongoing.


Languages

Ranked in order of maturity.

English
The main English documentation is here.
Russian
A set of Russian dictionaries providing full coverage for the language have been incorporated into the main distribution as of version 4.7.10 (March 2013). An older version, from which these are derived, can be found at http://slashzone.ru/parser/. By Sergey Protasov. Includes link documentation (mirror) and subscript (morphology) documentation (mirror). Russian morpheme dictionaries can be had at http://aot.ru.

Документация по связям и по классам слов доступна в виде списка примеров.

Persian
The Persian dictionaries from Jon Dehdari have been incorporated into the main distribution, as of version 5.0.0 (April 2014). This includes a copy of the Persian stemming engine, as significant morphology analysis needs to be performed to parse Persian.
Arabic
The Arabic dictionaries from Jon Dehdari have been incorporated into the main distribution, as of version 5.0.0 (April 2014). These are derived from the older, original version. [Mirror] These require the Aramorph stemming package, which is included.
German
A small German dictionary is available as a part of the distribution. It contains roughly one thousand words. A brief description is provided here.
Lithuanian
A very small Lithuanian prototype dictionary has been created. It contains a few hundred words. A few basic sentences parse just fine; the current version focuses on morphological analysis coupled with grammatical analysis. Documentation is here.

Sukurta yra labai prasta Lietuvių kalbos žodynas; beveik neiks ikį šiol neveikia. Čia dokumentacija.

Hebrew
A very small Hebrew prototype dictionary has been created. It contains a few dozen words. Almost nothing works correctly (yet).
Turkish
A very small Turkish prototype dictionary has been created. It contains a few dozen words. Almost nothing works correctly (yet).
French, Luthor project
The Luthor project aims to develop a set of scripts to automatically construct Link Grammar linkage dictionaries by mining Wiktionary data. Current efforts are focusing on French. (This project appears to be defunct).

Adjunct Projects

The default distribution for Link Grammar includes bindings for Java, Python, OCaML, Common Lisp, and AutoIt, as well as a SWIG FFI interface file. Additional language bindings, and some related projects, are listed below:

Windows binary
A binary .exe for MS Windows is available at WinLinkGrammar which includes the compiled Link Grammar, Corpus Statistics, and SQlite3 files. The zip file also has the required regex2.dll and installation instructions.
RelEx Semantic Relation Extractor
RelEx is an English-language semantic relationship extractor, built on the Link Parser. It can identify subject, object, indirect object and many other relationships between words in a sentence. It will also provide part-of-speech tagging, noun-number tagging, verb tense tagging, gender tagging, and so on. RelEx includes a basic implementation of the Hobbs anaphora (pronoun) resolution algorithm.
Delphi bindings
Delphi (Pascal) is a popular development environment for Windows. The LaKraven Page provides the source for Delphi bindings, as well as pre-compiled DLL's for Windows.
Perl bindings
The perl bindings, created by Danny Brian, have been updated. See the Lingua-LinkParser page on CPAN. There is also a tutorial written against an older version of the bindings; some details may be different.
Psi Toolkit (Perl)
The Psi Toolkit, an NLP toolkit aimed at linguists and NLP engineers, includes bindings for link-grammar, via perl.
Ruby bindings
There are two different packages providing Ruby bindings: Ruby Link Grammar, which is up-to-date and currently maintained, and Link Grammar 4 Ruby, which is wildly out-of-date (its for version 4.2.2) and is unmaintained. You only need one!
Pre-parsed Wikipedia
Parsed versions of various texts, including all articles from a May 2008 dump of Wikipedia, as well as a partial parse of an October 2010 dump, are available at http://gnucash.org/linas/nlp/data/

Of related interest

Genia tagger
The Genia tagger is useful for named entity extraction. BSD license source.
After the Deadline
After the Deadline is a GPL-licensed language-checking tool. If you just want to have your text proof-read, this is probably a good choice.

Recent Applications and Publications

Some recent uses and applications of the Link Grammar Parser are shown below. There is also an extensive bibliography on the CMU website (mirror) referencing several dozen older (pre-2004) papers pertaining to the Link Grammar Parser.

Some miscellaneous facts:

  • Any categorical grammar can be easily converted to a link grammar; see section 6 of Daniel Sleator and Davy Temperley. 1993. "Parsing English with a Link Grammar." Third International Workshop on Parsing Technologies.
  • Link grammars can be learned by performing a statistical analysis on a large corpus: see John Lafferty, Daniel Sleator, and Davy Temperley. 1992. "Grammatical Trigrams: A Probabilistic Model of Link Grammar." Proceedings of the AAAI Conference on Probabilistic Approaches to Natural Language, October, 1992.

Recent Changes

Version 5.1.3 (7 October 2014)

This release continues with fixes for build-breaks for Apple OSX.

  • More fixes for build breaks on Apple OSX.
  • Minor fixes involving "to do"

Version 5.1.2 (4 October 2014)

The most serious fix in this release is a build-break fix for Apple OSX Mavericks.

  • Fix greeting: "How do you do?"
  • Fix indirect object in 'what' questions: 'To what do you owe your success?'
  • Fix assorted questions with verb "to be".
  • Compile fixes for Apple OSX version "Mavericks"

Version 5.1.1 (23 September 2014)

The most serious fix in this release is a fix involving parse ranking in the Java API, which was causing RelEx to generate incorrect parse rankings for certain sentences.

  • Minor post-processing cleanup.
  • English dict: Fix questions with "it".
  • swig: add missing API functions sentence_split, dictionary_get_lang
  • Swap order of post-processing and bad morphology rejection.
  • Fix handling of ellipsis when there's missing whitespace.
  • Java: API bugfix/change: costs should have been doubles not ints.
  • Fat-linkage code: fix it so it compiles again.
  • Sat-solver: re-enable it so that it runs.

Version 5.1.0 (29 August 2014)

This version includes a number of important changes. One of these is that the connectors can now be given a direction (head and tail indicators), so that link-grammar dependencies can now be true, hierarchical dependency arrows. This is of marginal importance for English, where dependency directions are implicit, but is vital for free-word-order languages, where bi-directional links are not enough.

Another important change is that costs can now be arbitrary floating point numbers. This is particularly useful for providing fine-grained parse ranking. The LG cost system assigns a "cost" to every connector, and the sum-total of costs for a sentence determines the parse ranking. Since costs are additive, they behave as entropies (log P -- the logarithm of a probability: probabilities are multiplicative, logarithms are additive).

Under the covers, there's been some major work on the tokenization (splitting sentences into words) and morphology (splitting words into morphemes) code. This work is ongoing, and should eventually result in much better support for non-English languages.

Other notable changes include an updated Russian dictionary, and an assortment of changes to the English dictionary. An intriguing step towards phonology: LG can now distinguish between the use of the determiners "a" and "an" preceding nouns that start with consonants or vowels. Whether fancier phonology support is possible is a curious question.

  • Updated Russian dictionaries from Sergei Protasov.
  • Added morphology-based unknown-word handling for Russian, from Sergei.
  • Fix up fat-linkage code, which was recently broken...
  • API cleanup: many command-line options never belonged in the API.
  • New emoticon support was clobbering certain dictionary words.
  • Fix: "Go to spot X", "It happens at time T."
  • Add a dozen missing verbs.
  • Minor work on greetings.
  • Add mechanism for denoting fractional costs in the file-backed dict.
  • Fix: broken handling of gerunds (due to bad verb-wall connectors)
  • Major redesign of morpheme splitting mechanism (from AmirP)
  • Minor extensions to support numeric formulas, e.g. 1 + 1 = 2.
  • Remove fat linkage support from the SAT solver.
  • Enable build of SAT solver by default.
  • Fix multiple bugs with unit stripping.
  • Add bounds-checking to the C API.
  • Fix the old disjunct-printing implementation.
  • Add support for easy-to-use link direction indicator.
  • Add random morphology generator tool.
  • Partial support for phonetic use of "a" vs. "an" for English.
  • Rework how coordination between conjunctions works: "either... or ...", etc.
  • Major redesign of tokenization mechanism (from AmirP)

Version 5.0.8 (30 April 2014)

  • Fix handling of initial letters in ordered lists.
  • Fix another serious error in constituent printing, introduced in 4.8.0.
  • New emoticon support was clobbering certain number expressions.
  • Misc English dict fixes, more verb-wall connectors.

Version 5.0.7 (29 April 2014)

  • Compile fixes in SAT solver.
  • Add missing verb-wall connectors for is, hasn't, haven't, hadn't, etc.
  • Remove verb-wall connector for imperative verbs.
  • Fix serious error in constituent printing, introduced in 5.0.3
  • Fix old bug in command-line handling of options.
  • Fix parsing of various ordered lists, including some tables of contents.

Version 5.0.6 (18 April 2014)

  • Fix: JSON output format missing brace; from Matt Kruse.
  • Fix: Serious error in Russian morphology printing.

Version 5.0.5 (17 April 2014)

  • Fix packaging bug with the English dictionary.

Version 5.0.4 (16 April 2014)

  • Expanded unit tests for capitalization.
  • Fix who questions: "Who are they?", "Who are you?", etc.
  • Provide verb-wall linkage for many questions.
  • Add Biblical naming idioms: "Lud, son of Shem, ..."
  • Fix MacOSX build break.
  • Fix the 'make clean' target to not remove critical files.
  • Fix broken emoticon support in English dict.
  • Remove obsolete entity detection tokens from English dict.
  • Fix broken equation parsing.

Version 5.0.3 (13 April 2014)

  • Minor memory usage optimization
  • Fix unit test: suppress printing of empty word, and of morphology.
  • Fix: Swig and python were meant to be optional, not required!

Version 5.0.2 (10 April 2014)

  • Expanded unit tests
  • Fix another sqlite3-dev build break

Version 5.0.1 (9 April 2014)

  • Dictionary debugging print fixes from Amir P
  • Print summary of parse statistics when in batch mode (from AmirP)
  • Generalize the notion of prefix/suffix to arbitrary classes (Amir P)
  • Fixes for German adjectives.
  • Fix build break when sqlite3-dev not installed.
  • Fix regression in Russian morphology handling.

Version 5.0.0 (1 April 2014)

We are proud to announce a major new release of the Link Grammar Parser! It contains many important changes and new additions. One of the most significant changes is that the license has been changed from the BSD license to the LGPL. This was done to enable considerably more flexibility in accepting contributions to the project: it seems that few are particularly interested in contributing to a BSD-licensed project. This change has enabled folding in some new work:

  • Arabic and Persian dictionaries! These were previously maintained as separate add-ons. Including them as part of the distribution should make it easier for interested users.
  • A new 'bindings' directory, containing code for Java, Python, Common Lisp, OCaML and AutoIt programming languages. The Python bindings are an updated version of the older pylinkgrammar-0.2.13 bindings. A SWIG interface file should make it easy to create other language bindings as well.
  • Improved morphology support. This will be invisible to most users, but it lays the groundwork for add Hebrew support to the parser.
  • Expanded Lithuanian support. This remains a simplistic prototype, but it now performs a more sophisticated morphological analysis.
  • Experimental Turkish and Hebrew dictionaries.
  • A demo of the JSON parser server: it shows how to run the server, which will accept accept raw sentences on a socket, and returns the parsed forms.
  • Some slightly incompatible changes to the API: it was time for some housekeeping.
  • Misc minor updates to the English Language dictionaries.
  • Preliminary work for SQL-backed dynamic dictionaries. This should enable certain types of automated language learning.

The full changelog is shown below.

  • License upgrade to LGPLv2.1
  • Arabic dictionaries, from Jon Dehdari
  • Persian dictionaries, from Jon Dehdari
  • Support for Hebrew tokenization, from Amir P.
  • Fix wild-card matching for user-supplied word lookup.
  • Prototype Turkish dictionary from Can Bruce.
  • Re-arrange programming language bindings directory.
  • Adopt the orphaned/unsupported pylinkgrammar Python bindings.
  • Deprecate the obsolete CNode interface.
  • Provide low-level perl bindings.
  • Adopt the orphaned/unsupported OCaML bindings.
  • Support affirmative replies: "Who did it?" "John's evil twin."
  • Expanded Lithuanian dictionary.
  • Minor disjunct printing fixes.
  • Fix: "Mary is too XXX to talk to."
  • Prototype Hebrew dictionary from Amir P.
  • Change !suffixes flag to !morphology.
  • Introduce a bi-directional connector, for free-word-order languages.
  • Introduce a symmetric-AND operator, for free-word-order languages.
  • Add demo shell script for running the JSON parse server.
  • Bugfix: Java server failing when input sentence has commas in it!
  • New !test and !debug commands for selective debugging support.
  • Print post-processing rejection message, when !bad is enabled.
  • Remove some deprecated functions for C API.
  • Remove all deprecated functions from Java API.
  • Initial support for an SQL-backed dynamic dictionary.

Version 4.8.6 (2 February 2014)

  • Fix minor OSX compiler warnings.
  • Check for presence of Java ant before assuming it is there.
  • Fix crash on certain sentences containing equals sign.
  • Fix parsing of lists (blah, blah and blah).
  • Fix build break for uClibc systems (Gentoo).
  • Allow ungrammatical usage of 'ages' instead of 'aged'.
  • Fix crash on certain sentences containing words with periods.

Version 4.8.5 (5 January 2014)

  • Update memory usage accounting; fix accounting bugs.
  • Fix Java garbage collection bug.
  • Fix numerous compiler warnings in the SAT-solver code.
  • Fix build-break involving multiple declaration of 'Boolean'.

Version 4.8.4 (31 December 2013)

  • Fix build break for Mac OSX.

Version 4.8.3 (30 December 2013)

  • Create new msvc12 build files, restore old msvc9 files.
  • Revert location of the Windows mbrtowc declaration.
  • Add verb-wall connector for present participles.
  • Fix build-time include file directory paths.
  • Provide the 'any' language to enumerate all possible linkages.
  • Fix recognition of U+00A0, c2 a0, NO-BREAK SPACE as whitespace.
  • Improve parse-time performance of exceptionally long sentences.
  • Fix crash on certain sentences containing equals sign.

Version 4.8.2 (25 November 2013)

Add missing file, needed for Java bindings.

  • More MSWindows UTF-8/multi-byte fixes (for Russian).
  • Add missing JSONUtils file.

Version 4.8.1 (21 November 2013)

Minor updates, unless you are using Java, or the Russian dictionaries on Windows, in which case, you'll need this update.

  • Ongoing work on viterbi.
  • Updated MSVC9 project files from Jand Hashemi (Lucky--)
  • Fix important bug in Java services: return top parses, not random ones.
  • Java: for the link-diagram string, do not limit to 80 char term width.
  • Windows: UTF-8 fixes so that Russian works in most MSWindows locales.

Version 4.8.0 (24 October 2013)

This is the start of a new version series, containing an important change to the English language dictionary. Three new link types are introduced WV, CV and IV. These are used to connect the left-wall to the primary verb of the sentence (WV), to connect the ruling clause to the primary verb of a dependent clause (CV), and a similar link for certain infinitive verbs (IV). The goal of these links is to make it easier to locate verbs, and thus to provide a more direct mapping from the link-grammar formalism to a dependency parse (as dependency parses always put the verb at the root of a sentence).

These are not the first links that explicitly indicate root verbs: several other link types already play this role: The AF, CP, Eq, COq and B links already play this role. The new WV, CV and IV links round out this capability and do so in a very general form. See WV, CV and IV for details.

With this release, we expect that all (non-auxiliary) verbs in a sentence will be linked either to the wall, or to a controlling parent. We also expect there to be some additional fixes and tightening-up to occur in future releases, especially in regards to comparative sentences.

This release also includes a variety of fixes to the Java API/server. In addition, some ancient, deprecated C code was removed.

  • Fix "he answered yes"
  • Support bulleted, numbered lists.
  • New link types from Lian Ruiting, for identifying the head-verb.
  • Java: fix bug when totaling WordNet word-sense score.
  • Java: add info to README about using the JSON parse server.
  • Java: remove many deprecated functions.
  • C API: remove some deprecated functions.
  • Java: fix silent failure when library is not found.
  • Java: Add support for fetching the ASCII-art diagram string.
  • Java: Fix insane language selection initialization.
  • Fix: "The pig runs SLOWER than the cat."
  • Fix: conjoined superlatives: "... the longest and the farthest."
  • Fix: "inside" can be used with conjunction: "near or inside..."
  • Fix: conjoined question modifiers: "exactly when and precisely where..."
  • Fix: issue 59: crash/corruption when dictionary opened twice.
  • Fix: assorted exclamations!

A list of older changes can be found here.

License

The Link Grammar license is essentially the BSD license. A copy of this license can be found below, and at the original author's CMU site

Copyright (c) 2003-2004 Daniel Sleator, David Temperley, and John Lafferty. All rights reserved.
Copyright (c) 2003 Peter Szolovits
Copyright (c) 2004,2012,2013 Sergey Protasov
Copyright (c) 2006 Sampo Pyysalo
Copyright (c) 2007 Mike Ross.
Copyright (c) 2008,2009,2010 Borislav Iordanov.
Copyright (c) 2008-2014 Linas Vepstas
Copyright (c) 2014 Amir Plivatsky

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
  3. The names "Link Grammar" and "Link Parser" must not be used to endorse or promote products derived from this software without prior written permission. To obtain permission, contact sleator@cs.cmu.edu

THIS SOFTWARE IS PROVIDED BY DANIEL SLEATOR, DAVID TEMPERLEY, JOHN LAFFERTY AND OTHER CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.