Automatically Generated Inflection Database (AGID)

August 19, 2000
Revision 2

Copyright 2000 by Kevin Atkinson <kevina@users.sourceforge.net>

The file "infl.txt" is an automatically created database of the
inflected forms of words from an insanely large word list.

The latest version can be found at http://aspell.sourceforge.net/wl/.

Entries are in the following form.

<word> <pos>: <inflected forms>

Where <pos> is V for verb, N for noun, or A or adjective or adverb.
If <pos> is followed by a ? that means that the part-of-speech was not
in the part-of-speech database however the inflected forms of the word
where found in the word list.

The inflected forms are in the following order for verbs (except for
the verb "be"):
  <past tense>  [<past participle>]  <-ing form>  <plural form>
and for adjective or adverbs:
  <-er form>  <-est form>
There are two spaces between each form.

A word in parentheses mean that it is considered a less preferred form
of the previous inflection.  Two parentheses means that the word is
even less preferred, etc.  A / between two words means that the two
words are considered almost equal variants or that is is difficult to
tell which one is the primary form.  They are ordered by preference
however sometime this distinction is so slight it is meaningless.  A
"|" between words means that both inflections are used depending on
the meaning of the word.  If the distinction between the two forms can
be described in a word than that word is found after the word in
braces, for example:

  hang V: hung {suspend} | hanged {execute}  hanging  hangs

Notice how there is two spaces between the past tense, -ing form and
plural form but not between the alternate forms of the past tense.  In
general, if the "|" symbol would be needed more than once the words
the entry is split up into multiple lines like so:

  <word> [{explanation}] <POS>: <inflected forms>

However, the past particle as past tense form are considered a single
form. Thus, a "|" may appear more than once when the word contains
both a past participial and past tense form.

A /? between words means that both inflections were found in the word
list but the script was not sure which one to use.  A ~ after a word
means that there is a slight chance that it is the plural of a word.
A ! after a word indicates that the word is likely an inflections of a
similar word (generally one ending in e) and not the current word.  A
? after a word means that the word was not in the word list but if it
was it would be considered an inflected form of the base word.

Fell free to send me corrections to correct any of these questionable
words.  I am mostly interested in the preferred form of the word in
the case of /? or words marked with a ~ that are actually valid.

Words are in mixed case but all accents have been scripted thus words
like caf are instead cafe.

The file "variant" contains a list of alternate inflections.

The file "irregular" contains extra information where a noun or verb
has irregular inflected forms.

The file "dontuse" contains a list of words not to consider an
inflected form of a word if more than one inflected form of a word is
found.

The files "prefixes" and "suffixes" contains a list of common prefixes
and suffixes respectfully.  These files are used by the script to
produce inflected forms for words that end in a word in the
"irregular" file. If the beginning appears in the word list or the
prefixes file and the ending appears in the irregular file I also
consider <prefix>+<irregular inflections>.  If the prefix is 3 letters
or more OR appears in the prefixes file and the suffix is 4 letters or
more OR appears in the suffixes file I consider it the most likely
choice, otherwise I consider it as a possible candidate but not the
most likely choice.

The file "make-infl" is the actual Perl script used to create the
data base.

CHANGES:

From Revision 1 to 2 (August 18, 2000)

  Classified variants as either almost equal, also used, or
  secondary.

  The / is now used to indicate equal variants.  "/?" is now used to
  mean what "/" used to be.

  Lots of additional rules added which greatly improved the results.

COPYRIGHT AND SOURCE:

The final product is under the following copyright, as well as any
copyrights mentioned below.

  Copyright 2000 by Kevin Atkinson

  Permission to use, copy, modify, distribute and sell this database,
  the associated scripts, the output created form the scripts and its
  documentation for any purpose is hereby granted without fee,
  provided that the above copyright notice appears in all copies and
  that both that copyright notice and this permission notice appear in
  supporting documentation. Kevin Atkinson makes no representations
  about the suitability of this array for any purpose. It is provided
  "as is" without express or implied warranty.

The part-of-speech database used is created form the Moby
part-of-speech database which is in the public domain:

    The Moby lexicon project is complete and has
    been place into the public domain. Use, sell,
    rework, excerpt and use in any way on any platform.
    
    Placing this material on internal or public servers is
    also encouraged. The compiler is not aware of any
    export restrictions so freely distribute world-wide.
    
    You can verify the public domain status by contacting
    
    Grady Ward
    3449 Martha Ct.
    Arcata, CA  95521-4884
    
    grady@netcom.com
    grady@northcoast.com

and the WordNet database which is under the following copyright:

    This software and database is being provided to you, the LICENSEE, by
    Princeton University under the following license.  By obtaining, using  
    and/or copying this software and database, you agree that you have  
    read, understood, and will comply with these terms and conditions.:  
  
    Permission to use, copy, modify and distribute this software and
    database and its documentation for any purpose and without fee or
    royalty is hereby granted, provided that you agree to comply with  
    the following copyright notice and statements, including the disclaimer,  
    and that the same appear on ALL copies of the software, database and  
    documentation, including modifications that you make for internal  
    use or for distribution.  
  
    WordNet 1.6 Copyright 1997 by Princeton University.  All rights reserved.  
  
    THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON  
    UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR  
    IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON  
    UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT-  
    ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE  
    OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT  
    INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR  
    OTHER RIGHTS.
  
    The name of Princeton University or Princeton may not be used in  
    advertising or publicity pertaining to distribution of the software  
    and/or database.  Title to copyright in this software, database and  
    any associated documentation shall at all times remain with  
    Princeton University and LICENSEE agrees to preserve same.  

The word list used is a combination of several word list:

1) Most of the word lists from the Moby Words package:

     10196pla.ces 113809of.fic 21986na.mes 256772co.mpo 354984si.ngl
     3897male.nam 4160offi.cia 4946fema.len 6213acro.nym 74550com.mon
   
   The Moby Word package, like the Part-Of-Speech database is in the
   public domain.

2) The ENABLE2K word lists which is in the public domain:

     The ENABLE master word list, WORD.LST, is herewith formally
     released into the Public Domain. Anyone is free to use it or
     distribute it in any manner they see fit. No fee or registration
     is required for its use nor are "contributions" solicited (if you
     feel you absolutely must contribute something for your own peace
     of mind, the authors of the ENABLE list ask that you make a
     donation on their behalf to your favorite charity). This word
     list is our gift to the Scrabble community, as an alternate to
     "official" word lists. Game designers may feel free to
     incorporate the WORD.LST into their games. Please mention the
     source and credit us as originators of the list. Note that if
     you, as a game designer, use the WORD.LST in your product, you
     may still copyright and protect your product, but you may *not*
     legally copyright or in any way restrict redistribution of the
     WORD.LST portion of your product. This *may* under law restrict
     your rights to restrict your users' rights, but that is only
     fair.

3) All of the word lists in the ENABLE2K Supplemnt which consists of:

     2DICTS.LST  ALSO.LST   LETTERS.LST  OSPDADD.LST  UCACR.LST
     ABLE.LST    LCACR.LST  NOPOS.LST    PLURALS.LST  UPPER.LST

   All of these word lists are also in the public domain.

4) The list of signature words from the YAWL package which is in the
   public domain.

5) The UK Advanced Cryptics Dictionary which in under the following
   copyright:

     Copyright (c) J Ross Beresford 1993-1999. All Rights Reserved.

     The following restriction is placed on the use of this
     publication: if The UK Advanced Cryptics Dictionary is used
     in a software package or redistributed in any form, the
     copyright notice must be prominently displayed and the text
     of this document must be included verbatim.

     There are no other restrictions: I would like to see the
     list distributed as widely as possible.

6) Some extra words found in the Part-Of-Speech database that was not
   found in any of the above word list.

7) Words found in the Jargon File Word List package, available at
   http://aspell.sourceforge.net/wl/, which is in the Public Domain.

8) And finally some extra words that I added myself.  These words can be
   found in the file "extra-words"

The "dontuse", "irregular", and "variant" file was created by me
(Kevin Atkinson) from numerous sources.

