                                            Mons, June 25th 1999

Hi,

guess that you want to figure out what is that small package you have 
randomly download from the net (before pressing the DELETE key) ....

Let's say it deals with automatic and multilingual transcription from 
LETTER to SOUND. Example: you can download the American English CMU
dictionnary, build learning vectors from it (conv2vec.pl), feed them 
to the ID3 (id3.c) learning algorithm which produces a decision tree.
Then you put that tree in the RUN_TREE program, you type words that
are unknown to the system, and it transcribes them more or less 
accurately.... FREE GIFT: it's able to deal with lexical accent,
(less or more and vice-versa). Once it's done, you can utter the word 
with the MBRDICO project mentionned below.

The same operation has been tested with the British English Oxford 
Advanced Learner Dictionnary, and the French Brulex lexicon and is 
narrated in the thrilling "LETTER TO SOUND RULES FOR ACCENTED LEXICON
COMPRESSION" Vincent Pagel, Kevin Lenzo and Alan W. Black Proc ICSLP98
Sydney.

Those programs are free software; you can redistribute it and/or modify
them under the terms of the GNU General Public License as published by
the Free Software Foundation version 1. Please quote the ICSLP paper in
any scientific publication using this work

----------------------------------------------------------------------

If you have not yet pressed the delete key, you can compile 2 programs in
the TRAIN_TEST directory :

	 1) build the decision tree from learning vectors and the class they
	 belong to -> id3

        to compile with a Unix box: gcc -O6 -o id3 id3.c -lm
		  it's ANSI C so portability is not an issue

		 Choose a learning vector file:

e.g. of learning vectors coming from CMU with a 7 letter span :
        @ - - - - a b a n d
        b - - - a b a n d o
        {1 - - a b a n d o n
        n - a b a n d o n i
        d a b a n d o n i n
        @ b a n d o n i n g
        n a n d o n i n g -
        I n d o n i n g - -

The class (Phoneme in our example) is the first element of the
vector. Size of the vector doesn't matter. If you're working in the
field of Grapheme to phoneme conversion you can automatically build
the vector file with conv2vec.pl from an aligned dictionary. We
provide an excerpt of cmu.align in the TEST directory. It contains
line like:

abbreviate @ b _ r i1 v i EI t _ 
abbreviated @ b _ r i1 v i EI 4 @ d 
abbreviates @ b _ r i1 v i EI t _ s 

The special '_' symbol stands for EPSILON (means that the letter
doesn't issue a phoneme). If you don't have one, we provide an
alignement package in the ALIGN directory. Yep, written in PERL,
with kind regards of Kevin Lenzo. For languages more vicious than
English with regard to letter/phoneme association (namely French),
I provide another package in ALIGN_C++ which includes an ngram 
kind of algorithm, documentation explains more.

So if you want to work with 3 letters on the left, 3 letters on the
right context, just write:

conv2vec.pl 3 1 0 < TEST/cmu.align | id3 - > cmu.tree

Using pipes, avoid the creation of large intermediate files. The
output format is "binary", which just means that the output will look
like: [ 4 C a A b B ]  which equivallent to :

  switch ( feature[4]) {
	      case a: return(A)
         case b: return(B)
			default: return(C)
     }

Once the .tree file is constructed (* 2 or 3 minutes for the full CMU
dictionary *) you can test the performance with the program given in
the next section.

The "conv2vec.pl" program is flexible enough to authorize a phonemic
feedback (second argument) and POS tags (Part Of Speech). If the
entries of your training file look like:

record NOUN r E1 k r= _ d
record VERB r I k r=1 _ d

Typing: conv2vec.pl 3 0 1 < my_entries 

means "use a 3 letter context on the left, same on the right context,
0 phonemic feedback, and skip 1 item as it is a POS tag". Those
information are stored in the header of the vector file in the
mnemonic form: LLLTRRR S  (3 left letter,target letter, 3 right
letters, 1 pos tag). The run_tree program will fetch that header from
the decision tree to pass the proper vector for the decision to take
place.

A new feature let you specify the direction of the phonemic feedback
-> if the feedback is >0 it means "tanscribe from right to left",
otherwise it means transcribe from left to right. I can make a
difference with the phonemic feedback, depending on the language.

2) run the decision tree -> run_tree

	     to compile:	 
gcc -O6 -o run_tree run_tree.c 

        run it:

run_tree cmu.tree cmu.align

        the output looks like:

        Read 860711 vectors, 847406 success (98.454185/100)

if you want to limit the depth search in the decision tree, give the depth
as a last argument:

run_tree cmu.tree cmu 7
Read 860711 vectors, 847019 success (98.409222/100)

Note for the figures above train set and test set are the same, but it gives
a clue about the popularity of that 7 letter context :-)

To test the small example we give, just type:

run_tree TEST/cmu.tree TEST/cmu.align 

it will display:

Read 99 words, 79 success (79.797980/100)
Read 666 letters, 645 success (96.846847/100)
Nbnode=335

We encourage you to test the verbose debug mode:

run_tree TEST/cmu.tree TEST/cmu.align 100 1 2


If you have still not pressed the delete key, we're interested in bug
reports, enhancements and so on. If you're interested in talking
dictionnaries, you may want to download the MBRDICO package that is
based on decision trees, and generates a minimal prosody to utter the
words (another "advantage" the MBRDICO package implements the
equivallent of run_tree in true C++) : 

			 http://tcts.fpms.ac.be/synthesis/mbrdico

To get an audio output, use the MBROLA speech synthesizer:

			 http://tcts.fpms.ac.be/synthesis

3) We provide in Letter2Phone a C++ class equivallent to run_tree that
   you can include inside your programs to handler the decision
   trees. 

4) If you want to compute the generalization performances of your tree
   use the ten_fold_cross_valdation perl script. Splits your aligned
   dictionary randomly into ten subsets.... run on every combination
   of 90%train-10%test, and provide he average accuracy + standard
   deviation. 

5) Freely available dictionaries:

			      French: ftp://ftp.ulb.ac.be/pub/packages/psyling/Brulex/
		American English: http://www.speech.cs.cmu.edu/cgi-bin/cmudict
		British  English: ftp://ota.ox.ac.uk/pub/ota/public/dicts/710/

Other can be found in the comp.speech FAQ

ACKNOWLEGDEMENTS:

This work was achieved in collaboration with Kevin Lenzo (CMU) who
issued the original ID3 implementation in PERL, as well as several
alignment algorithms.

Thanks to Alan Black who independantly reproduced the results with
his amazing WAGON and proposed alternatives for the alignment
procedure. Check "ISSUES IN BUILDING GENERAL LETTER TO SOUND RULES"
Alan W Black, Kevin Lenzo, Vincent Pagel , Proc. 3rd ESCA/COCOSDA
Int. Workshop on Speech Synthesis, Jenolan Caves, Australia, 1998, 
pp. 77-80

Have fun

	Vincent Pagel
------------------------------------------------------------------------------
Vincent PAGEL               Labo. Traitement du Signal et Theorie des Circuits
email: pagel@tcts.fpms.ac.be                     Faculte Polytechnique de Mons
tel: /32/65/374133  fax:374129             31, bvd Dolez, B-7000 Mons, Belgium
