GRAPHEME PHONEME ALIGNMENT PACKAGE:

we suppose that the graphemic string is allways longer than the phonemic one. If not, 
you need to introduce pseudo phonemes ( for example k+s and g+z in English).

Once you have a dictionary beginning like oald.dic :

aberystwyth n a1 b @ r i1 s t w i th 
addressograph n @1 d r e1 s ou g r aa f 
afghans n a1 f g a n z 
agra n aa1 g r @ 
albania n a1 l b ei1 n i@ 
alexandra n a1 l i g-z aa1 n d r @ 
(excerpt from Oxford Advanced Learner's Dictionary
  Note that the second parameter is the Part Of Speech tag)

You can automatically derive the alignement with:

iterativelyAlign.pl oald.dic 1

(* 1 here indicate that we must skip 1 POS tag - obviously useless for alignement *)
After several iterations one gets _oald.*.aff and _oald.*.ali, resp the affinity matrix 
and aligned dictionary, which begins with:

aberystwyth n a1 b @ r i1 s t w i th _ 
addressograph n @1 _ d r e1 _ s ou g r aa f _ 
afghans n a1 f g _ a n z 
agra n aa1 g r @ 
albania n a1 l b ei1 n i@ _ 
alexandra n a1 l i g-z aa1 n d r @ 

Then one can feed those aligned words into conv2vec.pl

Adding one more parameter let you specify that you want to remove from
the alignement words whose score is too low :

iterativelyAlign.pl oald.dic 1 1

This is helpfull as a post-processing alignement, for example:

a  . b . c 
EI b i s i

is obviously a dangerous alignement to learn from :-)
