lish words, and nearest aligned neighbors. Martin et al. (2005) reported that this method resulted in absolute improvements of up to 20% as com-pared with the case of only using limited re-sources. Tufis et al. (2005) combined two word aligners: one is based on the limited resources[r]
methods, some researchers modeled the alignments with different statistical models (Wu, 1997; Och and Ney, 2000; Cherry and Lin, 2003). Some researchers use similarity and association measures to build alignment links (Ahrenberg et al., 1998; Tufis and Barbu, 2002). However, All of the[r]
MT systems, such as statistical phrase-based andsyntax-based systems, learn phrase translationpairs or translation rules from large amount ofbilingual data with word alignment. The qual-ity of the parallel data and the word alignmenthave significant impacts on the learned[r]
on co-occurrence, rare correspondences go unno-ticed, even though they may be relevant for ap-plications such as terminology or lexicography.This means that even for these applicationshigher recall and precision will give better effect.For machine translation errors in alignm[r]
lack of sufficient part-of-speech information. Wehave removed all context vectors that were builtfor a word that was registered in CELEX with aPoS-tag different from ’noun’. But some wordsare not found in CELEX and although they arenot of the word type ’noun’ their context vec-to[r]
to reach an automatic extraction of lexical tuples from the AC Corpus. The AC document collection was constituted when ten new countries joined the European Un-ion in 2004. They had to translate an existing collection of about ten thousand legal documents covering a large variety of subject areas. T[r]
We train model parameters on a development cor-pus, which consists of hundreds of manually-alignedbilingual sentence pairs. Using an n-best approx-imation may result in the problem that the param-eters trained with the GIS algorithm yield worsealignments even on the development corpus. Thisca[r]
We proposed a novel framework that incorpo-rates synonyms from monolingual linguistic re-sources in a word alignment generative model.This approach utilizes both bilingual and mono-lingual synonym resources effectively for wordalignment. Our proposed method uses a latentt[r]
the recall with an acceptable increase of noise.Previous links point to the context from whichthey originated. Therefore, we can access any pairof features which is available for the context aswell as for the linked items themselves. In thisway, clue probabilities can be based o[r]
HMM, this model estimates the parameters indi-rectly from various sources, such as word seman-tic similarity, surface similarity and distortion penalty, etc. For fair comparison reason, we also use the surface similarity computed as Equation (2) and position difference based distortion[r]
1 IntroductionGiven a parallel sentence pair, or bitext, bilin-gual word alignment finds word-to-word connec-tions across languages. Originally introduced as abyproduct of training statistical translation modelsin (Brown et al., 1993), word alignment h[r]
able comments. This work is supported by the Na-tional Natural Science Foundation of China (No.61003112), the National Fundamental ResearchProgram of China (2010CB327903) and by NSF un-der the CluE program, award IIS 084450.ReferencesPeter F. Brown, Stephen Della Pietra, Vincent J. DellaPietra, and[r]
There are 25K entries in the English vocabularyand 90K in Arabic side. Data sparseness severelychallenges word alignment model and consequentlyautomatic phrase translation induction. There are42K singletons in Arabic vocabulary, and 14K Ara-bic words with occurrence of twice eac[r]
pairs, are designed to discount incorrect translation rules caused by alignment errors. Third, the large language model (trained with 9 billion words) in our experiments further alleviated the impact of incorrect translation rules. Fourth, the GALE test set has fewer reference translat[r]
1993) that characterize word-level alignments inparallel corpora. Parameters of these alignmentmodels are learnt in an unsupervised manner us-ing the EM algorithm over sentence-level alignedparallel corpora. While the ease of automati-cally aligning sentences at the word-level withtool[r]
Up to this point our Hansards model has beentrained using only the sure (S) alignments. Asthe data set contains many possible (P) alignments,we would like to use these to improve our model.Most of the possible alignments flag blocks ofambiguous or idiomatic (or just difficult) phraselevel alignments.[r]
signed for inference and parameter esti-mation. With the inferred latent topics,BiTAM models facilitate coherent pairingof bilingual linguistic entities that sharecommon topical aspects. Our preliminaryexperiments show that the proposed mod-els improve word alignment accu[r]
doubt •that •would •be •the •position •of •the •Supreme •Court •of •Canada •. •Figure 2: Visualization of word alignments with an align-ment matrix.presented in a single line or column. Pairs of longsentences therefore often cannot be shown entirely onthe screen. Aligning pairs of long[r]
Word alignment is a well-studied problem in Natu-ral Language Computing. This is hardly surprisinggiven its significance in many applications: word-aligned data is crucial for example-based machinetranslation, statistical machine translation, but alsoother applications suc[r]
Wang, 2004). In addition, when we train the model with a smaller-scale in-domain corpus as described in (Wu and Wang, 2004), our method achieves an error rate reduction of 10.15% as compared with the method in (Wu and Wang, 2004). We also use in-domain corpora and out-of-domain corpora[r]