UNED at PASCAL RTE-2 Challenge Jesús Herrera, Anselmo Peñas, Álvaro Rodrigo, Felisa Verdejo Departmento de Lenguajes y Sistemas Informáticos Universidad Nacional de Educación a Distancia Madrid, Spain {jesus.herrera, anselmo, alvarory, felisa}@lsi.uned.es Abstract This paper reports the description of the developed system and the results obtained in the participation of the UNED1 in the Second Recognizing Textual Entailment (RTE) Challenge. New techniques and tools have been added: enriched queries to WordNet, detection of numeric expres- sions and their entailment, and Support Vector Machine classification (SVM) are the more relevant. The accuracy per- formed is slightly higher than the one from the previous edition system. 1 Introduction The system presented to the Second Recognizing Textual Entailment Challenge is based on the one presented to the First RTE Challenge (Herrera et al., 2005). The core of this latter was basically kept, but enhanced by means of several subsystems in order to study the efficiency of other not previously applied techniques that seemed promising for RTE. In short, the techniques involved in this new sys- tem are the following: • Dependency analysis of texts and hypothesises. • Lexical entailment between dependency tree nodes using WordNet. The subsystem consult- ing WordNet was enriched with respect to the one presented to the First RTE Challenge. 1Spanish Distance Learning University. • Mapping between dependency trees, which is the one defined for the previous system (Her- rera et al., 2005). • Detection of numeric expressions. A new module, which detects entailment between nu- meric expressions of the texts and the hypoth- esises, was implemented. For this detection, the train and test corpora were automatically tagged (cardinals, dates and named entities) by the López-Ostenero’s system (Peinado et al., 2005). • Support Vector Machine classification in order to determine the final decision about textual en- tailment between pairs of text and hypothesis, following previous ideas from successful works in Natural Language Processing using machine learning applications (Joachims, 1998). 2 System Description The proposed system is based on surface techniques of lexical and syntactic analysis, complemented with queries to WordNet as an external source of knowl- edge. It works in a non-specific way, not giving any kind of special treatment for the different set- tings considered in the RTE Challenge (Information Retrieval, Multi-document summarization, Question Answering and Information Extraction). The system accepts pairs of text snippets (text and hypothesis) at the input and gives a boolean value at the output: YES if the text entails the hypothesis and NO otherwise. This value is obtained by the appli- cation of the learned model by a SVM classifier. Figure 1: System’s architecture. System’s components, whose graphic representa- tion is shown in figure 1, are the following: 2.1 Linguistic processing A dependency parser, based on Lin’s Minipar (Lin, 1998), which normalizes data from the corpus of text and hypothesis pairs and accomplishes the de- pendency analysis, generating a dependency tree for every text and hypothesis. A named entities recognizer, implemented by López-Ostenero (Peinado et al., 2005), has been used to normalize numeric expressions and named entities. 2.2 Lexical entailment A WordNet-based entailment module – which takes the information given by the parser and returns the hypothesis’ nodes that are entailed by the text (Her- rera et al., 2005) – uses WordNet in order to find synonymy, similarity, hyponymy, WordNet’s entail- ment and negation relations between pairs of lexical units, as described in (Herrera et al., 2005). For the current edition some features have been added to the lexical entailment module: • Search of entailment paths. It has been stud- ied whether the strategy for searching the en- tailment paths affects or not the results. Two strategies have been tested: depth search and breadth search. In addition, the length of the path has been used as a final criteria for deciding the entailment between words. Al- though behaviour is slightly different, the ef- fect over results in the exercise is no significant. However, the breadth strategy is significatively slower. • WordNet relations. Synonymy, hyponymy, verb entailment and antonymy have been used as in the last edition. In addition, part meronym (e.g. Italy entails Europe), and adjective / ad- verb pertainym (e.g. Italian entails Italy) have been added in the search of entailment paths between the lemmas of the text and the ones of the hypothesis. Table 1: Entailment between numeric expressions Text Hypothesis Recognition 17 million citizens more than 15 million people Normalization lower bound: 17, 000, 000 lower bound: 15, 000, 000 upper bound: 17, 000, 000 upper bound: infinite unit: citizen unit: person Entailment TRUE if 15, 000, 000 ≤ 17, 000, 000 and infinite ≥ 17, 000, 000 and citizen entails person • Entailment between phrases / multiwords. Lev- ensthein distance has been used for an approxi- mate matching between multiwords only if the one related to the hypothesis is present in Word- Net. A new and simple entailment relation between phrases has been defined assuming the compositional meaning of phrases. Thus, a phrase is expected to entail all its compo- nents. This entailment relation can‘t be used over Named Entities since they haven’t a com- positional meaning. • Entailment between numeric expressions. Nu- meric expressions from the corpus are detected by means of an entities recognizer; they are normalized after the recognition, and the units affected (e.g. kilometers, years, etcetera) are considered for the detection of an entailment relation between these expressions. Thus, a numeric expression N1 entails a numeric ex- pression N2 if the range associated to N2 en- closes the range of N1 and the unit of N1 entails the unit of N2. When a numeric expression in the hypothesis is not entailed by one or more numeric expressions in the text, then the sys- tem responses that there is not entailment be- tween numeric expressions in the pair. An ex- ample is shown in table 1. The experiment in figure 2 shows the accuracy obtained over the development corpus when considering: coin- cidence between lemmas (LEM), WordNet re- lations (WN), and entailment between numeric expressions (NUMFAIL). For every percentage of overlap between the text and the hypothesis, the accuracy obtained is higher when adding NUMFAIL to the set of considered features; and the lower is the overlap the higher is this accu- racy improvement. Thus, NUMFAIL is an in- teresting feature to decide if there is entailment between a text and a hypothesis when a lower overlap exists. The named entities (NE) entailment module, as described in section 5, did not contribute to the runs submitted. 2.3 Sentence level matching A tree matching module, which searches for match- ing branches into the hypothesises’ dependency trees. These kind of branches are the ones whose all nodes are lexically entailed, as described in (Herrera et al., 2005). A plain text matching module that calculates the percentage of lemmas from the hypothesis entailed by lemmas from the text, according to section 2.2. 2.4 Entailment decision A SVM classifier, from Yet Another Learning Envi- ronment Yale 3.0 (Fischer et al., 2005), which was applied in order to train a model from the develop- ment corpus given by the organization and to apply it to the test corpus. The model was trained by means of a set of features obtained from the other mod- ules of the system; these ones, for every pair , are the following: 1. Percentage of nodes of the hypothesis’ de- pendency tree pertaining to matching branches (Herrera et al., 2005) considering, respectively: • Lexical entailment between the words of the snippets involved, without consulting WordNet. • Lexical entailment between the lemmas of the snippets involved, consulting Word- Net. 2. Percentage of lemmas of the hypothesis en- tailed by lemmas of the text, considering the lexical entailment relations described in section 2.2. Figure 2: Effect of the numerical entailment restriction over the development corpus 3. Percentage of words of the hypothesis in the text (treated as bags of words). 4. Percentage of lemmas of the hypothesis in the text (treated as bags of lemmas). 5. Existence or absence of any numeric expres- sion within the hypothesis. 6. Existence or absence of any numeric expres- sion within the text. 7. Existence or absence of entailment between nu- meric expressions of the text and the hypothe- sis, as described in section 2.2. 3 Runs Submitted Two runs were submitted to the Second RTE Chal- lenge. Run 1 was obtained using only the features 2 and 7 of section 2.4. Run 2 was produced by the system described in section 2. The SVM was trained with the features enumerated in section 2.4, obtained from the devel- opment corpus provided by the organizers. Thus, the model was applied to the features from the test cor- pus in order to obtain a prediction for the existence or absence of textual entailment for every pair. 4 Performance Two evaluation measures were applied to the partic- ipating systems: accuracy (Dagan et al. , 2005), as the main measure, and average precision (Voorhees and Harman, 1999), as the secondary measure. Ac- curacy was computed for all the runs submitted, but average precision only for the runs giving the results ranked according to their entailment confidence; this ranking was not mandatory. One of the two runs submitted was ranked and, then, average precision was computed for it. The results obtained over the test corpus are shown in tables 2 and 3. The accuracy of the system is not homogeneous over all the pairs and depends on the different appli- Table 2: Results for run 1 Accuracy Average Precision IE 49.00% 47.74% IR 64.50% 69.14% QA 56.50% 50.24% SUM 69.00% 79.05% Overall 59.75% 56.63% Table 3: Results for run 2 Accuracy IE 52.00% IR 57.00% QA 52.00% SUM 74.50% Overall 58.87% cation settings proposed by the organizers: Informa- tion Retrieval (IR), Multi-document Summarization (SUM), Question Answering (QA) and Information Extraction (IE). The overall accuracy shown by the current system is basically due to the contribution of the Multi-document Summarization setting. This setting is characterized by sentence pairs with high lexical overlap, and the system shows its better ac- curacy for the subset of pairs pertaining to this kind of setting, for which reaches 74.50% accuracy. 4.1 Performance comparison Though the overall accuracy is better than the one obtained in the First RTE Challenge, the improve- ment has not been very significant: only 3.38 percentage points between the best performances reached in every edition. Considering a combina- tion of the two systems used to obtain the runs sub- mitted to de Second RTE Challenge, in which the IR and QA pairs were treated by the system produc- ing run 1 and the IE and SUM pairs were treated by the system producing run 2, the best performance could be given. In such a case, the overall accuracy will be 61.88%; thus, the performance improvement with respect to the previous edition of the Challenge will be of 5.51 percentage points. During the development of the system, some ex- periments were accomplished in order to compute its accuracy. Training the SVM with a half of the development corpus and applying the model to the other half, and vice-versa, the overall accuracy ob- tained ranged between 63% and 64%. It overcomes in more than 4 percentage points the best accuracy obtained by the two runs submitted. Since the de- velopment corpus and the test corpus are similar, it can be concluded that the accuracy of the system is quite variable depending on the concrete samples of the corpus. Then, it is not easy to affirm that the cur- rent system is clearly better than the previous one. In the first edition of the RTE Challenge the Multi-document Summarization setting was not pro- posed but a similar one called Comparable Docu- ments (Dagan et al. , 2005), characterized by a high lexical overlap between texts and hypothesises, too. For Comparable Documents, the system pre- sented to the previous edition of the RTE Challenge reached its best accuracy, with a 79.33%. The ac- curacy obtained for the settings with a high lexical overlap went slightly down from the first to the sec- ond edition. Because of the settings are not exactly the same, no definitely conclusions can be stated, but it is clear that both systems are significantly good at recognizing textual entailment when the pairs show a high lexical overlap. 5 New Experiments after Second RTE Challenge Participation After submitting the results to the Second RTE Chal- lenge, the development of the system continued and some experiments were accomplished. 5.1 New features based on named entities A new module for recognizing entailment between named entities (see figure 1) was implemented in order to study its usefulness for RTE. It works in a similar way to the numerical entailment module. The features computed were the following: a) ex- istence or absence of any named entity within the hypothesis; b) existence or absence of any named entity within the text; c) existence or absence of en- tailment between named entities of the text and the hypothesis; it is said that there is entailment between named entities of the text and its correspondant hy- pothesis if, in case of existence of named entities in the hypothesis, all them are entailed by one ore more named entities from the text. The results obtained after executing the whole system described in figure 1 – over the test corpus – are shown in table 4. Table 4: Results considering named entities Accuracy IE 51.00% IR 63.50% QA 53.50% SUM 74.00% Overall 60.50% The overall accuracy of this system is slightly bet- ter than the ones obtained by the other two systems. Thus, named entities seem to be an interesting field of study in order to improve RTE. 6 Conclusions and Future Work The system presented to the First RTE Challenge has been improved in order to put in the same level the weight of the analysis due to the overlap between dependency trees – which represented the only de- cision item in the previous system – and the weight of other kinds of analysis, such as bag of words, bag of lemmas, numerical entailment, etcetera. The rele- vance of every feature computed for the pairs has been determined automatically by a SVM algorithm (training with the development corpus) which is the responsible for the prediction of existence/absence of entailment between the pairs of the test corpus. Despite the complexity of the developed system is quite higher than the complexity of the previous one, the obtained accuracy is slightly better. It seems not easy to determine the way to obtain a higher ac- curacy for every application setting. With the cur- rently experimented techniques and tools, the results obtained for settings with a high lexical overlap are significantly higher than the others, which are quite similar among themselves. It suggest that it should be stimulated the development of setting-oriented systems, aiming to increase the performance of RTE systems focusing on only one or a few settings. Thus, in the medium term, useful RTE-based sys- tems for specific uses could be hopefully available. From the results obtained along the development time and after the execution of the test it can be de- duced that, nowadays, the RTE systems could be used with a remarkable success to identify informa- tion redundancy in multi-document summarization tasks. Other applications for RTE systems should be ex- plored, such as automatic Answer Validation tasks. An example for this kind of task is proposed within the Cross Language Evaluation Forum (CLEF) for the year 20062. 7 Acknowledgments We are grateful to Fernando López-Ostenero, from UNED-NLP Group, for his named entities recog- nizer. This work has been partially supported by the Spanish Ministry of Science and Technology within the project: TIC-2003-07158-C04-02 Multilingual Answer Retrieval Systems and Evaluation, SyEM- BRA. References I. Dagan, O. Glickman and B. Magnini. 2005. The PAS- CAL Recognising Textual Entailment Challenge. Pro- ceedings of the First PASCAL Recognizing Textual Entailment Workshop. LNAI. Springer. In press. S. Fischer, R. Klinkenberg, I. Mierswa and O. Ritthoff. 2005. Yale 3.0, Yet Another Learning Environment. User Guide, Operator Reference, Developer Tutorial. University of Dortmund, Department of Computer Science. Dortmund, Germany. J. Herrera, A. Peñas and F. Verdejo. 2005. Textual Entailment Recognition Based on Dependency Analy- sis and WordNet. Proceedings of the First PASCAL Recognizing Textual Entailment Workshop. LNAI. Springer. In press. T. Joachims. 1998. Text Categorization with Support Vector Machines: Learning with Many Relevant Fea- tures. LS-8 Report 23. University of Dortmund, De- partment of Computer Science. Dortmund, Germany. D. Lin. 1998. Dependency-based Evaluation of MINI- PAR. Workshop on the Evaluation of Parsing Systems, Granada, Spain, May, 1998. V. Peinado, F. López-Ostenero, J. Gonzalo and F. Verdejo. 2005. UNED at ImageCLEF 2005: Au- tomatically Structured Queries with Named Entities over Metadata. Cross Language Evaluation Forum, Working Notes for the CLEF 2005 Workshop. LNCS. Springer. In press. M. Voorhees and D. Harman. 1999. Overview of the seventh text retrieval conference. Proceedings of the Seventh Text Retrieval Conference (TREC-7). NIST Special Publication. 2http://nlp.uned.es/QA/AVE/