Tutorials


TU01Statistical Machine Translation: the Basic, the Novel, and the Speculative
TU02Tutorial on Ontology Learning from Text
TU03Language Independent Methods of Clustering Similar Contexts (with applications)
TU04Text Mining in Biomedicine: an Overview of Techniques


TU01: Statistical Machine Translation: the Basic, the Novel, and the Speculative

Philipp Koehn     (date: Apr 4, 2006 in the morning)

Statistical machine translation has matured in recent years into a viable challenge to traditional, more knowledge-driven methods, with a energetic research community and commercial interest. This tutorial will serve as a basic introduction to the basic principles of current statistical machine translation methods, present the latest research in the field and point to the challenges and ideas beyond.

Specifically, the tutorial will cover:

  • Foundations: available data and tools, generatice modeling, EM training, phrase-based models, evaluation.
  • Advanced Methods: log-linear models, discriminative training, specialized models, integration with speech
  • Outlook: Syntax-based and syntax-aided approaches

Philipp Koehn received his PhD from the University of Southern California, where he was a research assistant at the Information Sciences Institute (ISI) from 1997 to 2003. He was a postdoctoral research associate at the Massachusetts Institute of Technology (MIT) in 2004, and joined the University of Edinburgh as a lecturer in 2005. His research centres on statistical machine translation, but he has also work on speech in 1999 at AT&T Research Labs and text classification in 2000 at Whizbang Labs. Besides his research, his major contribution to the machine translation community are the preparation and release of the DE-News and Europarl corpora, as well as the Pharaoh decoder


TU02: Tutorial on Ontology Learning from Text

Paul Buitelaar and Philipp Cimiano    (date: Apr 4, 2006 in the morning)

An ontology is commonly defined as an explicit and formal specification of a shared conceptualization of a domain of interest. Ontologies formalize the intensional aspects of a domain, whereas the extensional part is provided by a knowledge base that contains assertions about instances of concepts and relations as defined by the ontology. The process of defining and instantiating a knowledge base is referred to as knowledge markup or ontology population, whereas (semi-)automatic support in ontology development is usually referred to as ontology learning.

Ontologies have been broadly used in knowledge management applications, with a recent upsurge around Semantic Web applications and research. In recent years, ontologies have regained interest also within the NLP community, specifically in the context of such applications as information extraction, text mining, and question answering. However, as ontology development is a tedious and costly process there has been an equally growing interest in the automatic learning or extraction of ontologies. Much of this work has been directed towards extraction from textual data as human language is a primary mode of knowledge transfer. In this way, textual data provide both a resource for the ontology learning process as well as an application medium for developed ontologies.

The tutorial will give an introduction to ontology learning from textual data. It will assume no prior knowledge of the field and will thus be suited for people with very different backgrounds, although some emphasis will be placed on the role of linguistic analysis, NLP and machine learning as used in ontology learning. Also, the role of ontologies in NLP applications will be discussed, i.e. in information extraction, text mining, information retrieval, machine translation, question answering, and the relation between ontologies and lexical semantics.


TU03: Language Independent Methods of Clustering Similar Contexts (with applications)

Ted Pedersen (http://www.d.umn.edu/~tpederse)     (date: Apr 4, 2006 in the afternoon)

Methods that identify similar (but not identical) units of text have wide potential application. For example, Web search results can be better organized by grouping together pages with related and similar content. Email can be automatically foldered and categorized by finding which messages are similar to each other. Word senses can be discovered by clustering multiple contexts that use a particular ambiguous word.

This tutorial will introduce a language independent methodology for identifying similar contexts based on lexical features. The tutorial will explore the use of first and second order co-occurrence vectors for representing contexts, and introduce methods for carrying out dimensionality reduction that lower the noise and computational complexity associated with these large feature spaces. A number of different clustering methods will be discussed, as will various methods of evaluating the quality of the clustering results. Finally, the tutorial will explore methods of automatically generating descriptive labels for clusters.

The tutorial will also include a hands-on option for those with laptop computers. Attendees will be given a bootable Knoppix CD that will let them experiment with many of these ideas and applications using the SenseClusters package http://senseclusters.sourceforge.net


TU04: Text Mining in Biomedicine: an Overview of Techniques

Sophia Ananiadou and Yoshimasa Tsuruoka    (date: Apr 4, 2006 in the afternoon)

In the past few years, there has been an upsurge of research papers on the topic of text mining from biomedical literature. The primary goal of text mining is to retrieve knowledge that is hidden in text, and to present the distilled knowledge to users in a concise form. The advantage of text mining is that it enables scientists to collect, maintain, interpret, curate, and discover knowledge needed for research or education, efficiently and systematically.

This tutorial will provide a critical overview of state-of-the-art techniques applied to biomedical text mining aiming to make clear what can be expected of the field at present, or in the near future. The tutorial will focus in particular on terminology management and information extraction.

One of the core challenges for text mining from biomedical literature is presented by terminology. Given the amount of neologisms characterising biomedical terminology, it is necessary to provide tools which will automatically extract newly coined terms from texts, and link them with bio-databases, controlled vocabularies, and ontologies. The importance of this topic has triggered significant research, which has in turn resulted in several approaches used to collect, classify, and identify term occurrences in biomedical texts. Terminological processing also covers aspects such as extraction, term variation, classification and mapping.

The second part of this tutorial introduces technologies and resources that have been developed for information extraction from biomedical literature. These include linguistically annotated biomedical corpora, various NLP tools that are designed to deal with biomedical text, and several approaches to extracting useful information from biomedical documents such as protein-protein interactions and disease-gene associations.

TUTORIAL OUTLINE

  1. Introduction
    What are the main challenges of text mining in biomedicine?
    Needs and applications
  2. Terminology management
    • Terminological resources in biomedicine
    • Automatic term recognition approaches
      • Rule-based
      • Machine learning
      • Dictionary-based
      • Hybrid
    • Dealing with ambiguity and variation
    • An example of term variation: acronyms
  3. Information Extraction
    • Resources for Bio-Text Mining (corpora)
      • GENIA, PennBioIE, etc.
      • Corpus annotation in Biology
    • Linguistic analysis for Bio-Text Mining
      • POS tagging
      • Shallow parsing
      • Syntactic parsing
      • Deep syntactic parsing
    • Approaches to IE in Biology
      • Pattern matching
      • Full parsing
      • Sublanguage-driven
      • Ontology based
      • Machine learning
    • Conclusions and extensive bibliography

Sophia Ananiadou is Reader in Text Mining in the School of Informatics, University of Manchester. She is also deputy director of The National Centre for Text Mining (NaCTeM), ( http://www.nactem.ac.uk) with the aim to provide leadership in text mining for the UK academic community, focusing initially on Life Sciences. Her main interests are bio-text mining, natural language processing, automatic terminology management, ontologies and linguistic knowledge acquisition from biomedical texts. She is the co-editor of the book Text mining in Biology and Biomedicine (2006), Artech House. Dr Ananiadou has organised workshops and gave tutorials on text mining in biomedicine in conferences such as PSB, ISMB, ACL, MIE and COLING.

Yoshimasa Tsuruoka's research interests include machine learning approaches for natural language processing, as well as text mining from biomedical literature. He has been working on developing NLP tools including part-of-speech taggers, named-entity recognizers and parsers, and machine learning techniques for biomedical text mining. He is a research fellow at the GENIA project ( http://www-tsujii.is.s.u-tokyo.ac.jp) which aims at corpus-based knowledge acquisition and information extraction from genomic literature and also at the National Centre for Text Mining (University of Manchester).


TUTORIALS CHAIRS

Alexis Nasr (Universite Paris 7, France)
Kemal Oflazer (Sabanci, University, Turkey)
Miles Osborne (University of Edinburgh, UK)