What We Do
Most NLP applications such as information extraction, machine translation, sentiment analysis and question answering, require both syntactic and semantic analysis at various levels. Traditionally, NLP research has focused on developing algorithms that are either language-specific and/or perform well only on closed-domain text. At Google, we work on solving these problems in multiple languages at web-scale by leveraging the massive amounts of unlabeled data on the Web. We support a number of Google products such as web search and search advertising.
At the syntactic level, we develop algorithms to predict part-of-speech tags for each word (e.g., noun, verb, adjective) in a given sentence as well as the various relationships between them (e.g., subject, object and other modifiers). Historically, parsing systems were primarily developed for English, did not scale well, nor were robust to large shifts in vocabulary, e.g., from well formed news text to unedited Web 2.0 content. Thus, our focus is to develop multilingual linear time parsing algorithms that are robust to these kinds of domain shifts. Towards this end we work on developing algorithms that leverage large amounts of unlabeled web data and can even be trained to maximize application specific performance. Furthermore, we are pushing the state-of-the-art in multilingual syntactic analysis by building robust modeling techniques to transfer knowledge from resource rich languages (like English) to resource poor languages.
On the semantic side, we work on problems such as noun-phrase extraction (e.g. identifying Barack Obama, CEO in free text), tagging these noun-phrases as either person, organization, location or common noun, clustering noun-phrases that refer to the same entity both within and across documents (coreference resolution), resolving mentions of entities in free text against entities in a knowledge base, relation and knowledge extraction (e.g. is-a). While most state-of-the-art NLP algorithms attempt to solve these problems for data from a closed domain, here at Google, we solve them at web-scale bringing to bear the different sources of knowledge at our disposal including our cutting-edge syntactic analysis. The scale and nature of the data on the web (a web-page could be from newswire or a blog or a personal homepage) requires us to design algorithms that are efficient, perform well on text from different domains and can be easily distributed across thousands of cores.
No comments:
Post a Comment