Scalable Ad-hoc Entity Linking of Wikipedia Revision Using Apache Spark and Hedera
Category: Big data, graph processing
Being the biggest online multi-lingual encyclopedia on the Web, Wikipedia is increasingly recognized as a gold resource for various data mining and machine learning applications. While Wikipedia text has been widely used in many study, little work has been able to process the Wikipedia revision history due to its exceptional in size and ad-hoc formats.
The project aims to build a machine learning framework to study the entity linking algorithms using Wikipedia Revision dataset .The outcome is the system that can annotate named entities from free language text (for instance, detect "Barack Obama" from a news article), taking into account its time of presence (for example, during the US presidential election 2012). To improve the states of the art, the system would need to synchronize the training data with the time of study, hence remove the noise (for instance, Barack Obama during Iraq Conflict). Wikipedia revision history is used as a training data.
The challenge lies in the evolving nature of Wikipedia Revision, as well as the huge size of its textual content. In order to do that, we will revise the classical link-based algorithm for entity linking [1,2] using novel temporal-evolving graph algorithms of . The thesis can potentially lead to a scientific publication given a reasonably good outcomes.
 Milne, David and Witten, Ian H. Learning to Link with Wikipedia. CIKM' 2008
 Page, Lawrence and Brin, Sergey and Motwani, Rajeev and Winograd, Terry. The PageRank Citation Ranking: Bringing Order to the Web. WWW' 1999
 Fujiwara, Yasuhiro and Nakatsuji, Makoto and Shiokawa, Hiroaki and Mishima, Takeshi and Onizuka, Makoto. Efficient Ad-hoc Search for Personalized PageRank. SIGMOD' 2013.
If you are interested, contact Tuan Tran.