The URX technology platform helps predict what a mobile user wants to do next and enables users to easily take action inside apps.
A critical component of this technology is the ability to dynamically understand user intent from a web page. At run time, URX analyzes text and metadata on the page and uses our knowledge graph to gain an understanding of what a user is reading about. In other words, we make sense of the people, places, and things that are mentioned in order to recommend related actions.
As you can imagine, this is a serious “big data” problem that we have had to overcome. We were selected to share how we have addressed these challenges in the Startup Showcase at the 2015 IEEE Big Data Conference last week in Santa Clara, California. Below is a summary of the presentation, which addressed how URX collects, stores, and processes the data needed to train our machine learning models that dynamically detect what’s mentioned on a page.
Using Wikipedia to Train Machine Learning Models
We are continually ingesting data to train our machine learning models and improve our knowledge graph. Wikipedia is an incredibly valuable dataset for this task because it is a rich source of semantic information and is often considered a vast and comprehensive knowledge base. It’s also a heterogeneous data set with lots of diversity - information on people, sports, politics, companies and many things in between.
We have a lot to gain by training and testing our machine learning models against it and using such models to help us add facts to our knowledge base. To achieve this, however, we must first collect and parse the data quickly and effectively.
The English Wikipedia corpus consists of 15 million pages, which is available for complete download as an 11 gigabyte xml file. While Wikipedia is considered reasonably accurate, it does include a lot of noise, with only about one in three pages we consider informative enough for learning.
Wikipedia is also updated all the time; in order to refine your model and stay up to date with changes, you need to be able to parse the entire corpus as quickly as possible. We tried several python libraries to do this, including gensim, and a suite of media wiki parsers such as mediawiki parser and mwparserfromhell. We eventually figured out how to do it in 20 minutes using a combination of pyspark, mwparserfromhell, and wikihadoop.
The main challenge in parsing the text (aside from size) is that Wikipedia text is written using “wiki markup,” which is a different grammar from hypertext markup (i.e., HTML). Translating “wikitext grammar” into raw text or HTML format is not trivial. It requires understanding and then mapping the wikitext accordingly. The mwparserfromhell library helps with this translation. Once we did that, we used wikihadoop to split the input xml file and then parallelize the processing among many machines using a pyspark job.
Once we have collected and parsed the data, we needed to store it in two ways: 1) persistent storage that allows us to iterate over all the data quickly, and 2) a search index to lookup context for keywords.
For persistent storage we used the Hadoop FileSystem (HDFS), which is ideal for distributed storage of large amounts of data. HDFS also facilitates relatively fast reading and writing of large scale data. However, Hadoop is notoriously inefficient at providing search support across stored data to get results quickly.
For this, we needed an elasticsearch index. Elasticsearch is a distributed index built on top of lucene. It facilitates sharding, which is the splitting of data into manageable chunks for fast retrieval across the cluster. Further, it provides data replication - a failsafe against data loss using backups of the same data across multiple machines.
Given that the data has been collected and stored, it must be processed and analyzed. That is, for each anchor text (or wikilink) in Wikipedia, we must develop an abstract representation of its context to feed to our machine learning models. A preliminary step in this process is making a dictionary of terms from the corpus in order to create a sparse vector representation of context. In other words, if we know that the term “performer” is mentioned in the context of “Taylor Swift” 300 times in Wikipedia, and performer is the 25th term in the dictionary, in a simplistic model, we can assign it a weight (or term frequency) of 300 at position 25 in the vector.
Term frequency is often an oversimplified weighting scheme for vectorization. Instead, the importance of words are often normalized by the number of documents they appear in across the corpus; that is, their document frequency. The resulting metric is the commonly used term frequency-inverse document frequency (TF-IDF).
Computing tf-idf across Wikipedia is no trivial task. The gensim wikicorpus python library, which uses multithreading to read the Wikipedia corpus, takes more than three hours to generate the dictionary. Then to create a bag of words (bows) from each document and generate the tf-idf model takes almost another four hours. We completed this task in about one hour by: 1) using wikihadoop to split the input 11GB xml file, 2) parallelizing it using pyspark, 3) tokenizing the text using gensim’s text extractor, and then 4) merging local dictionaries created at each executor, again using gensim.
Given the dictionary and tf-idf model, we can then generate the features needed to train our machine learning models. Initial experiments using this pipeline to train various machine learning classifiers show approximately 80 percent precision in entity disambiguation, which is important for determining what a page is really talking about.
Over the past two years, we’ve made huge progress on the challenge of helping companies deliver a better, more context-driven experience on mobile. With the help of machine learning, we are helping companies put relevant, context-driven actions in front of the right users. There are a number of benefits we see from ingesting Wikipedia, and it is just one of many sources we are using to refine our models and add facts to our knowledge graph.