How URX Derives Context from Big Data

The URX technology platform helps predict what a mobile user wants to do next and enables users to easily take action inside apps.

A critical component of this technology is the ability to dynamically understand user intent from a web page. At run time, URX analyzes text and metadata on the page and uses our knowledge graph to gain an understanding of what a user is reading about. In other words, we make sense of the people, places, and things that are mentioned in order to recommend related actions.

As you can imagine, this is a serious “big data” problem that we have had to overcome. We were selected to share how we have addressed these challenges in the Startup Showcase at the 2015 IEEE Big Data Conference last week in Santa Clara, California. Below is a summary of the presentation, which addressed how URX collects, stores, and processes the data needed to train our machine learning models that dynamically detect what’s mentioned on a page.

Using Wikipedia to Train Machine Learning Models

We are continually ingesting data to train our machine learning models and improve our knowledge graph. Wikipedia is an incredibly valuable dataset for this task because it is a rich source of semantic information and is often considered a vast and comprehensive knowledge base. It’s also a heterogeneous data set with lots of diversity - information on people, sports, politics, companies and many things in between.

We have a lot to gain by training and testing our machine learning models against it and using such models to help us add facts to our knowledge base. To achieve this, however, we must first collect and parse the data quickly and effectively.

Data Collection

The English Wikipedia corpus consists of 15 million pages, which is available for complete download as an 11 gigabyte xml file. While Wikipedia is considered reasonably accurate, it does include a lot of noise, with only about one in three pages we consider informative enough for learning.

Wikipedia is also updated all the time; in order to refine your model and stay up to date with changes, you need to be able to parse the entire corpus as quickly as possible. We tried several python libraries to do this, including gensim, and a suite of media wiki parsers such as mediawiki parser and mwparserfromhell. We eventually figured out how to do it in 20 minutes using a combination of pyspark, mwparserfromhell, and wikihadoop.

The main challenge in parsing the text (aside from size) is that Wikipedia text is written using “wiki markup,” which is a different grammar from hypertext markup (i.e., HTML). Translating “wikitext grammar” into raw text or HTML format is not trivial. It requires understanding and then mapping the wikitext accordingly. The mwparserfromhell library helps with this translation. Once we did that, we used wikihadoop to split the input xml file and then parallelize the processing among many machines using a pyspark job.

Data Storage

Once we have collected and parsed the data, we needed to store it in two ways: 1) persistent storage that allows us to iterate over all the data quickly, and 2) a search index to lookup context for keywords.

For persistent storage we used the Hadoop FileSystem (HDFS), which is ideal for distributed storage of large amounts of data. HDFS also facilitates relatively fast reading and writing of large scale data. However, Hadoop is notoriously inefficient at providing search support across stored data to get results quickly.

For this, we needed an elasticsearch index. Elasticsearch is a distributed index built on top of lucene. It facilitates sharding, which is the splitting of data into manageable chunks for fast retrieval across the cluster. Further, it provides data replication - a failsafe against data loss using backups of the same data across multiple machines.

Data Processing

Given that the data has been collected and stored, it must be processed and analyzed. That is, for each anchor text (or wikilink) in Wikipedia, we must develop an abstract representation of its context to feed to our machine learning models. A preliminary step in this process is making a dictionary of terms from the corpus in order to create a sparse vector representation of context. In other words, if we know that the term “performer” is mentioned in the context of “Taylor Swift” 300 times in Wikipedia, and performer is the 25th term in the dictionary, in a simplistic model, we can assign it a weight (or term frequency) of 300 at position 25 in the vector.

Term frequency is often an oversimplified weighting scheme for vectorization. Instead, the importance of words are often normalized by the number of documents they appear in across the corpus; that is, their document frequency. The resulting metric is the commonly used term frequency-inverse document frequency (TF-IDF).

Computing tf-idf across Wikipedia is no trivial task. The gensim wikicorpus python library, which uses multithreading to read the Wikipedia corpus, takes more than three hours to generate the dictionary. Then to create a bag of words (bows) from each document and generate the tf-idf model takes almost another four hours. We completed this task in about one hour by: 1) using wikihadoop to split the input 11GB xml file, 2) parallelizing it using pyspark, 3) tokenizing the text using gensim’s text extractor, and then 4) merging local dictionaries created at each executor, again using gensim.

Given the dictionary and tf-idf model, we can then generate the features needed to train our machine learning models. Initial experiments using this pipeline to train various machine learning classifiers show approximately 80 percent precision in entity disambiguation, which is important for determining what a page is really talking about.


Over the past two years, we’ve made huge progress on the challenge of helping companies deliver a better, more context-driven experience on mobile. With the help of machine learning, we are helping companies put relevant, context-driven actions in front of the right users. There are a number of benefits we see from ingesting Wikipedia, and it is just one of many sources we are using to refine our models and add facts to our knowledge graph.

Many thanks to @delroycam and the @urxtech Science Team for all of their help with this post. Please reach out to if you are interested in learning more!

The Eagle(s) Have Landed in New Haven!

YHack is an international hackathon hosted by and held at Yale, bringing 1,500 like-minded hackers and creatives from all over the world. This year, two URX team members, Jeremy Lucas and Andrew Goldstein traveled across the country from San Francisco to give YHackers the opportunity to build innovative experiences using the URX App Search API. We can't wait to see the cool sh!t the YHackers come up with!

If you're attending YHack and want to see what URX is working on, please visit our event website

Takeaways from the ExchangeWire ATS New York Conference

This past Tuesday, November 2nd, ExchangeWire held its last ATS event of the year in New York City. URX’s Chief Revenue Officer, Lauren Nemeth, spoke on stage about the promise of deep linking to solve the challenges of mobile fragmentation. Her slides and key takeaways are below:


  • Platform control is dangerous. Facebook, Twitter, Google, and Apple dominate referral traffic and app store distribution on mobile. Publishers have to figure out how to monetize content outside of their own site, often without access to cookies or user data.
  • Focus on what a user wants to do next. Mobile marketing campaigns tend to focus on targeting users and the actions they have taken. The future of mobile marketing will be about appealing to what mobile users want to do in the moment (context) rather than fixating on what they have done in the past (retargeting).
  • Deep linking drives action. Mobile users love the functionality and convenience of mobile apps; however, they lack basic link functionality of the web. The solution is to utilize deep linking to allow users to take action inside an app with a single click to streamline the user’s path to purchase.

Lauren’s presentation was well received by the ~400 attendees across agencies, publishers, ad tech companies, ad networks, and more:

We’d love to know what you think. Please reach out to us on Twitter @urxtech to continue the conversation!

Deep links help Uber drivers keep their eyes on the road

An easy button to Uber when you are looking up directions in Google Maps is one of the most cited examples of deep linking.  It just makes sense.

However their implementation of deep links in their driver app might actually save lives!  It gives drivers a one click option to use Waze or Google Maps to check for the best possible route. The more time that drivers keep their hands on the wheel instead of trying to type in addresses while cruising in traffic the better.