# URX is joining Pinterest to solve content discovery at scale

URX was founded just over 3 years ago to create seamless, interconnected mobile experiences by helping people discover content inside apps. During that time, we've partnered with many of the world's leading developers and brands to help them distribute and monetize their content on mobile. We’ve helped lay the foundation for a world in which mobile apps exist as a part of the web, and I couldn’t be more excited to take the next step in our journey.

Today, I’m excited to announce that the URX team is joining Pinterest. As we learned more about Pinterest's mission of helping people discover and do the things they love, it was clear that we share a similar view of the future. Discovery is one of the largest problems and opportunities on the web, and Pinterest is well positioned to solve it at an unprecedented scale.

As a part of this transition, we will be sunsetting URX’s AppViews advertising product, effective immediately. We’ve been working with our customers over the last month to ease this transition.

We wouldn’t be where we are without the customers, developers, investors and mentors who believed in us. We want to thank everyone who has been a part of the first piece of our journey, and hope you all continue to follow along as we work with Pinterest make the web’s content more discoverable and actionable.

- John

# URX launches product that helps revitalize the mobile ecosystem

Traditional monetization models are failing as users migrate from desktop to mobile. Millions of users are installing ad blockers, and advertisers and content publishers are getting desperate for new ways to convert page views into revenue. Enter the AppView Carousel: URX’s new advertising platform that connects publishers and users directly to relevant, revenue-generating mobile experiences.

For instance, if you’re reading an Entertainment Weekly article about Beyonce and Coldplay’s upcoming Super Bowl 50 halftime show, the AppView Carousel will allow you to do things like purchase tickets to buy Beyonce paraphernalia on Wish, listen to the artists’ popular songs on Spotify, or find tickets to the game on SeatGeek.

Cards in the AppView Carousel related to Beyonce, Coldplay, and Super Bowl 50.

URX believes that the key to mobile commerce lies in anonymously and accurately harnessing user intent. Using predictive technology, the AppView Carousel determines what a user might want to do next, based on what they’re reading about. Our platform then helps turn intent into action by deep linking users directly to the desired content, wherever it lives across the mobile ecosystem.

The AppView Carousel helps extinguish fires started by intrusive and ineffective mobile ads. Instead of serving static, irrelevant ads, publishers can now use the AppView Carousel to bring useful, easy-to-navigate products directly into users’ hands.

We’re thrilled to partner with The Huffington Post, Entertainment Weekly, Spin, sovrn, Disqus, Dancing Astronaut, FanSided and many other properties to help them build dynamic commerce opportunities from their content. With the addition of the AppView Carousel, readers of these premium publications will be able to take advantage of a new wave of mobile commerce opportunities within the retail, local, travel, and ticketing verticals due to our partnerships with advertisers like Fetch, Groupon, SeatGeek, OpenTable, Spotify, and more.

"URX allowed us to maintain control of our top priority - building the best user experience - while making critical progress toward our mobile revenue goals,” says David Rosenbloom, General Manager of Entertainment Digital for Entertainment Weekly at Time Inc.

By helping publishers unlock new revenue streams, brands reach users with relevant and specific offers, and users discover relevant content, the AppView Carousel marks a major step toward creating a healthy and thriving mobile ecosystem.

We at URX are excited to develop additional solutions that empower users to interact with relevant mobile content in the coming months.

# Introduction

## Background

URX’s mission is to help publishers distribute and monetize content on mobile in a way that’s useful to users. We use anonymous context from mobile websites (e.g., keywords) to recommend engaging actions to users, be it listening to Adele’s Hello in Spotify or buying a new pair of winter boots in the Amazon Shopping app.

Imagine that John Milinovich is reading an article on his phone about the January 1st, 2016 Rose Bowl football game between Stanford and Iowa. URX might recommend relevant actions to John, such as purchasing tickets for the game or buying gear to show his school spirit:

The URX Science team supports the company's mission by parsing, storing, and learning the content contained within mobile web and app pages. This puts us in a position to intelligently suggest the next action a mobile user should take.

## Challenges

One of the core challenges is understanding the underlying structure of web content. Once we can identify structure in webpages, we have more power to focus on the sections that best inform our matching process. The problem is that web content is rarely consistent. Looking across websites, you will find similar HTML elements: <HTML>, <body>, <head>, <div>, … but the layout of these elements varies considerably between sites.

What we’ve discovered, however, is that while inter-domain (across domains) structure is hard to come by, there is often a lot of structure within most domains (intra-domain). That is, within a domain, web developers typically use a similar layout page-to-page with a similar organization of HTML tags. Visually, this creates a consistent design; algorithmically, this provides patterns we can exploit. For example, below you can see that SpinMedia (Spin) uses consistent title, author, and tag selection across various pages on its Spin.com property:

# Problem Definition

We restrict the problem of finding structure in HTML pages to finding structure within a domain rather than between domains. Once we solve the intra-domain problem, we can automate the algorithm to run against an array of domains. We distill our problem to: given a set of web pages in a domain, is it possible to automatically "learn" similar content locations shared among the pages? Can we automatically detect that all Spin sites have a title, a keyword section, a header section, and more, given nothing but a list of Spin URLs?

## How to define a "location" on a website

The first step toward identifying common content locations within a domain is to look at how web pages are laid out. The structure of a website is represented as a tree of HTML elements of varying depth and complexity called the document object model (DOM). Specific content on a given web page corresponds to a node, or a set of nodes, in a web page’s DOM tree. The XML Path Language (XPath) is commonly used for describing locations within a DOM tree. We can use multiple different types of information to constrain our search for a particular XPath.

file.py

from StringIO import StringIO
from lxml import etree
sample_html_doc = \
"""<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.10: http://docutils.sourceforge.net/" />
<title>I luv html </title>
<body>
<div><div></div></div>
</body>

</html>"""

tree = etree.parse(StringIO(sample_html_doc))

# Get the title element
elements = tree.xpath(xpath)
print "We found %d tag for %s: "% (len(elements), xpath)

# Get the second meta tag
elements= tree.xpath(xpath)
print "We found %d tag for %s: "% (len(elements), xpath)

# Meta tag must have a particular key-value pair
elements = tree.xpath(xpath)
print "We found %d tag for %s: "% (len(elements), xpath)

# Meta tag must have certain attribute value, key doesn't matter
elements = tree.xpath(xpath)
print "We found %d tag for %s: "% (len(elements), xpath)

# # Meta tag have certain attribute key, value doesn't matter
xpath = '''/html/head/meta[@http-equiv and @content="text/html; charset=utf-8"]'''
elements = tree.xpath(xpath)
print "We found %d tag for %s: "% (len(elements), xpath)

We found 1 tag for /html/head/title:
We found 1 tag for /html/head/meta[2]:
We found 1 tag for /html/head/meta[@content="text/html; charset=utf-8"]:
We found 1 tag for /html/head/meta[@*="text/html; charset=utf-8"]:
We found 1 tag for /html/head/meta[@http-equiv and @content="text/html; charset=utf-8"]:

Now that we have a consistent language for defining HTML element locations, we can redefine our core problem: how do you identify a set of XPaths for a domain which point to consistent content? It should be possible to apply each of these XPaths to the majority of pages within a domain. Furthermore, we want XPaths to uniquely point to only one piece of content within a page. This latter point is important; if we are too loose in our XPath matching criteria, the corresponding site content will be ambiguous.

## Conventional approaches for structured prediction

Upfront our problem seems to map to the machine learning field of structure learning, as we want to learn the structure of a web domain. Several python libraries are available for prototyping structured learners like a Conditional Random Field (CRF). In addition, some modules are specifically designed for website DOM structure. Structured learners are more flexible than hard-coding a rule set for identifying DOM elements as they can use features like the depth of the tree, and the density of text data, to elucidate patterns in web text. If we were to train a CRF to learn the common structure embedded in Spin websites, we would:

1. Manually identify the common elements across multiple Spin pages
2. Label these elements with a consistent schema (i.e., the "title", "header 1", "keywords")
3. Extract features from these pages
4. Train a machine learning model to correlate these features with the human labels

Unfortunately, this approach requires creating a labeled dataset of common elements in each domain we care about. This is non-trivial, requiring a human to study several web pages within the domain and interpret which elements are “consistent”. This is precisely what we are trying to accomplish: building a program that tells us what labels are important.

## Solution

Let’s instead look for a solution that will work for any new domain, using only a list of urls from the domain as a starting point - human labeling not required. Consider the following intra-domain approach:

1. Choose one website to start, let us call it the reference page
3. For each reference leaf element, we will try to figure out whether this particular leaf element represents a "common" Xpath that has an analogue in all or most of the sites within the domain

In Step 3, we define an element to be “common” if the same XPath, defined as a list of HTML tags plus attribute key-value pairs, successfully extracts text from all pages. At first we require all HTML tags and attribute-key-value pairs to match. If this doesn’t work, it doesn't necessarily mean the XPath is bad; a particular attribute key, or value, may differ systematically across sites in a domain. We then use increasingly less strict criteria by allowing one, two, or three non-matches in specific attribute keys or values, iterating over all combinations. Each time, we examine whether we’ve successfully produced a common path that works on all sites. This approach identifies common XPaths and successfully isolates the attribute keys and values that differ systematically within a common XPath.

Below we start with a full XPath (including all attribute key-value pairs) for the keywords section of a reference website (Step 1). We then create a generator that yields “partial" XPaths, which have less strict matching criteria, leaving either one attribute key or value out of the matching criteria. Below, we document the process of finding the partial path that works well across all websites.

nn.py

import requests
exec requests.get(
'https://raw.githubusercontent.com/URXtech/domblogpost/master/notebookcode.py').text

# This XPath corresponds to the place on the reference website where we found
# the Spin keywords (for one reference site). Below, we output a generic XPath that will
work for any Spin site.

xpathjsonstr = """[
{
"elename": "[document]"
},
{
"elename": "html"
},
{
"elename": "body",
"attrs": {
"class": "single single-post postid-164076 single-format-image"
}
},
{
"elename": "div",
"attrs": {
"class": "container main-container"
}
},
{
"elename": "div",
"attrs": {
"class": "row"
}
},
{
"elename": "div",
"attrs": {
"class": "col-sm-8 left-col"
}
},
{
"elename": "div",
"attrs": {
"class": "article-body"
}
},
{
"elename": "div",
"attrs": {
"class": "article-content-holder"
}
},
{
"elename": "div",
"attrs": {
}
},
{
"elename": "div",
"attrs": {
"class": "col-sm-9"
}
},
{
"attribute_key": "null",
"typet": "rawtext",
"elename": "div",
"attrs": {
"class": "tags"
}
}
]"""

xpath = XPath.from_json(xpathjsonstr)
partialpathgenerator= xpath.get_less_strict_xpath(1)
partialpathgenerator.next()
working_xpath = partialpathgenerator.next().as_xpathlang()
print working_xpath

# Output
/html/body[@class]/div[@class='container main-container']/div[@class='row']/
div[@class='col-sm-8 left-col']/div[@class='article-body']/
div[@class='col-sm-9']/div[@class='tags']

We've produced a single partial XPath, which has a slightly less strict matching criteria than what we started with in the original JSON string. Note how deeply embedded the specific XPath is in the DOM structure. We see that the class key in the body tag is now orphaned. This means we no longer mandate that the "class" key of the body tag must be of a particular value. Although a matching tag must have a class attribute key, we'll allow the corresponding value to be anything now. This is a good thing, because its actual value (i.e, "single single-post postid-164053 single-format-image") turns out to be highly variable across sites, with an identifier for the particular webpage embedded within it ("postid-164053"). As such, the original XPath wouldn't work in anything but the reference site. However, with this less strict criteria, the partial path we created will work with any Spin website. In the typical case, we'll have to create many of these partial XPaths before we find one that will work well across all sites in the domain - we are often successful by considering all combinations. Here we show the partial XPath found for the keywords section from the Spin diagram above (shown in blue). We then apply it to a few Spin webpages, and pull out the textual content in the page.

nn.py

from lxml import etree
from StringIO import StringIO
import requests
parser = etree.HTMLParser()

urls = [
'http://www.spin.com/2015/09/global-citizen-festival-2015-livestream/'
]

def apply_xpath(url):
tree   = etree.parse(StringIO(requests.get(url).text), parser)
tag = tree.xpath(working_xpath)
print 'Here is our Tag Text for %s\n%s\n' % \
(url, '''"'''+' '.join([a for a in tag[0].itertext()]).replace('\n', '') + '''"''')

for url in urls: apply_xpath(url)
#Output
Here is our Tag Text for http://www.spin.com/2015/09/global-citizen-festival-2015-livestream/
" Tags:  Beyonce ,  coldplay ,  ed sheeran ,  pearl jam  "

" Tags:  Drake ,  Future  "

Here is our Tag Text for http://www.spin.com/2015/11/palm-trading-basics-album-premiere-lp/
" Tags:  PALM  "



# Conclusions

We've developed a systematic approach for isolating common XPaths within a website domain. We use powerful, brute force that requires defining an XPath as list of HTML tags, plus all associated attribute key-value pairs. If a given XPath can’t be applied to all websites, we then try to use a looser matching criteria. We iteratively leave out one, two, or three attribute keys or values from the matching criteria until we find a slightly looser match that still identifies one unique place on the page for all sites.

This approach has proved valuable for identifying domain-specific elements that can be used to extract valuable information related to the page's content. In turn, this content can be used to create excellent features for machine learning models, thus feeding downstream recommendation engines.

# URX Debrief: The Best of Deep Linking & Mobile Monetization

Digital ads helped build the internet we know today, but as user behavior moves to mobile devices, digital ads have grown ever more annoying - nearly to a breaking point. Read more

### Cyber Monday Beat Forecasts With A Record $3.07 Billion In Sales, 26% From Mobile Devices Consumers spent 16% more online year-over-year on Cyber Monday, with$799M coming from mobile devices. Read more

Snapchat now lets publishers link directly to Discover content on its platform, making it easier for users to find and engage with Snapchat content outside of its own walls. Read more

Google and Facebook drive the majority of traffic to many publisher sites, and publishers need to consider the risks of depending on these sources so heavily. Read more

Google launches a new display product to help commercialize its recently-launched app streaming functionality. Users will see ads that are mini-versions of apps themselves. Read more

### Uber’s API Now Lets Other Apps Add A “Ride Request Button”, But Not Competitors Too

Uber now offers a turnkey way for developers to get paid for driving qualified users to its app. Read more

# How URX Derives Context from Big Data

The URX technology platform helps predict what a mobile user wants to do next and enables users to easily take action inside apps.

A critical component of this technology is the ability to dynamically understand user intent from a web page. At run time, URX analyzes text and metadata on the page and uses our knowledge graph to gain an understanding of what a user is reading about. In other words, we make sense of the people, places, and things that are mentioned in order to recommend related actions.

As you can imagine, this is a serious “big data” problem that we have had to overcome. We were selected to share how we have addressed these challenges in the Startup Showcase at the 2015 IEEE Big Data Conference last week in Santa Clara, California. Below is a summary of the presentation, which addressed how URX collects, stores, and processes the data needed to train our machine learning models that dynamically detect what’s mentioned on a page.

## Using Wikipedia to Train Machine Learning Models

We are continually ingesting data to train our machine learning models and improve our knowledge graph. Wikipedia is an incredibly valuable dataset for this task because it is a rich source of semantic information and is often considered a vast and comprehensive knowledge base. It’s also a heterogeneous data set with lots of diversity - information on people, sports, politics, companies and many things in between.

We have a lot to gain by training and testing our machine learning models against it and using such models to help us add facts to our knowledge base. To achieve this, however, we must first collect and parse the data quickly and effectively.

## Data Collection

The English Wikipedia corpus consists of 15 million pages, which is available for complete download as an 11 gigabyte xml file. While Wikipedia is considered reasonably accurate, it does include a lot of noise, with only about one in three pages we consider informative enough for learning.

Wikipedia is also updated all the time; in order to refine your model and stay up to date with changes, you need to be able to parse the entire corpus as quickly as possible. We tried several python libraries to do this, including gensim, and a suite of media wiki parsers such as mediawiki parser and mwparserfromhell. We eventually figured out how to do it in 20 minutes using a combination of pyspark, mwparserfromhell, and wikihadoop.

The main challenge in parsing the text (aside from size) is that Wikipedia text is written using “wiki markup,” which is a different grammar from hypertext markup (i.e., HTML). Translating “wikitext grammar” into raw text or HTML format is not trivial. It requires understanding and then mapping the wikitext accordingly. The mwparserfromhell library helps with this translation. Once we did that, we used wikihadoop to split the input xml file and then parallelize the processing among many machines using a pyspark job.

## Data Storage

Once we have collected and parsed the data, we needed to store it in two ways: 1) persistent storage that allows us to iterate over all the data quickly, and 2) a search index to lookup context for keywords.

For persistent storage we used the Hadoop FileSystem (HDFS), which is ideal for distributed storage of large amounts of data. HDFS also facilitates relatively fast reading and writing of large scale data. However, Hadoop is notoriously inefficient at providing search support across stored data to get results quickly.

For this, we needed an elasticsearch index. Elasticsearch is a distributed index built on top of lucene. It facilitates sharding, which is the splitting of data into manageable chunks for fast retrieval across the cluster. Further, it provides data replication - a failsafe against data loss using backups of the same data across multiple machines.

## Data Processing

Given that the data has been collected and stored, it must be processed and analyzed. That is, for each anchor text (or wikilink) in Wikipedia, we must develop an abstract representation of its context to feed to our machine learning models. A preliminary step in this process is making a dictionary of terms from the corpus in order to create a sparse vector representation of context. In other words, if we know that the term “performer” is mentioned in the context of “Taylor Swift” 300 times in Wikipedia, and performer is the 25th term in the dictionary, in a simplistic model, we can assign it a weight (or term frequency) of 300 at position 25 in the vector.

Term frequency is often an oversimplified weighting scheme for vectorization. Instead, the importance of words are often normalized by the number of documents they appear in across the corpus; that is, their document frequency. The resulting metric is the commonly used term frequency-inverse document frequency (TF-IDF).

Computing tf-idf across Wikipedia is no trivial task. The gensim wikicorpus python library, which uses multithreading to read the Wikipedia corpus, takes more than three hours to generate the dictionary. Then to create a bag of words (bows) from each document and generate the tf-idf model takes almost another four hours. We completed this task in about one hour by: 1) using wikihadoop to split the input 11GB xml file, 2) parallelizing it using pyspark, 3) tokenizing the text using gensim’s text extractor, and then 4) merging local dictionaries created at each executor, again using gensim.

Given the dictionary and tf-idf model, we can then generate the features needed to train our machine learning models. Initial experiments using this pipeline to train various machine learning classifiers show approximately 80 percent precision in entity disambiguation, which is important for determining what a page is really talking about.

## Conclusion

Over the past two years, we’ve made huge progress on the challenge of helping companies deliver a better, more context-driven experience on mobile. With the help of machine learning, we are helping companies put relevant, context-driven actions in front of the right users. There are a number of benefits we see from ingesting Wikipedia, and it is just one of many sources we are using to refine our models and add facts to our knowledge graph.

Many thanks to @delroycam and the @urxtech Science Team for all of their help with this post. Please reach out to research@urx.com if you are interested in learning more!