Sehrch.com: A Structured Search Engine Powered By Hypertable

Introduction

Sehrch.com is a structured search engine. It provides powerful querying capabilities that enable users to quickly complete complex information retrieval tasks. It gathers conceptual awareness from the Linked Open Data cloud, and can be used as (1) a regular search engine or (2) as a structured search engine. In both cases conceptual awareness is used to build entity centric result sets.

To facilitate structured search we have introduced a new simple search query syntax that allows for bound properties and relations (contains, less than, more than, between, etc). The initial implementation of Sehrch.com was built over an eight month period. The primary technologies used are Hypertable and Lucene. Hypertable is used to store all Sehrch.com data which is in RDF (Resource Description Framework). Lucene provides the underlying searcher capability. This post provides an introduction to structured web search and an overview of how we tackled our big data problem.

Some example searches on Sehrch.com:

Query Type of query Notes
Madonna unstructured like a regular search engine
Justin (age<25) semi structured token "Justin", property age less than 25
(type:Pop singers) (age<20) structured entity type pop singers (free text) and property age less than 20
(type:building) (floors>150) structured entity type building (free text) and property floors more than 150

About me

I am a computer scientist who focuses mostly on big data problems. My research area is the Semantic Web. I am based in the United Kingdom.

This post is written from my perspective as the creator of Sehrch.com and the opinions expressed may not represent those of Hypertable, Inc or other publishers of this post.

Background

Since their advent web search engines have changed, they have evolved to scale. Whereas they once served a few thousand results for a given query, they can now serve billions. The result sets have themselves become web scale. To progress information retrieval on the web, querying capability must also evolve. Sehrch.com introduces structured search, which allows documents to be found by using an ever-expanding list of search operators, which we call properties that are harvested from the Semantic Web.

Over the last decade, Google has shown the importance of web search. It is a difficult problem to tackle, visible in the fact that there has been little viable competition in the English market. More so, the typical differentiating factors of web search have been limited to speed, coverage and relevance. While web search is difficult, some factors like scale are understood. With Moore's Law and the Bigtable architecture, Google and others have managed to keep pace with the web. While web search engines have evolved to scale, their interaction with users has not.

Application specific, enterprise or controlled domain search engines can offer fielded (property) search, entity models, faceting and other features to assist with information retrieval tasks. Web search engines on the other hand still restrict to accepting keyword queries, and place a heavier dependency upon relevance algorithms. But with the continuous scaling, it is not uncommon to be served result sets that are billions of documents in size. It begs the question, in the context of information retrieval, how are such large result sets useful?

As a human being; bar a mining task, why would you need a hundred million document result set for a query? Any SEO expert will tell you its all about page number one, and users rarely go beyond the first few pages of search result at all. So while a hundred million document result set provides far more than adequate coverage for keyword queries, it does not escape from the fact that from an information retrieval perspective, a result set that large and widespread is lousy.

The primary reason for the large result sets are the input limitations. While search has scaled, the inputs have not. User queries are still mostly ambiguous keywords. Users have no capacity to define semantics, unlike with an answer engine where semantics can be extracted from natural language. Search engines on the other hand provide a limited fixed set of operators, which mostly expose niche functionality (e.g. define: lookup dictionary definition) rather than focusing a query (e.g. site: search a subset of the web).

Documents on the web are unstructured. They contain mark up and they contain links, but those links are anonymous. There is no link type property. But the web is evolving and structured data now exists on the web. Whether as RDF in the Linked Open Data (LOD) cloud or microdata, RDFa or now Schema.org in pages, structured data is available in abundance, with more datasets being published daily. Ontologies also exist upon the semantic web, with rich domain model definitions in OWL (Web Ontology Language).

Why can't that data and those domain models be harnessed by search engines to address the input problem? While answer engines attempt to extract query semantics, why can't users define those query semantics in some form of structure; with an expanding list of operators, as a structured web search query? The data and domain models gathered from the semantic web should be used to augment, focus and scope a search result. This should enrich a user’s search experience and make the information retrieval task more efficient than it is now.

The Vision

Advertising over the last decade has taught users how use search engines most effectively, for instance by identifying keywords and avoiding natural language. However search engines employ little, if any conceptual awareness when deriving search results. This makes them under equipped to determine context outside of user profiling. For example, given the query "Magic Mountain", a user may be referring to the theme park, the German novel, or even an alternative novel set in the theme park. By employing conceptual knowledge, result sets can become entity centric, after which they may be further scoped. An entity centric result set about Magic Mountain the theme park can be searched further about novels, finding novels set in Magic Mountain the theme park.

Entity centric scoping can be achieved with conceptual awareness (knowledge of existence and facts about entities). If a user queries for "Magic Mountain", but actually means the novel, the search engine could internally query "Magic Mountain Thomas Mann book". The user did not know the authors name, or even specify "book", but the search engine appended these tokens to the query, transparently from the user to better focus the result set. The system knew that Magic Mountain was an entity of type book (among others, like a theme park) and also had author data. Using this methodology, search engines can derive focused search queries that then yield entity centric result sets.

Given an ambiguous query, how is the entity of interest determined? Well it is either specified, or not. If it is not, the query is type ambiguous. With such queries, an entity match attempt is made, and an initial entity is selected using relevance vs. popularity algorithms. Other entity suggestions may be offered (ranked using the same algorithms). If a user selects a different entity, the query is rerun but now better scoped. Therefore in the worst case the user will receive an entity centric result set (1) for Magic Mountain the theme park and a result set (2) for Magic Mountain the novel. Whereas no conceptual awareness will produce a single non-centric result set not specific to either.

With a structured query users can specify the entity type or another property to focus the initial entity match and then derive entity centric result sets with a better probability of appropriate query context. Whereas answer engines rely on extracting that semantic from natural language, search engines who place emphasis on keywords (and in turn focused querying) should provision for accepting a form of structure to accommodate query context. Sehrch.com accepts unstructured, semi structured or structured queries, with the table at the start of this post providing an example of each. Structured queries also enable entity discovery as users can search by describing aspects of their target (as properties) and not relying solely on a textual hit.

Entities can be discovered by searching against properties. For instance, if the user wanted to find pop singers, they could query "(type:pop singers)". Here the property "type" is bound to the value pop singers, and the engines conceptual awareness is consulted to retrieve entities that are pop singers. The parenthesis is used in this particular structured search syntax to contain multiple terms, opposed to quotes which additionally define order. Additional properties can be bound, for instance "(type:pop singers) (age<20)". The parenthesis around age is optional, as its value is a single token. There can also be less than/ more than relations between properties and bound values.

So how does a search engine gather or obtain the conceptual knowledge required to make this all possible? The coverage needs to be sufficient enough to attempt to serve context for every possible query a user can make. Well, that conceptual knowledge can be obtained, it exists, and it’s in the Semantic Web, as RDF and grows by the minute. Armed with that conceptual knowledge, we can move beyond looking for a needle in a haystack and become entity centric; serving new types of information retrieval tasks that at present cannot be fulfilled. This is why we see Sehrch.com as the next generation of web search engines.

Caveat: Semantic Web people are already aware of the vast amounts of structured data within the LOD cloud. They can already harness it within applications and query it across endpoints on the web or within local triplestores (a purpose built data store for RDF). Triplestores usually accept queries in SPARQL, a query language for RDF. Some Semantic Web people may argue against the need of expanded search query syntaxes when SPARQL exists. Our answer is that search and SPARQL are different things, they are tools for different jobs conduced by different types of users. While one is used for matching graph patterns and transverse links, the other is used for free form keyword queries to enable information retrieval tasks.

The Implementation

Sehrch.com's implementation has focused on speed and scalability. We were well aware that if we could not achieve response times comparable to the search leaders from day one, it was game over. Users rightly expect fast response times when conducting searches. We also need to be able to scale with the LOD cloud. The conceptual awareness gathered from the LOD cloud must be retrievable fast enough to focus our searcher during query time.

The initial implementation task was then divided into three problems:

  1. The data storage problem
  2. The search problem
  3. The user interface problem

The main concerns when addressing each of these problems were scaling and speed. Not finding appropriate solutions to any one of these problems could have stopped Sehrch.com from progressing further than the drawing board. While we will provide some high level architecture, this post focuses on the data storage problem and why we use Hypertable rather than a purpose built RDF store to serve our storage needs.

Storing billions of triples

Sehrch.com gathers RDF from the Web. We crawl the LOD cloud making requests with headers accepting RDF N-Triples (and its derivates) ignoring any other responses. Data dumps are also obtained from various sources. All incoming RDF is parsed and the context (graph) is added for provenance returning N-Quads. The N-Quads then need to be stored. So where do we store this data? And what does this data look like?

To provide some context to readers unfamiliar with RDF from the LOD cloud, below are two triples (a three-fold tuple) in N-Triples format:

Subject Predicate Object
<http://dbpedia.org/resource/Michael_Jackson> <http://dbpedia.org/ontology/birthDate> “1958-08-29”^^<http://www.w3.org/2001/XMLSchema#date>
<http://dbpedia.org/resource/Michael_Jackson> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://dbpedia.org/class/yago/AmericanPopSingers>

Triples have a subject, predicate and object. The first triple contains a literal object with Michael Jackson’s date of birth which has a XSD date data type. The second triple contains a resource object. The object component of triples can contain either a literal, where the graph comes to an end or a resource, which can be followed. The property is described as a data type property should the expected object be a literal and an object property should the expected object be a resource. Triples may also contain blank nodes where the object of a particular triple contains reference to an anonymous resource, for which no subject is defined. Resource Description Framework (RDF) can be serialized in a variety of syntaxes, e.g. XML, N-Triples, Turtle, N3 etc. Triplestores are the native purpose built data stores for RDF.

To determine which data store would meet our performance requirements, we set a challenge while building our initial prototype. DBpedia and Freebase are the two datasets that appear in the center of the LOD cloud. They are both known to be two of the larger datasets, and are therefore challenging to host and query. It is common to come across posts upon various mailing lists from users who are attempting to load either one or both of these datasets asking for help as they have experienced days and nights of unresponsiveness either at load or query time. Usual responses are to instruct doses of more hardware. In early 2011, the minimum requirements (between various triplestore implementations) to host these datasets on a single node were reported on mailing lists as 8 cores, 16GB of memory and fast disks. Storing the data was half of the problem, as we also required responsive querying under load.

We then devised the challenge details. Rather than building the prototype on cutting edge hardware, we limited ourselves to using 5% of the projected resource our eventual production machines would provide. The challenge was set: given certain hardware, store DBpedia 3.6 (~630 million triples, all languages) and Freebase (~700 million triples, January 2011) in entirety and serve a query throughput of 1,000 queries per second. The hardware? A dual core machine (Intel Core2Duo 2.4 GHz E6600) with 3GB memory and a 200GB 5400rpm disk. While the mailing lists flamed about needing 4 times the cores and 5 times the memory to make use of those particular datasets, we started work with what we had. If we can store and query ~1.3 billion triples, 500GB of uncompressed RDF on hardware that was commodity in 2009, then we stood a chance to achieve the responsiveness that would make us a viable alternative for search.

We started the task by obtaining the datasets. Uncompressed, the RDF is 500GB in size. We then attempted to store that data in various triplestore implementations. The implementations all follow common standards, for instance by exposing SPARQL, a language for querying RDF. We derived the shortlist below of triplestore implementations that are commonly used within the Semantic Web community for hosting the challenge datasets at the time (January 2011):

Triplestore License Scale-out Creators
TDB Open source No (single node) HP Labs
4store Open source Yes Garlik
Virtuoso Open Source Edition Open source No (single node) Openlink
Virtuoso Propriety Yes Openlink
OWLIM Propriety Yes Ontotext
5store Propriety Yes Garlik
AllegroGraph Propriety Yes Franz Inc

Triplestores still in active development were not considered, such as Bigdata which had not reached version 1.0 at the time of our initial prototype. Our shortlist highlighted 4store as the only open source scale-out triplestore. TDB and Virtuoso (Open Source Edition) were mature implementations but lacked scale-out capabilities. Due to the limited choice, we decided not to preclude them, hoping to play to their advantages and possibly experiment with federation strategies should they pass the challenge. Propriety triplestores like Virtuoso, OWLIM, 5store and AllegroGraph on the other hand were scale-out triplestores, but it is not feasible for us to use any propriety software.

We began by attempting the challenge with 4store, at the time the only scale-out open source triplestore but very quickly ran into problems. While 4store can scale to store tens of billions of triples, its indexing strategy prevents it from being able to serve datasets with more than a few hundred properties. This may be sufficient for enterprises with in house datasets that are well controlled but not for data in the LOD cloud. DBpedia 3.6 (all languages) contains tens of thousands of properties. 4store seems to create index files based upon properties, so at query time many files are opened and soft limit issues arise very quickly. This caused us revise our requirements to specify that the data store cannot specify any property limit.

After 4store we attempted to load the data into TDB. TDB was output from the Jena work of the Semantic Web research group at HP Labs in Bristol, UK. The group disbanded in late 2009 and Jena at present is an Apache project in incubation. TDB is an RDF data store and its creator is also an editor of the SPARQL 1.0 and 1.1 standards. Therefore TDB provides one of the most comprehensive implementations of the SPARQL query language. Also, unlike 4store, TDB provides a very comprehensive indexing strategy. Whereas 4store created property based files, TDB uses composite indexes (B+ trees) of variations SPO, POS, OSP, GSPO, GPOS, GOSP, POSG, OSPG, SPOG (total 9). As its indexes are not property based, it does not encounter the same soft limit issues as 4store when hosting datasets with a large number of properties.

TDB’s comprehensive indexing makes it suitable for matching the various forms of graph patterns that you can template in SPARQL. Its implementation of the SPARQL standards is the most comprehensive. However, this may provide too much flexibility at query time for the sake of performance for specific applications. Further due to the comprehensive indexing, the indexes grow very quickly, making it less suitable for large datasets. We were not able to complete the challenge with TDB as the load times would grind to a halt and the loader would eventually exit. It was obvious that our hardware could not serve TDB’s comprehensive indexing strategy. Reviewing mailing list discussions shows that users attempting to load one of our two datasets would usually offer TDB at least four times the hardware. We therefore discounted TDB for being too resource hungry, due to its comprehensive indexing strategy which provided unneeded query time flexibility for our needs, especially for the sake of performance.

Virtuoso is an enterprise grade triplestore well known within the Semantic Web community. Its highest profile user within the community is most likely the DBpedia project, for whom its hosts the dataset, a public SPARQL endpoint and the linked data browser. Virtuoso is a propriety triplestore, but there is also an open source edition which removes the clustering and replication features, making it single node store. It is a very mature product and the first point of call for most Semantic Web users who are setting up their own local DBpedia mirrors. Virtuoso’s online documentation provides step by step instructions to guide novice users embarking on such a task. This attempt to load the challenge datasets onto Virtuoso Open Source Edition 6.1.2 triggered the most conversation with the triplestore developer.

We began by following the instructions provided in the Virtuoso documentation, which is very comprehensive; almost 4000 pages in length. Bulk loading of the dataset always began well but we were never able to exceed loading more than ~80 million triples before encountering failure. We attempted the load using both dedicated hardware and virtual machines that matched the challenge specification to remove the possibility of any hardware faults. Once the index files would grow to a certain size (in our case 8GB) the loading would stall and then fail. This generated a fair amount of discussion on their user forum, with us reporting back many details so they could investigate. At the end we concluded that again, we could not index the datasets due to their indexing strategy being more comprehensive than what our challenge hardware could serve.

When attempting to load the challenge datasets within three triplestore implementations, we found that one failed due to incomprehensive indexing (4store, indexes not able to scale with the number of properties), and two failed due to comprehensive indexing- mostly to serve the flexibilities of the SPARQL 1.0 and 1.1 standards (TDB, Virtuoso). We then had to question if triplestores are the right data store type for us to be storing the volumes of data that we envision? Do we require all of the flexibility that the SPARQL standard provides? Are we willing to sacrifice query time flexibility for the sake of performance? Do we need reasoning and other capability specific to triplestores? We answered no to all of these questions. Instead, we need guarantees on speed and scalability, and given our resource we determined no open source triplestore implementation could provide those guarantees.

Enter Hypertable

After discovering that the open source SPARQL compliant triplestores were inappropriate to store RDF data for our use case, we began investigating HBase. We knew that HBase would be an appropriate store for our data due to its column oriented nature and because it was modeled upon Bigtable, so scalability was guaranteed. We also came across Hypertable, and then we read the Hypertable vs HBase results republished by Baidu. We downloaded Hypertable, set it up (single node with DfsBroker) and found it very easy to work with. The command line interface enabled us to quickly load and query some data. We then began thinking about how we could store RDF in Hypertable while providing affective query capabilities for our use case. We quickly produced a simple model for a table, for simplicity named “rdf”, with columns “res” and “lit”:

RDF component Hypertable component
triple subject row key
triple predicate column qualifier
triple value value

Yes, it really is that simple! Because RDF is triples, the hard work of breaking data down to its atomic form has already been done and therefore makes it highly suited for column oriented stores. In order to introduce some immediate efficiency, we created separate columns based upon data type properties and object properties (which can be determined from examining the object, either literal or resource). The column qualifiers, which are infinite and can be generated on the fly, would hold the triple properties. Once we had a model, we wrote a multi-threaded RDF importer in Java that would load the data into Hypertable via the Thrift API. We then began to load data into Hypertable and were staggered with the results.

We achieved a stable loading throughput of 22,000 triples per second (tps), peaking at 30,000 tps. Within 24 hours we had loaded the full 1.3 billion triples on a single node, on hardware that was at least two years out of date. We were shocked, mostly because on the same hardware the SPARQL compliant triplestores had managed 80 million triples (Virtuoso) at the most and Hypertable had just loaded all 1.3 billion. The 500GB of input RDF had become a very efficient 50GB of Hypertable data files. But then, loading was only half of the problem, could we query? We wrote a multi-threaded data exporter that would query Hypertable for entities by subject (Hypertable row key) randomly. We ran the exporter, and achieved speeds that peaked at 1,800 queries per second. Again we were shocked. Now that the data the challenge had set forth was loaded, we wondered how far Hypertable could go on the same hardware.

So we reloaded the data, this time appending the row keys with 1. Hypertable completed the load again, in approximately the same time. So we ran it again, now appending the keys with 2. Hypertable completed again, again in the same time frame. We now had a machine which was only 5% of our eventual production specification that stored 3.6 billion triples, three copies each of DBpedia and Freebase. We reran our data exporter and achieved query speeds that ranged between 1,000-1,300 queries per second. From that day on we have never looked back, Hypertable solved our data storage problem, it smashed the challenge that we set forth that would determine if Sehrch.com was at all possible. Hypertable made it possible.

But what do we lose out on by not using triplestores, the stores that were purpose built for storing RDF? Well triplestores may be the native data stores for RDF, but due to the comprehensive indexing required to serve SPARQL, they cannot host data as efficiently as those stores modeled on Bigtable. While we lose much flexibility at query time, we must question how much of that flexibility was actually required by the application. How much query time flexibility are we willing to sacrifice for the sake of performance? How much is enough for us to make do, to suffice so our application will perform to its bleeding edge? In our case, we sacrificed all of it. We would start at the bottom-up and spec everything that we needed, rather than work top-down and end up with flexibility that we never utilise but pay for with performance and scalability.

With our simple Hypertable data model, we can quickly map some very common SPARQL queries to Hypertable Query Language (HQL). Here are some common queries:

SPARQL HQL
SELECT * WHERE { <http://dbpedia.org/resource/Michael_Jackson> ?p ?o } SELECT * FROM rdf WHERE ROW="<http://dbpedia.org/resource/Michael_Jackson>";
SELECT * WHERE { <http://dbpedia.org/resource/Michael_Jackson> <http://dbpedia.org/ontology/birthDate> ?o } SELECT lit:"<http://dbpedia.org/ontology/birthDate>" FROM rdf WHERE ROW="<http://dbpedia.org/resource/Michael_Jackson>";
SELECT * WHERE { <http://dbpedia.org/resource/Michael_Jackson> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?o } LIMIT 5 SELECT res:" <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>" FROM rdf WHERE ROW="<http://dbpedia.org/resource/Michael_Jackson>" REVS=5;
SELECT * WHERE { <http://dbpedia.org/resource/Michael_Jackson> ?p ?o FILTER isLiteral(?o) } SELECT lit FROM rdf WHERE ROW="<http://dbpedia.org/resource/Michael_Jackson>";
SELECT * WHERE { <http://dbpedia.org/resource/Michael_Jackson> ?p ?o FILTER isURI(?o) } SELECT res FROM rdf WHERE ROW="<http://dbpedia.org/resource/Michael_Jackson>";
SELECT ?type ?o WHERE { <http://dbpedia.org/resource/Michael_Jackson> <http://dbpedia.org/ontology/birthDate> ?o . <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?type . } SELECT res: "<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>", lit:"<http://dbpedia.org/ontology/birthDate>" FROM rdf WHERE ROW="<http://dbpedia.org/resource/Michael_Jackson>";

Other triplestore specific features that we seem to lose are those like reasoning. The only open source scale-out triplestore at the time, 4store did not provide any reasoning functionality at all. Therefore not all triplestores offer reasoning and the benefit versus overhead needs to be considered. Other triplestores, for instance Virtuoso and AllegroGraph implement various portions of RDFS and OWL, providing limited reasoning capacity. The danger with reasoning is that there are almost no guarantees once the data grows beyond a few million triples. While some (particular property) reasoning seemed to be a benefit of using a triplestore, the actual overhead introduced at query time may produce performance implications and risks not worth taking. After some thinking, we realised that if you know your data models well enough (say for particular datasets or properties) you can obtain many of the outputs of inferencing from just processing the data, and Hypertable offers complete Hadoop MapReduce compatibility to do just that.

By running MapReduce jobs we can preprocess common graph pattern queries that would otherwise run expensively at query time had we been using SPARQL and a triplestore. One example of a MapReduce job that we run on an ongoing basis is for inverse links across all of our data. These appear on many Sehrch.com results as related concepts. Inverse links are usually found by executing a SPARQL query with the target placed as the object; this finds all resources that link to the target:

SPARQL HQL - after MapReduce job
SELECT * WHERE { ?s ?p <http://dbpedia.org/resource/Michael_Jackson> } SELECT inverse FROM rdf WHERE ROW="<http://dbpedia.org/resource/Michael_Jackson>";

We can compute these inverses efficiently prior to query time by running a MapReduce job that streams through all “res” columns of the table and inverses the resource values against the row key. The inversed row can then be persisted in a new column, for instance named “inverse” that can be accessed at query time. We use inverse links on Sehrch.com to discover related objects that display on the right hand side of result pages (with images). They are also used to highlight object descriptions and search snippets with wiki style links to related concepts. Using MapReduce allows us to perform simple to complex graph pattern analysis with results persisted in Hypertable, providing us with performance guarantees and users with rich object relationships and interactions.

Conclusion

Every component of Sehrch.com was implemented with great focus upon speed and scalability. We used a combination of amazing open source technologies to build Sehrch.com and Hypertable addressed one of the three critical problems that stood between Sehrch.com becoming a reality. Since using Hypertable, we have never looked back, and while this post only describes our initial interaction with it at the prototype stages which was long ago and not the larger scale with which we are using it now, we hope it triggers the imagination of others in the Semantic Web community to embrace it and deliver high performance scalable systems worthy of real world users. Not using native data stores and looking towards big data technology may not be the purist approach to people in the community, but should the Semantic Web community want to embrace real scale, the technology exists and it is Hypertable.

As for the future of search, we hope this post leads users to ask more from the search leaders. The Semantic Web exists, structured data exists on the web, and the main search players can be doing much more. Whether it is at how we have demonstrated, by expanding search query syntax, becoming entity centric or in other ways, users must begin demanding more meaningful results. Should they not, whether it is because of the generational shift or that their business models depend upon amassing sense from the meaningless, bar privacy issues and other sensitivities, an opportunity exists now more than ever for new entrants to make technological contribution and progress information retrieval on the Web.