Feeds
Tech Semantic Web and Linked Data
(590 unread)
-
OpenLink Community Blog (88 unread)
-
Z-Blog
(242 unread) -
Nodalities
(47 unread) -
EN - Flux RSS - R & D (11 unread)
-
The Semantic Puzzle
(68 unread) -
OpenCalais - Official Blog
(6 unread) -
Chief Marketing Technologist (124 unread)
-
Semai
(1 unread) -
Reactive, autonomous (3 unread)
Tech Coding the Web and Software
(156 unread)
-
Software Cooperative News (69 unread)
-
Talance Friendly Web Tools Blog (87 unread)
Tech General News
(14447 unread)
-
Tech Eye - Latest technology headlines (4200 unread)
-
BBC News - Technology (4590 unread)
-
NYT > Technology (5657 unread)
Knowledge Man and Eng
(310 unread)
-
ISKO UK (56 unread)
-
KOnnect
(10 unread) -
CELSTEC Publications
(216 unread) -
Knowledge Engineering (19 unread)
-
Open Intelligence
(9 unread)
Friends
(311 unread)
-
VISION AFORETHOUGHT
(82 unread) -
Snell-Pym
(229 unread)
Newspapers
(27293 unread)
-
The Guardian World News (10891 unread)
-
The Independent - Frontpage RSS Feed
(16402 unread)
Politics UK and Ireland
(1186 unread)
-
Liberal Democrats RSS (481 unread)
-
Green Liberal Democrats News Stories
(100 unread) -
Liberal Democrat Christian Forum
(5 unread) -
Liberal Youth - Latest News
-
The Alliance Party of Northern Ireland News Stories
(528 unread) -
Home
(72 unread)
Politics EU and International
(825 unread)
-
European Movement UK (38 unread)
-
European Movement Ireland
(89 unread) -
OSCE press releases and media advisories (376 unread)
-
ALDE News
(39 unread) -
ELDR News
(283 unread) -
IFLRY News and Updates
Religion Christian
(2234 unread)
-
Church of England News (259 unread)
-
Latest News
(725 unread) -
Open Path
(19 unread) -
Affirming Liberalism
(11 unread) -
Greenbelt Blog (464 unread)
-
Fresh Expressions RSS feed (442 unread)
-
Emergent Village
(56 unread) -
Taizé (258 unread)
Religion Interfaith and Universalism
(496 unread)
-
Interfaith (230 unread)
-
IDC Interfaith Dialog Center
(139 unread) -
Inter-Religious Dialogue
(127 unread)
OpenLink Community Blog
-
14:55
IEEE publication of ?Virtuoso, a Hybrid RDBMS/Graph Column Store?
» OpenLink Community BlogMy article, Virtuoso, a Hybrid RDBMS/Graph Column Store (PDF), can be found in Volume 35, Number 1, March 2012 (PDF) of the Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (also known as the IEEE Data Engineering Bulletin).
Abstract:
We discuss applying column store techniques to both graph (RDF) and relational data for mixed workloads ranging from lookup to analytics in the context of the OpenLink Virtuoso DBMS. In so doing, we need to obtain the excellent memory efficiency, locality and bulk read throughput that are the hallmark of column stores while retaining low-latency random reads and updates, under serializable isolation.
DBLP BibTeX Record 'journals/debu/Erling12' (XML)
@article{DBLP:journals/debu/Erling12, author = {Orri Erling}, title = {Virtuoso, a Hybrid RDBMS/Graph Column Store}, journal = {IEEE Data Eng. Bull.}, volume = {35}, number = {1}, year = {2012}, pages = {3-8}, ee = [sites.computer.org] bibsource = {DBLP, [dblp.uni-trier.de}] } -
14:55
IEEE publication of ?Virtuoso, a Hybrid RDBMS/Graph Column Store?
» OpenLink Community BlogMy article, Virtuoso, a Hybrid RDBMS/Graph Column Store (PDF), can be found in Volume 35, Number 1, March 2012 (PDF) of the Bulletin of the IEEE Computer Society Technical Committee on Data Engineering (also known as the IEEE Data Engineering Bulletin).
Abstract:
We discuss applying column store techniques to both graph (RDF) and relational data for mixed workloads ranging from lookup to analytics in the context of the OpenLink Virtuoso DBMS. In so doing, we need to obtain the excellent memory efficiency, locality and bulk read throughput that are the hallmark of column stores while retaining low-latency random reads and updates, under serializable isolation.
DBLP BibTeX Record 'journals/debu/Erling12' (XML)
@article{DBLP:journals/debu/Erling12, author = {Orri Erling}, title = {Virtuoso, a Hybrid RDBMS/Graph Column Store}, journal = {IEEE Data Eng. Bull.}, volume = {35}, number = {1}, year = {2012}, pages = {3-8}, ee = [sites.computer.org] bibsource = {DBLP, [dblp.uni-trier.de}] } -
19:38
ICDE 2012 (post 6 of 6) - Science Data Panel
» OpenLink Community BlogMichael Stonebraker chaired a panel on the future of science data at ICDE 2012 last week. Other participants were Jeremy Kepner from MIT Lincoln Labs, Anastasia Ailamaki from EPFL, and Alex Szalay from Johns Hopkins University.
This is the thrust of what was said, noted from memory. My comments follow after the synopsis.
Jeremy Kepner: When Java was new we saw it as the coming thing and figured that in HPC we should find space for this. When MapReduce and Hadoop came along, we saw this as a sea change in parallel programming models. This was so simple literally anybody could make parallel algorithms whereas this was not so with MPI. Even parallel distributed arrays are harder. So MapReduce was a game changer, together with the cloud where anybody can get a cluster. Hardly a week passes without me having to explain to somebody in government what MapReduce and Hadoop are about.We have a lot of arrays and a custom database for them. But the arrays are sparse so this is in fact a triple store. Our users like to work in MATLAB, and any data management must run together with that.
Of course, MapReduce is not a real scheduler, and Hadoop is not a real file system. For deployment, we must integrate real schedulers and make HDFS look like a file system to applications. The abstraction of a file system is something people like. Being able to skip a time-consuming data-ingestion process with a database is an advantage with file-based paradigms like Hadoop. If this is enhanced with the right scheduling features, this can be a good component in the HPC toolbox.
Michael Stonebraker: Users of the data use math packages like R, MATLAB, SAS, SPSS, or similar. If business intelligence is about AVG, MIN, MAX, COUNT, and GROUP BY, science applications are much more diverse in their analytics. All science algorithms have an inner loop that resembles linear algebra operations like matrix multiplication. Data is more often than not a large array. There are some graphs in biology and chemistry, but the world is primarily rectangular. Relational databases can emulate sparse arrays but are 20x slower than a custom-made array database for dense arrays. And I will not finish without picking on MapReduce: I know of 2000-node MapReduce clusters. The work they do is maybe that of a 100-node parallel database. So if 2000 nodes is what you want to operate, be my guest.
Science database is a zero billion dollar business. We do not expect to make money from the science market with SciDB, which by now works and has commercial services supplied by Paradigm 4, while the code itself is open source, which is a must for the science community. The real business opportunity is in the analytics needed by insurance and financial services in general, which are next to identical with the science use cases SciDB tackles. This makes the vendors pay attention.
Alex Szalay: The way astronomy is done today is through surveys: a telescope scans through the sky and produces data. We have now for 10 years operated the Sloane Sky Survey and kept the data online. We have all the data, and complete query logs, available for anyone interested. When we set out to do this with Jim Gray, everybody found this a crazy idea, but it has worked out.
Anastasia Ailamaki: We do not use SciDB. We find a lot of spatial use cases. Researchers need access to simulation results which are usually over a spatial model, like in earthquake simulations and the brain. Off-the-shelf techniques like R trees do not work -- the objects overlap too much -- so we have made our own spatial indexing. We make custom software when it is necessary, and are not tied to vendors. In geospatial applications, we can create meshes of different shapes -- like tetrahedral or cubes for earthquakes, and cylinders for the brain -- and index these in a geospatial index. But since an R tree is inefficient when objects overlap too much, as these do, we just find one; and then because there is reachability from an object to neighboring ones, we use this to get all the objects in the area of interest.
* * *
This is obviously a diverse field. Probably the message that we can synthesize out of this is that flexibility and parallel programming models are what we need to pay attention to. There is a need to go beyond what one can do in SQL while continuing to stay close to the data. Also, allowing for plug-in data types and index structures may be useful; we sometimes get requests for such anyway.
The continuing argument around MapReduce and Hadoop is a lasting feature of the landscape. A parallel DB will beat MapReduce any day at joining across partitions; the problem is to overcome the mindset that sees Hadoop as the always-first answer to anything parallel. People will likely have to fail with this before they do anything else. For us, the matter is about having database-resident logic for extract-transform-load (ETL) that can do data-integration type-transformations and maybe iterative graph algorithms that constantly join across partitions, better than a MapReduce job, while still allowing application logic to be written in Java. Teaching sem-web-heads to write SQL procedures and to know about join order, join type, and partition locality, has proven to be difficult. People do not understand latency, whether in client-server or cluster settings. This is why they do not see the point of stored procedures or of shipping functions to data. This sounds like a terrible indictment, like saying that people do not understand why rivers flow downhill. Yet, it is true. This is also why MapReduce is maybe the only parallel programming paradigm that can be successfully deployed in the absence of this understanding, since it is actually quite latency-tolerant, not having any synchronous cross-partition operations except for the succession of the map and reduce steps themselves.
Maybe it is so that the database guys see MapReduce as an insult to their intelligence and the rest of the world sees it as the only understandable way of running grep and sed (Unix commands for string search/replace) in parallel, with the super bonus of letting you reshuffle the outputs so that you can compare everything to everything else, which grep alone never let you do.
* * *
Making a database that does not need data loading seems a nice idea, and CWI has actually done something in this direction in "Here are my Data Files. Here are my Queries. Where are my Results?"] However, there is another product called Algebra Data that claims to take in data without loading and to optimize storage based on access. We do not have immediate plans in this direction. Bulk load is already quite fast (take 100G TPC-H in 70 minutes or so), but faster is always possible.
-
19:38
ICDE 2012 (post 5 of 6) - Graphs
» OpenLink Community BlogThere were quite a few talks about graphs at ICDE 2012. Neither the representations of graphs, nor the differences between RDF and generic graph models, entered much into the discussion. On the other hand, graph similarity searches and related were addressed a fair bit.
Graph DB and RDF/Linked Data are distinct, if neighboring disciplines. On one hand, graph problems predate Linked Data, and the RDF/Linked Data world is a web artifact, which graphs are not as such, so a slightly different cultural derivation also makes these disjoint. Besides, graphs may imply schema first whereas linked data basically cannot. Then another differentiation might be derived from edges not really being first class citizens in RDF, except for reification, at which the RDF reification vocabulary is miserably inadequate, as pointed out before.
RDF is being driven by the web-style publishing of Linked Open Data (LOD), with some standardization and uptake by publishers; Graph DB is not standardized but driven by diverse graph-analytics use cases.
There is no necessary reason why these could not converge, but it will be indefinitely long before any standards come to cover this, so best not hold one's breath. Communities are jealous of their borders, so if the neighbor does something similar one tends to emphasize the differences and not the commonalities.
So for some things, one could warehouse the original RDF of the web microformats and LOD, and then ETL into some other graph model for specific tasks, or just do these in RDF. Of course, then RDF systems need to offer suitable capabilities. These seem to be about very fast edge traversal within a rather local working set, and about accommodating large, iteratively-updated intermediate results, e.g., edge weights.
Judging by the benchmarks paper (Benchmarking traversal operations over graph databases (Slidedeck (ppt), paper (pdf)); Marek Ciglan, Alex Averbuch, and Ladialav Hluchy.) at the GDM workshop, the state of benchmarking in graph databases is even worse than in RDF, where the state is bad enough. The paper's premise was flawed to start, using application logic to do JOINs instead of doing them in the DBMS. In this way, latency comes to dominate, and only the most blatant differences are seen. There is nothing like this style of benchmarking to make an industry look bad. The supercomputer Graph 500 benchmark, on the other hand, lets the contestants make their own implementations on a diversity of architectures with random traversal as well as loading and generating large intermediate results. It is somewhat limited, but still broader than the the graph database benchmarks paper at the GDM workshop.
Returning to graphs, there were some papers on similarity search and clique detection. As players in this space, beyond just RDF, we might as well consider implementing necessary features for efficient expression of such problems. The algorithms discussed were expressed in procedural code against memory-based data structures; there is usually no query language or parallel/distributed processing involved.
MapReduce has become the default way in which people would tackle such problems at scale; in fact, people do not consider anything else, as far as I can tell. Well, they certainly do not consider MPI for example as a first choice. The parallel array things in Fortran do not at first sight seem very graphy, so this is likely not something that crosses one's mind either.
We should try some of the similarity search and clustering in SQL with a parallel programming model. We have excellent expression-evaluation speed from vectoring and unrestricted recursion between partitions, and no file system latencies like MapReduce. The initial test case will be some of the linking/data-integration/mapping workloads in LOD2.
Having some sort-of-agreed-upon benchmark for these workloads would make this more worthwhile. Again, we will see what emerges.
-
19:38
ICDE 2012 (post 4 of 6) - Graph Data Management Workshop
» OpenLink Community BlogI gave an invited talk ("Virtuoso 7 - Column Store and Adaptive Techniques for Graph" (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012.
Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.
So, we could do it like Bigdata®, or we could add a "quad ID" column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.
Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.
I heard from Bryan that the new W3 RDF WG had declared provenance out of scope, unfortunately. The word on the street on the other hand is that provenance is increasingly found to be an issue. This is confirmed by the active work of the W3 Provenance Working Group.
-
19:38
ICDE 2012 (post 3 of 6) - What Is Timely LOD Search Worth?
» OpenLink Community BlogThere was a talk (Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces (Slides (PDF)); Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira, Axel Polleres and Manfred Hauswirth) at the Data Engineering Meets the Semantic Web (DESWEB) workshop at ICDE last week about the problems of caching LOD, whether attempted by Sindice or OpenLink's LOD Cloud Cache. The conclusion was that OpenLink covered a bit more of the test data sets and that Sindice was maybe better up to date on the ones that it covered but that neither did it very well. The data sets were random graphs of user FOAF profiles and such collected from some Billion Triples Data set, thus not data that is likely to have commercial value, except in huge quantities maybe for some advertising, except that click streams and the like are much more valuable.
Being involved with at least one of these, and being in the audience, I felt obligated to comment. The fact is, neither OpenLink's LOD Cloud Cache nor Sindice is a business, and there is not a business model which could justify keeping them timely on the web crawls they contain. Doing so is easy enough, if there is a good enough reason.
The talk did make a couple of worthwhile points: The data does change; and if one queries entities, one encounters large variation in change-frequency across entities and their attributes.
The authors suggested to have a piece of middleware decide what things can be safely retrieved from a copy and what have to be retrieved from the source. Not too much is in fact known about the change frequency of the data, except that it changes, as the authors pointed out.
The crux of the matter is that the thing that ought to know this best is the query processor at the LOD warehouse. For client-side middleware to split the query, it needs access to statistics that it must get from the warehouse or keep by itself. Of course, in concrete application scenarios, you go to the source if you ask about the weather or traffic jams, and otherwise go to the warehouse based on application-level knowledge.
But for actual business intelligence, one needs histories, so a search engine with only the present is not so interesting. At any rate, refreshing the data should leave a trail of past states. Exposing this for online query would just triple the price, so we forget about that for now. Just keeping an append-only table of history is not too much of a problem. One may make extracts from this table into a relational form for specific business questions. There is no point doing such analytics in RDF itself. One would have to just try to see if there is anything remotely exploitable in such histories. Making a history table is easy enough. Maybe I will add one.
Let us now see what it would take to operate a web crawl cache that would be properly provisioned, kept fresh, and managed. We base this on the Sindice crawl sizes and our experiments on these; the non-web-crawl LOD Cloud Cache is not included.
From previous experience we know the sizing: 5Gt/144GB RAM. Today's best price point is on 24-DIMM E5 boards, so 192GB RAM, or 6.67Gt. A unit like that (8TB HDD, 0.5TB SSD, 192GB RAM, 12 core E5, InfiniBand) costs about $6800.
The Sindice crawl is now about 20Gt, so $28K of gear (768GB RAM) is enough. Let us count this 4 times: 2x for anticipated growth; and 2x for running two copies -- one for online, and one for batch jobs. This is 3TB RAM. Power is 16x500W = 8KW, which we could round to 80A at 110V. Colocation comes to $500 for the space, and $1200 per month for power; make it $2500 per month with traffic included.
At this rate, 3 year TCO is $120K + ( 36 * $2.5K ) = $210K. This takes one person half time to operate, so this is another $50K per year.
We do not count software development in this, except some scripting that should be included in the yearly $50K DBA bill.
Under what circumstances is such a thing profitable? Or can such a thing be seen as a marketing demo, to be paid for by license or service sales?
A third party can operate a system of this sort, but then the cost will be dominated by software licenses if running on Virtuoso cluster.
For comparison, the TB at EC2 costs ((( 16 * $2 ) * 24 ) * 31 ) = $24,808 per month. With reserved instances, it is ( 16 * ( $2192 + ((( 0.7 * 24 ) * 365 ) * 3 ))) / 36 = $8938 per month for a 3 year term. Counting at 3TB, the 3 year TCO is $965K at EC2. AWS has volume discounts but they start higher than this; ( 3 * ( 16 * $2K )) = $96K reserved host premium is under $250K. So if you do not even exceed their first volume discount threshold, it does not look likely you can cut a special deal with AWS.
(The AWS prices are calculated with the high memory instances, approximately 64GB usable RAM each. The slightly better CC2 instance is a bit more expensive.)
Yet another experiment to make is whether a system as outlined will even run at anywhere close to the performance of physical equipment. This is uncertain; clouds are not for speed, based on what we have seen. They make the most sense when the monthly bill is negligible in relation to the cost of a couple of days of human time.
-
19:37
ICDE 2012 (post 2 of 6) - LOD Column Store Experiences and Sizing
» OpenLink Community BlogWe have played around with LOD data sets and Virtuoso Column Store for the past several months. I will here give a few numbers and comment on some different platform comparisons that we have made. The answer at the end of this is how to size a system for often-changing web-style data. The conclusion is a data-to-RAM ratio that gives an acceptable working set without driving the price up by forcing 100% RAM residence.
The experiment is loading Sindice web crawls. The platform is 2 x Xeon 5520 and 144G RAM. The initial load rate is 200-180Kt, and drops to 100Kt at 5Gt because of I/O. The system is Virtuoso Column Store configured to run as 4 processes and 32 partitions, all on the same box. After 5Gt, we see just more I/O and going further is not relevant; one runs CPU-bound or not at all.
We use 4 Crucial SSDs in the setup. The hot structures like the RDF quad indices are on SSD, and the cold ones are on hard disk. A cold structure is a write-only index like the dictionary of literals (id to lit).
For bulk load, SSDs turn out not to be particularly useful. For a cold start on the other hand, SSDs cut warmup time of 144G RAM from over half an hour to a couple of minutes. It is possible that Intel SSDs would also help with bulk load, but this has not been tried. The SSD problem during bulk load is that these do not write very fast, and while there are writes in queue, read latency goes up; so under a constant write load, the SSD's famous instantaneous random read no longer works.
The fragment considered in the example is 4.95Gt: 8.1M pages worth of quads; 12.7M of literals and iris; and 4.71M of full text index. A page is 8KB. The files on disk contain empty pages, but these do not matter since they do not take up RAM. The quad indices take 13.4 bytes/quad. The row-wise equivalent used to be 38 or so bytes/quad with similar data. Two-thirds of the IRI and literal string data can benefit from column-wise stream compression. (This was not used but if it were, we could count on a 50% drop in size for the data affected, so instead of 12.7M pages, we could maybe get 8.5M on a good day. This could be worth doing but is not a priority.) The system was configured to have 12M database pages in RAM, so a little under half the database pages of the set fit in RAM at one time; thus one cannot call this a memory-only setup. Due to the locality in the unusually non-local data, this is as far as secondary storage can reach without becoming an over-2x slowdown. In practice, we are talking about under 1% of rows accessed coming from secondary storage, but that alone means half throughput.
We note that this data set represents the worst that we have seen. It has 129M distinct graphs, 38 t/g. Regular data like the synthetic benchmark sets take half the space per quad. This is about a third of a Sindice crawl; the other two-thirds look the same as far as we looked.
So if you are interested in hosting data like this, you can budget 144GB RAM for every 5Gt. Do not try it with anything less. Budgeting double this is wise, so that you have space to cook the data; this is important since in order to do things with it, one needs to at least copy things for materializing transformations.
If you are budget-constrained and hosting very regular content like UniProt, you can budget maybe 144GB RAM for every 10Gt.
As for CPU, this does not matter as much as long as you do not go to disk. Just for load speed, Dbpedia is loaded in 300s on a cluster of eight (8) dual AMD 2378 boxes at 2.6GHz (total 8 cores per host, so 64 cores in the cluster), and in 945s on one (1) dual Xeon 5520 box at 2.26GHz (total 8 cores in the host). Intel makes much better CPUs, as we see. Both scenarios are 100% in RAM. For even more regular data, the load rates are a bit higher: 1.3Mt/s for the AMD cluster, and 300Kt/s for the Xeon host.
The interconnect for the AMD cluster is 1 x gigE but this does not matter for load. For CPU-bound cross-partition JOINs, 1 or 2 x gigE is insufficient; 4 x gigE might barely make it; InfiniBand should be safe. When running cross-partition JOINs, a single 8-core Xeon box generates about 300MB/s of interconnect traffic; a gigE connection can maybe take 50MB/s with some luck.
Intel E5 is not dramatically better than Nehalem but this is something we will see in a while when we make measurements with real equipment. Prior to the E5 release, we tried Amazon EC2 CC2 ("Cluster Compute Eight Extra Large Instance" -- 2x8 core E5, 2.66GHz). The results were inconclusive; it never did more than 1.9x better than Xeon 5520 even when running an empty loop (i.e., recursive Fibonacci function in SQL, no cache misses, no I/O). With a database JOIN, 1.3x better is the best we saw. But this must be the fault of Amazon and not of E5.
We also tried AMD "Magny-Cours", but for 32 cores against 8 it never did over 2x better, more like 1.4x often enough, and and single thread speed was 50% worse, so not a good buy. We did not find a Bulldozer to try, and did not feel like buying one since the reviews did not promise more core speed over the Magny-Cours.
It seems that especially with Column Store, we are truly CPU-bound and not memory-latency- or bandwidth-bound. This is based on the observation that a Xeon 5620 with 2 of 3 memory channels populated loads BSBM data only 10% faster than the same with 1 of 3 channels populated, with CPU affinity set on a dual socket system.
So, if you have a choice between a $2K processor (E5-2690) and a $600 processor (E5-2630), buy the cheaper one and get RAM with the money saved. $1440 buys 128G in $90 8G DIMMs. Then buy E5 boards with 24 DIMMs -- one for every 7Gt of web crawl data. If your software licenses are priced per core, getting higher-clock 4-core E5’s might make sense.
While on the subject of bytes and quads/triples, we note that Bigdata®'s recent announcement says up to 50 billion triples per single server. Franz loaded at a good 800+ Kt/s rate up to a trillion triples. One is led to think from the spec that this was with less than full cpu but still with highly local data, considering 1.5 bytes a triple would hit very heavy I/O otherwise. Their statement to the effect of LUBM-like data corroborates this, so we are not talking about exactly the same thing.
So if you compare the claims, I am talking about running CPU-bound on the worst data there is. Franz and Bigdata® do not specify, so it is hard to compare. LOD2 should in principle publish actual metrics with at least Bigdata®; Franz is not participating in these races.
We may publish some more detailed measurements with more varied configurations later. The thing to remember is minimum 144GB RAM for every 5Gt of web crawls, if you want to load and refresh in RAM.
-
19:37
ICDE 2012 (post 1 of 6) - LOD2 Plenary
» OpenLink Community BlogLOD2's database contributions are, on one hand, Virtuoso Column Store and Elastic Cluster, and on the other, the demonstration and proof from CWI that indeed all of the relational innovations for which CWI is well known apply to graph/RDF data as well.
The value is unquestionable both to Virtuoso users in the short-term, and to the state of science and to all RDF users and vendors in the mid-term.
The LOD2 claim of "linking the universe" (my words) will be tested soon enough, after we first put the universe in a bucket. This refers to a real-time quad store of Sindice crawls, plus a warehouse of the LOD data sets.
This effort raises a few questions that I will treat in a number of posts to follow, such as --
- How do you size a real-time copy of LOD/web data?
- What does it cost to operate a properly provisioned warehouse of all RDF web crawls?
What is done now is under-provisioned and not kept up to date. We are talking about all the RDF on the web in near real time with arbitrary queries. This is very far from the "billion triples" data sets or vertical portals, which are both easy by comparison.
-
19:36
ICDE 2012 (post 6 of 6) - Science Data Panel
» OpenLink Community BlogMichael Stonebraker chaired a panel on the future of science data at ICDE 2012 last week. Other participants were Jeremy Kepner from MIT Lincoln Labs, Anastasia Ailamaki from EPFL, and Alex Szalay from Johns Hopkins University.
This is the thrust of what was said, noted from memory. My comments follow after the synopsis.
Jeremy Kepner: When Java was new we saw it as the coming thing and figured that in HPC we should find space for this. When MapReduce and Hadoop came along, we saw this as a sea change in parallel programming models. This was so simple literally anybody could make parallel algorithms whereas this was not so with MPI. Even parallel distributed arrays are harder. So MapReduce was a game changer, together with the cloud where anybody can get a cluster. Hardly a week passes without me having to explain to somebody in government what MapReduce and Hadoop are about.We have a lot of arrays and a custom database for them. But the arrays are sparse so this is in fact a triple store. Our users like to work in MATLAB, and any data management must run together with that.
Of course, MapReduce is not a real scheduler, and Hadoop is not a real file system. For deployment, we must integrate real schedulers and make HDFS look like a file system to applications. The abstraction of a file system is something people like. Being able to skip a time-consuming data-ingestion process with a database is an advantage with file-based paradigms like Hadoop. If this is enhanced with the right scheduling features, this can be a good component in the HPC toolbox.
Michael Stonebraker: Users of the data use math packages like R, MATLAB, SAS, SPSS, or similar. If business intelligence is about AVG, MIN, MAX, COUNT, and GROUP BY, science applications are much more diverse in their analytics. All science algorithms have an inner loop that resembles linear algebra operations like matrix multiplication. Data is more often than not a large array. There are some graphs in biology and chemistry, but the world is primarily rectangular. Relational databases can emulate sparse arrays but are 20x slower than a custom-made array database for dense arrays. And I will not finish without picking on MapReduce: I know of 2000-node MapReduce clusters. The work they do is maybe that of a 100-node parallel database. So if 2000 nodes is what you want to operate, be my guest.
Science database is a zero billion dollar business. We do not expect to make money from the science market with SciDB, which by now works and has commercial services supplied by Paradigm 4, while the code itself is open source, which is a must for the science community. The real business opportunity is in the analytics needed by insurance and financial services in general, which are next to identical with the science use cases SciDB tackles. This makes the vendors pay attention.
Alex Szalay: The way astronomy is done today is through surveys: a telescope scans through the sky and produces data. We have now for 10 years operated the Sloane Sky Survey and kept the data online. We have all the data, and complete query logs, available for anyone interested. When we set out to do this with Jim Gray, everybody found this a crazy idea, but it has worked out.
Anastasia Ailamaki: We do not use SciDB. We find a lot of spatial use cases. Researchers need access to simulation results which are usually over a spatial model, like in earthquake simulations and the brain. Off-the-shelf techniques like R trees do not work -- the objects overlap too much -- so we have made our own spatial indexing. We make custom software when it is necessary, and are not tied to vendors. In geospatial applications, we can create meshes of different shapes -- like tetrahedral or cubes for earthquakes, and cylinders for the brain -- and index these in a geospatial index. But since an R tree is inefficient when objects overlap too much, as these do, we just find one; and then because there is reachability from an object to neighboring ones, we use this to get all the objects in the area of interest.
* * *
This is obviously a diverse field. Probably the message that we can synthesize out of this is that flexibility and parallel programming models are what we need to pay attention to. There is a need to go beyond what one can do in SQL while continuing to stay close to the data. Also, allowing for plug-in data types and index structures may be useful; we sometimes get requests for such anyway.
The continuing argument around MapReduce and Hadoop is a lasting feature of the landscape. A parallel DB will beat MapReduce any day at joining across partitions; the problem is to overcome the mindset that sees Hadoop as the always-first answer to anything parallel. People will likely have to fail with this before they do anything else. For us, the matter is about having database-resident logic for extract-transform-load (ETL) that can do data-integration type-transformations and maybe iterative graph algorithms that constantly join across partitions, better than a MapReduce job, while still allowing application logic to be written in Java. Teaching sem-web-heads to write SQL procedures and to know about join order, join type, and partition locality, has proven to be difficult. People do not understand latency, whether in client-server or cluster settings. This is why they do not see the point of stored procedures or of shipping functions to data. This sounds like a terrible indictment, like saying that people do not understand why rivers flow downhill. Yet, it is true. This is also why MapReduce is maybe the only parallel programming paradigm that can be successfully deployed in the absence of this understanding, since it is actually quite latency-tolerant, not having any synchronous cross-partition operations except for the succession of the map and reduce steps themselves.
Maybe it is so that the database guys see MapReduce as an insult to their intelligence and the rest of the world sees it as the only understandable way of running grep and sed (Unix commands for string search/replace) in parallel, with the super bonus of letting you reshuffle the outputs so that you can compare everything to everything else, which grep alone never let you do.
* * *
Making a database that does not need data loading seems a nice idea, and CWI has actually done something in this direction in "Here are my Data Files. Here are my Queries. Where are my Results?"] However, there is another product called Algebra Data that claims to take in data without loading and to optimize storage based on access. We do not have immediate plans in this direction. Bulk load is already quite fast (take 100G TPC-H in 70 minutes or so), but faster is always possible.
-
19:35
ICDE 2012 (post 5 of 6) - Graphs
» OpenLink Community BlogThere were quite a few talks about graphs at ICDE 2012. Neither the representations of graphs, nor the differences between RDF and generic graph models, entered much into the discussion. On the other hand, graph similarity searches and related were addressed a fair bit.
Graph DB and RDF/Linked Data are distinct, if neighboring disciplines. On one hand, graph problems predate Linked Data, and the RDF/Linked Data world is a web artifact, which graphs are not as such, so a slightly different cultural derivation also makes these disjoint. Besides, graphs may imply schema first whereas linked data basically cannot. Then another differentiation might be derived from edges not really being first class citizens in RDF, except for reification, at which the RDF reification vocabulary is miserably inadequate, as pointed out before.
RDF is being driven by the web-style publishing of Linked Open Data (LOD), with some standardization and uptake by publishers; Graph DB is not standardized but driven by diverse graph-analytics use cases.
There is no necessary reason why these could not converge, but it will be indefinitely long before any standards come to cover this, so best not hold one's breath. Communities are jealous of their borders, so if the neighbor does something similar one tends to emphasize the differences and not the commonalities.
So for some things, one could warehouse the original RDF of the web microformats and LOD, and then ETL into some other graph model for specific tasks, or just do these in RDF. Of course, then RDF systems need to offer suitable capabilities. These seem to be about very fast edge traversal within a rather local working set, and about accommodating large, iteratively-updated intermediate results, e.g., edge weights.
Judging by the benchmarks paper (Benchmarking traversal operations over graph databases (Slidedeck (ppt), paper (pdf)); Marek Ciglan, Alex Averbuch, and Ladialav Hluchy.) at the GDM workshop, the state of benchmarking in graph databases is even worse than in RDF, where the state is bad enough. The paper's premise was flawed to start, using application logic to do JOINs instead of doing them in the DBMS. In this way, latency comes to dominate, and only the most blatant differences are seen. There is nothing like this style of benchmarking to make an industry look bad. The supercomputer Graph 500 benchmark, on the other hand, lets the contestants make their own implementations on a diversity of architectures with random traversal as well as loading and generating large intermediate results. It is somewhat limited, but still broader than the the graph database benchmarks paper at the GDM workshop.
Returning to graphs, there were some papers on similarity search and clique detection. As players in this space, beyond just RDF, we might as well consider implementing necessary features for efficient expression of such problems. The algorithms discussed were expressed in procedural code against memory-based data structures; there is usually no query language or parallel/distributed processing involved.
MapReduce has become the default way in which people would tackle such problems at scale; in fact, people do not consider anything else, as far as I can tell. Well, they certainly do not consider MPI for example as a first choice. The parallel array things in Fortran do not at first sight seem very graphy, so this is likely not something that crosses one's mind either.
We should try some of the similarity search and clustering in SQL with a parallel programming model. We have excellent expression-evaluation speed from vectoring and unrestricted recursion between partitions, and no file system latencies like MapReduce. The initial test case will be some of the linking/data-integration/mapping workloads in LOD2.
Having some sort-of-agreed-upon benchmark for these workloads would make this more worthwhile. Again, we will see what emerges.
-
19:34
ICDE 2012 (post 4 of 6) - Graph Data Management Workshop
» OpenLink Community BlogI gave an invited talk ("Virtuoso 7 - Column Store and Adaptive Techniques for Graph" (Slides (ppt))) at the Graph Data Management Workshop at ICDE 2012.
Bryan Thompson of Systap (Bigdata® RDF store) was also invited, so we got to talk about our common interests. He told me about two cool things they have recently done, namely introducing tables to SPARQL, and adding a way of reifying statements that does not rely on extra columns. The table business is just about being able to store a multicolumn result set into a named persistent entity for subsequent processing. But this amounts to a SQL table, so the relational model has been re-arrived at, once more, from practical considerations. The reification just packs all the fields of a triple (or quad) into a single string and this string is then used as an RDF S or O (Subject or Object), less frequently a P or G (Predicate or Graph). This works because Bigdata® has variable length fields in all columns of the triple/quad table. The query notation then accepts a function-looking thing in a triple pattern to mark reification. Nice. Virtuoso has a variable length column in only the O but could of course have one in also S and even in P and G. The column store would still compress the same as long as reified values did not occur. These values on the other hand would be unlikely to compress very well but run length and dictionary would always work.
So, we could do it like Bigdata®, or we could add a "quad ID" column to one of the indices, to give a reification ID to quads. Again no penalty in a column store, if you do not access the column. Or we could make an extra table of PSOG->R.
Yet another variation would be to make the SPOG concatenation a literal that is interned in the RDF literal table, and then used as any literal would be in the O, and as an IRI in a special range when occurring as S. The relative merits depend on how often something will be reified and on whether one wishes to SELECT based on parts of reification. Whichever the case may be, the idea of a function-looking placeholder for a reification is a nice one and we should make a compatible syntax if we do special provenance/reification support. The model in the RDF reification vocabulary is a non-starter and a thing to discredit the sem web for anyone from database.
I heard from Bryan that the new W3 RDF WG had declared provenance out of scope, unfortunately. The word on the street on the other hand is that provenance is increasingly found to be an issue. This is confirmed by the active work of the W3 Provenance Working Group.
-
19:33
ICDE 2012 (post 3 of 6) - What Is Timely LOD Search Worth?
» OpenLink Community BlogThere was a talk (Linked Data and Live Querying for Enabling Support Platforms for Web Dataspaces (Slides (PDF)); Jürgen Umbrich, Marcel Karnstedt, Josiane Xavier Parreira, Axel Polleres and Manfred Hauswirth) at the Data Engineering Meets the Semantic Web (DESWEB) workshop at ICDE last week about the problems of caching LOD, whether attempted by Sindice or OpenLink's LOD Cloud Cache. The conclusion was that OpenLink covered a bit more of the test data sets and that Sindice was maybe better up to date on the ones that it covered but that neither did it very well. The data sets were random graphs of user FOAF profiles and such collected from some Billion Triples Data set, thus not data that is likely to have commercial value, except in huge quantities maybe for some advertising, except that click streams and the like are much more valuable.
Being involved with at least one of these, and being in the audience, I felt obligated to comment. The fact is, neither OpenLink's LOD Cloud Cache nor Sindice is a business, and there is not a business model which could justify keeping them timely on the web crawls they contain. Doing so is easy enough, if there is a good enough reason.
The talk did make a couple of worthwhile points: The data does change; and if one queries entities, one encounters large variation in change-frequency across entities and their attributes.
The authors suggested to have a piece of middleware decide what things can be safely retrieved from a copy and what have to be retrieved from the source. Not too much is in fact known about the change frequency of the data, except that it changes, as the authors pointed out.
The crux of the matter is that the thing that ought to know this best is the query processor at the LOD warehouse. For client-side middleware to split the query, it needs access to statistics that it must get from the warehouse or keep by itself. Of course, in concrete application scenarios, you go to the source if you ask about the weather or traffic jams, and otherwise go to the warehouse based on application-level knowledge.
But for actual business intelligence, one needs histories, so a search engine with only the present is not so interesting. At any rate, refreshing the data should leave a trail of past states. Exposing this for online query would just triple the price, so we forget about that for now. Just keeping an append-only table of history is not too much of a problem. One may make extracts from this table into a relational form for specific business questions. There is no point doing such analytics in RDF itself. One would have to just try to see if there is anything remotely exploitable in such histories. Making a history table is easy enough. Maybe I will add one.
Let us now see what it would take to operate a web crawl cache that would be properly provisioned, kept fresh, and managed. We base this on the Sindice crawl sizes and our experiments on these; the non-web-crawl LOD Cloud Cache is not included.
From previous experience we know the sizing: 5Gt/144GB RAM. Today's best price point is on 24-DIMM E5 boards, so 192GB RAM, or 6.67Gt. A unit like that (8TB HDD, 0.5TB SSD, 192GB RAM, 12 core E5, InfiniBand) costs about $6800.
The Sindice crawl is now about 20Gt, so $28K of gear (768GB RAM) is enough. Let us count this 4 times: 2x for anticipated growth; and 2x for running two copies -- one for online, and one for batch jobs. This is 3TB RAM. Power is 16x500W = 8KW, which we could round to 80A at 110V. Colocation comes to $500 for the space, and $1200 per month for power; make it $2500 per month with traffic included.
At this rate, 3 year TCO is $120K + ( 36 * $2.5K ) = $210K. This takes one person half time to operate, so this is another $50K per year.
We do not count software development in this, except some scripting that should be included in the yearly $50K DBA bill.
Under what circumstances is such a thing profitable? Or can such a thing be seen as a marketing demo, to be paid for by license or service sales?
A third party can operate a system of this sort, but then the cost will be dominated by software licenses if running on Virtuoso cluster.
For comparison, the TB at EC2 costs ((( 16 * $2 ) * 24 ) * 31 ) = $24,808 per month. With reserved instances, it is ( 16 * ( $2192 + ((( 0.7 * 24 ) * 365 ) * 3 ))) / 36 = $8938 per month for a 3 year term. Counting at 3TB, the 3 year TCO is $965K at EC2. AWS has volume discounts but they start higher than this; ( 3 * ( 16 * $2K )) = $96K reserved host premium is under $250K. So if you do not even exceed their first volume discount threshold, it does not look likely you can cut a special deal with AWS.
(The AWS prices are calculated with the high memory instances, approximately 64GB usable RAM each. The slightly better CC2 instance is a bit more expensive.)
Yet another experiment to make is whether a system as outlined will even run at anywhere close to the performance of physical equipment. This is uncertain; clouds are not for speed, based on what we have seen. They make the most sense when the monthly bill is negligible in relation to the cost of a couple of days of human time.
-
19:31
ICDE 2012 (post 2 of 6) - LOD Column Store Experiences and Sizing
» OpenLink Community BlogWe have played around with LOD data sets and Virtuoso Column Store for the past several months. I will here give a few numbers and comment on some different platform comparisons that we have made. The answer at the end of this is how to size a system for often-changing web-style data. The conclusion is a data-to-RAM ratio that gives an acceptable working set without driving the price up by forcing 100% RAM residence.
The experiment is loading Sindice web crawls. The platform is 2 x Xeon 5520 and 144G RAM. The initial load rate is 200-180Kt, and drops to 100Kt at 5Gt because of I/O. The system is Virtuoso Column Store configured to run as 4 processes and 32 partitions, all on the same box. After 5Gt, we see just more I/O and going further is not relevant; one runs CPU-bound or not at all.
We use 4 Crucial SSDs in the setup. The hot structures like the RDF quad indices are on SSD, and the cold ones are on hard disk. A cold structure is a write-only index like the dictionary of literals (id to lit).
For bulk load, SSDs turn out not to be particularly useful. For a cold start on the other hand, SSDs cut warmup time of 144G RAM from over half an hour to a couple of minutes. It is possible that Intel SSDs would also help with bulk load, but this has not been tried. The SSD problem during bulk load is that these do not write very fast, and while there are writes in queue, read latency goes up; so under a constant write load, the SSD's famous instantaneous random read no longer works.
The fragment considered in the example is 4.95Gt: 8.1M pages worth of quads; 12.7M of literals and iris; and 4.71M of full text index. A page is 8KB. The files on disk contain empty pages, but these do not matter since they do not take up RAM. The quad indices take 13.4 bytes/quad. The row-wise equivalent used to be 38 or so bytes/quad with similar data. Two-thirds of the IRI and literal string data can benefit from column-wise stream compression. (This was not used but if it were, we could count on a 50% drop in size for the data affected, so instead of 12.7M pages, we could maybe get 8.5M on a good day. This could be worth doing but is not a priority.) The system was configured to have 12M database pages in RAM, so a little under half the database pages of the set fit in RAM at one time; thus one cannot call this a memory-only setup. Due to the locality in the unusually non-local data, this is as far as secondary storage can reach without becoming an over-2x slowdown. In practice, we are talking about under 1% of rows accessed coming from secondary storage, but that alone means half throughput.
We note that this data set represents the worst that we have seen. It has 129M distinct graphs, 38 t/g. Regular data like the synthetic benchmark sets take half the space per quad. This is about a third of a Sindice crawl; the other two-thirds look the same as far as we looked.
So if you are interested in hosting data like this, you can budget 144GB RAM for every 5Gt. Do not try it with anything less. Budgeting double this is wise, so that you have space to cook the data; this is important since in order to do things with it, one needs to at least copy things for materializing transformations.
If you are budget-constrained and hosting very regular content like UniProt, you can budget maybe 144GB RAM for every 10Gt.
As for CPU, this does not matter as much as long as you do not go to disk. Just for load speed, Dbpedia is loaded in 300s on a cluster of eight (8) dual AMD 2378 boxes at 2.6GHz (total 8 cores per host, so 64 cores in the cluster), and in 945s on one (1) dual Xeon 5520 box at 2.26GHz (total 8 cores in the host). Intel makes much better CPUs, as we see. Both scenarios are 100% in RAM. For even more regular data, the load rates are a bit higher: 1.3Mt/s for the AMD cluster, and 300Kt/s for the Xeon host.
The interconnect for the AMD cluster is 1 x gigE but this does not matter for load. For CPU-bound cross-partition JOINs, 1 or 2 x gigE is insufficient; 4 x gigE might barely make it; InfiniBand should be safe. When running cross-partition JOINs, a single 8-core Xeon box generates about 300MB/s of interconnect traffic; a gigE connection can maybe take 50MB/s with some luck.
Intel E5 is not dramatically better than Nehalem but this is something we will see in a while when we make measurements with real equipment. Prior to the E5 release, we tried Amazon EC2 CC2 ("Cluster Compute Eight Extra Large Instance" -- 2x8 core E5, 2.66GHz). The results were inconclusive; it never did more than 1.9x better than Xeon 5520 even when running an empty loop (i.e., recursive Fibonacci function in SQL, no cache misses, no I/O). With a database JOIN, 1.3x better is the best we saw. But this must be the fault of Amazon and not of E5.
We also tried AMD "Magny-Cours", but for 32 cores against 8 it never did over 2x better, more like 1.4x often enough, and and single thread speed was 50% worse, so not a good buy. We did not find a Bulldozer to try, and did not feel like buying one since the reviews did not promise more core speed over the Magny-Cours.
It seems that especially with Column Store, we are truly CPU-bound and not memory-latency- or bandwidth-bound. This is based on the observation that a Xeon 5620 with 2 of 3 memory channels populated loads BSBM data only 10% faster than the same with 1 of 3 channels populated, with CPU affinity set on a dual socket system.
So, if you have a choice between a $2K processor (E5-2690) and a $600 processor (E5-2630), buy the cheaper one and get RAM with the money saved. $1440 buys 128G in $90 8G DIMMs. Then buy E5 boards with 24 DIMMs -- one for every 7Gt of web crawl data. If your software licenses are priced per core, getting higher-clock 4-core E5’s might make sense.
While on the subject of bytes and quads/triples, we note that Bigdata®'s recent announcement says up to 50 billion triples per single server. Franz loaded at a good 800+ Kt/s rate up to a trillion triples. One is led to think from the spec that this was with less than full cpu but still with highly local data, considering 1.5 bytes a triple would hit very heavy I/O otherwise. Their statement to the effect of LUBM-like data corroborates this, so we are not talking about exactly the same thing.
So if you compare the claims, I am talking about running CPU-bound on the worst data there is. Franz and Bigdata® do not specify, so it is hard to compare. LOD2 should in principle publish actual metrics with at least Bigdata®; Franz is not participating in these races.
We may publish some more detailed measurements with more varied configurations later. The thing to remember is minimum 144GB RAM for every 5Gt of web crawls, if you want to load and refresh in RAM.
-
19:28
ICDE 2012 (post 1 of 6) - LOD2 Plenary
» OpenLink Community BlogLOD2's database contributions are, on one hand, Virtuoso Column Store and Elastic Cluster, and on the other, the demonstration and proof from CWI that indeed all of the relational innovations for which CWI is well known apply to graph/RDF data as well.
The value is unquestionable both to Virtuoso users in the short-term, and to the state of science and to all RDF users and vendors in the mid-term.
The LOD2 claim of "linking the universe" (my words) will be tested soon enough, after we first put the universe in a bucket. This refers to a real-time quad store of Sindice crawls, plus a warehouse of the LOD data sets.
This effort raises a few questions that I will treat in a number of posts to follow, such as --
- How do you size a real-time copy of LOD/web data?
- What does it cost to operate a properly provisioned warehouse of all RDF web crawls?
What is done now is under-provisioned and not kept up to date. We are talking about all the RDF on the web in near real time with arbitrary queries. This is very far from the "billion triples" data sets or vertical portals, which are both easy by comparison.
-
21:02
LOD2 Plenary and Review: Semanticist, Think Database!
» OpenLink Community BlogLast week the LOD2 FP7 project had its first review, preceded by its third plenary meeting.
Before this, we did, as promised, get the column store and vectored execution capabilities of Virtuoso 7 Single-Server Edition extended to Virtuoso 7 Cluster Edition. More interesting still, we decoupled storage from the database server process, so now database files can migrate between server processes. This means that clusters are now elastic, i.e., new servers can be added to a cluster and the load can be redistributed without reloading the data.
These things were long planned, but now are done. Measurements will be published in some weeks, as part of CWI's continued running of RDF store benchmarks, per the LOD2 plan.
Doing the column store and elastic cluster is work enough, so I do not in general participate in support or consultancy or the like. This has some pros and cons. On the plus side, there is a relative lack of noise and a very clear idea of focus. Of course, this work is most highly applied, thus always informed by use cases, thus forgetting what ought to be done out there is not the problem. Rather, the problem is forgetting how things in fact are done as opposed to how they could or should be done.
To cut a long story short, it has become clear to me that the DBMS must tell the application developer what to do. Of course, the application developer could also look at performance metrics, but they do not, and explaining these metrics is too much work and yields no lasting benefit. Developers will produce all kinds of performance diagnostic traces if requested, but going through this song and dance can also be avoided by the right automation.
So, I will introduce two new product features called Wazzup? and Saywhat?
Wazzup? is answered by a mood line, like "Heavily disk bound: 100G more memory will give 10x speedup" or "Network bound: Processing in larger batches will give 5x more throughput" and Saywhat? is answered by some commentary on the user's last action, for example "there is no ?order with o_totalprice < 0" or "there is no property O_misspelledtotallprrice."
Wazzup? is about overall system state, and Saywhat? is about the user session, specifically query plans. But an explanation of a query plan is not understandable, so this will just point out some salient facts, like the reason why the answer comes out empty.
The other thing that came to my attention is the fact that a user has no instinctive feel for ETL. A database person takes it for a self-evident truth that data is loaded in bulk, but the application developer does not think of that. Likewise, the line between warehousing and federating is not instinctively felt; actually the question is not even posed in these terms. So one will find Web protocols and end-points and glue code on the app server when one ought to have ETL and adequate hardware for running the consolidated database.
Further, under-provisioning of equipment is endemic with semanticists. The Semantic Web gets a needlessly bad rap just because we find too much data on too little equipment. For example, I was surprised to learn that the Linked Geodata demo ran on only 16 GB RAM and 6 processor cores with 2 billion triples and 350 million points in a geo index. Now, even with our greatest space efficiency advances, there is no way this will run from memory.
It is not that the Web 2.0 stack is necessarily efficient (we hear the wildest stories of lack of database understanding from that side too), but at least there is a culture of running with enough equipment. Surely when the web-scale data gear (e.g. Google Bigtable, Yahoo PNUTS, Amazon Dynamo) was new, by the operators' own admission there was no way for this to be particularly efficient, database-wise. Not if your eventual consistency is a client application to a shared MySQL back-end. For a lookup or single-record-update workload, who cares when there is enough hardware? For analytics, there is the de facto impossibility of doing big joins, but map reduce is for that, all offline. The big web houses have always known how to deal with data; it is the smaller Web 2.0 guys who patch systems together with duct tape and memcache. Even so, the online experience gets created.
Semanticism has no part of this outlook, except maybe for Freebase, but then they are from California and now have been inside Google for a while.
We quite understand that when one needs to get big data online, one makes a key-value store as a point solution, because this way one owns what one operates, and the time to market is a lot shorter than if one tried building all this inside a general-purpose DBMS. Besides, the people who can in fact do this almost do not exist, and even if one had a whole army of this rare breed, development is not very scalable in a tightly-integrated system like a high-performance DBMS. Still further, to even start, one needs to own the DBMS, meaning that the initial platform must be known through and through. This is an issue even though open source platforms exist.
The graph data, semdata, schema-last, RDF, linked data enterprise -- whatever one calls it -- makes the bold proposition of bringing complex-query-at-scale to heterogeneous data. This is a database claim.
In the meantime, test deployments are made in defiance of database best practices. This is a bit like test driving a race car in reverse gear and steering by looking in the rear-view mirror.
There is also no short-term scalable way to educate people. At the LOD2 review, one comment was that an integrated project ought to clearly indicate how to set up the tool chain for good performance, specially as concerns interfaces between the tools. This is very true. Experience shows that developers of tools cannot accurately anticipate what usage patterns will emerge in the field. Therefore, we propose to do better than just documentation; we will make the server recognize the common sources of inefficiency and point the user to the right action.
Provisioning and usage patterns: The DBMS ought to know best.Imagine the following conversation:
DBMS: Your application does single-triple INSERTs over client-server protocol all day, from a single client. 57% of real time goes in client server latency, 40% in cluster interconnect latency, 2% in compiling the statements, and 1% in doing the work. Use array parameters or bulk load from a file.
Operator: My developers use industry-standard Java class libraries with a service-oriented architecture and strictly enforced interfaces. This is called software engineering. Watch out ere you raise your voice against the canon.
[Some weeks later, after the load job has gone on for 10 days and gotten a third of the way, developers have discovered that JDBC has array parameters and are trying these.]
DBMS: 60% of real time goes into waiting for locks. 10% of transactions get aborted for deadlock. Transactions consist of an average of 10 client-server operations. Use stored procedures; acquire locks in predictable order; do SELECT FOR UPDATE. Throughput will be 4x higher if client-server operations are merged into a single operation. The transactions only INSERT; hence consider bulk load instead.
Operator: We are using an enterprise-class three-tier architecture. It has "enterprise" in the name and all the big guys are using it, so it must be scalable. Besides, it is distributed transactions, and distributed computing is the wave of the future. You are a cluster yourself, so the pot's got no business calling the kettle black.
[After a while, the data gets loaded with bulk load, but now on a single stream.]
DBMS: CPU is at 400% for an INSERT workload; adding more parallel threads will get 4.5x better throughput.
[Some time has elapsed and there are Ajax client apps out there trying to use the data.]
DBMS: Will you really not give me another 140 GB RAM and 16 more cores?
Operator: No, on general principles I will not, shut up.
DBMS: Do you know that your page impression takes 3 seconds and anything over 0.25 seconds is visibly slow? 300 GB worth of distinct pages have been accessed in the last 24 hours for 160 GB of RAM. Latency will drop 10x by using SSD; 50x by increasing RAM.
Operator: No dice, bucket. Shut up, besides, when I scroll through the data I always use for testing, I get it fast enough, you are just doing this out of greed and self-importance. You are a server among many, just like the mail server; you databases are just pretentious.
Currently addressing any of the above sorts of issues takes a long time and involves mostly-avoidable support communication. Questions of this sort do occur. We can probably produce commentary like the above based on logging some 50 numbers, and making some 15 regularly-run reports over these. The patterns to watch out for are well known. No, we will not make a Zippy the Pinhead office assistant; a computer should not try to be cute. This one will talk only in terms of gains from adjusting the deployment or usage patterns.
Now, suppose the operator said yes to the request for more cores and memory; then it would be up to the DBMS to deliver. This entails a capacity to redistribute itself automatically, and to give a quantitative report on the success of this measure. This means usage-based repartitioning of the data to equalize load over a cluster. The relevant metric in the above case is the drop in response time. On the other hand, the DBMS should also notice if there is clearly unused capacity.
This all will be presented as a line in the status report, so there is no extra wizard or workload analyzer that one must remember to run. For programmatic use there are SQL views for the relevant reports.
As for ETL, even if the DBMS can detect that it is not being done right, this does not mean that the user will know what to do. Therefore, for all the Web harvesting we support, as well as any import from local file system or Web services, with some RDF-ization, we will simply implement a proper ETL utility that will do things right. Wazzup? can just point the user to that if the workload looks like loading. This will have its own status report giving a load and transform rate and will point out what takes the longest, after everything is duly parallelized and made asynchronous.
Beyond these lessons, there is more to say about the review and plenary, we will get to that a bit later. We did promise a new edition of the LOD cache in a couple of months, now on the clustered column-store platform. Look for advances in data discoverability.
-
14:50
LOD2 Plenary and Review: Semanticist, Think Database!
» OpenLink Community BlogLast week the LOD2 FP7 project had its first review, preceded by its third plenary meeting.
Before this, we did, as promised, get the column store and vectored execution capabilities of Virtuoso 7 Single-Server Edition extended to Virtuoso 7 Cluster Edition. More interesting still, we decoupled storage from the database server process, so now database files can migrate between server processes. This means that clusters are now elastic, i.e., new servers can be added to a cluster and the load can be redistributed without reloading the data.
These things were long planned, but now are done. Measurements will be published in some weeks, as part of CWI's continued running of RDF store benchmarks, per the LOD2 plan.
Doing the column store and elastic cluster is work enough, so I do not in general participate in support or consultancy or the like. This has some pros and cons. On the plus side, there is a relative lack of noise and a very clear idea of focus. Of course, this work is most highly applied, thus always informed by use cases, thus forgetting what ought to be done out there is not the problem. Rather, the problem is forgetting how things in fact are done as opposed to how they could or should be done.
To cut a long story short, it has become clear to me that the DBMS must tell the application developer what to do. Of course, the application developer could also look at performance metrics, but they do not, and explaining these metrics is too much work and yields no lasting benefit. Developers will produce all kinds of performance diagnostic traces if requested, but going through this song and dance can also be avoided by the right automation.
So, I will introduce two new product features called Wazzup? and Saywhat?
Wazzup? is answered by a mood line, like "Heavily disk bound: 100G more memory will give 10x speedup" or "Network bound: Processing in larger batches will give 5x more throughput" and Saywhat? is answered by some commentary on the user's last action, for example "there is no ?order with o_totalprice < 0" or "there is no property O_misspelledtotallprrice."
Wazzup? is about overall system state, and Saywhat? is about the user session, specifically query plans. But an explanation of a query plan is not understandable, so this will just point out some salient facts, like the reason why the answer comes out empty.
The other thing that came to my attention is the fact that a user has no instinctive feel for ETL. A database person takes it for a self-evident truth that data is loaded in bulk, but the application developer does not think of that. Likewise, the line between warehousing and federating is not instinctively felt; actually the question is not even posed in these terms. So one will find Web protocols and end-points and glue code on the app server when one ought to have ETL and adequate hardware for running the consolidated database.
Further, under-provisioning of equipment is endemic with semanticists. The Semantic Web gets a needlessly bad rap just because we find too much data on too little equipment. For example, I was surprised to learn that the Linked Geodata demo ran on only 16 GB RAM and 6 processor cores with 2 billion triples and 350 million points in a geo index. Now, even with our greatest space efficiency advances, there is no way this will run from memory.
It is not that the Web 2.0 stack is necessarily efficient (we hear the wildest stories of lack of database understanding from that side too), but at least there is a culture of running with enough equipment. Surely when the web-scale data gear (e.g. Google Bigtable, Yahoo PNUTS, Amazon Dynamo) was new, by the operators' own admission there was no way for this to be particularly efficient, database-wise. Not if your eventual consistency is a client application to a shared MySQL back-end. For a lookup or single-record-update workload, who cares when there is enough hardware? For analytics, there is the de facto impossibility of doing big joins, but map reduce is for that, all offline. The big web houses have always known how to deal with data; it is the smaller Web 2.0 guys who patch systems together with duct tape and memcache. Even so, the online experience gets created.
Semanticism has no part of this outlook, except maybe for Freebase, but then they are from California and now have been inside Google for a while.
We quite understand that when one needs to get big data online, one makes a key-value store as a point solution, because this way one owns what one operates, and the time to market is a lot shorter than if one tried building all this inside a general-purpose DBMS. Besides, the people who can in fact do this almost do not exist, and even if one had a whole army of this rare breed, development is not very scalable in a tightly-integrated system like a high-performance DBMS. Still further, to even start, one needs to own the DBMS, meaning that the initial platform must be known through and through. This is an issue even though open source platforms exist.
The graph data, semdata, schema-last, RDF, linked data enterprise -- whatever one calls it -- makes the bold proposition of bringing complex-query-at-scale to heterogeneous data. This is a database claim.
In the meantime, test deployments are made in defiance of database best practices. This is a bit like test driving a race car in reverse gear and steering by looking in the rear-view mirror.
There is also no short-term scalable way to educate people. At the LOD2 review, one comment was that an integrated project ought to clearly indicate how to set up the tool chain for good performance, specially as concerns interfaces between the tools. This is very true. Experience shows that developers of tools cannot accurately anticipate what usage patterns will emerge in the field. Therefore, we propose to do better than just documentation; we will make the server recognize the common sources of inefficiency and point the user to the right action.
Provisioning and usage patterns: The DBMS ought to know best.Imagine the following conversation:
DBMS: Your application does single-triple INSERTs over client-server protocol all day, from a single client. 57% of real time goes in client server latency, 40% in cluster interconnect latency, 2% in compiling the statements, and 1% in doing the work. Use array parameters or bulk load from a file.
Operator: My developers use industry-standard Java class libraries with a service-oriented architecture and strictly enforced interfaces. This is called software engineering. Watch out ere you raise your voice against the canon.
[Some weeks later, after the load job has gone on for 10 days and gotten a third of the way, developers have discovered that JDBC has array parameters and are trying these.]
DBMS: 60% of real time goes into waiting for locks. 10% of transactions get aborted for deadlock. Transactions consist of an average of 10 client-server operations. Use stored procedures; acquire locks in predictable order; do SELECT FOR UPDATE. Throughput will be 4x higher if client-server operations are merged into a single operation. The transactions only INSERT; hence consider bulk load instead.
Operator: We are using an enterprise-class three-tier architecture. It has "enterprise" in the name and all the big guys are using it, so it must be scalable. Besides, it is distributed transactions, and distributed computing is the wave of the future. You are a cluster yourself, so the pot's got no business calling the kettle black.
[After a while, the data gets loaded with bulk load, but now on a single stream.]
DBMS: CPU is at 400% for an INSERT workload; adding more parallel threads will get 4.5x better throughput.
[Some time has elapsed and there are Ajax client apps out there trying to use the data.]
DBMS: Will you really not give me another 140 GB RAM and 16 more cores?
Operator: No, on general principles I will not, shut up.
DBMS: Do you know that your page impression takes 3 seconds and anything over 0.25 seconds is visibly slow? 300 GB worth of distinct pages have been accessed in the last 24 hours for 160 GB of RAM. Latency will drop 10x by using SSD; 50x by increasing RAM.
Operator: No dice, bucket. Shut up, besides, when I scroll through the data I always use for testing, I get it fast enough, you are just doing this out of greed and self-importance. You are a server among many, just like the mail server; you databases are just pretentious.
Currently addressing any of the above sorts of issues takes a long time and involves mostly-avoidable support communication. Questions of this sort do occur. We can probably produce commentary like the above based on logging some 50 numbers, and making some 15 regularly-run reports over these. The patterns to watch out for are well known. No, we will not make a Zippy the Pinhead office assistant; a computer should not try to be cute. This one will talk only in terms of gains from adjusting the deployment or usage patterns.
Now, suppose the operator said yes to the request for more cores and memory; then it would be up to the DBMS to deliver. This entails a capacity to redistribute itself automatically, and to give a quantitative report on the success of this measure. This means usage-based repartitioning of the data to equalize load over a cluster. The relevant metric in the above case is the drop in response time. On the other hand, the DBMS should also notice if there is clearly unused capacity.
This all will be presented as a line in the status report, so there is no extra wizard or workload analyzer that one must remember to run. For programmatic use there are SQL views for the relevant reports.
As for ETL, even if the DBMS can detect that it is not being done right, this does not mean that the user will know what to do. Therefore, for all the Web harvesting we support, as well as any import from local file system or Web services, with some RDF-ization, we will simply implement a proper ETL utility that will do things right. Wazzup? can just point the user to that if the workload looks like loading. This will have its own status report giving a load and transform rate and will point out what takes the longest, after everything is duly parallelized and made asynchronous.
Beyond these lessons, there is more to say about the review and plenary, we will get to that a bit later. We did promise a new edition of the LOD cache in a couple of months, now on the clustered column-store platform. Look for advances in data discoverability.
-
13:37
GDB for the Data Driven Age (STI Summit Position Paper)
» OpenLink Community BlogNote: The following was written prior to the event, but was not posted until later due to human error.
The Semantic Technology Institute (STI) is organizing a meeting around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details.
GDB for the Data Driven AgeDatabases and knowledge representation both have decades of history, but to date the exchange of ideas and techniques between these disciplines has been limited. The intuition that there would be value in greater cooperation has not failed to occur to researchers on either side; after all, both sides deal with data. From this, we have seen deductive databases emerge, as well as more recently "database friendly" profiles of OWL.
In this position paper we will examine what, in the most concrete terms, is needed in order to bring leading edge database technology together with expressive querying and reasoning. This draws on our experience in building Virtuoso, one of today's leading graph data stores. Following this, we argue for the creation of benchmarks and challenges that in fact do reflect reality and facilitate open and fair comparison of products and technologies.
Data integration is often mentioned as the motivating use case for GDB, commonly popularized today as RDF. Database research has over the past few years produced great advances for business intelligence (i.e., complex queries and read-mostly workloads). These advances are typified by compressed columnar storage and architecture-conscious execution models, mostly based on the idea of always processing multiple sets of values in each operation (vectoring). With these techniques, raw performance with relatively simple schemas and regular data (e.g., TPC-H) is no longer a barrier to extracting value from data.
A similar breakthrough has not been seen on the semantics side. Data integration still requires manual labor. Publishing GDB datasets is a good and necessary intermediate stage, but producing these datasets from diverse sources is not fundamentally different from doing the same work without GDB or RDF. Even so, GDB and RDF serve as a catalyst for a culture of publishing datasets.
GDB, as a base model for integration, offers the following benefits over a purely relational result format:
- All entities have globally unique identifiers.
- Any statements may be associated ad hoc to any entities.
- These statements can be scoped into graphs according to their provenance, time, validity, etc.
Obtaining this flexibility on a relational basis would simply require moving to an graph-like representation with essentially one-row-per-attribute. Indeed, we see key-value stores being used in online applications with high volatility of schema (e.g., social networks, search); and we also see relational applications making provisions for post-hoc addition of per-entity attributes (i.e., associating a bag of mixed non-first normal form data with entities). The benefits of a schema-last approach are recognized in many places.
GDB seems a priori a fit for all these requirements, thus how will it claim its place as a solution?
The first part of the answer lies in learning all the relevant database lessons. The second part lies in eliminating the impedance mismatch between querying and reasoning. The third and most important part consists of substantiating these claims in a manner that is understandable to the relevant publics, finally leading to the creation of a semantics-aware segment of the database industry. We will address each of these aspects in turn.
GDB and RDBThe problem is divided into storage format, execution, and query optimization. For the first two, Daniel Abadi's renowned Ph.D. thesis holds most of the keys. Space efficiency is specially important for Linked Data, since data is often voluminous, and many datasets have to be brought together for integration. Access patterns are also unpredictable, with indexed-random-access predominating, as opposed to RDB BI workloads where sequential scans and hash joins represent the bulk of the work. However, we find that a sorted column-wise compressed representation of Linked Data with a single quad table for all statements gives excellent space efficiency and good random access as well as random insert speed. The space efficiency is close to par with the equivalent column-wise relational format, since three of the four columns of the quad table compress to almost nothing. As many sort orders as are necessary may be maintained, but we find that two are enough, with some extra data structures for dealing with queries where the predicate is unspecified. The details are found in VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data. Since GDB/RDF is a model typed at run time, the engine must support an "
ANY" data type for columns and query variables, where values on successive rows may be of different types. This is a straightforward enhancement.Vectored execution is traditionally associated with column stores because the per-row access cost is relatively high, thus needing to access many nearby rows at a time in order to amortize the overhead. Aside this, vectored execution provides many opportunities for parallelism, from the instruction level all the way to threading and distributed execution on clusters, thus some form of execution on large numbers of concurrent query states is needed for RDF stores, just as it is needed for RDBMS"s.
Query optimization for GDBMS is similar to that for RDBMS, except that the statistics can no longer be collected by column and table, but must rather apply to individual entities and ranges of a single quad table. This can be provided through run-time sampling of the database based on constants in the query being optimized. This may take into account trivial inference such as expanding properties into the set of their sub-properties and the like. Beyond this, interleaving execution and optimization (as in ROX) seems to offer limitless possibilities, especially when inference is introduced, making optimizer statistics less predictive.
In summary, starting with an RDBMS and going to GDB entails changes to all parts of the engine, but these changes are not fundamental. One does need to own the engine; however, otherwise the expertise for efficiently implementing these changes will not exist. Essentially any DBMS technique may be translated to a GDB use case, if its application can be decided at run-time. GDB may be schema-less, yet most datasets have fairly regular structure; the question is simply to reconstruct the needed statistics and schema information from the data on an as you go basis. Techniques with high up-front cost, like constructing specially ordered materializations for optimizing specific queries, are harder to deploy but still conceivable for GDB also.
RDB and InferenceCompared to the straightforwardly performance oriented world of database engines, the contours of the landscape become less defined when moving to inference. Databases, whether relational or schema-less all perform roughly the same functions but inference is more diverse. We include here also techniques like machine learning and meta-reasoning for guiding reasoning, although these might not strictly fit the definition.
As we posit that data integration is the motivating use case for GDB as opposed to RDB (Relational Database Model), we must ask which modes of inference are actually required for data integration. Further, we need to ask whether these inferences ought to be applied as a preprocessing step (ETL or forward chaining), or as needed (backward chaining). Some low-hanging fruit can be collected by simply constructing class or property hierarchies; e.g., in the data at hand, the following properties have the meaning of company name, and the following classes have the meaning of company. We have found that such techniques can be efficiently supported at run-time, without materialization, if the support is simply built into the engine, which is in itself straightforward as long as one controls the engine. The same applies to trivial identity resolution, such as
owl:sameAsor resolution of identity based on sharing an inverse-functional property value. These things take longer at run-time, but if one caches and reuses the result, one can get around materialization.We do not believe in weak statements of identity, as in X is similar to Y, since the meaning of similarity is entirely contextual. X and Y may or may not be interchangeable depending on the application; thus the statement on identity needs to be strong, but it must be easy to modify the grounds on which such a statement is made. This is a further argument for why one should not automatically materialize consequences of identity, particularly if dealing with web data where identity is especially problematic.
Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling ("address as many columns in customer table" vs. "address normalized away under a contact entity"), normalization ("first name" and "last name" as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world!
If data exists, the conversion questions are often answerable but their answer depends on context -- e.g., date of transaction for currency exchange rate; source of data for the definition of blood pressure.
Alongside these, there remain issues of identity, e.g., depending on the perspective, a national subsidiary is or is not the same entity as the parent company, companies with the same name can be entirely unrelated in different jurisdictions.
It appears that we may need a multi-level approach, combining different techniques for different phases of the integration process. We do not a priori believe that using SQL VIEWs for unit and modeling conversion, and then OWL for unifying terminology on top of this, were the whole solution. Even if this were the solution, the pipeline from the relational sources to SPARQL and OWL needs to be optimized for real-world BI information volumes, and the query language needs to be able to express the business questions and needs to interface with the reporting tools the analyst has come to expect.
Our answer so far consists of a SPARQL extension with non-recursive rules, roughly equivalent to SQL VIEWs in expressive power, tightly integrated to the query engine. There is also limited support for recursion through transitive subqueries; thus one can compactly express things like "all parts of all assemblies and subassemblies must satisfy applicable safety requirements, where the requirements depend on the type of the part in question."
This is only an intermediate step. We believe that a database-scale generic inference engine with at least Datalog power, with second-order extensions like computed predicates, is needed, executing inside the DBMS, benefiting from the whole array of optimizations database-science expects of execution engines, as part of the answer.
This will not relieve the analyst of having to consider that the currency rates in effect at the time of conversion must be taken into account when calculating profits, but this will at least make expressing this and similar pieces of context more compact.
We note that time-to-answer has historically won over raw performance. This was also the case for RDBMS when these were the fresh challenger to the CODASYL incumbents, just as was the case with the adoption of high-level languages. The key is that the raw performance must be sufficient for the real world task. With the adoption of the database lessons outlined in the previous section, we believe this to be the case for GDB (and thus, RDF).
Substantiating the ClaimsBenchmarks have a stellar record for improving any metric they measure. The question is, how can we make a metric that measures GDB's ability to deliver on its claim to fame -- time-to-answer for big data -- with all the integration and other complexities this entails?
So far, GDB benchmarks have consisted of workloads where RDBMS are clearly better (e.g., LUBM, or the Berlin SPARQL Benchmark). This does not remove their usefulness for GDB, but does not constitute a GDB selling point, either.
We suggest a dual approach. The first part is demonstrating that GDB is scalable for BI: We take the industry standard decision support benchmark TPC-H, which is very favorable to RDB and quite unfavorable to GDB, and show that we can tackle the workload at reasonable cost. If TPC-H is all one wants, an RDBMS will stay a better fit, but then this benchmark does not capture any of the heterogeneity, schema evolution, or other such requirements faced by real-world data warehouses. This is still a qualification test, not the selling point.
The issue of benchmark is inextricably tied to the issue of messaging. There must be a compelling story, with which the IT community can identify. Further, the benchmark must capture real-world challenges in the area of interest. With all this, the benchmark should not be too expensive to run. Here too, a multistage approach suggests itself.
Our tentative answer to this question is the Social Intelligence Benchmark (SIB), developed together with CWI and other partners in the LOD2 consortium. This simulates a social network and combines an online workload with complex analytics. This benchmark should cover all of the target areas of the LOD2 project, so that the project itself generates its own metric of success. The project has clear data integration targets, especially as applies to Web and Linked Data. Questions of integration with enterprise sources need to be further developed; for example, comparing CRM data with extractions from the online conversation space for market research.
Data integration will invariably involve human effort, and the area cannot be satisfactorily covered with metrics of scale and throughput alone. Development time, accuracy of results, and cost of maintenance are all factors. Furthermore, the task being modeled must correspond to reality, still without being too domain-specific or prohibitively time-consuming to implement.
ConclusionsThe data driven world will increase rewards for efficiency in data integration. We believe that such efficiency crucially depends on semantics. Real world requirements just might throw the database and AI communities together with enough heat and pressure for fusion to ignite, allegorically speaking. Without a clear and present need, the geek world analog of electrostatic repulsion will keep the communities separate, as has been the case thus far, and no new, qualitatively-different element will arise.
Efforts such as this STI Summit and the LOD2 Project are needed for setting directions and communicating the requirement to the research world. In our fusion analogy, this is the field which directs the nuclei to collide.
Once there is an actual reaction that produces more than it consumes by a sufficient margin, regular business dynamics will take over, and we will have an industry with several products of comparable capability, as well as a set of metrics, all to the benefit of the end user.
References-
TPC-H results pages
-
Daniel Abadi's Ph.D. Thesis, Query Execution in Column-Oriented Database Systems ( PDF )
-
Our VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data ( HTML | PDF )
-
CWI's ROX: Run-time Optimization of XQueries ( PDF )
-
13:36
GDB for the Data Driven Age (STI Summit Position Paper)
» OpenLink Community BlogNote: The following was written prior to the event, but was not posted until later due to human error.
The Semantic Technology Institute (STI) is organizing a meeting around the questions of making semantic technology deliver on its promise. We were asked to present a position paper (reproduced below). This is another recap of our position on making graph databasing come of age. While the database technology matters are getting tackled, we are drawing closer to the question of deciding actually what kind of inference will be needed close to the data. My personal wish is to use this summit for clarifying exactly what is needed from the database in order to extract value from the data explosion. We have a good idea of what to do with queries but what is the exact requirement for transformation and alignment of schema and identifiers? What is the actual use case of inference, OWL or other, in this? It is time to get very concrete in terms of applications. We expect a mixed requirement but it is time to look closely at the details.
GDB for the Data Driven AgeDatabases and knowledge representation both have decades of history, but to date the exchange of ideas and techniques between these disciplines has been limited. The intuition that there would be value in greater cooperation has not failed to occur to researchers on either side; after all, both sides deal with data. From this, we have seen deductive databases emerge, as well as more recently "database friendly" profiles of OWL.
In this position paper we will examine what, in the most concrete terms, is needed in order to bring leading edge database technology together with expressive querying and reasoning. This draws on our experience in building Virtuoso, one of today's leading graph data stores. Following this, we argue for the creation of benchmarks and challenges that in fact do reflect reality and facilitate open and fair comparison of products and technologies.
Data integration is often mentioned as the motivating use case for GDB, commonly popularized today as RDF. Database research has over the past few years produced great advances for business intelligence (i.e., complex queries and read-mostly workloads). These advances are typified by compressed columnar storage and architecture-conscious execution models, mostly based on the idea of always processing multiple sets of values in each operation (vectoring). With these techniques, raw performance with relatively simple schemas and regular data (e.g., TPC-H) is no longer a barrier to extracting value from data.
A similar breakthrough has not been seen on the semantics side. Data integration still requires manual labor. Publishing GDB datasets is a good and necessary intermediate stage, but producing these datasets from diverse sources is not fundamentally different from doing the same work without GDB or RDF. Even so, GDB and RDF serve as a catalyst for a culture of publishing datasets.
GDB, as a base model for integration, offers the following benefits over a purely relational result format:
- All entities have globally unique identifiers.
- Any statements may be associated ad hoc to any entities.
- These statements can be scoped into graphs according to their provenance, time, validity, etc.
Obtaining this flexibility on a relational basis would simply require moving to an graph-like representation with essentially one-row-per-attribute. Indeed, we see key-value stores being used in online applications with high volatility of schema (e.g., social networks, search); and we also see relational applications making provisions for post-hoc addition of per-entity attributes (i.e., associating a bag of mixed non-first normal form data with entities). The benefits of a schema-last approach are recognized in many places.
GDB seems a priori a fit for all these requirements, thus how will it claim its place as a solution?
The first part of the answer lies in learning all the relevant database lessons. The second part lies in eliminating the impedance mismatch between querying and reasoning. The third and most important part consists of substantiating these claims in a manner that is understandable to the relevant publics, finally leading to the creation of a semantics-aware segment of the database industry. We will address each of these aspects in turn.
GDB and RDBThe problem is divided into storage format, execution, and query optimization. For the first two, Daniel Abadi's renowned Ph.D. thesis holds most of the keys. Space efficiency is specially important for Linked Data, since data is often voluminous, and many datasets have to be brought together for integration. Access patterns are also unpredictable, with indexed-random-access predominating, as opposed to RDB BI workloads where sequential scans and hash joins represent the bulk of the work. However, we find that a sorted column-wise compressed representation of Linked Data with a single quad table for all statements gives excellent space efficiency and good random access as well as random insert speed. The space efficiency is close to par with the equivalent column-wise relational format, since three of the four columns of the quad table compress to almost nothing. As many sort orders as are necessary may be maintained, but we find that two are enough, with some extra data structures for dealing with queries where the predicate is unspecified. The details are found in VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data. Since GDB/RDF is a model typed at run time, the engine must support an "
ANY" data type for columns and query variables, where values on successive rows may be of different types. This is a straightforward enhancement.Vectored execution is traditionally associated with column stores because the per-row access cost is relatively high, thus needing to access many nearby rows at a time in order to amortize the overhead. Aside this, vectored execution provides many opportunities for parallelism, from the instruction level all the way to threading and distributed execution on clusters, thus some form of execution on large numbers of concurrent query states is needed for RDF stores, just as it is needed for RDBMS"s.
Query optimization for GDBMS is similar to that for RDBMS, except that the statistics can no longer be collected by column and table, but must rather apply to individual entities and ranges of a single quad table. This can be provided through run-time sampling of the database based on constants in the query being optimized. This may take into account trivial inference such as expanding properties into the set of their sub-properties and the like. Beyond this, interleaving execution and optimization (as in ROX) seems to offer limitless possibilities, especially when inference is introduced, making optimizer statistics less predictive.
In summary, starting with an RDBMS and going to GDB entails changes to all parts of the engine, but these changes are not fundamental. One does need to own the engine; however, otherwise the expertise for efficiently implementing these changes will not exist. Essentially any DBMS technique may be translated to a GDB use case, if its application can be decided at run-time. GDB may be schema-less, yet most datasets have fairly regular structure; the question is simply to reconstruct the needed statistics and schema information from the data on an as you go basis. Techniques with high up-front cost, like constructing specially ordered materializations for optimizing specific queries, are harder to deploy but still conceivable for GDB also.
RDB and InferenceCompared to the straightforwardly performance oriented world of database engines, the contours of the landscape become less defined when moving to inference. Databases, whether relational or schema-less all perform roughly the same functions but inference is more diverse. We include here also techniques like machine learning and meta-reasoning for guiding reasoning, although these might not strictly fit the definition.
As we posit that data integration is the motivating use case for GDB as opposed to RDB (Relational Database Model), we must ask which modes of inference are actually required for data integration. Further, we need to ask whether these inferences ought to be applied as a preprocessing step (ETL or forward chaining), or as needed (backward chaining). Some low-hanging fruit can be collected by simply constructing class or property hierarchies; e.g., in the data at hand, the following properties have the meaning of company name, and the following classes have the meaning of company. We have found that such techniques can be efficiently supported at run-time, without materialization, if the support is simply built into the engine, which is in itself straightforward as long as one controls the engine. The same applies to trivial identity resolution, such as
owl:sameAsor resolution of identity based on sharing an inverse-functional property value. These things take longer at run-time, but if one caches and reuses the result, one can get around materialization.We do not believe in weak statements of identity, as in X is similar to Y, since the meaning of similarity is entirely contextual. X and Y may or may not be interchangeable depending on the application; thus the statement on identity needs to be strong, but it must be easy to modify the grounds on which such a statement is made. This is a further argument for why one should not automatically materialize consequences of identity, particularly if dealing with web data where identity is especially problematic.
Real-world problems are however harder than just bundling properties, classes, or instances into sets of interchangeable equivalents, which is all we have mentioned thus far. There are differences of modeling ("address as many columns in customer table" vs. "address normalized away under a contact entity"), normalization ("first name" and "last name" as one or more properties; national conventions on person names; tags as comma-separated in a string or as a one-to-many), incomplete data (one customer table has family income bracket, the other does not), diversity in units of measurement (Imperial vs. metric), variability in the definition of units (seven different things all called blood pressure), variability in unit conversions (currency exchange rates), to name a few. What a world!
If data exists, the conversion questions are often answerable but their answer depends on context -- e.g., date of transaction for currency exchange rate; source of data for the definition of blood pressure.
Alongside these, there remain issues of identity, e.g., depending on the perspective, a national subsidiary is or is not the same entity as the parent company, companies with the same name can be entirely unrelated in different jurisdictions.
It appears that we may need a multi-level approach, combining different techniques for different phases of the integration process. We do not a priori believe that using SQL VIEWs for unit and modeling conversion, and then OWL for unifying terminology on top of this, were the whole solution. Even if this were the solution, the pipeline from the relational sources to SPARQL and OWL needs to be optimized for real-world BI information volumes, and the query language needs to be able to express the business questions and needs to interface with the reporting tools the analyst has come to expect.
Our answer so far consists of a SPARQL extension with non-recursive rules, roughly equivalent to SQL VIEWs in expressive power, tightly integrated to the query engine. There is also limited support for recursion through transitive subqueries; thus one can compactly express things like "all parts of all assemblies and subassemblies must satisfy applicable safety requirements, where the requirements depend on the type of the part in question."
This is only an intermediate step. We believe that a database-scale generic inference engine with at least Datalog power, with second-order extensions like computed predicates, is needed, executing inside the DBMS, benefiting from the whole array of optimizations database-science expects of execution engines, as part of the answer.
This will not relieve the analyst of having to consider that the currency rates in effect at the time of conversion must be taken into account when calculating profits, but this will at least make expressing this and similar pieces of context more compact.
We note that time-to-answer has historically won over raw performance. This was also the case for RDBMS when these were the fresh challenger to the CODASYL incumbents, just as was the case with the adoption of high-level languages. The key is that the raw performance must be sufficient for the real world task. With the adoption of the database lessons outlined in the previous section, we believe this to be the case for GDB (and thus, RDF).
Substantiating the ClaimsBenchmarks have a stellar record for improving any metric they measure. The question is, how can we make a metric that measures GDB's ability to deliver on its claim to fame -- time-to-answer for big data -- with all the integration and other complexities this entails?
So far, GDB benchmarks have consisted of workloads where RDBMS are clearly better (e.g., LUBM, or the Berlin SPARQL Benchmark). This does not remove their usefulness for GDB, but does not constitute a GDB selling point, either.
We suggest a dual approach. The first part is demonstrating that GDB is scalable for BI: We take the industry standard decision support benchmark TPC-H, which is very favorable to RDB and quite unfavorable to GDB, and show that we can tackle the workload at reasonable cost. If TPC-H is all one wants, an RDBMS will stay a better fit, but then this benchmark does not capture any of the heterogeneity, schema evolution, or other such requirements faced by real-world data warehouses. This is still a qualification test, not the selling point.
The issue of benchmark is inextricably tied to the issue of messaging. There must be a compelling story, with which the IT community can identify. Further, the benchmark must capture real-world challenges in the area of interest. With all this, the benchmark should not be too expensive to run. Here too, a multistage approach suggests itself.
Our tentative answer to this question is the Social Intelligence Benchmark (SIB), developed together with CWI and other partners in the LOD2 consortium. This simulates a social network and combines an online workload with complex analytics. This benchmark should cover all of the target areas of the LOD2 project, so that the project itself generates its own metric of success. The project has clear data integration targets, especially as applies to Web and Linked Data. Questions of integration with enterprise sources need to be further developed; for example, comparing CRM data with extractions from the online conversation space for market research.
Data integration will invariably involve human effort, and the area cannot be satisfactorily covered with metrics of scale and throughput alone. Development time, accuracy of results, and cost of maintenance are all factors. Furthermore, the task being modeled must correspond to reality, still without being too domain-specific or prohibitively time-consuming to implement.
ConclusionsThe data driven world will increase rewards for efficiency in data integration. We believe that such efficiency crucially depends on semantics. Real world requirements just might throw the database and AI communities together with enough heat and pressure for fusion to ignite, allegorically speaking. Without a clear and present need, the geek world analog of electrostatic repulsion will keep the communities separate, as has been the case thus far, and no new, qualitatively-different element will arise.
Efforts such as this STI Summit and the LOD2 Project are needed for setting directions and communicating the requirement to the research world. In our fusion analogy, this is the field which directs the nuclei to collide.
Once there is an actual reaction that produces more than it consumes by a sufficient margin, regular business dynamics will take over, and we will have an industry with several products of comparable capability, as well as a set of metrics, all to the benefit of the end user.
References-
TPC-H results pages
-
Daniel Abadi's Ph.D. Thesis, Query Execution in Column-Oriented Database Systems ( PDF )
-
Our VLDB 2010 Semdata workshop paper, Directions and Challenges for Semantically Linked Data ( HTML | PDF )
-
CWI's ROX: Run-time Optimization of XQueries ( PDF )
-
15:49
The 2011 STI Semantic Summit
» OpenLink Community BlogI was recently at the STI 2011 summit in Riga, Latvia. This is a meeting of senior participants in the semantic web and sem tech scene, organized by STI of Dieter Fensel fame, with board members like Michael Brodie, Mark Greaves, and Jim Hendler.
This is substantially about the intersection of AI, knowledge representation, and databases. As we have said before, the database side has not been very prominent in these meetings in the past, but this time we had Peter Boncz of CWI, of MonetDB and VectorWise fame, attending the proceedings.
Will DB and AI finally meet? Well, they have met, but how do they get along? Before I try to answer this, let us look at some background.
At present, CWI and OpenLink are working together in the LOD2 EU FP7 project, around the general topic of bringing the best of Relational Database (RDB) science to the Graph Database (GDB) world. Virtuoso has for a few months had a column store capability (which is about to be made available for public preview). CWI has a long history of column store work, with MonetDB and Ingres VectorWise as results. OpenLink's column store implementation is separate in terms of code but is of course influenced by the work at CWI and other published column store results. The plan is to transplant the applicable CWI innovations into the graph context within Virtuoso. These improvements naturally also benefit Virtuoso RDB (SQL), but the LOD2 project is primarily concerned with GDB applications. The RDB yardstick for much of this work is TPC-H, of which we have made a GDB translation. CWI is uniquely qualified as concerns this in light of VectorWise holding some of the top places in the TPC-H charts.
Even now, we do in fact run the 22 TPC-H queries in SPARQL against the Virtuoso column store. True, these run faster in SQL against relational tables but we have established a beach head. From this initial position, we can incrementally improve the GDB/SPARQL and RDB/SQL functions, and see how close to SQL we get with SPARQL. I will make a separate post commenting on the differences between SQL and SPARQL.
So let's get back to Riga. Mark Greaves said in his opening comments that he would be sick if he once again heard complaining about how bad and un-scalable the tools were. From all the talks, I did get the overall impression that just better databasing for Graph Data is still needed. OK, we have 1-1/2 years of unreleased work just for that about to hit the street; advances are substantial. Along these lines, the people from Bio2RDF pointed out that there still is a cost to publishing query services, specially for complex queries. Well, this cost will be substantially reduced.
The takeaway from the meeting is that the most useful thing, for both our public and ourselves, is simply to keep advancing database tech for graph data. In the first instance, this is about launching what we already have; in the second, about going through the CWI record of innovation and adapting this to GDB.
The thinking is that once query-answering on some tens-of-billions of triples is easily interactive no matter what question one asks, a tipping point will be reached, and GDB can efficiently play the role of data-melting-pot that has been envisioned for it.
This is just a beginning, though. Michael Brodie has on a number of occasions pointed out that that (relational) database guys are only about performance with little or no regard to meaning or even questions of the applicability of the relational model. Peter Boncz then comments back that it can well be that the bulk of IT expenditure worldwide in fact goes into data integration. However, data integration is an "AI-complete" problem with infinite variety and consequent difficulty of measurement. So, making better database engines stands a much greater chance of success and has the nicety of relatively unambiguous metrics.
Quite so. We are somewhere in the middle. I'd say that GDB is still at the stage where making better databases is a matter of make-or-break and not a matter of cutting already vanishingly-short response times just for the sake of it. We will have progress if we just keep at it; for now, performance is still a basic need and not a luxury.
Now that there is all this potentially integrable data published as graphs (most commonly as RDF serializations), what do we do? Someone at the Riga meeting suggested we take a look across the tracks to the RDB world to see what is being done there for data integration. The question is raised, what does GDB have for data integration? The automatic answer that GDB and RDF have OWL is not adequate, as was rightly pointed out by many. Having schema-last, global identifiers, and some culture of vocabulary reuse is nice, but this is only a start. To cite an example,
owl:sameAswill not work when entities simply do not align: One database models a product as a parts hierarchy; another does the same but now based on the materials used in the parts. One tree just has a node that is not in the other. Besides, things like string matching (as in extracting area codes from phone numbers) are common, and OWL specifically excludes any such functions.It is now time to look at what will come after all the database advances. In my talk I outlined some things that have or are about to get solutions:
-
Database technology: Applying advances from RDB (specifically columns, vectoring, and some adaptive query execution) will make GDB a possibility for data warehousing at some scale.
-
Benchmarks: These advances will be demonstrable through benchmarking. There is a better suite of benchmarks with many variations of BSBM, an GDB-modified TPC-H, and the upcoming Social Intelligence Benchmark (SIBB) with actual graph data. There are the beginnings of an auditing process for result publishing, and a fair chance the semdata world will get its analog of the TPC.
After these basics are more or less in hand, we have a vista of more diverse questions:
-
What to do about inference? We do not want OWL or RIF for their own sake; instead we want whatever will declaratively facilitate making sense of data. This is an entirely use-case-driven question. If this can have a reasonably generic answer, we will build it into the engine.
-
Data integration is highly diverse, and tool sets like IBM Infosphere have thousands of modules and functions for different aspects of the problem. To what degree does it make sense to put DI-oriented capabilities into a DBMS?
-
Is it the case that SQL or SPARQL, plus or minus a few details, is as powerful as a language can be while staying application domain-agnostic? In other words, if more powerful reasoning is built into the query language, will the requirements vary so much between application domains that the work is not generally applicable? Datalog is general enough, but can we demonstrate substantially reduced time to answer with big data if this is built into the engine? Berkeley Orders Of Magnitude claims this, even though their claim is not exactly in a database context. We need use cases to refine the actual requirement for inference.
In all these questions, we of necessity turn to the user community. In fact we do not follow the usage of these technologies as much as we ought to. One outcome of the Riga summit is a set of public challenges that will hopefully ameliorate this state of matters, to be released soon.
The general feeling was that there is more going on on the data side than the AI side. The LOD movement proceeds and lightweight everything predominates, also for knowledge representation. There was some discussion about "pay as you go" integration. On the one hand, there is no up-front integration of information systems just for its own sake, so pay as you go is the only kind that exists, system by system, as the need becomes sufficient. On the other hand, each such integration is a process which has its distinct steps and maintenance and within itself it is planned, and thus pre-paid, so to speak. We need more work with the data itself to better understand the matter. The open government data should offer a playground for this and there will be a special challenge around this.
Schema.org and Microdata got their share of discussion. As we see it, it is good that search engines make their pre-competitive data open. This is better than, for example, Google wanting retailers to put their catalogs in Google Base. We do not care about the specific syntax in which data is embedded; we support them all. Microdata converts easily to triples, and if one wants to make a tabular extraction for use with relational tools, this too is simple enough. Applications will have to do their own entity resolution, but this is independent of data publication format.
All in all, the mood was positive. Mark Greaves noted in his closing remarks that there has been a 1000x increase in published GDB data over a few years. There is in fact a large quantity of technology for tackling almost any aspect of the LOD value chain, but people do not necessarily know about this nor is it easy to integrate. Still there would be great value in integration. Getting software to interoperate in a meaningful way is manual labor, so it might make sense to organize hackathons around this. While the STI Summit is for the senior people, there could be a parallel track of events for bringing the coders together to actually practice tool integration and interoperation.
-
-
15:12
The 2011 STI Semantic Summit
» OpenLink Community BlogI was recently at the STI 2011 summit in Riga, Latvia. This is a meeting of senior participants in the semantic web and sem tech scene, organized by STI of Dieter Fensel fame, with board members like Michael Brodie, Mark Greaves, and Jim Hendler.
This is substantially about the intersection of AI, knowledge representation, and databases. As we have said before, the database side has not been very prominent in these meetings in the past, but this time we had Peter Boncz of CWI, of MonetDB and VectorWise fame, attending the proceedings.
Will DB and AI finally meet? Well, they have met, but how do they get along? Before I try to answer this, let us look at some background.
At present, CWI and OpenLink are working together in the LOD2 EU FP7 project, around the general topic of bringing the best of Relational Database (RDB) science to the Graph Database (GDB) world. Virtuoso has for a few months had a column store capability (which is about to be made available for public preview). CWI has a long history of column store work, with MonetDB and Ingres VectorWise as results. OpenLink's column store implementation is separate in terms of code but is of course influenced by the work at CWI and other published column store results. The plan is to transplant the applicable CWI innovations into the graph context within Virtuoso. These improvements naturally also benefit Virtuoso RDB (SQL), but the LOD2 project is primarily concerned with GDB applications. The RDB yardstick for much of this work is TPC-H, of which we have made a GDB translation. CWI is uniquely qualified as concerns this in light of VectorWise holding some of the top places in the TPC-H charts.
Even now, we do in fact run the 22 TPC-H queries in SPARQL against the Virtuoso column store. True, these run faster in SQL against relational tables but we have established a beach head. From this initial position, we can incrementally improve the GDB/SPARQL and RDB/SQL functions, and see how close to SQL we get with SPARQL. I will make a separate post commenting on the differences between SQL and SPARQL.
So let's get back to Riga. Mark Greaves said in his opening comments that he would be sick if he once again heard complaining about how bad and un-scalable the tools were. From all the talks, I did get the overall impression that just better databasing for Graph Data is still needed. OK, we have 1-1/2 years of unreleased work just for that about to hit the street; advances are substantial. Along these lines, the people from Bio2RDF pointed out that there still is a cost to publishing query services, specially for complex queries. Well, this cost will be substantially reduced.
The takeaway from the meeting is that the most useful thing, for both our public and ourselves, is simply to keep advancing database tech for graph data. In the first instance, this is about launching what we already have; in the second, about going through the CWI record of innovation and adapting this to GDB.
The thinking is that once query-answering on some tens-of-billions of triples is easily interactive no matter what question one asks, a tipping point will be reached, and GDB can efficiently play the role of data-melting-pot that has been envisioned for it.
This is just a beginning, though. Michael Brodie has on a number of occasions pointed out that that (relational) database guys are only about performance with little or no regard to meaning or even questions of the applicability of the relational model. Peter Boncz then comments back that it can well be that the bulk of IT expenditure worldwide in fact goes into data integration. However, data integration is an "AI-complete" problem with infinite variety and consequent difficulty of measurement. So, making better database engines stands a much greater chance of success and has the nicety of relatively unambiguous metrics.
Quite so. We are somewhere in the middle. I'd say that GDB is still at the stage where making better databases is a matter of make-or-break and not a matter of cutting already vanishingly-short response times just for the sake of it. We will have progress if we just keep at it; for now, performance is still a basic need and not a luxury.
Now that there is all this potentially integrable data published as graphs (most commonly as RDF serializations), what do we do? Someone at the Riga meeting suggested we take a look across the tracks to the RDB world to see what is being done there for data integration. The question is raised, what does GDB have for data integration? The automatic answer that GDB and RDF have OWL is not adequate, as was rightly pointed out by many. Having schema-last, global identifiers, and some culture of vocabulary reuse is nice, but this is only a start. To cite an example,
owl:sameAswill not work when entities simply do not align: One database models a product as a parts hierarchy; another does the same but now based on the materials used in the parts. One tree just has a node that is not in the other. Besides, things like string matching (as in extracting area codes from phone numbers) are common, and OWL specifically excludes any such functions.It is now time to look at what will come after all the database advances. In my talk I outlined some things that have or are about to get solutions:
-
Database technology: Applying advances from RDB (specifically columns, vectoring, and some adaptive query execution) will make GDB a possibility for data warehousing at some scale.
-
Benchmarks: These advances will be demonstrable through benchmarking. There is a better suite of benchmarks with many variations of BSBM, an GDB-modified TPC-H, and the upcoming Social Intelligence Benchmark (SIBB) with actual graph data. There are the beginnings of an auditing process for result publishing, and a fair chance the semdata world will get its analog of the TPC.
After these basics are more or less in hand, we have a vista of more diverse questions:
-
What to do about inference? We do not want OWL or RIF for their own sake; instead we want whatever will declaratively facilitate making sense of data. This is an entirely use-case-driven question. If this can have a reasonably generic answer, we will build it into the engine.
-
Data integration is highly diverse, and tool sets like IBM Infosphere have thousands of modules and functions for different aspects of the problem. To what degree does it make sense to put DI-oriented capabilities into a DBMS?
-
Is it the case that SQL or SPARQL, plus or minus a few details, is as powerful as a language can be while staying application domain-agnostic? In other words, if more powerful reasoning is built into the query language, will the requirements vary so much between application domains that the work is not generally applicable? Datalog is general enough, but can we demonstrate substantially reduced time to answer with big data if this is built into the engine? Berkeley Orders Of Magnitude claims this, even though their claim is not exactly in a database context. We need use cases to refine the actual requirement for inference.
In all these questions, we of necessity turn to the user community. In fact we do not follow the usage of these technologies as much as we ought to. One outcome of the Riga summit is a set of public challenges that will hopefully ameliorate this state of matters, to be released soon.
The general feeling was that there is more going on on the data side than the AI side. The LOD movement proceeds and lightweight everything predominates, also for knowledge representation. There was some discussion about "pay as you go" integration. On the one hand, there is no up-front integration of information systems just for its own sake, so pay as you go is the only kind that exists, system by system, as the need becomes sufficient. On the other hand, each such integration is a process which has its distinct steps and maintenance and within itself it is planned, and thus pre-paid, so to speak. We need more work with the data itself to better understand the matter. The open government data should offer a playground for this and there will be a special challenge around this.
Schema.org and Microdata got their share of discussion. As we see it, it is good that search engines make their pre-competitive data open. This is better than, for example, Google wanting retailers to put their catalogs in Google Base. We do not care about the specific syntax in which data is embedded; we support them all. Microdata converts easily to triples, and if one wants to make a tabular extraction for use with relational tools, this too is simple enough. Applications will have to do their own entity resolution, but this is independent of data publication format.
All in all, the mood was positive. Mark Greaves noted in his closing remarks that there has been a 1000x increase in published GDB data over a few years. There is in fact a large quantity of technology for tackling almost any aspect of the LOD value chain, but people do not necessarily know about this nor is it easy to integrate. Still there would be great value in integration. Getting software to interoperate in a meaningful way is manual labor, so it might make sense to organize hackathons around this. While the STI Summit is for the senior people, there could be a parallel track of events for bringing the coders together to actually practice tool integration and interoperation.
-
-
14:15
The 2011 STI Semantic Summit
» OpenLink Community BlogI was recently at the STI 2011 summit in Riga, Latvia. This is a meeting of senior participants in the semantic web and sem tech scene, organized by STI of Dieter Fensel fame, with board members like Michael Brodie, Mark Greaves, and Jim Hendler.
This is substantially about the intersection of AI, knowledge representation, and databases. As we have said before, the database side has not been very prominent in these meetings in the past, but this time we had Peter Boncz of CWI, of MonetDB and VectorWise fame, attending the proceedings.
Will DB and AI finally meet? Well, they have met, but how do they get along? Before I try to answer this, let us look at some background.
At present, CWI and OpenLink are working together in the LOD2 EU FP7 project, around the general topic of bringing the best of Relational Database (RDB) science to the Graph Database (GDB) world. Virtuoso has for a few months had a column store capability (which is about to be made available for public preview). CWI has a long history of column store work, with MonetDB and Ingres VectorWise as results. OpenLink's column store implementation is separate in terms of code but is of course influenced by the work at CWI and other published column store results. The plan is to transplant the applicable CWI innovations into the graph context within Virtuoso. These improvements naturally also benefit Virtuoso RDB (SQL), but the LOD2 project is primarily concerned with GDB applications. The RDB yardstick for much of this work is TPC-H, of which we have made a GDB translation. CWI is uniquely qualified as concerns this in light of VectorWise holding some of the top places in the TPC-H charts.
Even now, we do in fact run the 22 TPC-H queries in SPARQL against the Virtuoso column store. True, these run faster in SQL against relational tables but we have established a beach head. From this initial position, we can incrementally improve the GDB/SPARQL and RDB/SQL functions, and see how close to SQL we get with SPARQL. I will make a separate post commenting on the differences between SQL and SPARQL.
So let's get back to Riga. Mark Greaves said in his opening comments that he would be sick if he once again heard complaining about how bad and un-scalable the tools were. From all the talks, I did get the overall impression that just better databasing for Graph Data is still needed. OK, we have 1-1/2 years of unreleased work just for that about to hit the street; advances are substantial. Along these lines, the people from Bio2RDF pointed out that there still is a cost to publishing query services, specially for complex queries. Well, this cost will be substantially reduced.
The takeaway from the meeting is that the most useful thing, for both our public and ourselves, is simply to keep advancing database tech for graph data. In the first instance, this is about launching what we already have; in the second, about going through the CWI record of innovation and adapting this to GDB.
The thinking is that once query-answering on some tens-of-billions of triples is easily interactive no matter what question one asks, a tipping point will be reached, and GDB can efficiently play the role of data-melting-pot that has been envisioned for it.
This is just a beginning, though. Michael Brodie has on a number of occasions pointed out that that (relational) database guys are only about performance with little or no regard to meaning or even questions of the applicability of the relational model. Peter Boncz then comments back that it can well be that the bulk of IT expenditure worldwide in fact goes into data integration. However, data integration is an "AI-complete" problem with infinite variety and consequent difficulty of measurement. So, making better database engines stands a much greater chance of success and has the nicety of relatively unambiguous metrics.
Quite so. We are somewhere in the middle. I'd say that GDB is still at the stage where making better databases is a matter of make-or-break and not a matter of cutting already vanishingly-short response times just for the sake of it. We will have progress if we just keep at it; for now, performance is still a basic need and not a luxury.
Now that there is all this potentially integrable data published as graphs (most commonly as RDF serializations), what do we do? Someone at the Riga meeting suggested we take a look across the tracks to the RDB world to see what is being done there for data integration. The question is raised, what does GDB have for data integration? The automatic answer that GDB and RDF have OWL is not adequate, as was rightly pointed out by many. Having schema-last, global identifiers, and some culture of vocabulary reuse is nice, but this is only a start. To cite an example,
owl:sameAswill not work when entities simply do not align: One database models a product as a parts hierarchy; another does the same but now based on the materials used in the parts. One tree just has a node that is not in the other. Besides, things like string matching (as in extracting area codes from phone numbers) are common, and OWL specifically excludes any such functions.It is now time to look at what will come after all the database advances. In my talk I outlined some things that have or are about to get solutions:
-
Database technology: Applying advances from RDB (specifically columns, vectoring, and some adaptive query execution) will make GDB a possibility for data warehousing at some scale.
-
Benchmarks: These advances will be demonstrable through benchmarking. There is a better suite of benchmarks with many variations of BSBM, an GDB-modified TPC-H, and the upcoming Social Intelligence Benchmark (SIBB) with actual graph data. There are the beginnings of an auditing process for result publishing, and a fair chance the semdata world will get its analog of the TPC.
After these basics are more or less in hand, we have a vista of more diverse questions:
-
What to do about inference? We do not want OWL or RIF for their own sake; instead we want whatever will declaratively facilitate making sense of data. This is an entirely use-case-driven question. If this can have a reasonably generic answer, we will build it into the engine.
-
Data integration is highly diverse, and tool sets like IBM Infosphere have thousands of modules and functions for different aspects of the problem. To what degree does it make sense to put DI-oriented capabilities into a DBMS?
-
Is it the case that SQL or SPARQL, plus or minus a few details, is as powerful as a language can be while staying application domain-agnostic? In other words, if more powerful reasoning is built into the query language, will the requirements vary so much between application domains that the work is not generally applicable? Datalog is general enough, but can we demonstrate substantially reduced time to answer with big data if this is built into the engine? Berkeley Orders Of Magnitude claims this, even though their claim is not exactly in a database context. We need use cases to refine the actual requirement for inference.
In all these questions, we of necessity turn to the user community. In fact we do not follow the usage of these technologies as much as we ought to. One outcome of the Riga summit is a set of public challenges that will hopefully ameliorate this state of matters, to be released soon.
The general feeling was that there is more going on on the data side than the AI side. The LOD movement proceeds and lightweight everything predominates, also for knowledge representation. There was some discussion about "pay as you go" integration. On the one hand, there is no up-front integration of information systems just for its own sake, so pay as you go is the only kind that exists, system by system, as the need becomes sufficient. On the other hand, each such integration is a process which has its distinct steps and maintenance and within itself it is planned, and thus pre-paid, so to speak. We need more work with the data itself to better understand the matter. The open government data should offer a playground for this and there will be a special challenge around this.
Schema.org and Microdata got their share of discussion. As we see it, it is good that search engines make their pre-competitive data open. This is better than, for example, Google wanting retailers to put their catalogs in Google Base. We do not care about the specific syntax in which data is embedded; we support them all. Microdata converts easily to triples, and if one wants to make a tabular extraction for use with relational tools, this too is simple enough. Applications will have to do their own entity resolution, but this is independent of data publication format.
All in all, the mood was positive. Mark Greaves noted in his closing remarks that there has been a 1000x increase in published GDB data over a few years. There is in fact a large quantity of technology for tackling almost any aspect of the LOD value chain, but people do not necessarily know about this nor is it easy to integrate. Still there would be great value in integration. Getting software to interoperate in a meaningful way is manual labor, so it might make sense to organize hackathons around this. While the STI Summit is for the senior people, there could be a parallel track of events for bringing the coders together to actually practice tool integration and interoperation.
-
-
23:55
Transaction Semantics in RDF and Relational Models
» OpenLink Community BlogAs a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.
In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts
Insert (Create) and DeleteSPO. Any other index-like setting with any possible key order will have similar issues.INSERTandDELETEas defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that aDELETEmay delete a triple which theDELETEhas not read; thus the delete set is not a subset of the read set. The SQL equivalent is theDELETE FROM table WHERE key IN ( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of
other_tableand an index lookup followed byDELETEon table.The meaning of
INSERTis that the triples in question exist after the operation, and the meaning ofDELETEis that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.Suppose that the triples
{ 1 0 0 },{ 1 5 6 }, and{ 1 5 7 }exist in the beginning. If weDELETE { 1 ?x ?y }and concurrentlyINSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject1or would have the three just inserted.Suppose the
INSERTinserts the first triple,{ 1 2 4 }. TheDELETEat the same time reads all triples with subject1. The exclusive read waits for the uncommittedINSERT. TheINSERTthen inserts the second triple,{ 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no{ 1 2 3 }was read, or causes a deadlock. The first corresponds toREPEATABLE READisolation; the second toSERIALIZABLE.We would not get the desired end-state of either all the inserted triples or no triples with subject
1if the read or theDELETEwere not serializable.Furthermore if a
Read and UpdateDELETEtemplate produced a triple that did not exist in the pre-image, theDELETEsemantics still imply that this also does not exist in the after-image, which implies serializability.Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10 b balance 10
We transfer
1fromatob, and at the same time transfer2frombtoa. The end state must haveaat11andbat9.A relational database needs
REPEATABLE READisolation for this.With RDF,
txn1reads thatahas abalanceof10. At the same time,txn1reads thebalanceofa.txn2waits because the read oftxn1is exclusive.txn1proceeds and read thebalanceofb. It then updates thebalanceofaandb.All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of
a, since RDF does not really have an update-in-place, consists of deleting{ a balance 10 }and inserting{ a balance 9 }. This gets done andtxn1commits. At this point,txn2proceeds after its wait on the row that stated{ a balance 10 }. This row is now gone, andtxn2sees thatahas no balance, which is quite possible in RDF's schema-less model.We see that
REPEATABLE READis not adequate with RDF, even though it is with relational. The reason why there is noUPDATE-in-place is that thePRIMARY KEYof the triple includes all the parts, including the object. Even in a RDBMS, anUPDATEof a primary key part amounts to aDELETE-plus-INSERT. One could here argue that an implementation might stillUPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.Thus we see that the read of the balance must be
locking order and OLTPSERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read oftxn2waits on the lock that is conceptually before the first possible match of{ a balance ?x }.To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
Pessimistic Vs. Optimistic Concurrency ControlWe have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement
SERIALIZABLEisolation, i.e., the guarantee that if a transaction twice performs aCOUNTthe result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Eventual ConsistencyWeb scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e.,
DELETEd + INSERTedquads, as there is noUPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site hasINSERTeda quad that another hasDELETEdin the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
ConclusionsWe have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be
REPEATABLE READbecomesSERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
-
23:55
Transaction Semantics in RDF and Relational Models
» OpenLink Community BlogAs a part of defining benchmark audit for testing ACID properties on RDF stores, we will here examine different RDF scenarios where lack of concurrency control causes inconsistent results. In so doing, we consider common implementation techniques and implications as concern locking (pessimistic) and multi-version (optimistic) concurrency control schemes.
In the following, we will talk in terms of triples, but the discussion can be trivially generalized to quads. We will use numbers for IRIs and literals. In most implementations, the internal representation for these is indeed a number (or at least some data type that has a well defined collation order). For ease of presentation, we consider a single index with key parts
Insert (Create) and DeleteSPO. Any other index-like setting with any possible key order will have similar issues.INSERTandDELETEas defined in SPARQL are queries which generate a result set which is then used for instantiating triple patterns. We note that aDELETEmay delete a triple which theDELETEhas not read; thus the delete set is not a subset of the read set. The SQL equivalent is theDELETE FROM table WHERE key IN ( SELECT key1 FROM other_table )
expression, supposing it were implemented as a scan of
other_tableand an index lookup followed byDELETEon table.The meaning of
INSERTis that the triples in question exist after the operation, and the meaning ofDELETEis that said triples do not exist. In a transactional context, this means that the after-image of the transaction is guaranteed either to have or not-have said triples.Suppose that the triples
{ 1 0 0 },{ 1 5 6 }, and{ 1 5 7 }exist in the beginning. If weDELETE { 1 ?x ?y }and concurrentlyINSERT { 1 2 4 . 1 2 3 . 1 3 5 }, then whichever was considered to be first by the concurrency control of the DBMS would complete first, and the other after that. Thus the end state would either have no triples with subject1or would have the three just inserted.Suppose the
INSERTinserts the first triple,{ 1 2 4 }. TheDELETEat the same time reads all triples with subject1. The exclusive read waits for the uncommittedINSERT. TheINSERTthen inserts the second triple,{ 1 2 3 }. Depending on the isolation of the read, this either succeeds, since no{ 1 2 3 }was read, or causes a deadlock. The first corresponds toREPEATABLE READisolation; the second toSERIALIZABLE.We would not get the desired end-state of either all the inserted triples or no triples with subject
1if the read or theDELETEwere not serializable.Furthermore if a
Read and UpdateDELETEtemplate produced a triple that did not exist in the pre-image, theDELETEsemantics still imply that this also does not exist in the after-image, which implies serializability.Let us consider the prototypical transaction example of transferring funds from one account to another. Two balances are updated, and a history record is inserted.
The initial state is
a balance 10 b balance 10
We transfer
1fromatob, and at the same time transfer2frombtoa. The end state must haveaat11andbat9.A relational database needs
REPEATABLE READisolation for this.With RDF,
txn1reads thatahas abalanceof10. At the same time,txn1reads thebalanceofa.txn2waits because the read oftxn1is exclusive.txn1proceeds and read thebalanceofb. It then updates thebalanceofaandb.All goes without the deadlock which is always cited in this scenario, because the locks are acquired in the same order. The act of updating the balance of
a, since RDF does not really have an update-in-place, consists of deleting{ a balance 10 }and inserting{ a balance 9 }. This gets done andtxn1commits. At this point,txn2proceeds after its wait on the row that stated{ a balance 10 }. This row is now gone, andtxn2sees thatahas no balance, which is quite possible in RDF's schema-less model.We see that
REPEATABLE READis not adequate with RDF, even though it is with relational. The reason why there is noUPDATE-in-place is that thePRIMARY KEYof the triple includes all the parts, including the object. Even in a RDBMS, anUPDATEof a primary key part amounts to aDELETE-plus-INSERT. One could here argue that an implementation might stillUPDATE-in-place if the key order were not changed. This would resolve the special case of the accounts but not a more general case.Thus we see that the read of the balance must be
locking order and OLTPSERIALIZABLE. This means that the read locks the space before the first balance, so that no insertion may take place. In this way the read oftxn2waits on the lock that is conceptually before the first possible match of{ a balance ?x }.To implement TPC-C, I would update the table with the highest cardinality first, and then all tables in descending order of cardinality. In this way, the locks with the highest likelihood for contention are held for the least time. If locking multiple rows of a table, these should be locked in a deterministic order, e.g., lowest key-value first. In this way, the workload would not deadlock. In actual fact, with clusters and parallel execution, the lock acquisition will not be guaranteed to be serial, so deadlocks do not entirely go away, but still may get fewer. Besides, any outside transaction might still lock in the wrong order and cause deadlocks, which is why the OLTP application must in any case be built to deal with the possibility of deadlock.
This is the conventional relational view of the matter. In more recent times, in-memory schemes with deterministic lock acquisition (Abadi VLDB 2010) or single-threaded atomic execution of transactions (Uni Munich BIRTE workshop at VLDB2010, VoltDB) have been proposed. There the transaction is described as a stored procedure, possibly with extra annotations. These techniques might apply to RDF also. RDF is however an unlikely model for transaction-intensive applications, so we will not for now examine these further.
RDBMS usually implement row-level locking. This means that once a column of a row has an uncommitted state, any other transaction is prevented from changing the row. This has no ready RDF equivalent. RDF is usually implemented as a row-per-triple system and applying row-level locking to this does not give the semantic one expects of a relational row.
I would argue that it is not essential to enforce transactional guarantees in units of rows. The guarantees must apply between data that is read and written by a transaction. It does not need to apply to columns that the transaction does not reference. To take the TPC-C example, the new order transaction updates the stock level and the delivery transaction updates the delivery count on the stock table. In practice, a delivery and a new order falling on the same row of stock will lock each other out, but nothing in the semantics of the workload mandates this.
It does not seem a priori necessary to recreate the row as a unit of concurrency control in RDF. One could say that a multi-attribute whole (such as an address) ought to be atomic for concurrency control, but then applications updating addresses will most likely read and update all the fields together even if only the street name changes.
Pessimistic Vs. Optimistic Concurrency ControlWe have so far spoken only in terms of row-level locking, which is to my knowledge the most widely used model in RDBMS, and one we implement ourselves. Some databases (e.g., MonetDB and VectorWise) implement optimistic concurrency control. The general idea is that each transaction has a read and write set and when a transaction commits, any other transactions whose read or write set intersects with the write set of the committing transaction are marked un-committable. Once a transaction thus becomes un-committable, it may presumably continue reading indefinitely but may no longer commit its updates. Optimistic concurrency is generally coupled with multi-version semantics where the pre-image of a transaction is a clean committed state of the database as of a specific point in time, i.e., snapshot isolation.
To implement
SERIALIZABLEisolation, i.e., the guarantee that if a transaction twice performs aCOUNTthe result will be the same, one locks also the row that precedes the set of selected rows and marks each lock so as to prevent an insert to the right of the lock in key order. The same thing may be done in an optimistic setting.Positional Handling of Updates in Column Stores [Heman, Zukowski, CWI science library] discusses management of multiple consecutive snapshots in some detail. The paper does not go into the details of different levels of isolation but nothing there suggests that serializability could not be supported. There is some complexity in marking the space between ordered rows as non-insertable across multiple versions but this should be feasible enough.
The issue of optimistic Vs. pessimistic concurrency does not seem to be affected by the differences between RDF and relational models. We note that an OLTP workload can be made to run with very few transaction aborts (deadlocks) by properly ordering operations when using a locking scheme. The same does not work with optimistic concurrency since updates happen immediately and transaction aborts occur whenever the writes of one intersect the reads or writes of another, regardless of the order in which these were made.
Developers seldom understand transactions; therefore DBMS should, within the limits of the possible, optimize locking order for locking schemes. A simple example is locking in key order when doing an operation on a set of values. A more complex variant would consist of analyzing data dependencies in stored procedures and reordering updates so as to get the highest cardinality tables first. We note that this latter trick also benefits optimistic schemes.
In RDF, the same principles apply but distinguishing cardinality of an updated set will have to rely on statistics of predicate cardinality. Such are anyhow needed for query optimization.
Eventual ConsistencyWeb scale systems that need to maintain consistent state across multiple data centers sometimes use "eventual consistency" schemes. Two-phase-commit becomes very inefficient as latency increases, thus strict transactional semantics have prohibitive cost if the system is more distributed than a cluster with a fast interconnect.
Eventual consistency schemes (Amazon Dynamo, Yahoo! PNUTS) maintain history information on the record which is the unit of concurrency control. The record is typically a non-first normal form chunk of related data that it makes sense to store together from the application's viewpoint. Application logic can then be applied to reconciling differing copies of the same logical record.
Such a scheme seems a priori ill-suited for RDF, where the natural unit of concurrency control would seem to be the quad. We first note that only recently changed (i.e.,
DELETEd + INSERTedquads, as there is noUPDATE-in-place) need history information. This history information can be stored away from the quad itself, thus not disrupting compression. When detecting that one site hasINSERTeda quad that another hasDELETEdin the same general time period, application logic can still be applied for reading related quads in order to arrive at a decision on how to reconcile two databases that have diverged. The same can apply to conflicting values of properties that for the application should be single-valued. Comparing time-stamped transaction logs on quads is not fundamentally different from comparing record histories in Dynamo or PNUTS.As we overcome the data size penalties that have until recently been associated with RDF, RDF becomes even more interesting as a data model for large online systems such as social network platforms where frequent application changes lead to volatility of schema. Key value stores are currently found in such applications, but they generally do not provide the query flexibility at which RDF excels.
ConclusionsWe have gone over basic aspects of the endlessly complex and variable topic of transactions, and drawn parallels as well as outlined two basic differences between relational and RDF systems: What used to be
REPEATABLE READbecomesSERIALIZABLE; and row-level locking becomes locking at the level of a single attribute value. For the rest, we see that the optimistic and pessimistic modes of concurrency control, as well as guidelines for writing transaction procedures, remain much the same.Based on this overview, it should be possible to design an ACID test for describing the ACID behavior of benchmarked systems. We do not intend to make transaction support a qualification requirement for an RDF benchmark, but information on transaction support will still be valuable in comparing different systems.
-
22:52
RDF and Transactions
» OpenLink Community BlogI will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.
Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?
There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.
At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an
UPDATErequest to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.
There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).
If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.
There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.
Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.
Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.
Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.
As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.
As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.
This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.
As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.
Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.
With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.
If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent
INSERTsandDELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having anINSERTorDELETEfail because of deadlock with multiple users. VectoredINSERTandDELETEare sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is theINSERTorDELETEof a small graph.Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.
In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.
-
22:52
RDF and Transactions
» OpenLink Community BlogI will here talk about RDF and transactions for developers in general. The next one talks about specifics and is for specialists.
Transactions are certainly not the first thing that comes to mind when one hears "RDF". We have at times used a recruitment questionnaire where we ask applicants to define a transaction. Many vaguely remember that it is a unit of work, but usually not more than that. We sometimes get questions from users about why they get an error message that says "deadlock". "Deadlock" is what happens when multiple users concurrently update balances on multiple bank accounts in the wrong order. What does this have to do with RDF?
There are in fact users who even use XA with a Virtuoso-based RDF application. Franz also has publicized their development of full ACID capabilities for AllegroGraph. RDF is a database schema model, and transactions will inevitably become an issue in databases.
At the same time, the developer population trained with MySQL and PHP is not particularly transaction-aware. Transactions have gone out of style, declares the No-SQL crowd. Well, it is not so much SQL they object to but ACID, i.e., transactional guarantees. We will talk more about this in the next post. The SPARQL language and protocol do not go into transactions, except for expressing the wish that an
UPDATErequest to an end-point be atomic. But beware -- atomicity is a gateway drug, and soon one finds oneself on full ACID.If one says that a thing will either happen in its entirety or not at all, which is what (A) atomicity means, then the question arises of (I) isolation; that is, what happens if somebody else does something to the same data at the same time? Then comes the question of whether a thing, once having happened, will stay that way; i.e., (D) durability. Finally, there is (C) consistency, which means that the transaction's result must not contradict restrictions the database is supposed to enforce. RDF usually has no restrictions; thus consistency mostly means that the internal state of the DBMS must be consistent, e.g., different indices on triples/quads should contain the same data.
There are, of course, database-like consistency criteria that one can express in RDF Schema and OWL, concerning data types, mandatory presence of properties, or restrictions on cardinality (i.e., one may only have one spouse at a time, and the like).
If one indeed did enforce them all, then RDF would be very like the relational model -- with all the restrictions, but without the 40 years of work on RDBMS performance. For this reason, RDF use tends to involve data that is not structured enough to be a good fit for RDBMS.
There is of course the OWL side, where consistency is important but is defined in such complex ways that they again are not a good fit for RDBMS. RDF could be seen to be split between the schema-last world and the knowledge representation world. I will here focus on the schema-last side.
Transactions are relevant in RDF in two cases: 1. If data is trickle loaded in small chunks, one likes to know that the chunks do not get lost or corrupted; 2. If the application has any semantics that reserve resources, then these operations need transactions. The latter is not so common with RDF but examples include read-write situations, like checking if a seat is available and then reserving it. Transactionality guarantees that the same seat does not get reserved twice.
Web people argue with some justification that since the four cardinal virtues of database never existed on the web to begin with, applying strict ACID to web data is beside the point, like locking the stable after the horse has long since run away. This may be so; yet the systems used for processing data, whether that data is dirty or not, benefit from predictable operation under concurrency and from not losing data.
Analytics workloads are not primarily about transactions, but still need to specify what happens with updates. Analyzing data from measurements may not have concurrent updates, but there the transaction issue is replaced by the question of making explicit how the data was acquired and what processing has been applied to it before storage.
As mentioned before, the LOD2 project is at the crossroads of RDF and database. I construe its mission to be the making of RDF into a respectable database discipline. Database respectability in turn is as good as inconceivable without addressing the very bedrock on which this science was founded: transactions.
As previously argued, we need well-defined and auditable benchmarks. This again brings up the topic of transactions. Once we embark on the database benchmark route, there is no way around this. TPC-H mandates that the system under test support transactions, and the audit involves a test for this. We can do no less.
This has led me to more closely examine the issue of RDF and transactions, and whether there exist differences between transactions applied to RDF and to relational data.
As concerns Virtuoso, our position has been that one can get full ACID in Virtuoso, whether in SQL or SPARQL, by using a connected client (e.g., ODBC, JDBC, or the Jena or Sesame frameworks), and setting the isolation options on the connection. Having taken this step, one then must take the next step, which consists of dealing with deadlocks; i.e., with concurrent utilization, it may happen that the database at any time notifies the client that the transaction got aborted and the client must retry.
Web developers especially do not like this, because this is not what MySQL has taught them to expect. MySQL does have transactional back-ends like InnoDB, but often gets used without transactions.
With the March 2011 Virtuoso releases, we have taken a closer look at transactions with RDF. It is more practical to reduce the possibility of errors than to require developers to pay attention. For this reason we have automated isolation settings for RDF, greatly reduced the incidence of deadlocks, and even incorporated automatic deadlock retries where applicable.
If all users lock resources they need in the same order, there will be no deadlocks. This is what we do with RDF load in Virtuoso 7; thus any mix of concurrent
INSERTsandDELETEs, if these are under a certain size (normally 10000 quads) are guaranteed never to fail due to locking. These could still fail due to running out of space, though. With previous versions, there always was a possibility of having anINSERTorDELETEfail because of deadlock with multiple users. VectoredINSERTandDELETEare sufficient for making web crawling or archive maintenance practically deadlock free, since there the primary transaction is theINSERTorDELETEof a small graph.Furthermore, since the SPARQL protocol has no way of specifying transactions consisting of multiple client-server exchanges, the SPARQL end-point may deal with deadlocks by itself. If all else fails, it can simply execute requests one after the other, thus eliminating any possibility of locking. We note that many statements will be intrinsically free of deadlocks by virtue of always locking in key order, but this cannot be universally guaranteed with arbitrary size operations; thus concurrent operations might still sometimes deadlock. Anyway, vectored execution as introduced in Virtuoso 7, besides getting easily double-speed random access, also greatly reduces deadlocks by virtue of ordering operations.
In the next post we will talk about what transactions mean with RDF and whether there is any difference with the relational model.
-
22:32
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
» OpenLink Community BlogThis article covers the changes we have made to the BSBM test driver during our series of experiments.
-
Drill-down mode - For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy. The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time. Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy.
- Permutation of query mix - In the BI workload, the queries are run in a random order on each thread in multiuser mode. Doing exactly the same thing on many threads is not realistic for large queries. The data access patterns must be spread out in order to evaluate how bulk IO is organized with differing concurrent demands. The permutations are deterministic on consecutive runs and do not depend on the non-deterministic timing of concurrent activities. For queries with a drill-down, the individual executions that make up the drill-down are still consecutive.
-
New metrics - The BI Power is the geometric mean of query run times scaled to queries per hour and multiplied by the scale factor, where 100 Mt is considered the unit scale. The BI Throughput is the arithmetic mean of the run times scaled to QPH and adjusted to scale as with the Power metric. These are analogous to the TPC-H Power and Throughput metrics.
The Power is defined as
(scale_factor / 284826) * 3600 / ((t0 * t1 * ... * tn) ^(1 / n))
The Throughput is defined as
(scale_factor / 284826) * 3600 / ((t0 + t2 + ... + tn) / n)
The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt). We consider this "scale one." The reason for the multiplication is that scores at different scales should get similar numbers, otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.
We also show the percentage each query represents from the total time the test driver waits for responses.
-
Deadlock retry - When running update mixes, it is possible that a transaction gets aborted by a deadlock. We have made a retry logic for this.
-
Cluster mode - Cluster databases may have multiple interchangeable HTTP listeners. With this mode, one can specify multiple end-points so a multi-user workload can divide itself evenly over these.
-
Identifying matter - A version number was added to test driver output. Use of the new switches is also indicated in the test driver output.
-
SUT CPU - In comparing results it is crucial to differentiate between in memory runs and IO bound runs. To make this easier, we have added an option to report server CPU times over the timed portion (excluding warm-ups). A pluggable self-script determines the CPU times for the system; thus clusters can be handled, too. The time is given as a sum of the time the server processes have aged during the run and as a percentage over the wall-clock time.
These changes will soon be available as a diff and as a source tree. This version is labeled
BSBM Test Driver 1.1-opl; the-oplsignifies OpenLink additions.We invite FU Berlin to include these enhancements into their Source Forge repository of the BSBM test driver. There is more precise documentation of these options in the README file in the above distribution.
The next planned upgrade of the test driver concerns adding support for "RDF-H", the RDF adaptation of the industry standard TPC-H decision support benchmark for RDBMS.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements (this post)
-
-
22:32
Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
» OpenLink Community BlogThis article covers the changes we have made to the BSBM test driver during our series of experiments.
-
Drill-down mode - For queries that have a product type as parameter, the test driver will invoke the query multiple times with each time a random subtype of the product type of the previous invocation. The starting point of the drill-down is an a random type from a settable level in the hierarchy. The rationale for the drill-down mode is that depending on the parameter choice, there can be 1000x differences in query run time. Thus run times of consecutive query mixes will be incomparable unless we guarantee that each mix has a predictable number of queries with a product type from each level in the hierarchy.
- Permutation of query mix - In the BI workload, the queries are run in a random order on each thread in multiuser mode. Doing exactly the same thing on many threads is not realistic for large queries. The data access patterns must be spread out in order to evaluate how bulk IO is organized with differing concurrent demands. The permutations are deterministic on consecutive runs and do not depend on the non-deterministic timing of concurrent activities. For queries with a drill-down, the individual executions that make up the drill-down are still consecutive.
-
New metrics - The BI Power is the geometric mean of query run times scaled to queries per hour and multiplied by the scale factor, where 100 Mt is considered the unit scale. The BI Throughput is the arithmetic mean of the run times scaled to QPH and adjusted to scale as with the Power metric. These are analogous to the TPC-H Power and Throughput metrics.
The Power is defined as
(scale_factor / 284826) * 3600 / ((t0 * t1 * ... * tn) ^(1 / n))
The Throughput is defined as
(scale_factor / 284826) * 3600 / ((t0 + t2 + ... + tn) / n)
The magic number 284826 is the scale that generates approximately 100 million triples (100 Mt). We consider this "scale one." The reason for the multiplication is that scores at different scales should get similar numbers, otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.
We also show the percentage each query represents from the total time the test driver waits for responses.
-
Deadlock retry - When running update mixes, it is possible that a transaction gets aborted by a deadlock. We have made a retry logic for this.
-
Cluster mode - Cluster databases may have multiple interchangeable HTTP listeners. With this mode, one can specify multiple end-points so a multi-user workload can divide itself evenly over these.
-
Identifying matter - A version number was added to test driver output. Use of the new switches is also indicated in the test driver output.
-
SUT CPU - In comparing results it is crucial to differentiate between in memory runs and IO bound runs. To make this easier, we have added an option to report server CPU times over the timed portion (excluding warm-ups). A pluggable self-script determines the CPU times for the system; thus clusters can be handled, too. The time is given as a sum of the time the server processes have aged during the run and as a percentage over the wall-clock time.
These changes will soon be available as a diff and as a source tree. This version is labeled
BSBM Test Driver 1.1-opl; the-oplsignifies OpenLink additions.We invite FU Berlin to include these enhancements into their Source Forge repository of the BSBM test driver. There is more precise documentation of these options in the README file in the above distribution.
The next planned upgrade of the test driver concerns adding support for "RDF-H", the RDF adaptation of the industry standard TPC-H decision support benchmark for RDBMS.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements (this post)
-
-
22:31
Benchmarks, Redux (part 14): BSBM BI Mix
» OpenLink Community BlogIn this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post.
Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.
Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.
Below we show samples of test driver output; the whole output is downloadable.
100 Mt Single User
bsbm/testdriver -runs 1 -w 0 -idir /bs/1 -drill -ucf bsbm/usecases/businessIntelligence/sparql.txt -dg http://bsbm.org [localhost:8604]
0: 43348.14ms, total: 43440ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 43.3481s / 43.3481s Elapsed runtime: 43.348 seconds QMpH: 83.049 query mixes per hour CQET: 43.348 seconds average runtime of query mix CQET (geom.): 43.348 seconds geometric mean runtime of query mix AQET (geom.): 0.492 seconds geometric mean runtime of query Throughput: 1494.874 BSBM-BI throughput: qph*scale BI Power: 7309.820 BSBM-BI Power: qph*scale (geom)
100 Mt 8 User
Thread 6: query mix 3: 195793.09ms, total: 196086.18ms Thread 8: query mix 0: 197843.84ms, total: 198010.50ms Thread 7: query mix 4: 201806.28ms, total: 201996.26ms Thread 2: query mix 5: 221983.93ms, total: 222105.96ms Thread 4: query mix 7: 225127.55ms, total: 225317.49ms Thread 3: query mix 6: 225860.49ms, total: 226050.17ms Thread 5: query mix 2: 230884.93ms, total: 231067.61ms Thread 1: query mix 1: 237836.61ms, total: 237959.11ms Benchmark run completed in 237.985427s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 195.7931s / 237.8366s Total runtime (sum): 1737.137 seconds Elapsed runtime: 1737.137 seconds QMpH: 121.016 query mixes per hour CQET: 217.142 seconds average runtime of query mix CQET (geom.): 216.603 seconds geometric mean runtime of query mix AQET (geom.): 2.156 seconds geometric mean runtime of query Throughput: 2178.285 BSBM-BI throughput: qph*scale BI Power: 1669.745 BSBM-BI Power: qph*scale (geom)
1000 Mt Single User
0: 608707.03ms, total: 608768ms Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 608.7070s / 608.7070s Elapsed runtime: 608.707 seconds QMpH: 5.914 query mixes per hour CQET: 608.707 seconds average runtime of query mix CQET (geom.): 608.707 seconds geometric mean runtime of query mix AQET (geom.): 5.167 seconds geometric mean runtime of query Throughput: 1064.552 BSBM-BI throughput: qph*scale BI Power: 6967.325 BSBM-BI Power: qph*scale (geom)
1000 Mt 8 User
bsbm/testdriver -runs 8 -mt 8 -w 0 -idir /bs/10 -drill -ucf bsbm/usecases/businessIntelligence/sparql.txt -dg [bsbm.org] http://localhost:8604/sparql
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms Benchmark run completed in 2889.302566s Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 2211.2753s / 2889.1992s Total runtime (sum): 20481.895 seconds Elapsed runtime: 20481.895 seconds QMpH: 9.968 query mixes per hour CQET: 2560.237 seconds average runtime of query mix CQET (geom.): 2544.284 seconds geometric mean runtime of query mix AQET (geom.): 13.556 seconds geometric mean runtime of query Throughput: 1794.205 BSBM-BI throughput: qph*scale BI Power: 2655.678 BSBM-BI Power: qph*scale (geom) Metrics for Query: 1 Count: 8 times executed in whole run Time share 2.120884% of total execution time AQET: 54.299656 seconds (arithmetic mean) AQET(geom.): 34.607302 seconds (geometric mean) QPS: 0.13 Queries per second minQET/maxQET: 11.71547600s / 148.65379700s Metrics for Query: 2 Count: 8 times executed in whole run Time share 0.207382% of total execution time AQET: 5.309462 seconds (arithmetic mean) AQET(geom.): 2.737696 seconds (geometric mean) QPS: 1.34 Queries per second minQET/maxQET: 0.78729800s / 25.80948200s Metrics for Query: 3 Count: 8 times executed in whole run Time share 17.650472% of total execution time AQET: 451.893890 seconds (arithmetic mean) AQET(geom.): 410.481088 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 171.07262500s / 721.72939200s Metrics for Query: 5 Count: 32 times executed in whole run Time share 6.196565% of total execution time AQET: 39.661685 seconds (arithmetic mean) AQET(geom.): 6.849882 seconds (geometric mean) QPS: 0.18 Queries per second minQET/maxQET: 0.15696500s / 189.00906200s Metrics for Query: 6 Count: 8 times executed in whole run Time share 0.119916% of total execution time AQET: 3.070136 seconds (arithmetic mean) AQET(geom.): 2.056059 seconds (geometric mean) QPS: 2.31 Queries per second minQET/maxQET: 0.41524400s / 7.55655300s Metrics for Query: 7 Count: 40 times executed in whole run Time share 1.577963% of total execution time AQET: 8.079921 seconds (arithmetic mean) AQET(geom.): 1.342079 seconds (geometric mean) QPS: 0.88 Queries per second minQET/maxQET: 0.02205800s / 40.27761500s Metrics for Query: 8 Count: 40 times executed in whole run Time share 72.126818% of total execution time AQET: 369.323481 seconds (arithmetic mean) AQET(geom.): 114.431863 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 5.94377300s / 1824.57867400s
The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.
With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good.
For reference, we show a run with Virtuoso 6 at 100Mt.
0: 424754.40ms, total: 424829ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 424.7544s / 424.7544s Elapsed runtime: 424.754 seconds QMpH: 8.475 query mixes per hour CQET: 424.754 seconds average runtime of query mix CQET (geom.): 424.754 seconds geometric mean runtime of query mix AQET (geom.): 1.097 seconds geometric mean runtime of query Throughput: 152.559 BSBM-BI throughput: qph*scale BI Power: 3281.150 BSBM-BI Power: qph*scale (geom)
and 8 user
Thread 5: query mix 3: 616997.86ms, total: 617042.83ms Thread 7: query mix 4: 625522.18ms, total: 625559.09ms Thread 3: query mix 7: 626247.62ms, total: 626304.96ms Thread 1: query mix 0: 629675.17ms, total: 629724.98ms Thread 4: query mix 6: 667633.36ms, total: 667670.07ms Thread 8: query mix 2: 674206.07ms, total: 674256.72ms Thread 6: query mix 5: 695020.21ms, total: 695052.29ms Thread 2: query mix 1: 701824.67ms, total: 701864.91ms Benchmark run completed in 701.909341s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 616.9979s / 701.8247s Total runtime (sum): 5237.127 seconds Elapsed runtime: 5237.127 seconds QMpH: 41.031 query mixes per hour CQET: 654.641 seconds average runtime of query mix CQET (geom.): 653.873 seconds geometric mean runtime of query mix AQET (geom.): 2.557 seconds geometric mean runtime of query Throughput: 738.557 BSBM-BI throughput: qph*scale BI Power: 1408.133 BSBM-BI Power: qph*scale (geom)
Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput.
100 Mt: sqrt ( 7771 * 2178 ); = 4114 1000 Mt: sqrt ( 6967 * 1794 ); = 3535
Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable.
Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.
Also there should be more queries.
At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people.
Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an
INinto aJOINwith theINsubquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimizedHASH JOINwhich besides must be partitioned at the larger scales.Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.
In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.
BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a
JOINand the cardinality of aGROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonableJOINorder since we always reorder the query regardless of the source formulation.BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different
JOINorders for different parameter values. I have not looked into whether this really makes a difference, though.There are places in BSBM-BI where using a
HASH JOINmakes sense. We do not useHASH JOINswith RDF because there is an index for everything and making aHASH JOINin the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not doHASH JOINs. This said, aHASH JOINin the right place is a lot better than an index lookup. With TPC-H Q13, our bestHASH JOINis over 2x better than the bestINDEX-basedJOIN, both being well tuned. For questions like "count the hairballs made in Germany reviewed by Japanese Hello Kitty fans," where two ends of aJOINpath are fairly selective doing the other as aHASH JOINis good. This can, if theJOINis always cardinality-reducing, even be merged inside anINDEXlookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.Let us see the profile for a single user 100 Mt run.
The database activity summary is --
select db_activity (0, ' [http');]161.3M rnd 210.2M seq 0 same seg 104.5M same pg 45.08M same par 0 disk 0 spec disk 0B / 0 messages 2.393K forkSee the post "What Does BSBM Explore Measure" for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --
172.4M rnd 220.8M seq 0 same seg 149.6M same pg 10.99M same par 21 disk 861 spec disk 0B / 0 messages 754 forkThe throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.
Let us now go back to the first run with vector size 10000.
The top of the CPU
oprofileis as follows:722309 15.4507 cmpf_iri64n_iri64n 434791 9.3005 cmpf_iri64n_iri64n_anyn_iri64n 294712 6.3041 itc_next_set 273488 5.8501 itc_vec_split_search 203970 4.3631 itc_dive_transit 199687 4.2714 itc_page_rcf_search 181614 3.8848 dc_itc_append_any 173043 3.7015 itc_bm_vec_row_check 146727 3.1386 cmpf_int64n 128224 2.7428 itc_vec_row_check 113515 2.4282 dk_alloc 97296 2.0812 page_wait_access 62523 1.3374 qst_vec_get_int64 59014 1.2623 itc_next_set_parent 53589 1.1463 sslr_qst_get 48003 1.0268 ds_add 46641 0.9977 dk_free_tree 44551 0.9530 kc_var_col 43650 0.9337 page_col_cmp_1 35297 0.7550 cmpf_iri64n_iri64n_anyn_gt_lt 34589 0.7399 dv_compare 25864 0.5532 cmpf_iri64n_anyn_iri64n_iri64n_lte 23088 0.4939 dk_free
The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with
PandSgiven. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted toHASH JOINlookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile,GROUP BYseems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside
ANDsandORswould be good. TPC-H has such in Q21 and Q19. AnORwith existences where there would be gain from good guesses of a subquery's selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like theCASE-WHENin TPC-H Q12.We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from
HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more.This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM-BI Modifications
- Benchmarks, Redux (part 14): BSBM-BI Mix (this post)
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
22:31
Benchmarks, Redux (part 14): BSBM BI Mix
» OpenLink Community BlogIn this post, we look at how we run the BSBM-BI mix. We consider the 100 Mt and 1000 Mt scales with Virtuoso 7 using the same hardware and software as in the previous posts. The changes to workload and metric are given in the previous post.
Our intent here is to look at whether the metric works, and to see what results will look like in general. We are as much testing the benchmark as we are testing the system-under-test (SUT). The results shown here will likely not be comparable with future ones because we will most likely change the composition of the workload since it seems a bit out of balance. Anyway, for the sake of disclosure, we attach the query templates. The test driver we used will be made available soon, so the interested may still try a comparison with their systems. If you practice with this workload for the coming races, the effort will surely not be wasted.
Once we have come up with a rules document, we will redo all that we have published so far by-the-book, and have it audited as part of the LOD2 service we plan for this (see previous posts in this series). This will introduce comparability; but before we get that far with the BI workload, the workload needs to evolve a bit.
Below we show samples of test driver output; the whole output is downloadable.
100 Mt Single User
bsbm/testdriver -runs 1 -w 0 -idir /bs/1 -drill -ucf bsbm/usecases/businessIntelligence/sparql.txt -dg http://bsbm.org [localhost:8604]
0: 43348.14ms, total: 43440ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 43.3481s / 43.3481s Elapsed runtime: 43.348 seconds QMpH: 83.049 query mixes per hour CQET: 43.348 seconds average runtime of query mix CQET (geom.): 43.348 seconds geometric mean runtime of query mix AQET (geom.): 0.492 seconds geometric mean runtime of query Throughput: 1494.874 BSBM-BI throughput: qph*scale BI Power: 7309.820 BSBM-BI Power: qph*scale (geom)
100 Mt 8 User
Thread 6: query mix 3: 195793.09ms, total: 196086.18ms Thread 8: query mix 0: 197843.84ms, total: 198010.50ms Thread 7: query mix 4: 201806.28ms, total: 201996.26ms Thread 2: query mix 5: 221983.93ms, total: 222105.96ms Thread 4: query mix 7: 225127.55ms, total: 225317.49ms Thread 3: query mix 6: 225860.49ms, total: 226050.17ms Thread 5: query mix 2: 230884.93ms, total: 231067.61ms Thread 1: query mix 1: 237836.61ms, total: 237959.11ms Benchmark run completed in 237.985427s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 195.7931s / 237.8366s Total runtime (sum): 1737.137 seconds Elapsed runtime: 1737.137 seconds QMpH: 121.016 query mixes per hour CQET: 217.142 seconds average runtime of query mix CQET (geom.): 216.603 seconds geometric mean runtime of query mix AQET (geom.): 2.156 seconds geometric mean runtime of query Throughput: 2178.285 BSBM-BI throughput: qph*scale BI Power: 1669.745 BSBM-BI Power: qph*scale (geom)
1000 Mt Single User
0: 608707.03ms, total: 608768ms Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 608.7070s / 608.7070s Elapsed runtime: 608.707 seconds QMpH: 5.914 query mixes per hour CQET: 608.707 seconds average runtime of query mix CQET (geom.): 608.707 seconds geometric mean runtime of query mix AQET (geom.): 5.167 seconds geometric mean runtime of query Throughput: 1064.552 BSBM-BI throughput: qph*scale BI Power: 6967.325 BSBM-BI Power: qph*scale (geom)
1000 Mt 8 User
bsbm/testdriver -runs 8 -mt 8 -w 0 -idir /bs/10 -drill -ucf bsbm/usecases/businessIntelligence/sparql.txt -dg [bsbm.org] http://localhost:8604/sparql
Thread 3: query mix 4: 2211275.25ms, total: 2211371.60ms Thread 4: query mix 0: 2212316.87ms, total: 2212417.99ms Thread 8: query mix 3: 2275942.63ms, total: 2276058.03ms Thread 5: query mix 5: 2441378.35ms, total: 2441448.66ms Thread 6: query mix 7: 2804001.05ms, total: 2804098.81ms Thread 2: query mix 2: 2808374.66ms, total: 2808473.71ms Thread 1: query mix 6: 2839407.12ms, total: 2839510.63ms Thread 7: query mix 1: 2889199.23ms, total: 2889263.17ms Benchmark run completed in 2889.302566s Scale factor: 2848260 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 2211.2753s / 2889.1992s Total runtime (sum): 20481.895 seconds Elapsed runtime: 20481.895 seconds QMpH: 9.968 query mixes per hour CQET: 2560.237 seconds average runtime of query mix CQET (geom.): 2544.284 seconds geometric mean runtime of query mix AQET (geom.): 13.556 seconds geometric mean runtime of query Throughput: 1794.205 BSBM-BI throughput: qph*scale BI Power: 2655.678 BSBM-BI Power: qph*scale (geom) Metrics for Query: 1 Count: 8 times executed in whole run Time share 2.120884% of total execution time AQET: 54.299656 seconds (arithmetic mean) AQET(geom.): 34.607302 seconds (geometric mean) QPS: 0.13 Queries per second minQET/maxQET: 11.71547600s / 148.65379700s Metrics for Query: 2 Count: 8 times executed in whole run Time share 0.207382% of total execution time AQET: 5.309462 seconds (arithmetic mean) AQET(geom.): 2.737696 seconds (geometric mean) QPS: 1.34 Queries per second minQET/maxQET: 0.78729800s / 25.80948200s Metrics for Query: 3 Count: 8 times executed in whole run Time share 17.650472% of total execution time AQET: 451.893890 seconds (arithmetic mean) AQET(geom.): 410.481088 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 171.07262500s / 721.72939200s Metrics for Query: 5 Count: 32 times executed in whole run Time share 6.196565% of total execution time AQET: 39.661685 seconds (arithmetic mean) AQET(geom.): 6.849882 seconds (geometric mean) QPS: 0.18 Queries per second minQET/maxQET: 0.15696500s / 189.00906200s Metrics for Query: 6 Count: 8 times executed in whole run Time share 0.119916% of total execution time AQET: 3.070136 seconds (arithmetic mean) AQET(geom.): 2.056059 seconds (geometric mean) QPS: 2.31 Queries per second minQET/maxQET: 0.41524400s / 7.55655300s Metrics for Query: 7 Count: 40 times executed in whole run Time share 1.577963% of total execution time AQET: 8.079921 seconds (arithmetic mean) AQET(geom.): 1.342079 seconds (geometric mean) QPS: 0.88 Queries per second minQET/maxQET: 0.02205800s / 40.27761500s Metrics for Query: 8 Count: 40 times executed in whole run Time share 72.126818% of total execution time AQET: 369.323481 seconds (arithmetic mean) AQET(geom.): 114.431863 seconds (geometric mean) QPS: 0.02 Queries per second minQET/maxQET: 5.94377300s / 1824.57867400s
The CPU for the multiuser runs stays above 1500% for the whole run. The CPU for the single user 100 Mt run is 630%; for the 1000 Mt run, this is 574%. This can be improved since the queries usually have a lot of data to work on. But final optimization is not our goal yet; we are just surveying the race track. The difference between a warm single user run and a cold single user run is about 15% with data on SSD; with data on disk, this would be more. The numbers shown are with warm cache. The single-user and multi-user Throughput difference, 1064 single-user vs. 1794 multi-user, is about what one would expect from the CPU utilization.
With these numbers, the CPU does not appear badly memory-bound, else the increase would be less; also core multi-threading seems to bring some benefit. If the single-user run was at 800%, the Throughput would be 1488. The speed in excess of this may be attributed to core multi-threading, although we must remember that not every query mix is exactly the same length, so the figure is not exact. Core multi-threading does not seem to hurt, at the very least. Comparison of the same numbers with the column store will be interesting since it misses the cache a lot less and accordingly has better SMP scaling. The Intel Nehalem memory subsystem is really pretty good.
For reference, we show a run with Virtuoso 6 at 100Mt.
0: 424754.40ms, total: 424829ms Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 1 times min/max Querymix runtime: 424.7544s / 424.7544s Elapsed runtime: 424.754 seconds QMpH: 8.475 query mixes per hour CQET: 424.754 seconds average runtime of query mix CQET (geom.): 424.754 seconds geometric mean runtime of query mix AQET (geom.): 1.097 seconds geometric mean runtime of query Throughput: 152.559 BSBM-BI throughput: qph*scale BI Power: 3281.150 BSBM-BI Power: qph*scale (geom)
and 8 user
Thread 5: query mix 3: 616997.86ms, total: 617042.83ms Thread 7: query mix 4: 625522.18ms, total: 625559.09ms Thread 3: query mix 7: 626247.62ms, total: 626304.96ms Thread 1: query mix 0: 629675.17ms, total: 629724.98ms Thread 4: query mix 6: 667633.36ms, total: 667670.07ms Thread 8: query mix 2: 674206.07ms, total: 674256.72ms Thread 6: query mix 5: 695020.21ms, total: 695052.29ms Thread 2: query mix 1: 701824.67ms, total: 701864.91ms Benchmark run completed in 701.909341s Scale factor: 284826 Explore Endpoints: 1 Update Endpoints: 1 Drilldown: on Number of warmup runs: 0 Number of clients: 8 Seed: 808080 Number of query mix runs (without warmups): 8 times min/max Querymix runtime: 616.9979s / 701.8247s Total runtime (sum): 5237.127 seconds Elapsed runtime: 5237.127 seconds QMpH: 41.031 query mixes per hour CQET: 654.641 seconds average runtime of query mix CQET (geom.): 653.873 seconds geometric mean runtime of query mix AQET (geom.): 2.557 seconds geometric mean runtime of query Throughput: 738.557 BSBM-BI throughput: qph*scale BI Power: 1408.133 BSBM-BI Power: qph*scale (geom)
Having the numbers, let us look at the metric and its scaling. We take the geometric mean of the single-user Power and the multiuser Throughput.
100 Mt: sqrt ( 7771 * 2178 ); = 4114 1000 Mt: sqrt ( 6967 * 1794 ); = 3535
Scaling seems to work; the results are in the same general ballpark. The real times for the 1000 Mt run are a bit over 10x the times for the 100Mt run, as expected. The relative percentages of the queries are about the same on both scales, with the drill-down in Q8 alone being 77% and 72% respectively. The Q8 drill-down starts at the root of the product hierarchy. If we made this start one level from the top, its share would drop. This seems reasonable.
Conversely, Q2 is out of place, with far too little share of the time. It takes a product as a starting point and shows a list of products with common features, sorted by descending count of common features. This would more appropriately be applied to a leaf product category instead, measuring how many of the products in the category have the top 20 features found in this category, to name an example.
Also there should be more queries.
At present it appears that BSBM-BI is definitely runnable, but a cursory look suffices to show that the workload needs more development and variety. We remember that I dreamt up the business questions last fall without much analysis, and that these questions were subsequently translated to SPARQL by FU Berlin. So, on one hand, BSBM-BI is of crucial importance because it is the first attempt at doing a benchmark with long running queries in SPARQL. On the other hand, BSBM-BI is not very good as a benchmark; TPC-H is a lot better. This stands to reason, as TPC-H has had years and years of development and participation by many people.
Benchmark queries are trick questions: For example, TPC-H Q18 cannot be done without changing an
INinto aJOINwith theINsubquery in the outer loop and doing streaming aggregation. Q13 cannot be done without a well-optimizedHASH JOINwhich besides must be partitioned at the larger scales.Having such trick questions in an important benchmark eventually results in everybody doing the optimizations that the benchmark clearly calls for. Making benchmarks thus entails a responsibility ultimately to the end user, because an irrelevant benchmark might in the worst case send developers chasing things that are beside the point.
In the following, we will look at what BSBM-BI requires from the database and how these requirements can be further developed and extended.
BSBM-BI does not have any clear trick questions, at least not premeditatedly. BSBM-BI just requires a cost model that can guess the fanout of a
JOINand the cardinality of aGROUP BY; it is enough to distinguish smaller from greater; the guess does not otherwise have to be very good. Further, the queries are written in the benchmark text so that joining from left to right would work, so not even a cost-based optimizer is strictly needed. I did however have to add some cardinality statistics to get reasonableJOINorder since we always reorder the query regardless of the source formulation.BSBM-BI does have variable selectivity from the drill-downs; thus these may call for different
JOINorders for different parameter values. I have not looked into whether this really makes a difference, though.There are places in BSBM-BI where using a
HASH JOINmakes sense. We do not useHASH JOINswith RDF because there is an index for everything and making aHASH JOINin the wrong place can have a large up-front cost, so one is more robust against cost model errors if one does not doHASH JOINs. This said, aHASH JOINin the right place is a lot better than an index lookup. With TPC-H Q13, our bestHASH JOINis over 2x better than the bestINDEX-basedJOIN, both being well tuned. For questions like "count the hairballs made in Germany reviewed by Japanese Hello Kitty fans," where two ends of aJOINpath are fairly selective doing the other as aHASH JOINis good. This can, if theJOINis always cardinality-reducing, even be merged inside anINDEXlookup. We have such capabilities since we have been for a while gearing up for the relational races, but are not using any of these with BSBM-BI, although they would be useful.Let us see the profile for a single user 100 Mt run.
The database activity summary is --
select db_activity (0, ' [http');]161.3M rnd 210.2M seq 0 same seg 104.5M same pg 45.08M same par 0 disk 0 spec disk 0B / 0 messages 2.393K forkSee the post "What Does BSBM Explore Measure" for an explanation of the numbers. We see that there is more sequential access than random and the random has fair locality with over half on the same page as the previous and a lot of the rest falling under the same parent. Funnily enough, the explore mix has more locality. Running with a longer vector size would probably increase performance by getting better locality. There is an optimization that adjusts vector size on the fly if locality is not sufficient but this is not being used here. So we manually set vector size to 100000 instead of the default 10000. We get --
172.4M rnd 220.8M seq 0 same seg 149.6M same pg 10.99M same par 21 disk 861 spec disk 0B / 0 messages 754 forkThe throughput goes from 1494 to 1779. We see more hits on the same page, as expected. We do not make this setting a default since it raises the cost for small queries; therefore the vector size must be self-adjusting -- besides, expecting a DBA to tune this is not reasonable. We will just have to correctly tune the self-adjust logic, and we have again clear gains.
Let us now go back to the first run with vector size 10000.
The top of the CPU
oprofileis as follows:722309 15.4507 cmpf_iri64n_iri64n 434791 9.3005 cmpf_iri64n_iri64n_anyn_iri64n 294712 6.3041 itc_next_set 273488 5.8501 itc_vec_split_search 203970 4.3631 itc_dive_transit 199687 4.2714 itc_page_rcf_search 181614 3.8848 dc_itc_append_any 173043 3.7015 itc_bm_vec_row_check 146727 3.1386 cmpf_int64n 128224 2.7428 itc_vec_row_check 113515 2.4282 dk_alloc 97296 2.0812 page_wait_access 62523 1.3374 qst_vec_get_int64 59014 1.2623 itc_next_set_parent 53589 1.1463 sslr_qst_get 48003 1.0268 ds_add 46641 0.9977 dk_free_tree 44551 0.9530 kc_var_col 43650 0.9337 page_col_cmp_1 35297 0.7550 cmpf_iri64n_iri64n_anyn_gt_lt 34589 0.7399 dv_compare 25864 0.5532 cmpf_iri64n_anyn_iri64n_iri64n_lte 23088 0.4939 dk_free
The top 10 are all index traversal, with the key compare for two leading IRI keys in the lead, corresponding to a lookup with
PandSgiven. The one after that is with all parts given, corresponding to an existence test. The existence tests could probably be converted toHASH JOINlookups to good advantage. Aggregation and arithmetic are absent. We should probably add a query like TPC-H Q1 that does nothing but these two. Considering the overall profile,GROUP BYseems to be around 3%. We should probably put in a query that makes a very large number of groups and could make use of streaming aggregation, i.e., take advantage of a situation where aggregation input comes already grouped by the grouping columns.A BI use case should offer no problem with including arithmetic, but there are not that many numbers in the BSBM set. Some code sections in the queries with conditional execution and costly tests inside
ANDsandORswould be good. TPC-H has such in Q21 and Q19. AnORwith existences where there would be gain from good guesses of a subquery's selectivity would be appropriate. Also, there should be conditional expressions somewhere with a lot of data, like theCASE-WHENin TPC-H Q12.We can make BSBM-BI more interesting by putting in the above. Also we will have to see where we can profit from
HASH JOIN, both small and large. There should be such places in the workload already so this is a matter of just playing a bit more.This post amounts to a cheat sheet for the BSBM-BI runs a bit farther down the road. By then we should be operational with the column store and Virtuoso 7 Cluster, though, so not everything is yet on the table.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM-BI Modifications
- Benchmarks, Redux (part 14): BSBM-BI Mix (this post)
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
22:30
Benchmarks, Redux (part 13): BSBM BI Modifications
» OpenLink Community BlogIn this post we introduce changes to the BSBM BI queries and metric. These changes are motivated by prevailing benchmark practice and by our experiences in optimizing for the BSBM BI workload.
We will publish results according to the definitions given here and recommend that any interested parties do likewise. The rationales are given in the text.
Query MixWe have removed Q4 from the mix because it is quadratic to the scale factor. The other queries are roughly
Parameter Substitutionn * log (n).All queries that take a product type as parameter are run in flights of several query invocations where the product type goes from broader to more specific. The initial product type specifies either the root product type or an immediate subtype of this, and the last in the drill-down is a leaf type.
The rationale for this is that the choice of product type may make several orders of magnitude difference in the run time of a query. In order to make consecutive query mixes roughly comparable in execution time, all mixes should have a predictable number of query invocations with product types of each level.
Query OrderIn the BI mix, when running multiple concurrent clients, each query mix is submitted in a random order. Queries which do drill-downs always have the steps of the drill-down as consecutive in the session, but the query templates are permuted. This is done so as to make less likely that there were two concurrent queries accessing exactly the same data. In this way, scans cannot be trivially shared between queries -- but there are still opportunities for reuse of results and adapting execution to working set, e.g., starting with what is in memory.
MetricsWe use a TPC-H-like metric. This metric consists of a single-user part and a multi-user part, called respectively Power and Throughput. The Power metric is a geometric mean of query run-time. The Throughput is the total run-time divided by the number of queries completed. After taking the mean, the time is converted into queries-per-hour. This time is then multiplied by the scale factor divided by the scale factor for 100 Mt. In other words, we consider the 100 Mt data set as the unit scale.
The Power is defined as
( scale_factor / 284826 ) * 3600 / ( ( t1 * t1 * ... * tn ) ^ ( 1 / n ) )
The Throughput is defined as
( scale_factor / 284826 ) * 3600 / ( ( t1 + t2 + ... + tn ) / n )
The magic number
284826is the scale that generates approximately 100 million triples (100 Mt). We consider this scale "one". The reason for the multiplication is that scores at different scales should get similar numbers; otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.The Composite metric is the geometric mean of the Power and Throughput metrics. A complete report shows both Power and Throughput metrics, as well as individual query times for all queries. The rationale for using a geometric mean is to give an equal importance to long and short queries. Halving the execution time of either a long query or a short query will have the same effect on the metric. This is good for encouraging research into all aspects of query processing. On the other hand, real-life users are more interested in halving the time of queries that take one hour than of queries that take one second; therefore, the throughput metric considers run times.
Taking the geometric mean of the two metrics gives more weight to the lower of the two than an arithmetic mean, hence we pay more attention to the worse of the two.
Single-user and multi-user metrics are separate because of the relative importance of intra-query parallelization in BI workloads: There may not be large numbers of concurrent users, yet queries are still complex, and it is important to have maximum parallelization. Therefore the metric rewards single-user performance.
In the next post we will look at the use of this metric and the actual content of BSBM BI.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications (this post)
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
22:30
Benchmarks, Redux (part 13): BSBM BI Modifications
» OpenLink Community BlogIn this post we introduce changes to the BSBM BI queries and metric. These changes are motivated by prevailing benchmark practice and by our experiences in optimizing for the BSBM BI workload.
We will publish results according to the definitions given here and recommend that any interested parties do likewise. The rationales are given in the text.
Query MixWe have removed Q4 from the mix because it is quadratic to the scale factor. The other queries are roughly
Parameter Substitutionn * log (n).All queries that take a product type as parameter are run in flights of several query invocations where the product type goes from broader to more specific. The initial product type specifies either the root product type or an immediate subtype of this, and the last in the drill-down is a leaf type.
The rationale for this is that the choice of product type may make several orders of magnitude difference in the run time of a query. In order to make consecutive query mixes roughly comparable in execution time, all mixes should have a predictable number of query invocations with product types of each level.
Query OrderIn the BI mix, when running multiple concurrent clients, each query mix is submitted in a random order. Queries which do drill-downs always have the steps of the drill-down as consecutive in the session, but the query templates are permuted. This is done so as to make less likely that there were two concurrent queries accessing exactly the same data. In this way, scans cannot be trivially shared between queries -- but there are still opportunities for reuse of results and adapting execution to working set, e.g., starting with what is in memory.
MetricsWe use a TPC-H-like metric. This metric consists of a single-user part and a multi-user part, called respectively Power and Throughput. The Power metric is a geometric mean of query run-time. The Throughput is the total run-time divided by the number of queries completed. After taking the mean, the time is converted into queries-per-hour. This time is then multiplied by the scale factor divided by the scale factor for 100 Mt. In other words, we consider the 100 Mt data set as the unit scale.
The Power is defined as
( scale_factor / 284826 ) * 3600 / ( ( t1 * t1 * ... * tn ) ^ ( 1 / n ) )
The Throughput is defined as
( scale_factor / 284826 ) * 3600 / ( ( t1 + t2 + ... + tn ) / n )
The magic number
284826is the scale that generates approximately 100 million triples (100 Mt). We consider this scale "one". The reason for the multiplication is that scores at different scales should get similar numbers; otherwise 10x larger scale would result roughly in 10x lower throughput with the BI queries.The Composite metric is the geometric mean of the Power and Throughput metrics. A complete report shows both Power and Throughput metrics, as well as individual query times for all queries. The rationale for using a geometric mean is to give an equal importance to long and short queries. Halving the execution time of either a long query or a short query will have the same effect on the metric. This is good for encouraging research into all aspects of query processing. On the other hand, real-life users are more interested in halving the time of queries that take one hour than of queries that take one second; therefore, the throughput metric considers run times.
Taking the geometric mean of the two metrics gives more weight to the lower of the two than an arithmetic mean, hence we pay more attention to the worse of the two.
Single-user and multi-user metrics are separate because of the relative importance of intra-query parallelization in BI workloads: There may not be large numbers of concurrent users, yet queries are still complex, and it is important to have maximum parallelization. Therefore the metric rewards single-user performance.
In the next post we will look at the use of this metric and the actual content of BSBM BI.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications (this post)
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
22:29
Benchmarks, Redux (part 12): Our Own BSBM Results Report
» OpenLink Community BlogThis is a placeholder; it will be replaced with a complete report in the very near future.
-
22:29
Benchmarks, Redux (part 12): Our Own BSBM Results Report
» OpenLink Community BlogThis is a placeholder; it will be replaced with a complete report in the very near future.
-
23:30
Benchmarks, Redux (part 11): The Substance of Benchmarks
» OpenLink Community BlogLet us talk about what ought to be benchmarked in the context of RDF.
A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at TPC-H and similar workloads, and therefore there is no need for RDF to go there. It is, as it were, somebody else's problem; besides, it is a solved one.
On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item.
BSBM seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else. But we should not lose sight of the fact that this is in fact a relational schema and workload that has just been straightforwardly transformed to RDF. BSBM was made, after all, in part for measuring RDB to RDF mapping. Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be. TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal. But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and SQL will keep some advantage. However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized. We could publish these and call this "RDF-H".
In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark. The points are in part based on discussions with Peter Boncz of CWI.
The Social Network Intelligence Benchmark (SNIB) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM. In LOD2, CWI is presently working on this.
The data includes DBpedia as a base component used for providing conversation topics, information about geographical locales of simulated users, etc. DBpedia is not very large, around 200M-300M triples, but it is diverse enough.
The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere.
The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth. The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles. Since there is such correlation, NLP techniques like entity and relationship extraction can be used with the data even though this is not the primary thrust of SNIB.
There is variation in frequency of online interaction, and this interaction consist of sessions. For example, one could analyze user behavior per time of day for online ad placement.
The data probably should include propagating memes, fashions, and trends that travel on the social network. With this, one could query about their origin and speed of propagation.
There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries.
Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account.
-
Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted.
-
The dataset should be predictably scalable. For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale.
For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated). Making a single metric from such parts may not be meaningful. Therefore, SNIB might be structured into different workloads.
The first would be an online mix with typically short lookups and updates, around
O ( log ( n ) ).The Business Intelligence Mix would be composed of queries around
OO ( n log ( n ) ). Even so, with real data, choice of parameters will provide dramatic changes in query run-time. Therefore a run should be specified to have a predictable distribution of "hard" and "easy" parameter choices. In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy. This could be done here too in some cases; other cases would have to be defined with buckets of values.Both the real world and LOD2 are largely concerned with data integration. The SNIB workload can have aspects of this, for example, in resolving duplicate identities. These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data.
One characteristic of these is the production of sometimes large intermediate results that need to be materialized. Doing these operations in practice requires procedural control. Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language. Some basic graph operations like shortest path are expressible but then are not in unextended SPARQL 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec.
These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload.
We find that such a workload will have procedural sections either in application code or stored procedures. Map-reduce is sometimes used for scaling these. As one would expect, many cluster databases have their own version of these control structures. Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations. We might here touch base with the LarKC map-reduce work to see if it could be applied to SNIB workloads.
We see a three-level structure emerging. There is an Online mix which is a bit like the BSBM Explore mix, and an Analytics mix which is on the same order of complexity as TPC-H. These may have a more-or-less fixed query formulation and test driver. Beyond these, yet working on the same data, we have a set of Predefined Tasks which the test sponsor may implement in a manner of their choice.
We would finally get to the "raging conflict" between the "declarativists" and the "map reductionists." Last year's VLDB had a lot of map-reduce papers. I know of comparisons between Vertica and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data.
We might even interest some of the cluster RDBMS players (Teradata, Vertica, Greenplum, Oracle Exadata, ParAccel, and/or Aster Data, to name a few) in running this workload using their map-reduce analogs.
We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth.
There is not, nor ought there to be, a sheltered, RDF-only enclave. RDF will have to justify itself in a world of alternatives.
This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does. RDF cannot be a total failure at this, even if this were not RDF's claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks (this post)
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
23:30
Benchmarks, Redux (part 11): The Substance of Benchmarks
» OpenLink Community BlogLet us talk about what ought to be benchmarked in the context of RDF.
A point that often gets brought up by RDF-ers when talking about benchmarks is that there already exist systems which perform very well at TPC-H and similar workloads, and therefore there is no need for RDF to go there. It is, as it were, somebody else's problem; besides, it is a solved one.
On the other hand, being able to express what is generally expected of a query language might not be a core competence or a competitive edge, but it certainly is a checklist item.
BSBM seems to be adopted as a de facto RDF benchmark, as there indeed is almost nothing else. But we should not lose sight of the fact that this is in fact a relational schema and workload that has just been straightforwardly transformed to RDF. BSBM was made, after all, in part for measuring RDB to RDF mapping. Thus BSBM is no more RDF-ish than a trivially RDF-ized TPC-H would be. TPC-H is however a bit more difficult if also a better thought out benchmark than the BSBM BI Mix proposal. But I do not expect an RDF audience to have any enthusiasm for this as this is indeed a very tough race by now, and besides one in which RDB and SQL will keep some advantage. However, using this as a validation test is meaningful, as there exists a validation dataset and queries that we already have RDF-ized. We could publish these and call this "RDF-H".
In the following I will outline what would constitute an RDF-friendly, scientifically interesting benchmark. The points are in part based on discussions with Peter Boncz of CWI.
The Social Network Intelligence Benchmark (SNIB) takes the social web Facebook-style schema Ivan Mikhailov and I made last year under the name of Botnet BM. In LOD2, CWI is presently working on this.
The data includes DBpedia as a base component used for providing conversation topics, information about geographical locales of simulated users, etc. DBpedia is not very large, around 200M-300M triples, but it is diverse enough.
The data will have correlations, e.g., people who talk about sports tend to know other people who talk about the same sport, and they are more likely to know people from their geographical area than from elsewhere.
The bulk of the data consists of a rich history of interactions including messages to individuals and groups, linking to people, dropping links, joining and leaving groups, and so forth. The messages are tagged using real-world concepts from DBpedia, and there is correlation between tagging and textual content since both are generated from Dbpedia articles. Since there is such correlation, NLP techniques like entity and relationship extraction can be used with the data even though this is not the primary thrust of SNIB.
There is variation in frequency of online interaction, and this interaction consist of sessions. For example, one could analyze user behavior per time of day for online ad placement.
The data probably should include propagating memes, fashions, and trends that travel on the social network. With this, one could query about their origin and speed of propagation.
There should probably be cases of duplicate identities in the data, i.e., one real person using many online accounts to push an agenda. Resolving duplicate identities makes for nice queries.
Ragged data with half-filled profiles and misspelled identifiers like person and place names are a natural part of the social web use case. The data generator should take this into account.
-
Distribution of popularity and activity should follow a power-law-like pattern; actual measures of popularity can be sampled from existing social networks even though large quantities of data cannot easily be extracted.
-
The dataset should be predictably scalable. For the workload considered, the relative importance of the queries or other measured tasks should not change dramatically with the scale.
For example some queries are logarithmic to data size (e.g., find connections to a person), some are linear (e.g., find average online time of sports fans on Sundays), and some are quadratic or worse (e.g., find two extremists of the same ideology that are otherwise unrelated). Making a single metric from such parts may not be meaningful. Therefore, SNIB might be structured into different workloads.
The first would be an online mix with typically short lookups and updates, around
O ( log ( n ) ).The Business Intelligence Mix would be composed of queries around
OO ( n log ( n ) ). Even so, with real data, choice of parameters will provide dramatic changes in query run-time. Therefore a run should be specified to have a predictable distribution of "hard" and "easy" parameter choices. In the BSBM BI mix modification, I did this by defining some to be drill downs from a more general to a more specific level of a hierarchy. This could be done here too in some cases; other cases would have to be defined with buckets of values.Both the real world and LOD2 are largely concerned with data integration. The SNIB workload can have aspects of this, for example, in resolving duplicate identities. These operations are more complex than typical database queries, as the attributes used for joining might not even match in the initial data.
One characteristic of these is the production of sometimes large intermediate results that need to be materialized. Doing these operations in practice requires procedural control. Further, running algorithms like network analytics (e.g., Page rank, centrality, etc.) involves aggregation of intermediate results that is not very well expressible in a query language. Some basic graph operations like shortest path are expressible but then are not in unextended SPARQL 1.1; as these would for example involve returning paths, which are explicitly excluded from the spec.
These are however the areas where we need to go for a benchmark that is more than a repackaging of a relational BI workload.
We find that such a workload will have procedural sections either in application code or stored procedures. Map-reduce is sometimes used for scaling these. As one would expect, many cluster databases have their own version of these control structures. Therefore some of the SNIB workload could even be implemented as map-reduce jobs alongside parallel database implementations. We might here touch base with the LarKC map-reduce work to see if it could be applied to SNIB workloads.
We see a three-level structure emerging. There is an Online mix which is a bit like the BSBM Explore mix, and an Analytics mix which is on the same order of complexity as TPC-H. These may have a more-or-less fixed query formulation and test driver. Beyond these, yet working on the same data, we have a set of Predefined Tasks which the test sponsor may implement in a manner of their choice.
We would finally get to the "raging conflict" between the "declarativists" and the "map reductionists." Last year's VLDB had a lot of map-reduce papers. I know of comparisons between Vertica and map reduce for doing a fairly simple SQL query on a lot of data, but here we would be talking about much more complex jobs on more interesting (i.e., less uniform) data.
We might even interest some of the cluster RDBMS players (Teradata, Vertica, Greenplum, Oracle Exadata, ParAccel, and/or Aster Data, to name a few) in running this workload using their map-reduce analogs.
We see that as we get to topics beyond relational BI, we do not find ourselves in an RDF-only world but very much at a crossroads of many technologies, e.g., map-reduce and its database analogs, various custom built databases, graph libraries, data integration and cleaning tools, and so forth.
There is not, nor ought there to be, a sheltered, RDF-only enclave. RDF will have to justify itself in a world of alternatives.
This must be reflected in our benchmark development, so relational BI is not irrelevant; in fact, it is what everybody does. RDF cannot be a total failure at this, even if this were not RDF's claim to fame. The claim to fame comes after we pass this stage, which is what we intend to explore in SNIB.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks (this post)
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
23:29
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
» OpenLink Community BlogI have in the previous posts generally argued for and demonstrated the usefulness of benchmarks.
Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a LOD2 members consensus, but have been discussed in the consortium.
My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water! But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking. Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating le chef d'oeuvre culinaire ("the culinary masterpiece") create it. Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values. Indeed, an intimate knowledge de la vie secrete du canard ("the secret life of duck") is required in order to liberate the aroma that it might take flight and soar. In the previous, I have shed some light on how we prepare le canard, and if le canard be such then la dinde (turkey) might in some ways be analogous; who is to say?
In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice. In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained. This is the TPC (Transaction Processing Performance Council) model.
Another culture of doing benchmarks is the periodic challenge model used in TREC, the Billion Triples Challenge, the Semantic Search Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication.
A third party performing benchmarks by itself is uncommon in databases. Licenses even often explicitly prohibit this, for understandable reasons.
The LOD2 project has an outreach activity called Publink where we offer to help owners of data to publish it as Linked Data. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing RDF store benchmarks.
One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results. The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison.
Isn't this the very truth? Let the chefs mix their own spices.
This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import.
In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question. Increasing the scale remains a stated objective. LOD2 even promised to run things with a trillion triples in another 3 years.
Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off? Or would this on the contrary combine strict Justice with edifying Charity? Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice?
Even better, CWI, with its stellar database pedigree, agreed in principle to audit RDF benchmarks in LOD2.
In this way one could get a stamp of approval for one's results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs. On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here. I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes.
We could even do this unilaterally -- just publish Virtuoso results according to a predefined reporting and verification format. If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings. This could all take place over the net, so we are not talking about any huge cost or prohibitive amount of trouble. It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason.
Then there is the matter of the BSBM Business Intelligence (BI) mix. At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer. This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions. Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around. The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well. There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it. If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner.
(I will talk about the BI mix in more detail in part 13 and part 14 of this series.)
Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit.
Of course, this could be done even before then, but the content of the mix might not be settled. We likely need to check it on a few implementations first.
For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained. For example, FU Berlin could give people a login to get their recently published results fixed. Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal.
As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment. They can set up and tune their systems, and perform the runs. We will just watch. As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data. Like this, both parties get to see the others' technology with proper tuning and installation. What, if anything, is reported about this activity is up to the owner of the technology being tested. We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these. This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user. If you wish to take advantage of this offer, you may contact Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice.
The next post will talk about the actual content of benchmarks. The milestone after this will be when we publish the measurement and reporting protocols.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process (this post)
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
23:29
Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
» OpenLink Community BlogI have in the previous posts generally argued for and demonstrated the usefulness of benchmarks.
Here I will talk about how this could be organized in a way that is tractable, and takes vendor and end user interests into account. These are my views on the subject and do not represent a LOD2 members consensus, but have been discussed in the consortium.
My colleague Ivan Mikhailov once proposed that the only way to get benchmarks run right is to package them as a single script that does everything, like instant noodles -- just add water! But even instant noodles can be abused: Cook too long, add too much water, maybe forget to light the stove, and complain that the result is unsatisfyingly hard and brittle, lacking the suppleness one has grown to expect from this delicacy. No, the answer lies at the other end of the culinary spectrum, in gourmet cooking. Let the best cooks show what they can do, and let them work at it; let those who in fact have capacity and motivation for creating le chef d'oeuvre culinaire ("the culinary masterpiece") create it. Even so, there are many value points along the dimensions of preparation time, cost, and esthetic layout, not to forget taste and nutritional values. Indeed, an intimate knowledge de la vie secrete du canard ("the secret life of duck") is required in order to liberate the aroma that it might take flight and soar. In the previous, I have shed some light on how we prepare le canard, and if le canard be such then la dinde (turkey) might in some ways be analogous; who is to say?
In other words, as a vendor, we want to have complete control over the benchmarking process, and have it take place in our environment at a time of our choice. In exchange for this, we are ready to document and observe possibly complicated rules, document how the runs are made, and let others monitor and repeat them on the equipment on which the results are obtained. This is the TPC (Transaction Processing Performance Council) model.
Another culture of doing benchmarks is the periodic challenge model used in TREC, the Billion Triples Challenge, the Semantic Search Challenge and others. In this model, vendors prepare the benchmark submission and agree to joint publication.
A third party performing benchmarks by itself is uncommon in databases. Licenses even often explicitly prohibit this, for understandable reasons.
The LOD2 project has an outreach activity called Publink where we offer to help owners of data to publish it as Linked Data. Similarly, since FP 7s are supposed to offer a visible service to their communities, I proposed that LOD2 offer to serve a role in disseminating and auditing RDF store benchmarks.
One representative of an RDF store vendor I talked to, in relation to setting up a benchmark configuration of their product, told me that we could do this and that they would give some advice but that such an exercise was by its nature fundamentally flawed and could not possibly produce worthwhile results. The reason for this was that OpenLink engineers could not possibly learn enough about the other products nor unlearn enough of their own to make this a meaningful comparison.
Isn't this the very truth? Let the chefs mix their own spices.
This does not mean that there would not be comparability of results. If the benchmarks and processes are well defined, documented, and checked by a third party, these can be considered legitimate and not just one-off best-case results without further import.
In order to stretch the envelope, which is very much a LOD2 goal, this benchmarking should be done on a variety of equipment -- whatever works best at the scale in question. Increasing the scale remains a stated objective. LOD2 even promised to run things with a trillion triples in another 3 years.
Imagine that the unimpeachably impartial Berliners made house calls. Would this debase Justice to be a servant of mere show-off? Or would this on the contrary combine strict Justice with edifying Charity? Who indeed is in greater need of the light of objective evaluation than the vendor whose very nature makes a being of bias and prejudice?
Even better, CWI, with its stellar database pedigree, agreed in principle to audit RDF benchmarks in LOD2.
In this way one could get a stamp of approval for one's results regardless of when they were produced, and be free of the arbitrary schedule of third party benchmarking runs. On the relational side this is a process of some cost and complexity, but since the RDF side is still young and more on mutually friendly terms, the process can be somewhat lighter here. I did promise to draft some extra descriptions of process and result disclosure so that we could see how this goes.
We could even do this unilaterally -- just publish Virtuoso results according to a predefined reporting and verification format. If others wished to publish by the same rules, LOD2 could use some of the benchmarking funds for auditing the proceedings. This could all take place over the net, so we are not talking about any huge cost or prohibitive amount of trouble. It would be in the FP7 spirit that LOD2 provide this service for free, naturally within reason.
Then there is the matter of the BSBM Business Intelligence (BI) mix. At present, it seems everybody has chosen to defer the matter to another round of BSBM runs in the summer. This seems to fit the pattern of a public challenge with a few months given for contenders to prepare their submissions. Here we certainly should look at bigger scales and more diverse hardware than in the Berlin runs published this time around. The BI workload is in fact fairly cluster friendly, with big joins and aggregations that parallelize well. There it would definitely make sense to reserve an actual cluster, and have all contenders set up their gear on it. If all have access to the run environment and to monitoring tools, we can be reasonably sure that things will be done in a transparent manner.
(I will talk about the BI mix in more detail in part 13 and part 14 of this series.)
Once the BI mix has settled and there are a few interoperable implementations, likely in the summer, we could pass from the challenge model to a situation where vendors may publish results as they become available, with LOD2 offering its services for audit.
Of course, this could be done even before then, but the content of the mix might not be settled. We likely need to check it on a few implementations first.
For equipment, people can use their own, or LOD2 partners might on a case-by-case basis make some equipment available for running on the same hardware on which say the Virtuoso results were obtained. For example, FU Berlin could give people a login to get their recently published results fixed. Now this might or might not happen, so I will not hold my breath waiting for this but instead close with a proposal.
As a unilateral diplomatic overture I put forth the following: If other vendors are interested in 1:1 comparison of their results with our publications, we can offer them a login to the same equipment. They can set up and tune their systems, and perform the runs. We will just watch. As an extra quid pro quo, they can try Virtuoso as configured for the results we have published, with the same data. Like this, both parties get to see the others' technology with proper tuning and installation. What, if anything, is reported about this activity is up to the owner of the technology being tested. We will publish a set of benchmark rules that can serve as a guideline for mutually comparable reporting, but we cannot force anybody to use these. This all will function as a catalyst for technological advance, all to the ultimate benefit of the end user. If you wish to take advantage of this offer, you may contact Hugh Williams at OpenLink Software, and we will see how this can be arranged in practice.
The next post will talk about the actual content of benchmarks. The milestone after this will be when we publish the measurement and reporting protocols.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process (this post)
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
22:54
Benchmarks, Redux (part 9): BSBM With Cluster
» OpenLink Community BlogThis post is dedicated to our brothers in horizontal partitioning (or sharding), Garlik and Bigdata.
At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism.
For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn't run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration.
But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.
The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.
6 Cluster - Load Rates and Times Scale Rate
(quads per second)Load time
(seconds)Checkpoint time
(seconds)100 Mt 119,204 749 89 200 Mt 121,607 1486 157 1000 Mt 102,694 8737 979
6 Single - Load Rates and Times Scale Rate
(quads per second)Load time
(seconds)Checkpoint time
(seconds)100 Mt 74,713 1192 145 The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.
Throughput is as follows:
6 Cluster - Throughput
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 7318 43120 200 Mt 6222 29981 1000 Mt 2526 11156
6 Single - Throughput
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487 Below is a snapshot of status during the 6 Cluster 100 Mt run.
Cluster 8 nodes, 15 s. 25784 m/s 25682 KB/s 1160% cpu 0% read 740% clw threads 18r 0w 10i buffers 1133459 12 d 4 w 0 pfs cl 1: 10851 m/s 3911 KB/s 597% cpu 0% read 668% clw threads 17r 0w 10i buffers 143992 4 d 0 w 0 pfs cl 2: 2194 m/s 7959 KB/s 107% cpu 0% read 9% clw threads 1r 0w 0i buffers 143616 3 d 2 w 0 pfs cl 3: 2186 m/s 7818 KB/s 107% cpu 0% read 9% clw threads 0r 0w 0i buffers 140787 0 d 0 w 0 pfs cl 4: 2174 m/s 2804 KB/s 77% cpu 0% read 10% clw threads 0r 0w 0i buffers 140654 0 d 2 w 0 pfs cl 5: 2127 m/s 1612 KB/s 71% cpu 0% read 9% clw threads 0r 0w 0i buffers 140949 1 d 0 w 0 pfs cl 6: 2060 m/s 544 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141295 2 d 0 w 0 pfs cl 7: 2072 m/s 517 KB/s 65% cpu 0% read 11% clw threads 0r 0w 0i buffers 141111 1 d 0 w 0 pfs cl 8: 2105 m/s 522 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141055 1 d 0 w 0 pfsThe main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes.
We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput.
Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.
This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.
The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.
Looking further at the 6 Cluster status we see the cluster wait (
clw) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster (this post)
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
22:54
Benchmarks, Redux (part 9): BSBM With Cluster
» OpenLink Community BlogThis post is dedicated to our brothers in horizontal partitioning (or sharding), Garlik and Bigdata.
At first sight, the BSBM Explore mix appears very cluster-unfriendly, as it contains short queries that access data at random. There is every opportunity for latency and few opportunities for parallelism.
For this reason we had not even run the BSBM mix with Virtuoso Cluster. We were not surprised to learn that Garlik hadn't run BSBM either. We have understood from Systap that their Bigdata BSBM experiments were on a single-process configuration.
But the 4Store results in the recent Berlin report were with a distributed setup, as 4Store always runs a multiprocess configuration, even on a single server, so it seemed interesting to us to compare how Virtuoso Cluster compares with Virtuoso Single with this workload. These tests were run on a different box than the recent BSBM tests, so those 4Store figures are not directly comparable.
The setup here consists of 8 partitions, each managed by its own process, all running on the same box. Any of these processes can have its HTTP and SQL listener and can provide the same service. Most access to data goes over the interconnect, except when the data is co-resident in the process which is coordinating the query. The interconnect is Unix domain sockets since all 8 processes are on the same box.
6 Cluster - Load Rates and Times Scale Rate
(quads per second)Load time
(seconds)Checkpoint time
(seconds)100 Mt 119,204 749 89 200 Mt 121,607 1486 157 1000 Mt 102,694 8737 979
6 Single - Load Rates and Times Scale Rate
(quads per second)Load time
(seconds)Checkpoint time
(seconds)100 Mt 74,713 1192 145 The load times are systematically better than for 6 Single. This is also not bad compared to the 7 Single vectored load rates of 220 Kt/s or so. We note that loading is a cluster friendly operation, going at a steady 1400+% CPU utilization with an aggregate message throughput of 40MB/s. 7 Single is faster because of vectoring at the index level, not because the clusters were hitting communication overheads. 6 Cluster is faster than 6 Single because scale-out in this case diminishes contention, even on a single box.
Throughput is as follows:
6 Cluster - Throughput
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 7318 43120 200 Mt 6222 29981 1000 Mt 2526 11156
6 Single - Throughput
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487 Below is a snapshot of status during the 6 Cluster 100 Mt run.
Cluster 8 nodes, 15 s. 25784 m/s 25682 KB/s 1160% cpu 0% read 740% clw threads 18r 0w 10i buffers 1133459 12 d 4 w 0 pfs cl 1: 10851 m/s 3911 KB/s 597% cpu 0% read 668% clw threads 17r 0w 10i buffers 143992 4 d 0 w 0 pfs cl 2: 2194 m/s 7959 KB/s 107% cpu 0% read 9% clw threads 1r 0w 0i buffers 143616 3 d 2 w 0 pfs cl 3: 2186 m/s 7818 KB/s 107% cpu 0% read 9% clw threads 0r 0w 0i buffers 140787 0 d 0 w 0 pfs cl 4: 2174 m/s 2804 KB/s 77% cpu 0% read 10% clw threads 0r 0w 0i buffers 140654 0 d 2 w 0 pfs cl 5: 2127 m/s 1612 KB/s 71% cpu 0% read 9% clw threads 0r 0w 0i buffers 140949 1 d 0 w 0 pfs cl 6: 2060 m/s 544 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141295 2 d 0 w 0 pfs cl 7: 2072 m/s 517 KB/s 65% cpu 0% read 11% clw threads 0r 0w 0i buffers 141111 1 d 0 w 0 pfs cl 8: 2105 m/s 522 KB/s 66% cpu 0% read 10% clw threads 0r 0w 0i buffers 141055 1 d 0 w 0 pfsThe main meters for cluster execution are the messages-per-second (m/s), the message volume (KB/s), and the total CPU% of the processes.
We note that CPU utilization is highly uneven and messages are short, about 1K on the average, compared to about 100K during the load. CPU would be evenly divided between the nodes if each got a share of the HTTP requests. We changed the test driver to round-robin requests between multiple end points. The work does then get evenly divided, but the speed is not affected. Also, this does not improve the message sizes since the workload consists mostly of short lookups. However, with the processes spread over multiple servers, the round-robin would be essential for CPU and especially for interconnect throughput.
Then we try 6 Cluster at 1000 Mt. For Single User, we get 1180 m/s, 6955 KB/s, and 173% cpu. For 16 User, this is 6573 m/s, 44366 KB/s, 1470% cpu.
This is a lot better than the figures with 6 Single, due to lower contention on the index tree, as discussed in A Benchmarking Story. Also Single User throughput on 6 Cluster outperforms 6 Single, due to the natural parallelism of doing the Q5 joins in parallel in each partition. The larger the scale, the more weight this has in the metric. We see this also in the average message size, i.e., the KB/s throughput is almost double while the messages/s is a bit under a third.
The small-scale 6 Cluster run is about even with the 6 Single figure. Looking at the details, we see that the qps for Q1 in 6 Cluster is half of that on 6 Single, whereas the qps for Q5 on 6 Cluster is about double that of the 6 Single. This is as one might expect; longer queries are favored, and single row lookups are penalized.
Looking further at the 6 Cluster status we see the cluster wait (
clw) to be 740%. For 16 Users, this means that about half of the execution real time is spent waiting for responses from other partitions. A high figure means uneven distribution between partitions; a low figure means even. This is as expected, since many queries are concerned with just one S and its related objects.We will update this section once 7 Cluster is ready. This will implement vectored execution and column store inside the cluster nodes.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster (this post)
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
17:32
Benchmarks, Redux (part 8): BSBM Explore and Update
» OpenLink Community BlogWe will here look at the Explore and Update scenario of BSBM. This presents us with a novel problem as the specification does not address any aspect of ACID.
A transaction benchmark ought to have something to say about this. The SPARUL (also known as SPARQL/Update) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability.
We begin by running Virtuoso 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning
SERIALIZABLEisolation betweenINSERTsandDELETEs, andREAD COMMITTEDisolation betweenREADand anyUPDATEtransaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.)Virtuoso 7 Single, Full ACID
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 9,969 65,537 200 Mt 8,646 40,527 1000 Mt 5,512 17,293
Virtuoso 6 Cluster, Full ACID
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 5604.520 34079.019 1000 Mt 2866.616 10028.325
Virtuoso 6 Single, Full ACID
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 7,152 21,065 200 Mt 5,862 16,895 1000 Mt 1,542 4,548 Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm cache; see previous post on read-ahead for details. All runs do 1000 Explore and Update mixes. The initial database is in the state following the Explore only runs.
The results are in line with the Explore results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability.
In the following we will look at transactions and at how the definition of the workload and reporting could be made complete.
Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such "transactions," the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the data already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable.
Read-only transactions (i.e., the Explore mix in the Explore and Update scenario) will be run as
READ COMMITTED. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable.Our first point of call is to determine the cost of ACID. We run 1000 mixes of Explore and Update at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale.
We look at the pertinent statistics:
SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC;
KEY_TABLE INDEX_NAME LOCKS WAITS WAIT_PCT DEADLOCKS LOCK_ESC WAIT_MSECS =============== ============= ====== ===== ======== ========= ======== ========== DB.DBA.RDF_QUAD RDF_QUAD_POGS 179205 934 0 0 0 35164 DB.DBA.RDF_IRI RDF_IRI 20752 217 1 0 0 16445 DB.DBA.RDF_QUAD RDF_QUAD_SP 9244 3 0 0 0 235
We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions.
We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert.
Considering these results it does not seem crucial to "drop ACID," though doing so would save some time. We will now run measurements for all scales with 16 Users and ACID.
Let us now see what the benchmark writes:
SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC;
KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS =========================== ============================ ========= ======= ======== ======= ========= DB.DBA.RDF_QUAD RDF_QUAD_POGS 763846891 237436 0 58040 228606 DB.DBA.RDF_QUAD RDF_QUAD 213282706 1991836 0 30226 1940280 DB.DBA.RDF_OBJ RO_VAL 15474 17837 115 13438 17431 DB.DBA.RO_START RO_START 10573 11195 105 10228 11227 DB.DBA.RDF_IRI RDF_IRI 61902 125711 203 7705 121300 DB.DBA.RDF_OBJ RDF_OBJ 23809053 3205963 13 636 3072517 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 3237687 504486 15 340 488797 DB.DBA.RDF_QUAD RDF_QUAD_SP 89995 70446 78 99 68340 DB.DBA.RDF_QUAD RDF_QUAD_OP 19440 47541 244 66 45583 DB.DBA.VTLOG_DB_DBA_RDF_OBJ VTLOG_DB_DBA_RDF_OBJ 3014 1 0 11 11 DB.DBA.RDF_QUAD RDF_QUAD_GS 1261 801 63 10 751 DB.DBA.RDF_PREFIX RDF_PREFIX 14 168 1120 1 153 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1807 200 11 1 200
The most dirty pages are on the
POGSindex, which is reasonable; values are spread out at random. After this we have thePSOGindex, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect.* * *
Going back to BSBM, at least the following aspects of the benchmark have to be further specified:
-
Disclosure of ACID properties. If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc.
-
Steady state. The working set of the Update mix is different from that of the Explore mixes. This touches more indices than Explore. The Explore warm-up is in part good but does not represent steady state.
-
Checkpoint and sustained throughput. Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described.
-
Memory and warm-up issues.We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up).
With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update (this post)
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
17:32
Benchmarks, Redux (part 8): BSBM Explore and Update
» OpenLink Community BlogWe will here look at the Explore and Update scenario of BSBM. This presents us with a novel problem as the specification does not address any aspect of ACID.
A transaction benchmark ought to have something to say about this. The SPARUL (also known as SPARQL/Update) language does not say anything about transactionality, but I suppose it is in the spirit of the SPARUL protocol to promise atomicity and durability.
We begin by running Virtuoso 7 Single, with Single User and 16 User, each at scales of 100 Mt, 200 Mt, and 1000 Mt. The transactionality is default, meaning
SERIALIZABLEisolation betweenINSERTsandDELETEs, andREAD COMMITTEDisolation betweenREADand anyUPDATEtransaction. (Figures for Virtuoso 6 will also be presented here in the near future, as they are the currently shipping production versions.)Virtuoso 7 Single, Full ACID
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 9,969 65,537 200 Mt 8,646 40,527 1000 Mt 5,512 17,293
Virtuoso 6 Cluster, Full ACID
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 5604.520 34079.019 1000 Mt 2866.616 10028.325
Virtuoso 6 Single, Full ACID
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 7,152 21,065 200 Mt 5,862 16,895 1000 Mt 1,542 4,548 Each run is preceded by a warm-up of 500 or 300 mixes (the exact number is not material), resulting in a warm cache; see previous post on read-ahead for details. All runs do 1000 Explore and Update mixes. The initial database is in the state following the Explore only runs.
The results are in line with the Explore results. There is a fair amount of variability between consecutive runs; the 16 User run at 1000 Mt varies between 14K and 19K QMpH depending on the measurement. The smaller runs exhibit less variability.
In the following we will look at transactions and at how the definition of the workload and reporting could be made complete.
Full ACID means serializable semantic of concurrent insert and delete of the same quad. Non-transactional means that on concurrent insert and delete of overlapping sets of quads the result is undefined. Further if one logged such "transactions," the replay would give serialization although the initial execution did not, hence further confusing the issue. Considering the hypothetical use case of an e-commerce information portal, there is little chance of deletes and inserts actually needing serialization. An insert-only workload does not need serializability because an insert cannot fail. If the data already exists the insert does nothing, if the quad does not previously exist it is created. The same applies to deletes alone. If a delete and insert overlap, serialization would be needed but the semantics implicit in the use case make this improbable.
Read-only transactions (i.e., the Explore mix in the Explore and Update scenario) will be run as
READ COMMITTED. These do not see uncommitted data and never block for lock wait. The reads may not be repeatable.Our first point of call is to determine the cost of ACID. We run 1000 mixes of Explore and Update at 1000 Mt. The throughput is 19214 after a warm-up of 500 mixes. This is pretty good in comparison with the diverse read-only results at this scale.
We look at the pertinent statistics:
SELECT TOP 5 * FROM sys_l_stat ORDER BY waits DESC;
KEY_TABLE INDEX_NAME LOCKS WAITS WAIT_PCT DEADLOCKS LOCK_ESC WAIT_MSECS =============== ============= ====== ===== ======== ========= ======== ========== DB.DBA.RDF_QUAD RDF_QUAD_POGS 179205 934 0 0 0 35164 DB.DBA.RDF_IRI RDF_IRI 20752 217 1 0 0 16445 DB.DBA.RDF_QUAD RDF_QUAD_SP 9244 3 0 0 0 235
We see 934 waits with a total duration of 35 seconds on the index with the most contention. The run was 187 seconds, real time. The lock wait time is not real time since this is the total elapsed wait time summed over all threads. The lock wait frequency is a little over one per query mix, meaning a little over one per five locking transactions.
We note that we do not get deadlocks since all inserts and deletes are in ascending key order due to vectoring. This guarantees the absence of deadlocks for single insert transactions, as long as the transaction stays within the vector size. This is always the case since the inserts are a few hundred triples at the maximum. The waits concentrate on POGS, because this is a bitmap index where the locking resolution is less than a row, and the values do not correlate with insert order. The locking behavior could be better with the column store, where we would have row level locking also for this index. This is to be seen. The column store would otherwise tend to have higher cost per random insert.
Considering these results it does not seem crucial to "drop ACID," though doing so would save some time. We will now run measurements for all scales with 16 Users and ACID.
Let us now see what the benchmark writes:
SELECT TOP 10 * FROM sys_d_stat ORDER BY n_dirty DESC;
KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS =========================== ============================ ========= ======= ======== ======= ========= DB.DBA.RDF_QUAD RDF_QUAD_POGS 763846891 237436 0 58040 228606 DB.DBA.RDF_QUAD RDF_QUAD 213282706 1991836 0 30226 1940280 DB.DBA.RDF_OBJ RO_VAL 15474 17837 115 13438 17431 DB.DBA.RO_START RO_START 10573 11195 105 10228 11227 DB.DBA.RDF_IRI RDF_IRI 61902 125711 203 7705 121300 DB.DBA.RDF_OBJ RDF_OBJ 23809053 3205963 13 636 3072517 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 3237687 504486 15 340 488797 DB.DBA.RDF_QUAD RDF_QUAD_SP 89995 70446 78 99 68340 DB.DBA.RDF_QUAD RDF_QUAD_OP 19440 47541 244 66 45583 DB.DBA.VTLOG_DB_DBA_RDF_OBJ VTLOG_DB_DBA_RDF_OBJ 3014 1 0 11 11 DB.DBA.RDF_QUAD RDF_QUAD_GS 1261 801 63 10 751 DB.DBA.RDF_PREFIX RDF_PREFIX 14 168 1120 1 153 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1807 200 11 1 200
The most dirty pages are on the
POGSindex, which is reasonable; values are spread out at random. After this we have thePSOGindex, likely because of random deletes. New IRIs tend to get consecutive numbers and do not make many dirty pages. Literals come next, with the index from leading string or hash of the literal to id leading, as one would expect, again because of values being distributed at random. After this come IRIs. The distribution of updates is generally as one would expect.* * *
Going back to BSBM, at least the following aspects of the benchmark have to be further specified:
-
Disclosure of ACID properties. If the benchmark required full ACID many would not run this at all. Besides full ACID is not necessarily an absolute requirement based on the hypothetical usage scenario of the benchmark. However, when publishing numbers the guarantees that go with the numbers must be made explicit. This includes logging, checkpoint frequency or equivalent etc.
-
Steady state. The working set of the Update mix is different from that of the Explore mixes. This touches more indices than Explore. The Explore warm-up is in part good but does not represent steady state.
-
Checkpoint and sustained throughput. Benchmarks involving update generally have rules for checkpointing the state and for sustained throughput. In specific, the throughput of an update benchmark cannot rely on never flushing to persistent storage. Even bulk load must be timed with a checkpoint guaranteeing durability at the end. A steady update stream should be timed with a test interval of sufficient length involving a few checkpoints; for example, a minimum duration of 30 minutes with no less than 3 completed checkpoints in the interval with at least 9 minutes between the end of one and the start of the next. Not all DBMSs work with logs and checkpoints, but if an alternate scheme is used then this needs to be described.
-
Memory and warm-up issues.We have seen the test data generator run out of memory when trying to generate update streams of meaningful length. Also the test driver should allow running updates in timed and non-timed mode (warm-up).
With an update benchmark, many more things need to be defined, and the set-up becomes more system specific, than with a read-only workload. We will address these shortcomings in the measurement rules proposal to come. Especially with update workloads, the vendors need to provide tuning expertise; however, this will not happen if the benchmark does not properly set the expectations. If benchmarks serve as a catalyst for clearly defining how things are to be set up, then they will have served the end user.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update (this post)
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
23:39
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
» OpenLink Community BlogWe will here analyze what the BSBM Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales.
We will here look at database-running statistics for BSBM at different scales. Finally, we look at CPU profiles.
But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have:
SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS ================= ============================ ========== ======= ======== ======= ========= DB.DBA.RDF_OBJ RDF_OBJ 114105938 3302150 2 0 3171275 DB.DBA.RDF_QUAD RDF_QUAD 977426773 2041156 0 0 1970712 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 8250414 509239 6 15 491631 DB.DBA.RDF_QUAD RDF_QUAD_POGS 3677233812 183860 0 0 175386 DB.DBA.RDF_IRI RDF_IRI 32 99710 302151 5 95353 DB.DBA.RDF_QUAD RDF_QUAD_OP 30597 51593 168 0 48941 DB.DBA.RDF_QUAD RDF_QUAD_SP 265474 47210 17 0 46078 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 6020 212 3 0 212 DB.DBA.RDF_PREFIX RDF_PREFIX 0 167 16700 0 157
The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table data structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the
PSOGindex for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison.Now let us reset the counts and see what the steady state I/O profile is.
SELECT key_stat (key_table, name_part (key_name, 2), 'reset') FROM sys_keys WHERE key_migrate_to IS NULL;SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS ================= ============================ ========== ======= ======== ======= ========= DB.DBA.RDF_OBJ RDF_OBJ 30155789 79659 0 0 3191391 DB.DBA.RDF_QUAD RDF_QUAD 259008064 8904 0 0 1948707 DB.DBA.RDF_QUAD RDF_QUAD_SP 68002 7730 11 0 53360 DB.DBA.RDF_IRI RDF_IRI 12 5415 41653 6 98804 DB.DBA.RDF_QUAD RDF_QUAD_POGS 975147136 1597 0 0 173459 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 2213525 1286 0 17 485093 DB.DBA.RDF_QUAD RDF_QUAD_OP 7999 904 11 0 48568 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1494 1 0 0 213
Literal strings dominate. The
SPindex is used only for situations where thePis not specified, i.e., theDESCRIBEquery. Based on this, I/O seems to be attributable mostly to this. The firstRDF_IRIrepresents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the firstRDF_IRIis not properly recorded, hence the miss % is out of line. We seeSPmissing the cache the most since its use is infrequent in the mix.We will next look at query processing statistics. For this we introduce a new meter.
The
db_activitySQL function provides a session-by-session cumulative statistic of activity. The fields are:-
rnd- Count of random index lookups. Each first row of a select or insert counts as one, regardless of whether something was found. -
seq- Count of sequential rows. Every move to next row on a cursor counts as 1, regardless of whether conditions match. -
same seg- For column store only; counts how many times the next row in a vectored join using an index falls in the same segment as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection. -
same pg- Counts how many times a vectored index join finds the next match on the same page as the previous one. -
same par- Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the same parent. -
disk- Counts how many disk reads were made, including any speculative reads initiated. -
spec disk- Counts speculative disk reads. -
messages- Counts cluster interconnect messages -
B (KB, MB, GB)- is the total length of the cluster interconnect messages. -
fork- Counts how many times a thread was forked (started) for query parallelization.
The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000).
We run 2000 query mixes with 16 Users. The special
httpaccount keeps a cumulative account of all activity on web server threads.SELECT db_activity (2, ' [http');]1.674G rnd 3.223G seq 0 same seg 1.286G same pg 314.8M same par 6.186M disk 6.461M spec disk 0B / 0 messages 298.6K forkWe see that random access dominates. The
seqnumber is about twice therndnumber, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, thesame segis 0; thesame pgindicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one.There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes.
Now let us look at the same reading after 2000 mixes, 16 user at 100Mt.
234.3M rnd 420.5M seq 0 same seg 188.8M same pg 29.09M same par 808.9K disk 919.9K spec disk 0B / 0 messages 76K forkWe note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work.
We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work.
We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor per se. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there.
To elucidate this last question, we look at the CPU profiles. We take an oprofile of 100 Single User mixes at both scales.
For 100 Mt:
For 1000 Mt61161 10.1723 cmpf_iri64n_iri64n_anyn_gt_lt 31321 5.2093 box_equal 19027 3.1646 sqlo_parse_tree_has_node 15905 2.6453 dk_alloc 15647 2.6024 itc_next_set_neq 12702 2.1126 itc_vec_split_search 12487 2.0768 itc_dive_transit 11450 1.9044 itc_bm_vec_row_check 10646 1.7706 itc_page_rcf_search 9223 1.5340 id_hash_get 9215 1.5326 gen_qsort 8867 1.4748 sqlo_key_part_best 8807 1.4648 itc_param_cmp 8062 1.3409 cmpf_iri64n_iri64n 6820 1.1343 sqlo_in_list 6005 0.9987 dc_iri_id_cmp 5905 0.9821 dk_free_tree 5801 0.9648 box_hash 5509 0.9163 dks_esc_write 5444 0.9054 sql_tree_hash_1
754331 31.4149 cmpf_iri64n_iri64n_anyn_gt_lt 146165 6.0872 itc_vec_split_search 144795 6.0301 itc_next_set_neq 131671 5.4836 itc_dive_transit 110870 4.6173 itc_page_rcf_search 66780 2.7811 gen_qsort 66434 2.7667 itc_param_cmp 58450 2.4342 itc_bm_vec_row_check 55213 2.2994 dk_alloc 47793 1.9904 cmpf_iri64n_iri64n 44277 1.8440 dc_iri_id_cmp 39489 1.6446 cmpf_int64n 36880 1.5359 dc_append_bytes 36601 1.5243 dv_compare 31286 1.3029 dc_any_value_prefetch 25457 1.0602 itc_next_set 20852 0.8684 box_equal 19895 0.8285 dk_free_tree 19698 0.8203 itc_page_insert_search 19367 0.8066 dc_copy
The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query optimization is about 6.5 times greater. The top function in this category is
box_equalwith 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile.From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that Virtuoso is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt.
We may conclude that different BSBM scales measure different things. The TPC workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales.
This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix.
Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure? (this post)
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
23:39
Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
» OpenLink Community BlogWe will here analyze what the BSBM Explore workload does. This is necessary in order to compare benchmark results at different scales. Historically, BSBM had a Query 6 whose share of the metric approached 100% as scale increased. The present mix does not have this query, but different queries still have different relative importance at different scales.
We will here look at database-running statistics for BSBM at different scales. Finally, we look at CPU profiles.
But first, let us see what BSBM reads in general. The system is in steady state after around 1500 query mixes; after this the working set does not shift much. After several thousand query mixes, we have:
SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS ================= ============================ ========== ======= ======== ======= ========= DB.DBA.RDF_OBJ RDF_OBJ 114105938 3302150 2 0 3171275 DB.DBA.RDF_QUAD RDF_QUAD 977426773 2041156 0 0 1970712 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 8250414 509239 6 15 491631 DB.DBA.RDF_QUAD RDF_QUAD_POGS 3677233812 183860 0 0 175386 DB.DBA.RDF_IRI RDF_IRI 32 99710 302151 5 95353 DB.DBA.RDF_QUAD RDF_QUAD_OP 30597 51593 168 0 48941 DB.DBA.RDF_QUAD RDF_QUAD_SP 265474 47210 17 0 46078 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 6020 212 3 0 212 DB.DBA.RDF_PREFIX RDF_PREFIX 0 167 16700 0 157
The first column is the table, then the index, then the number of times a row was found. The fourth number is the count of disk pages read. The last number is the count of 8K buffer pool pages in use for caching pages of the index in question. Note that the index is clustered, i.e., there is no table data structure separate from the index. Most of the reads are for strings or RDF literals. After this comes the
PSOGindex for getting a property value given the subject. After this, but much lower, we have lookups of IRI strings given the ID. The index from object value to subject is used the most but the number of pages is small; only a few properties seem to be concerned. The rest is minimal in comparison.Now let us reset the counts and see what the steady state I/O profile is.
SELECT key_stat (key_table, name_part (key_name, 2), 'reset') FROM sys_keys WHERE key_migrate_to IS NULL;SELECT TOP 10 * FROM sys_d_stat ORDER BY reads DESC;KEY_TABLE INDEX_NAME TOUCHES READS READ_PCT N_DIRTY N_BUFFERS ================= ============================ ========== ======= ======== ======= ========= DB.DBA.RDF_OBJ RDF_OBJ 30155789 79659 0 0 3191391 DB.DBA.RDF_QUAD RDF_QUAD 259008064 8904 0 0 1948707 DB.DBA.RDF_QUAD RDF_QUAD_SP 68002 7730 11 0 53360 DB.DBA.RDF_IRI RDF_IRI 12 5415 41653 6 98804 DB.DBA.RDF_QUAD RDF_QUAD_POGS 975147136 1597 0 0 173459 DB.DBA.RDF_IRI DB_DBA_RDF_IRI_UNQC_RI_ID 2213525 1286 0 17 485093 DB.DBA.RDF_QUAD RDF_QUAD_OP 7999 904 11 0 48568 DB.DBA.RDF_PREFIX DB_DBA_RDF_PREFIX_UNQC_RP_ID 1494 1 0 0 213
Literal strings dominate. The
SPindex is used only for situations where thePis not specified, i.e., theDESCRIBEquery. Based on this, I/O seems to be attributable mostly to this. The firstRDF_IRIrepresents translations from string to IRI id; the second represents translations from IRI id to string. The touch count for the firstRDF_IRIis not properly recorded, hence the miss % is out of line. We seeSPmissing the cache the most since its use is infrequent in the mix.We will next look at query processing statistics. For this we introduce a new meter.
The
db_activitySQL function provides a session-by-session cumulative statistic of activity. The fields are:-
rnd- Count of random index lookups. Each first row of a select or insert counts as one, regardless of whether something was found. -
seq- Count of sequential rows. Every move to next row on a cursor counts as 1, regardless of whether conditions match. -
same seg- For column store only; counts how many times the next row in a vectored join using an index falls in the same segment as the previous random access. A segment is the stretch of rows between entries in the sparse top level index on the column projection. -
same pg- Counts how many times a vectored index join finds the next match on the same page as the previous one. -
same par- Counts how many times the next lookup in a vectored index join falls on a different page than the previous but still under the same parent. -
disk- Counts how many disk reads were made, including any speculative reads initiated. -
spec disk- Counts speculative disk reads. -
messages- Counts cluster interconnect messages -
B (KB, MB, GB)- is the total length of the cluster interconnect messages. -
fork- Counts how many times a thread was forked (started) for query parallelization.
The numbers are given with 4 significant digits and a scale suffix. G is 10^9 (1,000,000,000); M is 10^6 (1,000,000), K is 10^3 (1,000).
We run 2000 query mixes with 16 Users. The special
httpaccount keeps a cumulative account of all activity on web server threads.SELECT db_activity (2, ' [http');]1.674G rnd 3.223G seq 0 same seg 1.286G same pg 314.8M same par 6.186M disk 6.461M spec disk 0B / 0 messages 298.6K forkWe see that random access dominates. The
seqnumber is about twice therndnumber, meaning that the average random lookup gets two rows. Getting a row at random obviously takes more time than getting the next row. Since the index used is row-wise, thesame segis 0; thesame pgindicates that 77% of the random accesses fall on the same page as the previous random access; most of the remaining random accesses fall under the same parent as the previous one.There are more speculative reads than disk reads which is an artifact of counting some concurrently speculated reads twice. This does indicate that speculative reads dominate. This is because a large part of the run was in the warm-up state with aggressive speculative reading. We reset the counts and run another 2000 mixes.
Now let us look at the same reading after 2000 mixes, 16 user at 100Mt.
234.3M rnd 420.5M seq 0 same seg 188.8M same pg 29.09M same par 808.9K disk 919.9K spec disk 0B / 0 messages 76K forkWe note that the ratios between the random and sequential and same page/parent counts are about the same. The sequential number looks to be even a bit smaller in proportion. The count of random accesses for the 100Mt run is 14% of the count for the 1000Mt run. The count of query parallelization threads is also much lower since it is worthwhile to schedule a new thread only if there are at least a few thousand operations to perform on it. The precise criterion for making a thread is that according to the cost model guess, the thread must have at least 5ms worth of work.
We note that the 100 Mt throughput is a little over three-times that of the 1000 Mt throughput, as reported before. We might justifiably ask why the 100 Mt run is not seven-times faster instead, for this much less work.
We note that for one-off random access, it makes no real difference whether the tree has 100 M or 1000 M rows; this translates to roughly 27 vs 30 comparisons, so the depth of the tree is not a factor per se. Besides, vectoring makes the tree often look only one or two levels deep, so the total row count matters even less there.
To elucidate this last question, we look at the CPU profiles. We take an oprofile of 100 Single User mixes at both scales.
For 100 Mt:
For 1000 Mt61161 10.1723 cmpf_iri64n_iri64n_anyn_gt_lt 31321 5.2093 box_equal 19027 3.1646 sqlo_parse_tree_has_node 15905 2.6453 dk_alloc 15647 2.6024 itc_next_set_neq 12702 2.1126 itc_vec_split_search 12487 2.0768 itc_dive_transit 11450 1.9044 itc_bm_vec_row_check 10646 1.7706 itc_page_rcf_search 9223 1.5340 id_hash_get 9215 1.5326 gen_qsort 8867 1.4748 sqlo_key_part_best 8807 1.4648 itc_param_cmp 8062 1.3409 cmpf_iri64n_iri64n 6820 1.1343 sqlo_in_list 6005 0.9987 dc_iri_id_cmp 5905 0.9821 dk_free_tree 5801 0.9648 box_hash 5509 0.9163 dks_esc_write 5444 0.9054 sql_tree_hash_1
754331 31.4149 cmpf_iri64n_iri64n_anyn_gt_lt 146165 6.0872 itc_vec_split_search 144795 6.0301 itc_next_set_neq 131671 5.4836 itc_dive_transit 110870 4.6173 itc_page_rcf_search 66780 2.7811 gen_qsort 66434 2.7667 itc_param_cmp 58450 2.4342 itc_bm_vec_row_check 55213 2.2994 dk_alloc 47793 1.9904 cmpf_iri64n_iri64n 44277 1.8440 dc_iri_id_cmp 39489 1.6446 cmpf_int64n 36880 1.5359 dc_append_bytes 36601 1.5243 dv_compare 31286 1.3029 dc_any_value_prefetch 25457 1.0602 itc_next_set 20852 0.8684 box_equal 19895 0.8285 dk_free_tree 19698 0.8203 itc_page_insert_search 19367 0.8066 dc_copy
The top function in both is the compare for an equality of two leading IRIs and a range for the trailing any. This corresponds to the range check in Q5. At the larger scale this is three times more important. At the smaller scale, the share of query optimization is about 6.5 times greater. The top function in this category is
box_equalwith 5.2% vs 0.87%. The remaining SQL compiler functions are all in proportion to this, totaling 14.3% of the 100 Mt top-20 profile.From this sample it appears ten times more scale is seven times more database operations. This is not taken into account in the metric. Query compilation is significant at the small end, and no longer significant at 1000 Mt. From these numbers, we could say that Virtuoso is about two times more efficient in terms of database operation throughput at 1000 Mt than at 100 Mt.
We may conclude that different BSBM scales measure different things. The TPC workloads are relatively better in that they have a balance between metric components that stay relatively constant across a large range of scales.
This is not necessarily something that should be fixed in the BSBM Explore mix. We must however take these factors better into account in developing the BI mix.
Let us also remember that BSBM Explore is a relational workload. Future posts in this series will outline how we propose to make RDF-friendlier benchmarks.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure? (this post)
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
22:36
Benchmarks, Redux (part 6): BSBM and I/O, continued
» OpenLink Community BlogIn the words of Jim Gray, disks have become tapes. By this he means that a disk is really only good for sequential access. For this reason, the SSD extent read ahead was incomparably better. We note that in the experiment, every page in the general area of the database the experiment touched would in time be touched, and that the whole working set would end up in memory. Therefore no speculative read would be wasted. Therefore it stands to reason to read whole extents.
So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in.
Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an "elevator only" scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty.
We keep in mind that the test we target is BSBM warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput.
We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run.
We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread's random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow.
In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset.
In Virtuoso 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the cache quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set.
The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch.
With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% CPU. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages.
The BSBM workload does not offer better possibilities for optimization, short of pre-reading the whole database, which is not practical at large scales.
Some DetailsFirst we start from cold disk, with and without mandatory read of the whole extent on the touch.
Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes:
0: 151560.82 ms, total: 151718 ms 1: 179589.08 ms, total: 179648 ms 2: 71974.49 ms, total: 72017 ms 3: 102701.73 ms, total: 102729 ms 4: 58834.41 ms, total: 58856 ms 5: 65926.34 ms, total: 65944 ms 6: 68244.69 ms, total: 68274 ms 7: 39197.15 ms, total: 39215 ms 8: 45654.93 ms, total: 45674 ms 9: 34850.30 ms, total: 34878 ms 10: 100061.30 ms, total: 100079 ms
The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms.
With vectored read-ahead and full extents only, i.e., max speculation:
0: 178854.23 ms, total: 179034 ms 1: 110826.68 ms, total: 110887 ms 2: 19896.11 ms, total: 19941 ms 3: 36724.43 ms, total: 36753 ms 4: 21253.70 ms, total: 21285 ms 5: 18417.73 ms, total: 18439 ms 6: 21668.92 ms, total: 21690 ms 7: 12236.49 ms, total: 12267 ms 8: 14922.74 ms, total: 14945 ms 9: 11502.96 ms, total: 11523 ms 10: 15762.34 ms, total: 15792 ms ... 90: 1747.62 ms, total: 1761 ms 91: 1701.01 ms, total: 1714 ms 92: 1300.62 ms, total: 1318 ms 93: 1873.15 ms, total: 1886 ms 94: 1508.24 ms, total: 1524 ms 95: 1748.15 ms, total: 1761 ms 96: 2076.92 ms, total: 2090 ms 97: 2199.38 ms, total: 2212 ms 98: 2305.75 ms, total: 2319 ms 99: 1771.91 ms, total: 1784 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 1.3006s / 178.8542s Elapsed runtime: 872.993 seconds QMpH: 412.374 query mixes per hour
The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%.
We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better.
Then the same with cold SSDs. First with no speculation:
0: 5177.68 ms, total: 5302 ms 1: 2570.16 ms, total: 2614 ms 2: 1353.06 ms, total: 1391 ms 3: 1957.63 ms, total: 1978 ms 4: 1371.13 ms, total: 1386 ms 5: 1765.55 ms, total: 1781 ms 6: 1658.23 ms, total: 1673 ms 7: 1273.87 ms, total: 1289 ms 8: 1355.19 ms, total: 1380 ms 9: 1152.78 ms, total: 1167 ms 10: 1787.91 ms, total: 1802 ms ... 90: 1116.25 ms, total: 1128 ms 91: 989.50 ms, total: 1001 ms 92: 833.24 ms, total: 844 ms 93: 1137.83 ms, total: 1150 ms 94: 969.47 ms, total: 982 ms 95: 1138.04 ms, total: 1149 ms 96: 1155.98 ms, total: 1168 ms 97: 1178.15 ms, total: 1193 ms 98: 1120.18 ms, total: 1132 ms 99: 1013.16 ms, total: 1025 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 0.8201s / 5.1777s Elapsed runtime: 127.555 seconds QMpH: 2822.321 query mixes per hour
The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%.
Now, SSDs with max speculation.
0: 44670.34 ms, total: 44809 ms 1: 18490.44 ms, total: 18548 ms 2: 7306.12 ms, total: 7353 ms 3: 9452.66 ms, total: 9485 ms 4: 5648.56 ms, total: 5668 ms 5: 5493.21 ms, total: 5511 ms 6: 5951.48 ms, total: 5970 ms 7: 3815.59 ms, total: 3834 ms 8: 4560.71 ms, total: 4579 ms 9: 3523.74 ms, total: 3543 ms 10: 4724.04 ms, total: 4741 ms ... 90: 673.53 ms, total: 685 ms 91: 534.62 ms, total: 545 ms 92: 730.81 ms, total: 742 ms 93: 1358.14 ms, total: 1370 ms 94: 1098.64 ms, total: 1110 ms 95: 1232.20 ms, total: 1243 ms 96: 1259.57 ms, total: 1273 ms 97: 1298.95 ms, total: 1310 ms 98: 1156.01 ms, total: 1166 ms 99: 1025.45 ms, total: 1034 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 0.4725s / 44.6703s Elapsed runtime: 269.323 seconds QMpH: 1336.683 query mixes per hour
The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%.
The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%.
We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however.
We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower.
We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with
cat file /dev/nulland two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with thecatto/dev/nullfigure. We will later test with 8 SSDs with better controllers.Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended.
Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM Explore in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the data that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving RDF literal strings, presumably on behalf of the
DESCRIBEquery in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized.The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the
ConclusionsDESCRIBE. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect.We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs.
More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns.
As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued (this post)
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
22:36
Benchmarks, Redux (part 6): BSBM and I/O, continued
» OpenLink Community BlogIn the words of Jim Gray, disks have become tapes. By this he means that a disk is really only good for sequential access. For this reason, the SSD extent read ahead was incomparably better. We note that in the experiment, every page in the general area of the database the experiment touched would in time be touched, and that the whole working set would end up in memory. Therefore no speculative read would be wasted. Therefore it stands to reason to read whole extents.
So I changed the default behavior to use a very long window for triggering read-ahead as long as the buffer pool was not full. After the initial filling of the buffer pool, the read ahead would require more temporal locality before kicking in.
Still, the scheme was not really good since the rest of the extent would go for background-read and the triggering read would be done right then, leading to extra seeks. Well, this is good for latency but bad for throughput. So I changed this too, going to an "elevator only" scheme where reads that triggered read-ahead would go with the read-ahead batch. Reads that did not trigger read-ahead would still be done right in place, thus favoring latency but breaking any sequentiality with its attendant 10+ ms penalty.
We keep in mind that the test we target is BSBM warm-up time, which is purely a throughput business. One could have timeouts and could penalize queries that sacrificed too much latency to throughput.
We note that even for this very simple metric, just reading the allocated database pages from start to end is not good since a large number of pages in fact never get read during a run.
We further note that the vectored read-ahead without any speculation will be useful as-is for cases with few threads and striping, since at least one thread's random I/Os get to go to multiple threads. The benefit is less in multiuser situations where disks are randomly busy anyhow.
In the previous I/O experiments, we saw that with vectored read ahead and no speculation, there were around 50 pages waiting for I/O at all times. With an easily-triggered extent read-ahead, there were around 4000 pages waiting. The more pages are waiting for I/O, the greater the benefit from the elevator algorithm of servicing I/O in order of file offset.
In Virtuoso 5 we had a trick that would, if the buffer pool was not full, speculatively read every uncached sibling of every index tree node it visited. This filled the cache quite fast, but was useless after the cache was full. The extent read ahead first implemented in 6 was less aggressive, but would continue working with full cache and did in fact help with shifts in the working set.
The next logical step is to combine the vector and extent read-ahead modes. We see what pages we will be getting, then take the distinct extents; if we have been to this extent within the time window, we just add all the uncached allocated pages of the extent to the batch.
With this setting, especially at the start of the run, we get large read-ahead batches and maintain I/O queues of 5000 to 20000 pages. The SSD starting time drops to about 120 seconds from cold start to reach 1200% CPU. We see transfer rates of up to 150 MB/s per SSD. With HDDs, we see transfer rates around 14 MB/s per drive, mostly reading chunks of an average of seventy-one (71) 8K pages.
The BSBM workload does not offer better possibilities for optimization, short of pre-reading the whole database, which is not practical at large scales.
Some DetailsFirst we start from cold disk, with and without mandatory read of the whole extent on the touch.
Without any speculation but with vectored read-ahead, here are the times for the first 11 query mixes:
0: 151560.82 ms, total: 151718 ms 1: 179589.08 ms, total: 179648 ms 2: 71974.49 ms, total: 72017 ms 3: 102701.73 ms, total: 102729 ms 4: 58834.41 ms, total: 58856 ms 5: 65926.34 ms, total: 65944 ms 6: 68244.69 ms, total: 68274 ms 7: 39197.15 ms, total: 39215 ms 8: 45654.93 ms, total: 45674 ms 9: 34850.30 ms, total: 34878 ms 10: 100061.30 ms, total: 100079 ms
The average CPU during this time was 5%. The best read throughput was 2.5 MB/s; the average was 1.35 MB/s. The average disk read was 16 ms.
With vectored read-ahead and full extents only, i.e., max speculation:
0: 178854.23 ms, total: 179034 ms 1: 110826.68 ms, total: 110887 ms 2: 19896.11 ms, total: 19941 ms 3: 36724.43 ms, total: 36753 ms 4: 21253.70 ms, total: 21285 ms 5: 18417.73 ms, total: 18439 ms 6: 21668.92 ms, total: 21690 ms 7: 12236.49 ms, total: 12267 ms 8: 14922.74 ms, total: 14945 ms 9: 11502.96 ms, total: 11523 ms 10: 15762.34 ms, total: 15792 ms ... 90: 1747.62 ms, total: 1761 ms 91: 1701.01 ms, total: 1714 ms 92: 1300.62 ms, total: 1318 ms 93: 1873.15 ms, total: 1886 ms 94: 1508.24 ms, total: 1524 ms 95: 1748.15 ms, total: 1761 ms 96: 2076.92 ms, total: 2090 ms 97: 2199.38 ms, total: 2212 ms 98: 2305.75 ms, total: 2319 ms 99: 1771.91 ms, total: 1784 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 1.3006s / 178.8542s Elapsed runtime: 872.993 seconds QMpH: 412.374 query mixes per hour
The peak throughput is 91 MB/s, with average around 50 MB/s; CPU average around 50%.
We note that the latency of the first query mix is hardly greater than in the non-speculative run, but starting from mix 3 the speed is clearly better.
Then the same with cold SSDs. First with no speculation:
0: 5177.68 ms, total: 5302 ms 1: 2570.16 ms, total: 2614 ms 2: 1353.06 ms, total: 1391 ms 3: 1957.63 ms, total: 1978 ms 4: 1371.13 ms, total: 1386 ms 5: 1765.55 ms, total: 1781 ms 6: 1658.23 ms, total: 1673 ms 7: 1273.87 ms, total: 1289 ms 8: 1355.19 ms, total: 1380 ms 9: 1152.78 ms, total: 1167 ms 10: 1787.91 ms, total: 1802 ms ... 90: 1116.25 ms, total: 1128 ms 91: 989.50 ms, total: 1001 ms 92: 833.24 ms, total: 844 ms 93: 1137.83 ms, total: 1150 ms 94: 969.47 ms, total: 982 ms 95: 1138.04 ms, total: 1149 ms 96: 1155.98 ms, total: 1168 ms 97: 1178.15 ms, total: 1193 ms 98: 1120.18 ms, total: 1132 ms 99: 1013.16 ms, total: 1025 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 0.8201s / 5.1777s Elapsed runtime: 127.555 seconds QMpH: 2822.321 query mixes per hour
The peak I/O is 45 MB/s, with average 28.3 MB/s; CPU average is 168%.
Now, SSDs with max speculation.
0: 44670.34 ms, total: 44809 ms 1: 18490.44 ms, total: 18548 ms 2: 7306.12 ms, total: 7353 ms 3: 9452.66 ms, total: 9485 ms 4: 5648.56 ms, total: 5668 ms 5: 5493.21 ms, total: 5511 ms 6: 5951.48 ms, total: 5970 ms 7: 3815.59 ms, total: 3834 ms 8: 4560.71 ms, total: 4579 ms 9: 3523.74 ms, total: 3543 ms 10: 4724.04 ms, total: 4741 ms ... 90: 673.53 ms, total: 685 ms 91: 534.62 ms, total: 545 ms 92: 730.81 ms, total: 742 ms 93: 1358.14 ms, total: 1370 ms 94: 1098.64 ms, total: 1110 ms 95: 1232.20 ms, total: 1243 ms 96: 1259.57 ms, total: 1273 ms 97: 1298.95 ms, total: 1310 ms 98: 1156.01 ms, total: 1166 ms 99: 1025.45 ms, total: 1034 ms Scale factor: 2848260 Number of warmup runs: 0 Seed: 808080 Number of query mix runs (without warmups): 100 times min/max Querymix runtime: 0.4725s / 44.6703s Elapsed runtime: 269.323 seconds QMpH: 1336.683 query mixes per hour
The peak I/O is 339 MB/s, with average 192 MB/s; average CPU is 121%.
The above was measured with the read-ahead thread doing single-page reads. We repeated the test with merging reads with small differences. The max IO was 353 MB/s, and average 173 MB/s; average CPU 113%.
We see that the start latency is quite a bit longer than without speculation and the CPU % is lower due to higher latency of individual I/O. The I/O rate is fair. We would expect more throughput however.
We find that a supposedly better use of the API, doing single requests of up to 100 pages instead of consecutive requests of 1 page, does not make a lot of difference. The peak I/O is a bit higher; overall throughput is a bit lower.
We will have to retry these experiments with a better controller. We have at no point seen anything like the 50K 4KB random I/Os promised for the SSDs by the manufacturer. We know for a fact that the controller gives about 700 MB/s sequential read with
cat file /dev/nulland two drives busy. With 4 drives busy, this does not get better. The best 30 second stretch we saw in a multiuser BSBM warm-up was 590 MB/s, which is consistent with thecatto/dev/nullfigure. We will later test with 8 SSDs with better controllers.Note that the average I/O and CPU are averages over 30 second measurement windows; thus for short running tests, there is some error from the window during which the activity ended.
Let us now see if we can make a BSBM instance warm up from disk in a reasonable time. We run 16 users with max speculation. We note that after reading 7,500,000 buffers we are not entirely free of disk. The max speculation read-ahead filled the cache in 17 minutes, with an average of 58 MB/s. After the cache is filled, the system shifts to a more conservative policy on extent read-ahead; one which in fact never gets triggered with the BSBM Explore in steady state. The vectored read-ahead is kept on since this by itself does not read pages that are not needed. However, the vectored read-ahead does not run either, because the data that is accessed in larger batches is already in memory. Thus there remains a trickle of an average 0.49 MB/s from disk. This keeps CPU around 350%. With SSDs, the trickle is about 1.5 MB/s and CPU is around 1300% in steady state. Thus SSDs give approximately triple the throughput in a situation where there is a tiny amount of continuous random disk access. The disk access in question is 80% for retrieving RDF literal strings, presumably on behalf of the
DESCRIBEquery in the mix. This query touches things no other query touches and does so one subject at a time, in a way that can neither be anticipated nor optimized.The Virtuoso 7 column store will deal with this better because it is more space efficient overall. If we apply stream-compression to literals, these will go in under half the space, while quads will go in maybe one-quarter the space. Thus 3000 Mt all from memory should be possible with 72 GB RAM. 1000 Mt row-wise did fit in in 72 GB RAM except for the random literals accessed by the the
ConclusionsDESCRIBE. This alone drops throughput to under a third of the memory-only throughput if using HDDs. SSDs, on the other hand, can largely neutralize this effect.We have looked at basics of I/O. SSDs have been found to be a readily available solution to I/O bottlenecks without need for reconfiguration or complex I/O policies. We have been able to get a decent read rate under conditions of server warm-up or shift of working set even with HDDs.
More advanced I/O matters will be covered with the column store. We note that the techniques discussed here apply identically to rows and columns.
As concerns BSBM, it seems appropriate to include a warm-up time. In practice, this means that the store just must eagerly pre-read. This is not hard to do and can be quite useful.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued (this post)
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
19:17
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
» OpenLink Community BlogIn the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.
There are two approaches:
-
run twice or otherwise make sure one runs from memory and forget about I/O, or
-
make rules and metrics for warm-up.
We will see if the second is possible with BSBM.
From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.
In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.
Storage Arrays Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with
`cat file > /dev/null`.The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.
Two different read-ahead schemes are used:
-
With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.
-
With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.
In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.
There are a few different possibilities for the physical I/O:
-
Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.
-
A thread finds it needs a page and reads it.
-
Using Unix asynchronous I/O,
aio.h, with theaio_*andlio_listiofunctions. -
Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.
The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.
There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.
While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.
The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.
This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.
The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.
The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.
The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.
For the sake of simplicity we only run 7 Single with the 1000 Mt scale.
The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.
The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.
The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.
There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.
We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.
Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.
So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.
So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.
The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.
Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.
BSBM NoteWe look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.
Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.
A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.
Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs (this post)
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
19:17
Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
» OpenLink Community BlogIn the context of database benchmarks we cannot ignore I/O, as pretty much has been done so far by BSBM.
There are two approaches:
-
run twice or otherwise make sure one runs from memory and forget about I/O, or
-
make rules and metrics for warm-up.
We will see if the second is possible with BSBM.
From this starting point, we look at various ways of scheduling I/O in Virtuoso using a 1000 Mt BSBM database on sets of each of HDDs (hard disk devices) and SSDs (solid-state storage devices). We will see that SSDs in this specific application can make a significant difference.
In this test we have the same 4 stripes of a 1000 Mt BSBM database on each of two storage arrays.
Storage Arrays Type Quantity Maker Size Speed Interface speed Controller Drive Cache RAID SSD 4 Crucial 128 GB N/A 6Gbit SATA RocketRaid 640 128 MB None HDD 4 Samsung 1000 GB 7200 RPM 3Gbit SATA Intel ICH on Supermicro motherboard 16 MB None We make sure that the files are not in OS cache by filling it with other big files, reading a total of 120 GB off SSDs with
`cat file > /dev/null`.The configuration files are as in the report on the 1000 Mt run. We note as significant that we have a few file descriptors for each stripe, and that read-ahead for each is handled by its own thread.
Two different read-ahead schemes are used:
-
With 6 Single, if a 2MB extent gets a second read within a given time after the first, the whole extent is scheduled for background read.
-
With 7 Single, as an index search is vectored, we know a large number of values to fetch at one time and these values are sorted into an ascending sequence. Therefore, by looking at a node in an index tree, we can determine which sub-trees will be accessed and schedule these for read-ahead, skipping any that will not be accessed.
In either model, a sequential scan touching more than a couple of consecutive index leaf pages triggers a read-ahead, to the end of the scanned range or to the next 3000 index leaves, whichever comes first. However, there are no sequential scans of significant size in BSBM.
There are a few different possibilities for the physical I/O:
-
Using a separate read system call for each page. There may be several open file descriptors on a file so that many such calls can proceed concurrently on different threads; the OS will order the operations.
-
A thread finds it needs a page and reads it.
-
Using Unix asynchronous I/O,
aio.h, with theaio_*andlio_listiofunctions. -
Using single-read system calls for adjacent pages. In this way, the drive sees longer requests and should give better throughput. If there are short gaps in the sequence, the gaps are also read, wasting bandwidth but saving on latency.
The two latter apply only to bulk I/O that are scheduled on background threads, one per independently-addressable device (HDD, SSD, or RAID-set). These bulk-reads operate on an elevator model, keeping a sorted queue of things to read or write and moving through this queue from start to end. At any time, the queue may get more work from other threads.
There is a further choice when seeing single-page random requests. They can either go to the elevator or they can be done in place. Taking the elevator is presumably good for throughput but bad for latency. In general, the elevator should have a notion of fairness; these matters are discussed in the CWI collaborative scan paper. Here we do not have long queries, so we do not have to talk about elevator policies or scan sharing; there are no scans. We may touch on these questions later with the column store, the BSBM BI mix, and TPC-H.
While we may know principles, I/O has always given us surprises; the only way to optimize this is to measure.
The metric we try to optimize here is the time it takes for a multiuser BSBM run starting from cold cache to get to 1200% CPU. When running from memory, the CPU is around 1350% for the system in question.
This depends on getting I/O throughput, which in turn depends on having a lot of speculative reading since the workload itself does not give any long stretches to read.
The test driver is set at 16 clients, and the run continues for 2000 query mixes or until target throughput is reached. Target throughput is deemed reached after the first 20 second stretch with CPU at 1200% or higher.
The meter is a stored procedure that records the CPU time, count of reads, cumulative elapsed time spent waiting for I/O, and other metrics. The code for this procedure (for 7 Single; this file will not work on Virtuoso 6 or earlier) is available here.
The database space allocation gives each index a number of 2MB segments, each with 256 8K pages. When a page splits, the new page is allocated from the same extent if possible, or from a specific second extent which is designated as the overflow extent of this extent. This scheme provides for a sort of pseudo-locality within extents over random insert order. Thus there is a chance that pre-reading an extent will get key values in the same range a the ones on the page being requested in the first place. At least the pre-read pages will be from the same index tree. There are insertion orders that do not create good locality with this allocation scheme, though. In order to generally improve locality, one could shuffle pages of an all-dirty subtree before writing this out so as to have physical order match key order. We will look at some tricks in this vein with the column store.
For the sake of simplicity we only run 7 Single with the 1000 Mt scale.
The first experiment was with SSDs and the vectored read-ahead. The target throughput was reached after 280 seconds.
The next test was with HDDs and extent read-ahead. One hour into the experiment, the CPU was about 70% after processing around 1000 query mixes. It might have been hours before HDD reads became rare enough for hitting 1200% CPU. The test was not worth continuing.
The result with HDDs and vectored read-ahead would be worse since vectored read-ahead leads to smaller read-ahead batches and to less contiguous read patterns. The individual read times here, are over twice the individual read times with per-extent read-ahead. The fact that vectored read-ahead does not read potentially unneeded pages makes no difference. Hence this test is also not worth running to completion.
There are other possibilities for improving HDD I/O. If only 2MB read requests are made, a transfer will be about 20 ms at a sequential transfer speed of 50 MB/s. Then seeking to the next 2MB extent will be a few ms, most often less than 20, so the HDD should give at least half the nominal throughput.
We note that, when reading sequential 8K pages inside a single 2MB (256 page) extent, the seek latency is not 0 as one would expect but an extreme 5 ms. One would think that the drive would buffer a whole track, and a track would hold a large number of 2MB sections, but apparently this is not so.
Therefore, now if we have a sequential read pattern that is more dense than 1 page out of 10, we read all the pages and just keep the ones we want.
So now we set the read-ahead to merge reads that fall within 10 pages. This wastes bandwidth, but supposedly saves on latency. We will see.
So we try, and we find that read-ahead does not account for most pages since it does not get triggered. Thus, we change the triggering condition to be the 2nd read to fall in the extent within 20 seconds of the first.
The HDDs were in all cases 700% busy for 4 HDDs. But with the new setting we get longer requests, most often full extents, which gets a per-HDD transfer rate of about 5 MB/s. With the looser condition for starting read-ahead, 89% of all pages were read in a read-ahead batch. We see the I/O throughput decrease during the run because there are more single-page reads that do not trigger extent read-ahead. So HDDs have 1.7 concurrent operations pending, but the batch size drops, dropping the throughput.
Thus with the best settings, the test with 2000 query mixes finishes in 46 minutes, and the CPU utilization is steadily increasing, hitting 392% for the last minute. In comparison, with SSDs and our worst read-ahead setting we got 1200% CPU in under 5 minutes from cold start. The I/O system can be further tuned; for example, by only reading full extents as long as the buffer pool is not full. In the next post we will measure some more.
BSBM NoteWe look at query times with semi-warm cache, with CPU around 400%. We note that Q8-Q12 are especially bad. Q5 runs at about half speed. Q12 runs at under 1/10th speed. The relatively slowest queries appear to be single-instance lookups. Nothing short of the most aggressive speculative reading can help there. Neither query nor workload has any exploitable pattern. Therefore if an I/O component is to be included in a BSBM metric, the only way to score in this is to use speculative read to the maximum.
Some of the queries take consecutive property values of a single instance. One could parallelize this pipeline, but this would be a one-off and would make sense only when reading from storage (whether HDD, SSD, or otherwise). Multithreading for single rows is not worth the overhead.
A metric for BSBM warm-up is not interesting for database science, but may still be of practical interest in the specific case of RDF stores. Specially reading large chunks at startup time is good, so putting a section in BSBM that would force one to implement this would be a service to most end users. Measuring and reporting such I/O performance would favor space efficiency in general. Space efficiency is generally a good thing, especially at larger scales, so we can put an optional section in the report for warm-up. This is also good for comparing HDDs and SSDs, and for testing read-ahead, which is still something a database is expected to do. Implementors have it easy; just speculatively read everything.
Looking at the BSBM fictional use case, anybody running such a portal would do this from RAM only, so it makes sense to define the primary metric as running from warm cache, in practice 100% from memory.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs (this post)
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
20:28
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
» OpenLink Community BlogBelow is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.
-
Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.
The following three settings are all in the
[Parameters]section of thevirtuoso.inifile.-
AsyncQueueMaxThreadscontrols the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better. -
ThreadsPerQueryis the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better. -
IndexTreeMapsis the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In">[HTTPServer">In] the
[HTTPServer]]section of thevirtuoso.inifile, theServerThreadssetting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.Note — The
[HTTPServer]] ServerThreadsare taken from the total pool made available by the[Parameters] ServerThreads. Thus, the[Parameters] ServerThreadsshould always be at least as large as (and is best set greater than) the[HTTPServer]] ServerThreads, and if using the closed-source Commercial Version,[Parameters] ServerThreadscannot exceed the licensed thread count. -
-
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.
It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the
segmentdeclaration in thevirtuoso.inifile. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.in the
[Parameters]section of thevirtuoso.inifile, setFDsPerFileto be(the number of concurrent threads * 1.5) ÷ the number of distinct database files.There are no SSD specific settings.
-
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?
Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.
Use the built-in bulk load facility, i.e.,
ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');For example,
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');Then do a
rdf_loader_run ()on enough connections. For example, you can use the shell commandisql rdf_loader_run () &to start one in a background isql process. When starting background load commands from the shell, you can use the shell
waitcommand to wait for completion. If starting from isql, use thewait_for_children;command (see isql documentation for details).See the BSBM disclosure report for an example load script.
-
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.
Execute
CHECKPOINT;through a SQL client, e.g.,
isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol. -
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.
No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is
REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the[Parameters]section of thevirtuoso.inifile, withDefaultIsolation = 4Transaction isolation cannot be set over the SPARQL protocol.
NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
-
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.
In the
[Parameters]section of thevirtuoso.inifile,NumberOfBufferscontrols the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting. -
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?
In an
isqlsession, executeSTATUS ( ? ? );The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
-
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.
Execute on an
isqlsession:CHECKPOINT; SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The
iss_pagescolumn is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining toRDF_QUADare for quads;RDF_IRI,RDF_PREFIX,RO_START,RDF_OBJare for dictionaries;RDF_OBJ_RO_FLAGS_WORDSandVTLOG_DB_DBA_RDF_OBJare for text index. -
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.
The default scheme uses quads. The default index layout is
PSOG,POGS,GS,SP,OP. To see the current index scheme, use anisqlsession to executeSTATISTICS DB.DBA.RDF_QUAD; -
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by
SorOdepending on which is first in key order for each index?The default partitioning settings are good, i.e., partitioning is on
OorS, whichever is first in key order. -
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?
In the
[Cluster]section of thecluster.inifile,ReqBatchSizeis the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of10000to50000or so if this is seen to be useful.To change this on the fly, the following can be issued through an
isqlsession:cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );The commands below may be executed through an
isqlsession to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.STATUS ('cluster') ;; whole cluster
STATUS ('cluster_d') ;; process-by-process -
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
-
For BSBM, needless query optimization should be capped at Virtuoso server start-up through the
[Parameters]section of thevirtuoso.ini, withStopCompilerWhenXOverRun = 1 -
When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of
READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the[Parameters]section of thevirtuoso.inifile, withDefaultIsolation = 2
-
- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post)
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): On the Substance of RDF Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
- Benchmarks, Redux (part 13): BSBM BI Modifications
- Benchmarks, Redux (part 14): BSBM BI Mix
- Benchmarks, Redux (part 15): BSBM Test Driver Enhancements
-
-
20:28
Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
» OpenLink Community BlogBelow is a questionnaire I sent to the BSBM participants in order to get tuning instructions for the runs we were planning. I have filled in the answers for Virtuoso, here. This can be a checklist for pretty much any RDF database tuning.
-
Threading - What settings should be used (e.g., for query parallelization, I/O parallelization [e.g., prefetch, flush of dirty], thread pools [e,.g. web server], any other thread related)? We will run with 8 and 32 cores, so if there are settings controlling number of read/write (R/W) locks or mutexes or such for serializing diverse things, these should be set accordingly to minimize contention.
The following three settings are all in the
[Parameters]section of thevirtuoso.inifile.-
AsyncQueueMaxThreadscontrols the size of a pool of extra threads that can be used for query parallelization. This should be set to either 1.5 * the number of cores or 1.5 * the number of core threads; see which works better. -
ThreadsPerQueryis the maximum number of threads a single query will take. This should be set to either the number of cores or the number of core threads; see which works better. -
IndexTreeMapsis the number of mutexes over which control for buffering an index tree is split. This can generally be left at default (256 in normal operation; valid settings are powers of 2 from 2 to 1024), but setting to 64, 128, or 512 may be beneficial.A low number will lead to frequent contention; upwards of 64 will have little contention. We have sometimes seen a multiuser workload go 10% faster when setting this to 64 (down from 256), which seems counter-intuitive. This may be a cache artifact.
In">[HTTPServer">In] the
[HTTPServer]]section of thevirtuoso.inifile, theServerThreadssetting is the number of web server threads, i.e., the maximum number of concurrent SPARQL protocol requests. Having a value larger than the number of concurrent clients is OK; for large numbers of concurrent clients a lower value may be better, which will result in requests waiting for a thread to be available.Note — The
[HTTPServer]] ServerThreadsare taken from the total pool made available by the[Parameters] ServerThreads. Thus, the[Parameters] ServerThreadsshould always be at least as large as (and is best set greater than) the[HTTPServer]] ServerThreads, and if using the closed-source Commercial Version,[Parameters] ServerThreadscannot exceed the licensed thread count. -
-
File layout - Are there settings for striping over multiple devices? Settings for other file access parallelism? Settings for SSDs (e.g., SSD based cache of hot set of larger db files on disk)? The target config is for 4 independent disks and 4 independent SSDs. If you depend on RAID, are there settings for this? If you need RAID to be set up, please provide the settings/script for doing this with 4 SSDs on Linux (RH and Debian). This will be software RAID, as we find the hardware RAID to be much worse than an independent disk setup on the system in question.
It is best to stripe database files over all available disks, and to not use RAID. If RAID is desired, then stripe database files across many RAID sets. Use the
segmentdeclaration in thevirtuoso.inifile. It is very important to give each independently seekable device its own I/O queue thread. See the documentation on the TPC-C sample for examples.in the
[Parameters]section of thevirtuoso.inifile, setFDsPerFileto be(the number of concurrent threads * 1.5) ÷ the number of distinct database files.There are no SSD specific settings.
-
Loading - How many parallel streams work best? We are looking for non-transactional bulk load, with no inference materialization. For partitioned cluster settings, do we divide the load streams over server processes?
Use one stream per core (not per core thread). In the case of a cluster, divide load streams evenly across all processes. The total number of streams on a cluster can equal the total number of cores; adjust up or down depending on what is observed.
Use the built-in bulk load facility, i.e.,
ld_dir ('<source-filename-or-directory>', '<file name pattern>', '<destination graph iri>');For example,
SQL> ld_dir ('/path/to/files', '*.n3', 'http://dbpedia.org');Then do a
rdf_loader_run ()on enough connections. For example, you can use the shell commandisql rdf_loader_run () &to start one in a background isql process. When starting background load commands from the shell, you can use the shell
waitcommand to wait for completion. If starting from isql, use thewait_for_children;command (see isql documentation for details).See the BSBM disclosure report for an example load script.
-
What command should be used after non-transactional bulk load, to ensure a consistent persistent state on disk, like a log checkpoint or similar? Load and checkpoint will be timed separately, load being CPU-bound and checkpoint being I/O-bound. No roll-forward log or similar is required; the load does not have to recover if it fails before the checkpoint.
Execute
CHECKPOINT;through a SQL client, e.g.,
isql. This is not a SPARQL statement and cannot be executed over the SPARQL protocol. -
What settings should be used for trickle load of small triple sets into a pre-existing graph? This should be as transactional as supported; at least there should be a roll forward log, unlike the case for the bulk load.
No special settings are needed for load testing; defaults will produce transactional behavior with a roll forward log. Default transaction isolation is
REPEATABLE READ, but this may be altered via SQL session settings or at Virtuoso server start-up through the[Parameters]section of thevirtuoso.inifile, withDefaultIsolation = 4Transaction isolation cannot be set over the SPARQL protocol.
NOTE: When testing full CRUD operations, other isolation settings may be preferable, due to ACID considerations. See answer #12, below, and detailed discussion in part 8 of this series, BSBM Explore and Update.
-
What settings control allocation of memory for database caching? We will be running mostly from memory, so we need to make sure that there is enough memory configured.
In the
[Parameters]section of thevirtuoso.inifile,NumberOfBufferscontrols the amount of RAM used by Virtuoso to cache database files. One buffer caches an 8KB database page. In practice, count 10KB of memory per page. If "swappiness" on Linux is low (e.g., 2), two-thirds or more of physical memory can be used for database buffers. If swapping occurs, decrease the setting. -
What command gives status on memory allocation (e.g., number of buffers, number of dirty buffers, etc.) so that we can verify that things are indeed in server memory and not, for example, being served from OS disk cache. If the cached format is different from the disk layout (e.g., decompression after disk read), is there a command for space statistics for database cache?
In an
isqlsession, executeSTATUS ( ? ? );The second result paragraph gives counts of total, used, and dirty buffers. If used buffers is steady and less than total, and if the disk read count on the line below does not increase, the system is running from memory. The cached format is the same as the disk based format.
-
What command gives information on disk allocation for different things? We are looking for the total size of allocated database pages for quads (including table, indices, anything else associated with quads) and dictionaries for literals, IRI names, etc. If there is a text index on literals, what command gives space stats for this? We count used pages, excluding any preallocated unused pages or other gaps. There is one number for quads and another for the dictionaries or other such structures, optionally a third for text index.
Execute on an
isqlsession:CHECKPOINT; SELECT TOP 20 * FROM sys_index_space_stats ORDER BY iss_pages DESC;
The
iss_pagescolumn is the total pages for each index, including blob pages. Pages are 8KB. Only used pages are reported, gaps and unused pages are not counted. The rows pertaining toRDF_QUADare for quads;RDF_IRI,RDF_PREFIX,RO_START,RDF_OBJare for dictionaries;RDF_OBJ_RO_FLAGS_WORDSandVTLOG_DB_DBA_RDF_OBJare for text index. -
If there is a choice between triples and quads, we will run with quads. How do we ascertain that the run is with quads? How do we find out the index scheme? Should be use an alternate index scheme? Most of the data will be in a single big graph.
The default scheme uses quads. The default index layout is
PSOG,POGS,GS,SP,OP. To see the current index scheme, use anisqlsession to executeSTATISTICS DB.DBA.RDF_QUAD; -
For partitioned cluster settings, are there partitioning-related settings to control even distribution of data between partitions? For example, is there a way to set partitioning by
SorOdepending on which is first in key order for each index?The default partitioning settings are good, i.e., partitioning is on
OorS, whichever is first in key order. -
For partitioned clusters, are there settings to control message batching or similar? What are the statistics available for checking interconnect operation, e.g. message counts, latencies, total aggregate throughput of interconnect?
In the
[Cluster]section of thecluster.inifile,ReqBatchSizeis the number of query states dispatched between cluster nodes per message round trip. This may be incremented from the default of10000to50000or so if this is seen to be useful.To change this on the fly, the following can be issued through an
isqlsession:cl_exec ( ' __dbf_set (''cl_request_batch_size'', 50000) ' );The commands below may be executed through an
isqlsession to get a summary of CPU and message traffic for the whole cluster or process-by-process, respectively. The documentation details the fields.STATUS ('cluster') ;; whole cluster
STATUS ('cluster_d') ;; process-by-process -
Other settings - Are there settings for limiting query planning, when appropriate? For example, the BSBM Explore mix has a large component of unnecessary query optimizer time, since the queries themselves access almost no data. Any other relevant settings?
-
For BSBM, needless query optimization should be capped at Virtuoso server start-up through the
[Parameters]section of thevirtuoso.ini, withStopCompilerWhenXOverRun = 1 -
When testing full CRUD operations (not simply CREATE, i.e., load, as discussed in #5, above), it is essential to make queries run with transaction isolation of
READ COMMITTED, to remove most lock contention. Transaction isolation cannot be adjusted via SPARQL. This can be changed through SQL session settings, or at Virtuoso server start-up through the[Parameters]section of thevirtuoso.inifile, withDefaultIsolation = 2
-
- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire (this post)
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
-
23:23
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
» OpenLink Community BlogIn this post I will summarize the figures for BSBM Load and Explore mixes at 100 Mt, 200 Mt, and 1000 Mt. (1 Mt = 1 Megatriple, or one million triples.) The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs. The exact specifications and configurations are in the raw reports to follow.
The load time in the recent Berlin report was measured with the wrong function, and so far as we can tell, without multiple threads. The intermediate cut of Virtuoso they tested also had broken SPARQL/Update (also known as SPARUL) features. We have fixed this since, and give here the right numbers.
In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:
-
6 Single is the generally available single server configuration of Virtuoso. Whether this is open source or not does not make a difference.
-
6 Cluster is the generally available commercial only cluster-capable Virtuoso.
-
7 Single is the next generation single server Virtuoso, about to be released as a preview.
To understand the numbers, we must explain how these differ from each other in execution:
-
6 Single has one thread-per-query, and operates on one state of the query at a time.
-
6 Cluster has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states. Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together.
-
7 Single has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states. This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the
n * log(n)index access for the batch becomes more like linear if the data accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads. Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan. These features are called vectored execution and query parallelization. These techniques will also be applied to the cluster variant in due time.
The version 6 and 7 variants discussed here use the same physical storage layout with row-wise key compression. Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space. This column store option is not used here because it still has some problems with random order inserts.
We will first consider loading. Below are the load times and rates for 7 at each scale.
7 Single Scale Rate
(quads per second)Load time
(seconds)Checkpoint time
(seconds)100 Mt 261,366 301 82 200 Mt 216,000 802 123 1000 Mt 130,378 6641 1012 In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale.
We also loaded the smallest data set with 6 Single using the same load script.
6 Single Scale Rate
(quads per second)Load time
(seconds)Checkpoint time
(seconds)100 Mt 74,713 1192 145 CPU time with 6 Single was 8047 seconds. We compare this to 4453 seconds of CPU for the same load on 7 Single. The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single. Note that high percentages involve core threads, not real cores.
The difference is mostly attributable to vectoring and the introduction of a non-transactional insert. The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in de facto non-transactional behavior but still there is a lock and commit cycle. Inserts in RDF load usually exhibit locality on all SPOG. Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go. Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row.
Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed. In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful. Writes are all in-place, and no delta-merge mechanism is involved. For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block. Repeatable and serializable readers would block before an uncommitted insert.
Now for the run (larger numbers indicate more queries executed, and are therefore better):
6 Single Throughput
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487
7 Single Throughput
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 11742 72278 200 Mt 10225 60951 1000 Mt 6262 24672 The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state. Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed. For the memory-only scales, we run 500 mixes twice, and take the timing of the second run.
Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%. This also explains why adding clients gives a larger boost at the smaller scale.
Now let us look at the relative effects of parallelizing and vectoring in 7 Single. We run 50 mixes of Single User Explore: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread. Then we set the vector size to 1, meaning that the query pipeline runs one row at a time. This gets us 1319 QMpH which is a bit worse than 6 Single. This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps.
The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least. The reason for the latter is covered in detail in A Benchmarking Story. We note that while vectoring is primarily geared to better single-thread speed and better cache hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt.
In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring. When moving to more complex workloads, the benefits become more pronounced. For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring. These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later.
The full run details will be supplied at the end of this blog series.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore (this post)
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
-
23:23
Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
» OpenLink Community BlogIn this post I will summarize the figures for BSBM Load and Explore mixes at 100 Mt, 200 Mt, and 1000 Mt. (1 Mt = 1 Megatriple, or one million triples.) The measurements were made on a 72GB 2xXeon 5520 with 4 SSDs. The exact specifications and configurations are in the raw reports to follow.
The load time in the recent Berlin report was measured with the wrong function, and so far as we can tell, without multiple threads. The intermediate cut of Virtuoso they tested also had broken SPARQL/Update (also known as SPARUL) features. We have fixed this since, and give here the right numbers.
In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:
-
6 Single is the generally available single server configuration of Virtuoso. Whether this is open source or not does not make a difference.
-
6 Cluster is the generally available commercial only cluster-capable Virtuoso.
-
7 Single is the next generation single server Virtuoso, about to be released as a preview.
To understand the numbers, we must explain how these differ from each other in execution:
-
6 Single has one thread-per-query, and operates on one state of the query at a time.
-
6 Cluster has one thread-per-query-per-process, and between processes it operates on batches of some tens-of-thousands of simultaneous query states. Within each node, these batches run through the execution pipeline one state at a time. Aggregation is distributed, and the query optimizer is generally smart about shipping colocated functions together.
-
7 Single has multiple threads-per-query and in all situations operates on batches of 10,000 or more simultaneous query states. This means, for example, that index lookups get large numbers of parameters which then are sorted to get an ascending search pattern which benefits from locality, so the
n * log(n)index access for the batch becomes more like linear if the data accessed has any locality. Furthermore, if there are many operands to an operator, these can be split on multiple threads. Also, scans of consecutive rows can be split before the scan on multiple threads, each doing a range of the scan. These features are called vectored execution and query parallelization. These techniques will also be applied to the cluster variant in due time.
The version 6 and 7 variants discussed here use the same physical storage layout with row-wise key compression. Additionally, there exists a column-wise storage option in 7 that can fit 4x the number of quads in the same space. This column store option is not used here because it still has some problems with random order inserts.
We will first consider loading. Below are the load times and rates for 7 at each scale.
7 Single Scale Rate
(quads per second)Load time
(seconds)Checkpoint time
(seconds)100 Mt 261,366 301 82 200 Mt 216,000 802 123 1000 Mt 130,378 6641 1012 In each case the load was made on 8 concurrent streams, each reading a file from a pool of 80 files for the two smaller scales and 360 files for the larger scale.
We also loaded the smallest data set with 6 Single using the same load script.
6 Single Scale Rate
(quads per second)Load time
(seconds)Checkpoint time
(seconds)100 Mt 74,713 1192 145 CPU time with 6 Single was 8047 seconds. We compare this to 4453 seconds of CPU for the same load on 7 Single. The CPU% during the run was on either side of 700% for 6 Single and 1300% for 7 Single. Note that high percentages involve core threads, not real cores.
The difference is mostly attributable to vectoring and the introduction of a non-transactional insert. The 6 Single inserts transactionally but makes very frequent commits and writes no log, resulting in de facto non-transactional behavior but still there is a lock and commit cycle. Inserts in RDF load usually exhibit locality on all SPOG. Sorting by value gives ascending insert order and eliminates much of the lookup time for deciding where the next row will go. Contention on page read-write locks is less because the engine stays longer on a page, inserting multiple values in one go, instead of re-acquiring the read-write lock and possible transaction locks for each row.
Furthermore, for single stream loading the non-transactional mode can serve one thread doing the parsing with many threads doing the inserting; hence, in practice the speed is bounded by the parsing speed. In multi-stream load this parallelization also happens but is less significant, as adding threads past the count of core threads is not useful. Writes are all in-place, and no delta-merge mechanism is involved. For transactional inserts, the uncommitted rows are not visible to read-committed readers, which do not block. Repeatable and serializable readers would block before an uncommitted insert.
Now for the run (larger numbers indicate more queries executed, and are therefore better):
6 Single Throughput
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 7641 29433 200 Mt 6017 13335 1000 Mt 1770 2487
7 Single Throughput
(QMpH, query mixes per hour)Scale Single User 16 User 100 Mt 11742 72278 200 Mt 10225 60951 1000 Mt 6262 24672 The 100 Mt and 200 Mt runs are entirely in memory; the 1000 Mt run is mostly in memory, with about a 1.6 MB/s trickle from SSD in steady state. Accordingly, the 1000 Mt run is longer, with 2000 query mixes in the timed period, preceded by a warm-up of 2000 mixes with a different seed. For the memory-only scales, we run 500 mixes twice, and take the timing of the second run.
Looking at single user speeds, 6 Single and 7 Single are closest at the small end and drift farther apart at the larger scales. This comes from the increased opportunity to parallelize Q5, since this works on more data and is relatively more important as the scale gets larger. The 100 Mt run of 7 Single has about 130% CPU, and the 1000 Mt run has about 270%. This also explains why adding clients gives a larger boost at the smaller scale.
Now let us look at the relative effects of parallelizing and vectoring in 7 Single. We run 50 mixes of Single User Explore: 6132 QMpH with both parallelizing and vectoring on; 2805 QMpH with execution limited to a single thread. Then we set the vector size to 1, meaning that the query pipeline runs one row at a time. This gets us 1319 QMpH which is a bit worse than 6 Single. This is to be expected since there is some overhead to running vectored with single-element vectors. Q5 on 7 Single with vectoring and a single thread runs at 1.9 qps; with single-element vectors, at 0.8 qps. The 6 Single engine runs Q5 at 1.13 qps.
The 100 Mt scale 7 Single gains the most from adding clients; the 1000 Mt 6 Single gains the least. The reason for the latter is covered in detail in A Benchmarking Story. We note that while vectoring is primarily geared to better single-thread speed and better cache hit rates, it delivers a huge multithreaded benefit by eliminating the mutex contention at the index tree top which stops 6 Single dead at 1000 Mt.
In conclusion, we see that even with a workload of short queries and little opportunity for parallelism, we get substantial benefits from query parallelization and vectoring. When moving to more complex workloads, the benefits become more pronounced. For a single user complex query load, we can get 7x speed-up from parallelism (8 core), plus up to 3x from vectoring. These numbers do not take into account the benefits of the column store; those will be analyzed separately a bit later.
The full run details will be supplied at the end of this blog series.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore (this post)
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
-
23:49
Data Spaces
» OpenLink Community BlogThere is increasing coalescence around the idea that [HTTP-based] Linked Data adds a tangible dimension to the World Wide Web (Web). This Data Dimension grants end-users, power-users, integrators, and developers the ability to experience the Web not solely as a Information Space or Document Space, but now also as a Data Space.
Here is a simple What and Why guide covering the essence of Data Spaces.
What is a Data Space?A Data Space is a point of presence on a network, where every Data Object (item or entity) is given a Name (e.g., a URI) by which it may be Referenced or Identified.
In a Data Space, every Representation of those Data Objects (i.e., every Object Representation) has an Address (e.g., a URL) from which it may be Retrieved (or "gotten").
In a Data Space, every Object Representation is a time variant (that is, it changes over time), streamable, and format-agnostic Resource.
An Object Representation is simply a Description of that Object. It takes the form of a graph, pictorially constructed from sets of 3 elements which are themselves named Subject, Predicate, and Object (or SPO); or Entity, Attribute, and Value (or EAV). Each Entity+Attribute+Value or Subject+Predicate+Object set (or triple), is one datum, one piece of data, one persisted observation about a given Subject or Entity.
The underlying Schema that defines and constrains the construction of Object Representations is based on Logic, specifically First-Order Logic. Each Object Representation is a collection of persisted observations (Data) about a given Subject, which aid observers in materializing their perception (Information), and ultimately comprehension (Knowledge), of that Subject.
Why are Data Spaces important?In the real-world -- which is networked by nature -- data is heterogeneously (or "differently") shaped, and disparately located.
Data has been increasing at an alarming rate since the advent of computing; the interWeb simply provides context that makes this reality more palpable and more exploitable, and in the process virtuously ups the ante through increasingly exponential growth rates.
We can't stop data heterogeneity; it is endemic to the nature of its producers -- humans and/or human-directed machines. What we can do, though, is create a powerful Conceptual-level "bus" or "interface" for data integration, based on Data Description oriented Logic rather than Data Representation oriented Formats. Basically, it's possible for us to use a Common Logic as the basis for expressing and blending SPO- or EAV-based Object Representations in a variety of Formats (or "dialects").
The roadmap boils down to:
-
Assigning unambiguous Object Names to:
-
Every record (or, in table terms, every row);
-
Every record attribute (or, in table terms, every field or column);
-
Every record relationship (that is, every relationship between one record and another);
-
Every record container (e.g., every table or view in a relational database, every named graph, every spreadsheet, every text file, etc.);
-
-
Making each Object Name resolve to an Address through which Create, Read, Update, and Delete ("CRUD") operations can be performed against (can access) the associated Object Representation graph.
-
-
21:12
Benchmarks, Redux (part 2): A Benchmarking Story
» OpenLink Community BlogCaeterum censeo, benchmarks are for vendors...
This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks.
We begin right after the publication of the recent Berlin report. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole BSBM matter and the general benchmarking question in forthcoming posts; for now, let's talk about specifics.
In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:
-
6 Single is the generally available single-instance-server configuration of Virtuoso. Whether this is open source or not does not make a difference.
-
6 Cluster is the generally available, commercial-only, cluster-capable Virtuoso.
-
7 Single is the next-generation single-instance-server Virtuoso, about to be released as a preview.
We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM explore mix at one scale got better throughput as we added more clients, approximately as one would expect based on CPU usage and number of cores, while at another scale this was not so.
At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%.
When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload.
See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous.
Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex.
Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better.
Now, the mutex in question happens to serialize the buffer cache for one specific page of data, one level down from the root of the index for RDF PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root.
One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers.
Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time.
So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously "good" throughput of 40K QMpH at 100 Mt went to 66K QMpH.
Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue.
We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real knowledge of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application optimization tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation.
So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story (this post)
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
-
21:12
Benchmarks, Redux (part 2): A Benchmarking Story
» OpenLink Community BlogCaeterum censeo, benchmarks are for vendors...
This is an edifying story about benchmarks and how databases work. I will show how one detail makes a 5+x difference, and how one really must understand how things work in order to make sense of benchmarks.
We begin right after the publication of the recent Berlin report. This report gives us OK performance for queries and very bad performance for loading. Trickle updates were not measurable. This comes as a consequence of testing intermediate software cuts and having incomplete instructions for operating them. I will cover the whole BSBM matter and the general benchmarking question in forthcoming posts; for now, let's talk about specifics.
In the course of the discussion to follow, we talk about 3 different kinds of Virtuoso:
-
6 Single is the generally available single-instance-server configuration of Virtuoso. Whether this is open source or not does not make a difference.
-
6 Cluster is the generally available, commercial-only, cluster-capable Virtuoso.
-
7 Single is the next-generation single-instance-server Virtuoso, about to be released as a preview.
We began by running the various parts of BSBM at different scales with different Virtuoso variants. In so doing, we noticed that the BSBM explore mix at one scale got better throughput as we added more clients, approximately as one would expect based on CPU usage and number of cores, while at another scale this was not so.
At the 1-billion-triple scale (1000 Mt; 1 Mt = 1 Megatriple, or one million triples) we saw CPU going from 200% with 1 client to 1400% with 16 clients but throughput increased by less than 20%.
When we ran the same scale with our shared-nothing 6 Cluster, running 8 processes on the same box, throughput increased normally with the client count. We have not previously tried BSBM with 6 Cluster simply because there is little to gain and a lot to lose by distributing this workload. But here we got a multiuser throughput with 6 Cluster that is easily 3 times that of the single server, even with a cluster-unfriendly workload.
See, sometimes scaling out even within a shared memory multiprocessor pays! Still, what we saw was rather anomalous.
Over the years we have looked at performance any number of times and have a lot of built-in meters. For cases of high CPU with no throughput, the prime suspect is contention on critical sections. Quite right, when building with the mutex meter enabled, counting how many times each mutex is acquired and how many times this results in a wait, we found a mutex which gets acquired 600M times in the run, of which an insane 450M result in a wait. One can count a microsecond of real time each time a mutex wait results in the kernel switching tasks. The run took 500 s or so, of which 450 s of real time were attributable to the overhead of waiting for this one mutex.
Waiting for a mutex is a real train wreck. We have tried spinning a few times before it, which the OS does anyhow, but this does not help. Using spin locks is good only if waits are extremely rare; with any frequency of waiting, even for very short waits, a mutex is still a lot better.
Now, the mutex in question happens to serialize the buffer cache for one specific page of data, one level down from the root of the index for RDF PSOG. By the luck of the draw, the Ps falling on that page are commonly accessed Ps pertaining to product features. In order to get any product feature value, one must pass via this page. At the smaller scale, the different properties web their different ways based on the index root.
One might here ask why the problem is one level down from the root and not in the root. The index root is already handled specially, so the read-write locks for buffers usually apply only for the first level down. One might also ask why have a mutex in the first place. Well, unless one is read-only and all in memory, there simply must be a way to say that a buffer must not get written to by one thread while another is reading it. Same for cache replacement. Some in-memory people fork a whole copy of the database process to do a large query and so can forget about serialization. But one must have long queries for this and have all in memory. One can make writes less frequent by keeping deltas, but this does not remove the need to merge the deltas at some point, which cannot happen without serializing this with the readers.
Most of the time the offending mutex is acquired for getting a property of a product in Q5, the one that looks for products with similar values of a numeric property. We retrieve this property for a number of products in one go, due to vectoring. Vectoring is supposed to save us from constantly hitting the index tree top when getting the next match. So how come there is contention in the index tree top? As it happens, the vectored index lookup checks for locality only when all search conditions on key parts are equalities. Here however there is equality on P and S and a range on O; hence, the lookup starts from the index root every time.
So I changed this. The effect was Q5 getting over twice as fast, with the single user throughput at 1000 Mt going from 2000 to 5200 QMpH (Query Mixes per Hour) and the 16-user throughput going from 3800 to over 21000 QMpH. The previously "good" throughput of 40K QMpH at 100 Mt went to 66K QMpH.
Vectoring can make a real difference. The throughputs for the same workload on 6 Single, without vectoring, thus unavoidably hitting the page with the crazy contention, are 1770 QMpH single user and 2487 QMpH with 16 users. The 6 Cluster throughput, avoiding the contention but without the increased locality from vectoring and with the increased latency of going out-of-process for most of the data, was about 11.5K QMpH with 16 users. Each partition had a page getting the hits but since the partitioning was on S and S was about-evenly distributed, each partition got 1/8 of the load; thus waiting on the mutex did not become a killer issue.
We see how detailed analysis of benchmarks can lead to almost an order of magnitude improvements in a short time. This analysis is however both difficult and tedious. It is not readily delegable; one needs real knowledge of how things work and of how they ought to work in order to get anywhere with this. Experience tends to show that a competitive situation is needed in order to motivate one to go to the trouble. Unless something really sticks out in an obvious manner, one is most likely not going to look deep enough. Of course, this is seen in applications too but application optimization tends to stop at a point where the application is usable. Also stored procedures and specially-tweaked queries will usually help. In most application scenarios, we are not simultaneously looking at multiple different implementations, except maybe at the start of development but then this falls under benchmarking and evaluation.
So, the usefulness of benchmarks is again confirmed. There is likely great unexplored space for improvement as we move to more interesting and diverse scenarios.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks
- Benchmarks, Redux (part 2): A Benchmarking Story (this post)
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
-
20:20
Benchmarks, Redux (part 1 of 4)
» OpenLink Community BlogThis post introduces a series on RDF benchmarking. In these posts I will cover the following:
-
Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
-
Discuss configuration options for Virtuoso.
-
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
-
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
-
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
-
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
-
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
-
Talk about BSBM in specific. What does it measure?
-
Discuss some experiments with the BI use case of BSBM.
-
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.
The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.
Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.
For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.
In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.
So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.
So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.
In this series of posts I will only talk about OpenLink Software.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks (this post)
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
-
20:20
Benchmarks, Redux (part 1 of 4)
» OpenLink Community BlogThis post introduces a series on RDF benchmarking. In these posts I will cover the following:
-
Correct misleading information about us in the recent Berlin report: The load rate is off-the wall and the update mix is missing. We supply the right numbers and explain how to load things so that one gets decent performance.
-
Discuss configuration options for Virtuoso.
-
Tell a story about multithreading and its perils and how vectoring and scale-out can save us.
-
Analyze the run time behavior of Virtuoso 6 Single, 6 Cluster, and 7 Single.
-
Look at the benefits of SSDs (solid-state storage devices) over HDDs (hard disk devices; spinning platters), and I/O matters in general.
-
Talk in general about modalities of benchmark running, and how to reconcile vendors doing what they know best with the air of legitimacy of a third party. Whether to do things a la TPC or a la TREC? We will hopefully try a bit of both, at least so I have proposed to our partners in LOD2, the EU FP7 that also funded the recent Berlin report.
-
Outline the desiderata for an RDF benchmark that is not just an RDF-ized relational workload, the Social Intelligence Benchmark.
-
Talk about BSBM in specific. What does it measure?
-
Discuss some experiments with the BI use case of BSBM.
-
Document how the results mentioned here were obtained and suggest practices for benchmark running and disclosure.
The background is that the LOD2 FP7 project is supposed to deliver a report about the state of the art and benchmark laboratory by March 1. The Berlin report is a part thereof. In the project proposal we talk about an ongoing benchmarking activity and about having up-to-date installations of the relevant RDF stores and RDBMS.
Since this is taxpayer money for supposedly the common good, I see no reason why such a useful thing should be restricted to the project participants. On the other hand, running a display window of stuff for benchmarking, when in at least in some cases licenses prohibit unauthorized publishing of benchmark results might be seen to conflict with the spirit of the license if not its letter. We will see.
For now, my take is that we want to run benchmarks of all interesting software, inviting the vendors to tell us how to do that if they will, and maybe even letting them perform those runs themselves. Then we promise not to disclose results without the vendor's permission. Access to the installations is limited to whoever operates the equipment. Configuration files and detailed hardware specs and such on the other hand will be made public. If a run is published, it will be with permission and in a format that includes full information for replicating the experiment.
In the LOD2 proposal we also in so many words say that we will stretch the limits of the state of the art. This stretching is surely not limited to the project's own products but should also include the general benchmarking aspect. I will say with confidence that running single server benchmarks at a max 200 Mtriples of data is not stretching anything.
So to ameliorate this situation, I thought to run the same at 10x the scale on a couple of large boxes we have access to. 1 and 2 billion triples are still comfortably single server scales. Then we could go for example to Giovanni's cluster at DERI and do 10 and 20 billion triples, this should fly reasonably on 8 or 16 nodes of the DERI gear. Or we might talk to SEALS who by now should have their own lab. Even Amazon EC2 might be an option, although not the preferred one.
So I asked everybody about config instructions, which produced a certain amount of dismay as I might be said to be biased and to be skirting the edges of conflict of interest. The inquiry was not altogether negative though since Ontotext and Garlik provided some information. We will look into these this and next week. We will not publish any information without asking first.
In this series of posts I will only talk about OpenLink Software.
Benchmarks, Redux Series- Benchmarks, Redux (part 1): On RDF Benchmarks (this post)
- Benchmarks, Redux (part 2): A Benchmarking Story
- Benchmarks, Redux (part 3): Virtuoso 7 vs 6 on BSBM Load and Explore
- Benchmarks, Redux (part 4): Benchmark Tuning Questionnaire
- Benchmarks, Redux (part 5): BSBM and I/O; HDDs and SSDs
- Benchmarks, Redux (part 6): BSBM and I/O, continued
- Benchmarks, Redux (part 7): What Does BSBM Explore Measure?
- Benchmarks, Redux (part 8): BSBM Explore and Update
- Benchmarks, Redux (part 9): BSBM With Cluster
- Benchmarks, Redux (part 10): LOD2 and the Benchmark Process
- Benchmarks, Redux (part 11): The Substance of Benchmarks
- Benchmarks, Redux (part 12): Our Own BSBM Results Report
-
-
1:20
New Preconfigured Virtuoso AMI for Amazon EC2 Cloud comprised of Linked Data from BBC & DBpedia
» OpenLink Community BlogWhat?Introducing a new preloaded and preconfigured Virtuoso (Cluster Edition) AMI for the Amazon EC2 Cloud that hosts combined Linked Datasets from:
Why?Predictably instantiate a powerful database with high quality data and cross links within minutes, for personal or service specific use.
How?Simply follow the instructions in our Amazon EC2 guide for the BBC + DBpedia 3.6 Linked Dataset guide.
Your installation steps are as follows:
- Instantiate a Virtuoso EC2 AMI
- Mount the Amazon Elastic Block Storage (EBS) snapshot that hosts the preloaded Virtuoso Database.
- BBC Music Linked Dataset Snapshot -- PivotViewer Page Screenshot
- BBC Programmes Linked Dataset Snapshot -- -- PivotViewer Page Screenshot
- BBC Nature Linked Dataset Snapshot -- PivotViewer Page Screenshot
- BBC Food Recipes Snapshot -- PivotViewer Page Screenshot
- My Del.icio.us bookmark collection re. BBC Linked Data Demos
- Amazon EC2 Snapshots for DBpedia 3.6 + BBC combo -- delivers the BBC and DBpedia dataset combo via a mountable Elastic Block Storage (EBS) device usable with an Amazon Machine Image (AMI)
- Amazon EC2 Snapshots for DBpedia 3.6 & 3.5
- Virtuoso Commercial Edition Download Page
- Virtuoso Cluster Edition Guide
-
22:15
DBpedia + BBC (combined) Linked Data Space Installation Guide
» OpenLink Community BlogWhat?The DBpedia + BBC Combo Linked Dataset is a preconfigured Virtuoso Cluster (4 Virtuoso Cluster Nodes, each comprised of one Virtuoso Instance; initial deployment is to a single Cluster Host, but license may be converted for physically distributed deployment), available via the Amazon EC2 Cloud, preloaded with the following datasets:
Why?The BBC has been publishing Linked Data from its Web Data Space for a number of years. In line with best practices for injecting Linked Data into the World Wide Web (Web), the BBC datasets are interlinked with other datasets such as DBpedia and MusicBrainz.
Typical follow-your-nose exploration using a Web Browser (or even via sophisticated SPARQL query crawls) isn't always practical once you get past the initial euphoria that comes from comprehending the Linked Data concept. As your queries get more complex, the overhead of remote sub-queries increases its impact, until query results take so long to return that you simply give up.
Thus, maximizing the effects of the BBC's efforts requires Linked Data that shares locality in a Web-accessible Data Space — i.e., where all Linked Data sets have been loaded into the same data store or warehouse. This holds true even when leveraging SPARQL-FED style virtualization — there's always a need to localize data as part of any marginally-decent locality-aware cost-optimization algorithm.
This DBpedia + BBC dataset, exposed via a preloaded and preconfigured Virtuoso Cluster, delivers a practical point of presence on the Web for immediate and cost-effective exploitation of Linked Data at the individual and/or service specific levels.
How? To work through this guide, you'll need to start with 90 GB of free disk space. (Only 41 GB will be consumed after you delete the installer archives, but starting with 90+ GB ensures enough work space for the installation.) Install Virtuoso-
Download Virtuoso installer archive(s). You must deploy the Personal or Enterprise Edition; the Open Source Edition does not support Shared-Nothing Cluster Deployment.
-
Set key environment variables and start the OpenLink License Manager, using command (this may vary depending on your shell and install directory):
. /opt/virtuoso/virtuoso-enterprise.sh -
Optional: To keep the default single-server configuration file and demo database intact, set the
VIRTUOSO_HOMEenvironment variable to a different directory, e.g.,export VIRTUOSO_HOME=/opt/virtuoso/cluster-home/Note: You will have to adjust this setting every time you shift between this cluster setup and your single-server setup. Either may be made your environment's default through the
virtuoso-enterprise.shand related scripts. -
Set up your cluster by running the
mkcluster.shscript. Note that initial deployment of the DBpedia + BBC Combo requires a 4 node cluster, which is the default for this script. -
Start the Virtuoso Cluster with this command:
virtuoso-start.sh -
Stop the Virtuoso Cluster with this command:
virtuoso-stop.sh
-
Navigate to your installation directory.
-
Download the combo dataset installer script —
bbc-dbpedia-install.sh. -
For best results, set the downloaded script to fully executable using this command:
chmod 755 bbc-dbpedia-install.sh -
Shut down any Virtuoso instances that may be currently running.
-
Optional: As above, if you have decided to keep the default single-server configuration file and demo database intact, set the
VIRTUOSO_HOMEenvironment variable appropriately, e.g.,export VIRTUOSO_HOME=/opt/virtuoso/cluster-home/ -
Run the combo dataset installer script with this command:
sh bbc-dbpedia-install.sh
The combo dataset typically deploys to EC2 virtual machines in under 90 minutes; your time will vary depending on your network connection speed, machine speed, and other variables.
Once the script completes, perform the following steps:
-
Verify that the Virtuoso Conductor [HTTP-based] Admin UI) is in place via:
http://localhost:[port]/conductor -
Verify that the Virtuoso SPARQL endpoint is in place via:
http://localhost:[port]/sparql -
Verify that the Precision Search & Find UI is in place via:
http://localhost:[port]/fct -
Verify that the Virtuoso hosted PivotViewer is in place via:
http://localhost:[port]/PivotViewer
- BBC Music Linked Dataset Snapshot -- PivotViewer Page Screenshot
- BBC Programmes Linked Dataset Snapshot -- -- PivotViewer Page Screenshot
- BBC Nature Linked Dataset Snapshot -- PivotViewer Page Screenshot
- BBC Food Recipes Snapshot -- PivotViewer Page Screenshot
- My Del.icio.us bookmark collection re. BBC Linked Data Demos
- Amazon EC2 Snapshots for DBpedia 3.6 + BBC combo -- delivers the BBC and DBpedia dataset combo via a mountable Elastic Block Storage (EBS) device usable with an Amazon Machine Image (AMI)
- Amazon EC2 Snapshots for DBpedia 3.6 & 3.5
- Virtuoso Commercial Edition Download Page
- Virtuoso Cluster Edition Guide
-
-
16:05
SPARQL Guide for the Perl Developer
» OpenLink Community BlogWhat?A simple guide usable by any Perl developer seeking to exploit SPARQL without hassles.
Why?SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing.
Steps:- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: [localhost:8890]
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
# # Demonstrating use of a single query to populate a # Virtuoso Quad Store via Perl. # # # HTTP URL is constructed accordingly with CSV query results format as the default via mime type. # use CGI qw/:standard/; use LWP::UserAgent; use Data::Dumper; use Text::CSV_XS; sub sparqlQuery(@args) { my $query=shift; my $baseURL=shift; my $format=shift; %params=( "default-graph" => "", "should-sponge" => "soft", "query" => $query, "debug" => "on", "timeout" => "", "format" => $format, "save" => "display", "fname" => "" ); @fragments=(); foreach $k (keys %params) { $fragment="$k=".CGI::escape($params{$k}); push(@fragments,$fragment); } $query=join("&", @fragments); $sparqlURL="${baseURL}?$query"; my $ua = LWP::UserAgent->new; $ua->agent("MyApp/0.1 "); my $req = [HTTP::Request->new(GET] => $sparqlURL); my $res = $ua->request($req); $str=$res->content; $csv = Text::CSV_XS->new(); foreach $line ( split(/^/, $str) ) { $csv->parse($line); @bits=$csv->fields(); push(@rows, [ @bits ] ); } return @rows; } # Setting Data Source Name (DSN) $dsn=" [dbpedia.org] # Virtuoso pragmas for instructing SPARQL engine to perform an HTTP GET using the IRI in # FROM clause as Data Source URL en route to DBMS # record Inserts. $query="DEFINE get:soft "replace"n # Generic (non Virtuoso specific SPARQL # Note: this will not add records to the # DBMS SELECT DISTINCT * FROM <$dsn> WHERE {?s ?p ?o}"; $data=sparqlQuery($query, " [localhost:8890] "text/csv"); print "Retrieved data:n"; print Dumper($data);
OutputRetrieved data: $VAR1 = [ [ 's', 'p', 'o' ], [ ' [dbpedia.org] ' [www.w3.org] ' [www.w3.org] ], [ ' [dbpedia.org] ' [www.w3.org] ' [dbpedia.org] ], [ ' [dbpedia.org] ' [www.w3.org] ' [dbpedia.org] ], ...ConclusionCSV was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a Perl developer that already knows how to use Perl for HTTP based data access within HTML. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related- RDF::Query::Client Guide
- SPARQL Guide for the Perl Developer
- SPARQL Guide for the PHP Developer
- SPARQL Guide for the Python Developer
- SPARQL Guide for the Ruby Developer
- Simple Guide for using SPARQL with Virtuoso
- General SPARQL Tutorial Collection
- Virtuoso Specific SPARQL Tutorial Collection
- The URI, URL, and Linked Data Meme's Generic HTTP URI.
-
1:08
Virtuoso + DBpedia 3.6 Installation Guide
» OpenLink Community BlogWhat is DBpedia?DBpedia is a community effort to provide a contemporary deductive database derived from Wikipedia content. Project contributions can be partitioned as follows:
- Ontology Construction and Maintenance
- Dataset Generation via Wikipedia Content Extraction & Transformation
- Live Database Maintenance & Administration -- includes actual Linked Data loading and publishing, provision of SPARQL endpoint, and traditional DBA activity
- Internationalization.
Comprising the nucleus of the Linked Open Data effort, DBpedia also serves as a fulcrum for the burgeoning Web of Linked Data by delivering a dense and highly-interlinked lookup database. In its most basic form, DBpedia is a great source of strong and resolvable identifiers for People, Places, Organizations, Subject Matter, and many other data items of interest. Naturally, it provides a fantastic starting point for comprehending the fundamental concepts underlying TimBL's initial Linked Data meme.
How do I use DBpedia?Depending on your particular requirements, whether personal or service-specific, DBpedia offers the following:
- Datasets that can be loaded on your deductive database (also known as triple or quad stores) platform of choice
- Live browsable HTML+RDFa based entity description pages
- A wide variety of data formats for importing entity description data into a broad range of existing applications and services
- A SPARQL endpoint allowing ad-hoc querying over HTTP using the SPARQL query language, and delivering results serialized in a variety of formats
- A broad variety of tools covering query by example, faceted browsing, full text search, entity name lookups, etc.
OpenLink Software has preloaded the DBpedia 3.6 datasets into a preconfigured Virtuoso Cluster Edition database, and made the package available for easy installation.
Why is the DBpedia+Virtuoso package important?The DBpedia+Virtuoso package provides a cost-effective option for personal or service-specific incarnations of DBpedia.
For instance, you may have a service that isn't best-served by competing with the rest of the world for ad-hoc query time and resources on the live instance, which itself operates under various restrictions which enable this ad-hoc query service to be provided at Web Scale.
Now you can easily commission your own instance and quickly exploit DBpedia and Virtuoso's database feature set to the max, powered by your own hardware and network infrastructure.
How do I use the DBpedia+Virtuoso package?Pre-requisites are simply:
- Functional Virtuoso Cluster Edition installation.
- Virtuoso Cluster Edition License.
- 90 GB of free disk space -- you ultimately only need 43 gigs, but this our recommended free disk space size pre installation completion.
To install the Virtuoso Cluster Edition simply perform the following steps:
- Download Software.
- Run installer
-
Set key environment variables and start the OpenLink License Manager, using command (this may vary depending on your shell):
. /opt/virtuoso/virtuoso-enterprise.sh -
Run the
mkcluster.shscript which defaults to a 4 node cluster -
Set
VIRTUOSO_HOMEenvironment variable -- if you want to start cluster databases distinct from single server databases via distinct root directory for database files (one that isn't adjacent to single-server database directories) -
Start Virtuoso Cluster Edition instances using command:
virtuoso-start.sh -
Stop Virtuoso Cluster Edition instances using command:
virtuoso-stop.sh
To install your personal or service specific edition of DBpedia simply perform the following steps:
- Navigate to your installation directory
-
Download Installer script (
dbpedia-install.sh) -
Set execution mode on script using command:
chmod 755 dbpedia-install.sh - Shutdown any Virtuoso instances that may be currently running
-
Set your
VIRTUOSO_HOMEenvironment variable, e.g., to the current directory, via command (this may vary depending on your shell):export VIRTUOSO_HOME=`pwd` -
Run script using command:
sh dbpedia-install.sh
Once the installation completes (approximately 1 hour and 30 minutes from start time), perform the following steps:
-
Verify that the Virtuoso Conductor (HTML based Admin UI) is in place via:
http://localhost:[port]/conductor -
Verify that the Precision Search & Find UI is in place via:
http://localhost:[port]/fct - Verify that DBpedia's Green Entity Description Pages are in place via:
http://localhost:[port]/resource/DBpedia
-
19:59
SPARQL guide for the Javascript Developer
» OpenLink Community BlogWhat?A simple guide usable by any Javascript developer seeking to exploit SPARQL without hassles.
Why?SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing.
Steps:- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: [localhost:8890]
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
/* Demonstrating use of a single query to populate a # Virtuoso Quad Store via Javascript. */ /* HTTP URL is constructed accordingly with JSON query results format as the default via mime type. */ function sparqlQuery(query, baseURL, format) { if(!format) format="application/json"; var params={ "default-graph": "", "should-sponge": "soft", "query": query, "debug": "on", "timeout": "", "format": format, "save": "display", "fname": "" }; var querypart=""; for(var k in params) { querypart+=k+"="+encodeURIComponent(params[k])+"&"; } var queryURL=baseURL + '?' + querypart; if (window.XM [HttpRequest)] { xm [http=new] XM [HttpRequest();] } else { xm [http=new] ActiveXObject("Microsoft.XM [HTTP");] } xm [http.open("GET",queryURL,false);] xm [http.send();] return JSON.parse(xm [http.responseText);] } /* setting Data Source Name (DSN) */ var dsn=" [dbpedia.org] /* Virtuoso pragma "DEFINE get:soft "replace" instructs Virtuoso SPARQL engine to perform an HTTP GET using the IRI in FROM clause as Data Source URL with regards to DBMS record inserts */ var query="DEFINE get:soft "replace"nSELECT DISTINCT * FROM <"+dsn+"> WHERE {?s ?p ?o}"; var data=sparqlQuery(query, "/sparql/");
OutputPlace the snippet above into the <script/> section of an HTML document to see the query result.
ConclusionJSON was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a Javascript developer that already knows how to use Javascript for HTTP based data access within HTML. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related -
21:25
SPARQL Guide for the PHP Developer
» OpenLink Community BlogWhat?A simple guide usable by any PHP developer seeking to exploit SPARQL without hassles.
Why?SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing e.g. local object binding re. PHP.
Steps:- From your command line execute: aptitude search '^PHP26', to verify PHP is in place
- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: [localhost:8890]
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
#!/usr/bin/env php <?php # # Demonstrating use of a single query to populate a # Virtuoso Quad Store via PHP. # # HTTP URL is constructed accordingly with JSON query results format in mind. function sparqlQuery($query, $baseURL, $format="application/json") { $params=array( "default-graph" => "", "should-sponge" => "soft", "query" => $query, "debug" => "on", "timeout" => "", "format" => $format, "save" => "display", "fname" => "" ); $querypart="?"; foreach($params as $name => $value) { $querypart=$querypart . $name . '=' . urlencode($value) . "&"; } $sparqlURL=$baseURL . $querypart; return json_decode(file_get_contents($sparqlURL)); }; # Setting Data Source Name (DSN) $dsn=" [dbpedia.org] #Virtuoso pragmas for instructing SPARQL engine to perform an HTTP GET #using the IRI in FROM clause as Data Source URL $query="DEFINE get:soft "replace" SELECT DISTINCT * FROM <$dsn> WHERE {?s ?p ?o}"; $data=sparqlQuery($query, " [localhost:8890] print "Retrieved data:n" . json_encode($data); ?>
OutputRetrieved data: {"head": {"link":[],"vars":["s","p","o"]}, "results": {"distinct":false,"ordered":true, "bindings":[ {"s": {"type":"uri","value":" [dbpedia.org] {"type":"uri","value":" [www.w3.org] {"type":"uri","value":" [www.w3.org] {"s": {"type":"uri","value":" [dbpedia.org] {"type":"uri","value":" [www.w3.org] {"type":"uri","value":" [dbpedia.org] {"s": {"type":"uri","value":" [dbpedia.org] {"type":"uri","value":" [www.w3.org] {"type":"uri","value":" [dbpedia.org] ...ConclusionJSON was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a PHP developer that already knows how to use PHP for HTTP based data access. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related -
17:13
SPARQL Guide for Python Developer
» OpenLink Community BlogWhat?A simple guide usable by any Python developer seeking to exploit SPARQL without hassles.
Why?SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing e.g. local object binding re. Python.
Steps:- From your command line execute: aptitude search '^python26', to verify Python is in place
- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: [localhost:8890]
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
#!/usr/bin/env python # # Demonstrating use of a single query to populate a # Virtuoso Quad Store via Python. # import urllib, json # HTTP URL is constructed accordingly with JSON query results format in mind. def sparqlQuery(query, baseURL, format="application/json"): params={ "default-graph": "", "should-sponge": "soft", "query": query, "debug": "on", "timeout": "", "format": format, "save": "display", "fname": "" } querypart=urllib.urlencode(params) response = urllib.urlopen(baseURL,querypart).read() return json.loads(response) # Setting Data Source Name (DSN) dsn=" [dbpedia.org] # Virtuoso pragmas for instructing SPARQL engine to perform an HTTP GET # using the IRI in FROM clause as Data Source URL query="""DEFINE get:soft "replace" SELECT DISTINCT * FROM <%s> WHERE {?s ?p ?o}""" % dsn data=sparqlQuery(query, " [localhost:8890] print "Retrieved data:n" + json.dumps(data, sort_keys=True, indent=4) # # End
OutputRetrieved data: { "head": { "link": [], "vars": [ "s", "p", "o" ] }, "results": { "bindings": [ { "o": { "type": "uri", "value": " [www.w3.org] }, "p": { "type": "uri", "value": " [www.w3.org] }, "s": { "type": "uri", "value": " [dbpedia.org] } }, ...ConclusionJSON was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a Python developer that already knows how to use Python for HTTP based data access. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related -
16:29
Virtuoso Directions for 2011
» OpenLink Community BlogAt the start of 2010, I wrote that 2010 would be the year when RDF became performance- and cost-competitive with relational technology for data warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of schema change.
I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.
At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.
So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso's relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented
HASH JOINandGROUP BY, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.
A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso's column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types.
The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side.
The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.
LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the "RDF tax," by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.
So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.
For now, our priority is to release the substantial gains that have already been accomplished.
After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper.
Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.
The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.
Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.
The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?
Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.
These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start.
Stay tuned for more news later this month!
Related -
16:29
Virtuoso Directions for 2011
» OpenLink Community BlogAt the start of 2010, I wrote that 2010 would be the year when RDF became performance- and cost-competitive with relational technology for data warehousing and analytics. More specifically, RDF would shine where data was heterogenous and/or where there was a high frequency of schema change.
I will now discuss what we have done towards this end in 2010 and how you will gain by this in 2011.
At the start of 2010, we had internally demonstrated 4x space efficiency gains from column-wise compression and 3x loop join speed gains from vectored execution. To recap, column-wise compression means a column-wise storage layout where values of consecutive rows of a single column are consecutive in memory/disk and are compressed in a manner that benefits from the homogenous data type and possible sort order of the column. Vectored execution means passing large numbers of query variable bindings between query operators and possibly sorting inputs to joins for improving locality. Furthermore, always operating on large sets of values gives extra opportunities for parallelism, from instruction level to threads to scale out.
So, during 2010, we integrated these technologies into Virtuoso, for relational- and graph-based applications alike. Further, even if we say that RDF will be close to relational speed in Virtuoso, the point is moot if Virtuoso's relational speed is not up there with the best of analytics-oriented RDBMS. RDF performance does rest on the basis of general-purpose database performance; what is sauce for the goose is sauce for the gander. So we reimplemented
HASH JOINandGROUP BY, and fine-tuned many of the tricks required by TPC-H. TPC-H is not the sole final destination, but it is a step on the way and a valuable checklist for what a database ought to do.At the Semdata workshop of VLDB 2010 we presented some results of our column store applied to RDF and relational tasks. As noted in the paper, the implementation did demonstrate significant gains over the previous row-wise architecture but was not yet well optimized, so not ready to be compared with the best of the relational analytics world. A good part of the fall of 2010 went into optimizing the column store and completing functionality such as transaction support with columns.
A lot of this work is not specifically RDF oriented, but all of this work is constantly informed by the specific requirements of RDF. For example, the general idea of vectored execution is to eliminate overheads and optimize CPU cache and other locality by doing single query operations on arrays of operands so that the whole batch runs more or less in CPU cache. Are the gains not lost if data is typed at run time, as in RDF? In fact, the cost of run-time-typing turns out to be small, since data in practice tends to be of homogenous type and with locality of reference in values. Virtuoso's column store implementation resembles in broad outline other column stores like Vertica or VectorWise, the main difference being the built-in support for run-time heterogenous types.
The LOD2 EU FP 7 project started in September 2010. In this project OpenLink and the celebrated heroes of the column store, CWI of MonetDB and VectorWise fame, represent the database side.
The first database task of LOD2 is making a survey of the state of the art and a round of benchmarking of RDF stores. The Berlin SPARQL Benchmark (BSBM) has accordingly evolved to include a business intelligence section and an update stream. Initial results from running these will become available in February/March, 2011. The specifics of this process merit another post; let it for now be said that benchmarking is making progress. In the end, it is our conviction that we need a situation where vendors may publish results as and when they are available and where there exists a well defined process for documenting and checking results.
LOD2 will continue by linking the universe, as I half-facetiously put it on a presentation slide. This means alignment of anything from schema to instance identifiers, with and without supervision, and always with provenance, summarization, visualization, and so forth. In fact, putting it this way, this gets to sound like the old chimera of generating applications from data or allowing users to derive actionable intelligence from data of which they do not even know the structure. No, we are not that unrealistic. But we are moving toward more ad-hoc discovery and faster time to answer. And since we provide an infrastructure element under all this, we want to do away with the "RDF tax," by which we mean any significant extra cost of RDF compared to an alternate technology. To put it another way, you ought to pay for unpredictable heterogeneity or complex inference only when you actually use them, not as a fixed up-front overhead.
So much for promises. When will you see something? It is safe to say that we cannot very well publish benchmarks of systems that are not generally available in some form. This places an initial technology preview cut of Virtuoso 7 with vectored execution somewhere in January or early February. The column store feature will be built in, but more than likely the row-wise compressed RDF format of Virtuoso 6 will still be the default. Version 6 and 7 databases will be interchangeable unless column-store structures are used.
For now, our priority is to release the substantial gains that have already been accomplished.
After an initial preview cut, we will return to the agenda of making sure Virtuoso is up there with the best in relational analytics, and that the equivalent workload with an RDF data model runs as close as possible to relational performance. As a first step this means taking TPC-H as is, and then converting the data and queries to the trivially equivalent RDF and SPARQL and seeing how it goes. In the September paper we dabbled a little with the data at a small scale but now we must run the full set of queries at 100GB and 300GB scales, which come to about 14 billion and 42 billion triples, respectively. A well done analysis of the issues encountered, covering similarities and dissimilarities of the implementation of the workload as SQL and SPARQL, should make a good VLDB paper.
Database performance is an entirely open-ended quest and the bag of potentially applicable tricks is as good as infinite. Having said this, it seems that the scales comfortably reached in the TPC benchmarks are more than adequate for pretty much anything one is likely to encounter in real world applications involving comparable workloads. Businesses getting over 6 million new order transactions per minute (the high score of TPC-C) or analyzing a warehouse of 60 billion orders shipped to 6 billion customers over 7 years (10000GB or 10TB TPC-H) are not very common if they exist at all.
The real world frontier has moved on. Scaling up the TPC workloads remains a generally useful exercise that continues to contribute to the state of the art but the applications requiring this advance are changing.
Someone once said that for a new technology to become mainstream, it needs to solve a new class of problem. Yes, while it is a preparatory step to run TPC-H translated to SPARQL without dying of overheads, there is little point in doing this in production since SQL is anyway likely better and already known, proven, and deployed.
The new class of problem, as LOD2 sees it, is the matter of web-wide cross-organizational data integration. Web-wide does not necessarily mean crawling the whole web, but does tend to mean running into significant heterogeneity of sources, both in terms of modeling and in terms of usage of more-or-less standard data models. Around this topic we hear two messages. The database people say that inference beyond what you can express in SQL views is theoretically nice but practically not needed; on the other side, we hear that the inference now being standardized in efforts like RIF and OWL is not expressive enough for the real world. As one expert put it, if enterprise data integration in the 1980s was between a few databases, today it is more like between 1000 databases, which makes this matter similar to searching the web. How can one know in such a situation that the data being aggregated is in fact meaningfully aggregate-able?
Add to this the prevalence of unstructured data in the world and the need to mine it for actionable intelligence. Think of combining data from CRM, worldwide media coverage of own and competitive brands, and in-house emails for assessing organizational response to events on the market.
These are the actual use cases for which we need RDF at relational DW performance and scale. This is not limited to RDF and OWL profiles, since we fully believe that inference needs are more diverse. The reason why this is RDF and not SQL plus some extension of Datalog, is the widespread adoption of RDF and linked data as a data publishing format, with all the schema-last and open world aspects that have been there from the start.
Stay tuned for more news later this month!
Related -
19:48
SPARQL for the Ruby Developer
» OpenLink Community BlogWhat?A simple guide usable by any Ruby developer seeking to exploit SPARQL without hassles.
Why?SPARQL is a powerful query language, results serialization format, and an HTTP based data access protocol from the W3C. It provides a mechanism for accessing and integrating data across Deductive Database Systems (colloquially referred to as triple or quad stores in Semantic Web and Linked Data circles) -- database systems (or data spaces) that manage proposition oriented records in 3-tuple (triples) or 4-tuple (quads) form.
How?SPARQL queries are actually HTTP payloads (typically). Thus, using a RESTful client-server interaction pattern, you can dispatch calls to a SPARQL compliant data server and receive a payload for local processing e.g. local object binding re. Ruby.
Steps:- From your command line execute: aptitude search '^ruby', to verify Ruby is in place
- Determine which SPARQL endpoint you want to access e.g. DBpedia or a local Virtuoso instance (typically: [localhost:8890]
- If using Virtuoso, and you want to populate its quad store using SPARQL, assign "SPARQL_SPONGE" privileges to user "SPARQL" (this is basic control, more sophisticated WebID based ACLs are available for controlling SPARQL access).
#!/usr/bin/env ruby # # Demonstrating use of a single query to populate a # Virtuoso Quad Store. # require 'net/http' require 'cgi' require 'csv' # # We opt for CSV based output since handling this format is straightforward in Ruby, by default. # HTTP URL is constructed accordingly with CSV as query results format in mind. def sparqlQuery(query, baseURL, format="text/csv") params={ "default-graph" => "", "should-sponge" => "soft", "query" => query, "debug" => "on", "timeout" => "", "format" => format, "save" => "display", "fname" => "" } querypart="" params.each { |k,v| querypart+="#{k}=#{CGI.escape(v)}&" } sparqlURL=baseURL+"?#{querypart}" response = Net: [HTTP.get_response(] URI.parse(sparqlURL)) return CSV::parse(response.body) end # Setting Data Source Name (DSN) dsn=" [dbpedia.org] #Virtuoso pragmas for instructing SPARQL engine to perform an HTTP GET #using the IRI in FROM clause as Data Source URL query="DEFINE get:soft "replace" SELECT DISTINCT * FROM <#{dsn}> WHERE {?s ?p ?o} " #Assume use of local installation of Virtuoso #otherwise you can change URL to that of a public endpoint #for example DBpedia: [dbpedia.org] data=sparqlQuery(query, " [localhost:8890] puts "Got data:" p data # # End
OutputGot data: [["s", "p", "o"], [" [dbpedia.org] " [www.w3.org] " [www.w3.org] [" [dbpedia.org] " [www.w3.org] " [dbpedia.org] [" [dbpedia.org] " [www.w3.org] " [dbpedia.org] ...
ConclusionCSV was chosen over XML (re. output format) since this is about a "no-brainer installation and utilization" guide for a Ruby developer that already knows how to use Ruby for HTTP based data access. SPARQL just provides an added bonus to URL dexterity (delivered via URI abstraction) with regards to constructing Data Source Names or Addresses.
Related -
7:06
Simple Virtuoso Installation & Utilization Guide for SPARQL Users
» OpenLink Community BlogWhat is SPARQL?A declarative query language from the W3C for querying structured propositional data (in the form of 3-tuple [triples] or 4-tuple [quads] records) stored in a deductive database (colloquially referred to as triple or quad stores in Semantic Web and Linked Data parlance).
SPARQL is inherently platform independent. Like SQL, the query language and the backend database engine are distinct. Database clients capture SPARQL queries which are then passed on to compliant backend databases.
Why is it important?Like SQL for relational databases, it provides a powerful mechanism for accessing and joining data across one or more data partitions (named graphs identified by IRIs). The aforementioned capability also enables the construction of sophisticated Views, Reports (HTML or those produced in native form by desktop productivity tools), and data streams for other services.
Unlike SQL, SPARQL includes result serialization formats and an HTTP based wire protocol. Thus, the ubiquity and sophistication of HTTP is integral to SPARQL i.e., client side applications (user agents) only need to be able to perform an HTTP GET against a URL en route to exploiting the power of SPARQL.
How do I use it, generally?- Locate a SPARQL endpoint (DBpedia, LOD Cloud Cache, Data.Gov, URIBurner, others), or;
- Install a SPARQL compliant database server (quad or triple store) on your desktop, workgroup server, data center, or cloud (e.g., Amazon EC2 AMI)
- Start the database server
- Execute SPARQL Queries via the SPARQL endpoint.
What follows is a very simple guide for using SPARQL against your own instance of Virtuoso:
- Software Download and Installation
- Data Loading from Data Sources exposed at Network Addresses (e.g. HTTP URLs) using very simple methods
- Actual SPARQL query execution via SPARQL endpoint.
- Download Virtuoso Open Source or Virtuoso Commercial Editions
- Run installer (if using Commercial edition of Windows Open Source Edition, otherwise follow build guide)
- Follow post-installation guide and verify installation by typing in the command: virtuoso -? (if this fails check you've followed installation and setup steps, then verify environment variables have been set)
- Start the Virtuoso server using the command: virtuoso-start.sh
- Verify you have a connection to the Virtuoso Server via the command: isql localhost (assuming you're using default DB settings) or the command: isql localhost:1112 (assuming demo database) or goto your browser and type in: [<virtuoso-server-host-name>:[port]] (e.g. [localhost:8889] for default DB or [localhost:8890] if using Demo DB)
- Go to SPARQL endpoint which is typically -- [<virtuoso-server-host-name>:[port]]
- Run a quick sample query (since the database always has system data in place): select distinct * where {?s ?p ?o} limit 50 .
- Ensure environment settings are set and functional -- if using Mac OS X or Windows, so you don't have to worry about this, just start and stop your Virtuoso server using native OS services applets
- If using the Open Source Edition, follow the getting started guide -- it covers PATH and startup directory location re. starting and stopping Virtuoso servers.
- Sponging (HTTP GETs against external Data Sources) within SPARQL queries is disabled by default. You can enable this feature by assigning "SPARQL_SPONGE" privileges to user "SPARQL". Note, more sophisticated security exists via WebID based ACLs.
- Identify an RDF based structured data source of interest -- a file that contains 3-tuple / triples available at an address on a public or private HTTP based network
- Determine the Address (URL) of the RDF data source
- Go to your Virtuoso SPARQL endpoint and type in the following SPARQL query: DEFINE GET:SOFT "replace" SELECT DISTINCT * FROM <RDFDataSourceURL> WHERE {?s ?p ?o}
- All the triples in the RDF resource (data source accessed via URL) will be loaded into the Virtuoso Quad Store (using RDF Data Source URL as the internal quad store Named Graph IRI) as part of the SPARQL query processing pipeline.
Note: the data source URL doesn't even have to be RDF based -- which is where the Virtuoso Sponger Middleware comes into play (download and install the VAD installer package first) since it delivers the following features to Virtuoso's SPARQL engine:
- Transformation of data from non RDF data sources (file content, hypermedia resources, web services output etc..) into RDF based 3-tuples (triples)
- Cache Invalidation Scheme Construction -- thus, subsequent queries (without the define get:soft "replace" pragma will not be required bar when you forcefully want to override cache).
- If you have very large data sources like DBpedia etc. from CKAN, simply use our bulk loader .
Public SPARQL endpoints are emerging at an ever increasing rate. Thus, we've setup up a DNS lookup service that provides access to a large number of SPARQL endpoints. Of course, this doesn't cover all existing endpoints, so if our endpoint is missing please ping me.
Here are a collection of commands for using DNS-SD to discover SPARQL endpoints:
- dns-sd -B _sparql._tcp sparql.openlinksw.com -- browse for services instances
- dns-sd -Z _sparql._tcp sparql.openlinksw.com -- output results in Zone File format
- Using HTTP from Ruby -- you can just make SPARQL Protocol URLs re. SPARQL
- Using SPARQL Endpoints via Ruby -- Ruby example using DBpedia endpoint
- Interactive SPARQL Query By Example (QBE) tool -- provides a graphical user interface (as is common in SQL realm re. query building against RDBMS engines) that works with any SPARQL endpoint
- Other methods of loading RDF data into Virtuoso
- Virtuoso Sponger -- architecture and how it turns a wide variety of non RDF data sources into SPARQL accessible data
- Using OpenLink Data Explorer (ODE) to populate Virtuoso -- locate a resource of interest; click on a bookmarklet or use context menus (if using ODE extensions for Firefox, Safari, or Chrome); and you'll have SPARQL accessible data automatically inserted into your Virtuoso instance.
- W3C's SPARQLing Data Access Ingenuity -- an older generic SPARQL introduction post
- Collection of SPARQL Query Examples -- GoodRelations (Product Offers), FOAF (Profiles), SIOC (Data Spaces -- Blogs, Wikis, Bookmarks, Feed Collections, Photo Galleries, Briefcase/DropBox, AddressBook, Calendars, Discussion Forums)
- Collection of Live SPARQL Queries against LOD Cloud Cache -- simple and advanced queries.
-
18:44
Rough draft poem: Document what art thou?
» OpenLink Community BlogI am the Data Container, Disseminator, and Canvas.
I came to be when the cognitive skills of mankind deemed oral history inadequate.
I am transcendent, I take many forms, but my core purpose is constant - Container, Disseminator, and Canvas.
I am dexterous, so I can be blank, partitioned horizontally, horizontally and vertically, and if you get moi excited and I'll show you fractals.
I am accessible in a number of ways, across a plethora of media.
I am loose, so you can access my content too.
I am loose in a cool way, so you can refer to moi independent of my content.
I am cool in a loose way, so you can refer to my content independent of moi.
I am even cool and loose enough to let you figure out stuff from my content including how its totally distinct from moi.
But...
I am possessive about my coolness, so all Containment, Dissemination, and Canvas requirements must first call upon moi, wherever I might be.
So...
If you postulate about my demise or irrelevance, across any medium, I will punish you with confusion!
Remember...
I just told you who I am.
Lesson to be learned..
When something tells you what it is, and it is as powerful as I, best you believe it.
BTW -- I am Okay with HTTP response code 200 OK :-) -
21:43
7 Things Brought to You by HTTP-based Hypermedia
» OpenLink Community BlogThere are some very powerful benefits that accrue from the use of HTTP based Hypermedia. 7 that come to mind immediately include:
- Structured & Platform Independent Enterprise Data Virtualization -- concrete conceptual level access and provisioning of abstract domain entities such as Customers, Orders, Employees, Products, Countries, Competitors etc.
- Distributed Application State (REST) -- application state transitions via links
- Structured Data Representation (Linked Data) -- whole data data representation via links
- Structured Identity (WebID) -- verifiable distributed identity
- Structured Profiles (FOAF) -- platform independent profiles for people and organizations
- Articulation of Structured Value Propositions (GoodRelations) -- Product & Service Offers, Business Entities, Locations, Business Hours, etc.
- Structured Collaboration Spaces (SIOC) -- Blogs, Wikis, File Sharing, Discussion Forums, Aggregated Feeds, Statuses, Photo Galleries, Polls etc.
-
17:02
6 Things That Must Remain Distinct re. Data
» OpenLink Community BlogConflation is the tech industry's equivalent of macroeconomic inflation. Whenever it rears it head, we lose value courtesy of diminishing productivity.
Looking retrospectively at any technology failure -- enterprises or industry at large -- you will eventually discover -- at the core -- messy conflation of at least one of the following:
- Data Model (Semantics)
- Data Object (Entity) Names (Identifiers)
- Data Representation Syntax (Markup)
- Data Access Protocol
- Data Presentation Syntax (Markup)
- Data Presentation Media.
The Internet & World Wide Web (InterWeb) are massive successes because their respective architectural cores embody the critical separation outlined above.
The Web of Linked Data is going to become a global reality, and massive success, because it leverages inherently sound architecture -- bar conflationary distractions of RDF. :-)
-
22:54
Virtuoso Linked Deployment 3-Step
» OpenLink Community BlogInjecting Linked Data into the Web has been a major pain point for those who seek personal, service, or organization-specific variants of DBpedia. Basically, the sequence goes something like this:
- You encounter DBpedia or the LOD Cloud Pictorial.
- You look around (typically following your nose from link to link).
- You attempt to publish your own stuff.
- You get stuck.
The problems typically take the following form:
- Functionality confusion about the complementary Name and Address functionality of a single URI abstraction
- Terminology confusion due to conflation and over-loading of terms such as Resource, URL, Representation, Document, etc.
- Inability to find robust tools with which to generate Linked Data from existing data sources such as relational databases, CSV files, XML, Web Services, etc.
To start addressing these problems, here is a simple guide for generating and publishing Linked Data using Virtuoso.
Step 1 - RDF Data GenerationExisting RDF data can be added to the Virtuoso RDF Quad Store via a variety of built-in data loader utilities.
Many options allow you to easily and quickly generate RDF data from other data sources:
- Install the Sponger Bookmarklet for the URIBurner service. Bind this to your own SPARQL-compliant backend RDF database (in this scenario, your local Virtuoso instance), and then Sponge some [HTTP-accessible] resources.
- Convert relational DBMS data to RDF using the Virtuoso RDF Views Wizard.
-
Starting with CSV files, you can
- Place them at an [HTTP-accessible] location, and use the Virtuoso Sponger to convert them to RDF or;
- Use the CVS import feature to import their content into Virtuoso's relational data engine; then use the built-in RDF Views Wizard as with other RDBMS data.
-
Starting from XML files, you can
- Use Virtuoso's inbuilt XSLT-Processor for manual XML to RDF/XML transformation or;
- Leverage the Sponger Cartridge for GRDDL, if there is a transformation service associated with your XML data source, or;
- Let the Sponger analyze the XML data source and make a best-effort transformation to RDF.
Install the Faceted Browser VAD package (
fct_dav.vad) which delivers the following:- Faceted Browser Engine UI
- Dynamic Hypermedia Resource Generator
Three simple steps allow you, your enterprise, and your customers to consume and exploit your newly deployed Linked Data --
-
Load a page like this in your browser:
http://<cname>[:<port>]/describe/?uri=<entity-uri>-
<cname>[:<port>]gets replaced by the host and port of your Virtuoso instance -
<entity-uri>gets replaced by the URI you want to see described -- for instance, the URI of one of the resources you let the Sponger handle.
-
- Follow the links presented in the descriptor page.
- If you ever see a blank page with a hyperlink subject name in the About: section at the top of the page, simply add the parameter "&sp=1" to the URL in the browser's Address box, and hit [ENTER]. This will result in an "on the fly" resource retrieval, transformation, and descriptor page generation.
- Use the navigator controls to page up and down the data associated with the "in scope" resource descriptor.
- Sample">[http%2Fwww.amazon.com%2Fo%2FASIN%2F006251587X">Sample] Descriptor Page (what you see post completion of the steps in this post)
- What is Linked Data, really?
- Painless Linked Data Generation via URIBurner
- How To Load RDF Data Into Virtuoso (various methods)
- Virtuoso Bulk Loader Script for RDF
- Bulk Loader Script for CSV
- Wizard based generation of RDF based Linked Data from ODBC accessible Relational Databases
-
22:54
Virtuoso Linked Data Deployment In 3 Simple Steps
» OpenLink Community BlogInjecting Linked Data into the Web has been a major pain point for those who seek personal, service, or organization-specific variants of DBpedia. Basically, the sequence goes something like this:
- You encounter DBpedia or the LOD Cloud Pictorial.
- You look around (typically following your nose from link to link).
- You attempt to publish your own stuff.
- You get stuck.
The problems typically take the following form:
- Functionality confusion about the complementary Name and Address functionality of a single URI abstraction
- Terminology confusion due to conflation and over-loading of terms such as Resource, URL, Representation, Document, etc.
- Inability to find robust tools with which to generate Linked Data from existing data sources such as relational databases, CSV files, XML, Web Services, etc.
To start addressing these problems, here is a simple guide for generating and publishing Linked Data using Virtuoso.
Step 1 - RDF Data GenerationExisting RDF data can be added to the Virtuoso RDF Quad Store via a variety of built-in data loader utilities.
Many options allow you to easily and quickly generate RDF data from other data sources:
- Install the Sponger Bookmarklet for the URIBurner service. Bind this to your own SPARQL-compliant backend RDF database (in this scenario, your local Virtuoso instance), and then Sponge some [HTTP-accessible] resources.
- Convert relational DBMS data to RDF using the Virtuoso RDF Views Wizard.
-
Starting with CSV files, you can
- Place them at an [HTTP-accessible] location, and use the Virtuoso Sponger to convert them to RDF or;
- Use the CVS import feature to import their content into Virtuoso's relational data engine; then use the built-in RDF Views Wizard as with other RDBMS data.
-
Starting from XML files, you can
- Use Virtuoso's inbuilt XSLT-Processor for manual XML to RDF/XML transformation or;
- Leverage the Sponger Cartridge for GRDDL, if there is a transformation service associated with your XML data source, or;
- Let the Sponger analyze the XML data source and make a best-effort transformation to RDF.
Install the Faceted Browser VAD package (
fct_dav.vad) which delivers the following:- Faceted Browser Engine UI
- Dynamic Hypermedia Resource Generator
Three simple steps allow you, your enterprise, and your customers to consume and exploit your newly deployed Linked Data --
-
Load a page like this in your browser:
http://<cname>[:<port>]/describe/?uri=<entity-uri>-
<cname>[:<port>]gets replaced by the host and port of your Virtuoso instance -
<entity-uri>gets replaced by the URI you want to see described -- for instance, the URI of one of the resources you let the Sponger handle.
-
- Follow the links presented in the descriptor page.
- If you ever see a blank page with a hyperlink subject name in the About: section at the top of the page, simply add the parameter "&sp=1" to the URL in the browser's Address box, and hit [ENTER]. This will result in an "on the fly" resource retrieval, transformation, and descriptor page generation.
- Use the navigator controls to page up and down the data associated with the "in scope" resource descriptor.
- Sample">[http%2Fwww.amazon.com%2Fo%2FASIN%2F006251587X">Sample] Descriptor Page (what you see post completion of the steps in this post)
- What is Linked Data, really?
- Painless Linked Data Generation via URIBurner
- How To Load RDF Data Into Virtuoso (various methods)
- Virtuoso Bulk Loader Script for RDF
- Bulk Loader Script for CSV
- Wizard based generation of RDF based Linked Data from ODBC accessible Relational Databases
-
17:50
Business Of Linked Data: Data Quality Factors
» OpenLink Community BlogVia my "context lenses" (i.e., my subjective view of the world) a unit of Data (or Datum) is like a cube of sugar, each side representing a value factor along the following dimensions:
- Identity -- via Resolvable URIs based Names for everything
- Data Representation Format Dexterity -- e.g., HTTP based Content Negotiation which loosens the coupling between Data Model Semantics and actual Data Representation (Syntax/Markup)
- Platform Agnostic Data Access -- e.g. via ubiquitous HTTP
- Change Sensitivity -- data warehouses are like real-world warehouses, goods rot and perish overtime
- Provenance -- data about the data (metadata) that helps establish "Who", "What", "When", "Where", and at least approximate or guesstimate "Why"
- Data Mesh Navigability -- delivered via inference rules.
The quality of service factors above nullify many of the typical concerns associated data driven business models, such as:
- Wholesale Imports (crawls) - where your data is crawled and/or imported wholesale into a new data space with zero attribution to the source
- Lossy Attribution -- attribution is delivered in literal form which doesn't deliver branding fidelity across many value chain layers or entire life cycle of a given data item
- Service Provisioning -- effectively build any business model if you can align services with unambiguously identifiable consumers with actual data items or across entire data spaces.
-
23:10
What is Linked Data, really?
» OpenLink Community BlogLinked Data is simply hypermedia-based structured data.
Linked Data offers everyone a Web-scale, Enterprise-grade mechanism for platform-independent creation, curation, access, and integration of data.
The fundamental steps to creating Linked Data are as follows:
-
Choose a Name Reference Mechanism — i.e., URIs.
-
Choose a Data Model with which to Structure your Data — minimally, you need a model which clearly distinguishes
- Subjects (also known as Entities)
- Subject Attributes (also known as Entity Attributes), and
- Attribute Values (also known as Subject Attribute Values or Entity Attribute Values).
-
Choose one or more Data Representation Syntaxes (also called Markup Languages or Data Formats) to use when creating Resources with Content based on your chosen Data Model. Some Syntaxes in common use today are HTML+RDFa, N3, Turtle, RDF/XML, TriX, XRDS, GData, OData, OpenGraph, and many others.
-
Choose a URI Scheme that facilitates binding Referenced Names to the Resources which will carry your Content -- your Structured Data.
-
Create Structured Data by using your chosen Name Reference Mechanism, your chosen Data Model, and your chosen Data Representation Syntax, as follows:
- Identify Subject(s) using Resolvable URI(s).
- Identify Subject Attribute(s) using Resolvable URI(s).
- Assign Attribute Values to Subject Attributes. These Values may be either Literals (e.g., STRINGs, BLOBs) or Resolvable URIs.
You can create Linked Data (hypermedia-based data representations) Resources from or for many things. Examples include: personal profiles, calendars, address books, blogs, photo albums; there are many, many more.
Related- Linked Data an Introduction -- simple introduction to Linked Data and its virtues
- How Data Makes Corporations Dumb -- Jeff Jonas (IBM) interview
- Hypermedia Types -- evolving information portal covering different aspects of Hypermedia resource types
- URIBurner -- service that generates Linked Data from a plethora of heterogeneous data sources
- Linked Data Meme -- TimbL design issues note about Linked Data
- Data 3.0 Manifesto -- note about format agnostic Linked Data
- DBpedia -- large Linked Data Hub
- Linked Open Data Cloud -- collection of Linked Data Spaces
- Linked Open Commerce Cloud -- commerce (clicks & mortar and/or clicks & clicks) oriented Linked Data Space
- LOD Cloud Cache -- massive Linked Data Space hosting most of the LOD Cloud Datasets
- LOD2 Initiative -- EU Co-Funded Project to develop global knowledge space from LOD .
-
-
21:54
What is Linked Data, really?
» OpenLink Community BlogLinked Data is simply hypermedia-based structured data.
Linked Data offers everyone a Web-scale, Enterprise-grade mechanism for platform-independent creation, curation, access, and integration of data.
The fundamental steps to creating Linked Data are as follows:
-
Choose a Name Reference Mechanism — i.e., URIs.
-
Choose a Data Model with which to Structure your Data — minimally, you need a model which clearly distinguishes
- Subjects (also known as Entities)
- Subject Attributes (also known as Entity Attributes), and
- Attribute Values (also known as Subject Attribute Values or Entity Attribute Values).
-
Choose one or more Data Representation Syntaxes (also called Markup Languages or Data Formats) to use when creating Resources with Content based on your chosen Data Model. Some Syntaxes in common use today are HTML+RDFa, N3, Turtle, RDF/XML, TriX, XRDS, GData, and OData; there are many others.
-
Choose a URI Scheme that facilitates binding Referenced Names to the Resources which will carry your Content -- your Structured Data.
-
Create Structured Data by using your chosen Name Reference Mechanism, your chosen Data Model, and your chosen Data Representation Syntax, as follows:
- Identify Subject(s) using Resolvable URI(s).
- Identify Subject Attribute(s) using Resolvable URI(s).
- Assign Attribute Values to Subject Attributes. These Values may be either Literals (e.g., STRINGs, BLOBs) or Resolvable URIs.
You can create Linked Data (hypermedia-based data representations) Resources from or for many things. Examples include: personal profiles, calendars, address books, blogs, photo albums; there are many, many more.
Related- Hypermedia Types -- evolving information portal covering different aspects of Hypermedia resource types
- URIBurner -- service that generates Linked Data from a plethora of heterogeneous data sources
- Linked Data Meme -- TimbL design issues note about Linked Data
- Data 3.0 Manifesto -- note about format agnostic Linked Data
- DBpedia -- large Linked Data Hub
- Linked Open Data Cloud -- collection of Linked Data Spaces
- Linked Open Commerce Cloud -- commerce (clicks & mortar and/or clicks & clicks) oriented Linked Data Space
- LOD Cloud Cache -- massive Linked Data Space hosting most of the LOD Cloud Datasets
- LOD2 Initiative -- EU Co-Funded Project to develop global knowledge space from LOD .
-
-
21:08
Virtuoso 6.2 brings New Features!
» OpenLink Community BlogVirtuoso 6.2 introduces a major number of enhancements to areas including...
- Linked Data Deployment
- Linked Data Middleware
- Data Virtualization
- Dynamic Data Exchange & Data Replication
- Security
Linked Data Deployment
Linked Data MiddlewareFeature Description Benefit Automatic Deployment Linked Data Pages are now automatically published for every Virtuoso Data Object; users need only load their data into the RDF Quad Store. Handcrafted URL-Rewrite Rules are no longer necessary. HTTP Metadata Enhancements HTTP Link:header is used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents.Enables [HTTP-oriented] tools to work with such relationships and other metadata. HTML Metadata Embedding HTML resource <head />and<link />elements and their@relattributes are used to transfer vital metadata (e.g., relationships between a Descriptor Resource and its Subject) from HTTP Servers to User Agents.Enables HTML-oriented tools to work with such relationships and other metadata. Hammer Stack Auto-Discovery Patterns HTML resource <head />section and<link />elements, the HTTPLink:header, and XRD-based"host-meta"resources collectively provide structured metadata about Virtuoso hosts, associated Linked Data Spaces, and specific Data Items (Entities).Enables humans and machines to easily distinguish between Descriptor Resources and their Subjects, irrespective of URI scheme.
Data VirtualizationFeature Description Benefit New Sponger Cartridges New cartridges (data access and transformation drivers) for Twitter, Facebook, Amazon, eBay, LinkedIn, and others. Enable users and user agents to deal with the Sponged data spaces as though they were named graphs in a quad store, or tables in an RDBMS. New Descriptor Pages HTML-based descriptor pages are automatically generated. Descriptor subjects, and the constellation of navigable attribute-and-value pairs that constitute their descriptive representation, are clearly identified. Automatic Subject Identifier Generation De-referenceable data object identifiers are automatically created. Removes tedium and risk of error associated with nuance-laced manual construction of identifiers. Support for OData, JSON, RDFa Additional data representation and serialization formats associated with Linked Data. Increases flexibility and interoperability.
Dynamic Data Exchange & Data ReplicationFeature Description Benefit Materialized RDF Views RDF Views over ODBC/JDBC Data Sources can now (optionally) keep the Quad Store in sync with the RDBMS data source. Enables high-performance Faceted Browsing while remaining sensitive to changes in the RDBMS data sources. CSV-to-RDF Transformation Wizard-based generation of RDF Linked Data from CSV files. Speeds deployment of data which may only exist in CSV form as Linked Data. Transparent Data Access Binding SPASQL (SPARQL Query Language integrated into SQL) is usable over ODBC, JDBC, ADO.NET, OLEDB, or XMLA connections. Enables Desktop Productivity Tools to transparently work with any blend of RDBMS and RDF data sources.
SecurityFeature Description Benefit Quad Store to Quad Store Replication High-fidelity graph-data replication between one or more database instances. Enables a wide variety of deployment topologies. Delta Engine Automated generation of deltas at the named-graph-level, matches transactional replication offered by the Virtuoso SQL engine. Brings RDF replication on par with SQL replication. PubSubHubbub Support Deep integration within Quad Store as an optional mechanism for shipping deltas. Enables push-based data replication across a variety of topologies. Feature Description Benefit WebID support at the DBMS core Use WebID protocol for low-level ACL-based protection of database objects (RDF or Relational) and Web Services. Enables application of sophisticated security and data access policies to Web Services (e.g., SPARQL endpoint) and actual DBMS objects. Webfinger Supports using mailto:andacct:URIs in the context of WebID and other mechanisms, when domain holders have published necessary XRDS resources.Enables more intuitive identification of people and organizations. Fingerpoint Similar to Webfinger but does not require XRDS resources; instea,d it works directly with SPARQL endpoints exposed using auto-discovery patterns in the <head />section of HTML documents.Enables more intuitive identification of people and organizations. -
18:20
The Business of Semantically Linked Data ("SemData")
» OpenLink Community BlogI had the opportunity the other day to converse about the semantic technology business proposition in terms of business development. My interlocutor was a business development consultant who had little prior knowledge of this technology but a background in business development inside a large diversified enterprise.
I will here recap some of the points discussed, since these can be of broader interest.
Why is there no single dominant vendor?The field is young. We can take the relational database industry as a historical precedent. From the inception of the relational database around 1970, it took 15 years for the relational model to become mainstream. "Mainstream" here does not mean dominant in installed base, but does mean something that one tends to include as a component in new systems. The figure of 15 years might repeat with RDF, from around 1990 for the first beginnings to 2015 for routine inclusion in new systems, where applicable.
This does not necessarily mean that the RDF graph data model (or more properly, EAV+CR; Entity-Attribute-Value + Classes and Relationships) will take the place of the RDBMS as the preferred data backbone. This could mean that RDF model serialization formats will be supported as data exchange mechanisms, and that systems will integrate data extracted by semantic technology from unstructured sources. Some degree of EAV storage is likely to be common, but on-line transactional data is guaranteed to stay pure relational, as EAV is suboptimal for OLTP. Analytics will see EAV alongside relational especially in applications where in-house data is being combined with large numbers of outside structured sources or with other open sources such as information extracted from the web.
EAV offerings will become integrated by major DBMS vendors, as is already the case with Oracle. Specialized vendors will exist alongside these, just as is the case with relational databases.
Can there be a positive reinforcement cycle (e.g., building cars creates a need for road construction, and better roads drive demand for more cars)? Or is this an up-front infrastructure investment that governments make for some future payoff or because of science-funding policies?
The Document Web did not start as a government infrastructure initiative. The infrastructure was already built, albeit first originating with the US defense establishment. The Internet became ubiquitous through the adoption of the Web. The general public's adoption of the Web was bootstrapped by all major business and media adopting the Web. They did not adopt the web because they particularly liked it, as it was essentially a threat to the position of media and to the market dominance of big players who could afford massive advertising in this same media. Adopting the web became necessary because of the prohibitive opportunity cost of not adopting it.
A similar process may take place with open data. For example, in E-commerce, vendors do not necessarily welcome easy-and-automatic machine-based comparison of their offerings against those of their competitors. Publishing data will however be necessary in order to be listed at all. Also, in social networks, we have the identity portability movement which strives to open the big social network silos. Data exchange via RDF serializations, as already supported in many places, is the natural enabling technology for this.
Will the web of structured data parallel the development of web 2.0?
Web 2.0 was about the blogosphere, exposure of web site service APIs, creation of affiliate programs, and so forth. If the Document Web was like a universal printing press, where anybody could publish at will, Web 2.0 was a newspaper, bringing the democratization of journalism, creating the blogger, the citizen journalist. The Data Web will create the Citizen Analyst, the Mini Media Mogul (e.g., social-network-driven coops comprised of citizen journalists, analysts, and other content providers such as video and audio producers and publishers). As the blogosphere became an alternative news source to the big media, the web of data may create an ecosystem of alternative data products. Analytics is no longer a government or big business only proposition.
Is there a specifically semantic market or business model, or will semantic technology be exploited under established business models and merged as a component technology into existing offerings?
We have seen a migration from capital expenses to operating expenses in the IT sector in general, as exemplified by cloud computing's Platform as a Service (PaaS) and Software as a Service (SaaS). It is reasonable to anticipate that this trend will continue to Data as a Service (DaaS). Microsoft Odata and Dallas are early examples of this and go towards legitimizing the data as service concept. DaaS is not related to semantic technology per se, but since this will involve integration of data, RDF serializations will be attractive, especially given the takeoff of linked data in general. The data models in Odata are also much like RDF, as both stem from EAV+CR, which makes for easy translation and a degree of inherent interoperability.
The integration of semantic technology into existing web properties and business applications will manifest to the end user as increased serendipity. The systems will be able to provide more relevant and better contextualized data for the user's situation. This applies equally to the consumer and business user cases.
Identity virtualization in the forms of WebID and Webfinger — making first-class de-referenceable identifiers of
mailto:andacct:schemes — is emerging as a new way to open social network and Web 2.0 data silos.On the software production side, especially as concerns data integration, the increased schema- and inference-flexibility of EAV will lead to a quicker time to answer in many situations. The more complex the task or the more diverse the data, the higher the potential payoff. Data in cyberspace is mirroring the complexity and diversity of the real world, where heterogeneity and disparity are simply facts of life, and such flexibility is becoming an inescapable necessity.
-
18:20
The Business of Semantically Linked Data ("SemData")
» OpenLink Community BlogI had the opportunity the other day to converse about the semantic technology business proposition in terms of business development. My interlocutor was a business development consultant who had little prior knowledge of this technology but a background in business development inside a large diversified enterprise.
I will here recap some of the points discussed, since these can be of broader interest.
Why is there no single dominant vendor?The field is young. We can take the relational database industry as a historical precedent. From the inception of the relational database around 1970, it took 15 years for the relational model to become mainstream. "Mainstream" here does not mean dominant in installed base, but does mean something that one tends to include as a component in new systems. The figure of 15 years might repeat with RDF, from around 1990 for the first beginnings to 2015 for routine inclusion in new systems, where applicable.
This does not necessarily mean that the RDF graph data model (or more properly, EAV+CR; Entity-Attribute-Value + Classes and Relationships) will take the place of the RDBMS as the preferred data backbone. This could mean that RDF model serialization formats will be supported as data exchange mechanisms, and that systems will integrate data extracted by semantic technology from unstructured sources. Some degree of EAV storage is likely to be common, but on-line transactional data is guaranteed to stay pure relational, as EAV is suboptimal for OLTP. Analytics will see EAV alongside relational especially in applications where in-house data is being combined with large numbers of outside structured sources or with other open sources such as information extracted from the web.
EAV offerings will become integrated by major DBMS vendors, as is already the case with Oracle. Specialized vendors will exist alongside these, just as is the case with relational databases.
Can there be a positive reinforcement cycle (e.g., building cars creates a need for road construction, and better roads drive demand for more cars)? Or is this an up-front infrastructure investment that governments make for some future payoff or because of science-funding policies?
The Document Web did not start as a government infrastructure initiative. The infrastructure was already built, albeit first originating with the US defense establishment. The Internet became ubiquitous through the adoption of the Web. The general public's adoption of the Web was bootstrapped by all major business and media adopting the Web. They did not adopt the web because they particularly liked it, as it was essentially a threat to the position of media and to the market dominance of big players who could afford massive advertising in this same media. Adopting the web became necessary because of the prohibitive opportunity cost of not adopting it.
A similar process may take place with open data. For example, in E-commerce, vendors do not necessarily welcome easy-and-automatic machine-based comparison of their offerings against those of their competitors. Publishing data will however be necessary in order to be listed at all. Also, in social networks, we have the identity portability movement which strives to open the big social network silos. Data exchange via RDF serializations, as already supported in many places, is the natural enabling technology for this.
Will the web of structured data parallel the development of web 2.0?
Web 2.0 was about the blogosphere, exposure of web site service APIs, creation of affiliate programs, and so forth. If the Document Web was like a universal printing press, where anybody could publish at will, Web 2.0 was a newspaper, bringing the democratization of journalism, creating the blogger, the citizen journalist. The Data Web will create the Citizen Analyst, the Mini Media Mogul (e.g., social-network-driven coops comprised of citizen journalists, analysts, and other content providers such as video and audio producers and publishers). As the blogosphere became an alternative news source to the big media, the web of data may create an ecosystem of alternative data products. Analytics is no longer a government or big business only proposition.
Is there a specifically semantic market or business model, or will semantic technology be exploited under established business models and merged as a component technology into existing offerings?
We have seen a migration from capital expenses to operating expenses in the IT sector in general, as exemplified by cloud computing's Platform as a Service (PaaS) and Software as a Service (SaaS). It is reasonable to anticipate that this trend will continue to Data as a Service (DaaS). Microsoft Odata and Dallas are early examples of this and go towards legitimizing the data as service concept. DaaS is not related to semantic technology per se, but since this will involve integration of data, RDF serializations will be attractive, especially given the takeoff of linked data in general. The data models in Odata are also much like RDF, as both stem from EAV+CR, which makes for easy translation and a degree of inherent interoperability.
The integration of semantic technology into existing web properties and business applications will manifest to the end user as increased serendipity. The systems will be able to provide more relevant and better contextualized data for the user's situation. This applies equally to the consumer and business user cases.
Identity virtualization in the forms of WebID and Webfinger — making first-class de-referenceable identifiers of
mailto:andacct:schemes — is emerging as a new way to open social network and Web 2.0 data silos.On the software production side, especially as concerns data integration, the increased schema- and inference-flexibility of EAV will lead to a quicker time to answer in many situations. The more complex the task or the more diverse the data, the higher the potential payoff. Data in cyberspace is mirroring the complexity and diversity of the real world, where heterogeneity and disparity are simply facts of life, and such flexibility is becoming an inescapable necessity.
-
21:14
VLDB Semdata Workshop
» OpenLink Community BlogI will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.
Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.
I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.
One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?
The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.
Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.
Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.
Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.
With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.
Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.
Let us talk about SpiderStore first.
SpiderStoreThe SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.
According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.
This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.
We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.
But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.
SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.
We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.
If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.
Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.
WebpieWebpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.
Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.
The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL
INSERT … SELECTstatements until no new inserts are produced. The only requirement is that theINSERTstatement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.We have suggested such an experiment to the LarKC people. We will see.
-
21:14
VLDB Semdata Workshop
» OpenLink Community BlogI will begin by extending my thanks to the organizers, in specific Reto Krummenacher of STI and Atanas Kiryakov of Ontotext for inviting me to give a position paper at the workshop. Indeed, it is the builders of bridges, the pontifs (pontifex) amongst us who shall be remembered by history. The idea of organizing a semantic data management workshop at VLDB is a laudable attempt at rapprochement between two communities to the advantage of all concerned.
Franz, Ontotext, and OpenLink were the vendors present at the workshop. To summarize very briefly, Jans Aasman of Franz talked about the telco call center automation solution by Amdocs, where the AllegroGraph RDF store is integrated. On the technical side, AllegroGraph has Javascript as a stored procedure language, which is certainly a good idea. Naso of Ontotext talked about the BBC FIFA World Cup site. The technical proposition was that materialization is good and data partitioning is not needed; a set of replicated read-only copies is good enough.
I talked about making RDF cost competitive with relational for data integration and BI. The crux is space efficiency and column store techniques.
One question that came up was that maybe RDF could approach relational in some things, but what about string literals being stored in a separate table? Or URI strings being stored in a separate table?
The answer is that if one accesses a lot of these literals the access will be local and fairly efficient. If one accesses just a few, it does not matter. For user-facing reports, there is no point in returning a million strings that the user will not read anyhow. But then it turned out that there in fact exist reports in bioinformatics where there are 100,000 strings. Now taking the worst abuse of SPARQL, a regexp over all literals in a property of a given class. With a column store this is a scan of the column; with RDF, a three table join. The join is about 10x slower than the column scan. Quite OK, considering that a full text index is the likely solution for such workloads anyway. Besides, a sensible relational schema will also not use strings for foreign keys, and will therefore incur a similar burden from fetching the strings before returning the result.
Another question was about whether the attitude was one of confrontation between RDF and relational and whether it would not be better to join forces. Well, as said in my talk, sauce for the goose is sauce for the gander and generally speaking relational techniques apply equally to RDF. There are a few RDB tricks that have no RDF equivalent, like clustering a fact table on dimension values, e.g., sales ordered by country, manufacturer, month. But by and large, column-store techniques apply. The execution engine can be essentially identical, just needing a couple of extra data types and some run-time typing and in some cases producing nulls instead of errors. Query optimization is much the same, except that RDB stats are not applicable as such; one needs to sample the data in the cost model. All in all, these adaptations to a RDB are not so large, even though they do require changes to source code.
Another question was about combining data models, e.g., relational (rows and columns), RDF (graph), XML (tree), and full text. Here I would say that it is a fault of our messaging that we do not constantly repeat the necessity of this combining, as we take it for granted. Most RDF stores have a full text index on literal values. OWLIM and a CWI prototype even have it for URIs. XML is a valid data type for an RDF literal, even though this does not get used very much. So doing SPARQL to select the values, and then doing XPath and XSLT on the values, is entirely possible, at least in Virtuoso which has an XPath/XSLT engine built in. Same for invoking SPARQL from an XSLT sheet. Colocating a native RDBMS with local and federated SQL is what Virtuoso has always done. One can, for example, map tables in heterogenous remote RDBs into tables in Virtuoso, then map these into RDF, and run SPARQL queries that get translated into SQL against the original tables, thereby getting SPARQL access without any materialization. Alongside this, one can ETL relational data into RDF via the same declarative mapping.
Further, there are RDF extensions for geospatial queries in Virtuoso and AllegroGraph, and soon also in others.
With all this cross-model operation, RDF is definitely not a closed island. We'll have to repeat this more.
Of the academic papers, the SpiderStore (paper is not yet available at time of writing, but should be soon) and Webpie that should be specially noted.
Let us talk about SpiderStore first.
SpiderStoreThe SpiderStore from the University of Innsbruck is a main-memory-only system that has a record for each distinct IRI. The IRI record has one array of pointers to all IRI records that are objects where the referencing record is the subject, and a similar array of pointers to all records where the referencing record is the object. Both sets of pointers are clustered based on the predicate labeling the edge.
According to the authors (Robert Binna, Wolfgang Gassler, Eva Zangerle, Dominic Pacher, and Günther Specht), a distinct IRI is 5 pointers and each triple is 3 pointers. This would make about 4 pointers per triple, i.e., 32 bytes with 64-bit pointers.
This is not particularly memory efficient, since one must count unused space after growing the lists, fragmentation, etc., which will make the space consumption closer to 40 bytes per triple, plus should one add a graph to the mix one would need another pointer per distinct predicate, adding another 1-4 bytes per triple. Supporting non-IRI types in the object position is not a problem, as long as all distinct values have a chunk of memory to them with a type tag.
We get a few times better memory efficiency with column compressed quads, plus we are not limited to main memory.
But SpiderStore has a point. Making the traversal of an edge in the graph into a pointer dereference is not such a bad deal, especially if the data set is not that big. Furthermore, compiling the queries into C procedures playing with the pointers alone would give performance to match or exceed any hard coded graph traversal library and would not be very difficult. Supporting multithreaded updates would spoil much of the gain but allowing single threaded updates and forking read-only copies for reading would be fine.
SpiderStore as such is not attractive for what we intend to do, this being aggregating RDF quads in volumes far exceeding main memory and scaling to clusters. We note that SpiderStore hits problems with distributed memory, since SpiderStore executes depth first, which is manifestly impossible if significant latencies are involved. In other words, if there can be latency, one must amortize by having a lot of other possible work available. Running with long vectors of values is one way, as in MonetDB or Virtuoso Cluster. The other way is to have a massively multithreaded platform which favors code with few instructions but little memory locality. SpiderStore could be a good fit for massive multithreading, specially if queries were compiled to C, dramatically cutting down on the count of instructions to execute.
We too could adopt some ideas from SpiderStore. Namely, if running vectored, one just in passing, without extra overhead, generates an array of links to the next IRI, a bit like the array that SpiderStore has for each predicate for the incoming and outgoing edges of a given IRI. Of course, here these would be persistent IDs and not pointers, but a hash from one to the other takes almost no time. So, while SpiderStore alone may not be what we are after for data warehousing, Spiderizing parts of the working set would not be so bad. This is especially so since the Spiderizable data structure almost gets made as a by-product of query evaluation.
If an algorithm made several passes over a relatively small subgraph of the whole database, Spiderizing it would accelerate things. The memory overhead could have a fixed cap so as not to ruin the working set if locality happened not to hold.
Running a SpiderStore-like execution model on vectors instead of single values would likely do no harm and might even result in better cache behavior. The exception is in the event of completely unpredictable patterns of connections which may only be amortized by massive multithreading.
WebpieWebpie from VU Amsterdam and the LarKC EU FP 7 project is, as it were, the opposite of SpiderStore. This is a map-reduce-based RDFS and OWL Horst inference engine which is all about breadth-first passes over the data in a map-reduce framework with intermediate disk-based storage.
Webpie is not however a database. After the inference result has been materialized, it must be loaded into a SPARQL engine in order to evaluate a query against the result.
The execution plan of Webpie is made from the ontology whose consequences must be materialized. The steps are sorted and run until a fixed point is reached for each. This is similar to running SPARQL
INSERT … SELECTstatements until no new inserts are produced. The only requirement is that theINSERTstatement should report whether new inserts were actually made. This is easy to do. In this way, a comparison between map-reduce plus memory-based joining and a parallel RDF database could be made.We have suggested such an experiment to the LarKC people. We will see.
-
21:13
Suggested Extensions to the BSBM
» OpenLink Community BlogBelow is a list of possible extensions to the Berlin SPARQL Benchmark. Our previous critique of BSBM consists of:
-
The queries touch very little data, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of RDF.
-
Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.
-
An update stream would make the workload more realistic.
We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.
So I am publishing the below as a starting point for discussion.
BSBM Analytics MixBelow is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.
This can be a separate metric from the "restricted" BSBM score. Restricted means "without a full scan with regexp" which will dominate the whole metric at larger scales.
Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for
JOINorder and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible.The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.
-
For each country, list the top 10 product categories, ordered by the count of reviews from the country.
-
Product with the most reviews during its first month on the market
-
10 products most similar to X, with similarity score based on the count of features in common
-
Top 10 reviewers of category X
-
Product with largest increase in reviews in month X compared to month X-minus-1.
-
Product of category X with largest change in mean price in the last month
-
Most active American reviewer of Japanese cameras last year
-
Correlation of price and average review
-
Features with greatest impact on price — for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature
-
Country with greatest popularity of products in category X — reviews of category X from country Y divided by total reviews
-
Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers
-
Fans of manufacturer — find top reviewers who score manufacturer above their mean score
-
Products sold only in country X
Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.
For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.
-
Q6 from the original mix, now allowing use of text index.
-
Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.
-
ibid but now specifying review author. The intent is that structured criteria are here more selective than text.
-
Difference in the frequency of use of "awesome", "super", and "suck(s)" by American vs. European vs. Asian review authors.
For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this.
The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.
The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.
Changes to Data GenerationFor supporting the IR mix, reviews should, in addition to random text, contain the following:
-
For each feature in the product concerned, add the label of said feature to 60% of the reviews.
-
Add the names of review author, product, product category, and manufacturer.
-
The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.
-
Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.
During the benchmark run:
-
1% of products are added;
-
3% of initial offers are deleted and 3% are added; and
-
5% of reviews are added.
Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.
The initial bulk load does not have to be transactional in any way.
Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in
READ COMMITTEDisolation, so that half-inserted products or offers are not seen.Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.
The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.
The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.
The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.
-
-
21:13
Suggested Extensions to the BSBM
» OpenLink Community BlogBelow is a list of possible extensions to the Berlin SPARQL Benchmark. Our previous critique of BSBM consists of:
-
The queries touch very little data, to the point where compilation is a large fraction of execution time. This is not representative of the data integration/analytics orientation of RDF.
-
Most queries are logarithmic to scale factor, but some are linear. The linear ones come to dominate the metric at larger scales.
-
An update stream would make the workload more realistic.
We could rectify this all with almost no changes to the data generator or test driver by adding one or two more metrics.
So I am publishing the below as a starting point for discussion.
BSBM Analytics MixBelow is a set of business questions that can be answered with the BSBM data set. These are more complex and touch a greater percentage of the data than the initial mix. Their evaluation is between linear and n * log(n) to the data size. The TPC-H rules can be used for a power (single user) and a throughput (multi-user, where each submits queries from the mix with different parameters and in different order). The TPC-H score formula and executive summary formats are directly applicable.
This can be a separate metric from the "restricted" BSBM score. Restricted means "without a full scan with regexp" which will dominate the whole metric at larger scales.
Vendor specific variations in syntax will occur, hence these are allowed but disclosure of specific query text should accompany results. Hints for
JOINorder and the like are not allowed; queries must be declarative. We note that both SPARQL and SQL implementations of the queries are possible.The queries are ordered so that the first ones fill the cache. Running the analytics mix immediately after backup post initial load is allowed, resulting in semi-warm cache. Steady-state rules will be defined later, seeing the characteristics of the actual workload.
-
For each country, list the top 10 product categories, ordered by the count of reviews from the country.
-
Product with the most reviews during its first month on the market
-
10 products most similar to X, with similarity score based on the count of features in common
-
Top 10 reviewers of category X
-
Product with largest increase in reviews in month X compared to month X-minus-1.
-
Product of category X with largest change in mean price in the last month
-
Most active American reviewer of Japanese cameras last year
-
Correlation of price and average review
-
Features with greatest impact on price — for features occurring in category X, find the top 10 features where the mean price with the feature is most above the mean price without the feature
-
Country with greatest popularity of products in category X — reviews of category X from country Y divided by total reviews
-
Leading product of category X by country, mentioning mean price in each country and number of offers, sort by number of offers
-
Fans of manufacturer — find top reviewers who score manufacturer above their mean score
-
Products sold only in country X
Since RDF stores often implement a full text index, and since a full scan with regexp matching would never be used in an online E-commerce portal, it is meaningful to extend the benchmark to have some full text queries.
For the SPARQL implementation, text indexing should be enabled for all string-valued literals even though only some of them will be queried in the workload.
-
Q6 from the original mix, now allowing use of text index.
-
Reviews of products of category X where the review contains the names of 1 to 3 product features that occur in said category of products; e.g., MP3 players with support for mp4 and ogg.
-
ibid but now specifying review author. The intent is that structured criteria are here more selective than text.
-
Difference in the frequency of use of "awesome", "super", and "suck(s)" by American vs. European vs. Asian review authors.
For full text queries, the search terms have to be selected according to a realistic distribution. DERI has offered to provide a definition and possibly an implementation for this.
The parameter distribution for the analytics queries will be defined when developing the queries; the intent is that one run will touch 90% of the values in the properties mentioned in the queries.
The result report will have to be adapted to provide a TPC-H executive summary-style report and appropriate metrics.
Changes to Data GenerationFor supporting the IR mix, reviews should, in addition to random text, contain the following:
-
For each feature in the product concerned, add the label of said feature to 60% of the reviews.
-
Add the names of review author, product, product category, and manufacturer.
-
The review score should be expressed in the text by adjectives (e.g., awesome, super, good, dismal, bad, sucky). Every 20th word can be an adjective from the list correlating with the score in 80% of uses of the word and random in 20%. For 90% of adjectives, pick the adjectives from lists of idiomatic expressions corresponding to the country of the reviewer. In 10% of cases, use a random list of idioms.
-
Skew the review scores so that comparatively expensive products have a smaller chance for a bad review.
During the benchmark run:
-
1% of products are added;
-
3% of initial offers are deleted and 3% are added; and
-
5% of reviews are added.
Updates may be divided into transactions and run in series or in parallel in a manner specified by the test sponsor. The code for loading the update stream is vendor specific but must be disclosed.
The initial bulk load does not have to be transactional in any way.
Loading the update stream must be transactional, guaranteeing that all information pertaining to a product or an offer constitutes a transaction. Multiple offers or products may be combined in a transaction. Queries should run at least in
READ COMMITTEDisolation, so that half-inserted products or offers are not seen.Full text indices do not have to be updated transactionally; the update can lag up to 2 minutes behind the insertion of the literal being indexed.
The test data generator generates the update stream together with the initial data. The update stream is a set of files containing Turtle-serialized data for the updates, with all triples belonging to a transaction in consecutive order. The possible transaction boundaries are marked with a comment distinguishable from the text. The test sponsor may implement a special load program if desired. The files must be loaded in sequence but a single file may be loaded on any number of parallel threads.
The data generator should generate multiple files for the initial dump in order to facilitate parallel loading.
The same update stream can be used during all tests, starting each run from a backup containing only the initial state. In the original run, the update stream is applied starting at the measurement interval, after the SUT is in steady state.
-
-
21:13
LOD2 Kick Off
» OpenLink Community BlogThe LOD2 kick off meeting was held in Leipzig on Sept 6-8. I will here talk about OpenLink plans as concerns LOD2; hence this is not to be taken as representative of the whole project. I will first discuss the immediate and conclude with the long term.
As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and RDF benchmarks in February.
The LOD2 repository is a fusion of the OpenLink LOD Cloud Cache (which includes data from URIBurner and PingTheSemanticWeb) and Sindice, both hosted at DERI. The value-add compared to Sindice or the Virtuoso-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the SPARQL of Virtuoso.
Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise key compression.
Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage.
As for benchmarks, I just compiled a draft of suggested extensions to the BSBM (Berlin SPARQL Benchmark). I talked about this with Peter Boncz and Chris Bizer, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational schema and that RDF offers no fundamental edge for the workload.
There was a graph benchmark talk at the TPC workshop at VLDB 2010. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such.
We did informally talk about a process for publishing with our colleagues from Franz and Ontotext at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware.
Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-H in SQL and SPARQL. The SQL will be Virtuoso, MonetDB, and possibly VectorWise and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end.
In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it.
LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at CWI to RDF use cases.
This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion.
LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, Greenplum, or Vertica can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the Berkeley Orders Of Magnitude (BOOM) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed.
From our viewpoint, the project's gains include:
-
Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility.
-
Ready to use toolbox for data integration, including schema alignment and resolution of coreference.
-
Data discovery, summarization and visualization
Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and linked data. In this respect the integration of results may be stronger than often seen in European large scale integrating projects.
The use cases fit the development profile well:
-
Wolters Kluwer will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology.
-
Exalead will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources.
-
The Open Knowledge Foundation will create a portal of all government published data for easy access by citizens.
In all these cases, the integration requirements of schema alignment, resolution of identity, information extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.
-
-
21:13
LOD2 Kick Off
» OpenLink Community BlogThe LOD2 kick off meeting was held in Leipzig on Sept 6-8. I will here talk about OpenLink plans as concerns LOD2; hence this is not to be taken as representative of the whole project. I will first discuss the immediate and conclude with the long term.
As concerns OpenLink specifically, we have two short term activities, namely publishing the initial LOD2 repository in December and publishing a set of RDB and RDF benchmarks in February.
The LOD2 repository is a fusion of the OpenLink LOD Cloud Cache (which includes data from URIBurner and PingTheSemanticWeb) and Sindice, both hosted at DERI. The value-add compared to Sindice or the Virtuoso-based LOD Cloud Cache alone is the merger of the timeliness and ping-ping crawling of Sindice with the SPARQL of Virtuoso.
Further down the road, after we migrate the system to the Virtuoso column store, we will also see gains in performance, primarily due to much better working set, as data is many times more compact than with the present row-wise key compression.
Still further, but before next September, we will have dynamic repartitioning; the time of availability is set as this is part of the LOD2 project roadmap. The operational need for this is pushed back somewhat by the compression gains from column-wise storage.
As for benchmarks, I just compiled a draft of suggested extensions to the BSBM (Berlin SPARQL Benchmark). I talked about this with Peter Boncz and Chris Bizer, to the effect that some extensions of BSBM could be done but that the time was a bit short for making a RDF-specific benchmark. We do recall that BSBM is fully feasible with a relational schema and that RDF offers no fundamental edge for the workload.
There was a graph benchmark talk at the TPC workshop at VLDB 2010. There too, the authors were suggesting a social network use case for benchmarking anything from RDF stores to graph libraries. The presentation did not include any specification of test data, so it may be that some cooperation is possible there. The need for such a benchmark is well acknowledged. The final form of this is not yet set but LOD2 will in time publish results from such.
We did informally talk about a process for publishing with our colleagues from Franz and Ontotext at VLDB 2010. The idea is that vendors tune their own systems and do the runs and that the others check on this, preferably all using the same hardware.
Now, the LOD2 benchmarks will also include relational-to-RDF comparisons, for example TPC-H in SQL and SPARQL. The SQL will be Virtuoso, MonetDB, and possibly VectorWise and others, depending on what legal restrictions apply at the time. This will give an RDF-to-SQL comparison of TPC-H at least on Virtuoso, later also on MonetDB, depending on the schedule for a MonetDB SPARQL front-end.
In the immediate term, this of course focuses our efforts on productizing the Virtuoso column store extension and the optimizations that go with it.
LOD2 is however about much more than database benchmarks. Over the longer term, we plan to apply suitable parts of the ground-breaking database research done at CWI to RDF use cases.
This involves anything from adaptive indexing, to reuse and caching of intermediate results, to adaptive execution. This is however more than just mapping column store concepts to RDF. New challenges are posed by running on clusters and dealing with more expressive queries than just SQL, in specific queries with Datalog-like rules and recursion.
LOD2 is principally about integration and alignment, from the schema to the instance level. This involves complex batch processing, close to the data, on large volumes of data. Map-reduce is not the be-all-end-all of this. Of course, a parallel database like Virtuoso, Greenplum, or Vertica can do map-reduce style operations under control of the SQL engine. After all, the SQL engine needs to do map-reduce and a lot more to provide good throughput for parallel, distributed SQL. Something like the Berkeley Orders Of Magnitude (BOOM) distributed Datalog implementation (Overlog, Deadalus, BLOOM) could be a parallel computation framework that would subsume any map-reduce-style functionality under a more elegant declarative framework while still leaving control of execution to the developer for the cases where this is needed.
From our viewpoint, the project's gains include:
-
Significant narrowing of the RDB to RDF performance gap. RDF will be an option for large scale warehousing, cutting down on time to integration by providing greater schema flexibility.
-
Ready to use toolbox for data integration, including schema alignment and resolution of coreference.
-
Data discovery, summarization and visualization
Integrating this into a relatively unified stack of tools is possible, since these all cluster around the task of linking the universe with RDF and linked data. In this respect the integration of results may be stronger than often seen in European large scale integrating projects.
The use cases fit the development profile well:
-
Wolters Kluwer will develop an application for integrating resources around law, from the actual laws to court cases to media coverage. The content is modeled in a fine grained legal ontology.
-
Exalead will implement the linked data enterprise, addressing enterprise search and any typical enterprise data integration plus generating added value from open sources.
-
The Open Knowledge Foundation will create a portal of all government published data for easy access by citizens.
In all these cases, the integration requirements of schema alignment, resolution of identity, information extraction, and efficient storage and retrieval play a significant role. The end user interfaces will be task-specific but developer interfaces around integration tools and query formulation may be quite generic and suited for generic RDF application development.
-
-
22:10
Perseus, Andromeda, and RDF
» OpenLink Community BlogIt has been several months since my last blog post. In this day and age of the attention economy, what gives me the insolence so to neglect my duty to mindshare?
Well, Perseus wasn't blogging or checking his email either, when he went to fetch the Gorgon's head. As Joseph Campbell puts it, the hero breaks into a world separate from the ordinary in order to bring back a blessing which will revitalize the community.
Thus, I deliberately withdrew from the public conversation, in faith that it would take care of itself and that I would still not be altogether forgotten. As it happens, I was confirmed in this when recently invited to submit a talk for the Semdata workshop at VLDB 2010.
Great deeds are not only personal accomplishments but also play a role in a broader context. The quest may appear remote and difficult to execute but its outcome can be quite tangible: Andromeda needed no elaborate sales pitch to convince her of the advantages of not being eaten by the sea serpent.
Thus right after the meeting in Sofia last March, I followed the vertical treasure map into the realm of first principles. As Perseus received advice from Athena, so was I informed by the Platonic ideas of locality and concurrency.
The great quests have an outer and inner aspect. Likewise here, bringing the ideas to physical reality gave me a great deal of material on cognitive function itself. For human and computer alike, it appears that the main reason why anything at all works is cache. Locality and parallelism again. Maybe I will say something more about memory, attention, interface, and paradigm some other time. On the other hand, such material is bound to be unpopular even if valid.
By now, you may ask yourself what I am talking about.
We remember that Andromeda's fix was due to her mother, Cassiopeia, having claimed greater beauty than the daughters of the sea-god Poseidon. To transpose the archetype into the present, it is like Tim B-L saying that OWLs (by the way sacred to Athena) are more semantic than Codd's brainchild. Yet the relational community sees RDF as something not quite serious. A matter of scale(s) — just think of the sea serpent.
So, I am talking about what I alluded to in the 2010 New Year's statement on this blog: RDF as a viable alternative to relational for big data. This means that RDF is no longer a specialty niche where, due to the hopeless task of bringing everything into a relational model, the fact of everything taking several times both the time and space is tolerated because there is no real alternative.
The value proposition is that for any current RDF user, the present assets will go four times farther than before with the next release of Virtuoso. For a prospective RDF user, the cost of keeping an ETLed RDF integration warehouse is now in the same ballpark as the relational cost, except that schema is now flexible, and the time to integrate and answer is accordingly shorter. For users of analytics-oriented RDBMS, the next Virtuoso is a full cluster-capable SQL column store. Its merits compared to others in this space will be published later with benchmarks like TPC-H. As an extra bonus for such users, Virtuoso brings SQL federation and a growth path to RDF, should this become interesting.
This is accomplished by introducing a new column-wise compressed-storage engine with corresponding changes to query execution. The general principles are explained in Daniel Abadi's famous Ph.D. thesis. The compression is tuned by the data itself, without user intervention. Further, our implementation remains capable of run-time-typing, thus the column-store advantages to RDF are obtained without going to a task-specific schema. But since data types, even if determined at run-time, are still in practice repetitive, the advantages of running on homogenous vectors are not lost.
When storing an RDF extraction of TPC-H data, we get a storage usage of 6.3 bytes per quad. If you do not care about queries where the predicate is unspecified, the storage requirement drops to 4.7 bytes per quad. Whether storing the data as RDF quads or as Vertica-style multicolumn projections, the working set is about the same. Since having enough of the data in memory is the sine qua non prerequisite of flexible querying, the point is made. QED.
In Virtuoso also, relational remains a bit faster but a penalty of 1.3x or so for RDF is quite tolerable, considering that a priori schema is no longer needed.
This means that we are coming into an age where the warehouse becomes an ad hoc asset, to be filled with RDF, without the need to develop an a priori universal schema for all data one may ever wish to integrate, now or in the future. The data can be stored as RDF and projected from there into any form that may be needed at any time, whether the target format is more RDF or a task-specific relational schema.
Availability is planned for late 2010, first as a Virtuoso Open Source preview.
-
22:10
Perseus, Andromeda, and RDF
» OpenLink Community BlogIt has been several months since my last blog post. In this day and age of the attention economy, what gives me the insolence so to neglect my duty to mindshare?
Well, Perseus wasn't blogging or checking his email either, when he went to fetch the Gorgon's head. As Joseph Campbell puts it, the hero breaks into a world separate from the ordinary in order to bring back a blessing which will revitalize the community.
Thus, I deliberately withdrew from the public conversation, in faith that it would take care of itself and that I would still not be altogether forgotten. As it happens, I was confirmed in this when recently invited to submit a talk for the Semdata workshop at VLDB 2010.
Great deeds are not only personal accomplishments but also play a role in a broader context. The quest may appear remote and difficult to execute but its outcome can be quite tangible: Andromeda needed no elaborate sales pitch to convince her of the advantages of not being eaten by the sea serpent.
Thus right after the meeting in Sofia last March, I followed the vertical treasure map into the realm of first principles. As Perseus received advice from Athena, so was I informed by the Platonic ideas of locality and concurrency.
The great quests have an outer and inner aspect. Likewise here, bringing the ideas to physical reality gave me a great deal of material on cognitive function itself. For human and computer alike, it appears that the main reason why anything at all works is cache. Locality and parallelism again. Maybe I will say something more about memory, attention, interface, and paradigm some other time. On the other hand, such material is bound to be unpopular even if valid.
By now, you may ask yourself what I am talking about.
We remember that Andromeda's fix was due to her mother, Cassiopeia, having claimed greater beauty than the daughters of the sea-god Poseidon. To transpose the archetype into the present, it is like Tim B-L saying that OWLs (by the way sacred to Athena) are more semantic than Codd's brainchild. Yet the relational community sees RDF as something not quite serious. A matter of scale(s) — just think of the sea serpent.
So, I am talking about what I alluded to in the 2010 New Year's statement on this blog: RDF as a viable alternative to relational for big data. This means that RDF is no longer a specialty niche where, due to the hopeless task of bringing everything into a relational model, the fact of everything taking several times both the time and space is tolerated because there is no real alternative.
The value proposition is that for any current RDF user, the present assets will go four times farther than before with the next release of Virtuoso. For a prospective RDF user, the cost of keeping an ETLed RDF integration warehouse is now in the same ballpark as the relational cost, except that schema is now flexible, and the time to integrate and answer is accordingly shorter. For users of analytics-oriented RDBMS, the next Virtuoso is a full cluster-capable SQL column store. Its merits compared to others in this space will be published later with benchmarks like TPC-H. As an extra bonus for such users, Virtuoso brings SQL federation and a growth path to RDF, should this become interesting.
This is accomplished by introducing a new column-wise compressed-storage engine with corresponding changes to query execution. The general principles are explained in Daniel Abadi's famous Ph.D. thesis. The compression is tuned by the data itself, without user intervention. Further, our implementation remains capable of run-time-typing, thus the column-store advantages to RDF are obtained without going to a task-specific schema. But since data types, even if determined at run-time, are still in practice repetitive, the advantages of running on homogenous vectors are not lost.
When storing an RDF extraction of TPC-H data, we get a storage usage of 6.3 bytes per quad. If you do not care about queries where the predicate is unspecified, the storage requirement drops to 4.7 bytes per quad. Whether storing the data as RDF quads or as Vertica-style multicolumn projections, the working set is about the same. Since having enough of the data in memory is the sine qua non prerequisite of flexible querying, the point is made. QED.
In Virtuoso also, relational remains a bit faster but a penalty of 1.3x or so for RDF is quite tolerable, considering that a priori schema is no longer needed.
This means that we are coming into an age where the warehouse becomes an ad hoc asset, to be filled with RDF, without the need to develop an a priori universal schema for all data one may ever wish to integrate, now or in the future. The data can be stored as RDF and projected from there into any form that may be needed at any time, whether the target format is more RDF or a task-specific relational schema.
Availability is planned for late 2010, first as a Virtuoso Open Source preview.
-
22:09
VLDB Semdata Workshop - The New Frontier of Semdata
» OpenLink Community BlogThis is a revised version of the talk I will be giving at the Semdata workshop at VLDB 2010.
The paper shows how we store TPC-H data as RDF with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in Virtuoso.
A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases.
The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics RDBMS, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed schema but limited querying.
The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability — you can run an RDF database on a cluster — but a question of relative cost as opposed to alternatives.
The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption.
I do not need to talk here about the benefits of linked data and more or less ad hoc integration per se. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso.
But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores.
Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest.
The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed ACID transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must — maybe even in map-reduce.
Besides, for some things that go beyond SQL (for example, with graph structures), there really isn't a good solution.
Now, enter Vertica, Greenplum, VectorWise (a MonetDB project derivative from Ingres) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible.
Here we find the next frontier of Semdata. Take Joe Hellerstein et al's work on declarative logic for the data centric data center.
We have heard it many times — when the data is big, the logic must go to it. We can take declarative, location-conscious rules, à la BOOM and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS.
Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability.
Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, à la MonetDB, are applicable with minimal if any adaptation.
Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with linked open data to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data.
In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power.
Last week I was at the LOD2 kick off and a LarKC meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.
-
22:09
VLDB Semdata Workshop - The New Frontier of Semdata
» OpenLink Community BlogThis is a revised version of the talk I will be giving at the Semdata workshop at VLDB 2010.
The paper shows how we store TPC-H data as RDF with relational-level efficiency and how we query both RDF and relational versions in comparable time. We also compare row-wise and column-wise storage formats as implemented in Virtuoso.
A question that has come up a few times during the Semdata initiative is how semantic data will avoid the fate of other would-be database revolutions like OODBMS and deductive databases.
The need and opportunity are driven by the explosion of data in quantity and diversity of structure. The competition consists of analytics RDBMS, point solutions done with map-reduce or the like, and lastly in some cases from key-value stores with relaxed schema but limited querying.
The benefits of RDF are the ever expanding volume of data published in it, reuse of vocabulary, and well-defined semantics. The downside is efficiency. This is not so much a matter of absolute scalability — you can run an RDF database on a cluster — but a question of relative cost as opposed to alternatives.
The baseline is that for relational-style queries, one should get relational performance or close enough. We outline in the paper how RDF reduces to a run-time-typed relational column-store, and gets all the compression and locality advantages traditionally associated with such. After memory is no longer the differentiator, the rest is engineering. So much for the scalability barrier to adoption.
I do not need to talk here about the benefits of linked data and more or less ad hoc integration per se. But again, to make these practical, there are logistics to resolve: How to keep data up to date? How to distribute it incrementally? How to monetize freshness? We propose some solutions for these, looking at diverse-RDF replication and RDB-to-RDF replication in Virtuoso.
But to realize the ultimate promise of RDF/Linked Data/Semdata, however we call it, we must look farther into the landscape of what is being done with big data. Here we are no longer so much running against the RDBMS, but against map-reduce and key-value stores.
Given the psychology of geekdom, the charm of map-reduce is understandable: One controls what is going on, can work in the usual languages, can run on big iron without being picked to pieces by the endless concurrency and timing and order-of-events issues one gets when programming a cluster. Tough for the best, and unworkable for the rest.
The key-value store has some of the same appeal, as it is the DBMS laid bare, so to say, made understandable, without the again intractably-complex questions of fancy query planning and distributed ACID transactions. The psychological rewards of the sense of control are there, never mind the complex query; one can always hard code a point solution for the business question, if really must — maybe even in map-reduce.
Besides, for some things that go beyond SQL (for example, with graph structures), there really isn't a good solution.
Now, enter Vertica, Greenplum, VectorWise (a MonetDB project derivative from Ingres) and Virtuoso, maybe others, who all propose some combination of SQL- and explicit map-reduce-style control structures. This is nice but better is possible.
Here we find the next frontier of Semdata. Take Joe Hellerstein et al's work on declarative logic for the data centric data center.
We have heard it many times — when the data is big, the logic must go to it. We can take declarative, location-conscious rules, à la BOOM and BLOOM, and combine these with the declarative query, well-defined semantics, parallel-database capability of the leading RDF stores. Merge this with locality compression and throughput from the best analytics DBMS.
Here we have a data infrastructure that subsumes map-reduce as a special case of arbitrary distributed-parallel control flow, can send the processing to the data, and has flexible queries and schema-last capability.
Further, since RDF more or less reduces to relational columns, the techniques of caching and reuse and materialized joins and demand-driven indexing, à la MonetDB, are applicable with minimal if any adaptation.
Such a hybrid database-fusion frontier is relevant because it addresses heterogenous, large-scale data, with operations that are not easy to reduce to SQL, still without loss of the advantages of SQL. Apply this to anything from enhancing the business intelligence process by faster integration, including integration with linked open data to the map-reduce bulk processing of today. Do it with strong semantics and inference close to the data.
In short, RDF stays relevant by tackling real issues, with scale second to none, and decisive advantages in time-to-integrate and expressive power.
Last week I was at the LOD2 kick off and a LarKC meeting. The capabilities envisioned in this and the following post mirror our commitments to the EU co-funded LOD2 project. This week is VLDB and the Semdata workshop. I will talk more about how these trends are taking shape within the Virtuoso product development roadmap in future posts.
-
3:25
Solving Real Problems by Leveraging Linked Data: Unambiguous & Verifiable Identity for HTTP Networks
» OpenLink Community BlogProblem: Unambiguous Verifiable Network Identity.How Does Linked Data Address This Problem? It provides critical infrastructure for the WebID Protocol that enables an innovative tweak of SSL/TLS.
What about OpenID? The WebID Protocol embraces and extends OpenID (in an open and positive way) via the WebID + OpenID Hybrid variant of the protocol -- basic effect is that OpenID calls are re-routed to the WebID aspect which simply removes Username and Password Authentication from the authentication challenge interaction pattern.
WebID Components- X.509 Certificate and Private Key Generator
- Structured Profile Document (e.g. a FOAF based Profile) published to an HTTP Network (e.g. World Wide Web) and accessible at an Address (URL)
- An Agent Identifier aka. WebID (an HTTP Name Reference re. URI variant) that's the Subject of a Structured Profile Document (actually a Descriptor Resource)
- Mechanism for persisting Public Key data from X.509 Certificate to Structured Profile Document and associating it with Subject WebID (e.g. SPARUL or other HTTP based methods)
- Mechanism for de-referencing Public Key data associated with a WebID (from its Structured Profile Document) for comparison against Public Key data following successful standard SSL/TLS protocol handshake (e.g. via SPARQL Query).
- WebID + OpenID Hybrid Protocol Demo using ODS, Stackoverflow.com, and identi.ca. - YouTube Screencast Demo Part 1 using Firefox
- WebID + OpenID Hybrid Protocol Demo using ODS, Stackoverflow.com, and identi.ca. - YouTube Screencast Demo Part 2 using Safari
-
21:09
Data 3.0 (a Manifesto for Platform Agnostic Structured Data) Update 5
» OpenLink Community BlogAfter a long period of trying to demystify and unravel the wonders of standards compliant structured data access, combined with protocols (e.g., [HTTP)] that separate:
- Identity,
- Access,
- Storage,
- Representation, and
- Presentation.
I ended up with what I can best describe as the Data 3.0 Manifesto. A manifesto for standards complaint access to structured data object (or entity) descriptors.
Some Related WorkAlex James (Program Manager Entity Frameworks at Microsoft), put together something quite similar to this via his Base4 blog (around the Web 2.0 bootstrap time), sadly -- quoting Alex -- that post has gone where discontinued blogs and their host platforms go (deep deep irony here).
It's also important to note that this manifesto is also a variant of the TimBL's Linked Data Design Issues meme re. Linked Data, but totally decoupled from RDF (data representation formats aspect) and SPARQL which -- in my world view -- remain implementation details.
Data 3.0 manifesto- An "Entity" is the "Referent" of an "Identifier."
- An "Identifier" SHOULD provide a global, unambiguous, and unchanging (though it MAY be opaque!) "Name" for its "Referent".
- A "Referent" MAY have many "Identifiers" (Names), but each "Identifier" MUST have only one "Referent".
- Structured Entity Descriptions SHOULD be based on the Entity-Attribute-Value (EAV) Data Model, and SHOULD therefore take the form of one or more 3-tuples (triples), each comprised of:
- an "Identifier" that names an "Entity" (i.e., Entity Name),
- an "Identifier" that names an "Attribute" (i.e., Attribute Name), and
- an "Attribute Value", which may be an "Identifier" or a "Literal".
- Structured Descriptions SHOULD be CARRIED by "Descriptor Documents" (i.e., purpose specific documents where Entity Identifiers, Attribute Identifiers, and Attribute Values are clearly discernible by the document's intended consumers, e.g., humans or machines).
- Structured Descriptor Documents can contain (carry) several Structured Entity Descriptions
- Stuctured Descriptor Documents SHOULD be network accessible via network addresses (e.g., HTTP URLs when dealing with [HTTP-based] Networks).
- An Identifier SHOULD resolve (de-reference) to a Structured Representation of the Referent's Structured Description.
- Referent, Identifier, and Descriptor/Sense (The Data Perception Trinity) illustration
- Referent, Identifier, and Descriptor/Sense Trinity (as exploited in FOAF+SSL based Secure WebIDs) illustration
- Demystifying Linked Data via EAV Model based Structured Descriptions
- What do people have against URIs and URLs?
- The URI, URL, and Linked Data Meme's Generic HTTP URI
- Simple Explanation of RDF and Linked Data Dynamics
- Linked Data and Identity
- FOAF+SSL FAQ
- LOD Community Thread (showing evolution of this manifesto based on feedback from members such as Richard Cyganiak).
- Googlebase Data API Docs
- Google Data Protocol (GData)
- Microsoft's OData Protocol
- Magic of De-referencable Names and actual Data via Binky Video
- Social Objects Presentation (aka. Social Linked Data Objects) - by Jyri Engeström
- What's a Reference?
-
22:21
Transactional High Availability in Virtuoso Cluster Edition
» OpenLink Community BlogIntroductionThis post discusses the technical specifics of how we accomplish smooth transactional operation in a database server cluster under different failure conditions. (A higher-level short version was posted last week.) The reader is expected to be familiar with the basics of distributed transactions.
Someone on a cloud computing discussion list called two-phase commit (2PC) the "anti-availability protocol." There is indeed a certain anti-SQL and anti-2PC sentiment out there, with key-value stores and "eventual consistency" being talked about a lot. Indeed, if we are talking about wide-area replication over high-latency connections, then 2PC with synchronously-sharp transaction boundaries over all copies is not really workable.
For multi-site operations, a level of eventual consistency is indeed quite unavoidable. Exactly what the requirements are depends on the application, so I will focus here on operations inside one site.
The key-value store culture seems to focus on workloads where a record is relatively self-contained. The record can be quite long, with repeating fields, different selections of fields in consecutive records, and so forth. Such a record would typically be split over many tables of a relational schema. In the RDF world, such a record would be split even wider, with the information needed to reconstitute the full record almost invariably split over many servers. This comes from the mapping between the text of URIs and their internal IDs being partitioned in one way, and the many indices on the RDF quads each in yet another way.
So it comes to pass that in the data models we are most interested in, the application-level entity (e.g., a user account in a social network) is not a contiguous unit with a single global identifier. The social network user account, that the key-value store would consider a unit of replication mastering and eventual consistency, will be in RDF or SQL a set of maybe hundreds of tuples, each with more than one index, nearly invariably spanning multiple nodes of the database cluster.
So, before we can talk about wide-area replication and eventual consistency with application-level semantics, we need a database that can run on a fair-sized cluster and have cast-iron consistency within its bounds. If such a cluster is to be large and is to operate continuously, it must have some form of redundancy to cover for hardware failures, software upgrades, reboots, etc., without interruption of service.
This is the point of the design space we are tackling here.
Non Fault-Tolerant OperationThere are two basic modes of operation we cover: bulk load, and online transactions.
In the case of bulk load, we start with a consistent image of the database; load data; and finish by making another consistent image. If there is a failure during load, we lose the whole load, and restart from the initial consistent image. This is quite simple and is not properly transactional. It is quicker for filling a warehouse but is not to be used for anything else. In the remainder, we will only talk about online transactions.
When all cluster nodes are online, operation is relatively simple. Each entry of each index belongs to a partition that is determined by the values of one or more partitioning columns of said index. There are no tables separate from indices; the relational row is on the index leaf of its primary key. Secondary indices reference the row by including the primary key. Blobs are in the same partition as the row which contains the blob. Each partition is then stored on a "cluster node." In non fault-tolerant operations, each such cluster node is a single process with exclusive access to its own permanent storage, consisting of database files and logs; i.e., each node is a single server instance. It does not matter if the storage is local or on a SAN, the cluster node is still the only one accessing it.
When things are not fault tolerant, transactions work as follows:
When there are updates, two-phase commit is used to guarantee a consistent result. Each transaction is coordinated by one cluster node, which issues the updates in parallel to all cluster nodes concerned. Sending two update messages instead of one does not significantly impact latency. The coordinator of each transaction is the primary authority for the transaction's outcome. If the coordinator of the transaction dies between the phases of the commit, the transaction branches stay in the prepared state until the coordinator is recovered and can be asked again about the outcome of the transaction. Likewise, if a non-coordinating cluster node with a transaction branch dies between the phases, it will do a roll-forward and ask the coordinator for the outcome of the transaction.
If cluster nodes occasionally crash and then recover relatively quickly, without ever losing transaction logs or database files, this is resilient enough. Everything is symmetrical; there are no cluster nodes with special functions, except for one master node that has the added task of resolving distributed deadlocks.
I suppose our anti-SQL person called 2PC "anti-availability" because in the above situation we have the following problems: if any one cluster node is offline, it is quite likely that no transaction can be committed. This is so unless the data is partitioned on a key with application semantics, and all data touched by a transaction usually stays within a single partition. Then operations could proceed on most of the data while one cluster node was recovering. But, especially with RDF, this is never the case, since keys are partitioned in ways that have nothing to do with application semantics. Further, if one uses XA or Microsoft DTC with the monitor on a single box, this box can become a bottleneck and/or a single point of failure. (Among other considerations, this is why Virtuoso does not rely on any such monitor.) Further, if a cluster node dies never to be heard of again, leaving prepared but uncommitted transaction branches, the rest of the system has no way of telling what to do with them, again unless relying on a monitor that is itself liable to fail.
If transactions have a real world counterpart, it is possible, at least in theory, to check the outcome against the real world state: One can ask a customer if an order was actually placed or a shipment delivered. But when a transaction has to do with internal identifiers of things, for example whether
Fault-Tolerant Operationmailto://plaidskirt@hotdate.comhas internal ID0xacebabe, such a check against external reality is not possible.In a fault tolerant setting, we introduce the following extra elements: Cluster nodes are comprised of "quorums" of mutually-mirroring server instances. Each such quorum holds a partition of the data. Such a quorum typically consists of two server instances, but may have three for extra safety. If all server instances in the quorum are offline, then the cluster node is offline, and the cluster is not fully operational. If at least one server instance in a quorum is online, then the cluster node is online, and the cluster is operational and can process new transactions.
We designate one cluster node (i.e., one quorum of 2 or 3 server instances) to act as a master node, and we set an order of precedence among its member instances. In addition to arbitrating distributed deadlocks, the master instance on duty will handle reports of server instance failures, and answer questions about any transactions left hanging in prepared state by a dead transaction coordinator. If the master on duty fails, the next master in line will either notice this itself in the line of normal business or get a complaint from another server instance about not being able to contact the previous master.
There is no global heartbeat messaging per se, but since connections between server instances are reused long-term, a dropped connection will be noticed and the master on duty will be notified. If all masters are unavailable, that entire quorum (i.e., the master node) is offline and thus (as with any entire node going offline) most operations will fail anyway, unless by chance they do not hit any data managed by that failed quorum.
When it receives a notice of unavailability, the master instance on duty tries to contact the unavailable server instance and if it fails, it will notify all remaining instances that that server instance is removed from the cluster. The effect is that the remaining server instances will stop attempting to access the failed instance. Updates to the partitions managed by the failed server instance are no longer sent to it, which results in updates to this data succeeding, as they are made against the other server instances in that quorum. Updates to the data of the failed server instance will fail in the window of time between the actual failure and the removal, which is typically well under a second. The removal of a failed server instance is delegated to a central authority in order not to have everybody get in each other's way when trying to effect the removal.
If the failed server instance left prepared uncommitted transactions behind, the server instances having such branches will in due order contact the transaction coordinator to ask what should be done. This is a normal procedure for dealing with possibly dropped commit or rollback messages. When they discover that the coordinator has been removed, the master on duty will be contacted instead. Each prepare message of a transaction lists all the server instances participating in the transaction; thus the master can check whether each has received the prepare. If all have the prepare and none has an abort, the transaction is committed. The dead coordinator may not know this or may indeed not have the transaction logged, since it sends the prepares before logging its own prepare. The recovery will handle this though. We note that of the remaining branches, there is at least one copy of the branch with the failed server instance, or else we would have a whole quorum failed. In cases where there are branches participating in an unresolved transaction where all the quorum members have failed, the system cannot decide the outcome, and will periodically retry until at least one member of the failed quorum becomes available.
The most complex part of the protocol is the recovery of a failed server instance. The recovery starts with a normal roll forward from the local transaction log. After this, the server instance will contact the master on duty to ask for its state. Typically, the master will reply that the recovering server instance had been removed and is out of date. When this is established, the recovering server instance will contact a live member of its quorum and ask for sync. The failed server instance has an approximate timestamp of its last received transaction. It knows this from the roll forward, where time markers are interspersed now and then between transaction records. The live partner then sends its transaction log(s) covering the time from a few seconds before the last transaction of the failed partner up to the present. A few transactions may get rolled forward twice but this does no harm, since these records have absolute values and no deltas and the second insert of a key is simply ignored. When the sender of the log reaches its last committed log entry, it asks the recovering server instance to confirm successful replay of the log so far. Having the confirmation, the sender will abort all unprepared transactions affecting it and will not accept any new ones until the sync is completed. If new transactions were committed between sending the last of the log and killing the uncommitted new transactions, these too are shipped to the recovering server instance in their committed or prepared state. When these are also confirmed replayed, the recovering server instance is in exact sync up to the transaction. The sender then notifies the rest of the cluster that the sync is complete and that the recovered server instance will be included in any updates of its slice of the data. The time between freeze and re-enable of transactions is the time to replay what came in between the first sync and finishing the freeze. Typically nothing came in, so the time is in milliseconds. If an application got its transaction killed in this maneuver, it will be seen as a deadlock.
If the recovering server instance received transactions in prepared state, it will ask about their outcome as a part of the periodic sweep through pending transactions. One of these transactions could have been one originally prepared by itself, where the prepares had gone out before it had time to log the transaction. Thus, this eventuality too is covered and has a consistent outcome. Failures can interrupt the recovery process. The recovering server instance will have logged as far as it got, and will pick up from this point onward. Real time clocks on the host nodes of the cluster will have to be in approximate sync, within a margin of a minute or so. This is not a problem in a closely connected network.
For simultaneous failure of a entire quorum of server instances (i.e., a set of mutually-mirroring partners; a cluster node), the rule is that the last one to fail must be the first to come back up. In order to have uninterrupted service across arbitrary double failures, one must store things in triplicate; statistically, however, most double failures will not hit cluster nodes of the same group.
The protocol for recovery of failed server instances of the master quorum (i.e., the master cluster node) is identical, except that a recovering master will have to ask the other master(s) which one is more up to date. If the recovering master has a log entry of having excluded all other masters in its quorum from the cluster, it can come back online without asking anybody. If there is no such entry, it must ask the other master(s). If all had failed at the exact same instant, none has an entry of the other(s) being excluded and all will know that they are in the same state since any update to one would also have been sent to the other(s).
Failure of Storage MediaWhen a server instance fails, its permanent storage may or may not survive. Especially with mirrored disks, storage most often survives a failure. However, the survival of the database does not depend on any single server instance retaining any permanent storage over failure. If storage is left in place, as in the case of an OS reboot or replacing a faulty memory chip, rejoining the cluster is done based on the existing copy of the database on the server instance. if there is no existing copy, a copy can be taken from any surviving member of the same quorum. This consists of the following steps: First, a log checkpoint is forced on the surviving instance. Normally log checkpoints are done at regular intervals, independently on each server instance. The log checkpoint writes a consistent state of the database to permanent storage. The disk pages forming this consistent image will not be written to until the next log checkpoint. Therefore copying the database file is safe and consistent as long as a log checkpoint does not take place between the start and end of copy. Thus checkpoints are disabled right after the initial checkpoint. The copy can take a relatively long time; consider 20s per gigabyte on a 1GbE network a good day. At the end of copy, checkpoints are re-enabled on the surviving cluster node. The recovering database starts without a log, sees the timestamp of the checkpoint in the database, and asks for transactions from just before this time up to present. The recovery then proceeds as outlined above.
Network FailuresThe CAP theorem states that Consistency, Availability, and Partition-tolerance do not mix. "Partition" here means the split of a network.
It is trivially true that if the network splits so that on both sides there is a copy of each partition of the data, both sides will think themselves the live copy left online after the other died, and each will thus continue to accumulate updates. Such an event is not very probable within one site where all machines are redundantly connected to two independent switches. Most servers have dual 1GbE on the motherboard, and both ports should be used for cluster interconnect for best performance, with each attached to an independent switch. Both switches would have to fail in such a way as to split their respective network for a single-site network split to happen. Of course, the likelihood of a network split in multi-site situations is higher.
One way of guarding against network splits is to require that at least one partition of the data have all copies online. Additionally, the master on duty can request each cluster node or server instance it expects to be online to connect to every other node or instance, and to report which they could reach. If the reports differ, there is a network problem. This procedure can be performed using both interfaces or only the first or second interface of each server to determine if one of the switches selectively blocks some paths. These simple sanity checks protect against arbitrary network errors. Using TCP for inter-cluster-node communication in principle protects against random message loss, but the Virtuoso cluster protocols do not rely on this. Instead, there are protocols for retry of any transaction messages and for using keep-alive messages on any long-running functions sent across the cluster. Failure to get a keep-alive message within a certain period will abort a query even if the network connections look OK.
Backups, and Recovery from Loss of Entire SiteFor a constantly-operating distributed system, it is hard to define what exactly constitutes a consistent snapshot. The checkpointed state on each cluster node is consistent as far as this cluster node is concerned (i.e., it contains no uncommitted data), but the checkpointed states on all the cluster nodes are not from exactly the same moment in time. The complete state of a cluster is the checkpoint state of each cluster node plus the current transaction log of each. If the logs were shipped in real time to off-site storage, a consistent image could be reconstructed from them. Since such shipping cannot be synchronous due to latency considerations, some transactions could be received only in part in the event of a failure of the off-site link. Such partial transactions can however be detected at reconstruction time because each record contains the list of all participants of the transaction. If some piece is found missing, the whole can be discarded. In this way integrity is guaranteed but it is possible that a few milliseconds worth of transactions get lost. In these cases, the online client will almost certainly fail to get the final success message and will recheck the status after recovery.
For business continuity purposes, a live feed of transactions can be constantly streamed off-site, for example to a cloud infrastructure provider. One low-cost virtual machine on the cloud will typically be enough for receiving the feed. In the event of long-term loss of the whole site, replacement servers can be procured on the cloud; thus, capital is not tied up in an aging inventory of spare servers. The cloud-based substitute can be maintained for the time it takes to rebuild an owned infrastructure, which is still at present more economical than a cloud-only solution.
Switching a cluster from an owned site to the cloud could be accomplished in a few hours. The prerequisite of this is that there are reasonably recent snapshots of the database files, so that replay of logs does not take too long. The bulk of the time taken by such a switch would be in transferring the database snapshots from S3 or similar to the newly provisioned machines, formatting the newly provisioned virtual disks, etc.
Rehearsing such a maneuver beforehand is quite necessary for predictable execution. We do not presently have a productized set of tools for such a switch, but can advise any interested parties on implementing and testing such a disaster recovery scheme.
ConclusionsIn conclusion, we have shown how we can have strong transactional guarantees in a database cluster without single points of failure or performance penalties when compared with a non fault-tolerant cluster. Operator intervention is not required for anything short of hardware failure. Recovery procedures are simple, at most consisting of installing software and copying database files from a surviving cluster node. Unless permanent storage is lost in the failure, not even this is required. Real-time off-site log shipment can easily be added to these procedures to protect against site-wide failures.
Future work may be directed toward concurrent operation of geographically-distributed data centers with eventual consistency. Such a setting would allow for migration between sites in the event of whole-site failures, and for reconciliation between inconsistent histories of different halves of a temporarily split network. Such schemes are likely to require application-level logic for reconciliation and cannot consist of an out-of-the-box DBMS alone. All techniques discussed here are application-agnostic and will work equally well for Graph Model (e.g., RDF) and Relational Model (e.g., SQL) workloads.
Glossary- Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
- Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
- Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
- Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
- Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).
- Special Relativity and the Problem of Database Scalability (PDF), by James Starkey of NimbusDB, Inc.
-
22:21
Transactional High Availability in Virtuoso Cluster Edition
» OpenLink Community BlogIntroductionThis post discusses the technical specifics of how we accomplish smooth transactional operation in a database server cluster under different failure conditions. (A higher-level short version was posted last week.) The reader is expected to be familiar with the basics of distributed transactions.
Someone on a cloud computing discussion list called two-phase commit (2PC) the "anti-availability protocol." There is indeed a certain anti-SQL and anti-2PC sentiment out there, with key-value stores and "eventual consistency" being talked about a lot. Indeed, if we are talking about wide-area replication over high-latency connections, then 2PC with synchronously-sharp transaction boundaries over all copies is not really workable.
For multi-site operations, a level of eventual consistency is indeed quite unavoidable. Exactly what the requirements are depends on the application, so I will focus here on operations inside one site.
The key-value store culture seems to focus on workloads where a record is relatively self-contained. The record can be quite long, with repeating fields, different selections of fields in consecutive records, and so forth. Such a record would typically be split over many tables of a relational schema. In the RDF world, such a record would be split even wider, with the information needed to reconstitute the full record almost invariably split over many servers. This comes from the mapping between the text of URIs and their internal IDs being partitioned in one way, and the many indices on the RDF quads each in yet another way.
So it comes to pass that in the data models we are most interested in, the application-level entity (e.g., a user account in a social network) is not a contiguous unit with a single global identifier. The social network user account, that the key-value store would consider a unit of replication mastering and eventual consistency, will be in RDF or SQL a set of maybe hundreds of tuples, each with more than one index, nearly invariably spanning multiple nodes of the database cluster.
So, before we can talk about wide-area replication and eventual consistency with application-level semantics, we need a database that can run on a fair-sized cluster and have cast-iron consistency within its bounds. If such a cluster is to be large and is to operate continuously, it must have some form of redundancy to cover for hardware failures, software upgrades, reboots, etc., without interruption of service.
This is the point of the design space we are tackling here.
Non Fault-Tolerant OperationThere are two basic modes of operation we cover: bulk load, and online transactions.
In the case of bulk load, we start with a consistent image of the database; load data; and finish by making another consistent image. If there is a failure during load, we lose the whole load, and restart from the initial consistent image. This is quite simple and is not properly transactional. It is quicker for filling a warehouse but is not to be used for anything else. In the remainder, we will only talk about online transactions.
When all cluster nodes are online, operation is relatively simple. Each entry of each index belongs to a partition that is determined by the values of one or more partitioning columns of said index. There are no tables separate from indices; the relational row is on the index leaf of its primary key. Secondary indices reference the row by including the primary key. Blobs are in the same partition as the row which contains the blob. Each partition is then stored on a "cluster node." In non fault-tolerant operations, each such cluster node is a single process with exclusive access to its own permanent storage, consisting of database files and logs; i.e., each node is a single server instance. It does not matter if the storage is local or on a SAN, the cluster node is still the only one accessing it.
When things are not fault tolerant, transactions work as follows:
When there are updates, two-phase commit is used to guarantee a consistent result. Each transaction is coordinated by one cluster node, which issues the updates in parallel to all cluster nodes concerned. Sending two update messages instead of one does not significantly impact latency. The coordinator of each transaction is the primary authority for the transaction's outcome. If the coordinator of the transaction dies between the phases of the commit, the transaction branches stay in the prepared state until the coordinator is recovered and can be asked again about the outcome of the transaction. Likewise, if a non-coordinating cluster node with a transaction branch dies between the phases, it will do a roll-forward and ask the coordinator for the outcome of the transaction.
If cluster nodes occasionally crash and then recover relatively quickly, without ever losing transaction logs or database files, this is resilient enough. Everything is symmetrical; there are no cluster nodes with special functions, except for one master node that has the added task of resolving distributed deadlocks.
I suppose our anti-SQL person called 2PC "anti-availability" because in the above situation we have the following problems: if any one cluster node is offline, it is quite likely that no transaction can be committed. This is so unless the data is partitioned on a key with application semantics, and all data touched by a transaction usually stays within a single partition. Then operations could proceed on most of the data while one cluster node was recovering. But, especially with RDF, this is never the case, since keys are partitioned in ways that have nothing to do with application semantics. Further, if one uses XA or Microsoft DTC with the monitor on a single box, this box can become a bottleneck and/or a single point of failure. (Among other considerations, this is why Virtuoso does not rely on any such monitor.) Further, if a cluster node dies never to be heard of again, leaving prepared but uncommitted transaction branches, the rest of the system has no way of telling what to do with them, again unless relying on a monitor that is itself liable to fail.
If transactions have a real world counterpart, it is possible, at least in theory, to check the outcome against the real world state: One can ask a customer if an order was actually placed or a shipment delivered. But when a transaction has to do with internal identifiers of things, for example whether
Fault-Tolerant Operationmailto://plaidskirt@hotdate.comhas internal ID0xacebabe, such a check against external reality is not possible.In a fault tolerant setting, we introduce the following extra elements: Cluster nodes are comprised of "quorums" of mutually-mirroring server instances. Each such quorum holds a partition of the data. Such a quorum typically consists of two server instances, but may have three for extra safety. If all server instances in the quorum are offline, then the cluster node is offline, and the cluster is not fully operational. If at least one server instance in a quorum is online, then the cluster node is online, and the cluster is operational and can process new transactions.
We designate one cluster node (i.e., one quorum of 2 or 3 server instances) to act as a master node, and we set an order of precedence among its member instances. In addition to arbitrating distributed deadlocks, the master instance on duty will handle reports of server instance failures, and answer questions about any transactions left hanging in prepared state by a dead transaction coordinator. If the master on duty fails, the next master in line will either notice this itself in the line of normal business or get a complaint from another server instance about not being able to contact the previous master.
There is no global heartbeat messaging per se, but since connections between server instances are reused long-term, a dropped connection will be noticed and the master on duty will be notified. If all masters are unavailable, that entire quorum (i.e., the master node) is offline and thus (as with any entire node going offline) most operations will fail anyway, unless by chance they do not hit any data managed by that failed quorum.
When it receives a notice of unavailability, the master instance on duty tries to contact the unavailable server instance and if it fails, it will notify all remaining instances that that server instance is removed from the cluster. The effect is that the remaining server instances will stop attempting to access the failed instance. Updates to the partitions managed by the failed server instance are no longer sent to it, which results in updates to this data succeeding, as they are made against the other server instances in that quorum. Updates to the data of the failed server instance will fail in the window of time between the actual failure and the removal, which is typically well under a second. The removal of a failed server instance is delegated to a central authority in order not to have everybody get in each other's way when trying to effect the removal.
If the failed server instance left prepared uncommitted transactions behind, the server instances having such branches will in due order contact the transaction coordinator to ask what should be done. This is a normal procedure for dealing with possibly dropped commit or rollback messages. When they discover that the coordinator has been removed, the master on duty will be contacted instead. Each prepare message of a transaction lists all the server instances participating in the transaction; thus the master can check whether each has received the prepare. If all have the prepare and none has an abort, the transaction is committed. The dead coordinator may not know this or may indeed not have the transaction logged, since it sends the prepares before logging its own prepare. The recovery will handle this though. We note that of the remaining branches, there is at least one copy of the branch with the failed server instance, or else we would have a whole quorum failed. In cases where there are branches participating in an unresolved transaction where all the quorum members have failed, the system cannot decide the outcome, and will periodically retry until at least one member of the failed quorum becomes available.
The most complex part of the protocol is the recovery of a failed server instance. The recovery starts with a normal roll forward from the local transaction log. After this, the server instance will contact the master on duty to ask for its state. Typically, the master will reply that the recovering server instance had been removed and is out of date. When this is established, the recovering server instance will contact a live member of its quorum and ask for sync. The failed server instance has an approximate timestamp of its last received transaction. It knows this from the roll forward, where time markers are interspersed now and then between transaction records. The live partner then sends its transaction log(s) covering the time from a few seconds before the last transaction of the failed partner up to the present. A few transactions may get rolled forward twice but this does no harm, since these records have absolute values and no deltas and the second insert of a key is simply ignored. When the sender of the log reaches its last committed log entry, it asks the recovering server instance to confirm successful replay of the log so far. Having the confirmation, the sender will abort all unprepared transactions affecting it and will not accept any new ones until the sync is completed. If new transactions were committed between sending the last of the log and killing the uncommitted new transactions, these too are shipped to the recovering server instance in their committed or prepared state. When these are also confirmed replayed, the recovering server instance is in exact sync up to the transaction. The sender then notifies the rest of the cluster that the sync is complete and that the recovered server instance will be included in any updates of its slice of the data. The time between freeze and re-enable of transactions is the time to replay what came in between the first sync and finishing the freeze. Typically nothing came in, so the time is in milliseconds. If an application got its transaction killed in this maneuver, it will be seen as a deadlock.
If the recovering server instance received transactions in prepared state, it will ask about their outcome as a part of the periodic sweep through pending transactions. One of these transactions could have been one originally prepared by itself, where the prepares had gone out before it had time to log the transaction. Thus, this eventuality too is covered and has a consistent outcome. Failures can interrupt the recovery process. The recovering server instance will have logged as far as it got, and will pick up from this point onward. Real time clocks on the host nodes of the cluster will have to be in approximate sync, within a margin of a minute or so. This is not a problem in a closely connected network.
For simultaneous failure of a entire quorum of server instances (i.e., a set of mutually-mirroring partners; a cluster node), the rule is that the last one to fail must be the first to come back up. In order to have uninterrupted service across arbitrary double failures, one must store things in triplicate; statistically, however, most double failures will not hit cluster nodes of the same group.
The protocol for recovery of failed server instances of the master quorum (i.e., the master cluster node) is identical, except that a recovering master will have to ask the other master(s) which one is more up to date. If the recovering master has a log entry of having excluded all other masters in its quorum from the cluster, it can come back online without asking anybody. If there is no such entry, it must ask the other master(s). If all had failed at the exact same instant, none has an entry of the other(s) being excluded and all will know that they are in the same state since any update to one would also have been sent to the other(s).
Failure of Storage MediaWhen a server instance fails, its permanent storage may or may not survive. Especially with mirrored disks, storage most often survives a failure. However, the survival of the database does not depend on any single server instance retaining any permanent storage over failure. If storage is left in place, as in the case of an OS reboot or replacing a faulty memory chip, rejoining the cluster is done based on the existing copy of the database on the server instance. if there is no existing copy, a copy can be taken from any surviving member of the same quorum. This consists of the following steps: First, a log checkpoint is forced on the surviving instance. Normally log checkpoints are done at regular intervals, independently on each server instance. The log checkpoint writes a consistent state of the database to permanent storage. The disk pages forming this consistent image will not be written to until the next log checkpoint. Therefore copying the database file is safe and consistent as long as a log checkpoint does not take place between the start and end of copy. Thus checkpoints are disabled right after the initial checkpoint. The copy can take a relatively long time; consider 20s per gigabyte on a 1GbE network a good day. At the end of copy, checkpoints are re-enabled on the surviving cluster node. The recovering database starts without a log, sees the timestamp of the checkpoint in the database, and asks for transactions from just before this time up to present. The recovery then proceeds as outlined above.
Network FailuresThe CAP theorem states that Consistency, Availability, and Partition-tolerance do not mix. "Partition" here means the split of a network.
It is trivially true that if the network splits so that on both sides there is a copy of each partition of the data, both sides will think themselves the live copy left online after the other died, and each will thus continue to accumulate updates. Such an event is not very probable within one site where all machines are redundantly connected to two independent switches. Most servers have dual 1GbE on the motherboard, and both ports should be used for cluster interconnect for best performance, with each attached to an independent switch. Both switches would have to fail in such a way as to split their respective network for a single-site network split to happen. Of course, the likelihood of a network split in multi-site situations is higher.
One way of guarding against network splits is to require that at least one partition of the data have all copies online. Additionally, the master on duty can request each cluster node or server instance it expects to be online to connect to every other node or instance, and to report which they could reach. If the reports differ, there is a network problem. This procedure can be performed using both interfaces or only the first or second interface of each server to determine if one of the switches selectively blocks some paths. These simple sanity checks protect against arbitrary network errors. Using TCP for inter-cluster-node communication in principle protects against random message loss, but the Virtuoso cluster protocols do not rely on this. Instead, there are protocols for retry of any transaction messages and for using keep-alive messages on any long-running functions sent across the cluster. Failure to get a keep-alive message within a certain period will abort a query even if the network connections look OK.
Backups, and Recovery from Loss of Entire SiteFor a constantly-operating distributed system, it is hard to define what exactly constitutes a consistent snapshot. The checkpointed state on each cluster node is consistent as far as this cluster node is concerned (i.e., it contains no uncommitted data), but the checkpointed states on all the cluster nodes are not from exactly the same moment in time. The complete state of a cluster is the checkpoint state of each cluster node plus the current transaction log of each. If the logs were shipped in real time to off-site storage, a consistent image could be reconstructed from them. Since such shipping cannot be synchronous due to latency considerations, some transactions could be received only in part in the event of a failure of the off-site link. Such partial transactions can however be detected at reconstruction time because each record contains the list of all participants of the transaction. If some piece is found missing, the whole can be discarded. In this way integrity is guaranteed but it is possible that a few milliseconds worth of transactions get lost. In these cases, the online client will almost certainly fail to get the final success message and will recheck the status after recovery.
For business continuity purposes, a live feed of transactions can be constantly streamed off-site, for example to a cloud infrastructure provider. One low-cost virtual machine on the cloud will typically be enough for receiving the feed. In the event of long-term loss of the whole site, replacement servers can be procured on the cloud; thus, capital is not tied up in an aging inventory of spare servers. The cloud-based substitute can be maintained for the time it takes to rebuild an owned infrastructure, which is still at present more economical than a cloud-only solution.
Switching a cluster from an owned site to the cloud could be accomplished in a few hours. The prerequisite of this is that there are reasonably recent snapshots of the database files, so that replay of logs does not take too long. The bulk of the time taken by such a switch would be in transferring the database snapshots from S3 or similar to the newly provisioned machines, formatting the newly provisioned virtual disks, etc.
Rehearsing such a maneuver beforehand is quite necessary for predictable execution. We do not presently have a productized set of tools for such a switch, but can advise any interested parties on implementing and testing such a disaster recovery scheme.
ConclusionsIn conclusion, we have shown how we can have strong transactional guarantees in a database cluster without single points of failure or performance penalties when compared with a non fault-tolerant cluster. Operator intervention is not required for anything short of hardware failure. Recovery procedures are simple, at most consisting of installing software and copying database files from a surviving cluster node. Unless permanent storage is lost in the failure, not even this is required. Real-time off-site log shipment can easily be added to these procedures to protect against site-wide failures.
Future work may be directed toward concurrent operation of geographically-distributed data centers with eventual consistency. Such a setting would allow for migration between sites in the event of whole-site failures, and for reconciliation between inconsistent histories of different halves of a temporarily split network. Such schemes are likely to require application-level logic for reconciliation and cannot consist of an out-of-the-box DBMS alone. All techniques discussed here are application-agnostic and will work equally well for Graph Model (e.g., RDF) and Relational Model (e.g., SQL) workloads.
Glossary- Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
- Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
- Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
- Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
- Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).
- Special Relativity and the Problem of Database Scalability (PDF), by James Starkey of NimbusDB, Inc.
-
16:40
Fault Tolerance in Virtuoso Cluster Edition (Short Version)
» OpenLink Community BlogWe have for some time had the option of storing data in a cluster in multiple copies, in the Commercial Edition of Virtuoso. (This feature is not in and is not planned to be added to the Open Source Edition.)
Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any knowledge of how things actually work.
So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.
Three types of individuals need to know about fault tolerance:
- Executives: What does it cost? Will it really eliminate downtime?
- System Administrators: Is it hard to configure? What do I do when I get an alert?
- Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?
I will explain the matter to each of these three groups:
ExecutivesThe value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money's worth of read throughput and half the money's worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.
Server instances are grouped in "quorums" of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.
The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please contact us for specifics.
Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.
System AdministratorsTo configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server's network cables to each switch, to cover switch failures.
When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.
Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.
Application Developers/ProgrammersAn application can connect to any server instance in the cluster and have access to the same data, with full ACID properties.
There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.
For the missing server instance, the application should try to reconnect. An ODBC/JDBC connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.
For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (SQL State 40001) as best practices dictate, there is no change needed.
ConclusionIn summary...
- Limited extra cost for fault tolerance; no equipment sitting idle.
- Easy operation: Replace servers when they fail; the cluster does the rest.
- No changes needed to most applications.
- No proprietary SQL APIs or special fault tolerance logic needed in applications.
- Fully transactional programming model.
All the above applies to both the Graph Model (RDF) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please contact OpenLink Software Sales for details of availability or for getting advance evaluation copies.
Glossary- Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
- Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
- Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
- Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
- Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).
- Special Relativity and the Problem of Database Scalability (PDF), by James Starkey of NimbusDB, Inc.
-
16:40
Fault Tolerance in Virtuoso Cluster Edition (Short Version)
» OpenLink Community BlogWe have for some time had the option of storing data in a cluster in multiple copies, in the Commercial Edition of Virtuoso. (This feature is not in and is not planned to be added to the Open Source Edition.)
Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any knowledge of how things actually work.
So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.
Three types of individuals need to know about fault tolerance:
- Executives: What does it cost? Will it really eliminate downtime?
- System Administrators: Is it hard to configure? What do I do when I get an alert?
- Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?
I will explain the matter to each of these three groups:
ExecutivesThe value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money's worth of read throughput and half the money's worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.
Server instances are grouped in "quorums" of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.
The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please contact us for specifics.
Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.
System AdministratorsTo configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server's network cables to each switch, to cover switch failures.
When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.
Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.
Application Developers/ProgrammersAn application can connect to any server instance in the cluster and have access to the same data, with full ACID properties.
There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.
For the missing server instance, the application should try to reconnect. An ODBC/JDBC connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.
For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (SQL State 40001) as best practices dictate, there is no change needed.
ConclusionIn summary...
- Limited extra cost for fault tolerance; no equipment sitting idle.
- Easy operation: Replace servers when they fail; the cluster does the rest.
- No changes needed to most applications.
- No proprietary SQL APIs or special fault tolerance logic needed in applications.
- Fully transactional programming model.
All the above applies to both the Graph Model (RDF) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please contact OpenLink Software Sales for details of availability or for getting advance evaluation copies.
Glossary- Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
- Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
- Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
- Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
- Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).
- Special Relativity and the Problem of Database Scalability (PDF), by James Starkey of NimbusDB, Inc.
-
15:21
"The Acquired, The Innate, and the Semantic" or "Teaching Sem Tech"
» OpenLink Community BlogI was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I blogged about earlier. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic.
I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.
When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, "working such magic that makes things do what they already want to do is easy." There is a grain of truth in that.
In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife's grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such "magic," needless to say, takes constant maintenance; else the spell breaks.
To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.
Now, in more specific terms, what can we realistically expect to teach about computer science?
Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.
Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.
Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.
I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the "semanticist" mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.
Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.
LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.
Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-"paradigmatism" given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.
I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:
The edge is created in the "Wild West" — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."
So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.
But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.
-
Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.
-
Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.
The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.
-
Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.
Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.
So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?
-
Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.
-
Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.
-
Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.
-
Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.
-
Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.
Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.
Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.
The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define.
We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.
If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.
The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.
Related
-
-
15:21
"The Acquired, The Innate, and the Semantic" or "Teaching Sem Tech"
» OpenLink Community BlogI was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I blogged about earlier. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic.
I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.
When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, "working such magic that makes things do what they already want to do is easy." There is a grain of truth in that.
In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife's grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such "magic," needless to say, takes constant maintenance; else the spell breaks.
To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.
Now, in more specific terms, what can we realistically expect to teach about computer science?
Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.
Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.
Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.
I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the "semanticist" mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.
Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.
LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.
Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-"paradigmatism" given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.
I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:
The edge is created in the "Wild West" — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."
So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.
But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.
-
Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.
-
Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.
The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.
-
Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.
Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.
So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?
-
Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.
-
Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.
-
Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.
-
Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.
-
Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.
Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.
Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.
The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define.
We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.
If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.
The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.
Related
-
-
14:15
Upcoming RDF Loader in Unclustered Virtuoso loads Uniprot at 279 Ktriples/s!
» OpenLink Community BlogWe recently heard that Oracle 11G loaded RDF faster than we did. Now, we never thought the speed of loading a database was as important as the speed of query results, but since this is the sole area where they have reportedly been tested as faster, we decided it was time loading was addressed. Indeed, without Oracle to challenge us on query performance, we would not be half as good as we are. So, spurred on by the Oracular influence, we did something about our RDF loading.
Performance, I have said before, is a matter of locality and parallelism. So we applied both to the otherwise quite boring exercise of loading RDF. The recipe is this: Take a large set of triples; resolve the IRIs and literals into their IDs; then insert each index of the triple table on its own thread. All the lookups and inserts are first sorted in key order to get the locality. Running the indices in parallel gets the parallelism. Then run the parser on its own thread, fetching chunks of consecutive triples and queueing them for a pool of loader threads. Then run several parsers concurrently on different files so as to make sure there is work enough at all times. Do not make many more process threads than available CPU threads, since they would just get in each other's way.
The whole process is non-transactional, starting from a checkpoint and ending with a checkpoint.
The test system was a dual-Xeon 5520 with 72G RAM. The Virtuoso was a single server; no cluster capability was used.
We loaded English Dbpedia, 179M triples, in 15 minutes, for a rate of 198 Kt/s. Uniprot with 1.33 G triples loaded in 79 minutes, for 279 Kt/s.
The source files were the Dbpedia 3.4 English files and the Bio2RDF copy of Uniprot, both in Turtle syntax. The uniref, uniparc and uniprot files from the Bio2RDF set were sliced into smaller chunks so as to have more files to load in parallel; the taxonomy file was as such; and no other Bio2RDF files were loaded. Both experiments ran with 8 load streams, 1 per core. The CPU utilization was mostly between 1400% and 1500%, 14-15 of 16 CPU threads busy. Top load speed for a measurement window of 2 minutes was 383 Kt/s.
The index scheme for RDF quads was the default Virtuoso 6 configuration of 5 indices — GS, SP, OP, PSOG, and POGS. (We call this "3+2" indexing, because there are 3 partial and 2 full indices, delivering massive performance benefits over most other index schemes.) IRIs and literals reside in their own tables, each indexed from string to ID and vice versa. A full-text index on literals was not used.
Compared to previous performance, we have more than tripled our best single server multi-stream load speed, and multiplied our single stream load speed by a factor of 8. Some further gains may be reached by adjusting thread counts and matching vector sizes to CPU cache.
This will be available in a forthcoming release; this is not for download yet. Now that you know this, you may guess what we are doing with queries. More on this another time.
-
14:15
Upcoming RDF Loader in Unclustered Virtuoso loads Uniprot at 279 Ktriples/s!
» OpenLink Community BlogWe recently heard that Oracle 11G loaded RDF faster than we did. Now, we never thought the speed of loading a database was as important as the speed of query results, but since this is the sole area where they have reportedly been tested as faster, we decided it was time loading was addressed. Indeed, without Oracle to challenge us on query performance, we would not be half as good as we are. So, spurred on by the Oracular influence, we did something about our RDF loading.
Performance, I have said before, is a matter of locality and parallelism. So we applied both to the otherwise quite boring exercise of loading RDF. The recipe is this: Take a large set of triples; resolve the IRIs and literals into their IDs; then insert each index of the triple table on its own thread. All the lookups and inserts are first sorted in key order to get the locality. Running the indices in parallel gets the parallelism. Then run the parser on its own thread, fetching chunks of consecutive triples and queueing them for a pool of loader threads. Then run several parsers concurrently on different files so as to make sure there is work enough at all times. Do not make many more process threads than available CPU threads, since they would just get in each other's way.
The whole process is non-transactional, starting from a checkpoint and ending with a checkpoint.
The test system was a dual-Xeon 5520 with 72G RAM. The Virtuoso was a single server; no cluster capability was used.
We loaded English Dbpedia, 179M triples, in 15 minutes, for a rate of 198 Kt/s. Uniprot with 1.33 G triples loaded in 79 minutes, for 279 Kt/s.
The source files were the Dbpedia 3.4 English files and the Bio2RDF copy of Uniprot, both in Turtle syntax. The uniref, uniparc and uniprot files from the Bio2RDF set were sliced into smaller chunks so as to have more files to load in parallel; the taxonomy file was as such; and no other Bio2RDF files were loaded. Both experiments ran with 8 load streams, 1 per core. The CPU utilization was mostly between 1400% and 1500%, 14-15 of 16 CPU threads busy. Top load speed for a measurement window of 2 minutes was 383 Kt/s.
The index scheme for RDF quads was the default Virtuoso 6 configuration of 5 indices — GS, SP, OP, PSOG, and POGS. (We call this "3+2" indexing, because there are 3 partial and 2 full indices, delivering massive performance benefits over most other index schemes.) IRIs and literals reside in their own tables, each indexed from string to ID and vice versa. A full-text index on literals was not used.
Compared to previous performance, we have more than tripled our best single server multi-stream load speed, and multiplied our single stream load speed by a factor of 8. Some further gains may be reached by adjusting thread counts and matching vector sizes to CPU cache.
This will be available in a forthcoming release; this is not for download yet. Now that you know this, you may guess what we are doing with queries. More on this another time.
-
14:46
SemData@Sofia Roundtable write-up
» OpenLink Community BlogThere was last week an invitation-based roundtable about semantic data management in Sofia, Bulgaria.
Lots of smart people together. The meeting was hosted by Ontotext and chaired by Dieter Fensel. On the database side we had Ontotext, SYSTAP (Bigdata), CWI (MonetDB), Karlsruhe Institute of Technology (YARS2/SWSE). LarKC was well represented, being our hosts, with STI, Ontotext, CYC, and VU Amsterdam. Notable absences were Oracle, Garlik, Franz, and Talis.
Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?
I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?
Michael Stonebraker and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and RDF storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)
OpenLink Software and Virtuoso are in agreement with both sides, contradictory as this might sound. We took our RDBMS and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.
I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF's run time typing needs.
So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.
After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.
So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what's been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.
Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.
Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.
I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.
Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you will have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing à la MonetDB, this too integrates and applies to inference results.
So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.
Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.
What do we need for this? We need close-to-parity with relational — doing your warehouse in RDF with the attendant agility thereof can't cost 10x more to deploy than the equivalent relational solution.
We also want to tell the key-value, anti-SQL people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of ACID, at least consistent read, availability, complex query, large scale.
And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a TPC-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold.
-
14:46
SemData@Sofia Roundtable write-up
» OpenLink Community BlogThere was last week an invitation-based roundtable about semantic data management in Sofia, Bulgaria.
Lots of smart people together. The meeting was hosted by Ontotext and chaired by Dieter Fensel. On the database side we had Ontotext, SYSTAP (Bigdata), CWI (MonetDB), Karlsruhe Institute of Technology (YARS2/SWSE). LarKC was well represented, being our hosts, with STI, Ontotext, CYC, and VU Amsterdam. Notable absences were Oracle, Garlik, Franz, and Talis.
Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?
I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?
Michael Stonebraker and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and RDF storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)
OpenLink Software and Virtuoso are in agreement with both sides, contradictory as this might sound. We took our RDBMS and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.
I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF's run time typing needs.
So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.
After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.
So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what's been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.
Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.
Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.
I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.
Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you will have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing à la MonetDB, this too integrates and applies to inference results.
So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.
Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.
What do we need for this? We need close-to-parity with relational — doing your warehouse in RDF with the attendant agility thereof can't cost 10x more to deploy than the equivalent relational solution.
We also want to tell the key-value, anti-SQL people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of ACID, at least consistent read, availability, complex query, large scale.
And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a TPC-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold.
