Marketing Deal Offers: database technology

Showing posts with label database technology. Show all posts

Thursday, 9 July 2009

ParAccel Toots Its Horn and Revs Its Database Engine

Posted on 11:42 by Unknown

Summary: Over the past year, columnar analytical database vendor ParAccel has methodically proven its claims about speed, scalability and easy deployment. Now it's looking to grow fast.

When I first wrote about analytical database vendor ParAccel in a February 2008 post, it was one of several barely distinguishable vendors offering massively parallel, SQL-compatible columnar databases. Their main claim to fame was a record-setting performance on the TPC-H benchmark, but even the significance of that was unclear since few vendors bother with the TPC process.

Since then, ParAccel has delivered an impressive string of accomplishments, including deals with demanding customers (Merkle, PriceChopper, Autometrics, TRX) and an important alliance with EMC to create a “scalable analytic appliance”. To top it off, they recently announced their 2.0 release, a new TPC-H record, and $22 million Series C funding. (Full disclosure: they also hired me to write a white paper.)

Of all these, perhaps the most significant news is that the new TPC-H benchmark comes at the 30 terabyte level.* ParAccel’s previous TPC-H championships were at the 100 GB to 1 TB levels.

The change reflects a general growth in the scale of systems supported by MPP columnar databases. ParAccel reports its largest production installation holds 18 TB of compressed data, which probably translates to something more than 50 TB of input. Segment-leader Vertica reports several production installations larger than 100 TB. Neither had more than 10 TB in production a year ago.

These figures still don’t put the columnar systems in the same ballpark as the petabyte-scale database appliances like Netezza, Greenplum and Aster Data, but they do open up some major new possibilities. In case you’re wondering, ParAccel’s TPC-H results were seven times faster and had 16 times better price / performance than the previous record, held by Oracle.

But pure scalability isn’t the key selling point for ParAccel. More than anything, the company stresses its ability to handle complex queries without specialized data schemas or indexes. This means that existing data structures can be loaded as is and queried immediately. The net result is a much faster “time to answer” than competitive systems, which do tailor schemas and/or indexes to specific questions. It also means that new queries can be answered immediately, without waiting for schema modifications or new indexes.

The 2.0 release extends these advantages with a new query optimizer that handles very complex joins and correlated subqueries; parallel data loading (nearly 9 TB per hour in the TPC-H benchmark) and User Defined Functions; enhanced compression; and “blended scans” that avoid Storage Area Network (SAN) controller bottlenecks by loading SAN data onto compute nodes and querying them directly. It also adds some special features such as Oracle SQL support and column encryption for financial data. Another set of enhancements are designed to provide enterprise-class reliability, availability and manageability, such as back-up and failover. Several of these features are already in production, although the official 2.0 release date is August.

The new release and added funding mark a transition of ParAccel from quiet introduction to full-throated selling. Over the past year, the company has carefully limited its participation in Proof of Concept (POC) competitions, the key selection tool in this segment. This gave it time to refine its POC processes, add system features, and build initial client references. It says it can now complete a typical POC in three days, often leaving while other vendors are still getting started. The company is now ramping up its lead generation and inside sales operations, aiming to grow quickly beyond its dozen-plus existing installations. (To provide some context: Vertica reports more than 100 clients.) We'll see what comes next.

______________
* For some serious doubt-sowing about the new benchmarks, see Daniel Abadi's post (be sure to read the comments) and ParAccel's response. What really matters, as ParAccel points out, is performance in customer POCs. The company says its performance has never been beaten, although there was one tie. (For sheer entertainment, check out the related string on Curt Monash's blog.)

Posted in analytical database, column data store, database technology, paraccel, vertica | No comments

Wednesday, 27 August 2008

Looking for Differences in MPP Analytical Databases

Posted on 11:15 by Unknown

“You gotta get a gimmick if you wanna get ahead” sing the strippers in the classic musical Gypsy. The same rule seems to apply to analytical databases: each vendor has its own little twist that makes it unique, if not necessarily better than the competition. This applies even, or maybe especially, to the non-columnar systems that use a massively parallel (“shared-nothing”) architecture to handle very large volumes.

You’ll note I didn’t refer to these systems as “appliances”. Most indeed follow the appliance path pioneered by industry leader Netezza, I’ve been contacted by Aster Data, Microsoft), Dataupia, and Kognitio. A review of my notes shows that no two are quite alike.

Let’s start with Dataupia. CEO and founder Foster Hinshaw was also a founder at Netezza, which he left in 2005. Hinshaw still considers Netezza the “killer machine” for large analytical workloads, but positions Dataupia as a more flexible product that can handle conventional reporting in addition to ad hoc analytics. “A data warehouse for the rest of us” is how he puts it.

As it happens, all the vendors in this group stress their ability to handle “mixed workloads”. It’s not clear they mean the same thing, although the phrase may indicate that data can be stored in structures other than only star/snowflake schemas. In any event, the overlap is large enough that I don’t think we can classify Dataupia as unique on that particular dimension. What does set the system apart is its ability to manage “dynamic aggregation” of inputs into the data cubes required by many business intelligence and reporting applications. Cube building is notoriously time-consuming for conventional databases, and although any MPP database can presumably maintain cubes, it appears that Dataupia is especially good at it. This would indeed support Dataupia’s position as more reporting-oriented than its competitors.

The other apparently unique feature of Dataupia is its ability to connect with applications through common relational databases such as Oracle and DB2. None of the other vendors made a similar claim, but I say this is “apparently” unique because Hinshaw said the connection is made via the federation layer built into the common databases, and I don’t know whether other systems could also connect in the same way. In any case, Hinshaw said this approach makes Dataupia look to Oracle like nothing more than some additional table space. So integration with existing applications can’t get much simpler.

One final point about Dataupia is pricing. A 2 terabyte blade costs $19,500, which includes both hardware and software. (Dataupia is a true appliance.) This is a much lower cost than any competitor.

The other true appliance in this group is DATAllegro. When we spoke in April, it was building its nodes with a combination of EMC storage, Cisco networking, Dell servers, Ingres database and the Linux operating system. Presumably the Microsoft acquisition will change those last two. DATAllegro’s contribution was the software to distribute data across and within the hardware nodes and to manage queries against that data. In my world, this falls under the heading of intelligent partitioning, which is not itself unique: in fact, three of the four vendors listed here do it. Of course, the details vary and DATAllegro’s version no doubt has some features that no one else shares. DATAllegro was also unique in requiring a large (12 terabyte) initial configuration, for close to $500,000. This will also probably change under Microsoft management.

Aster Data lets users select and assemble their own hardware rather than providing an appliance. Otherwise, it generally resembles the Dataupia and DATAllegro appliances in that it uses intelligent partitioning to distribute its data. Aster assigns separate nodes to the tasks of data loading, query management, and data storage/query execution. The vendor says this makes it easy to support different types of workloads by adding the appropriate types of nodes. But DATAllegro also has separate loader nodes, and I’m not sure about the other systems. So I’m not going to call that one unique. Aster pricing starts at $100,000 for the first terabyte.

Kognitio resembles Aster in its ability to use any type of hardware: in fact, a single network can combine dissimilar nodes. A more intriguing difference is that Kogitio is the only one of these systems that distributes incoming data in a round-robin fashion, instead of attempting to put related data on the same node. It can do this without creating excessive inter-node traffic because it loads data into memory during query execution—another unique feature among this group. (The trick is that related data is sent to the same node when it's loaded into memory. See the comments on this post for details.)

Kognitio also wins the prize for the oldest (or, as they probably prefer, most mature) technology in this group, tracing its WX2 product to the WhiteCross analytical database of the early 1980’s. (WX2…WhiteCross…get it?) It also has by far the highest list price, of $180,000 per terabyte. But this is clearly negotiable, especially in the U.S. market, which Kognitio entered just this year. (Note: after this post was originally published, Kognitio called to remind me that a. they will build an appliance for you with commodity hardware if you wish and b. they also offer a hosted solution they call Data as a Service. They also note that the price per terabyte drops when you buy more than one.)

Whew. I should probably offer a prize for anybody who can correctly infer which vendors have which features from the above. But I’ll make it easy for you (with apologies that I still haven’t figured out how to do a proper table within Blogger).

______________Dataupia___DATAllegro___Aster Data___Kognitio
Mixed Workload_____Yes________Yes________Yes________Yes
Intelligent Partition___Yes________Yes________Yes________no
Appliance__________Yes________Yes________no________no
Dynamic Aggregation__Yes________no_________no________no
Federated Access_____Yes________no_________no________no
In-Memory Execution__no________no_________no________Yes
Entry Cost per TB___$10K(1)___~$40K(2)______$100K______$180K

(1) $19.5K for 2TB
(2) under $500K for 12TB; pre-acquisition pricing

As I noted earlier, some of these differences may not really matter in general or for your application in particular. In other cases, the real impact depends on the implementation details not captured in such a simplistic list. So don’t take this list for anything more than it is: an interesting overview of the different choices made by analytical database developers.

Posted in analysis systems, analytical database, database technology | No comments

Wednesday, 6 August 2008

More on QlikView - Curt Monash Blog

Posted on 07:31 by Unknown

I somehow ended up posting some comments on QlikView technology on Curt Monash's DBMS2 blog. This is actually a more detailed description than I've ever posted here about how I think QlikView works. If you're interested in that sort of thing, do take a look.

Posted in analytical database, database technology, qliktech, qlikview | No comments

Saturday, 12 July 2008

Sybase IQ: A Different Kind of Columnar Database (Or Is It The Other Way Around?)

Posted on 12:18 by Unknown

I spent a fair amount of time this past week getting ready for my part in the July 10 DM Radio Webcast on columnar databases. Much of this was spent updating my information on SybaseIQ, whose CTO Irfan Khan was a co-panelist.

Sybase was particularly eager to educate me because I apparently ruffled a few feathers when my July DM Review column described SybaseIQ as a “variation on a columnar database” and listed it separately from other columnar systems. Since IQ has been around for much longer than the other columnar systems and has a vastly larger installed base—over 1,300 customers, as they reminded me several times—the Sybase position seems to be that they should be considered the standard, and everyone else as the variation. (Not that they put it that way.) I can certainly see why it would be frustrating to be set apart from other columnar systems at exactly the moment when columnar technology is finally at the center of attention.

The irony is that I’ve long been fond of SybaseIQ, precisely because I felt its unconventional approach offered advantages that few people recognized. I also feel good about IQ because I wrote about its technology back in 1994, before Sybase purchased it from Expressway Technologies—as I reminded Sybase several times.

In truth, though, that original article was part of the problem. Expressway was an indexing system that used a very clever, and patented, variation on bitmap indexes that allowed calculations within the index itself. Although that technology is still an important feature within SybaseIQ, it now supplements a true column-based data store. Thus, while Expressway was not a columnar database, SybaseIQ is.

I was aware that Sybase had extended Expressway substantially, which is why my DM Review article did refer to them as a type of columnar database. So there was no error in what I wrote. But I’ll admit that until this week’s briefings I didn’t realize just how far SybaseIQ has moved from its bitmapped roots. It now uses seven or nine types of indexes (depending on which document you read), including traditional b-tree indexes and word indexes. Many of its other indexes do use some form of bitmaps, often in conjunction with tokenization (i.e., replacing an actual value with a key that points to a look-up table of actual values. Tokenization saves space when the same value occurs repeatedly, because the key is much smaller than the value itself. Think how much smaller a database is if it stores “MA” instead of “Massachusetts” in its addresses. )

Of course, tokenization is really a data compression technique, so I have a hard time considering a column of tokenized data to be an index. To me, an index is an access mechanism, not the data itself, regardless of how well it’s compressed. Sybase serenely glides over the distinction with the Zen-like aphorism that “the index is the column” (or maybe it was the other way around). I’m not sure I agree, but the point doesn’t seem worth debating

Yet, semantics aside, SybaseIQ’s heavy reliance on “indexes” is a major difference between it and the raft of other systems currently gaining attention as columnar databases: Vertica, ParAccel, Exasol and Calpont among them. These systems do rely heavily on compression of their data columns, but don’t describe (or, presumably, use) these as indexes. In particular, so far as I know, they don’t build different kinds of indexes on the same column, which IQ treats as a main selling point. Some of the other systems store several versions of the same column in different sort sequences, but that’s quite different.

The other very clear distinction between IQ and the other columnar systems is that IQ uses Symmetrical Multi-Processing (SMP) servers to process queries against a unified data store, while the others rely on shared nothing or Massively Multi-Processor (MMP) servers. This reflects a fundamentally different approach to scalability. Sybase scales by having different servers execute different queries simultaneously, relying on its indexes to minimize the amount of data that must be read from the disk. The MPP-based systems scale by partitioning the data so that many servers can work in parallel to scan it quickly. (Naturally, the MPP systems do more than a brute-force column scan; for example, those sorted columns can substantially reduce read volumes.)

It’s possible that understanding these differences would allow someone to judge which type of columnar system works better for a particular application. But I am not that someone. Sybase makes a plausible case that its approach is inherently better for a wider range of ad hoc queries, because it doesn’t depend on how the data is partitioned or sorted. However, I haven’t heard the other vendors’ side of that argument. In any event, actual performance will depend on how the architecture has been implemented. So even a theoretically superior approach will not necessarily deliver better results in real life. Until the industry has a great deal more experience with the MPP systems in particular, the only way to know which database is better for a particular application will be to test them.

The SMP/MPP distinction does raise a question about SybaseIQ’s uniqueness. My original DM Review article actually listed two classes of columnar systems: SMP-based and MPP-based. Other SMP-based systems include Alterian, SmartFocus, Infobright, 1010Data and open-source LucidDB. (The LucidDB site contains some good technical explanations of columnar techniques, incidentally.)

I chose not to list SybaseIQ in the SMP category because I thought its reliance on bitmap techniques makes it significantly different from the others, and in particular because I believed it made IQ substantially more scalable. I’m not so sure about the bitmap part anymore, now that realize SybaseIQ makes less use of bitmaps than I thought, and have found that some of the other vendors use them too. On the other hand, IQ’s proven scalability is still much greater than any of these other systems—Sybase cites installations over 100 TB, while none of the others (possibly excepting Infobright) has an installation over 10 TB.

So where does all this leave us? Regarding SybaseIQ, not so far from where we started: I still say it’s an excellent columnar database that is significantly different from the (MPP-based) columnar databases that are the focus of recent attention. But, to me, the really important word in the preceding sentence is “excellent”, not “columnar”. The point of the original DM Review article was that there are many kinds of analytical databases available, and you should consider them all when assessing which might fit your needs. It would be plain silly to finally look for alternatives to conventional relational databases and immediately restrict yourself to just one other approach.

Posted in analytical database, column data store, column-oriented database, columnar database, database technology, sybase iq, vertica | No comments

Thursday, 8 May 2008

Infobright Puts a Clever Twist on the Columnar Database

Posted on 18:25 by Unknown

It took me some time to form a clear picture of analytical database vendor Infobright, despite an excellent white paper that seems to have since vanished from their Web site. [Note: Per Susan Davis' comment below, they have since reloaded it here.] Infobright’s product, named BrightHouse, confused me because it is a SQL-compatible, columnar database, which makes it sound similar to systems like Vertica and ParAccel (click here for my ParAccel entry).

But it turns out there is a critical difference: while those other products rely on massively parallel (MPP) hardware for scalability and performance, BrightHouse runs on conventional (SMP) servers. The system gains its performance edge by breaking each database column into 65K chunks called “data packs”, and reading relatively few packs to resolve most queries.

The trick is that BrightHouse stores descriptive information about each data pack and can often use this information to avoid loading the pack itself. For example, the descriptive information holds minimum and maximum values of data within the pack, plus summary data such as totals. This means that a query involving a certain value range may determine that all or none of the records within a pack are qualified. If all values are out of range, the pack can be ignored; if all values are in range, the summary data may suffice. Only when some but not all of the records within a pack are relevant must the pack itself be loaded from disk and decompressed. According to CEO Miriam Tuerk, this approach can reduce data transfers by up to 90%. The data is also highly compressed when loaded into the packs—by ratios as high as 50:1, although 10:1 is average. This reduces hardware costs and yields even faster disk reads. By contrast, data in MPP columnar systems often takes up as much or more storage space as the source files.

This design is substantially more efficient than conventional columnar systems, which read every record in a given column to resolve queries involving that column. The small size of the BrightHouse data packs means that many packs will be totally included or excluded from queries even without their contents being sorted when the data is loaded. This lack of sorting, along with the lack of indexing or data hashing, yields load rates of up to 250 GB per hour. This is impressive for a SMP system, although MPP systems are faster.

You may wonder what happens to BrightHouse when queries require joins across tables. It turns out that even in these cases, the system can use its summary data to exclude many data packs. In addition, the system watches queries as they execute and builds a record of which data packs are related to other data packs. Subsequent queries can use this information to avoid opening data packs unnecessarily. The system thus gains a performance advantage without requiring a single, predefined join path between tables—something that is present in some other columnar systems, though not all of them. The net result of all this is great flexibility: users can load data from existing source systems without restructuring it, and still get excellent analytical performance.

BrightHouse uses the open source MySQL database interface, allowing it to connect with any data source that is accessible to MySQL. According to Tuerk, it is the only version of MySQL that scales beyond 500 GB. Its scalability is still limited, however, to 30 to 50 TB of source data, which would be a handful of terabytes once compressed. The system runs on any Red Hat Linux 5 server—for example, a 1 TB installation runs on a $22,000 Dell. A Windows version is planned for later this year. The software itself costs $30,000 per terabyte of source data (one-time license plus annual maintenance), which puts it towards the low end of other analytical systems.

Infobright was founded in 2005 although development of the BrightHouse engine began earlier. Several production systems were in place by 2007. The system was officially launched in early 2008 and now has about dozen production customers.

Posted in analysis systems, analytics tools, columnar database, database technology, open source bi, open source software | No comments

Wednesday, 2 April 2008

illuminate Systems' iLuminate May Be the Most Flexible Analytical Database Ever

Posted on 08:07 by Unknown

OK, I freely admit I’m fascinated by offbeat database engines. Maybe there is a support group for this. In any event, the highlight of my brief visit to the DAMA International Symposium and Wilshire Meta-Data Conference conference last month was a presentation by Joe Foley of illuminate Solutions , which marked the U.S. launch of his company’s iLuminate analytical database.

(Yes, the company name is “illuminate” and the product is “iLuminate”. And if you look inside the HTML tag, you’ll see the Internet domain is “i-lluminate.com”. Marketing genius or marketing madness? You decide.)

illuminate calls iLuminate a “correlation database”, a term they apparently invented to distinguish it from everything else. It does appear to be unique, although somewhat similar to other products I’ve seen over the years: Digital Archaeology (now deceased), Omnidex and even QlikView come to mind. Like iLuminate, these systems store links among data values rather than conventional database tables or columnar structures. iLuminate is the purest expression I’ve seen of this approach: it literally stores each value once, regardless of how many tables, columns or rows contain it in the source data. Indexes and algorithms capture the context of each original occurrence so the system can recreate the original records when necessary.

The company is rather vague on the details, perhaps on purpose. They do state that each value is tied to a conventional b-tree index that makes it easily and quickly accessible. What I imagine—and let me stress I may have this wrong—is that each value is then itself tied to a hierarchical list of the tables, columns and rows in which it appears. There would be a count associated with each level, so a query that asked how many times a value appears in each table would simply look at the pre-calculated value counts; a query of how many times the value appeared in a particular column could look down one level to get the column counts. A query that needed to know about individual rows would retrieve the row numbers. A query that involved multiple values would retrieve multiple sets of row numbers and compare them: so, say, a query looking for state = “NY” and eye color = “blue” might find that “NY” appears in the state column for records 3, 21 and 42, while “blue” appears in the eye color for records 5, 21 and 56. It would then return row=21 as the only intersection of the two sets. Another set of indexes would make it simple to retrieve the other components of row 21.

Whether or not that’s what actually happens under the hood, this does illustrate the main advantages of iLuminate. Specifically, it can import data in any structure and access it without formal database design; it can identify instances of the same value across different tables or columns; it can provide instant table and column level value distributions; and it can execute incremental queries against a previously selected set of records. The company also claims high speed and terabyte scalability, although some qualification is needed: initial results from a query appear very quickly, but calculations involving a large result set must wait for the system to assemble and process the full set of data. Foley also said that although the system has been tested with a terabyte of data, the largest production implementation is a much less impressive 50 gigabytes. Still, the largest production row count is 200 million rows – nothing to sneeze at.

The system avoids some of the pitfalls that sometimes trap analytical databases: load times are comparable to load times for comparable relational databases (once you include time for indexing, etc.); total storage (including the indexes, which take up most of the space) is about the same as relational source data; and users can write queries in standard SQL via ODBC. This final point is particularly important, because many past analytical systems were not SQL-compatible. This deterred many potential buyers. The new crop of analytical database vendors has learned this lesson: nearly all of the new analytical systems are SQL-accessible. Just to be clear, iLuminate is not an in-memory database, although it will make intelligent use of what memory is available, often loading the data values and b-tree indexes into memory when possible.

Still, at least from my perspective, the most important feature of iLuminate is its ability to work with any structure of input data—including structures that SQL would handle poorly or not at all. This is where users gain huge time savings, because they need not predict the queries they will write and then design a data structure to support them. In this regard, the system is even more flexible than QlikView, which it in many ways resembles: while QlikView links tables with fixed keys during the data load, iLuminate does not. Instead, like a regular SQL system, iLuminate can apply different relationships to one set of data by defining the relationships within different queries. (On the other hand, QlikView’s powerful scripting language allows much more data manipulation during the load process.)

Part of the reason I mention QlikView is that iLuminate itself uses QlikView as a front-end tool under the label of iAnalyze. This extracts data from iLuminate using ODBC and then loads it into QlikView. Naturally, the data structure at that point must include fixed relationships. In addition to QlikView, iAnalyze also includes integrated mapping. A separate illuminate product, called iCorrelated, allows ad hoc, incremental queries directly against iLuminate and takes full advantage of its special capabilities.

illuminate, which is based in Spain, has been in business for nearly three years. It has more than 40 iLuminate installations, mostly in Europe. Pricing is based on several factors but the entry cost is very reasonable: somewhere in the $80,000 to $100,000 range, including iAnalyze. As part of its U.S. launch, the company is offering no-cost proof of concept projects to qualified customers.

Posted in analysis systems, business intelligence, columnar database, database technology, qliktech, qlikview | No comments

Thursday, 27 March 2008

The Limits of On-Demand Business Intelligence

Posted on 13:31 by Unknown

I had an email yesterday from Blink Logic , which offers on-demand business intelligence. That could mean quite a few things but the most likely definition is indeed what Blink Logic provides: remote access to business intelligence software loaded with your own data. I looked a bit further and it appears Blink Logic does this with conventional technologies, primarily Microsoft SQL Server Analysis Services and Cognos Series 7.

At that point I pretty much lost interest because (a) there’s no exotic technology, (b) quite a few vendors offer similar services*, and (c) the real issue with business intelligence is the work required to prepare the data for analysis, which doesn’t change just because the system is hosted.

Now, this might be unfair to Blink Logic, which could have some technology of its own for data loading or the user interface. It does claim that at least one collaboration feature, direct annotation of data in reports, is unique. But the major point remains: Blink Logic and other “on-demand business intelligence” vendors are simply offering a hosted version of standard business intelligence systems. Does anyone truly think the location of the data center is the chief reason that business intelligence has so few users?

As I see it, the real obstacle is that most source data must be integrated and restructured before business intelligence systems can use it. It may be literally true that hosted business intelligence systems can be deployed in days and users can build dashboards in minutes, but this only applies given the heroic assumption that the proper data is already available. Under those conditions, on-premise systems can be deployed and used just as quickly. Hosting per se has little benefit when it comes to speed of deployment. (Well, maybe some: it can take days or even a week or two to set up a new server in some corporate data centers. Still, that is a tiny fraction of the typical project schedule.)

If hosting isn't the answer, what can make true “business intelligence on demand” a reality? Since the major obstacle is data preparation, then anything that allows less preparation will help. This brings us back to the analytical databases and appliances I’ve been writing about recently : Alterian, Vertica, ParAccel, QlikView, Netezza and so on. At least some of them do reduce the need for preparation because they let users query raw data without restructuring it or aggregating it. This isn’t because they avoid SQL queries, but because they offer a great enough performance boost over conventional databases that aggregation or denormalization are not necessary to return results quickly.

Of course, performance alone can’t solve all data preparation problems. The really knotty challenges like customer data integration and data quality still remain. Perhaps some of those will be addressed by making data accessible as a service (see last week’s post). But services themselves do not appear automatically, so a business intelligence application that requires a new service will still need advance planning. Where services will help is when business intelligence users can take advantage of services created for operational purposes.

“On demand business intelligence” also requires that end-users be able to do more for themselves. I actually feel this is one area where conventional technology is largely adequate: although systems could always be easier, end-users willing to invest a bit of time can already create useful dashboards, reports and analyses without deep technical skills. There are still substantial limits to what can be done – this is where QlikView’s scripting and macro capabilities really add value by giving still more power to non-technical users (or, more precisely, to power users outside the IT department). Still, I’d say that when the necessary data is available, existing business intelligence tools let users accomplish most of what they want.

If there is an issue in this area, it’s that SQL-based analytical databases don’t usually include an end-user access tool. (Non-SQL systems do provide such tools, since users have no alternatives.) This is a reasonable business decision on their part, both because many firms have already selected a standard access tool and because the vendors don’t want to invest in a peripheral technology. But not having an integrated access tool means clients must take time to connect the database to another product, which does slow things down. Apparently I'm not the only person to notice this: some of the analytical vendors are now developing partnerships with access tool vendors. If they can automate the relationship so that data sources become visible in the access tool as soon as they are added to the analytical system, this will move “on demand business intelligence” one step closer to reality.

* results of a quick Google search: OnDemandIQ, LucidEra, PivotLink (an in-memory columnar database), oco, VisualSmart, GoodData and
Autometrics.

Posted in analysis systems, business intelligence, dashboards, database technology, hosted software, on-demand software, qliktech, qlikview, service oriented architecture | No comments

Wednesday, 27 February 2008

ParAccel Enters the Analytical Database Race

Posted on 18:49 by Unknown

As I’ve now written more times than I care to admit, specialized analytical databases are very much in style. In addition to my beloved QlikView, market entrants include Alterian, SmartFocus, QD Technology, Vertica, 1010data, Kognitio, Advizor and Polyhedra, not to mention established standbys including Teradata and Sybase IQ. Plus you have to add appliances like Netezza, Calpont, Greenplum and DATAllegro. Many of these run on massively parallel hardware platforms; several use columnar data structures and in-memory data access. It’s all quite fascinating, but after a while even I tend to lose interest in the details.

None of which dimmed my enthusiasm when I learned about yet another analytical database vendor, ParAccel. Sure enough, ParAccel is a massively parallel, in-memory-capable, SQL-compatible columnar database, which pretty much hits all the tick boxes on my list. Run by industry veterans, the company seems to have refined many of the details that will let it scale linearly with large numbers of processors and extreme data volumes. One point that seemed particularly noteworthy was that the standard data loader can handle 700 GB per hour, which is vastly faster than many columnar systems and can be a major stumbling block. And that’s just the standard loader, which passes all data through a single node: for really large volumes, the work can be shared among multiple nodes.

Still, if ParAccel had one particularly memorable claim to my attention, it was having blown past previous records for several of the TPC-H analytical query benchmarks run by the Transaction Processing Council. The TPC process is grueling and many vendors don’t bother with it, but it still carries some weight as one of the few objective performance standards available. While other winners had beaten the previous marks by a few percentage points, ParAccel's improvement was on the order of 500%.

When I looked at the TPC-H Website for details, it turned out that ParAccel’s winning results have since been bested by yet another massively parallel database vendor, EXASOL, based in Nuremberg, Germany. (Actually, ParAccel is still listed by TPC as best in the 300 GB category, but that’s apparently only because EXASOL has only run the 100 GB and 1 TB tests.) Still, none of the other analytic database vendors seem to have attempted the TPC-H process, so I’m not sure how impressed to be by ParAccel’s performance. Sure it clearly beats the pants off Oracle, DB2 and SQL Server, but any columnar database should be able to do that.

One insight I did gain from my look at ParAccel was that in-memory doesn’t need to mean small. I’ll admit to be used to conventional PC servers, where 16 GB of memory is a lot and 64 GB is definitely pushing it. The massively parallel systems are a whole other ballgame: ParAccel’s 1 TB test ran on a 48 node system. At a cost of maybe $10,000 per node, that’s some pretty serious hardware, so this is not something that will replace QlikView under my desk any time soon. And bear in mind that even a terabyte isn’t really that much these days: as a point of reference, the TPC-H goes up to 30 TB. Try paying for that much memory, massively parallel or not. The goods news is that ParAccel can work with on-disk as well as in-memory data, although the performance won’t be quite as exciting. Hence the term "in-memory-capable".

Hardware aside, ParAccel itself is not especially cheap either. The entry price is $210,000, which buys licenses for five nodes and a terabyte of data. Licenses cost $40,000 for each additional node cost $40,000 and $10,000 for each additional terabyte. An alternative pricing scheme doesn’t charge for nodes but costs $1,000 per GB, which is also a good bit of money. Subscription pricing is available, but any way you slice it, this is not a system for small businesses.

So is ParAccel the cat’s meow of analytical databases? Well, maybe, but only because I’m not sure what “the cat’s meow” really means. It’s surely an alternative worth considering for anyone in the market. Perhaps more significant, the company raised $20 million December 2007, which may make it more commercially viable than most. Even in a market as refined as this one, commercial considerations will ultimately be more important than pure technical excellence.

Posted in analysis systems, columnar database, database technology, qliktech, qlikview | No comments

Thursday, 31 January 2008

QlikView Scripts Are Powerful, Not Sexy

Posted on 10:30 by Unknown

I spent some time recently delving into QlikView’s automation functions, which allow users to write macros to control various activities. These are an important and powerful part of QlikView, since they let it function as a real business application rather than a passive reporting system. But what the experience really did was clarify why QlikView is so much easier to use than traditional software.

Specifically, it highlighted the difference between QlikView’s own scripting language and the VBScript used to write QlikView macros.

I was going to label QlikView scripting as a “procedural” language and contrast it with VBScript as an “object-oriented” language, but a quick bit of Wikipedia research suggests those may not be quite the right labels. Still, whatever the nomenclature, the difference is clear when you look at the simple task of assigning a value to a variable. With QlikView scripts, I use a statement like:

Set Variable1 = ‘123’;

With VBScript using the QlikView API, I need something like:

set v = ActiveDocument.GetVariable("Variable1")
v.SetContent "123",true

That the first option is considerably easier may not be an especially brilliant insight. But the implications are significant, because they mean vastly less training is needed to write QlikView scripts than to write similar programs in a language like VBScript, let alone Visual Basic itself. This in turn means that vastly less technical people can do useful things in QlikView than with other tools. And that gets back to the core advantage I’ve associated with QlikView previously: that it lets non-IT people like business analysts do things that normally require IT assistance. The benefit isn’t simply that the business analysts are happier or that IT gets a reduced workload. It's that the entire development cycle is accelerated because analysts can develop and refine applications for themselves. Otherwise, they'd be writing specifications, handing these to IT, waiting for IT to produce something, reviewing the results, and then repeating the cycle to make changes. This is why we can realistically say that QlikView cuts development time to hours or days instead of weeks or months.

Of course, any end-user tool cuts the development cycle. Excel reduces development time in exactly the same way. The difference lies in the power of QlikView scripts. They can do very complicated things, giving users the ability to create truly powerful systems. These capabilities include all kinds of file manipulation—loading data, manipulating it, splitting or merging files, comparing individual records, and saving the results.

The reason it’s taken me so long time to recognize that this is important is that database management is not built into today's standard programming languages. We’re simply become so used to the division between SQL queries and programs that the distinction feels normal. But reflecting on QlikView script brought me back to the old days of FoxPro and dBase database languages, which did combine database management with procedural coding. They were tremendously useful tools. Indeed, I still use FoxPro for certain tasks. (And not that crazy new-fangled Visual FoxPro, either. It’s especially good after a brisk ride on the motor-way in my Model T. You got a problem with that, sonny?)

Come to think of it FoxPro and dBase played a similar role in their day to what QlikView offers now: bringing hugely expanded data management power to the desktops of lightly trained users. Their fate was essentially to be overwhelmed by Microsoft Access and SQL Server, which made reasonably priced SQL databases available to end-users and data centers. Although I don’t think QlikView is threatened from that direction, the analogy is worth considering.

Back to my main point, which is that QlikView scripts are both powerful and easy to use. I think they’re an underreported part of the QlikView story, which tends to be dominated by the sexy technology of the in-memory database and the pretty graphics of QlikView reports. Compared with those, scripting seems pedestrian and probably a bit scary to the non-IT users whom I consider QlikView’s core market. I know I myself was put off when I first realized how dependent QlikView was on scripts, because I thought it meant only serious technicians could take advantage of the system. Now that I see how much easier the scripts are than today’s standard programming languages, I consider them a major QlikView advantage.

(Standard disclaimer: although my firm is a reseller for QlikView, opinions expressed in this blog are my own.)

Posted in analysis systems, analytics tools, business intelligence, dashboards, database technology, qliktech, qlikview | No comments

Thursday, 6 December 2007

1010data Offers A Powerful Columnar Database

Posted on 14:03 by Unknown

Back in October I wrote here about the resurgent interest in alternatives to standard relational databases for analytical applications. Vendors on my list included Alterian, SmartFocus, Vertica and QD Technology. Most use some form of a columnar structure, meaning data is stored so the system can load only the columns required for a particular query. This reduces the total amount of data read from disk and therefore improves performance. Since a typical analytical query might read only a half-dozen columns out of hundreds or even thousands available, the savings can be tremendous.

I recently learned about another columnar database, Tenbase from 1010data. Tenbase, introduced in 2000, turns out to be a worthy alternative to better-known columnar products.

Like other columnar systems, Tenbase is fast: an initial query against a 4.3 billion row, 305 gigabyte table came back in about 12 seconds. Subsequent queries against the results were virtually instantaneous, because they were limited to the selected data and that data had been moved into memory. Although in-memory queries will always be faster, Tenbase says reading from disk takes only three times as long, which is a very low ratio. The reflects a focused effort by the company to make disk access as quick as possible.

What’s particularly intriguing is Tenbase achieves this performance without compressing, aggregating or restructuring the input. Although indexes are used in some situations, queries generally read the actual data. Even with indexes, the Tenbase files usually occupy about the same amount of space as the input. This factor varies widely among columnar databases, which sometimes expand file size significantly and sometimes compress it. Tenbase also handles very large data sets: the largest in production is nearly 60 billion rows and 4.6 terabytes. Fast response on such large installations is maintained by adding servers that process queries in parallel. Each server contains a complete copy of the full data set.

Tenbase can import data from text files or connect directly to multi-table relational databases. Load speed is about 30 million observations per minute for fixed width data. Depending on the details, this comes to around 10 gigabytes per hour. Time for incremental loads, which add new data to an existing table, is determined only by the volume of the new data. Some columnar databases essentially reload the entire file during an ‘incremental’ update.

Regardless of the physical organization, Tenbase presents loaded data as if it were in the tables of a conventional relational database. Multiple tables can be linked on different keys and queried. This contrasts with some columnar systems that require all tables be linked on the same key, such as a customer ID.

Tenbase has an ODBC connector that lets it accept standard SQL queries. Results come back as quickly as queries in the system’s own query language. This is also special: some columnar systems run SQL queries much more slowly or won’t accept them at all. The Tenbase developers demonstrated this feature by querying a 500 million row database through Microsoft Access, which feels a little like opening the door to a closet and finding yourself in the Sahara desert.

Tenbase’s own query language is considerably more powerful than SQL. It gives users advanced functions for time-series analysis, which actually allows many types of comparisons between rows in the data set. It also contains a variety of statistical, aggregation and calculation functions. It’s still set-based rather than a procedural programming language, so it doesn't support features like if/then loops. This is one area where some other columnar databases may have an edge.

The Tenbase query interface is rather plain but it does let users pick the columns and values to select by, and the columns and summary types to include in the result. Users can also specify a reference column for calculations such as weighted averages. Results can be viewed as tables, charts or cross tabs (limited to one value per intersection), which can themselves be queried. Outputs can be exported in Excel, PDF, XML, text or CSV formats. The interface also lets users create calculated columns and define links among tables.

Under the covers, the Tenbase interface automatically creates XML statements written to the Tenbase API. Users can view and edit the XML or write their own statements from scratch. This lets them create alternate interfaces for special purposes or simply to improve the esthetics. Queries built in Tenbase can be saved and rerun either in their original form or with options for users to enter new values at run time. The latter feature gives a simple way to build query applications for casual users.

The user interface is browser-based, so no desktop client is needed. Perhaps I'm easily impressed, but I like that the browser back button actually works. This is often not the case in such systems. Performance depends on the amount of data and query complexity but it scales with the number of servers, so even very demanding queries against huge databases can be returned in a few minutes with the right hardware. The servers themselves are commodity Windows PCs. Simple queries generally come back in seconds.

Tenbase clients pay for the system on a monthly basis. Fees are based primarily on the number of servers, which is determined by the number of users, amount of data, types of queries, and other details. The company does not publish its pricing but the figures it mentioned seemed competitive. The servers can reside at 1010data or at the client, although the 1010data will manage them either way. Users can load data themselves no matter where the server is located.

Most Tenbase clients are in the securities industry, where the product is used for complex analytics. The company has recently added several customers in retail, consumer goods and health care. There are about 45 active Tenbase installations, including the New York Stock Exchange, Proctor & Gamble and Pathmark Stores.

Posted in 1010Data, analysis systems, analytics tools, columnar database, database technology, on-demand software, Tenbase | No comments

Tuesday, 27 November 2007

Just How Scalable Is QlikTech?

Posted on 14:07 by Unknown

A few days ago, I replied to a question regarding QlikTech scalability. (See What Makes QlikTech So Good?, August 3, 2007) I asked QlikTech itself for more information on the topic but haven’t learned anything new. So let me simply discuss this based on my own experience (and, once again, remind readers that while my firm is a QlikTech reseller, comments in this blog are strictly my own.)

The first thing I want to make clear is that QlikView is a wonderful product, so it would be a great pity if this discussion were to be taken as a criticism. Like any product, QlikView works within limits that must be understood to use it appropriately. No one benefits from unrealistic expectations, even if fans like me sometimes create them.

That said, let’s talk about what QlikTech is good at. I find two fundamental benefits from the product. The first is flexibility: it lets you analyze data in pretty much any way you want, without first building a data structure to accommodate your queries. By contrast, most business intelligence tools must pre-aggregate large data sets to deliver fast response. Often, users can’t even formulate a particular query if the dimensions or calculated measures were not specified in advance. Much of the development time and cost of conventional solutions, whether based in standard relational databases or specialized analytical structures, is spent on this sort of work. Avoiding it is the main reason QlikTech is able to deliver applications so quickly.

The other big benefit of QlikTech is scalability. I can work with millions of records on my desktop with the 32-bit version of the system (maximum memory 4 GB if your hardware allows it) and still get subsecond response. This is much more power than I’ve ever had before. A 64-bit server can work with tens or hundreds of millions of rows: the current limit for a single data set is apparently 2 billion rows, although I don’t know how close anyone has come to that in the field. I have personally worked with tables larger than 60 million rows, and QlikTech literature mentions an installation of 300 million rows. I strongly suspect that larger ones exist.

So far so good. But here’s the rub: there is a trade-off in QlikView between really big files and really great flexibility. The specific reason is that the more interesting types of flexibility often involve on-the-fly calculations, and those calculations require resources that slow down response. This is more a law of nature (there’s no free lunch) than a weakness in the product, but it does exist.

Let me give an example. One of the most powerful features of QlikView is a “calculated dimension”. This lets reports construct aggregates by grouping records according to ad hoc formulas. You might want to define ranges for a value such as age, income or unit price, or create categories using if/then/else statements. These formulas can get very complex, which is generally a good thing. But each formula must be calculated for each record every time it is used in a report. On a few thousand rows, this can happen in an instant, but on tens of millions of rows, it can take several minutes (or much longer if the formula is very demanding, such as on-the-fly ranking). At some point, the wait becomes unacceptable, particularly for users who have become accustomed to QlikView’s typically-immediate response.

As problems go, this isn’t a bad one because it often has a simple solution: instead of on-the-fly calculations, precalculate the required values in QlikView scripts and store the results on each record. There’s little or no performance cost to this strategy since expanding the record size doesn’t seem to slow things down. The calculations do add time to the data load, but that happens only once, typically in an unattended batch process. (Another option is to increase the number and/or speed of processors on the server. QlikTech makes excellent use of multiple processors.)

The really good news is you can still get the best of both worlds: work out design details with ad hoc reports on small data sets; then, once the design is stabilized, add precalculations to handle large data volumes. This is vastly quicker than prebuilding everything before you can see even a sample. It’s also something that’s done by business analysts with a bit of QlikView training, not database administrators or architects.

Other aspects of formulas and database design also more important in QlikView as data volumes grow larger. The general solution is the same: make the application more efficient through tighter database and report design. So even though it’s true that you can often just load data into QlikView and work with it immediately, it’s equally true that very large or sophisticated applications may take some tuning to work effectively. In other words, QlikView is not pure magic (any result you want for absolutely no work), but it does deliver much more value for a given amount of work than conventional business intelligence systems. That’s more than enough to justify the system.

Interestingly, I haven’t found that the complexity or over-all size of a particular data set impacts QlikView performance. That is, removing tables which are not used in a particular query doesn’t seem to speed up that query, nor does removing fields from tables within the query. This probably has to do with QlikTech’s “associative” database design, which treats each field independently and connects related fields directly to each other. But whatever the reason, most of the performance slow-downs I’ve encountered seem related to processing requirements.

And, yes, there are some upper limits to the absolute size of a QlikView implementation. Two billions rows is one, although my impression (I could be wrong) is that could be expanded if necessary. The need to load data into memory is another limit: even though the 64-bit address space is effectively infinite, there are physical limits to the amount of memory that can be attached to Windows servers. (A quick scan of the Dell site finds a maximum of 128 GB.) This could translate into more input data, since QlikView does some compression. At very large scales, processing speed will also impose a limit . Whatever the exact upper boundary, it’s clear that no one will be loading dozens of terabytes into QlikView any time soon. It can certainly be attached a multi-terabyte warehouse, but would have to work with multi-gigabyte extracts. For most purposes, that’s plenty.

While I’m on the topic of scalability, let me repeat a couple of points I made in the comments on the August post. One addresses the notion that QlikTech can replace a data warehouse. This is true in the sense that QlikView can indeed load and join data directly from operational systems. But a data warehouse is usually more than a federated view of current operational tables. Most warehouses include data integration to link otherwise-disconnected operational data. For example, customer records from different systems often can only be linked through complex matching techniques because there is no shared key such as a universal customer ID. QlikView doesn’t offer that kind of matching. You might be able to build some of it using QlikView scripts, but you’d get better results at a lower cost from software designed for the purpose.

In addition, most warehouses store historical information that is not retained in operational systems. A typical example is end-of-month account balance. Some of these values can be recreated from transaction details but it’s usually much easier just to take and store a snapshot. Other data may simply be removed from operational systems after a relatively brief period. QlikView can act as a repository for such data: in fact, it’s quite well suited for this. Yet in such cases, it’s probably more accurate to say that QlikView is acting as the data warehouse than to say a warehouse is not required.

I hope this clarifies matters without discouraging anyone from considering QlikTech. Yes QlikView is a fabulous product. No it won’t replace your multi-terabyte data warehouse. Yes it will complement that warehouse, or possibly substitute for a much smaller one, by providing a tremendously flexible and efficient business intelligence system. No it won’t run itself: you’ll still need some technical skills to do complicated things on large data volumes. But for a combination of speed, power, flexibility and cost, QlikTech can’t be beat.

Posted in analysis systems, analytics tools, business intelligence, database technology, qliktech, qlikview | No comments

Wednesday, 31 October 2007

Independent Teradata Makes New Friends

Posted on 18:21 by Unknown

I had a product briefing from Teradata earlier this week after not talking for nearly two years. They are getting ready to release version 6 of their marketing automation software, Teradata Relationship Manager (formerly Teradata CRM). The new version has a revamped user interface and large number of minor refinements such as allowing multiple levels of control groups. But the real change is technical: the system has been entirely rebuilt on a J2EE platform. This was apparently a huge effort – when I checked my notes from two years ago, Teradata was talking about releasing the same version 6 with pretty much the same changes. My contact at Teradata told me the delay was due to difficulties with the migration. She promises the current schedule for releasing version 6 by December will definitely be met.

I’ll get back to v6 in a minute, but did want to mention the other big news out of Teradata recently: alliances Assetlink and Infor for marketing automation enhancements, and with SAS Institute for analytic integration. Each deal has its own justification, but it’s hard not to see them as showing a new interest in cooperation at Teradata, whose proprietary technology has long kept it isolated from the rest of the industry. The new attitude might be related to Teradata’s spin-off from NCR, completed October 1, which presumably frees (or forces) management to consider options it rejected while inside the NCR family. It might also reflect increasing competition from database appliances like Netezza, DATAllegro, and Greenplum. (The Greenplum Web site offers links to useful Gartner and Ventana Research papers if you want to look at the database appliance market in more detail.)

But I digress. Let’s talk first about the alliances and then v6.

The Assetlink deal is probably the more significant yet least surprising new arrangement. Assetlink is one of the most complete marketing resource management suites, so it gives Teradata a quick way to provide a set of features that are now standard in enterprise marketing systems. (Teradata had an earlier alliance in this area with Aprimo, but that never took hold. Teradata mentioned technical incompatibility with Aprimo’s .NET foundation as well as competitive overlap with Aprimo’s own marketing automation software.) In the all-important area of integration, Assetlink and Teradata will both run on the same data structures and coordinate their internal processes, so they should work reasonably seamlessly. Assetlink still has its own user interface and workflow engine, though, so some separation will still be apparent. Teradata stressed that it will be investing to create a version of Assetlink that runs on the Teradata database and will sell that under the Teradata brand.

The Infor arrangement is a little more surprising because Infor also has its own marketing automation products (the old Epiphany system) and because Infor is more oriented to mid-size businesses than the giant retailers, telcos, and others served by Teradata. Perhaps the separate customer bases make the competitive issue less important. In any event, the Infor alliance is limited to Infor’s real time decision engine, currently known as CRM Epiphany Inbound Marketing, which was always Epiphany’s crown jewel. Like Assetlink, Infor gives Teradata a quick way to offer a capability (real time interaction management, including self-adjusting predictive models) that is increasingly requested by clients and offered by competitors. Although Epiphany is also built on J2EE, the initial integration (available today) will still be limited: the software will run on a separate server using SQL Server as its data store. A later release, due in the first quarter of next year, will still have a separate server but connect directly with the Teradata database. Even then, though, real-time interaction flows will be defined outside of Teradata Relationship Manager. Integration will be at the data level: Teradata will provide lists of customers are eligible for different offers and will be notified of interaction results. Teradata will be selling its own branded version of the Infor product too.

The SAS alliance is described as a “strategic partnership” in the firms' joint press release, which sounds jarring from two previous competitors. Basically, it involves running SAS analytic functions inside of the Teradata. This turns out to be part of a larger SAS initiative called “in-database processing” which seeks similar arrangements with other database vendors. Teradata is simply the first partner to be announced, so maybe the relationship isn’t so special after all. On the other hand, the companies’ joint roadmap includes deeper integration of selected SAS “solutions” with Teradata, including mapping of industry-specific SAS logical data models to corresponding Teradata structures. The companies will also create a joint technical “center of excellence” where specialists from both firms will help clients improve performance of SAS and Teradata products. We’ll see whether other database vendors work this closely with SAS. In the specific area of marketing automation, the two vendors will continue to compete head-to-head, at least for the time being.

This brings us back to Teradata Relationship Manager itself. As I already mentioned, v6 makes major changes at the deep technical level and in the user interface, but the 100+ changes in functionality are relatively minor. In other words, the functional structure of the product is the same.

This structure has always been different from other marketing automation systems. What sets Teradata apart is a very systematic approach to the process of customer communications: it’s not simply about matching offers to customers, but about managing all the components that contribute to those offers. For example, communication plans are built up from messages, which contain collateral, channels and response definitions, and the collateral itself may contain personalized components. Campaigns are created by attaching communication plans to segment plans, which are constructed from individual segments. All these elements in turn are subject to cross-campaign constraints on channel capacity, contacts per customer, customer channel preferences, and message priorities. In other words, everything is related to everything else in a very logical, precise fashion – just like a database design. Did I mention that Teradata is a database company?

This approach takes some practice before you understand how the parts are connected – again, like a sophisticated database. It can also make simple tasks seem unnecessarily complicated. But it rewards patient users with a system that handles complex tasks accurately and supports high volumes without collapsing. For example, managing customers across channels is very straightforward because all channels are structurally equivalent.

The functional capabilities of Relationship Manager are not so different from Teradata’s main competitors (SAS Marketing Automation and Unica). But those products have evolved incrementally, often through acquisition, and parts are still sold as separate components. It’s probably fair to say that they not as tightly or logically integrated as Teradata.

This very tight integration also has drawbacks, since any changes to the data structure need careful consideration. Teradata definitely has a tendency to fit new functions into existing structures, such as setting up different types of campaigns (outbound, multi-step, inbound) through a single interface. Sometimes that’s good; sometimes it’s just easier to do different things in different ways.

Teradata has also been something of a laggard at integrating statistical modeling into its system. Even what it calls “optimization” is rule-based rather than the constrained statistical optimization offered by other vendors. I’m actually rather fond of Teradata’s optimization approaches: its ability to allocate leads across channels based on sophisticated capacity rules (e.g., minimum and maximum volumes from different campaigns; automatically sending overflow from one channel to another; automatically reallocating leads based on current work load) has always impressed me and I believe remains unrivaled. But allowing marketers to build and deploy true predictive models is increasingly important and, unless I’ve missed something, is still not offered by Teradata.

This is why the new alliances are so intriguing. Assetlink adds a huge swath of capabilities that Teradata otherwise would have very slowly and painstakingly created by expanding its core data model. Infor and SAS both address the analytical weaknesses of the existing system, while Infor in particular adds another highly desired feature without waiting to build new structures in-house. All these changes suggest a welcome sense of urgency in responding quickly to customer needs. If this new attitude holds true, it seems unlikely that Teradata will accept another two year delay in the release of Relationship Manager version 7.

Posted in customer relationship management, database technology, marketing automation, marketing software, software selection, vendor evaluation | No comments

Monday, 8 October 2007

Proprietary Databases Rise Again

Posted on 10:33 by Unknown

I’ve been noticing for some time that “proprietary” databases are making a come-back in the world of marketing systems. “Proprietary” is a loaded term that generally refers to anything other than the major relational databases: Oracle, SQL Server and DB2, plus some of the open source products like MySQL. In the marketing database world, proprietary systems have a long history tracing back to the mid-1980’s MCIF products from Customer Insight, OKRA Marketing, Harte-Hanks and others. These originally used specialized structures to get adequate performance from the limited PC hardware available in the mid-1980’s. Their spiritual descendants today are Alterian and SmartFocus, both with roots in the mid-1990’s Brann Viper system and both having reinvented themselves in the past few years as low cost / high performance options for service bureaus to offer their clients.

Nearly all the proprietary marketing databases used some version of an inverted (now more commonly called “columnar”) database structure. In such a structure, data for each field (e.g., Customer Name) is physically stored in adjacent blocks on the hard drive, so it can be accessed with a single read. This makes sense for marketing systems, and analytical queries in general, which typically scan all contents of a few fields. By contrast, most transaction processes use a key to find a particular record (row) and read all its elements. Standard relational databases are optimized for such transaction processing and thus store entire rows together on the hard drive, making it easy to retrieve their contents.

Columnar databases themselves date back at least to mid-1970’s products including Computer Corporation of America Model 204, Software AG ADABAS, and Applied Data Research (now CA) Datacom/DB. All of these are still available, incidentally. In an era when hardware was vastly more expensive, the great efficiency of these systems at analytical queries made them highly attractive. But as hardware costs fell and relational databases became increasingly dominant, they fell by the wayside except for a special situations. Their sweet spot of high-volume analytical applications was further invaded by massively parallel systems (Teradata and more recently Netezza) and multi-dimensional data cubes (Cognos Powerplay, Oracle/Hyperion EssBase, etc.). These had different strengths and weaknesses but still competed for some of the same business.

What’s interesting today is that a new generation of proprietary systems is appearing. Vertica has recently gained a great deal of attention due to the involvement of database pioneer Michael Stonebraker, architect of INGRES and POSTGRES. (Click here for an excellent technical analysis by the Winter Corporation; registration required.) QD Technology, launched last year (see my review), isn’t precisely a columnar structure, but uses indexes and compression in a similar fashion. I can’t prove it, but suspect the new interest in alternative approaches is because analytical databases are now getting so large—tens and hundreds of gigabytes—that the efficiency advantages of non-relational systems (which translate into cost savings) are now too great to ignore.

We’ll see where all this leads. One of the few columnar systems introduced in the 1990’s was Expressway (technically, a bit map index—not unlike Model 204), which was purchased by Sybase and is now moderately successful as Sybase IQ. I think Oracle also added some bit-map capabilities during this period, and suspect the other relational database vendors have their versions as well. If columnar approaches continue to gain strength, we can certainly expect the major database vendors to add them as options, even though they are literally orthogonal to standard relational database design. In the meantime, it’s fun to see some new options become available and to hope that costs will come down as new competitors enter the domain of very large analytical databases.

Posted in analytics tools, business intelligence, columnar database, database technology, marketing software | No comments

Marketing Deal Offers