Adapters

Adapters

Schema adapters

A schema adapter allows Calcite to read particular kind of data,presenting the data as tables within a schema.

Cassandra adapter (calcite-cassandra)
CSV adapter (example/csv)
Druid adapter (calcite-druid)
Elasticsearch adapter(calcite-elasticsearch)
File adapter (calcite-file)
Geode adapter (calcite-geode)
JDBC adapter (part of calcite-core)
MongoDB adapter (calcite-mongodb)
OS adapter (calcite-os)
Pig adapter (calcite-pig)
Solr cloud adapter (solr-sql)
Spark adapter (calcite-spark)
Splunk adapter (calcite-splunk)
Eclipse Memory Analyzer (MAT) adapter (mat-calcite-plugin)
Apache Kafka adapter

Other language interfaces

Piglet (calcite-piglet) runs queries in a subset of Pig Latin

Engines

Many projects and products use Apache Calcite for SQL parsing,query optimization, data virtualization/federation,and materialized view rewrite. Some of them are listed on the“powered by Calcite”page.

Drivers

A driver allows you to connect to Calcite from your application.

JDBC driver

The JDBC driver is powered byAvatica.Connections can be local or remote (JSON over HTTP or Protobuf over HTTP).

The basic form of the JDBC connect string is

jdbc:calcite:property=value;property2=value2

where property, property2 are properties as described below.(Connect strings are compliant with OLE DB Connect String syntax,as implemented by Avatica’sConnectStringParser.)

JDBC connect string parameters

Property	Description
approximateDecimal	Whether approximate results from aggregate functions on `DECIMAL` types are acceptable.
approximateDistinctCount	Whether approximate results from `COUNT(DISTINCT …)` aggregate functions are acceptable.
approximateTopN	Whether approximate results from “Top N” queries (`ORDER BY aggFun() DESC LIMIT n`) are acceptable.
caseSensitive	Whether identifiers are matched case-sensitively. If not specified, value from `lex` is used.
conformance	SQL conformance level. Values: DEFAULT (the default, similar to PRAGMATIC_2003), LENIENT, MYSQL_5, ORACLE_10, ORACLE_12, PRAGMATIC_99, PRAGMATIC_2003, STRICT_92, STRICT_99, STRICT_2003, SQL_SERVER_2008.
createMaterializations	Whether Calcite should create materializations. Default false.
defaultNullCollation	How NULL values should be sorted if neither NULLS FIRST nor NULLS LAST are specified in a query. The default, HIGH, sorts NULL values the same as Oracle.
druidFetch	How many rows the Druid adapter should fetch at a time when executing SELECT queries.
forceDecorrelate	Whether the planner should try de-correlating as much as possible. Default true.
fun	Collection of built-in functions and operators. Valid values are “standard” (the default), “oracle”, “spatial”, and may be combined using commas, for example “oracle,spatial”.
lex	Lexical policy. Values are ORACLE (default), MYSQL, MYSQL_ANSI, SQL_SERVER, JAVA.
materializationsEnabled	Whether Calcite should use materializations. Default false.
model	URI of the JSON/YAML model file or inline like `inline:{…}` for JSON and `inline:…` for YAML.
parserFactory	Parser factory. The name of a class that implements `interface SqlParserImplFactory` and has a public default constructor or an `INSTANCE` constant.
quoting	How identifiers are quoted. Values are DOUBLE_QUOTE, BACK_QUOTE, BRACKET. If not specified, value from `lex` is used.
quotedCasing	How identifiers are stored if they are quoted. Values are UNCHANGED, TO_UPPER, TO_LOWER. If not specified, value from `lex` is used.
schema	Name of initial schema.
schemaFactory	Schema factory. The name of a class that implements `interface SchemaFactory` and has a public default constructor or an `INSTANCE` constant. Ignored if `model` is specified.
schemaType	Schema type. Value must be “MAP” (the default), “JDBC”, or “CUSTOM” (implicit if `schemaFactory` is specified). Ignored if `model` is specified.
spark	Specifies whether Spark should be used as the engine for processing that cannot be pushed to the source system. If false (the default), Calcite generates code that implements the Enumerable interface.
timeZone	Time zone, for example “gmt-3”. Default is the JVM’s time zone.
typeSystem	Type system. The name of a class that implements `interface RelDataTypeSystem` and has a public default constructor or an `INSTANCE` constant.
unquotedCasing	How identifiers are stored if they are not quoted. Values are UNCHANGED, TO_UPPER, TO_LOWER. If not specified, value from `lex` is used.
typeCoercion	Whether to make implicit type coercion when type mismatch during sql node validation, default is true.

To make a connection to a single schema based on a built-in schema type, you don’t need to specifya model. For example,

jdbc:calcite:schemaType=JDBC; schema.jdbcUser=SCOTT; schema.jdbcPassword=TIGER; schema.jdbcUrl=jdbc:hsqldb:res:foodmart

creates a connection with a schema mapped via the JDBC schema adapter to the foodmart database.

Similarly, you can connect to a single schema based on a user-defined schema adapter.For example,

jdbc:calcite:schemaFactory=org.apache.calcite.adapter.cassandra.CassandraSchemaFactory; schema.host=localhost; schema.keyspace=twissandra

makes a connection to the Cassandra adapter, equivalent to writing the following model file:

{
  "version": "1.0",
  "defaultSchema": "foodmart",
  "schemas": [
    {
      type: 'custom',
      name: 'twissandra',
      factory: 'org.apache.calcite.adapter.cassandra.CassandraSchemaFactory',
      operand: {
        host: 'localhost',
        keyspace: 'twissandra'
      }
    }
  ]
}

Note how each key in the operand section appears with a schema. prefix in the connect string.

Server

Calcite’s core module (calcite-core) supports SQL queries (SELECT) and DMLoperations (INSERT, UPDATE, DELETE, MERGE)but does not support DDL operations such as CREATE SCHEMA or CREATE TABLE.As we shall see, DDL complicates the state model of the repository and makesthe parser more difficult to extend, so we left DDL out of core.

The server module (calcite-server) adds DDL support to Calcite.It extends the SQL parser,using the same mechanism used by sub-projects,adding some DDL commands:

CREATE and DROP SCHEMA
CREATE and DROP FOREIGN SCHEMA
CREATE and DROP TABLE (including CREATE TABLE … AS SELECT)
CREATE and DROP MATERIALIZED VIEW
CREATE and DROP VIEW
CREATE and DROP FUNCTION
CREATE and DROP TYPE

Commands are described in the SQL reference.

To enable, include calcite-server.jar in your class path, and addparserFactory=org.apache.calcite.sql.parser.ddl.SqlDdlParserImpl#FACTORYto the JDBC connect string (see connect string propertyparserFactory).Here is an example using the sqlline shell.

$ ./sqlline
sqlline version 1.3.0
> !connect jdbc:calcite:parserFactory=org.apache.calcite.sql.parser.ddl.SqlDdlParserImpl#FACTORY sa ""
> CREATE TABLE t (i INTEGER, j VARCHAR(10));
No rows affected (0.293 seconds)
> INSERT INTO t VALUES (1, 'a'), (2, 'bc');
2 rows affected (0.873 seconds)
> CREATE VIEW v AS SELECT * FROM t WHERE i > 1;
No rows affected (0.072 seconds)
> SELECT count(*) FROM v;
+---------------------+
|       EXPR$0        |
+---------------------+
| 1                   |
+---------------------+
1 row selected (0.148 seconds)
> !quit

The calcite-server module is optional.One of its goals is to showcase Calcite’s capabilities(for example materialized views, foreign tables and generated columns) usingconcise examples that you can try from the SQL command line.All of the capabilities used by calcite-server are available via APIs incalcite-core.

If you are the author of a sub-project, it is unlikely that your syntaxextensions match those in calcite-server, so we recommend that you add yourSQL syntax extensions by extending the core parser;if you want DDL commands, you may be able to copy-paste from calcite-serverinto your project.

At present, the repository is not persisted. As you execute DDL commands, youare modifying an in-memory repository by adding and removing objectsreachable from a rootSchema.All commands within the same SQL session will see those objects.You can create the same objects in a future session by executing the samescript of SQL commands.

Calcite could also act as a data virtualization or federation server:Calcite manages data in multiple foreign schemas, but to a client the dataall seems to be in the same place. Calcite chooses where processing shouldoccur, and whether to create copies of data for efficiency.The calcite-server module is a step towards that goal; anindustry-strength solution would require further on packaging (to make Calciterunnable as a service), repository persistence, authorization and security.

Extensibility

There are many other APIs that allow you to extend Calcite’s capabilities.

In this section, we briefly describe those APIs, to give you an idea what ispossible. To fully use these APIs you will need to read other documentationsuch as the javadoc for the interfaces, and possibly seek out the tests thatwe have written for them.

Functions and operators

There are several ways to add operators or functions to Calcite.We’ll describe the simplest (and least powerful) first.

User-defined functions are the simplest (but least powerful).They are straightforward to write (you just write a Java class and register itin your schema) but do not offer much flexibility in the number and type ofarguments, resolving overloaded functions, or deriving the return type.

It you want that flexibility, you probably need to write you auser-defined operator(see interface SqlOperator).

If your operator does not adhere to standard SQL function syntax,“f(arg1, arg2, …)”, then you need toextend the parser.

There are many good examples in the tests:class UdfTesttests user-defined functions and user-defined aggregate functions.

Aggregate functions

User-defined aggregate functions are similar to user-defined functions,but each function has several corresponding Java methods, one for eachstage in the life-cycle of an aggregate:

init creates an accumulator;
add adds one row’s value to an accumulator;
merge combines two accumulators into one;
result finalizes an accumulator and converts it to a result.

For example, the methods (in pseudo-code) for SUM(int) are as follows:

struct Accumulator {
  final int sum;
}
Accumulator init() {
  return new Accumulator(0);
}
Accumulator add(Accumulator a, int x) {
  return new Accumulator(a.sum + x);
}
Accumulator merge(Accumulator a, Accumulator a2) {
  return new Accumulator(a.sum + a2.sum);
}
int result(Accumulator a) {
  return new Accumulator(a.sum + x);
}

Here is the sequence of calls to compute the sum of two rows with column values 4 and 7:

a = init()    # a = {0}
a = add(a, 4) # a = {4}
a = add(a, 7) # a = {11}
return result(a) # returns 11

Window functions

A window function is similar to an aggregate function but it is applied to a setof rows gathered by an OVER clause rather than by a GROUP BY clause.Every aggregate function can be used as a window function, but there are somekey differences. The rows seen by a window function may be ordered, andwindow functions that rely upon order (RANK, for example) cannot be used asaggregate functions.

Another difference is that windows are non-disjoint: a particular row canappear in more than one window. For example, 10:37 appears in both the9:00-10:00 hour and also the 9:15-9:45 hour.

Window functions are computed incrementally: when the clock ticks from10:14 to 10:15, two rows might enter the window and three rows leave.For this, window functions have have an extra life-cycle operation:

remove removes a value from an accumulator.

It pseudo-code for SUM(int) would be:

Accumulator remove(Accumulator a, int x) {
  return new Accumulator(a.sum - x);
}

Here is the sequence of calls to compute the moving sum,over the previous 2 rows, of 4 rows with values 4, 7, 2 and 3:

a = init()       # a = {0}
a = add(a, 4)    # a = {4}
emit result(a)   # emits 4
a = add(a, 7)    # a = {11}
emit result(a)   # emits 11
a = remove(a, 4) # a = {7}
a = add(a, 2)    # a = {9}
emit result(a)   # emits 9
a = remove(a, 7) # a = {2}
a = add(a, 3)    # a = {5}
emit result(a)   # emits 5

Grouped window functions

Grouped window functions are functions that operate the GROUP BY clauseto gather together records into sets. The built-in grouped window functionsare HOP, TUMBLE and SESSION.You can define additional functions by implementinginterface SqlGroupedWindowFunction.

Table functions and table macros

_User-defined table functions_are defined in a similar way to regular “scalar” user-defined functions,but are used in the FROM clause of a query. The following query uses a tablefunction called Ramp:

SELECT * FROM TABLE(Ramp(3, 4))

User-defined table macros use the same SQL syntax as table functions,but are defined differently. Rather than generating data, they generate anrelational expression.Table macros are invoked during query preparation and the relational expressionthey produce can then be optimized.(Calcite’s implementation of views uses table macros.)

class TableFunctionTesttests table functions and contains several useful examples.

Extending the parser

Suppose you need to extend Calcite’s SQL grammar in a way that will becompatible with future changes to the grammar. Making a copy of the grammar fileParser.jj in your project would be foolish, because the grammar is editedquite frequently.

Fortunately, Parser.jj is actually anApache FreeMarkertemplate that contains variables that can be substituted.The parser in calcite-core instantiates the template with default values ofthe variables, typically empty, but you can override.If your project would like a different parser, you can provide yourown config.fmpp and parserImpls.ftl files and therefore generate anextended parser.

The calcite-server module, which was created in[CALCITE-707] andadds DDL statements such as CREATE TABLE, is an example that you could follow.Also seeclass ExtensionSqlParserTest.

Customizing SQL dialect accepted and generated

To customize what SQL extensions the parser should accept, implementinterface SqlConformanceor use one of the built-in values inenum SqlConformanceEnum.

To control how SQL is generated for an external database (usually via the JDBCadapter), useclass SqlDialect.The dialect also describes the engine’s capabilities, such as whether itsupports OFFSET and FETCH clauses.

Defining a custom schema

To define a custom schema, you need to implementinterface SchemaFactory.

During query preparation, Calcite will call this interface to find outwhat tables and sub-schemas your schema contains. When a table in your schemais referenced in a query, Calcite will ask your schema to create an instance ofinterface Table.

That table will be wrapped in aTableScanand will undergo the query optimization process.

Reflective schema

A reflective schema(class ReflectiveSchema)is a way of wrapping a Java object so that it appearsas a schema. Its collection-valued fields will appear as tables.

It is not a schema factory but an actual schema; you have to create the objectand wrap it in the schema by calling APIs.

Seeclass ReflectiveSchemaTest.

Defining a custom table

To define a custom table, you need to implementinterface TableFactory.Whereas a schema factory a set of named tables, a table factory produces asingle table when bound to a schema with a particular name (and optionally aset of extra operands).

Modifying data

If your table is to support DML operations (INSERT, UPDATE, DELETE, MERGE),your implementation of interface Table must implementinterface ModifiableTable.

Streaming

If your table is to support streaming queries,your implementation of interface Table must implementinterface StreamableTable.

Seeclass StreamTestfor examples.

Pushing operations down to your table

If you wish to push processing down to your custom table’s source system,consider implementing eitherinterface FilterableTableorinterface ProjectableFilterableTable.

If you want more control, you should write a planner rule.This will allow you to push down expressions, to make a cost-based decisionabout whether to push down processing, and push down more complex operationssuch as join, aggregation, and sort.

Type system

You can customize some aspects of the type system by implementinginterface RelDataTypeSystem.

Relational operators

All relational operators implementinterface RelNodeand most extendclass AbstractRelNode.The core operators (used bySqlToRelConverterand covering conventional relational algebra) areTableScan,TableModify,Values,Project,Filter,Aggregate,Join,Sort,Union,Intersect,Minus,Window andMatch.

Each of these has a “pure” logical sub-class,LogicalProjectand so forth. Any given adapter will have counterparts for the operations thatits engine can implement efficiently; for example, the Cassandra adapter hasCassandraProjectbut there is no CassandraJoin.

You can define your own sub-class of RelNode to add a new operator, oran implementation of an existing operator in a particular engine.

To make an operator useful and powerful, you will needplanner rules to combine it with existing operators.(And also provide metadata, see below).This being algebra, the effects are combinatorial: you write a fewrules, but they combine to handle an exponential number of query patterns.

If possible, make your operator a sub-class of an existingoperator; then you may be able to re-use or adapt its rules.Even better, if your operator is a logical operation that you can rewrite(again, via a planner rule) in terms of existing operators, you should do that.You will be able to re-use the rules, metadata and implementations of thoseoperators with no extra work.

Planner rule

A planner rule(class RelOptRule)transforms a relational expression into an equivalent relational expression.

A planner engine has many planner rules registered and fires themto transform the input query into something more efficient. Planner rules aretherefore central to the optimization process, but surprisingly each plannerrule does not concern itself with cost. The planner engine is responsible forfiring rules in a sequence that produces an optimal plan, but each individualrules only concerns itself with correctness.

Calcite has two built-in planner engines:class VolcanoPlanneruses dynamic programming and is good for exhaustive search, whereasclass HepPlannerfires a sequence of rules in a more fixed order.

Calling conventions

A calling convention is a protocol used by a particular data engine.For example, the Cassandra engine has a collection of relational operators,CassandraProject, CassandraFilter and so forth, and these operators can beconnected to each other without the data having to be converted from one formatto another.

If data needs to be converted from one calling convention to another, Calciteuses a special sub-class of relational expression called a converter(see class Converter).But of course converting data has a runtime cost.

When planning a query that uses multiple engines, Calcite “colors” regions ofthe relational expression tree according to their calling convention. Theplanner pushes operations into data sources by firing rules. If the engine doesnot support a particular operation, the rule will not fire. Sometimes anoperation can occur in more than one place, and ultimately the best plan ischosen according to cost.

A calling convention is a class that implementsinterface Convention,an auxiliary interface (for instanceinterface CassandraRel),and a set of sub-classes ofclass RelNodethat implement that interface for the core relational operators(Project,Filter,Aggregate,and so forth).

Built-in SQL implementation

How does Calcite implement SQL, if an adapter does not implement all of the corerelational operators?

The answer is a particular built-in calling convention,EnumerableConvention.Relational expressions of enumerable convention are implemented as “built-ins”:Calcite generates Java code, compiles it, and executes inside its own JVM.Enumerable convention is less efficient than, say, a distributed enginerunning over column-oriented data files, but it can implement all corerelational operators and all built-in SQL functions and operators. If a datasource cannot implement a relational operator, enumerable convention isa fall-back.

Statistics and cost

Calcite has a metadata system that allow you to define cost functions andstatistics about relational operators, collectively referred to as metadata.Each kind of metadata has an interface with (usually) one method.For example, selectivity is defined byinterface RelMdSelectivityand the methodgetSelectivity(RelNode rel, RexNode predicate).

There are many built-in kinds of metadata, includingcollation,column origins,column uniqueness,distinct row count,distribution,explain visibility,expression lineage,max row count,node types,parallelism,percentage original rows,population size,predicates,row count,selectivity,size,table references, andunique keys;you can also define your own.

You can then supply a metadata provider that computes that kind of metadatafor particular sub-classes of RelNode. Metadata providers can handle built-inand extended metadata types, and built-in and extended RelNode types.While preparing a query Calcite combines all of the applicable metadataproviders and maintains a cache so that a given piece of metadata (for examplethe selectivity of the condition x > 10 in a particular Filter operator)is computed only once.