Baeldung

Java, Spring and Web Development tutorials

 

Introduction to Apache Kylin
2025-04-20 18:44 UTC by Ismail Ajagbe

1. Introduction

Apache Kylin is an open-source OLAP engine built to bring sub-second query performance to massive datasets. Originally developed by eBay and later donated to the Apache Software Foundation, Kylin has grown into a widely adopted tool for big data analytics, particularly in environments dealing with trillions of records across complex pipelines.

The platform is known for blending OLAP performance with the scale of distributed systems. It bridges the gap between complex, large-scale data storage and the speed requirements of modern business intelligence tools, enabling faster decisions on fresher data.

In this tutorial, we’ll explore the core features that make Kylin stand out, walk through its architecture, and look at how it changes the game in big data analytics. Let’s get started!

2. Understanding Apache Kylin’s Core Capabilities

Let’s start by looking at what Apache Kylin does well.

Apache Kylin delivers sub-second latency even when operating on datasets that span trillions of rows. This is possible due to its heavy use of pre-computed data models and optimized indexing. When performance and speed are critical, Kylin shines.

Similarly, Kylin also easily handles high concurrency. Whether the system is serving hundreds of queries simultaneously or performing heavy aggregations, the underlying architecture is built to scale without becoming a bottleneck.

Another strength is Kylin’s unified big data warehouse architecture. It integrates natively with the Hadoop ecosystem and data lake platforms, making it a solid fit for organizations already invested in distributed storage. For visualization and business reporting, Kylin integrates seamlessly with tools like Tableau, Superset, and Power BI. It exposes query interfaces that allow us to explore data without needing to understand the underlying complexity.

Furthermore, if we’re looking for production-ready features, Kylin provides robust security, metadata management, and multi-tenant capabilities, making it suitable for enterprise use at scale. Kylin’s performance isn’t just luck; its components are engineered from the ground up using multidimensional modeling, smart indexing, and an efficient data-loading pipeline.

Let’s take a closer look at how each of these elements contributes to its capabilities.

2.1. Multidimensional Modeling and the Role of Models

At the heart of Kylin is its data model, which is built using star or snowflake schemas to define the relationships between the underlying data tables. In this structure, we define dimensions, which are the perspectives or categories we want to analyze (like region, product, or time). Alongside them are measures, and aggregated numerical values such as total sales or average price.

Similarly, Kylin also supports computed columns, which allow us to define new fields using expressions or transformations, which are useful for standardizing date formats or creating derived attributes. It handles joins during the model definition stage, allowing Kylin to understand relationships and optimize the model accordingly.

Once a model is built, it becomes the foundation for index creation and data loading.

2.2. Index Design and Pre-Computation (CUBEs)

To achieve its speed, Kylin heavily relies on pre-computation. It builds indexes (also known as CUBEs) that aggregate data ahead of time based on the model dimensions and measures. There are two main types of indexes in Kylin:

  • Aggregate Indexes: These store pre-aggregated combinations of dimensions and measures, such as total revenue by product and month.
  • Table Indexes: These are multilevel indexes that help serve detailed or drill-down queries, like fetching the last 50 orders placed by a specific user.

By precomputing the possible combinations and storing them efficiently, Kylin avoids the need to scan raw data at query time. This drastically reduces latency, even for complex analytical queries.

Notably, index design is critical. The more targeted and efficient the indexes are, the less storage and processing power is consumed during query time.

2.3. Data Loading Process

Once the model and indexes are in place, we need to load the data. Data loading in Kylin involves building the CUBEs and populating them with pre-computed results.

Traditionally, this is done in batch mode using offline data. Kylin reads from source tables, often from Hive or Parquet files in HDFS, and processes the data into its index structures.

In addition, there’s also support for streaming sources like Apache Kafka, enabling near real-time ingestion and analysis. This makes it possible to use Kylin in hybrid batch-streaming scenarios without changing the analytical layer.

Importantly, once we load the data, queries run against the pre-built indexes instead of raw datasets, providing consistent and predictable performance regardless of the underlying volume.

3. How to Run Apache Kylin in Docker

The fastest way to explore Apache Kylin is by spinning it up in a Docker container. This is perfect if we want to test out new features locally or evaluate the latest releases.

Let’s see a docker run command to start using Apache Kylin:

$ docker run -d \
  --name Kylin5-Machine \
  --hostname localhost \
  -e TZ=UTC \
  -m 10G \
  -p 7070:7070 \
  -p 8088:8088 \
  -p 9870:9870 \
  -p 8032:8032 \
  -p 8042:8042 \
  -p 2181:2181 \
  apachekylin/apache-kylin-standalone:5.0.0-GA

Here, we pull the standalone image for version 5.0 and launch it. We’re launching Apache Kylin 5.0 as a standalone container and exposing some common ports for easy access:

  • –name: assigns a name to the container
  • –hostname: sets the container’s hostname (helpful for internal references)
  • -e TZ=UTC: sets the timezone to UTC
  • -m 10G: limits the container’s memory usage to 10 GB (it’s recommended to assign at least 10GB of memory to the container for a smoother experience)
  • -p options: map essential Kylin and Hadoop-related service ports from the container to the host
  • apachekylin/apache-kylin-standalone:5.0.0-GA: the image, which includes all necessary services bundled together

While the docker run command itself doesn’t produce output beyond the container ID, we can validate that it’s running with docker ps:

$ docker ps --filter name=Kylin5-Machine
CONTAINER ID   IMAGE                                         STATUS          PORTS                                             NAMES
abc123456789   apachekylin/apache-kylin-standalone:5.0.0-GA   Up 10 seconds   0.0.0.0:7070->7070/tcp, ...                      Kylin5-Machine

Once we’re sure that the container is up, we can access the Kylin web UI at http://localhost:7070 and start exploring. This setup gives us everything we need to build models and explore datasets in a self-contained environment.

3.1. Verifying the Kylin Instance

Once the container is running, we can verify the instance using a simple health check via curl:

$ curl http://localhost:7070/kylin/api/system/health

If everything is working, we should see a response indicating the server status as UP:

{
    "status": "UP",
    "storage": {
        "status": "UP"
    },
    "metadata": {
        "status": "UP"
    },
    "query": {
        "status": "UP"
    }
}

This confirms that Kylin’s internal services, metadata, query engine, and storage are running and ready to accept operations.

3.2. Accessing the Kylin Web Interface

The Kylin UI will be available at http://localhost:7070. We can use the default credentials to log in:

Username: ADMIN
Password: KYLIN

Once it’s up, Kylin’s web interface will also have access to Spark and Hadoop UI components through the other ports.

From here, we can create a project, upload a data model, and begin building CUBEs. The interface also includes sections for managing metadata, monitoring build jobs, and testing SQL queries interactively.

4. How to Define a Model and Build a CUBE in Apache Kylin Using SQL

With Kylin, we can also define models and kick off CUBE builds using plain SQL and the REST API. This makes the process cleaner, automatable, and perfect for dev-heavy environments. Let’s walk through it.

4.1. Loading a Table Into Kylin

Assuming the source table sales_data exists in Hive or a similar catalog, we begin by telling Kylin about it.

To do this, we can make a POST request to the /tables API via curl:

$ curl -X POST http://localhost:7070/kylin/api/tables/default.sales_data \
  -H "Authorization: Basic $(echo -n 'ADMIN:KYLIN' | base64)" \
  -d '{"project":"sales_analytics"}'

Here, we register our source table, sales_data, into the sales_analytics project. This tells Kylin to pull metadata for the sales_data table from the configured catalog (like Hive or JDBC). Let’s see an example output:

{
    "uuid": "fcbe5a9a-xxxx-xxxx-xxxx-87d8c1e6b2c5",
    "database": "default",
    "name": "sales_data",
    "project": "sales_analytics"
}

As we can see, once registered, it’s available for model creation.

4.2. Creating a Model From SQL

Here’s where things get interesting. We can now define a model using a SQL statement, and Kylin infers the dimensions, measures, and joins automatically.

Let’s see an example SQL:

SELECT
  order_id,
  product_id,
  region,
  order_date,
  SUM(order_amount) AS total_sales
FROM sales_data
GROUP BY order_id, product_id, region, order_date

This tells Kylin what the dimensions and measures are.

Now, let’s send this SQL to Kylin’s modeling engine via API:

$ curl -X POST http://localhost:7070/kylin/api/models/sql \
  -H "Authorization: Basic $(echo -n 'ADMIN:KYLIN' | base64)" \
  -H "Content-Type: application/json" \
  -d '{
    "project": "sales_analytics",
    "modelName": "sales_cube_model",
    "sql": "SELECT order_id, product_id, region, order_date, SUM(order_amount) AS total_sales FROM sales_data GROUP BY order_id, product_id, region, order_date"
  }'

If the request is successful, Kylin creates a new model that includes all the columns mentioned, along with a basic aggregation on order_amount:

{
    "model_id": "sales_cube_model",
    "status": "ONLINE",
    "fact_table": "sales_data",
    "dimensions": ["order_id", "product_id", "region", "order_date"],
    "measures": ["SUM(order_amount)"]
}

Here, this created a new model, sales_cube_model, by inferring metadata directly from the SQL. It automatically marks grouping fields as dimensions and applies the aggregation as a measure.

4.3. Triggering a CUBE Build Job

Once the model is created, we can trigger a build job to materialize the index.

First, we get the model’s ID (or name), then we send a build request:

$ curl -X PUT http://localhost:7070/kylin/api/jobs \
  -H "Authorization: Basic $(echo -n 'ADMIN:KYLIN' | base64)" \
  -H "Content-Type: application/json" \
  -d '{
    "model_name": "sales_cube_model",
    "project": "sales_analytics",
    "build_type": "BUILD",
    "start_time": 0,
    "end_time": 2000000000000
  }'

After running this, Kylin starts building the CUBE using the default aggregation groups, and outputs the status:

{
    "uuid": "job_3f23c498-xxxx-xxxx-xxxx-9eab1a66f79c",
    "status": "PENDING",
    "exec_start_time": 1711700000000,
    "model_name": "sales_cube_model"
}

This schedules a full CUBE build (covering all time ranges) for the model. Kylin precomputes aggregates defined in the model. The timestamp range here is wide open, which works well for full builds.

4.4. Monitoring the Build Status

For monitoring the build status, we can always check the status of that build job using the job API:

$ curl -X GET "http://localhost:7070/kylin/api/jobs?projectName=sales_analytics" \
  -H "Authorization: Basic $(echo -n 'ADMIN:KYLIN' | base64)"
[
    {
        "job_status": "FINISHED",
        "model_name": "sales_cube_model",
        "duration": 52300,
        "last_modified": 1711700150000
    }
]

The response shows job stages, status, duration, and whether the build succeeded. Once it reaches “job_status”: “FINISHED“, we’re ready to query.

Notably, Kylin supports index pruning based on query patterns. After a few queries, we can check the index usage stats in the API. We may find that some dimensions are rarely used together, and trimming those combinations from the index definition can improve build times and reduce storage without affecting query coverage.

In short, we’ve fully modeled a dataset, defined aggregations, and materialized a CUBE. We can now query next, or even automate this flow as part of a CI/CD analytics pipeline. For recurring data loads, we can automate the CUBE build process using cron jobs or CI pipelines. Kylin’s REST API is script-friendly, so it’s easy to trigger builds at midnight, hourly, or whenever new data lands in the source system.

5. Conclusion

In this article, we explored Apache Kylin, a purpose-built tool for extreme scale and performance in big data analytics. It combines the power of OLAP modeling with distributed computing to deliver fast, reliable insights across massive datasets.

With significant components and features, the platform introduces streaming support, a native compute engine, automated modeling, and smarter metadata handling. These changes make it more approachable, more performant, and more aligned with modern data architectures.

Whether we’re building dashboards, powering real-time metrics, or democratizing data access, Kylin provides the tooling to get it done at scale, and at speed.

The post Introduction to Apache Kylin first appeared on Baeldung.
 

Content mobilized by FeedBlitz RSS Services, the premium FeedBurner alternative.