OLAP Platform Construction Practice of ClickHouse in Industrial Internet Scenario

  • JDT Developer
  • 2022-01-12
  • IP归属:未知
  • 111 browse

    I. Background


     
    JD Industry BG (“JD Industry”) is a new independent business group established in 2021, engaging in four major business segments including industrial products, industrial services, and industrial interconnection. Industrial interconnection business mainly aims to build industrial Internet platforms where real-time on-site industrial data will be imported for analysis and data intelligence. Currently, it supports the businesses such as the state grid management platform, integrated energy, carbon neutral trading and electricity trading. The application of JD Cloud’s ClickHouse in integrated energy of JD Industry will be highlighted in the article. The data in the industrial Internet scenario mainly has the following three features:

    I. Large volume
    There may be hundreds of devices in micro customers and hundreds of thousands of instrumentation devices for large customers. The reporting frequency of these devices ranges from 1 to 60 per minute, so the volume of reported data is very large.

    II. High requirements for real-time query
    Most customers often use visualization screen real-time applications as real-time dashboards, and keep monitoring them, making real-time applications the most frequently used applications.


    clickhouse工业.jpg

    Example of a Real-Time Application Dashboard


    III. High requirements for data consistency


    The underlying environment may change frequently, resulting in unpredictable dirty data. However, mistakes are not allowed by customers. In the case of device disorder, there have been instances where the wrongful deployment of devices resulted in misalignment of reported data in a long period; in the case of new fields, new fields may be added if a customer updates the devices globally; in the case of unit conversion, such as converting t/h into kg/h, the value may be magnified 1,000 times instantly.


    II. Technical Architecture


    Industrial Internet platform has also experienced the evolution of the following technical architecture due to the evolution of business architecture.


    clickhouse工业2.jpg



    Architecture 1.0 Storage-aggregation separation


     
    Generation I architecture

    The overall architecture of the Generation I is shown below, which is comprised of the following important components: data filtering, data processing engine, Influxdb and Kylin.

    clickhouse工业3.jpg


    Data filtering


    1. The Policy Factory filters dirty data by global general rules + enterprise-specific personalization rules.
    2. Process for handling abnormal data. Changes in the environment ahead may lead to misjudgment, in this case, abnormal data shall be forwarded and stored temporarily.

    Data processing engine

    1. Widening of dimensions and features
    Such as the device dimensions, organizational structure dimensions, and alarm rules used for OLAP and algorithm model

    2. Data differential
    - Dimensions such as traffic and electricity are convenient for complex query instantaneous status and complex aggregation
    - Anti-intermediate data loss



    3. Warning and alarm


    - Monitor whether non-dirty data triggers alert rules set by the enterprise
    - Trigger the alarm processing process


    Influxdb


    Entering detailed time series data into database

    Kylin

    Query aggregated data


    Business performance and feelings

    clickhouse工业4.jpg


    Actual Business Performance of Kylin

    1. Dependent hadoop environments should be built in advance, resulting in heavy operation and maintenance workload.
    2. The combinational relationships of cube, dimension, measures, etc. should be preset in advance and cannot be changed after launch.
    3. It is merely a query engine based on predictive computing that cannot store data, so it is impossible to query detailed historical data.


    clickhouse工业5.jpg

    Influxdb Architecture


    Actual Business Performance of Influxdb


    1. At that time, the distributed version was not open sourced, and operation and maintenance was rarely known by people then;
    2. It is hard to find supporting tools and documentation, and get started. And the concepts of tag, series, fields, etc. were not easy to understand...;
    3. The aggregation query performance was poor, and the detailed query had not ever been improved;
    4. It hasn’t been endorsed by core applications of large manufacturers.

    Comprehensive Feelings

    1. Heavy operation and maintenance workload: Two sets of databases should be operated and maintained, in addition, Kylin often crashed in the test environment with average performance
    2. Difficult to use: One had to distinguish between querying detailed and aggregated data
    3. Data inconsistency: The data should often be changed in IoT and industrial scenarios
    4. The architecture could basically meet the requirements, but its demerits outweighed merits, therefore, a better architecture solution was needed


    Architecture 2.0 ClickHouse that merged the two into one


      The Generation II architecture aimed to address the three pain points of the first generation:
    I. The operation and maintenance workload was heavy, and two sets of storage frameworks were required;
    II. It was difficult to use and the data sources should be distinguished during querying;
    III. It was difficult to maintain the consistency of data.

    The Generation II architecture employed the practice recommended officially by ClickHouse: Kafka engine table + objectification process + local table + distributed table


    clickhouse工业6.jpg


    Merits and Demerits of Generation II Architecture

    Merits:
    1. It addressed all the pain points and required only one set of solution.
    2. It has outstanding performance. It supports stable and high-throughput data writing, which can meet the basic storage requirements for middle-end construction. It features ultra-high data compression rate which helps save disk storage. In addition, with reasonable index settings, the data query speed is extremely fast.

    Demerits:
    1. Huge data volume; 2. The QPS was not high enough for toC.


    Business Performance

    clickhouse工业7.jpg

    ClickHouse Architecture


    clickhouse工业8.jpg

    Performance Comparison of ClickHouse


    Practical Feeling

    I. The architecture is clear and concise. It adopts the simplest multi-primary shard structure, and only depends on several zks, enabling the setting up of any operation and maintenance in merely half a day.

    II. It is a one-stop solution that can not only store data, but also query data, with favorable performance optimization of the query internally by default.

    III. It is easy to use with user friendly SQL, basic mysql skills are adequate for using ClickHouse freely.

    In summary, ClickHouse is an excellent package for storing and querying data, and is a great choice for scenario of querying aggregated data that involves a lot of actions, such as “group by” and sorting. It also provides adequate support for DBMS. In addition, it operates the same as mysql, greatly simplifying the operation and maintenance, as well as R&D.


    New Problems


    Large-screen applications suffer performance bottlenecks

    clickhouse工业9.jpg

    clickhouse工业10.jpg


    Main bottlenecks


    I. High data throughput for instantaneous query
    The data is basically partitioned by day. If switched to “year”, the data of 365 partitions will be aggregated instantaneously for the API, and the API will be delayed for 5~10s.

    II. High QPS for instantaneous query
    There are a surprising number of large-screen application components. Roughly, more than a dozen SQLs will be submitted to ClickHouse in an instant when the large screen of the home page is refreshed. In the case of query by year that spans 365 partitions, the workload will increase sharply.

    It is advisable to perform a maximum of 100 query operations per second.


    In view of the above, people began to consider trying to build a real-time data warehouse in the next-generation architecture: separating production from consumption


    Architecture 3.0 ClickHouse + real-time data warehouse


    The Generation III architecture is as follows:

    clickhouse工业11.jpg


    DWD detailed data layer:


    1. Detailed big and wide table data by topic.
    The big and wide table is decoupled and split in business.
    2. Dimensional data.
    It is recommended to widen the dimensions in advance to avoid “join”; spare fields can be reserved for mapping in business.

    DWS data aggregation layer:
    1. Aggregation of objectification process
    AggregatingMergeTree
    Orientation: Oriented to scenarios where detailed data is not frequently modified
    Merits: Simple and fast implementation, smooth performance, and obvious query optimization
    Demerits: It takes time for evaluation and modification in case of change in detailed data


    2. Fixed-time task scheduling


    Online/offline scheduling of script or visual tasks
    Orientation: Oriented to scenarios where detailed data is frequently modified, strong data consistency is required
    Merits: Data consistency can be guaranteed
    Demerits: Pre-construction work is required, including analysis and extraction of business data models.

    ADS data application layer:
    1. Data production and consumption decoupling layer: It employs Redis, and is designed according to QPS 5W+
    2. Data application: It employs the data in Redis directly without accessing the lower-level data warehouse.


    III. Summary of ClickHouse Practice

        
    Applicable Scenarios of ClickHouse

    I. It is applicable to OLAP scenario for complex query of aggregated data and does not support the OLTP scenario.
    It is typically characterized by the use of a large number of “group by” and “order by” operations.


    II. It needs to support stable mass data writing.


    ClickHouse can generally support data writing of hundreds of MB per second, and this figure would be greater for optimized version.

    III. It does not require frequent queries.
    It can be used without login. It is advisable to perform a maximum of 100 query operations per second. However, this speed can be greatly increased after optimization according to the above real-time data warehouse solution. It is most suitable for scenario of internal use by internal operation personnel.

    IV. It is an excellent solution in scenario where you need detailed query beside OLAP aggregation.

    V. In addition, it does not require advanced DBMS capabilities such as transactional function, nor frequently complex inter-table operations such as join.


    Practice in Applicable Scenarios of ClickHouse


    I. It employs the officially advised big and wide table process, i.e. Kafka engine table + objectification process + local table + distributed table.

    It solves problems by means of advanced ETL process of “dimension widening + big and wide table + reasonable objectification”, instead of complex inter-table join.

    A large number of different table engines of the mergeTree family are used, such as in data deduplication, data aggregation and other special scenarios.

    II. All database solutions are part of the data warehouse solution, which should be taken into account from the perspective of the data warehouse.


    There’s no absolute “silver bullet” that can solve all data problems. All we could do is to use databases with reasonable collocation, maximize strengths and circumvent weaknesses, and give full play to their functions.


    clickhouse工业12.jpg

    ClickHouse Engine Table Family




    number of articles
    544
    reading volume
    3296738