Why data warehouse ?
To answer this let's take one comman example that you have your busness in multiple countries running for decades.
If you want to know the strength of your employees, If you want to know the sales , If you want to know your products ,
If you want to track the customer etc is not imaginable with out datawarehouse. Hence the complete globe depends and needs datawarehouse.
What is data warehouse ?
A data warehouse is nothing but a relational database which is subject-oriented, integrated, time-variant and non-volatile collection of data
It supports the business for analysing and taking decisions.
Subject-Oriented: Data warehouse is used to analyze a particular subject area. For example, "sales" can be a particular subject.
Integrated: Data warehouse integrates data from multiple data sources. For example, there will be source A and source B may have different approaches
of identifying a product,but in a data warehouse, there will be only a single way of identifying a product.
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a transactions system, where often only the most recent data is kept. For example, a transaction system may hold the most recent address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should never be altered.
What is a DATA MART ?
A Data Mart is a specific, subject oriented, repository of data designed to answer specific questions for a specific set of users. So an organization could have multiple data marts serving the needs of marketing, sales, operations, collections, etc. A data mart usually is organized as one dimensional model as a star-schema (OLAP cube) made of a fact table and multiple dimension tables.
A Data Mart act as an index and extraction system for rather than bring all the company's data into a single warehouse, the data mart knows what data each database contains and how to extract information from multiple databases when asked.
Generally Data Marts are of two types.
Dependent data Mart
Data warehouse is build carefully after the selection and completion of all data marts and this is called Bottom up approach.
Independent data Mart
Data warehouse is build without depending on data marts here the data marts are independent and this is called top down approach.
What is the difference between DATA WAREHOUE and DATA MART ?
Data Warehouse:
- Holds multiple subject areas
- Holds very detailed information
- Works to integrate all data sources
- Does not necessarily use a dimensional model but feeds dimensional models.
Data Mart
- Often holds only one subject area- for example, Finance, or Sales
- May hold more summarised data (although many hold full detail)
- Concentrates on integrating information from a given subject area or set of source systems
- Is built focused on a dimensional model using a star schema
Shortly we can say Data Mart is restricted to a single business group and where as Data Warehouse is the union of all the Data Marts.
WHAT IS ODS ?
- An operational data store (or "ODS") is a database designed to integrate data from multiple sources to make analysis and reporting easier.
- Complete History data is not available. Data will be stored only for certain period of time.
- An operational data store (ODS) is a place where data from multiple source systems is stored. As a general rule, an ODS is maintained in near real time and the data is usually at the transaction level
WHAT IS STAGING AREA ?
Staging Area is temporary location where data from source systems is copied. A staging area is mainly required in a Data Warehousing Architecture for timing reasons. In short, all required data must be available before data can be integrated into the Data Warehouse.
The business logic is going to develop here after loading the source data into staging area.
What exactly happen in staging Area ?
In staging area the business logic is going to develop as per business requirement.
Data is going to cleanse or Scrub or integrate or aggregate.
Data cleansing: Removing inconsistencies or inaccuracies from the source data is data cleansing.
Data Scrubbing: Deriving new attributes from the existing source attributes is data scrubbing.
Data Integration: A data integration system combines the data residing at
different sources, and provides a unified, reconciled view of these data, called global
schema.
Data Aggregation: Summary data is a key part of data warehouse design. It provides a cost effective means of improving query performance without the cost of investing in additional hardware and storage devices.
Creating summary table structure allows the data architect to design a solution that will rapidly satisfy common queries that access a sub-set of data.
Creating summary table structure allows the data architect to design a solution that will rapidly satisfy common queries that access a sub-set of data.
SCHEMAS IN DATAWARE HOUSE:
There are three schemas in datawarehouse and these are very crucial for the entire datawarehouse. Complete datawarehouse depends on the schema how it designed.
STAR SCHEMA:
A fact table surrounded by dimensional tables and the design looks like a star and hence the name is given as star schema.
FACT TABLE:
A fact table typically has two types of columns: foreign keys to dimension tables and measures those that contain numeric facts. A fact table can contain fact's data on detail or aggregated level.
Fact table tell us the measure value for the business across any of the dimensions that you link to fact.
There are three kinds of facts are there
- Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.
- Semi-Additive: Semi-additive facts are facts that can be summed up for some of the dimensions in the fact table, but not the others.
- Non-Additive: Non-additive facts are facts that cannot be summed up for any of the dimensions present in the fact table.
DIMENSION TABLES:
A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size than fact table.
A dimension is a structure usually composed of one or more hierarchies that categorizes data. If a dimension hasn't got a hierarchies and levels it is called flat dimension or list. The primary keys of each of the dimension tables are part of the composite primary key of the fact table. Dimensional attributes help to describe the dimensional value. They are normally descriptive, textual values. Dimension tables are generally small in size than fact table.
One major thing here is to understand about the grain or the lowest level of granularity.
Let have a close look on dimension tables.
In TIME dimension table the hierarchy of attributes is Year under Year Quarter and below Month and below month day of the week and below is day which is the lowest level of granularity and we cannot get any sales data below a day.
In LOCATION dimension the hierarchy level is country and under country state_or_province and below city and below street which is lowest level of granularity and we cannot get any sales data below city.
In Item dimension there is no hierarchy level
In Branch dimension there is no grain.
So while loading the data into fact table grain plays a major role and one should be taken care otherwise the data should be wasted and which results to poor analysis and poor decisions.
SNOW FLAKE SCHEMA:
A snowflake schema is similar to star schema and the only difference is the dimension tables are normalized and hence there is very less scope in getting duplicates.
OLTP databases will prefer this Schema. If you observe closely the dimension tables ITEM and LOCATION are normalized and hence here very less scope for getting duplicates.
HYBRID SCHEMA:
Hybrid schema or Fact constellation schema. Here two fact tables will share the dimensions and these dimensions are called confirmed dimensions. This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The split of fact table is done only when we want to focus on aggregation over few facts & dimensions. In the below figure if you observe sales fact table and shipping fact table are using same dimension tables TIME , ITEM,LOCATION
Fact Constellation Schema
Note: In any of the modeling first dimensions will be loaded and after that only fact table is loaded other wise data ware house will be wasted.
What is Aggregation and what is Incremental Aggregation ?
Aggregation is Nothing but calculating the summary values over the data.
Suppose there is a table department with deptno ,dname,sal,empno as columns.You can perform any aggregation on column sal like min sal, max sal, avg sal based on any key or non key columns and load into target table. This is called Aggregation.
Source data:
Deptno dname sal empno
10 physics 100 1
20 maths 200 2
30 comp 300 3
10 physics 600 4
20 maths 700 5
After Aggregation if you calculate the sum
Target data:
Deptno sum( sal)
10 700
20 900
30 300
Incremental aggregation: In Incremental Aggregation the target summary values will be stored in aggregator cache and the new records aggregate values will be added to the target summary value.
Eg:
Let there will be new records inserted into the source data given above.
Newly added source records.
Deptno dname sal empno
10 physics 200 6
20 maths 300 7
30 comp 400 8
If you use Incremental aggregator then source aggregate values should be added to target summary values.
After Incremental Aggregation:
Target Data:
Deptno sum( sal)
10 900
20 1200
30 700
No comments:
Post a Comment