Cleansing or scrubbing: removes inconsistent or invalid data, or data with errors to improve data quality.Joins: used to join extracted data, similar to SQL joins (inner join, left join, right join, full join, etc.).This combination means adding, deleting, and updating data in the staging area based on the extracted data. Integration: the combination of data from the extraction phase to the staging area.The selection can be made during the extraction phase, during the transformation phase, or in both phases. Selection: criteria to select extracted data.There is a range of tasks performed during the transformation phase: In some projects, normally those with a small amount of data, there is no need to use staging areas, but most projects use it. A staging area is an intermediate storage space used to temporarily store data extracted from data sources to be transformed. TransformationĪfter extracting the data, we can begin the transformation process: clean, manipulate, and transform data according to business rules and technical criteria to maintain an acceptable level of data quality.ĭepending on a number of factors, the use of a staging area may be necessary. If an engineer is in the middle of the test and has answered a few questions but hasn’t finished yet, the extraction process can’t read the questions from unfinished tests. For example, an online test to identify an engineer’s knowledge with 10 questions. It means that only completed transactions will be considered for extraction, eliminating data that could cause an integrity fault. In addition to these aspects, some other considerations are necessary when choosing between full or partial extraction, and I want to describe one of them: data availability and integrity. In a partial extraction, only updated or new data will be extracted. Partial extraction is recommended when there is no need to extract all the data again, or the amount of data is big enough to make a full extraction impossible.Full extraction is used for initial extractions or when the amount of data and consequently the time for extraction are acceptable.There are two main types of extraction: full and partial. Sources can have different formats and structures, such as documents, spreadsheets, CSV files, flat files, relational databases such as Oracle, MySQL, SQL Server, non-relational databases, and so on. ExtractionĮxtraction is the process of receiving data from single or multiple sources. Let’s dive into the three steps of ETL to elaborate on the process. In our case, we’ll receive data from an Oracle database (most kiosks), from Salesforce (stores), and from spreadsheets (newer kiosks), extract the data, transform when necessary, and load into a single data warehouse database to be accessed by reporting tools and used to generate dashboards and scorecards. ETL is a group of processes designed to turn this complex store of data into an organized, reliable, and replicable process to help your company generate more sales with the data you already have. How will you collate all the data from different locations, formats, and structures into a unique database ready to be accessed for dashboard generation? ETL FundamentalsĮTL stands for Extract, Transform, and Load. The newest kiosks that started operations during the system transition have shared spreadsheets used to compute sales. Bigger stores are using a new Salesforce system. Some of the retail kiosks are still using a proprietary system in an Oracle database. You already know that it is almost impossible to consolidate the information due to different formats and structures. One of the directors asks you to generate a dashboard with sales information from all the points of sale to present in the next board meeting. Your board of directors is struggling to make decisions based on sales because the information is available but it is in different locations and formats. Imagine this scenario-you are the IT manager of a cell phone case company with retail locations all over the United States. One of the most critical mistakes that prevent companies from providing vital decision-making information to business users is the absence of reliable data from one or more data sources, gathered in a single location, organized and prepared to be used.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |