Data Warehousing and Data Mining Essay
Data Warehousing and Data Mining Essay
Data warehousing is a useful tool for many companies because it creates an easily accessible permanent central storage space that supports data analysis, retrieval, and reporting (Rosencrance, 2011). Five benefits of using data warehousing include delivery of enhanced business intelligence, saving time, heightened and consistent data quality, ability to access previous information, and a high return on investment. Ultimately, data warehousing is ideal for businesses that make important decisions without consulting data. Creation of a data warehouse makes it simple for business professionals to consult various aspects of their business’s history, ranging from marketing information to profits and inventory needs. Since all of this information is located on a single system, it saves time compared to digging through paper files; in addition, this centralization will allow the IT department to focus on their other responsibilities which will increase the overall efficiency of the company. The data retrieved from a database can be made to appear in a consistent format, which will allow businesses to compare new data to data previously collected in a way that will give them a better understanding of their business’s progress. Lastly, the practice has determined that data warehouse implementation allows businesses to generate more revenue than those that use other formats of data storage. Although the initial monetary investment necessary for data warehouses creation is expensive, many business owners believe that they are worth it. Databases are useful for data storage practices that support both enterprise and web-based applications. The use of this system allows company owners to collect data from the Internet and convert this information into usable models that predict trends. Eventually, the company will be able to use this information to understand patterns that will help their business succeed.
Data mining is the physical process
Data mining is the physical process of extrapolating information from a data warehouse (Alexander, n.d.). This process is of particular use to businesses because it allows them to predict future patterns and trends based on the current and previous information; this is useful in assisting businesses with making important decisions. Even expert statisticians are unable to predict trends as well as certain data mining schemes; a human error would lead them to miss several data points that would skew the end result. Data mining uses algorithms that allow the computer to build mathematical models to answer problems such as market segmentation, customer churn, fraud detection, direct marketing, interactive marketing, market basket analysis, and trend analysis. These models help businesses determine the similarities between customers who buy similar products from their company, predict how likely current customers are to switch to a competing company, provide clues as to which customer transactions are most likely to result in fraudulent behavior, identify the customer base that should be included in a mailing list for marketing purposes, determine what customers would like to see on the company’s website, draw a connection of which products are usually purchased at the same time, and analyze the differences between recurring customers over a certain period of time. As one can see from this data mining essay, all of these strategies help businesses gain a greater understanding of who they are marketing to, what their market is interested in, and how to persuade people to purchase more of their product.
According to a January 2013 article published in Forbes magazine, data warehousing is becoming a trend and is replacing physical storage and some forms of less organized computer storage. The article explains that one of the major reasons for this response is the acquisition of increasingly large amounts of data; for businesses to be successful in the modern era, they must be able to both create and store an infinite amount of data (Evans, 2013). Performance has become one of the most important goals of the 21st century, which means that we have been forced to find new ways to store, retrieve, and analyze data; data warehouses have solved this problem in a way that provides a significant return on investment. Simplicity, accessibility, and mixed workload support has allowed data warehousing to replace conventional methods of data storage. The efficacy and efficiency of data mining and data warehousing have allowed businesses to evolve and take on more tasks than they were ever capable of in the past.
Many quality companies have successfully implemented the use of data warehouses into their daily business activities; two notable corporations that rely on the use of these systems include Apple and Walmart. Although it may seem obvious that Apple uses data warehousing since they are a computer company, this business actually uses data warehousing and mining for far more than its development of electronics and programs. Apple uses a multiple-pedabyte Teradata system to study its customers and determine which types of people buy their products (Harris, 2013). They save every piece of demographic information that people enter into their website, iTunes, and electronics to study these relationships. Walmart began using a data warehouse in 1992 which has grown significantly over the years. This corporation now uses three separate databases; one for Walmart stores, one for Sam’s Club, and a backup system. One of the major uses of this system is to provide store managers with information about their store’s layout. It knows where there is a free shelf space and the business uses data mining to figure out how to optimize this space. It also allows the stores to know what items are selling the best, how fast they are selling, and even suggestions whether they should design packaging for certain products that will allow them to fit the shelving more efficiently.
There are several types of data warehouse architecture; however, the precise elements used will depend on the needs of the individual business. In order for the company to have a highly efficient data warehouse, it should include the capability to run nightly updates, it servers should have connectivity around the world so different offices will be able to have access to the same database, it should allow customer-level service, have the capability of storing new data sources, and it must be reliable. Therefore, I must admit in my essay on data warehousing, that this data warehouse should have adequate staging horsepower, parallel or distributed servers, a large server size, flexible tools with support for metadata, and job control features (Hadley, 2002).
The two main forms of system design are OLTP and OLAP. OLTP (Online Transaction Processing) is “characterized by a large number of short online transactions (INSERT, UPDATE, DELETE)” (datawarehouse4u, n.d.). This system is used for processing queries quickly and maintaining data integrity that is substantial for its speed. This type of database includes the use of both detailed and current data and the schema used to store transactional databases is the called the entity model. OLAP (Online Analytical Processing) is defined by a low level of transactions because the queries are often complex. OLAP applications are generally used for data mining using aggregated and historical data that are stored in multi-dimensional schemas (usually the star schema). For our purposes, the OLAP application will be the most effective because it allows data mining that the company requires for their business analyses.
The model used in the creation of the data warehouse should be the normalized approach, which follows database normalization rules. This is the most effective model for a business’s data warehouse because it allows tables to be grouped together based on subject categories; this can be further divided into entities which will allow the users to create tables in a relational database (Kimball, 1996). Since studying relationships between consumers and products is a goal of many businesses, this representation of information will be most useful. This system will use materialized views because they are useful for many business applications. One of the most important advantages of using materialized views is the use of pre-existing computation and storage of aggregated data that will improve performance and allow fast lookups (Oracle, n.d.).
There are several techniques that are used to optimize data warehousing. Some problems that can arise when using a database include overloading the system, poor overall performance, and low storage. To avoid system overload, one should monitor parallel execution performance. For example, this may occur if many parallel statements are being downgraded. In this situation, I/O performance should also be monitored. If the system has a poor overall performance and low storage, it may be necessary to optimize the storage requirements; a useful way to do this is by using data compression. Several optimization methods can also be used to increase the efficiency of data mining. According to a 2009 article published by the University of Iowa, the main ways that data mining can be improved involve changing the algorithm used by the database (Yu, 2009). It states that optimization could be either a part of a larger data mining process or that optimization could work directly on data mining as a whole. Although the article cites several useful ways that mathematical equations can optimize data mining, the most interesting ones discussed included the use of the K-Mean cluster analysis algorithm to minimize the distances of points to their nearest centers, time-series data mining to define similarity, and the practice of determining what defines normal measures. There is an infinite number of ways to optimize data warehousing and data mining; the number of available methods will grow as the IT field evolves. Although the aforementioned optimization techniques are ideal for the business setting, it is essential to constantly review the literature for new techniques as they arise.
If the company has accumulated 20 terabytes of data and that a 20% per year growth is expected in the size of the data warehouse, I would recommend use of Oracle Exadata because it will accommodate the company’s need for data storage as it continues to grow, it performs well, requires little maintenance, and is quick to install. Since this is Big Data, the network configurations will require 1GbE access layer switch capacity (Borovick, 2012). However, as the amount of data storage that the company requires increases, they may need to upgrade to a 10GbE server after one or two years, which will increase to a need for 40GbE switch capacity. The specific hardware that is needed for the database is dependent mostly on the data storage growth and the number of people that will need to access it. Since the company may have many employees in many different parts of the world, I recommend RAID 10. RAID is ideal in this situation because it has several different storage methods that are accompanied with their own advantages and disadvantages (Acronis, n.d.). RAID 10 is one of the storage options and is ideal because it doesn’t crash and it has fast speed. In order to use RAID 10, two physical hard drives are required; a disk controller that understands it is also needed. One of the most useful functions of RAID 10 is in its ability to backup the database information; it uses a process called mirroring to save the data to two or more disks at once. If one disk completely fails, this information will still be safe on the other disks. RAID 10 also increases performance because it is able to retrieve information from more than one disk at a time. Despite its ability to copy information to several different places at once, Acronis recommends that a backup must be used because this information can still be wiped. Since RAID 10 is Acronis brand, I will also use Acronis backup software.
To conclude in this essay on data mining, data warehouses are generally broken up into three tiers including the data tier, application tier, and presentation tier. The data tier is responsible for physically storing all the warehouse’s data in addition to handling the ETL process where data is extracted and transformed (Paoletti, 2012). The application tier is where business information is converted to models for company use; this is the tier that deals with all user queries. Finally, the presentation tier gives the business reports, analyses, and event management information. It allows for user administration, dashboarding, and score carding. A graphical representation of these three tiers can be seen below:
- Acronis. (n.d.). What is RAID 10 and why should I use it? Retrieved from http://www.acronis.eu/resource/tips-tricks/2005/whats-raid-10.html
- Alexander, D. (n.d.). Data Mining. University of Texas. Retrieved from http://www.laits.utexas.edu/~anorman/BUS.FOR/course.mat/Alex/
- Borovick, L. (2012). The Critical Role of the Network in Big Data Applications. IDC. Retrieved from http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns944/critical_big_d ata_applications.pdf
- Datawarehouse4u. (n.d.). OLTP vs. OLAP. Retrieved from http://datawarehouse4u.info/OLTP- vs-OLAP.html
- Evans, B. (2013). Data Warehouse 2.0: The 10 Top Trends Driving the Revolution. Forbes. Retrieved from http://www.forbes.com/sites/oracle/2013/01/14/data-warehouse-2-0-the- 10-top-trends-driving-the-revolution/2/
- Hadley, L. (2002). Developing a Data Warehouse Architecture. Retrieved from http://www.users.qwest.net/~lauramh/resume/thorn.htm
- Harris, D. (2013). Why Apple, eBay, and Walmart have some of the biggest data warehouses you’ve ever seen. Gigaom. Retrieved from http://gigaom.com/2013/03/27/why-apple- ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/
- Kimball, R. (1996). The Data Warehouse Toolkit. Wiley.
Oracle. (n.d.). Data Warehousing with Materialized Views. Retrieved from http://docs.oracle.com/cd/F49540_01/DOC/server.815/a67775/ch1.htm#12524
- Paoletti, L. (2012). Data Warehousing: “Conceptual Architecture”. Retrieved from http://www.tomsitpro.com/articles/data_warehsouing-business_intelligence- data_warehouse_conceptual_design2-271.html
- Rosencrance, L. (2011). Top Five Benefits of a Data Warehouse. TIBCO Software. Retrieved from http://spotfire.tibco.com/blog/?p=7597
- Yu, Z. (2009). Optimization techniques in data mining with applications to biomedical and psychophysiological data sets. University of Iowa. Retrieved from http://ir.uiowa.edu/cgi/viewcontent.cgi?article=1459&context=etd