DATA SCIENCE

INTRODUCTION



Data Science is the science by which we can solve the problems by using data. The problem could be decision making such as identifying which email is spam and which email is not spam, So the core job of a Data Scientist is to understand the data and extract all the information out of it then apply this in solving the problems.

 

DATA COLLECTION

They need to collect all the data which can help to solve the problem. Data collection is a systematic approach to gather relevant information from a different variety of sources. Depending on the problem statement, the data collection method is broadly classified into two categories.

At first, when you have some problem which is unique and no any related research is done on the subject. Then you need to collect new data. This method is called as primary data collection. There is no public data available for these. But you can collect the data through various methods such as survey, interviews of employees and by monitoring the time spend by employees.

Another method is to use the data which is readily available or collected by someone else. These data can be found in the internet, news articles, all the government census and so on. This method is called as secondary data collection. This method is less time-consuming than the primary method.

 

DATA QUALITY CHECK AND REMEDIATION

The process typically involves detecting and correcting corrupt or inaccurate records by replacing, modifying or deleting the “dirty” data. It can be performed manually, with cleansing tools, as a batch process, through data migration or a combination of these methods.

After collecting all the data, most people start the analysis on it. Often, they forgot to do a sanity check on the data. Because if the data is of bad quality, it can give misleading information.



EXPLORATORY DATA ANALYSIS

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies, to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

You can use descriptive statistics such as central value measures and variability measures. Also, visualisation methods such as graphs and plots can be used for analysis.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate.

 

DATA MODELLING

Data modelling means to formulate every step to achieve the solution which we required. We needs to list down the flow of the calculations which is nothing but modelling steps to the solution. The main factor is how to perform the calculations. There are various techniques under Statistics and Machine Learning that you can choose based on the requirement.

 

CONCLUSION

Data science education is well into its formative stages of development; it is evolving into a self-supporting discipline and producing professionals with distinct and complementary skills relative to professionals in the computer, information, and statistical sciences.


Comments