Data exploration is done to become familiar with the data. This step is especially important when dealing with new data. There are a number of things you will want to do in this step –
a. What is there in the data – look at the list of all the variables in the data set. Understand the meaning of each variable using the data dictionary. Go back to the business for more information in case of any confusion.
b. How much data is there – look at the volume of the data (how many records), look at the time frame of the data (last 3 months, last 6 months, etc.)
c. Quality of the data – how much missing information, quality of data in each variable. Are all fields usable? If a field has data for only 10% of the observations, then maybe that field is not usable, etc.
d. You will also identify some important variables and may do a deeper investigation of these. Like looking at averages, min and max values, maybe 10th and 90th percentile as well.
e. You may also identify fields that you need to transform into the data prep stage.
In data preparation, you will prepare the data for the next stage i.e. the modeling stage. What you do here is influenced by the choice of technique you use in the next stage.
But some things are done in most cases – for example, identifying missing values and treating them, identifying outlier values (unusual values) and treating them, transforming variables, creating binary variables if required, etc.
This is the stage where you will partition the data as well i.e. create training data (to do modeling) and validation (to do validation).
The first step is to identify variables with missing values. Assess the extent of missing values. Is there a pattern in missing values? If yes, try and identify the pattern. It may lead to interesting insights.
If no pattern, then we can either ignore missing values (SAS will not use any observation with missing data) or impute the missing values.
Simple imputation – substituted with mean or median values.
You can use different methods to assess how good a logistic model is.
a. Concordance – This tells you about the ability of the model to discriminate between the event happening and not happening.
b. Lift – It helps you assess how much better the model is compared to random selection.
c. Classification matrix – helps you look at the false positives and true negatives.
Some other general questions you will most likely be asked:
- What have you done to improve your data analytics knowledge in the past year?
- What are your career goals?
- Why do you want a career in data analytics?
The answers to these questions will have to be unique to the person answering it. The key is to show confidence and give well thought out answers that demonstrate you are knowledgeable about the industry and have the conviction to work hard and excel as a data analyst.