Data Mining the 2017 Data Professional Salary Survey
For awhile now, I've been playing with data mining/data science. When Brent (b/t) released the results of his 2017 Data Professional Salary Survey, I thought it might be a good reason to play with building a Decision Tree and seeing what information I could gather.
For this Decision Tree, I want to see what factors had the most impact on salary. This is on salary in general, not what factors went into making the most money, but how it breaks down. Now, I need to be clear, I didn't do any data cleansing here and this data was generated by humans, so I wouldn't run out and change your life plan and goals based on what I found here.
There are several ways to create Decision Trees, but I used the Data Mining functionality included in SSAS. One nice thing is that you don't have to have a cube in order to take advantage of the data mining models. You do however need to have an multidimensional SSAS instance. There are several mining structures available, but I thought that a Decision Tree made sense for this experiment. For complete transparency, I didn't use all of the fields that were included in the survey, but chose them based on their weight. SSAS will suggest columns based on what you are trying to predict. This is what came back for Salary:
So already of interest - it didn't look like the number of database servers was weighted heavily, but I did add to the input columns. I'm not to going to go into detail about how Decision Trees are created - there are plenty of great blogs posts about that already (check here and here for examples). So after building and deploying my solution I came back with this.
Voila! Perfectly understandable, right?
Now if we focus on the parts of the tree where most of the responses fell (darker blue tiles), we can see how different factors influence different levels.
It makes sense to me that the first differentiating attribute was the number of years of experience. I was surprised though that Country was the such a large factor.
Another interesting piece is the Dependency Network. This shows all of the attributes that the salary depended on. Within this you can look at how strong each of the attributes are. I selected the top 50%.
I thought it was interesting that Education, Certifications, and Job Title weren't in that top 50%.
I hope that this was as interesting for you as it was for me to put together. Hopefully it encourages you to start playing around with Data Mining.