A day in the life of a wannabe data scientist

May 10, 2017

I recently had an experience that I think matches well with the textbook steps often presented in “day in the life of a data scientist” discussions:

  1. Frame the question
  2. Collect data
  3. Clean the data
  4. Explore the data
  5. Analyze the data
  6. Communicate the findings

I thought it might be fun to share my trip. Join me, won’t you?

(It should be noted that real-world applications of these steps often lead to looping behavior—initial exploration of the data might suggest additional cleaning steps, or analysis might reveal the need to collect more or different data. For simplicity of this illustration, we’ll pretend that I foresaw all of those possibilities and avoided looping by doing each step perfectly the first time.)

1. Frame the question

Here at SEP, we often refer to individuals using their initials. On a quick walk around the office you’ll see stickies everywhere with strings like “TET”, “JCF”, “MCT”, etc on them. Occasionally, though, this form of abbreviation leads to disambiguation problems. SEP is big enough these days that we have overlapping initials in a few cases. Recently at lunch a coworker asked, “How many sets of duplicated initials do we have at SEP anyway?”

2. Collect the data

I initially thought I could answer the question just by thinking hard enough about sets of duplicates that I have encountered. I came up with a couple of examples…eventually. But, I wasn’t satisfied with the answer “a few…I think”. I was going to need more authoritative data.

So, I asked the SEP API:

$ curl -s app.sep.com/sepeers | jq . | grep initials > initials.txt

The result was 122 lines that looked like this:

“initials” : “TET”,
“initials” : “JCF”,
“initials” : “MCT”,

And here was my first opportunity for a sanity check; I know there are roughly 120 current employees at SEP. 122 data points passed that sniff test. Hooray!

3. Clean the data

While I could have done analysis on data points that look like those above, the signal to noise ratio there is pretty high. Three characters from each of those lines are interesting—the rest is boilerplate resulting from the original JSON-esque structure that I slurped those lines out of.

Thankfully the SEP API data source is pretty clean already as far as accuracy and completeness of data goes. I didn’t really need to address outliers, mis-recorded, corrupt or inaccurate data, non-standard capitalization, etc. I just needed to transform the data into a more convenient form.

So, I opened my initials.txt file with Excel, declared colon a delimiter, and ignored the resulting first column. That yielded data like this:

“TET”,
“JCF”,
“MCT”,

I cleaned up the quotes and commas with =LEFT(RIGHT(A1,LEN(A1)-2), LEN(A1)-4) then copied and pasted the resulting values to get a clean list:

TET
JCF
MCT

In hindsight, a further optimization is possible that blends data collection with data cleaning:

$ curl -s app.sep.com/sepeers \

| jq . \
| grep initials \
| sed -e 's/.: "(.)",/\1/' \
| > initials.txt

I should note here that SEP has employees who have no middle initial recorded in the source data. The specific questions I was asking didn’t require that I do anything special to handle those data points. But, one can imagine similar questions involving middle initials that might drive cleaning steps like assigning placeholder middle initials or dropping those data points from further consideration before moving on.

4. Explore the data

To get a general feel for the data and whether any patterns were obvious, first I sorted alphabetically and visually scanned the list. It was relatively easy to pick out the duplicates:

– ARS (x2)
– JAM (x2)
– RPH (x2)

It seemed that we don’t have as many duplicated sets of initials as I thought. Time for another sanity check…

Upon reflection, I realized that I had been mentally counting some potential collisions that could have but never actually did occur due to timing. For example, we have current employees with the initials “CSB”, “MDS”, and “TLR”. We also have previous employees with the initials “CSB” and “MDS” and a current employee who previously had the initials “TLR”. But, the timelines there don’t actually overlap; we only ever had one of each of those at any given time.

In addition, neither I nor anyone nearby could cite any current collisions that weren’t on that short list. OK, this passed the second sniff test after all.

(I’ll probably try to run this experiment in the future with additional historical data from app.sep.com/alums to capture those “collisions in space, but not time”. However, that will introduce new work back in the “clean the data” step—currently, the /alums data doesn’t directly contain each person’s initials. I’ll have to construct that from clues such as first/last name and username.)

5. Analyze the data

To answer the specific original question in a more definitive manner, I created a PivotTable in Excel: Insert->PivotChart->PivotChart & PivotTable. Using Excel’s handy wizards, I configured my PivotChart to display initials versus count of initials. The three previously identified repeated initials jumped right out of this visualization:

Next I started asking related questions about distributions of initials, fiddling with =LEFT() and =RIGHT(), and generating more charts.

Yes, we would have significantly more collisions if we used only first and last initials—24 cases of duplicates, some with as many as 4 individuals:

No, we’re not really special snowflakes—given our small data set, SEP’s distribution of last initials tracks relatively closely with the US population as recorded in the 2000 census:

(It might be worth digging into that last one a little deeper given that SEP skews significantly more male and less culturally-diverse than the US population as a whole. I know that distributions of first initials are different between males and females. I don’t know if last initials vary in the same way.)

6. Communicate the findings

You’re soaking in it.

Also, I made a few posts in assorted Slack channels and had some follow-up conversations with the coworkers who had been involved in the original discussion. I still occasionally drop a factoid from my findings into conversations with coworkers who I think might be interested.

What next?

Partially because I got frustrated with PivotCharts and partly because I wanted to see if I could do it, I started exploring these same questions and data in R. Stay tuned for those adventures in the near future!