Data Science Interview Log数据科学家面经

Introduction

There are mainly 2 types of DS interview questions you'll meet with: 1. Technical and 2. Business.

For technical questions, you'll be asked about Brainteasing (quickly solve a math problem), Machine Learning (how do you deal with overfitting?), or Database related case questions.

For business questions, be prepared for Estimation, Data Strategic, or even Consultant-like case studies.

Bear in mind that every company has its own style of interviewing its DS positions based on its current development phase: Global companies need DS for research and developing data-driven strategy; middle companies need DS for KPI dashboards and try-and-fail ML projects; Start-ups need DS for fast-paced product iteration and new ML/AI service. You should be prepared for different types of DS interview questions and engage yourself nearer to the core business problem that the company aiming to solve during interview process.

DS Interview Sharing [Year - Location - Company Size - Question Type - Details]

2019 - New Jersey - Start up - Technical - Database - [Failed but corrected]

This interview has 4 parts: Self-introduction, Brain Teasing Problem, Database Case Study and Q&A.

Brain Teasing Problem: You have a dice with 6 sides, numbered from 1-6, and you have the same chance to see each number in one roll. Now you want to roll until you see the same number twice consecutively, what's the expectation (number of rolls)?

Interview Strategy: at the first few minutes, you can start from the most simple scenario to get familiar with the situation. For example, you can say 'Suppose we stop at the 2nd roll, then we should have come into scenarios like 11, 22, 33, ..., and the probability of seeing these scenarios are 6/36, which is 1/6.' Along the way, you should be able to feel that this problem is related to expectation and there is a inclination to solve it in a recursive way. But remember always start from the simple scenario before moving forward.

Now let's consider the situation when 2 digits scenario doesn't end the sequence, then we have a 1/6 probability to let 3rd digit to be the same as 2nd, or we have a 5/6 probability that 3 digits scenario doesn't end either... Get a sense of Markov chain? There are only 2 state associated at step t, one is when the digit is not the same as previous (we denote the expected additional length of sequence in 1st state at step t as $E_1^t$), another is when two consecutive digits are the same (denote similarly as $E_2^t$), we can also assign $E_0$ as the initial step where only 1 digit. Now it's the time to ask this question, does step t really matter? If you're at step t with state 1, or if you're at step t+1 with state 1, the expected additional length of sequence before you stop will be the same! Take a few minutes to digest and let's move on to the equations:

$
\begin{aligned}
E_0 &= 1 + E_1\\
E_1 &= 1 + \frac{5}{6}E_1
\end{aligned}
$

2nd line indicates that there is a 5/6 probability that state 1 at step t will transit to state 1 at step t+1. There is no need for us to write down $E_2$ since it'll be redundant information for us to know the expected additional length sequence at state 2.

Solving the equation will render $E_0 = 1 + E_1 = 7$, hence the expected length of sequence from the start will be 7 digits.

Alteration to the original problem: If we add a second criteria that the sequence will stop if sum of last 2 digits >= 10, what's the expected length of sequence stop?

It's essentially the same process, except that the previous 5/6 probability need to change into considering additional scenario in which non-repetitive consecutive 2 digits adding up to 10: 4+6, 6+4, 3+7, 7+3. Hence in enumeration the probability of $E_1$ to $E_1$ is $1-\frac{4+6}{36}=\frac{13}{18}$. Replace 5/6 with 13/18 and you'll get
$E_0=1+E_1 = \frac{23}{5}=4.6$.

Machine Learning General: How to avoid overfitting?
1. Model side: Add penalization terms (refer to Ridge, LASSO, ElasticNet with L1/L2 regularization); small datasets are easier to be overfitted, can consider SVM, Bayesian methods; Can also consider Model Averaging (works like regularization)
2. Data side: Column random select (refer to Random Forest, XGBoost) and Row random select (refer to training/testing data split)
3. Data side: Can also generate artificial data points for small datasets often with continuous variables (refer to SMOTE or ADASYN)
Database Case Study: You are given a schema of SQL table as your ONLY source of data. There is a 7-day marketing campaign for productID 1 starting from Sep 02 in store OKOK, how can you fetch the average discount rate during the campaign from the given table?

ProductID	Date	RetailStore	Region	Unit_sale (# of sold)	Dollar_sale (revenue)
1	2019-09-02	OKOK	Florida	3,000	6,000

Answer: From the given table, extract the time series of ProductID = 1 from RetailStore = OKOK, then we should be able to see a trend going roughly steady before the campaign and suddenly drop to another steady phase during the campaign. Since we don't know the full price, the best way to calculate the full price for ProductID in OKOK is MAX(Dollar_sale/Unit_sale) in pre-campaign period, considering daily dynamic pricing, and the average discount price would be AVG(Dollar_sale/Unit_sale) during the campaign.

Now discount rate = 1 - AVG(Dollar_sale/Unit_sale during campaign)/MAX(Dollar_sale/Unit_sale before campaign)

2019 - Florida - Medium Size - Business - Decision Making - [Failed but corrected]

Interviewer: You have been an owner of Chipotle, one fast food restaurant selling brittos, for several months. Being an owner/mechandise means the profit/loss is all on you, but you can utilize Chipotle's brand and campaign. Now the company is planning to launch a new round of campaign, in which all your brittos price will be fixed at $6 for 6 months. Now you need to utilize data to make decision whether joining the campaign or not.

Me: OK so in order to make this decision I need to find out, first, what kinds of metrics do we have in hand, then, I'll prioritize these metrics according to importance and relevance to answering the question, last, I'll transform the data insight into decision making.

First things coming into my mind is I need demographic information (how many Mexicans are living in the neighbourhood), and the income distribution in the society...

Interviewer: Remember you've been the owner for a long time so you should have some higher level data already.

Me: Oh you're right, I'm not starting a restaurant from scratch so I should already have data directed related to the business. I'd like to consider:

average monthly revenue
weekly customer visits
customer sensitivity to campaigns
new acquisition through email/mobile APP

Interviewer: Yes, so now here is the information you need. There are 2 types of brittos sold in your restaurant, steak brittos (price $9, cost $2) and chicken brittos (price $7, cost $1). Based on historical data, customers bought 3 times chicken than steak brittos. Brittos can be served either in a bowl (cost $1.5) or with bread (cost $0.5). Every customer will add toppings (cost $1.5/britto). Customers chose bowl 2 times of bread when ordering chicken, while equally the same when ordering steak. Now your 1st question is: calculate the average gross profit across all brittos.

Me: OK please allow me to jot down some notes..

So, suppose there are n customers choosing steak brittos in our time period of interest, then there are 3n customers choosing chicken brittos. Then we need to consider the following customer segments:

Meat	Serve	Topping	Gross profit	Sales
Steak	Bowl	Yes	9-2-1.5-1.5=$4	0.5n
Steak	Bread	Yes	9-2-0.5-1.5=$5	0.5n
Chicken	Bowl	Yes	7-1-1.5-1.5=$3	2n
Chicken	Bread	Yes	7-1-0.5-1.5=$4	n

Then we can write down the formula for calculating average pross profit for all brittos.

$
\frac{0.5n \times 4 + 0.5n \times 5 + 2n \times 3 + n \times 4}{4n}= 3.625\ dollar
$

Interviewer: OK now say if you're to participate in the campaign, what percentage of sales are you looking for as the price for brittos goes down?

Me: Got it. Now I need to find out the balance point where the percentage increase of sales can mitigate the decrease of price. Know that during campaign every britto is at the same price $6, hence the new table will look like this:

Meat	Serve	Topping	Gross profit	Sales
Steak	Bowl	Yes	6-2-1.5-1.5=$1	0.5n
Steak	Bread	Yes	6-2-0.5-1.5=$2	0.5n
Chicken	Bowl	Yes	6-1-1.5-1.5=$2	2n
Chicken	Bread	Yes	6-1-0.5-1.5=$3	n

and

$
\frac{0.5n \times 1+ 0.5n \times 2 + 2n \times 2 + n \times 3}{4n}= 2.125\ dollar
$

In order to mitigate the loss in average profit, we certainly need to increase n, the percentage of increase can be calculated like this:

$
\frac{3.625}{2.125} - 1 = 70.6\% \ increase\ in\ sales
$

Interviewer: Great, now you take the campaign trial for 1 month, and you find that the sale increase by 100%, but your total profit stays the same as before. What do you think might be the reason?

Me: Hmm... The most critical reason, I believe, is that since the price for all brittos are now the same, those customers who usually pick chicken brittos are now tending to pick steak ones, and from our table, the profit loss for steak brittos is higher than that for chicken...

Interviewer: Good job, people are now favoring steak brittos. Given the situation, can you estimate now the sale situation for both brittos?

Me: Sure. Suppose the sales for steak is n and the sales for chicken is xn, where x is the ratio coefficient, originally x = 3. We can write the equation in terms of total profits:

$
0.5 \times 2n \times 1+ 0.5 \times 2n \times 2 + \frac{2}{3} \times 2xn \times 2 + \frac{1}{3} \times 2xn \times 3 = 0.5 \times n \times 1+ 0.5 \times n \times 2 + 2 \times n \times 2 + n \times 3\\
solving\ x = 1.18
$

Hence on average, chicken britto sales are 1.18 times the steak britto sales, so indeed more customers are choosing steak than chicken brittos.

Interviewer: Given the above data, do you want to continue attending the campaign after 1 month trial?

Me:Yes, because the restaurant is not only selling brittos. With amazing 100% sales increase there are a lot of profit gain from other items such as beverage, starters and desserts, plus the popularity gain in the community, which combined will make this participation in campaign a win-win for both the company and my restaurant.

Interviewer: Thank you!

2019 - California - Big Name - Technical - Database & Product Case

The interview has 2 parts: 1 for SQL test and 1 for product case.

For the SQL test, I was asked a problem very similar to Leetcode Database Problem 1113 but with further follow-ups on calculating the percentage of posts that are reported 'spam' for a given amount of time.

Tips:

When dealing with SQL test, first make sure you understand the expected output from the interviewer -- asking clarification is a good sign of showing your deep and structured thinking. For example, I asked the interviewer 'Should we round the output to 2 decimal points?'.
As writing out the query, follow a certain pattern that you're comfortable with. I always start with subquery and then with main query but you can always choose what suits for your coding style. What's important is you let interviewer know you're thinking out loud while you're writing on the code pad.
After you finish, take a second look at your code in case you make some silly mistakes, such as leave the 'group by' behind or don't use 'distinct' for potential duplication. After reviewing, recapitulate your logic to interviewer.

I personally like the consistency between SQL test and product case, where the latter the built upon former while test you from a different angle.

For the product case, the 1st question is 'If we have this ML model which can predict the spam and down-rank them in the news feed -- how can we come up with a metric that tell whether this model is successful or not ?'.

Answer: We can look at average spam reported per user = total # spam / total # user for a given time period, and if the ML model is working, we should be able to see this metric to drop statistically significantly.

2nd question: OK now based on the metric, what method do you use to come up with a conclusion whether model works?

Answer: We can use A/B Testing to make a solid conclusion. Since this spam filtering model doesn't have much network effect, we can launch the test control and variant group in the same market -- the control group are users who are under the previous news feed environment and the variant group are those who are under effect of the ML model. After launching the test for about 2 weeks (to minimize weekly effect), we can compare the metrics in 2 groups and utilize 2 samples t-test comparison to come up with statistical conclusion.

3rd question: During the test, what metrics shouldn't change?

Answer: Since the model should not affect the number of users landed on the site, hence we can monitor the DAP (daily active people) and make sure DAP stays the same in 2 groups.

4th question: If we observe the metric indeed goes down, but our revenue goes down as well -- what might be the reason?

Answer: (First I asked some clarification questions to make sure revenue comes mainly from Ads and in this case Ads are charged based on impression -- when a user sees the Ads) If our model is working properly to not filter out the Ads in our sites, then the reason is users are reporting less spam while reaching shorter distance in their news feed. (Background knowledge: news feed contains Friends Activity, Trends in your group or community, and Ads).

Interviewer: Good. These are all the questions I have for you.

2019 - Boston - ECommerce - Business - General Case - [Passed]

Background: Starbucks is launching a new promotion event with its stainless travel mugs, and one of the marketing feature is that customers who buy the mugs will enjoy free coffee fill-up from day one. How should we price it? (No actually number of calculation is needed and what's most important is your approach to the answer)

Me: I'd like to ask some clarification questions before I start investigation. First, is the promotion event only in US market? Second, what are the options for the free fill-up coffee?

Interviewer: You're on the right track. Let's assume it's for US ONLY, and we're offering just the standard coffee for free fill-ups.

Me: Got it. In order to price the travel mugs, I'd like to first look at 2 data -- the travel mugs usage among US customers and user preference for different types of coffee, because the first will tell us the potential buyer population (revenue related) and the second will tell us people's coffee preference (cost related).

Sorry but I forgot the following conversation, but at least you get the gist of it -- finding a reasonable way to peel the onion (here the pricing problem) is the key.

Hope you like my sharing. Feel free to comment below to discuss!

Data Science Interview Log数据科学家面经