%pylab inline
import pandas as pd
import numpy as np
from scipy import stats
import re
Jupyter Notebook页面宽度没有优化好,使用电脑的小伙伴可以通过鼠标中键左右移动,使用手机的小伙伴只有部分output需要左右拖动,感谢亲们谅解!
Problem¶
Please go to the URL:
https://cdn2.hubspot.net/hubfs/2384307/test_data.csv
and download the data file test_data.csv. This file contains data in three fields:
code: Randomly generated 11-character strings (may include uppercase letters, lowercase letters, numerals, and non-alphanumeric characters).
x: Random deviates uniformly distributed between -50 and 50.
y: Random deviates uniformly distributed between 0 and 100.
Your task is to classify each observation (i.e., each row of the table, containing one code, one x, and one y) into one of 26 groups. The groups are defined by the first alphabetic character that appears in the code, irrespective of case. For example, the code 7%g9aPqwE;4 would belong to group G since the first alphabetic character is g and we're not distinguishing between upper- and lowercase.
Once you have identified the letter-group to which each observation belongs, calculate two results for each group:
- The number of observations in the group;
- The slope of the best-fit line (as determined by linear regression) for the points (x, y). In other words, for each letter-group, use linear regression to calculate the best-fit line y = A + Bx, and report the slope B.
Your final product should be a table with three columns: the letter groups, the number of observations in each letter group, and the slope of the regression line for each letter group. Of course, it should have 26 rows, one for each letter of the alphabet. I would also like you to provide a brief comment (one sentence is fine) on the results that you got for the slopes, explaining why they are or are not what you would expect given how the data set was generated. Finally, I would like you to generate a visualization of the data that presents all 26 letter-groups and their slopes, sorted by the slope. It's up to you to decide what kind of visualization is most relevant or appropriate.
My Solution¶
Given that it is a very big dataset with every row to be processed (actually only the first columns is of our concern for now), I'd like to compare both for loop in Python and apply function. Turns out that in total, for loop is estimated to take >5h while apply() in minutes, so I definitely recommend using apply() in your code.
The group regression is trivial and the test is meant to test the grouping function in pandas and you analytical skills. Please refer to my codes and analysis for details.
Generate Result Table¶
df = pd.read_csv("test_data.csv") # read in csv file
df
def groupingFirstAlphabet(df): # function to assign group according to 1st alphabetic character in the code
matchObj = re.match(r'[^a-z]*([a-z]).*', df['code'], re.I)
return matchObj.group(1).upper()
group_df = df.apply(groupingFirstAlphabet, axis = 1) # apply the function over rows of df
df['group'] = group_df
df
def calculateGroupSlope(group_data): # aggregate function to calculate the linear regression slope in each group
x = group_data['x']; y = group_data['y']
gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
return gradient
# first group by 'group' column, then apply the aggregate function
df_slope = df.groupby('group')['x', 'y'].aggregate(calculateGroupSlope)
df_slope = df_slope['x']
df_count = df.groupby('group')['code'].count() # do the same for the observation count in each group
df_result = pd.concat([df_count, df_slope], axis=1, join_axes=[df_count.index]) # merge the above results
df_result.rename(index=str, columns={'x':'slope', 'code':'count'}, inplace=True) # rename the columns
df_result
# save to csv
df_result.to_csv("result.csv")
Comment¶
The slope for each group should be around 0 (as expected) because both x and y values are uniform random variables, which distribute evenly in a rectangular space (in a 2-D plot) with its border paralleled to x and y axis, hence the linear regression line will also be paralleled to the x axis.
Visualization¶
import matplotlib.pyplot as plt
width = .45
df_result_sorted = df_result.sort_values('slope', ascending=False)['slope']
df_result_sorted.plot(kind='bar', width=width, )
df_result_sorted.plot(secondary_y=False)
plt.title("Slope in Each Group (Descending)")
Leave A Comment