Challenge Solution-- Ancheng Deng


In [1]:
%pylab inline
Populating the interactive namespace from numpy and matplotlib
In [2]:
import pandas as pd
import numpy as np
from scipy import stats
import re

Jupyter Notebook页面宽度没有优化好,使用电脑的小伙伴可以通过鼠标中键左右移动,使用手机的小伙伴只有部分output需要左右拖动,感谢亲们谅解!

Problem

Please go to the URL:

https://cdn2.hubspot.net/hubfs/2384307/test_data.csv

and download the data file test_data.csv. This file contains data in three fields:

code: Randomly generated 11-character strings (may include uppercase letters, lowercase letters, numerals, and non-alphanumeric characters).
x: Random deviates uniformly distributed between -50 and 50.
y: Random deviates uniformly distributed between 0 and 100.

Your task is to classify each observation (i.e., each row of the table, containing one code, one x, and one y) into one of 26 groups. The groups are defined by the first alphabetic character that appears in the code, irrespective of case. For example, the code 7%g9aPqwE;4 would belong to group G since the first alphabetic character is g and we're not distinguishing between upper- and lowercase.

Once you have identified the letter-group to which each observation belongs, calculate two results for each group:

  1. The number of observations in the group;
  2. The slope of the best-fit line (as determined by linear regression) for the points (x, y). In other words, for each letter-group, use linear regression to calculate the best-fit line y = A + Bx, and report the slope B.

Your final product should be a table with three columns: the letter groups, the number of observations in each letter group, and the slope of the regression line for each letter group. Of course, it should have 26 rows, one for each letter of the alphabet. I would also like you to provide a brief comment (one sentence is fine) on the results that you got for the slopes, explaining why they are or are not what you would expect given how the data set was generated. Finally, I would like you to generate a visualization of the data that presents all 26 letter-groups and their slopes, sorted by the slope. It's up to you to decide what kind of visualization is most relevant or appropriate.

My Solution

  1. Given that it is a very big dataset with every row to be processed (actually only the first columns is of our concern for now), I'd like to compare both for loop in Python and apply function. Turns out that in total, for loop is estimated to take >5h while apply() in minutes, so I definitely recommend using apply() in your code.

  2. The group regression is trivial and the test is meant to test the grouping function in pandas and you analytical skills. Please refer to my codes and analysis for details.

Generate Result Table

In [3]:
df = pd.read_csv("test_data.csv") # read in csv file
df
Out[3]:
codexy
07B22eXm7dO649.86026975.403216
1sb$Tb0NdrJ2-36.01108311.730664
2RE3h5#ym<d9-17.36750896.169508
3Dkh;g>mX.Ik26.94520765.115143
4uw7X6u9$vG.-12.54731174.810954
5CJIO6ysIVE436.6214555.112278
6!4BLN89TpBF4.01641830.832358
7F<p53Q"$o"9-10.59741374.086422
8pCdAsnDkf9q-48.76223270.355983
9tSoLASxuBOh41.61911777.286080
10Q5m$@Asxf,F-8.12786371.630108
11qQ#n"vZb%E#8.4536300.511155
12d4KcPPFs,Ma-0.61036569.068589
13Vr5UsYtJyJX41.70106915.144145
14ZyV6o8.dhIq19.81218969.967689
15%#MPZdPj9e014.71785189.427976
16<cpj8$2SBrO28.80145678.859624
17OVn58<OSEkm-48.34091953.387808
18tX"bF5kUTL30.37575549.082788
19pI,WYW#eQaM15.37676899.660078
20HX"B4DMoH,I46.47790664.719937
21wjG>%Xc1sMf31.02667997.436465
22v0tcW%wZNJS16.9274637.121895
23s,4XSud9,BI-37.69578146.936189
24@HA0wakl4<Y-1.20104781.218190
25eJqHbkZwUe9-23.54504244.113845
266Nhsc;KZrNP33.2762501.742119
27YCobn7V;sXk-8.03855833.588995
283H67Hfm6gjF-5.02030816.326110
29s5Q0dyqhOni-4.55569217.214983
............
999970!EP3Hwl<h$c-36.44670157.589548
999971kSm">mGTKHk26.00422622.132954
999972Bg"FS.32>wk-41.32435354.080752
999973rHnO%Ze%UWk10.00609657.485213
999974llvo.$4Vme7-4.57318553.635453
9999750a6iMcV6VYQ-5.26735565.000891
999976B6P@2u9S%iX21.96208233.205745
999977O7#RAEc4Trm-44.34615958.277554
9999784qZEG!"U1R13.30049630.722155
999979G58e0jDxFVa-30.50757260.911724
999980bKH8VRQMQE,-18.52664165.759977
999981SuaVPK#yPh$23.4865267.522892
999982pWP"mAHn6Ac6.86673386.535044
999983Kl3Q0tDN5QG-15.61644563.761934
9999841$ZCMskLE#w43.18347242.417812
999985hM73Av;8C.Y-46.43388251.245138
999986Vbm0;UGD5Hm27.16531347.320975
999987wIc,K4.>5be22.42460132.616290
999988bh%;%oFj39R-7.33244277.768665
999989oq%WNIVpLZ!-11.8147232.941206
999990.!7Fu%z05CZ-45.64789579.260987
999991wJqKO8wbjFI-5.32849161.507668
9999921QA3j%w.xVZ37.55590762.396403
999993#1uO9cJU7p8-8.70073827.081070
999994Y26%n9jwm1y23.11358263.736481
999995Jnuo.IXPC6113.82327982.483209
999996Qw%0mqDZ,Ji43.17285190.463722
999997%Z06s>#RlnC-22.66301811.892157
999998VgSpr4dDgt045.68799645.578465
999999p<cg!JrlwH549.43068079.457338

1000000 rows × 3 columns

In [4]:
def groupingFirstAlphabet(df): # function to assign group according to 1st alphabetic character in the code
    matchObj = re.match(r'[^a-z]*([a-z]).*', df['code'], re.I)
    return matchObj.group(1).upper()

group_df = df.apply(groupingFirstAlphabet, axis = 1) # apply the function over rows of df
In [5]:
df['group'] = group_df
df
Out[5]:
codexygroup
07B22eXm7dO649.86026975.403216B
1sb$Tb0NdrJ2-36.01108311.730664S
2RE3h5#ym<d9-17.36750896.169508R
3Dkh;g>mX.Ik26.94520765.115143D
4uw7X6u9$vG.-12.54731174.810954U
5CJIO6ysIVE436.6214555.112278C
6!4BLN89TpBF4.01641830.832358B
7F<p53Q"$o"9-10.59741374.086422F
8pCdAsnDkf9q-48.76223270.355983P
9tSoLASxuBOh41.61911777.286080T
10Q5m$@Asxf,F-8.12786371.630108Q
11qQ#n"vZb%E#8.4536300.511155Q
12d4KcPPFs,Ma-0.61036569.068589D
13Vr5UsYtJyJX41.70106915.144145V
14ZyV6o8.dhIq19.81218969.967689Z
15%#MPZdPj9e014.71785189.427976M
16<cpj8$2SBrO28.80145678.859624C
17OVn58<OSEkm-48.34091953.387808O
18tX"bF5kUTL30.37575549.082788T
19pI,WYW#eQaM15.37676899.660078P
20HX"B4DMoH,I46.47790664.719937H
21wjG>%Xc1sMf31.02667997.436465W
22v0tcW%wZNJS16.9274637.121895V
23s,4XSud9,BI-37.69578146.936189S
24@HA0wakl4<Y-1.20104781.218190H
25eJqHbkZwUe9-23.54504244.113845E
266Nhsc;KZrNP33.2762501.742119N
27YCobn7V;sXk-8.03855833.588995Y
283H67Hfm6gjF-5.02030816.326110H
29s5Q0dyqhOni-4.55569217.214983S
...............
999970!EP3Hwl<h$c-36.44670157.589548E
999971kSm">mGTKHk26.00422622.132954K
999972Bg"FS.32>wk-41.32435354.080752B
999973rHnO%Ze%UWk10.00609657.485213R
999974llvo.$4Vme7-4.57318553.635453L
9999750a6iMcV6VYQ-5.26735565.000891A
999976B6P@2u9S%iX21.96208233.205745B
999977O7#RAEc4Trm-44.34615958.277554O
9999784qZEG!"U1R13.30049630.722155Q
999979G58e0jDxFVa-30.50757260.911724G
999980bKH8VRQMQE,-18.52664165.759977B
999981SuaVPK#yPh$23.4865267.522892S
999982pWP"mAHn6Ac6.86673386.535044P
999983Kl3Q0tDN5QG-15.61644563.761934K
9999841$ZCMskLE#w43.18347242.417812Z
999985hM73Av;8C.Y-46.43388251.245138H
999986Vbm0;UGD5Hm27.16531347.320975V
999987wIc,K4.>5be22.42460132.616290W
999988bh%;%oFj39R-7.33244277.768665B
999989oq%WNIVpLZ!-11.8147232.941206O
999990.!7Fu%z05CZ-45.64789579.260987F
999991wJqKO8wbjFI-5.32849161.507668W
9999921QA3j%w.xVZ37.55590762.396403Q
999993#1uO9cJU7p8-8.70073827.081070U
999994Y26%n9jwm1y23.11358263.736481Y
999995Jnuo.IXPC6113.82327982.483209J
999996Qw%0mqDZ,Ji43.17285190.463722Q
999997%Z06s>#RlnC-22.66301811.892157Z
999998VgSpr4dDgt045.68799645.578465V
999999p<cg!JrlwH549.43068079.457338P

1000000 rows × 4 columns

In [6]:
def calculateGroupSlope(group_data): # aggregate function to calculate the linear regression slope in each group
    x = group_data['x']; y = group_data['y']
    gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
    return gradient

# first group by 'group' column, then apply the aggregate function 
df_slope = df.groupby('group')['x', 'y'].aggregate(calculateGroupSlope) 
df_slope = df_slope['x']

df_count = df.groupby('group')['code'].count() # do the same for the observation count in each group

df_result = pd.concat([df_count, df_slope], axis=1, join_axes=[df_count.index]) # merge the above results
df_result.rename(index=str, columns={'x':'slope', 'code':'count'}, inplace=True) # rename the columns
df_result
Out[6]:
countslope
group
A382340.000431
B384050.000891
C383270.007697
D38309-0.002585
E38547-0.006771
F387040.009832
G38459-0.001408
H381940.003276
I384380.005457
J38413-0.004127
K383760.006628
L384510.009429
M382490.002112
N388550.000844
O38793-0.007709
P38465-0.000075
Q385470.002014
R38564-0.001255
S38425-0.000138
T387470.002590
U385150.002055
V384780.000453
W38569-0.003936
X38182-0.005922
Y38573-0.008263
Z381810.003301
In [7]:
# save to csv
df_result.to_csv("result.csv")

Comment

The slope for each group should be around 0 (as expected) because both x and y values are uniform random variables, which distribute evenly in a rectangular space (in a 2-D plot) with its border paralleled to x and y axis, hence the linear regression line will also be paralleled to the x axis.

Visualization

In [8]:
import matplotlib.pyplot as plt
width = .45
df_result_sorted = df_result.sort_values('slope', ascending=False)['slope']

df_result_sorted.plot(kind='bar', width=width, )
df_result_sorted.plot(secondary_y=False)
plt.title("Slope in Each Group (Descending)")
Out[8]:
Text(0.5,1,'Slope in Each Group (Descending)')