%pylab inline

Populating the interactive namespace from numpy and matplotlib

import pandas as pd
import numpy as np
from scipy import stats
import re

Jupyter Notebook页面宽度没有优化好，使用电脑的小伙伴可以通过鼠标中键左右移动，使用手机的小伙伴只有部分output需要左右拖动，感谢亲们谅解！

Problem¶

Please go to the URL:

https://cdn2.hubspot.net/hubfs/2384307/test_data.csv

and download the data file test_data.csv. This file contains data in three fields:

code: Randomly generated 11-character strings (may include uppercase letters, lowercase letters, numerals, and non-alphanumeric characters).
x: Random deviates uniformly distributed between -50 and 50.
y: Random deviates uniformly distributed between 0 and 100.

Your task is to classify each observation (i.e., each row of the table, containing one code, one x, and one y) into one of 26 groups. The groups are defined by the first alphabetic character that appears in the code, irrespective of case. For example, the code 7%g9aPqwE;4 would belong to group G since the first alphabetic character is g and we're not distinguishing between upper- and lowercase.

Once you have identified the letter-group to which each observation belongs, calculate two results for each group:

The number of observations in the group;
The slope of the best-fit line (as determined by linear regression) for the points (x, y). In other words, for each letter-group, use linear regression to calculate the best-fit line y = A + Bx, and report the slope B.

Your final product should be a table with three columns: the letter groups, the number of observations in each letter group, and the slope of the regression line for each letter group. Of course, it should have 26 rows, one for each letter of the alphabet. I would also like you to provide a brief comment (one sentence is fine) on the results that you got for the slopes, explaining why they are or are not what you would expect given how the data set was generated. Finally, I would like you to generate a visualization of the data that presents all 26 letter-groups and their slopes, sorted by the slope. It's up to you to decide what kind of visualization is most relevant or appropriate.

My Solution¶

Given that it is a very big dataset with every row to be processed (actually only the first columns is of our concern for now), I'd like to compare both for loop in Python and apply function. Turns out that in total, for loop is estimated to take >5h while apply() in minutes, so I definitely recommend using apply() in your code.
The group regression is trivial and the test is meant to test the grouping function in pandas and you analytical skills. Please refer to my codes and analysis for details.

Generate Result Table¶

df = pd.read_csv("test_data.csv") # read in csv file
df

def groupingFirstAlphabet(df): # function to assign group according to 1st alphabetic character in the code
    matchObj = re.match(r'[^a-z]*([a-z]).*', df['code'], re.I)
    return matchObj.group(1).upper()

group_df = df.apply(groupingFirstAlphabet, axis = 1) # apply the function over rows of df

df['group'] = group_df
df

def calculateGroupSlope(group_data): # aggregate function to calculate the linear regression slope in each group
    x = group_data['x']; y = group_data['y']
    gradient, intercept, r_value, p_value, std_err = stats.linregress(x,y)
    return gradient

# first group by 'group' column, then apply the aggregate function 
df_slope = df.groupby('group')['x', 'y'].aggregate(calculateGroupSlope) 
df_slope = df_slope['x']

df_count = df.groupby('group')['code'].count() # do the same for the observation count in each group

df_result = pd.concat([df_count, df_slope], axis=1, join_axes=[df_count.index]) # merge the above results
df_result.rename(index=str, columns={'x':'slope', 'code':'count'}, inplace=True) # rename the columns
df_result

# save to csv
df_result.to_csv("result.csv")

Comment¶

The slope for each group should be around 0 (as expected) because both x and y values are uniform random variables, which distribute evenly in a rectangular space (in a 2-D plot) with its border paralleled to x and y axis, hence the linear regression line will also be paralleled to the x axis.

Visualization¶

import matplotlib.pyplot as plt
width = .45
df_result_sorted = df_result.sort_values('slope', ascending=False)['slope']

df_result_sorted.plot(kind='bar', width=width, )
df_result_sorted.plot(secondary_y=False)
plt.title("Slope in Each Group (Descending)")

Text(0.5,1,'Slope in Each Group (Descending)')

	code	x	y
0	7B22eXm7dO6	49.860269	75.403216
1	sb$Tb0NdrJ2	-36.011083	11.730664
2	RE3h5#ym<d9	-17.367508	96.169508
3	Dkh;g>mX.Ik	26.945207	65.115143
4	uw7X6u9$vG.	-12.547311	74.810954
5	CJIO6ysIVE4	36.621455	5.112278
6	!4BLN89TpBF	4.016418	30.832358
7	F<p53Q"$o"9	-10.597413	74.086422
8	pCdAsnDkf9q	-48.762232	70.355983
9	tSoLASxuBOh	41.619117	77.286080
10	Q5m$@Asxf,F	-8.127863	71.630108
11	qQ#n"vZb%E#	8.453630	0.511155
12	d4KcPPFs,Ma	-0.610365	69.068589
13	Vr5UsYtJyJX	41.701069	15.144145
14	ZyV6o8.dhIq	19.812189	69.967689
15	%#MPZdPj9e0	14.717851	89.427976
16	<cpj8$2SBrO	28.801456	78.859624
17	OVn58<OSEkm	-48.340919	53.387808
18	tX"bF5kUTL3	0.375755	49.082788
19	pI,WYW#eQaM	15.376768	99.660078
20	HX"B4DMoH,I	46.477906	64.719937
21	wjG>%Xc1sMf	31.026679	97.436465
22	v0tcW%wZNJS	16.927463	7.121895
23	s,4XSud9,BI	-37.695781	46.936189
24	@HA0wakl4<Y	-1.201047	81.218190
25	eJqHbkZwUe9	-23.545042	44.113845
26	6Nhsc;KZrNP	33.276250	1.742119
27	YCobn7V;sXk	-8.038558	33.588995
28	3H67Hfm6gjF	-5.020308	16.326110
29	s5Q0dyqhOni	-4.555692	17.214983
...	...	...	...
999970	!EP3Hwl<h$c	-36.446701	57.589548
999971	kSm">mGTKHk	26.004226	22.132954
999972	Bg"FS.32>wk	-41.324353	54.080752
999973	rHnO%Ze%UWk	10.006096	57.485213
999974	llvo.$4Vme7	-4.573185	53.635453
999975	0a6iMcV6VYQ	-5.267355	65.000891
999976	B6P@2u9S%iX	21.962082	33.205745
999977	O7#RAEc4Trm	-44.346159	58.277554
999978	4qZEG!"U1R1	3.300496	30.722155
999979	G58e0jDxFVa	-30.507572	60.911724
999980	bKH8VRQMQE,	-18.526641	65.759977
999981	SuaVPK#yPh$	23.486526	7.522892
999982	pWP"mAHn6Ac	6.866733	86.535044
999983	Kl3Q0tDN5QG	-15.616445	63.761934
999984	1$ZCMskLE#w	43.183472	42.417812
999985	hM73Av;8C.Y	-46.433882	51.245138
999986	Vbm0;UGD5Hm	27.165313	47.320975
999987	wIc,K4.>5be	22.424601	32.616290
999988	bh%;%oFj39R	-7.332442	77.768665
999989	oq%WNIVpLZ!	-11.814723	2.941206
999990	.!7Fu%z05CZ	-45.647895	79.260987
999991	wJqKO8wbjFI	-5.328491	61.507668
999992	1QA3j%w.xVZ	37.555907	62.396403
999993	#1uO9cJU7p8	-8.700738	27.081070
999994	Y26%n9jwm1y	23.113582	63.736481
999995	Jnuo.IXPC61	13.823279	82.483209
999996	Qw%0mqDZ,Ji	43.172851	90.463722
999997	%Z06s>#RlnC	-22.663018	11.892157
999998	VgSpr4dDgt0	45.687996	45.578465
999999	p<cg!JrlwH5	49.430680	79.457338

	code	x	y	group
0	7B22eXm7dO6	49.860269	75.403216	B
1	sb$Tb0NdrJ2	-36.011083	11.730664	S
2	RE3h5#ym<d9	-17.367508	96.169508	R
3	Dkh;g>mX.Ik	26.945207	65.115143	D
4	uw7X6u9$vG.	-12.547311	74.810954	U
5	CJIO6ysIVE4	36.621455	5.112278	C
6	!4BLN89TpBF	4.016418	30.832358	B
7	F<p53Q"$o"9	-10.597413	74.086422	F
8	pCdAsnDkf9q	-48.762232	70.355983	P
9	tSoLASxuBOh	41.619117	77.286080	T
10	Q5m$@Asxf,F	-8.127863	71.630108	Q
11	qQ#n"vZb%E#	8.453630	0.511155	Q
12	d4KcPPFs,Ma	-0.610365	69.068589	D
13	Vr5UsYtJyJX	41.701069	15.144145	V
14	ZyV6o8.dhIq	19.812189	69.967689	Z
15	%#MPZdPj9e0	14.717851	89.427976	M
16	<cpj8$2SBrO	28.801456	78.859624	C
17	OVn58<OSEkm	-48.340919	53.387808	O
18	tX"bF5kUTL3	0.375755	49.082788	T
19	pI,WYW#eQaM	15.376768	99.660078	P
20	HX"B4DMoH,I	46.477906	64.719937	H
21	wjG>%Xc1sMf	31.026679	97.436465	W
22	v0tcW%wZNJS	16.927463	7.121895	V
23	s,4XSud9,BI	-37.695781	46.936189	S
24	@HA0wakl4<Y	-1.201047	81.218190	H
25	eJqHbkZwUe9	-23.545042	44.113845	E
26	6Nhsc;KZrNP	33.276250	1.742119	N
27	YCobn7V;sXk	-8.038558	33.588995	Y
28	3H67Hfm6gjF	-5.020308	16.326110	H
29	s5Q0dyqhOni	-4.555692	17.214983	S
...	...	...	...	...
999970	!EP3Hwl<h$c	-36.446701	57.589548	E
999971	kSm">mGTKHk	26.004226	22.132954	K
999972	Bg"FS.32>wk	-41.324353	54.080752	B
999973	rHnO%Ze%UWk	10.006096	57.485213	R
999974	llvo.$4Vme7	-4.573185	53.635453	L
999975	0a6iMcV6VYQ	-5.267355	65.000891	A
999976	B6P@2u9S%iX	21.962082	33.205745	B
999977	O7#RAEc4Trm	-44.346159	58.277554	O
999978	4qZEG!"U1R1	3.300496	30.722155	Q
999979	G58e0jDxFVa	-30.507572	60.911724	G
999980	bKH8VRQMQE,	-18.526641	65.759977	B
999981	SuaVPK#yPh$	23.486526	7.522892	S
999982	pWP"mAHn6Ac	6.866733	86.535044	P
999983	Kl3Q0tDN5QG	-15.616445	63.761934	K
999984	1$ZCMskLE#w	43.183472	42.417812	Z
999985	hM73Av;8C.Y	-46.433882	51.245138	H
999986	Vbm0;UGD5Hm	27.165313	47.320975	V
999987	wIc,K4.>5be	22.424601	32.616290	W
999988	bh%;%oFj39R	-7.332442	77.768665	B
999989	oq%WNIVpLZ!	-11.814723	2.941206	O
999990	.!7Fu%z05CZ	-45.647895	79.260987	F
999991	wJqKO8wbjFI	-5.328491	61.507668	W
999992	1QA3j%w.xVZ	37.555907	62.396403	Q
999993	#1uO9cJU7p8	-8.700738	27.081070	U
999994	Y26%n9jwm1y	23.113582	63.736481	Y
999995	Jnuo.IXPC61	13.823279	82.483209	J
999996	Qw%0mqDZ,Ji	43.172851	90.463722	Q
999997	%Z06s>#RlnC	-22.663018	11.892157	Z
999998	VgSpr4dDgt0	45.687996	45.578465	V
999999	p<cg!JrlwH5	49.430680	79.457338	P

	count	slope
group
A	38234	0.000431
B	38405	0.000891
C	38327	0.007697
D	38309	-0.002585
E	38547	-0.006771
F	38704	0.009832
G	38459	-0.001408
H	38194	0.003276
I	38438	0.005457
J	38413	-0.004127
K	38376	0.006628
L	38451	0.009429
M	38249	0.002112
N	38855	0.000844
O	38793	-0.007709
P	38465	-0.000075
Q	38547	0.002014
R	38564	-0.001255
S	38425	-0.000138
T	38747	0.002590
U	38515	0.002055
V	38478	0.000453
W	38569	-0.003936
X	38182	-0.005922
Y	38573	-0.008263
Z	38181	0.003301

Challenge Question — Python, RegExp, Pandas, Group Regression