Introduction to Pandas

Brief introduction to the data structrue and grammars of Pandas

Posted by Yingshan Li on October 8, 2019

Preface

1. Data structure

1.1 Series

1.1.1 Creation:

s = pd.Series(data, index=index)

From ndarray: 
	s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
From dict: 
	d = {'b' : 1, 'a' : 0, 'c' : 2}
	s = pd.Series(d)
From scalar value:
	s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

1.1.2 Opration:

ndarray-like:
		s[0]
		s[:3]
		s[s > s.median()]
		s[[4, 3, 1]]
		np.exp(s)
dict-like:
		s['a']
		'e' in s
Vectorized operations:
		s + s
		s * 2
		np.exp(s)
		s[1:] + s[:-1]	automatically align the data based on label
Name attribute:
		s = pd.Series(np.random.randn(5), name='something')
		s2 = s.rename("different")

1.2. DataFrame

1.2.1 Creation:

From dict of Series or dicts:
		d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
 			     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
		df1 = pd.DataFrame(d)
		df2 = pd.DataFrame(d, index=['d', 'b', 'a'])
		df3 = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
		df.index
		df.columns

From dict of ndarrays/lists:
		d = {'one' : [1., 2., 3., 4.],
			     'two' : [4., 3., 2., 1.]}
		df1 = pd.DataFrame(d)
		df2 = pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

From structured or record array:
		d = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
		d[:] = [(1,2.,'Hello'), (2,3.,"World")]
		df1 = pd.DataFrame(d)
		df2 = pd.DataFrame(d, index=['first', 'second'])
		df3 = pd.DataFrame(d, columns=['C', 'A', 'B'])

From a list of dicts:
		d = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
		df1 = pd.DataFrame(d)
		df2 = pd.DataFrame(d, index=['first', 'second'])
		df3 = pd.DataFrame(d, columns=['a', 'b'])

DataFrame.from_dict:
		df1 = pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
		df2 = pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),orient='index', columns=['one', 'two', 'three'])

DataFrame.from_records:
		df = pd.DataFrame.from_records(data, index='C')

1.2.2 Operation:

Column selection, addition, deletion
		df['one']
		df['three'] = df['one'] * df['two']
		df['flag'] = df['one'] > 2
		del df['two']
		three = df.pop('three')
		df['foo'] = 'bar'
		df['one_trunc'] = df['one'][:2]
		df.insert(1, 'bar', df['one'])

Assigning New Columns in Method Chains: 
		iris = pd.read_csv('data/iris.data')
		iris.assign(sepal_ratio = iris['SepalWidth']/iris['SepalLength'])
		iris.assign(sepal_ratio = lambda x: (x['SepalWidth’]/x['SepalLength']))

Indexing/Selection:
		Select column			df[col]		Series
		Select row by label		df.loc[label]       Series
		Select row by integer location	df.iloc[loc]      Series
		Slice rows			df[5:10]	DataFrame
		Select rows by boolean vector	df[bool_vec]     DataFrame

Data alignment and arithmetic:
		df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
		df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
		df + df2

		df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
		df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)
		df1 & df2

Transposing:
		df[:5].T

DataFrame interoperability with NumPy functions:
		np.exp(df)
		np.asarray(df)
		pd.set_option('display.width', 40) # default is 80
		pd.DataFrame(np.random.randn(3, 12))

2. Tutorial:

2.1. Object creation:

s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df = pd.read_csv('data/iris.data')

2.2. Viewing data:

df.head()
df.tail(3)
df.info()
df.shape
df.index
df.columns
df.values
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort_values(by='B')

2.3. Indexing/Selection:

	Select column			df[col]		Series
	Select row by label		df.loc[label]	Series
	Select row by integer location	df.iloc[loc]	Series
	Slice rows			df[5:10]	DataFrame
	Select rows by boolean vector	df[bool_vec]	DataFrame

2.4. Missing Data:

	df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
	df1.loc[dates[0]:dates[1],'E'] = 1
	df1.dropna(how='any')		To drop any rows that have missing data.
	df1.fillna(value=5)		Filling missing data.
	pd.isna(df1)			To get the boolean mask where values are nan.

2.5. Operations:

	df.mean()
	df.apply(np.cumsum)
	df.apply(lambda x: x.max() - x.min())
	s = pd.Series(np.random.randint(0, 7, size=10))
	s.value_counts()	Histogramming
	s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
	s.str.lower()		String Methods

2.6. Merge

	Concat:
		df = pd.DataFrame(np.random.randn(10, 4))
		pieces = [df[:3], df[3:7], df[7:]]
		pd.concat(pieces)
	Join:
		left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
		right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
		pd.merge(left, right, on='key')
	Append:
		df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
		s = df.iloc[3]
		df.append(s, ignore_index=True)
	Grouping:
		df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
 					  'foo', 'bar', 'foo', 'foo'],
					  'B' : ['one', 'one', 'two', 'three',
					  'two', 'two', 'one', 'three'],
					  'C' : np.random.randn(8),
					  'D' : np.random.randn(8)})
		df.groupby(['A','B']).sum()

2.7. Categoricals:

	df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
	df["grade"] = df["raw_grade"].astype("category")
	df["grade"].cat.categories = ["very good", "good", "very bad"]
	df.sort_values(by="grade")

2.8. Plotting:

	ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
	ts = ts.cumsum()
	ts.plot()

	df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
	df = df.cumsum()
	plt.figure(); df.plot(); plt.legend(loc='best')

2.9. IO:

	df.to_csv('foo.csv')		Writing to a csv file.
	pd.read_csv('foo.csv')		Reading from a csv file.
	df.to_excel('foo.xlsx', sheet_name='Sheet1')	Reading and writing to MS Excel.
	pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])	Reading from an excel file.