Preface
1. Data structure
1.1 Series
1.1.1 Creation:
s = pd.Series(data, index=index)
From ndarray:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
From dict:
d = {'b' : 1, 'a' : 0, 'c' : 2}
s = pd.Series(d)
From scalar value:
s = pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])
1.1.2 Opration:
ndarray-like:
s[0]
s[:3]
s[s > s.median()]
s[[4, 3, 1]]
np.exp(s)
dict-like:
s['a']
'e' in s
Vectorized operations:
s + s
s * 2
np.exp(s)
s[1:] + s[:-1] automatically align the data based on label
Name attribute:
s = pd.Series(np.random.randn(5), name='something')
s2 = s.rename("different")
1.2. DataFrame
1.2.1 Creation:
From dict of Series or dicts:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}
df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d, index=['d', 'b', 'a'])
df3 = pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
df.index
df.columns
From dict of ndarrays/lists:
d = {'one' : [1., 2., 3., 4.],
'two' : [4., 3., 2., 1.]}
df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d, index=['a', 'b', 'c', 'd'])
From structured or record array:
d = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])
d[:] = [(1,2.,'Hello'), (2,3.,"World")]
df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d, index=['first', 'second'])
df3 = pd.DataFrame(d, columns=['C', 'A', 'B'])
From a list of dicts:
d = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
df1 = pd.DataFrame(d)
df2 = pd.DataFrame(d, index=['first', 'second'])
df3 = pd.DataFrame(d, columns=['a', 'b'])
DataFrame.from_dict:
df1 = pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]))
df2 = pd.DataFrame.from_dict(dict([('A', [1, 2, 3]), ('B', [4, 5, 6])]),orient='index', columns=['one', 'two', 'three'])
DataFrame.from_records:
df = pd.DataFrame.from_records(data, index='C')
1.2.2 Operation:
Column selection, addition, deletion
df['one']
df['three'] = df['one'] * df['two']
df['flag'] = df['one'] > 2
del df['two']
three = df.pop('three')
df['foo'] = 'bar'
df['one_trunc'] = df['one'][:2]
df.insert(1, 'bar', df['one'])
Assigning New Columns in Method Chains:
iris = pd.read_csv('data/iris.data')
iris.assign(sepal_ratio = iris['SepalWidth']/iris['SepalLength'])
iris.assign(sepal_ratio = lambda x: (x['SepalWidth’]/x['SepalLength']))
Indexing/Selection:
Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame
Data alignment and arithmetic:
df = pd.DataFrame(np.random.randn(10, 4), columns=['A', 'B', 'C', 'D'])
df2 = pd.DataFrame(np.random.randn(7, 3), columns=['A', 'B', 'C'])
df + df2
df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)
df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)
df1 & df2
Transposing:
df[:5].T
DataFrame interoperability with NumPy functions:
np.exp(df)
np.asarray(df)
pd.set_option('display.width', 40) # default is 80
pd.DataFrame(np.random.randn(3, 12))
2. Tutorial:
2.1. Object creation:
s = pd.Series([1,3,5,np.nan,6,8])
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df = pd.read_csv('data/iris.data')
2.2. Viewing data:
df.head()
df.tail(3)
df.info()
df.shape
df.index
df.columns
df.values
df.describe()
df.T
df.sort_index(axis=1, ascending=False)
df.sort_values(by='B')
2.3. Indexing/Selection:
Select column df[col] Series
Select row by label df.loc[label] Series
Select row by integer location df.iloc[loc] Series
Slice rows df[5:10] DataFrame
Select rows by boolean vector df[bool_vec] DataFrame
2.4. Missing Data:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df1.loc[dates[0]:dates[1],'E'] = 1
df1.dropna(how='any') To drop any rows that have missing data.
df1.fillna(value=5) Filling missing data.
pd.isna(df1) To get the boolean mask where values are nan.
2.5. Operations:
df.mean()
df.apply(np.cumsum)
df.apply(lambda x: x.max() - x.min())
s = pd.Series(np.random.randint(0, 7, size=10))
s.value_counts() Histogramming
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
s.str.lower() String Methods
2.6. Merge
Concat:
df = pd.DataFrame(np.random.randn(10, 4))
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
Join:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
pd.merge(left, right, on='key')
Append:
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
s = df.iloc[3]
df.append(s, ignore_index=True)
Grouping:
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C' : np.random.randn(8),
'D' : np.random.randn(8)})
df.groupby(['A','B']).sum()
2.7. Categoricals:
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df["grade"] = df["raw_grade"].astype("category")
df["grade"].cat.categories = ["very good", "good", "very bad"]
df.sort_values(by="grade")
2.8. Plotting:
ts = pd.Series(np.random.randn(1000), index=pd.date_range('1/1/2000', periods=1000))
ts = ts.cumsum()
ts.plot()
df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=['A', 'B', 'C', 'D'])
df = df.cumsum()
plt.figure(); df.plot(); plt.legend(loc='best')
2.9. IO:
df.to_csv('foo.csv') Writing to a csv file.
pd.read_csv('foo.csv') Reading from a csv file.
df.to_excel('foo.xlsx', sheet_name='Sheet1') Reading and writing to MS Excel.
pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA']) Reading from an excel file.