Pandas - Scientists Data 분석
1. Sample Data
2. Import Module
import pandas as pd
from print_df import print_df
3. Data 분석
- CSV(comma separated values) : Data들이 comma(,)로 구분된 파일.
- CSV File Load(CSV는 ,로 구분 되어있으므로 sep를 안줘도 무방)
df = pd.read_csv('data\scientists.csv')
- Data의 행(row) / 열(column) 갯수 확인
df = pd.read_csv('data\scientists.csv')
print('shape:', df.shape)
shape: (8, 5)
Process finished with exit code 0
- Data의 양이 적으므로 CSV 출력 후 형태 확인(8행 5열 Data)
df = pd.read_csv('data\scientists.csv')
| | Name | Born | Died | Age | Occupation |
| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |
| 1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |
| 3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |
| 4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist |
| 5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician |
| 6 | Alan Turing | 1912-06-23 | 1954-06-07 | 41 | Computer Scientist |
| 7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |
Process finished with exit code 0
- 머릿말 3행 까지 출력
| | Name | Born | Died | Age | Occupation |
| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |
| 1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |
Process finished with exit code 0
- 꼬릿말 4행 까지 출력
| | Name | Born | Died | Age | Occupation |
| 4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist |
| 5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician |
| 6 | Alan Turing | 1912-06-23 | 1954-06-07 | 41 | Computer Scientist |
| 7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |
Process finished with exit code 0
ages = df['Age']
0 37
1 61
2 90
3 66
4 56
5 45
6 41
7 77
Name: Age, dtype: int64
Process finished with exit code 0
- Age 오름차순 정렬
0 37
6 41
5 45
4 56
1 61
3 66
7 77
2 90
Name: Age, dtype: int64
Process finished with exit code 0
ages = df['Age']
print('최소값 : ', ages.min())
print('최대값 : ', ages.max())
최대값 : 90
Process finished with exit code 0
print('중앙값 : ', ages.median())
중앙값 : 58.5
Process finished with exit code 0
- 평균값 추출
print('평균값 : ', ages.mean())
평균값 : 59.125
Process finished with exit code 0
- 표준편차 추출
print('표준편차 : ', ages.std())
표준편차 : 18.325918413937288
Process finished with exit code 0
above_mean = ages > ages.mean()
0 False
1 True
2 True
3 True
4 False
5 False
6 False
7 True
Name: Age, dtype: bool
Process finished with exit code 0
- 평균보다 큰 값만 추출하여 DataFrame화
df_above_mean = df[above_mean]
| | Name | Born | Died | Age | Occupation |
| 1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |
| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |
| 3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |
| 7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |
Process finished with exit code 0
- occupation 컬럼의 값이 chemist인 데이터 추출
print_df(df[df['Occupation'] == 'Chemist'])
| | Name | Born | Died | Age | Occupation |
| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |
| 3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |
Process finished with exit code 0
- 각 컬럼의 데이터 타입 확인
Name object
Born object
Died object
Age int64
Occupation object
dtype: object
Process finished with exit code 0
born_date = pd.to_datetime(df['Born'],
0 1920-07-25
1 1876-06-13
2 1820-05-12
3 1867-11-07
4 1907-05-27
5 1813-03-15
6 1912-06-23
7 1777-04-30
Name: Born, dtype: datetime64[ns]
Process finished with exit code 0
- Died 열의 데이터 사용해서 데이터 타입을 날짜 타입으로 변환 후 신규 열 생성
dide_date = pd.to_datetime(df['Died'],
0 1958-04-16
1 1937-10-16
2 1910-08-13
3 1934-07-04
4 1964-04-14
5 1858-06-16
6 1954-06-07
7 1855-02-23
Name: Died, dtype: datetime64[ns]
Process finished with exit code 0
- 신규 열을 DataFrame에 추가
df['Born_date'] = born_date
df['Dide_date'] = dide_date
| | Name | Born | Died | Age | Occupation | Born_date | Dide_date |
| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 |
| 1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 |
| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 |
| 3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 |
| 4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 |
| 5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 |
| 6 | Alan Turing | 1912-06-23 | 1954-06-07 | 41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 |
| 7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 |
Process finished with exit code 0
- Object 타입인 Born / Died 열의 데이터 삭제(원본 수정)
df.drop(['Born', 'Died'], axis=1, inplace=True)
| | Name | Age | Occupation | Born_date | Dide_date |
| 0 | Rosaline Franklin | 37 | Chemist | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 |
| 1 | William Gosset | 61 | Statistician | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 |
| 2 | Florence Nightingale | 90 | Nurse | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 |
| 3 | Marie Curie | 66 | Chemist | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 |
| 4 | Rachel Carson | 56 | Biologist | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 |
| 5 | John Snow | 45 | Physician | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 |
| 6 | Alan Turing | 41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 |
| 7 | Johann Gauss | 77 | Mathematician | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 |
Process finished with exit code 0
- 산 날짜 열 생성
dropped_df = df.drop(['Born', 'Died'], axis=1)
dropped_df['Days'] = dropped_df['Dide_date'] - dropped_df['Born_date']
| | Name | Age | Occupation | Born_date | Dide_date | Days |
| 0 | Rosaline Franklin | 37 | Chemist | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 | 13779 days 00:00:00 |
| 1 | William Gosset | 61 | Statistician | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 | 22404 days 00:00:00 |
| 2 | Florence Nightingale | 90 | Nurse | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 | 32964 days 00:00:00 |
| 3 | Marie Curie | 66 | Chemist | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 | 24345 days 00:00:00 |
| 4 | Rachel Carson | 56 | Biologist | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 | 20777 days 00:00:00 |
| 5 | John Snow | 45 | Physician | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 | 16529 days 00:00:00 |
| 6 | Alan Turing | 41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 | 15324 days 00:00:00 |
| 7 | Johann Gauss | 77 | Mathematician | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 | 28422 days 00:00:00 |
Process finished with exit code 0