Pandas - Scientists Data 분석

1. Sample Data

2. Import Module

import pandas as pd
from print_df import print_df

3. Data 분석

- CSV(comma separated values) : Data들이 comma(,)로 구분된 파일.

- CSV File Load(CSV는 ,로 구분 되어있으므로 sep를 안줘도 무방)

df = pd.read_csv('data\scientists.csv')

- Data의 행(row) / 열(column) 갯수 확인

df = pd.read_csv('data\scientists.csv')
print('shape:', df.shape)

shape: (8, 5)

Process finished with exit code 0

- Data의 양이 적으므로 CSV 출력 후 형태 확인(8행 5열 Data)

df = pd.read_csv('data\scientists.csv')
print_df(df)

+---+----------------------+------------+------------+-----+--------------------+

+---+----------------------+------------+------------+-----+--------------------+

| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |

| 1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |

| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |

| 3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |

| 4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist |

| 5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician |

| 6 | Alan Turing | 1912-06-23 | 1954-06-07 | 41 | Computer Scientist |

| 7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |

+---+----------------------+------------+------------+-----+--------------------+

Process finished with exit code 0

- 머릿말 3행 까지 출력

print_df(df.head(n=3))

+---+----------------------+------------+------------+-----+--------------+

+---+----------------------+------------+------------+-----+--------------+

| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |

| 1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |

| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |

+---+----------------------+------------+------------+-----+--------------+

Process finished with exit code 0

- 꼬릿말 4행 까지 출력

print_df(df.tail(n=4))

+---+---------------+------------+------------+-----+--------------------+

+---+---------------+------------+------------+-----+--------------------+

| 4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist |

| 5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician |

| 6 | Alan Turing | 1912-06-23 | 1954-06-07 | 41 | Computer Scientist |

| 7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |

+---+---------------+------------+------------+-----+--------------------+

Process finished with exit code 0

- Age에 행의 Data 추출

ages = df['Age']
print(ages)

0 37

1 61

2 90

3 66

4 56

5 45

6 41

7 77

Name: Age, dtype: int64

Process finished with exit code 0

- Age 오름차순 정렬

print(ages.sort_values())

0 37

6 41

5 45

4 56

1 61

3 66

7 77

2 90

Name: Age, dtype: int64

Process finished with exit code 0

- 최소값 추출

ages = df['Age']
print('최소값 : ', ages.min())

최소값 : 37

Process finished with exit code 0

- 최대값 추출

print('최대값 : ', ages.max())

최대값 : 90

Process finished with exit code 0

- 중앙값 추출

print('중앙값 : ', ages.median())

중앙값 : 58.5

Process finished with exit code 0

- 평균값 추출

print('평균값 : ', ages.mean())

평균값 : 59.125

Process finished with exit code 0

- 표준편차 추출

print('표준편차 : ', ages.std())

표준편차 : 18.325918413937288

Process finished with exit code 0

- 나이가 나이평균보다 작으면 False / 크면 True Boolen 자료형 추출

above_mean = ages > ages.mean()
print(above_mean)

0 False

1 True

2 True

3 True

4 False

5 False

6 False

7 True

Name: Age, dtype: bool

Process finished with exit code 0

- 평균보다 큰 값만 추출하여 DataFrame화

df_above_mean = df[above_mean]
print_df(df_above_mean)

+---+----------------------+------------+------------+-----+---------------+

+---+----------------------+------------+------------+-----+---------------+

| 1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician |

| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse |

| 3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |

| 7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician |

+---+----------------------+------------+------------+-----+---------------+

Process finished with exit code 0

- occupation 컬럼의 값이 chemist인 데이터 추출

print_df(df[df['Occupation'] == 'Chemist'])

+---+-------------------+------------+------------+-----+------------+

+---+-------------------+------------+------------+-----+------------+

| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist |

| 3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist |

+---+-------------------+------------+------------+-----+------------+

Process finished with exit code 0

- 각 컬럼의 데이터 타입 확인

print(df.dtypes)

Name object

Born object

Died object

Age int64

Occupation object

dtype: object

Process finished with exit code 0

- Born 열의 데이터 사용해서 데이터 타입을 날짜 타입으로 변환 후 신규 열 생성

born_date = pd.to_datetime(df['Born'],
                           format='%Y-%m-%d')

0 1920-07-25

1 1876-06-13

2 1820-05-12

3 1867-11-07

4 1907-05-27

5 1813-03-15

6 1912-06-23

7 1777-04-30

Name: Born, dtype: datetime64[ns]

Process finished with exit code 0

- Died 열의 데이터 사용해서 데이터 타입을 날짜 타입으로 변환 후 신규 열 생성

dide_date = pd.to_datetime(df['Died'],
                           format='%Y-%m-%d')

0 1958-04-16

1 1937-10-16

2 1910-08-13

3 1934-07-04

4 1964-04-14

5 1858-06-16

6 1954-06-07

7 1855-02-23

Name: Died, dtype: datetime64[ns]

Process finished with exit code 0

- 신규 열을 DataFrame에 추가

df['Born_date'] = born_date
df['Dide_date'] = dide_date
print_df(df)

+---+----------------------+------------+------------+-----+--------------------+---------------------+---------------------+

+---+----------------------+------------+------------+-----+--------------------+---------------------+---------------------+

| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 | 37 | Chemist | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 |

| 1 | William Gosset | 1876-06-13 | 1937-10-16 | 61 | Statistician | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 |

| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 | 90 | Nurse | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 |

| 3 | Marie Curie | 1867-11-07 | 1934-07-04 | 66 | Chemist | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 |

| 4 | Rachel Carson | 1907-05-27 | 1964-04-14 | 56 | Biologist | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 |

| 5 | John Snow | 1813-03-15 | 1858-06-16 | 45 | Physician | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 |

| 6 | Alan Turing | 1912-06-23 | 1954-06-07 | 41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 |

| 7 | Johann Gauss | 1777-04-30 | 1855-02-23 | 77 | Mathematician | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 |

+---+----------------------+------------+------------+-----+--------------------+---------------------+---------------------+

Process finished with exit code 0

- Object 타입인 Born / Died 열의 데이터 삭제(원본 수정)

df.drop(['Born', 'Died'], axis=1, inplace=True)
print_df(df)

+---+----------------------+-----+--------------------+---------------------+---------------------+

+---+----------------------+-----+--------------------+---------------------+---------------------+

| 0 | Rosaline Franklin | 37 | Chemist | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 |

| 1 | William Gosset | 61 | Statistician | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 |

| 2 | Florence Nightingale | 90 | Nurse | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 |

| 3 | Marie Curie | 66 | Chemist | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 |

| 4 | Rachel Carson | 56 | Biologist | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 |

| 5 | John Snow | 45 | Physician | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 |

| 6 | Alan Turing | 41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 |

| 7 | Johann Gauss | 77 | Mathematician | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 |

+---+----------------------+-----+--------------------+---------------------+---------------------+

Process finished with exit code 0

- 산 날짜 열 생성

dropped_df = df.drop(['Born', 'Died'], axis=1)

dropped_df['Days'] = dropped_df['Dide_date'] - dropped_df['Born_date']
print_df(dropped_df)

+---+----------------------+-----+--------------------+---------------------+---------------------+---------------------+

+---+----------------------+-----+--------------------+---------------------+---------------------+---------------------+

| 0 | Rosaline Franklin | 37 | Chemist | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 | 13779 days 00:00:00 |

| 1 | William Gosset | 61 | Statistician | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 | 22404 days 00:00:00 |

| 2 | Florence Nightingale | 90 | Nurse | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 | 32964 days 00:00:00 |

| 3 | Marie Curie | 66 | Chemist | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 | 24345 days 00:00:00 |

| 4 | Rachel Carson | 56 | Biologist | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 | 20777 days 00:00:00 |

| 5 | John Snow | 45 | Physician | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 | 16529 days 00:00:00 |

| 6 | Alan Turing | 41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 | 15324 days 00:00:00 |

| 7 | Johann Gauss | 77 | Mathematician | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 | 28422 days 00:00:00 |

+---+----------------------+-----+--------------------+---------------------+---------------------+---------------------+

Process finished with exit code 0

저작자표시 비영리 변경금지

'Python_Intermediate > Pandas' 카테고리의 다른 글

Pandas - Json File Data 분석 1(기본 문법 사용) (0)	2019.05.25
Python - 한국기상청 도시별 현재 날씨 Data 분석 시각화 (0)	2019.05.22
Pandas - Gapminder Data 분석(그래프 분석) 3 (0)	2019.05.20
Pandas - Gapminder Data 분석(TSV File) 2 (0)	2019.05.19
Pandas - Gapminder Data 분석(TSV File) 1 (0)	2019.05.19

오늘 코딩 내일 디버깅

Pandas - Scientists Data 분석

'Python_Intermediate > Pandas' 카테고리의 다른 글

티스토리툴바

Pandas - Scientists Data 분석

'Python_Intermediate > Pandas' 카테고리의 다른 글

'Python_Intermediate/Pandas' Related Articles

티스토리툴바