본문 바로가기

Python_Intermediate/Pandas

Pandas - Scientists Data 분석

반응형

1. Sample Data

scientists.csv


2. Import Module

import pandas as pd
from print_df import print_df


3. Data 분석

- CSV(comma separated values) : Data들이 comma(,)로 구분된 파일.


- CSV File Load(CSV는 ,로 구분 되어있으므로 sep를 안줘도 무방)

df = pd.read_csv('data\scientists.csv')


- Data의 행(row) / 열(column) 갯수 확인

df = pd.read_csv('data\scientists.csv')
print('shape:', df.shape)

shape: (8, 5)


Process finished with exit code 0


- Data의 양이 적으므로 CSV 출력 후 형태 확인(8행 5열 Data)

df = pd.read_csv('data\scientists.csv')
print_df(df)

+---+----------------------+------------+------------+-----+--------------------+

|   |         Name         |    Born    |    Died    | Age |     Occupation     |

+---+----------------------+------------+------------+-----+--------------------+

| 0 |  Rosaline Franklin   | 1920-07-25 | 1958-04-16 |  37 |      Chemist       |

| 1 |    William Gosset    | 1876-06-13 | 1937-10-16 |  61 |    Statistician    |

| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 |  90 |       Nurse        |

| 3 |     Marie Curie      | 1867-11-07 | 1934-07-04 |  66 |      Chemist       |

| 4 |    Rachel Carson     | 1907-05-27 | 1964-04-14 |  56 |     Biologist      |

| 5 |      John Snow       | 1813-03-15 | 1858-06-16 |  45 |     Physician      |

| 6 |     Alan Turing      | 1912-06-23 | 1954-06-07 |  41 | Computer Scientist |

| 7 |     Johann Gauss     | 1777-04-30 | 1855-02-23 |  77 |   Mathematician    |

+---+----------------------+------------+------------+-----+--------------------+




Process finished with exit code 0


- 머릿말 3행 까지 출력

print_df(df.head(n=3))

+---+----------------------+------------+------------+-----+--------------+

|   |         Name         |    Born    |    Died    | Age |  Occupation  |

+---+----------------------+------------+------------+-----+--------------+

| 0 |  Rosaline Franklin   | 1920-07-25 | 1958-04-16 |  37 |   Chemist    |

| 1 |    William Gosset    | 1876-06-13 | 1937-10-16 |  61 | Statistician |

| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 |  90 |    Nurse     |

+---+----------------------+------------+------------+-----+--------------+




Process finished with exit code 0


- 꼬릿말 4행 까지 출력

print_df(df.tail(n=4))

+---+---------------+------------+------------+-----+--------------------+

|   |      Name     |    Born    |    Died    | Age |     Occupation     |

+---+---------------+------------+------------+-----+--------------------+

| 4 | Rachel Carson | 1907-05-27 | 1964-04-14 |  56 |     Biologist      |

| 5 |   John Snow   | 1813-03-15 | 1858-06-16 |  45 |     Physician      |

| 6 |  Alan Turing  | 1912-06-23 | 1954-06-07 |  41 | Computer Scientist |

| 7 |  Johann Gauss | 1777-04-30 | 1855-02-23 |  77 |   Mathematician    |

+---+---------------+------------+------------+-----+--------------------+




Process finished with exit code 0


- Age에 행의 Data 추출
ages = df['Age']
print(ages)

0    37

1    61

2    90

3    66

4    56

5    45

6    41

7    77

Name: Age, dtype: int64


Process finished with exit code 0


- Age 오름차순 정렬

print(ages.sort_values())

0    37

6    41

5    45

4    56

1    61

3    66

7    77

2    90

Name: Age, dtype: int64


Process finished with exit code 0


- 최소값 추출
ages = df['Age']
print('최소값 : ', ages.min())
최소값 :  37

Process finished with exit code 0

- 최대값 추출
print('최대값 : ', ages.max())

최대값 :  90


Process finished with exit code 0


- 중앙값 추출
print('중앙값 : ', ages.median())

중앙값 :  58.5


Process finished with exit code 0


- 평균값 추출

print('평균값 : ', ages.mean())

평균값 :  59.125


Process finished with exit code 0


- 표준편차 추출

print('표준편차 : ', ages.std())

표준편차 :  18.325918413937288


Process finished with exit code 0


-  나이가 나이평균보다 작으면 False / 크면 True Boolen 자료형 추출
above_mean = ages > ages.mean()
print(above_mean)

0    False

1     True

2     True

3     True

4    False

5    False

6    False

7     True

Name: Age, dtype: bool


Process finished with exit code 0


- 평균보다 큰 값만 추출하여 DataFrame화

df_above_mean = df[above_mean]
print_df(df_above_mean)

+---+----------------------+------------+------------+-----+---------------+

|   |         Name         |    Born    |    Died    | Age |   Occupation  |

+---+----------------------+------------+------------+-----+---------------+

| 1 |    William Gosset    | 1876-06-13 | 1937-10-16 |  61 |  Statistician |

| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 |  90 |     Nurse     |

| 3 |     Marie Curie      | 1867-11-07 | 1934-07-04 |  66 |    Chemist    |

| 7 |     Johann Gauss     | 1777-04-30 | 1855-02-23 |  77 | Mathematician |

+---+----------------------+------------+------------+-----+---------------+




Process finished with exit code 0


- occupation 컬럼의 값이 chemist인 데이터 추출

print_df(df[df['Occupation'] == 'Chemist'])

+---+-------------------+------------+------------+-----+------------+

|   |        Name       |    Born    |    Died    | Age | Occupation |

+---+-------------------+------------+------------+-----+------------+

| 0 | Rosaline Franklin | 1920-07-25 | 1958-04-16 |  37 |  Chemist   |

| 3 |    Marie Curie    | 1867-11-07 | 1934-07-04 |  66 |  Chemist   |

+---+-------------------+------------+------------+-----+------------+




Process finished with exit code 0


- 각 컬럼의 데이터 타입 확인

print(df.dtypes)

Name          object

Born          object

Died          object

Age            int64

Occupation    object

dtype: object


Process finished with exit code 0


- Born 열의 데이터 사용해서 데이터 타입을 날짜 타입으로 변환 후 신규 열 생성
born_date = pd.to_datetime(df['Born'],
format='%Y-%m-%d')

0   1920-07-25

1   1876-06-13

2   1820-05-12

3   1867-11-07

4   1907-05-27

5   1813-03-15

6   1912-06-23

7   1777-04-30

Name: Born, dtype: datetime64[ns]


Process finished with exit code 0


- Died 열의 데이터 사용해서 데이터 타입을 날짜 타입으로 변환 후 신규 열 생성

dide_date = pd.to_datetime(df['Died'],
format='%Y-%m-%d')

0   1958-04-16

1   1937-10-16

2   1910-08-13

3   1934-07-04

4   1964-04-14

5   1858-06-16

6   1954-06-07

7   1855-02-23

Name: Died, dtype: datetime64[ns]


Process finished with exit code 0


- 신규 열을 DataFrame에 추가

df['Born_date'] = born_date
df['Dide_date'] = dide_date
print_df(df)

+---+----------------------+------------+------------+-----+--------------------+---------------------+---------------------+

|   |         Name         |    Born    |    Died    | Age |     Occupation     |      Born_date      |      Dide_date      |

+---+----------------------+------------+------------+-----+--------------------+---------------------+---------------------+

| 0 |  Rosaline Franklin   | 1920-07-25 | 1958-04-16 |  37 |      Chemist       | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 |

| 1 |    William Gosset    | 1876-06-13 | 1937-10-16 |  61 |    Statistician    | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 |

| 2 | Florence Nightingale | 1820-05-12 | 1910-08-13 |  90 |       Nurse        | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 |

| 3 |     Marie Curie      | 1867-11-07 | 1934-07-04 |  66 |      Chemist       | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 |

| 4 |    Rachel Carson     | 1907-05-27 | 1964-04-14 |  56 |     Biologist      | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 |

| 5 |      John Snow       | 1813-03-15 | 1858-06-16 |  45 |     Physician      | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 |

| 6 |     Alan Turing      | 1912-06-23 | 1954-06-07 |  41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 |

| 7 |     Johann Gauss     | 1777-04-30 | 1855-02-23 |  77 |   Mathematician    | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 |

+---+----------------------+------------+------------+-----+--------------------+---------------------+---------------------+




Process finished with exit code 0


- Object 타입인 Born / Died 열의 데이터 삭제(원본 수정)

df.drop(['Born', 'Died'], axis=1, inplace=True)
print_df(df)

+---+----------------------+-----+--------------------+---------------------+---------------------+

|   |         Name         | Age |     Occupation     |      Born_date      |      Dide_date      |

+---+----------------------+-----+--------------------+---------------------+---------------------+

| 0 |  Rosaline Franklin   |  37 |      Chemist       | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 |

| 1 |    William Gosset    |  61 |    Statistician    | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 |

| 2 | Florence Nightingale |  90 |       Nurse        | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 |

| 3 |     Marie Curie      |  66 |      Chemist       | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 |

| 4 |    Rachel Carson     |  56 |     Biologist      | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 |

| 5 |      John Snow       |  45 |     Physician      | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 |

| 6 |     Alan Turing      |  41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 |

| 7 |     Johann Gauss     |  77 |   Mathematician    | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 |

+---+----------------------+-----+--------------------+---------------------+---------------------+




Process finished with exit code 0


- 산 날짜 열 생성

dropped_df = df.drop(['Born', 'Died'], axis=1)

dropped_df['Days'] = dropped_df['Dide_date'] - dropped_df['Born_date']
print_df(dropped_df)

+---+----------------------+-----+--------------------+---------------------+---------------------+---------------------+

|   |         Name         | Age |     Occupation     |      Born_date      |      Dide_date      |         Days        |

+---+----------------------+-----+--------------------+---------------------+---------------------+---------------------+

| 0 |  Rosaline Franklin   |  37 |      Chemist       | 1920-07-25 00:00:00 | 1958-04-16 00:00:00 | 13779 days 00:00:00 |

| 1 |    William Gosset    |  61 |    Statistician    | 1876-06-13 00:00:00 | 1937-10-16 00:00:00 | 22404 days 00:00:00 |

| 2 | Florence Nightingale |  90 |       Nurse        | 1820-05-12 00:00:00 | 1910-08-13 00:00:00 | 32964 days 00:00:00 |

| 3 |     Marie Curie      |  66 |      Chemist       | 1867-11-07 00:00:00 | 1934-07-04 00:00:00 | 24345 days 00:00:00 |

| 4 |    Rachel Carson     |  56 |     Biologist      | 1907-05-27 00:00:00 | 1964-04-14 00:00:00 | 20777 days 00:00:00 |

| 5 |      John Snow       |  45 |     Physician      | 1813-03-15 00:00:00 | 1858-06-16 00:00:00 | 16529 days 00:00:00 |

| 6 |     Alan Turing      |  41 | Computer Scientist | 1912-06-23 00:00:00 | 1954-06-07 00:00:00 | 15324 days 00:00:00 |

| 7 |     Johann Gauss     |  77 |   Mathematician    | 1777-04-30 00:00:00 | 1855-02-23 00:00:00 | 28422 days 00:00:00 |

+---+----------------------+-----+--------------------+---------------------+---------------------+---------------------+




Process finished with exit code 0

반응형