반응형
1. Sample Data
usa.gov 사이트를 방문한 데이터들을 JSON 형식으로 작성한 데이터
JSON(Javascript Object Notation) : 자바스크립트 객체 표현 방법
JSON은 Python의 dict 데이터 타입과 비슷(동일)
{key1 : value1, key2 : value2 ...}
2. import Module
import json
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from print_df import print_df
3. Pandas Code
- Data File 경로 지정
path = 'data\example.txt'
- 데이터 파일 생성
records = [json.loads(line) for line in open(path, encoding='utf-8')]
- 파일에서 읽을 데이터를 Data Frame 작업
df = pd.DataFrame(records)
- Data Frame 정보 확인
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3560 entries, 0 to 3559
Data columns (total 18 columns):
_heartbeat_ 120 non-null float64
a 3440 non-null object
al 3094 non-null object
c 2919 non-null object
cy 2919 non-null object
g 3440 non-null object
gr 2919 non-null object
h 3440 non-null object
hc 3440 non-null float64
hh 3440 non-null object
kw 93 non-null object
l 3440 non-null object
ll 2919 non-null object
nk 3440 non-null float64
r 3440 non-null object
t 3440 non-null float64
tz 3440 non-null object
u 3440 non-null object
dtypes: float64(4), object(14)
memory usage: 500.7+ KB
None
- DataFrame에 일부 Data 출력(User-Agent 값)
print(df['a'][0])
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11
- DataFrame 중 tz(time zone) 데이터 확인
print(df['tz'][1])
print(df['tz'][:5])
print(df['tz'].value_counts())
America/Denver
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
Name: tz, dtype: object
America/New_York 1251
521
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Europe/Berlin 28
Europe/Rome 27
America/Rainy_River 25
Europe/Amsterdam 22
America/Phoenix 20
America/Indianapolis 20
Europe/Warsaw 16
America/Mexico_City 15
Europe/Stockholm 14
Europe/Paris 14
America/Vancouver 12
Pacific/Auckland 11
Europe/Oslo 10
Europe/Helsinki 10
Asia/Hong_Kong 10
Europe/Prague 10
Europe/Moscow 10
America/Puerto_Rico 10
Asia/Calcutta 9
Asia/Istanbul 9
...
Europe/Riga 2
Asia/Novosibirsk 1
Asia/Nicosia 1
America/Mazatlan 1
America/Argentina/Cordoba 1
America/La_Paz 1
Asia/Riyadh 1
America/Lima 1
America/Caracas 1
America/Monterrey 1
Australia/Queensland 1
Europe/Sofia 1
America/Costa_Rica 1
America/St_Kitts 1
Asia/Pontianak 1
Europe/Skopje 1
America/Santo_Domingo 1
Asia/Kuching 1
Europe/Volgograd 1
America/Montevideo 1
Europe/Ljubljana 1
Europe/Uzhgorod 1
Africa/Lusaka 1
America/Argentina/Mendoza 1
Asia/Manila 1
America/Argentina/Buenos_Aires 1
Africa/Casablanca 1
America/Tegucigalpa 1
Asia/Yekaterinburg 1
Africa/Johannesburg 1
Name: tz, Length: 97, dtype: int64
- Value Data 변수 지정
tz_counts = df['tz'].value_counts()
- Data에 공백 Data 삭제
df_drop = tz_counts.drop([''])
- NaN을 'Missing' 으로 대체
clean_tz = df['tz'].fillna('Missing')
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7
8
9
10 America/Los_Angeles
11 America/New_York
12 America/New_York
13 Missing
14 America/New_York
15 Asia/Hong_Kong
16 Asia/Hong_Kong
17 America/New_York
18 America/Denver
19 Europe/Rome
20 Africa/Ceuta
21 America/New_York
22 America/New_York
23 America/New_York
24 Europe/Madrid
25 Asia/Kuala_Lumpur
26 Asia/Nicosia
27 America/Sao_Paulo
28
29
...
3530 America/Los_Angeles
3531
3532 America/New_York
3533 America/New_York
3534 America/Chicago
3535 America/Chicago
3536
3537 America/Tegucigalpa
3538 America/Los_Angeles
3539 America/Los_Angeles
3540 America/Denver
3541 America/Los_Angeles
3542 America/Los_Angeles
3543 Missing
3544 America/Chicago
3545 America/Chicago
3546 America/Los_Angeles
3547 America/New_York
3548 America/Chicago
3549 Europe/Stockholm
3550 America/New_York
3551
3552 America/Chicago
3553 America/New_York
3554 America/New_York
3555 America/New_York
3556 America/Chicago
3557 America/Denver
3558 America/Los_Angeles
3559 America/New_York
Name: tz, Length: 3560, dtype: object
- 공백 부분을 'Unknown' 문자로 대체
clean_tz[clean_tz == ''] = 'Unknown'
0 America/New_York
1 America/Denver
2 America/New_York
3 America/Sao_Paulo
4 America/New_York
5 America/New_York
6 Europe/Warsaw
7 Unknown
8 Unknown
9 Unknown
10 America/Los_Angeles
11 America/New_York
12 America/New_York
13 Missing
14 America/New_York
15 Asia/Hong_Kong
16 Asia/Hong_Kong
17 America/New_York
18 America/Denver
19 Europe/Rome
20 Africa/Ceuta
21 America/New_York
22 America/New_York
23 America/New_York
24 Europe/Madrid
25 Asia/Kuala_Lumpur
26 Asia/Nicosia
27 America/Sao_Paulo
28 Unknown
29 Unknown
...
3530 America/Los_Angeles
3531 Unknown
3532 America/New_York
3533 America/New_York
3534 America/Chicago
3535 America/Chicago
3536 Unknown
3537 America/Tegucigalpa
3538 America/Los_Angeles
3539 America/Los_Angeles
3540 America/Denver
3541 America/Los_Angeles
3542 America/Los_Angeles
3543 Missing
3544 America/Chicago
3545 America/Chicago
3546 America/Los_Angeles
3547 America/New_York
3548 America/Chicago
3549 Europe/Stockholm
3550 America/New_York
3551 Unknown
3552 America/Chicago
3553 America/New_York
3554 America/New_York
3555 America/New_York
3556 America/Chicago
3557 America/Denver
3558 America/Los_Angeles
3559 America/New_York
Name: tz, Length: 3560, dtype: object
- tz_count에서 상위 10개 데이터 추출
subset = df_drop[:10]
America/New_York 1251
America/Chicago 400
America/Los_Angeles 382
America/Denver 191
Europe/London 74
Asia/Tokyo 37
Pacific/Honolulu 36
Europe/Madrid 35
America/Sao_Paulo 33
Europe/Berlin 28
Name: tz, dtype: int64
반응형
'Python_Intermediate > Pandas' 카테고리의 다른 글
Pandas - 1880 ~ 2010 년까지 출생 자료 분석 1 (0) | 2019.05.26 |
---|---|
Pandas - Json File Data 분석 3(Data 시각화) (0) | 2019.05.25 |
Pandas - Json File Data 분석 1(기본 문법 사용) (0) | 2019.05.25 |
Python - 한국기상청 도시별 현재 날씨 Data 분석 시각화 (0) | 2019.05.22 |
Pandas - Scientists Data 분석 (0) | 2019.05.20 |