[Crawling]imDB(인터넷 영화 데이터 베이스) Tutorial - 1
https://developer-ankiwoong.tistory.com/manage/newpost/843
- import 모듈
import pandas as pd
- DataFrame 구조
movie | year | timeMin | imdb | votes | us_grossMillions | |
1 | ||||||
2 |
- DataFrame 생성
df = pd.DataFrame({
'movie': titles,
'year': years,
'timeMin': time,
'imdb': imdb_ratings,
'metascore': metascores,
'votes': votes,
'us_grossMillions': us_gross,
})
- DataFame 출력
print(df)
> Result
movie year timeMin imdb metascore votes us_grossMillions
0 Knives Out (2019) 131 min 8.0 82 243378 $165.36M
1 1917 (2019) 119 min 8.4 78 249378 $159.18M
2 Gisaengchung (2019) 132 min 8.6 96 333107 $53.37M
3 Uncut Gems (2019) 135 min 7.6 90 134433 NaN
4 Jojo Rabbit (2019) 108 min 8.0 58 175187 $0.35M
5 Once Upon a Time... in Hollywood (2019) 161 min 7.7 83 418425 $142.50M
6 Joker (2019) 122 min 8.5 59 728676 $335.45M
7 Ford v Ferrari (2019) 152 min 8.1 81 180106 $117.62M
8 Little Women (2019) 135 min 8.0 91 80965 $108.05M
9 The Shawshank Redemption (1994) 142 min 9.3 80 2206090 $28.34M
10 The Irishman (2019) 209 min 7.9 94 264221 NaN
11 Avengers: Endgame (2019) 181 min 8.4 78 684592 $858.37M
12 The Gentlemen (2019) 113 min 8.0 51 65504 NaN
13 Toy Story 4 (2019) 100 min 7.8 84 167476 $434.04M
14 28 Days Later... (2002) 113 min 7.6 73 359873 $45.06M
15 Daeboo (1972) 175 min 9.2 100 1519634 $134.97M
16 The Lighthouse (I) (2019) 109 min 7.7 83 86850 $0.43M
17 Blade Runner 2049 (2017) 164 min 8.0 81 419421 $92.05M
18 The Dark Knight (2008) 152 min 9.0 84 2186071 $534.86M
19 Harry Potter and the Sorcerer's Stone (2001) 152 min 7.6 64 598595 $317.58M
20 Inception (2010) 148 min 8.8 74 1935107 $292.58M
21 Guardians of the Galaxy (2014) 121 min 8.0 76 1001925 $333.18M
22 Marriage Story (2019) 137 min 8.0 93 194103 NaN
23 Goodfellas (1990) 146 min 8.7 90 960538 $46.84M
24 Interstellar (2014) 169 min 8.6 74 1387921 $188.02M
25 The Shining (1980) 146 min 8.4 66 840607 $44.02M
26 Twelve Monkeys (1995) 129 min 8.0 74 556357 $57.14M
27 Fight Club (1999) 139 min 8.8 66 1761132 $37.03M
28 The Lord of the Rings: The Fellowship of the Ring (2001) 178 min 8.8 92 1578877 $315.54M
29 Pulp Fiction (1994) 154 min 8.9 94 1732917 $107.93M
30 The Wolf of Wall Street (2013) 180 min 8.2 75 1098527 $116.90M
31 Thor: Ragnarok (2017) 130 min 7.9 74 538627 $315.06M
32 Portrait de la jeune fille en feu (2019) 122 min 8.2 95 24959 NaN
33 Sen to Chihiro no kamikakushi (2001) 125 min 8.6 96 596931 $10.06M
34 Inglourious Basterds (2009) 153 min 8.3 69 1191303 $120.54M
35 Green Book (2018) 130 min 8.2 69 302240 $85.08M
36 Titanic (1997) 194 min 7.8 75 998230 $659.33M
37 Forrest Gump (1994) 142 min 8.8 82 1702661 $330.25M
38 Rapunjel (2010) 100 min 7.7 71 385422 $200.82M
39 The Matrix (1999) 136 min 8.7 73 1587845 $171.48M
40 Mad Max: Fury Road (2015) 120 min 8.1 90 826098 $154.06M
41 Call Me by Your Name (2017) 132 min 7.9 93 182019 $18.10M
42 There Will Be Blood (2007) 158 min 8.2 93 486507 $40.22M
43 Mission: Impossible - Fallout (2018) 147 min 7.7 86 265865 $220.16M
44 Avengers: Infinity War (2018) 149 min 8.5 68 755809 $678.82M
45 Gone Girl (2014) 149 min 8.1 79 804447 $167.77M
46 Shutter Island (2010) 138 min 8.1 63 1060520 $128.01M
47 The Lion King (1994) 88 min 8.5 88 892216 $422.78M
48 Prisoners (2013) 153 min 8.1 71 558987 $61.00M
49 Gladiator (2000) 155 min 8.5 67 1273173 $187.71M
- DataType 확인
print(df.dtypes)
> Result
movie object
year object
timeMin object
imdb float64
metascore object
votes object
us_grossMillions object
dtype: object
- Data 전처리 과정 Part 1
* 특수문자를 제거하고 데이터를 정수로 변환
df['year'] = df['year'].str.extract('(\d+)').astype(int)
print(df['year'])
> Result
0 2019
1 2019
2 2019
3 2019
4 2019
5 2019
6 2019
7 2019
8 2019
9 1994
10 2019
11 2019
12 2019
13 2019
14 2002
15 1972
16 2019
17 2017
18 2008
19 2001
20 2010
21 2014
22 2019
23 1990
24 2014
25 1980
26 1995
27 1999
28 2001
29 1994
30 2013
31 2017
32 2019
33 2001
34 2009
35 2018
36 1997
37 1994
38 2010
39 1999
40 2015
41 2017
42 2007
43 2018
44 2018
45 2014
46 2010
47 1994
48 2013
49 2000
Name: year, dtype: int32
- Data 전처리 과정 Part 2
* 특수문자를 제거하고 데이터를 정수로 변환
df['timeMin'] = df['timeMin'].str.extract('(\d+)').astype(int)
print(df['timeMin'])
> Result
0 131
1 119
2 132
3 135
4 108
5 161
6 122
7 152
8 135
9 142
10 209
11 181
12 113
13 100
14 113
15 175
16 109
17 164
18 152
19 152
20 148
21 121
22 137
23 146
24 169
25 146
26 129
27 139
28 178
29 154
30 180
31 130
32 122
33 125
34 153
35 130
36 194
37 142
38 100
39 136
40 120
41 132
42 158
43 147
44 149
45 149
46 138
47 88
48 153
49 155
Name: timeMin, dtype: int32
- Data 전처리 과정 Part 3
* 정수로 변환
df['metascore'] = df['metascore'].astype(int)
print(df['metascore'])
> Result
0 82
1 78
2 96
3 90
4 58
5 83
6 59
7 81
8 91
9 80
10 94
11 78
12 51
13 84
14 73
15 100
16 83
17 81
18 84
19 64
20 74
21 76
22 93
23 90
24 74
25 66
26 74
27 66
28 92
29 94
30 75
31 74
32 95
33 96
34 69
35 69
36 75
37 82
38 71
39 73
40 90
41 93
42 93
43 86
44 68
45 79
46 63
47 88
48 71
49 67
Name: metascore, dtype: int32
- Data 전처리 과정 Part 4
* 정수로 변환
df['votes'] = df['votes'].astype(int)
print(df['votes'])
> Result
0 243512
1 249538
2 333231
3 134509
4 175267
5 418537
6 728746
7 180106
8 81002
9 2206133
10 264259
11 684629
12 65504
13 167476
14 359887
15 1519649
16 86869
17 419434
18 2186096
19 598607
20 1935132
21 1001936
22 194132
23 960557
24 1387945
25 840616
26 556364
27 1761157
28 1578892
29 1732938
30 1098541
31 538634
32 25008
33 596948
34 1191329
35 302256
36 998243
37 1702661
38 385435
39 1587862
40 826109
41 182023
42 486522
43 265874
44 755822
45 804465
46 1060531
47 892234
48 558996
49 1273190
Name: votes, dtype: int32
- Data 전처리 과정 Part 5
* 문자로 변환 및 특수문자 제거 작업
df['us_grossMillions'] = df['us_grossMillions'].astype(str)
df['us_grossMillions'] = df['us_grossMillions'].map(
lambda x: x.lstrip('$').rstrip('M'))
df['us_grossMillions'] = pd.to_numeric(
df['us_grossMillions'], errors='coerce')
print(df['us_grossMillions'])
> Result
0 165.36
1 159.18
2 53.37
3 NaN
4 0.35
5 142.50
6 335.45
7 117.62
8 108.05
9 28.34
10 NaN
11 858.37
12 NaN
13 434.04
14 45.06
15 134.97
16 0.43
17 92.05
18 534.86
19 317.58
20 292.58
21 333.18
22 NaN
23 46.84
24 188.02
25 44.02
26 57.14
27 37.03
28 315.54
29 107.93
30 116.90
31 315.06
32 NaN
33 10.06
34 120.54
35 85.08
36 659.33
37 330.25
38 200.82
39 171.48
40 154.06
41 18.10
42 40.22
43 220.16
44 678.82
45 167.77
46 128.01
47 422.78
48 61.00
49 187.71
Name: us_grossMillions, dtype: float64
- CSV 저장
df.to_csv('movie.csv')
'Python_Crawling > Crawling' 카테고리의 다른 글
[패스트캠퍼스]04. lxml 사용 기초 스크랩핑 A <수정코드> (0) | 2020.07.05 |
---|---|
[Crawling]imDB(인터넷 영화 데이터 베이스) Tutorial - 3 (0) | 2020.03.29 |
[Crawling]imDB(인터넷 영화 데이터 베이스) Tutorial - 1 (0) | 2020.03.29 |
[Naver]네이버 메일 제목 가져오기 - 클립보드 사용 (0) | 2020.03.08 |
[Selenium]Python Study - PPT Presentation Material - 3 (0) | 2019.12.18 |