Pandas - 데이터프레임 병합(merge)
1. 데이터프레임 병합(merge)
Data Table 중에 여러개의 csv로 되어있으나 파일은 서로 연관성이 있는 경우도 있다.
이를 병합하는 과정이 필요하다.
2. SampleData
3. Import Module
import pandas as pd
from print_df import print_df
4. Data Code
- person DataFrame
person = pd.read_csv('data\survey_person.csv')
+---+----------+-----------+----------+
| | ident | personal | family |
+---+----------+-----------+----------+
| 0 | dyer | William | Dyer |
| 1 | pb | Frank | Pabodie |
| 2 | lake | Anderson | Lake |
| 3 | roe | Valentina | Roerich |
| 4 | danforth | Frank | Danforth |
+---+----------+-----------+----------+
- site DataFrame
site = pd.read_csv('data\survey_site.csv')
+---+-------+--------+---------+
| | name | lat | long |
+---+-------+--------+---------+
| 0 | DR-1 | -49.85 | -128.57 |
| 1 | DR-3 | -47.15 | -126.72 |
| 2 | MSK-4 | -48.87 | -123.4 |
+---+-------+--------+---------+
- survey DataFrame
survey = pd.read_csv('data\survey_survey.csv')
+----+-------+--------+-------+---------+
| | taken | person | quant | reading |
+----+-------+--------+-------+---------+
| 0 | 619 | dyer | rad | 9.82 |
| 1 | 619 | dyer | sal | 0.13 |
| 2 | 622 | dyer | rad | 7.8 |
| 3 | 622 | dyer | sal | 0.09 |
| 4 | 734 | pb | rad | 8.41 |
| 5 | 734 | lake | sal | 0.05 |
| 6 | 734 | pb | temp | -21.5 |
| 7 | 735 | pb | rad | 7.22 |
| 8 | 735 | nan | sal | 0.06 |
| 9 | 735 | nan | temp | -26.0 |
| 10 | 751 | pb | rad | 4.35 |
| 11 | 751 | pb | temp | -18.5 |
| 12 | 751 | lake | sal | 0.1 |
| 13 | 752 | lake | rad | 2.19 |
| 14 | 752 | lake | sal | 0.09 |
| 15 | 752 | lake | temp | -16.0 |
| 16 | 752 | roe | sal | 41.6 |
| 17 | 837 | lake | rad | 1.46 |
| 18 | 837 | lake | sal | 0.21 |
| 19 | 837 | roe | sal | 22.5 |
| 20 | 844 | roe | rad | 11.25 |
+----+-------+--------+-------+---------+
- visited DataFrame
visited = pd.read_csv('data\survey_visited.csv')
+---+-------+-------+------------+
| | ident | site | dated |
+---+-------+-------+------------+
| 0 | 619 | DR-1 | 1927-02-08 |
| 1 | 622 | DR-1 | 1927-02-10 |
| 2 | 734 | DR-3 | 1939-01-07 |
| 3 | 735 | DR-3 | 1930-01-12 |
| 4 | 751 | DR-3 | 1930-02-26 |
| 5 | 752 | DR-3 | nan |
| 6 | 837 | MSK-4 | 1932-01-14 |
| 7 | 844 | DR-1 | 1932-03-22 |
+---+-------+-------+------------+
- visited site 기준점 생성(Site - DR-1 / DR-3 / MSK-4)
visited_subset = visited.loc[[0, 2, 6]]
+---+-------+-------+------------+
| | ident | site | dated |
+---+-------+-------+------------+
| 0 | 619 | DR-1 | 1927-02-08 |
| 2 | 734 | DR-3 | 1939-01-07 |
| 6 | 837 | MSK-4 | 1932-01-14 |
+---+-------+-------+------------+
- 부분집합(merge : 공통적인 컬럼을 합침 / site DataFrame에 name 컬럼과 visited subset에 공통적인 컬럼은 site 이다.)
merge1 = site.merge(visited_subset, left_on='name', right_on='site')
+---+-------+--------+---------+-------+-------+------------+
| | name | lat | long | ident | site | dated |
+---+-------+--------+---------+-------+-------+------------+
| 0 | DR-1 | -49.85 | -128.57 | 619 | DR-1 | 1927-02-08 |
| 1 | DR-3 | -47.15 | -126.72 | 734 | DR-3 | 1939-01-07 |
| 2 | MSK-4 | -48.87 | -123.4 | 837 | MSK-4 | 1932-01-14 |
+---+-------+--------+---------+-------+-------+------------+
- 전체집합(site DataFrame와 visted DataFrame에 집합은 site 와 name이다)
merge2 = site.merge(visited, left_on='name', right_on='site')
+---+-------+--------+---------+-------+-------+------------+
| | name | lat | long | ident | site | dated |
+---+-------+--------+---------+-------+-------+------------+
| 0 | DR-1 | -49.85 | -128.57 | 619 | DR-1 | 1927-02-08 |
| 1 | DR-1 | -49.85 | -128.57 | 622 | DR-1 | 1927-02-10 |
| 2 | DR-1 | -49.85 | -128.57 | 844 | DR-1 | 1932-03-22 |
| 3 | DR-3 | -47.15 | -126.72 | 734 | DR-3 | 1939-01-07 |
| 4 | DR-3 | -47.15 | -126.72 | 735 | DR-3 | 1930-01-12 |
| 5 | DR-3 | -47.15 | -126.72 | 751 | DR-3 | 1930-02-26 |
| 6 | DR-3 | -47.15 | -126.72 | 752 | DR-3 | nan |
| 7 | MSK-4 | -48.87 | -123.4 | 837 | MSK-4 | 1932-01-14 |
+---+-------+--------+---------+-------+-------+------------+
- Data 중에 Nan 결측치 값은 비교할 값이 없으므로 제거 된다.
merge3 = person.merge(survey, left_on='ident', right_on='person')
+----+-------+-----------+---------+-------+--------+-------+---------+
| | ident | personal | family | taken | person | quant | reading |
+----+-------+-----------+---------+-------+--------+-------+---------+
| 0 | dyer | William | Dyer | 619 | dyer | rad | 9.82 |
| 1 | dyer | William | Dyer | 619 | dyer | sal | 0.13 |
| 2 | dyer | William | Dyer | 622 | dyer | rad | 7.8 |
| 3 | dyer | William | Dyer | 622 | dyer | sal | 0.09 |
| 4 | pb | Frank | Pabodie | 734 | pb | rad | 8.41 |
| 5 | pb | Frank | Pabodie | 734 | pb | temp | -21.5 |
| 6 | pb | Frank | Pabodie | 735 | pb | rad | 7.22 |
| 7 | pb | Frank | Pabodie | 751 | pb | rad | 4.35 |
| 8 | pb | Frank | Pabodie | 751 | pb | temp | -18.5 |
| 9 | lake | Anderson | Lake | 734 | lake | sal | 0.05 |
| 10 | lake | Anderson | Lake | 751 | lake | sal | 0.1 |
| 11 | lake | Anderson | Lake | 752 | lake | rad | 2.19 |
| 12 | lake | Anderson | Lake | 752 | lake | sal | 0.09 |
| 13 | lake | Anderson | Lake | 752 | lake | temp | -16.0 |
| 14 | lake | Anderson | Lake | 837 | lake | rad | 1.46 |
| 15 | lake | Anderson | Lake | 837 | lake | sal | 0.21 |
| 16 | roe | Valentina | Roerich | 752 | roe | sal | 41.6 |
| 17 | roe | Valentina | Roerich | 837 | roe | sal | 22.5 |
| 18 | roe | Valentina | Roerich | 844 | roe | rad | 11.25 |
+----+-------+-----------+---------+-------+--------+-------+---------+
- Data 중에 Nan 결측치 값은 비교할 값이 없으므로 제거 된다.
merge4 = visited.merge(survey, left_on='ident', right_on='taken')
+----+-------+-------+------------+-------+--------+-------+---------+
| | ident | site | dated | taken | person | quant | reading |
+----+-------+-------+------------+-------+--------+-------+---------+
| 0 | 619 | DR-1 | 1927-02-08 | 619 | dyer | rad | 9.82 |
| 1 | 619 | DR-1 | 1927-02-08 | 619 | dyer | sal | 0.13 |
| 2 | 622 | DR-1 | 1927-02-10 | 622 | dyer | rad | 7.8 |
| 3 | 622 | DR-1 | 1927-02-10 | 622 | dyer | sal | 0.09 |
| 4 | 734 | DR-3 | 1939-01-07 | 734 | pb | rad | 8.41 |
| 5 | 734 | DR-3 | 1939-01-07 | 734 | lake | sal | 0.05 |
| 6 | 734 | DR-3 | 1939-01-07 | 734 | pb | temp | -21.5 |
| 7 | 735 | DR-3 | 1930-01-12 | 735 | pb | rad | 7.22 |
| 8 | 735 | DR-3 | 1930-01-12 | 735 | nan | sal | 0.06 |
| 9 | 735 | DR-3 | 1930-01-12 | 735 | nan | temp | -26.0 |
| 10 | 751 | DR-3 | 1930-02-26 | 751 | pb | rad | 4.35 |
| 11 | 751 | DR-3 | 1930-02-26 | 751 | pb | temp | -18.5 |
| 12 | 751 | DR-3 | 1930-02-26 | 751 | lake | sal | 0.1 |
| 13 | 752 | DR-3 | nan | 752 | lake | rad | 2.19 |
| 14 | 752 | DR-3 | nan | 752 | lake | sal | 0.09 |
| 15 | 752 | DR-3 | nan | 752 | lake | temp | -16.0 |
| 16 | 752 | DR-3 | nan | 752 | roe | sal | 41.6 |
| 17 | 837 | MSK-4 | 1932-01-14 | 837 | lake | rad | 1.46 |
| 18 | 837 | MSK-4 | 1932-01-14 | 837 | lake | sal | 0.21 |
| 19 | 837 | MSK-4 | 1932-01-14 | 837 | roe | sal | 22.5 |
| 20 | 844 | DR-1 | 1932-03-22 | 844 | roe | rad | 11.25 |
+----+-------+-------+------------+-------+--------+-------+---------+
- 최종적으로 Data를 병합
merge_final = merge3.merge(merge4, left_on=['ident', 'taken', 'quant', 'reading'],
right_on=['person', 'ident', 'quant', 'reading'])
+----+---------+-----------+---------+---------+----------+-------+---------+---------+-------+------------+---------+----------+
| | ident_x | personal | family | taken_x | person_x | quant | reading | ident_y | site | dated | taken_y | person_y |
+----+---------+-----------+---------+---------+----------+-------+---------+---------+-------+------------+---------+----------+
| 0 | dyer | William | Dyer | 619 | dyer | rad | 9.82 | 619 | DR-1 | 1927-02-08 | 619 | dyer |
| 1 | dyer | William | Dyer | 619 | dyer | sal | 0.13 | 619 | DR-1 | 1927-02-08 | 619 | dyer |
| 2 | dyer | William | Dyer | 622 | dyer | rad | 7.8 | 622 | DR-1 | 1927-02-10 | 622 | dyer |
| 3 | dyer | William | Dyer | 622 | dyer | sal | 0.09 | 622 | DR-1 | 1927-02-10 | 622 | dyer |
| 4 | pb | Frank | Pabodie | 734 | pb | rad | 8.41 | 734 | DR-3 | 1939-01-07 | 734 | pb |
| 5 | pb | Frank | Pabodie | 734 | pb | temp | -21.5 | 734 | DR-3 | 1939-01-07 | 734 | pb |
| 6 | pb | Frank | Pabodie | 735 | pb | rad | 7.22 | 735 | DR-3 | 1930-01-12 | 735 | pb |
| 7 | pb | Frank | Pabodie | 751 | pb | rad | 4.35 | 751 | DR-3 | 1930-02-26 | 751 | pb |
| 8 | pb | Frank | Pabodie | 751 | pb | temp | -18.5 | 751 | DR-3 | 1930-02-26 | 751 | pb |
| 9 | lake | Anderson | Lake | 734 | lake | sal | 0.05 | 734 | DR-3 | 1939-01-07 | 734 | lake |
| 10 | lake | Anderson | Lake | 751 | lake | sal | 0.1 | 751 | DR-3 | 1930-02-26 | 751 | lake |
| 11 | lake | Anderson | Lake | 752 | lake | rad | 2.19 | 752 | DR-3 | nan | 752 | lake |
| 12 | lake | Anderson | Lake | 752 | lake | sal | 0.09 | 752 | DR-3 | nan | 752 | lake |
| 13 | lake | Anderson | Lake | 752 | lake | temp | -16.0 | 752 | DR-3 | nan | 752 | lake |
| 14 | lake | Anderson | Lake | 837 | lake | rad | 1.46 | 837 | MSK-4 | 1932-01-14 | 837 | lake |
| 15 | lake | Anderson | Lake | 837 | lake | sal | 0.21 | 837 | MSK-4 | 1932-01-14 | 837 | lake |
| 16 | roe | Valentina | Roerich | 752 | roe | sal | 41.6 | 752 | DR-3 | nan | 752 | roe |
| 17 | roe | Valentina | Roerich | 837 | roe | sal | 22.5 | 837 | MSK-4 | 1932-01-14 | 837 | roe |
| 18 | roe | Valentina | Roerich | 844 | roe | rad | 11.25 | 844 | DR-1 | 1932-03-22 | 844 | roe |
+----+---------+-----------+---------+---------+----------+-------+---------+---------+-------+------------+---------+----------+