728x90

Kaggle's Learn Machine Learning step02 | Translate to Korea

Starting Your Project

You are about to build a simple model and then continually improve it.

당신은 간단한 모델을 만들고 그것을 지속적으로 개선해 나가려한다.

It is easiest to keep one browser tab (or window) for the tutorials you are reading, and a separate browser window with the code you are writing.

읽고 있는 자습서를 위한 브라우저 탭(또는 창)과 작성중인 코드가 있는 별도의 창을 유지하는 것은 가장 쉽다.

You will continue writing code in the same place even as you progress through the sequence of tutorials.

자습서의 순차를 진행하더라도 같은 공간에 코드작성을 계속할 수 있다.

The starting point for your project is at THIS LINK.

당신의 프로젝트를 위한 시작 포인트는 이 링크를 따라 가보자.

Open that link in a new tab.

새로운 탭에서 링크열기를 하자.

Then hit the "Fork Notebook" button towards the top of the screen.

그리고 화면의 상단에 있는 "Fork Notebook" 버튼을 클릭하자.

You will see examples predicting home prices using data from Melbourne, Australia.

당신은 호주 멜버른으로부터의 데이터를 사용한 주택 예상 가격의 예를 볼 수 있을 것이다.

You will then write code to build a model predicting prices in the US state of Iowa.

그리고 당신은 미국의 아이오와주의 예상 가격모델을 만들기 위해 코드를 작성할 것이다.

The Iowa data is pre-loaded in your coding notebook.

아이오와주의 데이터는 당신의 코딩 노트북에 미리 로드되어 있다.

Working in Kaggle Notebooks

You will be coding in a "notebook" environment.

당신은 노트북 환경설정을 할수 있다.

These allow you to easily see your code and its output in one place.

이 것들은 당신의 코드를 보기 슆게 해주고 그것들을 한 곳에 출력해 준다.

A couple tips on the Kaggle notebook environment:

kaggle노트북 환경설정에서 두가지 팁:

1) It is composed of "cells." You will write code in the cells.

1) cell들을 포함한다. cell들 안에서 코드를 작성할 수있다.

Add a new cell by clicking on a cell, and then using the buttons in that look like this.

cell 위에서 클릭하는 것으로 새로운 cell 을 추가하며 같은 방법으로 버튼을 사용할 수 있다.

The arrows indicate whether the new cell goes above or below your current location.

화살들은 당신의 현재 위치에서 위 또는 아래로 새로운 cell을 위치시킬수 있다.

2) Execute the code in the current cell with the keyboard shortcut Control-Enter.

현재 cell에서 Control-Enter 단축키로 코드를 실행한다.

Using Pandas to Get Familiar With Your Data

데이터에 익숙해지기 위해 Pandas를 사용하자

The first thing you'll want to do is familiarize yourself with the data.

가장 먼저 해야할 일은 데이터에 익숙해지는 것입니다.

You'll use the Pandas library for this.

그러기 위해서는 Pandas library를 이용해야 합니다.

Pandas is the primary tool that modern data scientists use for exploring and manipulating data.

Pandas는 현대 데이터 과학자들이 데이터를 탐색하고 조작 하는데 사용하는 기본 도구 입니다.

Most people abbreviate pandas in their code as pd.

대부분의 사람들은 코드에서 pd로 pandas를 약칭합니다.

We do this with the command.

우리는 명령에 이렇게 할겁니다.

import pandas as pd

The most important part of the Pandas library is the DataFrame.

Pandas의 가장 중요한 부분은 DateFrame이다.

A DataFrame holds the type of data you might think of as a table.

DataFrame은 table로 데이터의 타입을 보유한다.

This is similar to a sheet in Excel, or a table in a SQL database.

이것은 SQL데이터베이스의 테이블이나 Exel의 sheet와도 비슷하다.

The Pandas DataFrame has powerful methods for most things you'll want to do with this type of data.

Pandas DataFrame에는 이렇한 데이터 타입으로 수행하려는 대부분의 작업을 위한 강력한 방법이 있다.

Let's start by looking at a basic data overview with our example data from Melbourne and the data you'll be working with from Iowa.

우선 멜버른의 예제 데이터와 아이오와에서 작업할 데이터를 사용하여 기본적인 개요를 살펴보자.

The example will use data at the file path ../input/melbourne-housing-snapshot/melb_data.csv.

예는 파일 경로 ../input/melbourne-housing-snapshot/melb_data.csv. 에 있는 데이터를 사용할 것이다.

Your data will be available in your notebook at ../input/train.csv (which is already typed into the sample code for you).

데이터는 당신의 노트북의 경로 ../input/train.csv 에서 사용가능하다.

We load and explore the data with the following:

다음과 같이 데이터를 로드하고 탐색한다.

# save filepath to variable for easier access

melbourne_file_path = '../input/melbourne-housing-snapshot/melb_data.csv'

# read the data and store data in DataFrame titled melbourne_data

melbourne_data = pd.read_csv(melbourne_file_path)

# print a summary of the data in Melbourne data

print(melbourne_data.describe())

Unnamed: 0 Rooms Price Distance Postcode \

count 18396.000000 18396.000000 1.839600e+04 18395.000000 18395.000000

mean 11826.787073 2.935040 1.056697e+06 10.389986 3107.140147

std 6800.710448 0.958202 6.419217e+05 6.009050 95.000995

min 1.000000 1.000000 8.500000e+04 0.000000 3000.000000

25% 5936.750000 2.000000 6.330000e+05 6.300000 3046.000000

50% 11820.500000 3.000000 8.800000e+05 9.700000 3085.000000

75% 17734.250000 3.000000 1.302000e+06 13.300000 3149.000000

max 23546.000000 12.000000 9.000000e+06 48.100000 3978.000000

Bedroom2 Bathroom Car Landsize BuildingArea \

count 14927.000000 14925.000000 14820.000000 13603.000000 7762.000000

mean 2.913043 1.538492 1.615520 558.116371 151.220219

std 0.964641 0.689311 0.955916 3987.326586 519.188596

min 0.000000 0.000000 0.000000 0.000000 0.000000

25% 2.000000 1.000000 1.000000 176.500000 93.000000

50% 3.000000 1.000000 2.000000 440.000000 126.000000

75% 3.000000 2.000000 2.000000 651.000000 174.000000

max 20.000000 8.000000 10.000000 433014.000000 44515.000000

YearBuilt Lattitude Longtitude Propertycount

count 8958.000000 15064.000000 15064.000000 18395.000000

mean 1965.879996 -37.809849 144.996338 7517.975265

std 37.013261 0.081152 0.106375 4488.416599

min 1196.000000 -38.182550 144.431810 249.000000

25% 1950.000000 -37.858100 144.931193 4294.000000

50% 1970.000000 -37.803625 145.000920 6567.000000

75% 2000.000000 -37.756270 145.060000 10331.000000

max 2018.000000 -37.408530 145.526350 21650.000000

Interpreting Data Description

Interpreting Data 설명

The results show 8 numbers for each column in your original dataset.

결과는 원본 데이터셋의 각 columnj8 8개의 숫자를 보여준다

The first number, the count, shows how many rows have non-missing values.

첫번째 숫자인 count는 누락되지 않은 값의 행의 숫자를 보여준다.

Missing values arise for many reasons.

누락된 값은 여러가지 이유로 발생합니다.

For example, the size of the 2nd bedroom wouldn't be collected when surveying a 1 bedroom house.

예를 들어 침실이 1개인 집을 조사할 때 침실이 두개인 집은 수집되지 않습니다.

We'll come back to the topic of missing data.

우리는 누락된 데이터라는 주제로 돌아갈 것입니다.

The second value is the mean, which is the average.

두번째 값인 mean은 평균이다.

Under that, std is the standard deviation, which measures how numerically spread out the values are.

그 아래에 있는 std는 표준편차로서 값이 숫자로 어떻게 퍼저나가는지를 측정한다.

To interpret the min, 25%, 50%, 75% and max values, imagine sorting each column from lowest to highest value.

min, 25%, 50%, 75% and max를 해석하려면 가장 낮은 값에서 높은 값으로 정렬하는 것을 상상해보자.

The first (smallest) value is the min.

가장 작은 값은 min이다.

If you go a quarter way through the list, you'll find a number that is bigger than 25% of the values and smaller than 75% of the values.

리스트를 4부분으로 나누면 25%보다 높고 75%보다 낮은 값을 찾을 수 있다.

That is the 25% value (pronounced "25th percentile").

이것이 25%의 가치인데 25번째 백분위수 라고 한다.

The 50th and 75th percentiles are defined analgously, and the max is the largest number.

50번째 백분위수와 75번째 백분위수는 서로 부합하도록 정의되며, max는 최대수이다.

Your Turn너의차례

Remember, the notebook you want to "fork" is here.

기억하자 당신이 fork를 원하는 notebook은 here 라는 것을.

Run the equivalent commands (to read the data and print the summary) in the code cell below.

데이터를 읽고 요약을 출력하기 위해 아래 코드 셀에서 해당 명령을 실행하라.

The file path for your data is already shown in your coding notebook.

당신의 데이터를 위한 파일 경로는 당신의 코딩 노트북에서 이미 보여진적이 있다.

Look at the mean, minimum and maximum values for the first few fields.

첫번째 몇몇의 fields의 최소값과 최대값 , 평균값을 보자.

Are any of the values so crazy that it makes you think you've misinterpreted the data?

값이 너무 미쳤기 때문에 데이터를 잘못 해석했다고 여겨지는가?

There are a lot of fields in this data. You don't need to look at it all quite yet.

이 데이터에는 많은 량의 필드가 존재한다. 아직은 모든 것을 자세히 살펴볼 필요는 없다.

When your code is correct, you'll see the size, in square feet, of the smallest lot in your dataset.

코드가 정확하다면 당신의 데이터셋에서 로트의 최소 크기가 평방 피트 라는 것을 볼수 있을 것이다.

This is from the min value of LotArea, and you can see the max size too.

이것은 Lotarea의 최소 크기로부터 나온 것이며 이런식으로 최대 크기도 불수 있다.

You should notice that it's a big range of lot sizes!

로트 크기들의 큰 범위라는 것을 알아차려야 한다.

You'll also see some columns filled with ....

당신은 또한 ...으로 된 컬럼 필들를 볼 수 있을 것이다.

That indicates that we had too many columns of data to print, so the middle ones were omitted from printing.

그것은 출력해야 할 column이 너무 많아서 출력할 것들의 중간 부분을 생략한 것을 나타낸다.

We'll take care of both issues in the next step.

우리는 다음 단계에서 두가지 issues에 대해서 살펴볼 것이다.

Continue

Move on to the next page where you will focus in on the most relevant columns.

가장 관련성이 높은 열에 집중할 다음 페이지로 이동하자

저작자표시 비영리 변경금지 (새창열림)

Boolean

Kaggle's Learn Machine Learning step02 | Translate to Korea

Kaggle's Learn Machine Learning step02 | Translate to Korea

Starting Your Project

Working in Kaggle Notebooks

Using Pandas to Get Familiar With Your Data

Interpreting Data Description

Your Turn너의차례

Continue

댓글

티스토리툴바