[STATA] 원하는 문자를 추출하고 바꾸는 정규표현식(regular expression)

regular expression, 즉 정규표현식이란 특정한 규칙을 가진 문자열의 집합을 표현하는데 언어입니다. 프로그래밍에서 많이 사용되기도 하는데, 문자열의 검색과 치환을 위한 용도로 많이 쓰입니다.

regexm

regexm()은 문자변수에서 찾고자하는 정보가 있으면 1 그렇지 않으면 0의 값을 만들때 유용합니다.

gen 새로운변수명 =regexm(기존변수명, "찾고자하는문자열")

예컨대 다음의 command는 brand라는 변수에서 BMW가 몇 개나 있는지 찾아서 이에 해당하는 더미 변수를 만듭니다.

gen BMW=regexm(brand, "BMW")

tab BMW

regexr

regexr은 원하는 문자열을 찾아서 새로운 문자열로 대체합니다.

gen 새로운변수명=regexr(기존변수명, "대체하고자하는문자열", "새로입력하고자하는문자열")

예컨대 다음의 command는 brand라는 변수에서 BMW라는 문자열을 Luxury로 바꾸어 새로운 변수 luxurycar을 생성합니다. 찾고자 하는 문자열이 없을 경우 그대로 원래 변수 내의 관측치가 입력됩니다.

gen luxurycar=regexr(brand, "BMW", "Luxury")

기존에 소문자로 입력되어 있던 것을 대문자로 바꾸고 싶을 때에는 다음과 같이 활용할 수 있습니다. local 매크로로 l이 정의되어 있고, 이때 정의된 문자열들을 모두 대문자로 바꾸는 command입니다.

replace project_code=regexr(project_code,"`l'",upper("`l'"))

특정한 문자열이 아니라 몇 개의 조건을 갖춘 문자열에 대하여 새로운 문자열로 대체할 수도 있습니다. 다음의 예시는 대문자B로 시작해서 세 개의 숫자와 하나의 소문자 알파벳으로 끝나는 문자열을 "found"로 대체하도록 명령하고 있습니다.

generate make2 = make replace make2 = regexr(make2, "^B.*[0-9][0-9][0-9][a-z]$", "found") list make make2 if make != make2

regexs

여기에서 s는 subexpression을 뜻합니다. 원래 문자열에서 일부를 추출해서 새로운 변수를 만들 수 있습니다. 한편, 이는 match를 해주는 regexm function과 함께 쓰입니다.

gen 새로운변수명 =regexs(추출하기원하는s.e.의번호) if regexm(기존변수명, ("첫번째subexpression") ("두번째subexpression")...("n번째subexpression"))

예컨대 "date"이라는 변수에 "12March2015"라고 입력되어 있을 때, "March"라는 월별 변수만 추출하고 싶다고 합시다. 이는 다음과 같은 command로 해결할 수 있습니다.

gen month=regexs(2) if regexm(date,("[0-9]+")("[a-z]+")("[1-9]+"))

각각의 subexpresssion은 () 안에 들어가 있고, 그 중 두 번째에 해당하는 알파벳으로 이루어진 문자열을 추출하기를 명령했습니다.

다음과 같이 입력하여 문자열의 시작(^)과 끝($)을 명확히 지정해줄 수도 있습니다.

gen month=regexs(2) if regexm(date,("^[0-9]+")("[a-z]+")("[1-9]+$"))

다음과 같은 dataset이 있다고 할 때, 각 주에 대해서만 이를 정리하고 싶을 때 다음과 같은 command를 쓸 수 있습니다.

where | Freq. Percent Cum.

----------------------------------------+-------------------------------

philadelphia, pa | 1 0.12 93.55

pikeville, ky | 1 0.12 93.67

pittsburg, pa | 1 0.12 93.80

portland , oregan | 1 0.12 93.92

providence, rhode island | 1 0.12 94.04

raleigh, north carolina | 1 0.12 94.17

san francisco , ca | 1 0.12 94.29

swan quarter, n.c | 1 0.12 95.91

gen where_upper=upper(where)

gen noper=regexr(where_upper, "\.", "")

gen state=regexs(2) if regexm(where, "(, )+([A-Z][A-Z]$)")

tab state

먼저 모든 where에 대해 대문자로 전환하고, 그 중 주에 대한 정보가 n.c와 같이 코딩된 표본들을 없앤 후, 주 정보만 추출하면 됩니다.

substr

정규표현식은 아니지만, 변수가 문자열일 때 이 중 일부를 추출하게 하는 substr command도 있었죠.

gen 새로운변수명=substr(원래변수명,몇번째문자부터n1,몇개의문자를n2)

regular expression에서 사용되는 기호들

Counting
*	Asterisk means “match zero or more” of the preceding expression.
+	Plus sign means “match one or more” of the preceding expression.
?	Question mark means “match either zero or one” of the preceding expression.
Characters
a–z	The dash operator means “match a range of characters or numbers”. The “a” and “z” are merely an example. It could also be 0–9, 5–8, F–M, etc.
.	Period means “match any character”.
\	A backslash is used as an escape character to match characters that would otherwise be interpreted as a regular-expression operator.
Anchors
^	When placed at the beginning of a regular expression, the caret means “match expression at beginning of string”. This character can be thought of as an “anchor” character since it does not directly match a character, only the location of the match.
$	When the dollar sign is placed at the end of a regular expression, it means “match expression at end of string”. This is the other anchor character.
Groups
\|	The pipe character signifies a logical “or” that is often used in character sets (see square brackets below).
[ ]	Square brackets denote a set of allowable characters/expressions to use in matching, such as [a-zA-Z0-9] for all alphanumeric characters.
( )	Parentheses must match and denote a subexpression group.

출처: http://www.stata.com/support/faqs/data-management/regular-expressions/

저작자표시 비영리 변경금지

'방법론 공부 > 계량통계 방법론' 카테고리의 다른 글

데이터의 분포를 어떻게 보여줄까 - Histogram vs. Boxplot (0)	2015.03.13
[STATA] 문자열 관련하여 유용한 기능들 - ltrim, itrim, rtri, abbrev, proper, upper, lower (3)	2015.03.13
[STATA] 일정한 조건에 따라 dummy 변수 쉽게 만들기 (0)	2015.03.12
[STATA] 두 개 이상의 분포 비교하기 - Box plot (0)	2015.03.12
[STATA] 카테고리에 따라 데이터 정렬 혹은 생성 - sort, gsort, by, bysort (0)	2015.03.12

새벽첫빛의 꿈꾸는 아프리카

[STATA] 원하는 문자를 추출하고 바꾸는 정규표현식(regular expression) - regexm, regexr, regexs

'방법론 공부 > 계량통계 방법론' 카테고리의 다른 글

티스토리툴바

[STATA] 원하는 문자를 추출하고 바꾸는 정규표현식(regular expression) - regexm, regexr, regexs

'방법론 공부 > 계량통계 방법론' 카테고리의 다른 글

'방법론 공부/계량통계 방법론' Related Articles

티스토리툴바