|
他们经常说:R语言处理字符串、文本数据不行
我说:就这?
我给他们来个 tidyverse 版本。
<hr/>先创建数据:
library(tidyverse)
df = tibble(
item_id = 1001:1006,
description = c(&#34;used car 3 years old&#34;, &#34;brand new car&#34;,
&#34;5-year-old used car&#34;, &#34;used bicycle&#34;,
&#34;car, 10k mileage, used&#34;, &#34;car&#34;),
lot = c(&#34;A-124-X&#34;, &#34;B-102-X&#34;, &#34;A-039-Y&#34;,
&#34;A-200-Y&#34;, &#34;B-120-Y&#34;, &#34;B-025-X&#34;),
price = c(24000, 36000, &#34;$18000&#34;, 1200, &#34;12k&#34;, 14000))
df

1. 检查字符串是否包含特定的单词或字符序列
- 筛选 description 列,包含 &#34;used car&#34; 的行
df %>%
filter(str_detect(description, &#34;used car&#34;))

- 筛选 description 列,包含 &#34;used&#34; 且包含 &#34;car&#34; 的行
df %>%
filter(str_detect(description, &#34;used&#34;), str_detect(description, &#34;car&#34;))

2. 根据字符串的长度进行过滤
- 筛选 description 列字符串长度 > 15 的行:
df %>%
filter(str_count(description) > 15)

3. 基于字符串的第一个或最后一个字母进行过滤
- 筛选 lot 列以 &#34;A&#34; 开头的行:
df %>%
filter(str_starts(lot, &#34;A&#34;))

- 筛选 lot 列以 &#34;A-0&#34; 开头的行:
df %>%
filter(str_starts(lot, &#34;A-0&#34;))

4. 过滤非数字字符
df %>%
filter(!is.na(as.numeric(price)))

更好的做法(正则表达式,结果同上):
df %>%
filter(!str_detect(price, &#34;\\D&#34;))注:若要选出带字符的,去掉 &#34;!&#34; 即可。
5. 计算单个字符或字符序列的出现次数
- 统计 description 列 &#34;used&#34; 出现的次数:
df %>%
mutate(n = str_count(description, &#34;used&#34;))

若要筛选出现次数小于 1 的行:
df %>%
filter(str_count(description, &#34;used&#34;) < 1)
 |
|