Character sets

### Problem
I think the package will be incomplete until we find a way to express groups of characters. Here's a challenge to express email pattern matching in `rx`:

![regex-example](https://user-images.githubusercontent.com/13419011/54075789-9e08f380-42a3-11e9-9374-45dd126d8977.png)

### Challenges
First of all, I dont know of the way to express single "word" character (`alnum` + `_`). We used `rx_word` to denote `\\w+` and perhaps it should have been `rx_word_char() %>% rx_one_or_more()`.
```r
rx_char <- function(.data = NULL, value=NULL) {
  if(missing(value))
    return(paste0(.data, "\\w"))
  paste0(.data, sanitize(value))
}
```
I also extended `rx_count` to cases of ranges of input

```r
rx_count <- function(.data = NULL, n = 1) {
  if(length(n)>1){
    n[is.na(n)]<-""
    return(paste0(.data, "{", n[1], "," , n[length(n)], "}"))
  }
  paste0(.data, "{", n,"}")
}
```
Finally, we dont have a way to express word boundaries (`\\b`) and it might be useful to denote them. We shall call this function `rx_word_edge`
```r
rx_word_start <- function(.data = NULL){
  paste0(.data, "\\b")
}

rx_word_end <- rx_word_start
``` 
Finally, our biggest problem is that there's no way to express groups of characters, other than through `rx_any_of()`, but if we pass other `rx` expressions, values will be sanitized twice, meaning that we will get four backslashes before each symbol instead of two.

```r
# this function is exactly like rx_any_of() but without sanitization
rx_group <- function(.data = NULL, value) {
  paste0(.data, "[", value, "]")
}
```
### Solution
Here's what it looks like when we put all pieces together:
```r
x <- rx_word_start() %>% 
  rx_group(
    rx() %>% 
      rx_char() %>% 
      rx_char(".%+-")
  ) %>%
  rx_one_or_more() %>% 
  rx_char("@") %>% 
  rx_group(
    rx() %>% 
      rx_char() %>% 
      rx_char(".-")
  ) %>% 
  rx_one_or_more() %>% 
  rx_char(".") %>% 
  rx_alpha() %>% 
  rx_count(2:6) %>% 
  rx_word_end()
x
#> [1] "\\b[\\w\\.%\\+-]+@[\\w\\.-]+\\.[[:alpha:]]{2,6}\\b"

txt <- "This text contains email first.last@gmail.com and noname@post.io. The latter is no longer valid."
regmatches(txt, gregexpr(x, txt, perl = TRUE))
#> [[1]]
#> [1] "first.last@gmail.com" "noname@post.io"  
stringr::str_extract_all(txt, x)
#> [[1]]
#> [1] "first.last@gmail.com" "noname@post.io"  
```
The code works but I don't like it.
1. Constructor `rx` look redundant (I believe, there's a way to get rid of it entirely using specialized class, see below).
1. It is not very clear what `rx_one_or_more()` is referring to. I wonder if all functions should have `rep` argument with default option `one` and options `some`/`any` in addition to what `rx_count` does today.
1. Should `rx_char()` without arguments be called `rx_wordchar`? 
1. Should `rx_char()` with arguments be called `rx_literal()` or `rx_plain`? 
1. We should be very explicit about sanitization of arguments. To the extent that we should just mention: "input will be sanitized".
1. `rx_group` is artificial construct, a duplicate of `rx_any_of`, but without sanitization. Here I see couple of solutions.
  a. Allow "nested pipes" (as I have done above). Create S3 class and this way detect when type of `value` argument is not character, but `rx_string`. Input of this class do not need to be sanitized, because it has been sanitized at creation.
  b. Do not allow "nested pipes". Instead define `rx_any_of()` to have `...` and allow multiple arguments mixing functions and characters. Then hypotherical pipe would look like this:
```r
rx_word_edge() %>% 
  rx_any_of(rx_wordchar(), ".%+-", rep="some") %>%
  rx_literal("@") %>% 
  rx_any_of(rx_wordchar(), ".-", rep="some") %>% 
  rx_literal(".") %>% 
  rx_alpha(rep=2:6) %>% 
  rx_word_edge()
```
It's a lot to digest, but somehow everything related to one particular problem. Happy to split the issue once we identify the issues worth tackling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Character sets #9

Problem

Challenges

Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Character sets #9

Description

Problem

Challenges

Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions