When I was first learning about what regular expressions are and how to use them, I was having a harder than average time finding good resources to explain the basics. Luckily for me, I have an amazing community that came to my aid. Erin Call wrote this awesome post, and Cory Kolbeck wrote this one explaining the bones of using regexes.

I wanted to write about regexes, but the write ups from Erin and Cory are far and away better than any intro I could write. So, I wondered for a while what I could contribute to the discussion. Then a friend who just graduated from Epicodus mentioned that she understood how to construct regexes (use Rubular!), but not really how they would integrate with an actual program. And I have done that! So, here is a description of how my pair partner and I used regexes as an integral part of solving PCS’s code challenge 4.

For our code challenge, we had to parse and clean a text file of data in this format:

[prefix] [first_name] [middle_name | middle_initial] last_name [suffix] phone_number phone_extension

where anything in square brackets is optional, and then appropriately assign the parts that are present to the correct column in a CSV. We used OptionParser to create our own flags to run a command line script that would do all this.

The way my pair partner and I ended up solving this problem was by creating two scripts. The first script read in the raw data, stripped off the first word in every line of that data and created a reverse sorted histogram*, and then wrote that histogram data to a CSV. We then examined the output and determined which of the words were legitimately prefixes. This is the script we used:

histogram = Hash.new(0)
while line = gets
  prefix = (/^S*/).match(line)
  prefix = prefix.to_s
  histogram[prefix.to_sym] += 1
histogram = Hash[histogram.sort_by { |prefix, count| count }.reverse]
histogram.each { |prefix, count| puts "#{prefix} #{count}" }

We needed to do the same for the suffixes, so we had to split the raw data into the name portion and phone number portion in order to expose the last word of the name section. At this point, we used a regex to match the tab whitespace separating those two parts of the data, saved the first part to a variable, then ran it through a similar script to create a histogram for the suffixes.

Once we’d done that, we set up OptionParser to take the prefixes and suffixes files as input, used those lists to break up the name parts of the data, then used digit matching regexes to both match the existing phone information and reformat it in a consistent way. You can see the source code for the cleaning script. Please forgive our horribly verbose code. Hopefully I’ll get around to refactoring it soon, but for now, done is better than perfect.

So, hopefully this will be helpful to someone looking for some way to actually use regexes in their code! If you feel like playing around with the scripts we wrote, the regexes therein, and OptionParser, you can clone the project and run it in your command line. In your terminal, follow these steps:

shawnacscott in ~/Code on master #%
+ git clone https://github.com/shawnacscott/pcs_code_challenge_04.git
Cloning into 'pcs_code_challenge_04'...
remote: Counting objects: 128, done.
remote: Compressing objects: 100% (86/86), done.
remote: Total 128 (delta 54), reused 103 (delta 32)
Receiving objects: 100% (128/128), 382.55 KiB | 541.00 KiB/s, done.
Resolving deltas: 100% (54/54), done.
Checking connectivity... done
shawnacscott in ~/Code on master #%
+ cd pcs_code_challenge_04
shawnacscott in ~/Code/pcs_code_challenge_04 on master<>
+ ruby clean.rb --prefixes prefix_words.txt --suffixes suffix_words.txt --input raw_customers.txt --output customers.csv

*a hash where the first word is the key, and the value is the count of how many times that word is present.