每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;电报群 https://t.me/OpeningSourceOrg


今日推荐开源项目:《中华古诗词数据库——chinese-poetry

推荐理由:全中华古诗词数据库, 唐宋两朝近一万四千古诗人, 接近5.5万首唐诗加26万宋诗. 两宋时期1564位词人,21050首词。

古诗是我们中华民族的一份巨大宝藏,但是很多人并没有古典文集,从而让古诗与我们有了距离。方便实用的电子版此时就起到了巨大的作用,所以就有了这个诗词数据库。

这个庞大的数据库已经给不少关于古诗的应用提供了帮助,比如说 Android 应用《离线全唐诗》和训练电脑写诗的 pytorch-poetry-gen,下面放出它们的 github 链接:

https://github.com/justdark/pytorch-poetry-gen

https://github.com/animalize/QuanTangshi

开源周报2018年第7期:为你写诗,为你无所不知

今日推荐英文原文:《Getting started with regular expressions》作者:

原文链接:https://opensource.com/article/18/5/getting-started-regular-expressions

推荐理由:正则表达式是一个非常强大的操作字符串的工具,很多编程语言都支持正则表达式,这篇文章是一个正则表达式的入门指南

Getting started with regular expressions

Regular expressions can be one of the most powerful tools in your toolbox as a Linux user, system administrator, or even as a programmer. It can also be one of the most daunting things to learn, but it doesn't have to be! While there are an infinite number of ways to write an expression, you don't have to learn every single switch and flag. In this short how-to, I'll show you a few simple ways to use regex that will have you running in no time and share some follow-up resources that will make you a regex master if you want to be.

A quick overview

Regular expressions, also referred to as "regex" patterns or even "regular statements," are in simple terms "a sequence of characters that define a search pattern." The idea came about in the 1950s when Stephen Cole Kleene wrote a description of an idea he called a "regular language," of which part came to be known as "Kleene's theorem." At a very high level, it says if the elements of the language can be defined, then an expression can be written to match patterns within that language.

Since then, regular expressions have been part of even the earliest Unix programs, including vi, sed, awk, grep, and others. In fact, the word grep is derived from the command that was used in the earliest "ed" editor, namely g/re/p, which essentially means "do a global search for this regular expression and print the lines." Cool!

Why we need regular expressions

As mentioned above, regular expressions are used to define a pattern to help us match on or "find" objects that match that pattern. Those objects can be files in a filesystem when using the find command for instance, or a block of text in a file which we might search using grep, awk, vi, or sed, for example.

Start with the basics

Let's start at the very beginning; it's a very good place to start.

The first regex everyone seems to learn is probably one you already know and didn't realize what it was. Have you ever wanted to print out a list of files in a directory, but it was too long? Maybe you've seen someone type \*.gif to list GIF images in a directory, like:

$ ls *.gif

That's a regular expression!

When writing regular expressions, certain characters have special meaning to allow us to move beyond matching just characters to matching entire sets of characters. In this case, the * character, also called "star" or "splat," takes the place of filenames and allows you to match all files ending with .gif.

Search for patterns in a file

The next step in your regex foo training is searching for patterns within a file, especially using the replace pattern to make quick changes.

Two common ways to do this are:

  1. Use vi to open the file, search for a pattern, and make the change (even automatically using replace).
  2. Use the "stream editor," aka sed, to programmatically search within the file and make the change.

Let's start by learning some regex by using vi to edit the following file:

The quick brown fox jumped over the lazy dog.
Simple test
Harder test
Extreme test case
ABC 123 abc 567
The dog is lazy

Now, with this file open in vi, let's look at some regex examples that will help us find some matching strings inside and even replace them automatically.

To make things easier, let's set vi to ignore case. Type set ic to enable case-insensitive searching.

Now, to start searching in vi, type the / character followed by your search pattern.

Search for things at the beginning or end of a line

To find a line that starts with "Simple," use this regex pattern:

/^Simple

Notice in the image below that only the line starting with "Simple" is highlighted. The carat symbol (^) is the regex equivalent of "starts with."

'Simple' highlighted

Next, let's use the $ symbol, which in regex speak is "ends with."

/test$

'Test' highlighted

See how it highlights both lines that end in "test"? Also, notice that the fourth line has the word test in it, but not at the end, so this line is not highlighted.

This is the power of regular expressions, giving you the ability to quickly look across a great number of matches with ease but specifically drill down on only exact matches.

Test for the frequency of occurrence

To further extend your skills in regular expressions, let's take a look at some more common special characters that allow us to look for not just matching text, but also patterns of matches.

Frequency matching characters:

Character Meaning Example
* Zero or more ab* – the letter a followed by zero or more b's
+ One or more ab+ – the letter a followed by one or more b's
? Zero or one ab? – zero or just one b
{n} Given a number, find exactly that number ab{2} – the letter a followed by exactly two b's
{n,} Given a number, find at least that number ab{2,} – the letter a followed by at least two b's
{n,y} Given two numbers, find a range of that number ab{1,3} – the letter a followed by between one and three b's

Find classes of characters

The next step in regex training is to use classes of characters in our pattern matching. What's important to note here is that these classes can be combined either as a list, such as [a,d,x,z], or as a range, such as [a-z], and that characters are usually case sensitive.

To see this work in vi, we'll need to turn off the ignore case we set earlier. Let's type: set noic to turn ignore case off again.

Some common classes of characters that are used as ranges are:

  • a-z – all lowercase characters
  • A-Z – all UPPERCASE characters
  • 0-9 – numbers

Now, let's try a search similar to one we ran earlier:

/tT

Do you notice that it finds nothing? That's because the previous regex looks for exactly "tT." If we replace this with:

/[tT]

We'll see that both the lowercase and UPPERCASE T's are matched across the document.

Letter 't' highlighted

Now, let's chain a couple of class ranges together and see what we get. Try:

/[A-Z1-3]

capital letters and 123 are highlighted

Notice that the capital letters and 123 are highlighted, but not the lowercase letters (including the end of line five).

Flags

The last step in your beginning regex training is to understand flags that exist to search for special types of characters without needing to list them in a range.

  • . – any character
  • \s – whitespace
  • \w – word
  • \d – digit (number)

For example, to find all digits in the example text, use:

/\d

Notice in the example below that all of the numbers are highlighted.

numbers are highlighted

To match on the opposite, you usually use the same flag, but in UPPERCASE. For example:

  • \S – not a space
  • \W – not a word
  • \D – not a digit

Notice in the example below that by using \D, all characters EXCEPT the numbers are highlighted.

all characters EXCEPT the numbers are highlighted

Searching with sed

A quick note on sed: It's a stream editor, which means you don't interact with a user interface. It takes the stream coming in one side and writes it out the other side.

Using sed is very similar to vi, except that you give it the regex to search and replace, and it returns the output. For example:

sed s/dog/cat/ examples

will return the following to the screen:

Searching and replacing

If you want to save that file, it's only slightly more tricky. You'll need to chain a couple of commands together to a) write that file, and b) copy it over the top of the first file.

To do this, try:

sed s/dog/cat/ examples > temp.out; mv temp.out examples

Now, if you look at your examples file, you'll see that the word "dog" has been replaced.

The quick brown fox jumped over the lazy cat.
Simple test
Harder test
Extreme test case
ABC 123 abc 567
The cat is lazy

For more information

I hope this was a helpful overview of regular expressions. Of course, this is just the tip of the iceberg, and I hope you'll continue to learn about this powerful tool by reviewing the additional resources below.


每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文,欢迎关注开源日报。交流QQ群:202790710;电报群 https://t.me/OpeningSourceOrg