每天推薦一個 GitHub 優質開源項目和一篇精選英文科技或編程文章原文,歡迎關注開源日報。交流QQ群:202790710;電報群 https://t.me/OpeningSourceOrg
今日推薦開源項目:《中華古詩詞資料庫——chinese-poetry》
推薦理由:全中華古詩詞資料庫, 唐宋兩朝近一萬四千古詩人, 接近5.5萬首唐詩加26萬宋詩. 兩宋時期1564位詞人,21050首詞。
古詩是我們中華民族的一份巨大寶藏,但是很多人並沒有古典文集,從而讓古詩與我們有了距離。方便實用的電子版此時就起到了巨大的作用,所以就有了這個詩詞資料庫。
這個龐大的資料庫已經給不少關於古詩的應用提供了幫助,比如說 Android 應用《離線全唐詩》和訓練電腦寫詩的 pytorch-poetry-gen,下面放出它們的 github 鏈接:
https://github.com/justdark/pytorch-poetry-gen
https://github.com/animalize/QuanTangshi
今日推薦英文原文:《Getting started with regular expressions》作者:
原文鏈接:https://opensource.com/article/18/5/getting-started-regular-expressions
推薦理由:正則表達式是一個非常強大的操作字元串的工具,很多編程語言都支持正則表達式,這篇文章是一個正則表達式的入門指南
Getting started with regular expressions
Regular expressions can be one of the most powerful tools in your toolbox as a Linux user, system administrator, or even as a programmer. It can also be one of the most daunting things to learn, but it doesn't have to be! While there are an infinite number of ways to write an expression, you don't have to learn every single switch and flag. In this short how-to, I'll show you a few simple ways to use regex that will have you running in no time and share some follow-up resources that will make you a regex master if you want to be.
A quick overview
Regular expressions, also referred to as "regex" patterns or even "regular statements," are in simple terms "a sequence of characters that define a search pattern." The idea came about in the 1950s when Stephen Cole Kleene wrote a description of an idea he called a "regular language," of which part came to be known as "Kleene's theorem." At a very high level, it says if the elements of the language can be defined, then an expression can be written to match patterns within that language.
Since then, regular expressions have been part of even the earliest Unix programs, including vi, sed, awk, grep, and others. In fact, the word grep is derived from the command that was used in the earliest "ed" editor, namely g/re/p
, which essentially means "do a global search for this regular expression and print the lines." Cool!
Why we need regular expressions
As mentioned above, regular expressions are used to define a pattern to help us match on or "find" objects that match that pattern. Those objects can be files in a filesystem when using the find
command for instance, or a block of text in a file which we might search using grep, awk, vi, or sed, for example.
Start with the basics
Let's start at the very beginning; it's a very good place to start.
The first regex everyone seems to learn is probably one you already know and didn't realize what it was. Have you ever wanted to print out a list of files in a directory, but it was too long? Maybe you've seen someone type \*.gif
to list GIF images in a directory, like:
That's a regular expression!
When writing regular expressions, certain characters have special meaning to allow us to move beyond matching just characters to matching entire sets of characters. In this case, the *
character, also called "star" or "splat," takes the place of filenames and allows you to match all files ending with .gif
.
Search for patterns in a file
The next step in your regex foo training is searching for patterns within a file, especially using the replace pattern to make quick changes.
Two common ways to do this are:
- Use vi to open the file, search for a pattern, and make the change (even automatically using replace).
- Use the "stream editor," aka sed, to programmatically search within the file and make the change.
Let's start by learning some regex by using vi to edit the following file:
Simple test
Harder test
Extreme test case
ABC 123 abc 567
The dog is lazy
Now, with this file open in vi, let's look at some regex examples that will help us find some matching strings inside and even replace them automatically.
To make things easier, let's set vi to ignore case. Type set ic
to enable case-insensitive searching.
Now, to start searching in vi, type the /
character followed by your search pattern.
Search for things at the beginning or end of a line
To find a line that starts with "Simple," use this regex pattern:
Notice in the image below that only the line starting with "Simple" is highlighted. The carat symbol (^
) is the regex equivalent of "starts with."
Next, let's use the $
symbol, which in regex speak is "ends with."
See how it highlights both lines that end in "test"? Also, notice that the fourth line has the word test in it, but not at the end, so this line is not highlighted.
This is the power of regular expressions, giving you the ability to quickly look across a great number of matches with ease but specifically drill down on only exact matches.
Test for the frequency of occurrence
To further extend your skills in regular expressions, let's take a look at some more common special characters that allow us to look for not just matching text, but also patterns of matches.
Frequency matching characters:
Character | Meaning | Example |
---|---|---|
* |
Zero or more | ab* – the letter a followed by zero or more b's |
+ |
One or more | ab+ – the letter a followed by one or more b's |
? |
Zero or one | ab? – zero or just one b |
{n} |
Given a number, find exactly that number | ab{2} – the letter a followed by exactly two b's |
{n,} |
Given a number, find at least that number | ab{2,} – the letter a followed by at least two b's |
{n,y} |
Given two numbers, find a range of that number | ab{1,3} – the letter a followed by between one and three b's |
Find classes of characters
The next step in regex training is to use classes of characters in our pattern matching. What's important to note here is that these classes can be combined either as a list, such as [a,d,x,z]
, or as a range, such as [a-z]
, and that characters are usually case sensitive.
To see this work in vi, we'll need to turn off the ignore case we set earlier. Let's type: set noic
to turn ignore case off again.
Some common classes of characters that are used as ranges are:
- a-z – all lowercase characters
- A-Z – all UPPERCASE characters
- 0-9 – numbers
Now, let's try a search similar to one we ran earlier:
Do you notice that it finds nothing? That's because the previous regex looks for exactly "tT." If we replace this with:
We'll see that both the lowercase and UPPERCASE T's are matched across the document.
Now, let's chain a couple of class ranges together and see what we get. Try:
Notice that the capital letters and 123 are highlighted, but not the lowercase letters (including the end of line five).
Flags
The last step in your beginning regex training is to understand flags that exist to search for special types of characters without needing to list them in a range.
.
– any character\s
– whitespace\w
– word\d
– digit (number)
For example, to find all digits in the example text, use:
Notice in the example below that all of the numbers are highlighted.
To match on the opposite, you usually use the same flag, but in UPPERCASE. For example:
\S
– not a space\W
– not a word\D
– not a digit
Notice in the example below that by using \D
, all characters EXCEPT the numbers are highlighted.
Searching with sed
A quick note on sed: It's a stream editor, which means you don't interact with a user interface. It takes the stream coming in one side and writes it out the other side.
Using sed is very similar to vi, except that you give it the regex to search and replace, and it returns the output. For example:
will return the following to the screen:
If you want to save that file, it's only slightly more tricky. You'll need to chain a couple of commands together to a) write that file, and b) copy it over the top of the first file.
To do this, try:
Now, if you look at your examples
file, you'll see that the word "dog" has been replaced.
Simple test
Harder test
Extreme test case
ABC 123 abc 567
The cat is lazy
For more information
I hope this was a helpful overview of regular expressions. Of course, this is just the tip of the iceberg, and I hope you'll continue to learn about this powerful tool by reviewing the additional resources below.
每天推薦一個 GitHub 優質開源項目和一篇精選英文科技或編程文章原文,歡迎關注開源日報。交流QQ群:202790710;電報群 https://t.me/OpeningSourceOrg