开源日报每天推荐一个 GitHub 优质开源项目和一篇精选英文科技或编程文章原文，坚持阅读《开源日报》，保持每日学习的好习惯。
今日推荐开源项目：《工具箱 hutool》
今日推荐英文原文：《Should You Use Machine Learning?》

今日推荐开源项目：《工具箱 hutool》传送门：GitHub链接
推荐理由：一个为 Java 语言提供的工具包，旨在将一些常用的代码融合为一个函数方便使用。诸如发送邮件，使用 md5 加密等等代码在实际使用中并不算是少见，与其复制这些代码到各处，不如直接作为函数使用它们，于是就有了这些函数组成的工具包。如果刚好需要这其中的一些功能的话，引入它们所属的模块来简化操作是个不错的选择。

今日推荐英文原文：《Should You Use Machine Learning?》作者：Devin Soni
原文链接：https://medium.com/better-programming/should-you-use-machine-learning-73a7746f7280
推荐理由：机器学习可不是解决问题的万能药

Should You Use Machine Learning?

How to know if machine learning can solve your engineering or business problem

Introduction

It’s become very trendy to use machine learning to solve technical problems. Companies spend a lot of time and effort finding employees who can leverage these techniques to approach complex tasks which may have previously been unsolvable with traditional methods.

However, how do you know if you really need machine learning?

Even though these techniques can be very useful, they don’t fit every situation. If you try to apply machine learning to an inappropriate task, you may waste time and money, and could end up with a poorly-performing model that is not useful.

In this article, I will go through some questions you should ask when determining whether or not machine learning is right for your situation. These should act as a framework to guide your decision-making process.

The Task

The first thing to think about is whether or not your task is well suited to machine learning.

Do you have a well-defined problem with clear inputs and outputs? It is essential that you have a clear idea of what your model would have as inputs and outputs. Otherwise, you may have a difficult time in the feature engineering and evaluation stages of producing a machine learning model.

Do you have metrics that you can use to evaluate a model’s performance and to compare different models? Without an easy way to evaluate models, you will have a difficult time determining whether your model was successful, and in choosing which model to use. You will also have a difficult time iterating upon, and improving, your model as your use-case evolves over time.

Does the problem require an approximate solution? Most machine learning algorithms are used in situations where there is no exact way to find a solution, or the exact solution is too costly to implement. If your problem does have a method to solve it exactly, such as through the use of regular expressions, classical optimization techniques such as linear programming, or older AI techniques such as constraint satisfaction problems, then you may be better off using these methods instead.

Does the problem fit the machine learning paradigm? Most machine learning algorithms rely on the idea that current data will be useful in predicting or classifying future data. If your situation is prone to external events invalidating previous data, then machine learning will most likely not be effective. Similarly, if previous data has no relevance to future data, your model will not learn any useful trends that help you understand incoming data in a real-world setting. It is essential that your model sees relevant past data in order to use machine learning effectively.

The Data

Next, you must determine whether or not your data is suitable for machine learning.

Do you have reliable data labels? Most machine learning methods (the supervised kind) rely on the presence of labels for each data point you have. These labels should be as free of noise as possible, and should be obtainable at a reasonable cost. If your labels are too noisy, either due to inherent situational difficulty in data collection, or due to poor labeling quality, then your models will most likely fail to properly learn the relationships in your data. Additionally, if it is too costly to obtain labels, you may not be able to obtain enough training data over your model’s lifetime for it to be able to learn properly.

Does the data suit machine learning? The data you use to train your model must accurately represent the real-world data that it will be used on. This does not mean that it must perfectly reflect it, but the closer it does, the more useful and accurate your model will be. Even though there are techniques to ameliorate issues surrounding class imbalance and lack of data availability, it is always best if you can sufficiently supply your models with training data that reflects its real-world inputs. If you train your model with biased training data, and the available feature engineering and preprocessing methods are not sufficient, your model may perform unexpectedly poorly when it faces real-world data. For example, this may occur if you train your model with a heavily imbalanced data set in a classification setting. If your model is expecting to see 1% class A and 99% class B based on its training data, it will perform poorly if the real-world situation has 50% class A and 50% class B (assuming your goal is to maximize accuracy).

The Model

Finally, you must investigate the events that may occur during the life cycle of your machine learning model.

Are the effects and risks of the model well-understood? Are you fully aware of the societal effects of your machine learning model? For example, have you researched whether or not this model may further socioeconomic inequality, or if it may create divisions between people in different socioeconomic groups? It is important that you fully understand the effects of your model beyond your specific engineering or business problem. If your model is prone to algorithmic bias, it is important that you try to address this problem ahead of time, and try to create a training pipeline that removes as much bias as possible from the data. In addition, it is important to be aware of how this model may be abused by adversaries. Are there any ways for personal information to be obtained by reverse engineering the model’s outputs? Can industry secrets be leaked? With respect to these issues, it is important that you understand what data is visible through the outputs of your model, and who is able to access these outputs directly. It may be useful to obfuscate the outputs, so that you can control what information is revealed, and provide strictly what is necessary.

Can you maintain the model over time? In most cases, machine learning models are used throughout time, and are not confined to a single instance of usage. So, it is important that your organization has people who can maintain the model over time. As real-world situations change and drift over time, you are likely to need to retrain your model on current data. For example, a model that inputs natural language will need to be retrained periodically to incorporate changes in language usage such as slang. Even if the real-world data does not change over time, you may want to study the model’s errors and continuously iterate on the model in order to improve performance. Therefore, you most likely need a dedicated group of employees who are able to monitor and improve upon the model. Otherwise, it may quickly become obsolete or even useless, depending on how prone its domain is to change.

下载开源日报APP：https://openingsource.org/2579/
加入我们：https://openingsource.org/about/join/
关注我们：https://openingsource.org/about/love/

hutool Java

开源日报第495期：《工具箱 hutool》