開源日報每天推薦一個 GitHub 優質開源項目和一篇精選英文科技或編程文章原文，堅持閱讀《開源日報》，保持每日學習的好習慣。
今日推薦開源項目：《工具箱 hutool》
今日推薦英文原文：《Should You Use Machine Learning?》

今日推薦開源項目：《工具箱 hutool》傳送門：GitHub鏈接
推薦理由：一個為 Java 語言提供的工具包，旨在將一些常用的代碼融合為一個函數方便使用。諸如發送郵件，使用 md5 加密等等代碼在實際使用中並不算是少見，與其複製這些代碼到各處，不如直接作為函數使用它們，於是就有了這些函數組成的工具包。如果剛好需要這其中的一些功能的話，引入它們所屬的模塊來簡化操作是個不錯的選擇。

今日推薦英文原文：《Should You Use Machine Learning?》作者：Devin Soni
原文鏈接：https://medium.com/better-programming/should-you-use-machine-learning-73a7746f7280
推薦理由：機器學習可不是解決問題的萬能葯

Should You Use Machine Learning?

How to know if machine learning can solve your engineering or business problem

Introduction

It』s become very trendy to use machine learning to solve technical problems. Companies spend a lot of time and effort finding employees who can leverage these techniques to approach complex tasks which may have previously been unsolvable with traditional methods.

However, how do you know if you really need machine learning?

Even though these techniques can be very useful, they don』t fit every situation. If you try to apply machine learning to an inappropriate task, you may waste time and money, and could end up with a poorly-performing model that is not useful.

In this article, I will go through some questions you should ask when determining whether or not machine learning is right for your situation. These should act as a framework to guide your decision-making process.

The Task

The first thing to think about is whether or not your task is well suited to machine learning.

Do you have a well-defined problem with clear inputs and outputs? It is essential that you have a clear idea of what your model would have as inputs and outputs. Otherwise, you may have a difficult time in the feature engineering and evaluation stages of producing a machine learning model.

Do you have metrics that you can use to evaluate a model』s performance and to compare different models? Without an easy way to evaluate models, you will have a difficult time determining whether your model was successful, and in choosing which model to use. You will also have a difficult time iterating upon, and improving, your model as your use-case evolves over time.

Does the problem require an approximate solution? Most machine learning algorithms are used in situations where there is no exact way to find a solution, or the exact solution is too costly to implement. If your problem does have a method to solve it exactly, such as through the use of regular expressions, classical optimization techniques such as linear programming, or older AI techniques such as constraint satisfaction problems, then you may be better off using these methods instead.

Does the problem fit the machine learning paradigm? Most machine learning algorithms rely on the idea that current data will be useful in predicting or classifying future data. If your situation is prone to external events invalidating previous data, then machine learning will most likely not be effective. Similarly, if previous data has no relevance to future data, your model will not learn any useful trends that help you understand incoming data in a real-world setting. It is essential that your model sees relevant past data in order to use machine learning effectively.

The Data

Next, you must determine whether or not your data is suitable for machine learning.

Do you have reliable data labels? Most machine learning methods (the supervised kind) rely on the presence of labels for each data point you have. These labels should be as free of noise as possible, and should be obtainable at a reasonable cost. If your labels are too noisy, either due to inherent situational difficulty in data collection, or due to poor labeling quality, then your models will most likely fail to properly learn the relationships in your data. Additionally, if it is too costly to obtain labels, you may not be able to obtain enough training data over your model』s lifetime for it to be able to learn properly.

Does the data suit machine learning? The data you use to train your model must accurately represent the real-world data that it will be used on. This does not mean that it must perfectly reflect it, but the closer it does, the more useful and accurate your model will be. Even though there are techniques to ameliorate issues surrounding class imbalance and lack of data availability, it is always best if you can sufficiently supply your models with training data that reflects its real-world inputs. If you train your model with biased training data, and the available feature engineering and preprocessing methods are not sufficient, your model may perform unexpectedly poorly when it faces real-world data. For example, this may occur if you train your model with a heavily imbalanced data set in a classification setting. If your model is expecting to see 1% class A and 99% class B based on its training data, it will perform poorly if the real-world situation has 50% class A and 50% class B (assuming your goal is to maximize accuracy).

The Model

Finally, you must investigate the events that may occur during the life cycle of your machine learning model.

Are the effects and risks of the model well-understood? Are you fully aware of the societal effects of your machine learning model? For example, have you researched whether or not this model may further socioeconomic inequality, or if it may create divisions between people in different socioeconomic groups? It is important that you fully understand the effects of your model beyond your specific engineering or business problem. If your model is prone to algorithmic bias, it is important that you try to address this problem ahead of time, and try to create a training pipeline that removes as much bias as possible from the data. In addition, it is important to be aware of how this model may be abused by adversaries. Are there any ways for personal information to be obtained by reverse engineering the model』s outputs? Can industry secrets be leaked? With respect to these issues, it is important that you understand what data is visible through the outputs of your model, and who is able to access these outputs directly. It may be useful to obfuscate the outputs, so that you can control what information is revealed, and provide strictly what is necessary.

Can you maintain the model over time? In most cases, machine learning models are used throughout time, and are not confined to a single instance of usage. So, it is important that your organization has people who can maintain the model over time. As real-world situations change and drift over time, you are likely to need to retrain your model on current data. For example, a model that inputs natural language will need to be retrained periodically to incorporate changes in language usage such as slang. Even if the real-world data does not change over time, you may want to study the model』s errors and continuously iterate on the model in order to improve performance. Therefore, you most likely need a dedicated group of employees who are able to monitor and improve upon the model. Otherwise, it may quickly become obsolete or even useless, depending on how prone its domain is to change.

下載開源日報APP：https://openingsource.org/2579/
加入我們：https://openingsource.org/about/join/
關注我們：https://openingsource.org/about/love/

hutool Java

開源日報第495期：《工具箱 hutool》