random forest vs support vector machine

水木社区手机版

主题:random forest vs support vector machine
楼主|MicroSat|2021-03-17 04:14:04|只看此ID
training集合训练
testing集合预测
training和testing是独立的。
请问根据你的经验，哪个testing集合中预测准确率更高？
--
FROM 129.120.103.*
1楼|jfgao|2021-03-18 07:39:06|只看此ID
转载参考答案

I would say, the choice depends very much on what data you have and what is your purpose. A few "rules of thumb".

Random Forest is intrinsically suited for multiclass problems, while SVM is intrinsically two-class. For multiclass problem you will need to reduce it into multiple binary classification problems.

Random Forest works well with a mixture of numerical and categorical features. When features are on the various scales, it is also fine. Roughly speaking, with Random Forest you can use data as they are. SVM maximizes the "margin" and thus relies on the concept of "distance" between different points. It is up to you to decide if "distance" is meaningful. As a consequence, one-hot encoding for categorical features is a must-do. Further, min-max or other scaling is highly recommended at preprocessing step.

If you have data with n points and m features, an intermediate step in SVM is constructing an n×n matrix (think about memory requirements for storage) by calculating n^2 dot products (computational complexity). Therefore, as a rule of thumb, SVM is hardly scalable beyond 10^5 points. Large number of features (homogeneous features with meaningful distance, pixel of image would be a perfect example) is generally not a problem.

For a classification problem Random Forest gives you probability of belonging to class. SVM gives you distance to the boundary, you still need to convert it to probability somehow if you need probability.

For those problems, where SVM applies, it generally performs better than Random Forest.

SVM gives you "support vectors", that is points in each class closest to the boundary between classes. They may be of interest by themselves for interpretation.

--
FROM 174.113.17.*
2楼|MicroSat|2021-03-19 02:41:39|只看此ID
对于你用过的数据集合，你的第一手经验，哪个更好？

【在 jfgao 的大作中提到: 】
: 转载参考答案
:
: I would say, the choice depends very much on what data you have and what is your purpose. A few "rules of thumb".
: ...................
--
FROM 129.120.103.*
3楼|Bernstein|2021-03-29 22:54:43|只看此ID
为啥要比这两个呢，感觉rf应该和gbm比
svm和这两个相去甚远，没啥比较的意义

【在 MicroSat 的大作中提到: 】
: training集合训练
: testing集合预测
: training和testing是独立的。
: ...................
--
FROM 125.33.244.*
4楼|MicroSat|2021-03-30 00:36:54|只看此ID
你是说svm的性能远不如rf，根本没有和rf比较的意义？
rf应该和gbm比
【在 Bernstein 的大作中提到: 】
: 为啥要比这两个呢，感觉rf应该和gbm比
: svm和这两个相去甚远，没啥比较的意义
:
--
FROM 129.120.103.*
5楼|Bernstein|2021-03-30 02:36:42|只看此ID
性能我不太清楚
但rf和gbm都算是基于决策树的集成学习，二者可能更好比较一些

【在 MicroSat 的大作中提到: 】
: 你是说svm的性能远不如rf，根本没有和rf比较的意义？
: rf应该和gbm比
--
FROM 125.33.244.*
6楼|shong|2021-03-30 15:40:21|只看此ID
我自己的经验，两者差不多，准确率也就是91%和92%这样的差别
【在 MicroSat 的大作中提到: 】
: training集合训练
: testing集合预测
: training和testing是独立的。
: ...................
--
FROM 92.196.119.*
7楼|MicroSat|2021-03-30 22:17:22|只看此ID
多谢！请问哪儿能找到这样的训练集和测试集，然后别人测试的精度值。
kagger不提供测试集的真实值。所以没法用来计算预测精度。

【在 shong 的大作中提到: 】
: 我自己的经验，两者差不多，准确率也就是91%和92%这样的差别
--
FROM 129.120.103.*
8楼|shong|2021-03-30 22:25:50|只看此ID
数据源自课题
你可以去github上找找，有的代码是带数据集的
【在 MicroSat 的大作中提到: 】
: 多谢！请问哪儿能找到这样的训练集和测试集，然后别人测试的精度值。
: kagger不提供测试集的真实值。所以没法用来计算预测精度。
:
--
FROM 92.196.119.*