labeled
Labeled point
A labeled point is a local vector, either dense or sparse, associated with a label/response. In MLlib, labeled points are used in supervised learning algorithms. We use a double to store a label, so we can use labeled points in both regression and classification. For binary classification, a label should be either 0 (negative) or 1 (positive). For multiclass classification, labels should be class indices starting from zero: 0, 1, 2, ….
标记点是与标签/响应相关联的密集或稀疏的局部矢量。在MLlib中,标记点用于监督学习算法。我们使用double来存储标签,因此我们可以在回归和分类中使用标记点。对于二进制分类,标签应为0(负)或1(正)。对于多类分类,标签应该是从零开始的类索引:0, 1, 2, …。
A labeled point is represented by LabeledPoint.
标记点表示为 LabeledPoint。
Refer to the LabeledPoint Python docs for more details on the API.
有关API的更多详细信息,请参阅LabeledPointPython文档。
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
# Create a labeled point with a positive label and a dense feature vector.
pos = LabeledPoint(1.0, [1.0, 0.0, 3.0])
# Create a labeled point with a negative label and a sparse feature vector.
neg = LabeledPoint(0.0, SparseVector(3, [0, 2], [1.0, 3.0]))
Sparse data稀疏数据
It is very common in practice to have sparse training data. MLlib supports reading training examples stored in LIBSVM format, which is the default format used by LIBSVM and LIBLINEAR. It is a text format in which each line represents a labeled sparse feature vector using the following format:
在实践中很常见的是具有稀疏的训练数据。MLlib支持读取以LIBSVM格式存储的训练样例,格式是LIBSVM和 使用的默认格式 LIBLINEAR。它是一种文本格式,其中每一行使用以下格式表示标记的稀疏特征向量:
label index1:value1 index2:value2 ...
where the indices are one-based and in ascending order. After loading, the feature indices are converted to zero-based.
其中索引是一个从1 开始升序的。加载后,要素索引将转换为从零开始。
Python
MLUtils.loadLibSVMFile reads training examples stored in LIBSVM format.
MLUtils.loadLibSVMFile 读取以LIBSVM格式存储的训练样例。
Refer to the MLUtils Python docs for more details on the API.
有关API的更多详细信息,请参阅MLUtilsPython文档。
from pyspark.mllib.util import MLUtils
examples = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
内容来自http://spark.apache.org/docs/latest/mllib-data-types.html