前言
LibFFM 在之前的好几界CTR比赛之中都取得了非常不错的成绩。但是在使用上, 最大的问题莫过于其特殊的libffm
格式了。下面会覆盖两个问题:
* 简述libffm数据格式
* 单机将pandas DataFrame 转成libffm 的实现(已优化)
什么是libffm格式
这部分内容主要参考: https://www.jianshu.com/p/9c2c2421ef2e
假设有下面数据:
其中:
User / Movie / Genre
是类别, 可以进行One Hot EncoderPrice
是连续值,不需要One Hot转换
转换过程
首先需要明确field
跟feature
如上面的数据集:
- fields 有:
User / Movie / Genre / Price
-
Feature 有:
User-YuChin -> 0
Movie-3Idiots -> 1
Genre-Comedy -> 2
Genre-Drama -> 3
Price -> 5
转换结果
1:1:1 2:2:2 3:3:1 3:4:1 4:5:9.99
转换代码
Kaggle 的一个方案
在Kaggle上面有一个人分享了一个实现:
https://www.kaggle.com/mpearmain/pandas-to-libffm
具体代码就不贴上来了, 但是这里面用了两层for
循环, 可想而知速度并不会太快。 特别是对于数据量一般都很大的广告数据来说。
一个并行实现的方案
https://blog.csdn.net/songbinxu/article/details/80298195
主要思路:
先将一份数据分开成多份, 然后每个核单独进行计算。计算结果最终进行合并。
对于这种实现方案, 我觉得最大的问题在于数据分发到多核之上。 这个序列化过程并不会非常快。
我的单核实现方案:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
def df2libffm(df, save_file=None, y=None, non_categorical_cols = []) : """ df: data source, should be pandas dataframe save_file: save file name, if None, will skip saving action y: label columns name in the pandas dataframe non_categorical_cols: columns in this list will not be one hot encoder. example: price : 0.99$ click rate: 0.03 """ assert y is not None row_cnt = df.shape[0] out = pd.DataFrame({ y: df[y] }) print("out.shape:", out.shape ) df = df.drop(columns=[y], axis=1) feature_base = 0 for idx, col in enumerate(df.columns.tolist()) : dt = datetime.datetime.now() print( str(dt), idx, col) cur_field_id = idx field_series = pd.Series([cur_field_id] * row_cnt).astype(str) if col in non_categorical_cols : # if not categorical feature, do not consider how many different values feature_series = pd.Series([feature_base] * row_cnt).astype(str) feature_base += 1 value_series = df[col].astype(str) new_col = field_series + ":" + feature_series + ":" + value_series out[str(cur_field_id)] = new_col.values else : # if is categorical feature feature_series = df[col].astype('category').values.codes feature_series = feature_series + feature_base feature_series = pd.Series(feature_series).astype(str) feature_base += feature_series.unique().shape[0] print "next feature base:", feature_base new_col = field_series + ":" + feature_series + ":1" out[str(cur_field_id)] = new_col.values if save_file: file_name = save_file if not file_name.endswith(".txt"): file_name += ".txt" print("save file name:", file_name) out.to_csv(file_name, sep=" ", header=False, index=False) return out # Usage Sample data = pd.read_parquet("xxx.parquet") # your data df_ffm = df2libffm(data, y="click", save_file="data/train_int.ffm", non_categorical_cols=['hour']) |
实际性能:
使用的数据: 2014 Kaggle Avazu 的训练数据(为了方便, 没有使用其测试数据部分)
data.shape: (40428967, 24)
运行日志:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
('out.shape:', (40428967, 1)) ('2019-02-19 08:17:50.134204', 0, 'hour') ('2019-02-19 08:19:31.047506', 1, 'C1') next feature base: 8 ('2019-02-19 08:20:36.779871', 2, 'banner_pos') next feature base: 15 ('2019-02-19 08:21:44.295403', 3, 'site_id') next feature base: 4752 ('2019-02-19 08:22:55.094539', 4, 'site_domain') next feature base: 12497 ('2019-02-19 08:24:05.746931', 5, 'site_category') next feature base: 12523 ('2019-02-19 08:25:16.551758', 6, 'app_id') next feature base: 21075 ('2019-02-19 08:26:27.508076', 7, 'app_domain') next feature base: 21634 ('2019-02-19 08:27:38.011073', 8, 'app_category') next feature base: 21670 ('2019-02-19 08:28:47.997322', 9, 'device_id') next feature base: 2708078 ('2019-02-19 08:30:03.525996', 10, 'device_ip') next feature base: 9437564 ('2019-02-19 08:31:28.049253', 11, 'device_model') next feature base: 9445815 ('2019-02-19 08:32:43.284236', 12, 'device_type') next feature base: 9445820 ('2019-02-19 08:33:56.265743', 13, 'device_conn_type') next feature base: 9445824 ('2019-02-19 08:35:10.996536', 14, 'C14') next feature base: 9448450 ('2019-02-19 08:36:25.697655', 15, 'C15') next feature base: 9448458 ('2019-02-19 08:37:39.690792', 16, 'C16') next feature base: 9448467 ('2019-02-19 08:38:55.148370', 17, 'C17') next feature base: 9448902 ('2019-02-19 08:40:11.580086', 18, 'C18') |
可以看到, 每一列的处理速度在2分钟左右。 秒杀上面的2层for循环!

文章评论