There is no big difference between using external memory version and in-memory version. The only difference is the filename format.
The external memory version takes in the following filename format:
filename#cacheprefix
The filename
is the normal path to libsvm file you want to load in, and cacheprefix
is a
path to a cache file that XGBoost will use for external memory cache.
Note
External memory is not available with GPU algorithms
External memory is not available when tree_method
is set to gpu_exact
or gpu_hist
.
The following code was extracted from demo/guide-python/external_memory.py:
dtrain = xgb.DMatrix('../data/agaricus.txt.train#dtrain.cache')
You can find that there is additional #dtrain.cache
following the libsvm file, this is the name of cache file.
For CLI version, simply add the cache suffix, e.g. "../data/agaricus.txt.train#dtrain.cache"
.
nthread
should be set to number of physical coresnthread
to be 4 for maximum performance in such caseThe external memory mode naturally works on distributed version, you can simply set path like
data = "hdfs://path-to-data/#dtrain.cache"
XGBoost will cache the data to the local position. When you run on YARN, the current folder is temporal
so that you can directly use dtrain.cache
to cache to current folder.