pyspark.RDD.distinct#

RDD.distinct(numPartitions=None)[source]#

Return a new RDD containing the distinct elements in this RDD.

New in version 0.7.0.

Parameters
numPartitionsint, optional

the number of partitions in new RDD

Returns
RDD

a new RDD containing the distinct elements

Examples

>>> sorted(sc.parallelize([1, 1, 2, 3]).distinct().collect())
[1, 2, 3]