I was watching Will's video, 'Transforming data with PySpark' and I noticed the use of dropDuplicates() method to remove duplicate values in a data frame. I also discovered you can use distinct() method to only return unique values in a data frame to achieve the same result. I have experimented and both work the same. you may wanna want to try it out folks