最近 review 公司代码，发现在一个导入 Excel 文档去重业务上，使用的是 List.Contain(T) 方法。性能和效率双低，所以来实现一下优化一下代码。

1 使用 Contain 方法实现去重

先来看看使用Contain方法去重的代码

public static void deduplicateDataByContain(List<String> testDate) {
    stopWatch.start();
    System.out.println("Contain去重之前条数：" + testDate.size());
    List<String> result = new ArrayList<>();
    for (String s : testDate) {
        if (!result.contains(s)) {
            result.add(s);
        }
    }
    System.out.println("Contain去重之前条数：" + result.size());
    stopWatch.stop();
    System.out.println("Contain去重耗时：" + stopWatch.getTotalTimeMillis());
}

在测试之前，我们先生成一个含有10万条数据的 List容器来模拟去重情况，其中将会有5万条重复数据。生成模拟数据容器代码如下：

public static List<String> generateData() {
    List<String> testData = new ArrayList<>();
    Random random = new Random();

    for (int i = 0; i < 50000; i++) {
        testData.add("UniqueData" + i);
    }

    for (int i = 0; i < 50000; i++) {
        int randomIndex = random.nextInt(50000);
        testData.add(testData.get(randomIndex));
    }
    return testData;
}

开始调用，计算一下耗时：

public static void main(String[] args) {
    List<String> testData = generateData();
    deduplicateDataByContain(testData);
}

得到最终结果：

Contain去重之前条数：100000
Contain去重之前条数：50000
Contain去重耗时：11051

Process finished with exit code 0

可以看到耗时为11051毫秒，即11秒多，效率过低。为什么会出现这种情况呢？我们来分析一下该方法中各操作的时间复杂度。

创建了一个新的 ArrayList 来存储去重后的结果，初始为空，时间复杂度为O(1)。

遍历 testDate 中的每个元素，遍历的时间复杂度为O(n)。

在 result 中使用 contains 方法来检查是否已经包含当前元素，每次 contains 操作最坏情况下需要遍历 result 中所有的元素，时间复杂度为 O(m)，其中 m 是 result 的大小。

如果当前元素不在 result 中，会执行 add 操作将元素添加到 result 中，时间复杂度为 O(1)（平均情况下）。

因此，总的时间复杂度为 O(n^2)，因为 contains 操作在每次遍历时都可能需要遍历 result。

显然该方法并不适用在实际开发中，在Java中，我们还有其他更高效的方法来实现去重这个需求。

2 使用 HashSet 实现去重

HashSet是一个不包含重复元素的 collection ，所以我们可以通过使用 HashSet 的 add 方法来实现去重这个功能需求。具体代码实现如下：

public static void deduplicateDataByHashSet(List<String> testDate) {
    stopWatch.start();
    System.out.println("HashSet去重之前条数：" + testDate.size());
    List<String> result = new ArrayList<>(new HashSet<>(testDate));
    System.out.println("HashSet去重之前条数：" + result.size());
    stopWatch.stop();
    System.out.println("HashSet去重耗时：" + stopWatch.getTotalTimeMillis());
}

让我们调用一下该方法，计算一下耗时。

public static void main(String[] args) {
    List<String> testData = generateData();
    deduplicateDataByHashSet(testData);
}

得到最终结果：

HashSet去重之前条数：100000
HashSet去重之前条数：50000
HashSet去重耗时：30

Process finished with exit code 0

可以看到耗时达到惊人的30毫秒，与 Contain 方法相差了恐怖的368倍！！！为什么会有如此之大的差异呢，我们来分析一下该方法中各操作的时间复杂度。

创建HashSet时，需要遍历整个testList，并将元素加入HashSet中。遍历的时间复杂度为O(n)，每次加入元素的操作平均时间复杂度为O(1)。所以，创建HashSet的时间复杂度为O(n)。

创建新的ArrayList时，需要将HashSet中的元素复制到新的ArrayList中，遍历的时间复杂度为O(n)。

因此，总的时间复杂度为O(n)。

可见从理论上来说，在不考虑其他原因素的情况下，HashSet 与 Ccontain 方法的差距为平方倍，且随着数据越大，差距可能越大。最后，我们再来看看其他去重方法。

3 使用 Stream API 实现去重

使用Java 8引入的Stream API。这个方法相比之前的HashSet方法更简洁，也更符合函数式编程的风格：

public static void deduplicateDataByStream(List<String> testDate) {
     stopWatch.start();
     System.out.println("Stream去重之前条数：" + testDate.size());
     List<String> result = testDate.stream().distinct().collect(Collectors.toList());
     System.out.println("Stream去重之前条数：" + result.size());
     stopWatch.stop();
     System.out.println("Stream去重耗时：" + stopWatch.getTotalTimeMillis());
 }

老规矩，调用一下该方法并得出最终结果：

public static void main(String[] args) {
    List<String> testData = generateData();
    deduplicateDataByStream(testData);
}

Stream去重之前条数：100000
Stream去重之前条数：50000
Stream去重耗时：60

Process finished with exit code 0

可以看得出，相比较 Contain 方法，依然出现极大的优化情况。我们再来分析一下该方法中各操作的时间复杂度。

testDate.stream() 操作创建一个新的 Stream ，时间复杂度为 O(1)。

distinct() 操作对 Stream 进行去重，需要遍历整个 Stream，并使用哈希表或其他数据结构来存储已经出现过的元素，以判断重复，时间复杂度为 O(n)。

collect(Collectors.toList()) 操作将 Stream 转换为 List，需要遍历整个 Stream，并将元素添加到新的 List 中，时间复杂度为 O(n)。

因此，总的时间复杂度为 O(n)。

4 使用 Collectors.toSet() 实现去重

使用 Java 8的 Collectors.toSet()：与 Stream API 方法类似，但我们可以直接使用 Collectors.toSet() 来收集元素并去重。然后再将 Set 转换回 List。具体代码实现如下：

public static void deduplicateDataByToSet(List<String> testDate) {
    stopWatch.start();
    System.out.println("ToSet去重之前条数：" + testDate.size());
    Set<String> uniqueData = testDate.stream().collect(Collectors.toSet());
    List<String> result = new ArrayList<>(uniqueData);
    System.out.println("ToSet去重之前条数：" + result.size());
    stopWatch.stop();
    System.out.println("ToSet去重耗时：" + stopWatch.getTotalTimeMillis());
}

还是老规矩，调用一下该方法并得出最终结果：

public static void main(String[] args) {
    List<String> testData = generateData();
    deduplicateDataByToSet(testData);
}

ToSet去重之前条数：100000
ToSet去重之前条数：50000
ToSet去重耗时：90

Process finished with exit code 0

该方法相比 Stream API 效率稍次一些，我们来分析一下该方法中各操作的时间复杂度得出原因。

data.stream()操作创建一个新的Stream，时间复杂度为O(1)。

collect(Collectors.toSet())操作对Stream进行去重，并将元素收集到一个Set中，需要遍历整个Stream，时间复杂度为O(n)。这里的n是输入List的大小。

创建一个新的ArrayList对象deduplicatedData，并将Set中的元素复制到该ArrayList中，复制的时间复杂度为O(m)，其中m是去重后的元素个数，即不重复元素的个数。

因此，总的时间复杂度为O(n + m)。

5 跳过相邻的重复元素实现去重

public static void deduplicateDataBySort(List<String> testDate) {
    stopWatch.start();
    System.out.println("Sort去重之前条数：" + testDate.size());
    // 对数据进行排序（前提是数据已经有序）
    testDate.sort(String::compareTo);

    List<String> result = new ArrayList<>();
    int dataSize = testDate.size();
    int index = 0;

    while (index < dataSize) {
        String current = testDate.get(index);
        result.add(current);

        // 跳过相邻的重复元素
        while (index < dataSize && testDate.get(index).equals(current)) {
            index++;
        }
    }
    System.out.println("Sort去重之前条数：" + result.size());
    stopWatch.stop();
    System.out.println("Sort去重耗时：" + stopWatch.getTotalTimeMillis());
}

真的是最后一次了……调用一下该方法并得出最终结果：

public static void main(String[] args) {
    List<String> testData = generateData();
    deduplicateDataBySort(testData);
}

Sort去重之前条数：100000
Sort去重之前条数：50000
Sort去重耗时：104

Process finished with exit code 0

最后的最后，再分析一下该方法中各操作的时间复杂度得出原因。

testDate.sort(String::compareTo)操作对数据进行排序，排序的时间复杂度为O(n log n)，其中n是输入List的大小。

创建一个新的ArrayList对象result，并遍历排序后的testDate，对每个元素执行以下操作：

1. 添加当前元素到result中，这个操作的时间复杂度为O(1)。

2. 使用while循环跳过相邻的重复元素，最坏情况下需要遍历相邻的重复元素直到不再重复，所以这个操作的时间复杂度为O(k)，其中k是相邻重复元素的个数。

因此，总的时间复杂度为排序的时间复杂度加上遍历去重的时间复杂度，即O(n log n + k)。

需要注意的是，排序操作的时间复杂度主要是决定该方法的性能瓶颈。在一般情况下，排序的时间复杂度为O(n log n)，并且可以认为k通常不会超过n，因此，可以简化为O(n log n)

去重需求还有很多很多的方法可以去实现，例如上方代码通过遍历数据并跳过相邻的重复元素来去重（前提是数据已经有序），或使用Bloom Filter的Java库，例如Google Guava库或Apache Commons Collections库，本文就不再一一分析与实现。具体实现方法应根据业务具体需求和数据量，选择最佳的方法来实现。瑞斯拜！

最后附上本文所写源代码：去重业务的实现与优化

YouFeng

关于 Java 去重业务的优化