Finding Duplicate Linking in Multi-Module Flex Apps

Posted on May 5, 2014

The Setup

Thirty-eight megabytes. Over time our application had somehow grown to 38 megabytes. The road to that point made sense: features needed to be added, and with features came new libraries each with a few dependencies. Still, somehow the path of good intentions had led us to an application that weighed in at a colossal 38 megabytes. The would have actually been a reasonable size if it weren't for one slight detail: it was a web app. The proper functioning of our system required the end-user to load 38 megabytes of swfs over what very possibly might be a slow DSL connection.

Our Flex application was composed of a loader and roughly two dozen modules. Something seemed a bit off though: several of the modules fell within a few kilobytes of 1.5MB even if they only had a few hundred lines of code and only depended on our core libraries. This couldn't be right. There was no way that a few hundred lines of code that only used core should cause a 1.5MB binary. The only thing that made sense was that something inside core was being embedded into every one of our modules.

Quick Validation

So, we've got a problem: we suspect that something is being duplicated in all of our modules. Great, but where to from here? The easiest way to verify would be to analyze the linker reports we used in Detecting Unreferenced Files in Flex.

<script name="C:\Users\CaptainJack\Source\DemoProject\src\com\example\demo\Frobnitz.as"
        mod="1338614991879"
        size="722"
        optimizedsize="325">
  <def id="com.example.demo:Frobnitz" />
  <pre id="Object" />
  <pre id="com.example.demo:Unobtainium" />
  <dep id="com.example.demo:Bar" />
</script>

In our linker report we'll end up having one for every item that gets embedded. Thus, if an item gets embedded more than once, we should see the same line in multiple linker reports. We'll quickly scan for every , and then use uniq to show us how many (-c) duplicates (-d) we have.

grep '<def id' *.xml | sort | uniq -c -d
      23 <def id="com.example.demo:Frobnitz" />

Something is being embedded 23 times, but we still haven't proven that duplication is the cause of our bloated binaries. Things are looking promising, but we're going to have to dig a bit further yet.

Measuring Duplication

The way forward is going to require actual parsing of all the link reports created during compilation. The link reports have a great deal of information about everything that gets linked, but in our case there are only two pieces of information that we're going to care about regarding our definitions: what is being defined and how much space it actually takes up.

class Definition
  def initialize(name, size)
    @name = name
    @size = size
  end
  
  attr_accessor :name, :size
end

Now that we know the content that we'll be looking for, let's add the code to extract it from an XML file. In order to make the parsing as painless as possible we're going to use the nokogiri gem. First we're going to need to open the XML file and find each of the <script> tags. This should provide us with the optimizedsize, and a containing the name of the symbol. We'll use this to create an instance of Definition for each of the symbols defined in the document.

gem install nokogiri
require 'nokogiri'

class Log
  def initialize(file_name)
    @xml_doc = Nokogiri::XML(File.new file_name)
  end
  
  def definitions
    @xml_doc.search('script').map { |script| build_definition script }
  end
  
  private
  def build_definition(script)
    name = script.search('def').first[:id]
    size = script[:optimizedsize]
    Definition.new name, size.to_i
  end
end

There we go. We can open a log file with Log.new “log_file.xml”. Additionally, we can get a list of all of the definitions contained in it by calling .definitions off of the resultant Log object. This is definitely a good start, but we're going to need to read several files since an object can't be defined multiple times within one linker report. For this, let's create a Folder class. Instead of taking a file name like our Log class, it will take a folder name. When we call .definitions on it, we'll look for all of the linker reports, concatenate the lists together, and return the results. 1

class Folder
  def initialize(folder_path)
    @folder_path = folder_path
  end
  
  def definitions
    @definitions ||= Dir.glob("#{@folder_path}/*.xml").
                         map { |file_name| Log.new(file_name).definitions }.
                         flatten
  end
end

As we've gotten the ability to parse our linker reports now, we can start to go after our main goal: to identify how much space is being wasted by symbols being defined multiple times. Ideally each class should be linked exactly one time, so we'll define the amount of wasted space as the number of bytes in all but the first instance. We'll create a class to represent these duplications.

class Duplicate
  def initialize(definition, count)
    @definition = definition
    @count = count
  end
  
  def wasted
    (@count - 1) * @definition.size
  end
  
  attr_accessor :definition, :count
end

With the concept of duplicates and the ability to load linker reports in place, we can go back to Folder and give it the capability to quickly identify the duplicates for us. We're going to count how many times each object appears, build our duplicate objects, and then pare down the list to only items that have multiple definitions. Though we could do it with less code, I've chosen this implementation so that we can maintain O(n) performance.

class Folder
   ...
  def duplicates
    processed = Hash.new 0
    definitions.each { |definition| processed[definition.name] += 1 }
    definitions.uniq { |definition| definition.name }.
                map { |definition| Duplicate.new definition, processed[definition.name] }.
                select { |duplicate| duplicate.count > 1 }
  end
end

Finally, we'll just create a little runner to sort the list and show us how much space is actually being wasted.

class App
  def self.run(folder)
    duplicates = Folder.new(folder).duplicates
    duplicates.sort { |a, b| a.wasted <=> b.wasted }.
               each { |dup| puts "#{dup.definition.name}: duplicated #{dup.count - 1}" +
                                 " times (#{dup.wasted} bytes)" }    
  end
end

The Results

We had some major problems with symbols being linked into our modules; large portions of our core libraries and their dependencies were being linked into every module. Roughly 31 megabytes of the application was the direct result of duplicate definitions2. If we could somehow remove the duplication, we could get our web-app down to roughly 7 megs. It looked possible though, if we could learn to build RSLs3. That, however, is a tale for another post.

Completed Script

gem install nokogiri
require 'nokogiri'

class Definition
  def initialize(name, size)
    @name = name
    @size = size
  end
  
  attr_accessor :name, :size
end

class Duplicate
  def initialize(definition, count)
    @definition = definition
    @count = count
  end
  
  def wasted
    (@count - 1) * @definition.size
  end
  
  attr_accessor :definition, :count
end

class Log
  def initialize(file_name)
    @xml_doc = Nokogiri::XML(File.new file_name)
  end
  
  def definitions
    @xml_doc.search('script').map { |script| build_definition script }
  end
  
  private
  def build_definition(script)
    name = script.search('def').first[:id]
    size = script[:optimizedsize]
    Definition.new name, size.to_i
  end
end

class Folder
  def initialize(folder_path)
    @folder_path = folder_path
  end
  
  def definitions
    @definitions ||= Dir.glob("#{@folder_path}/*.xml").
                         map { |file_name| Log.new(file_name).definitions }.
                         flatten
  end
  
  def duplicates
    processed = Hash.new 0
    definitions.each { |definition| processed[definition.name] += 1 }
    definitions.uniq { |definition| definition.name }.
                map { |definition| Duplicate.new definition, processed[definition.name] }.
                select { |duplicate| duplicate.count > 1 }
  end
end

class App
  def self.run(folder)
    duplicates = Folder.new(folder).duplicates
    duplicates.sort { |a, b| a.wasted <=> b.wasted }.
               each { |dup| puts "#{dup.definition.name}: duplicated #{dup.count - 1}" +
                                 " times (#{dup.wasted} bytes)" }    
  end
end

App.run ARGV[0]
ruby dupe.rb linker_reports/
...
mx.resources:LocaleSorter duplicated 21 times (143934 bytes)
mx.resources:ResourceManagerImpl duplicated 21 times (149625 bytes)
en_US$ESRIMessages_properties duplicated 21 times (250593 bytes)

Footnotes


  1. In case the syntax @variable_name ||= method_name() doesn't look familiar, we're effectively telling Ruby to return the cached data if available. Otherwise, calculate, cache, and return. ↩︎

  2. At the time I didn't run this script, as it was rather obvious to me what was going on, so I didn't know the actual degree of duplication till later on. I have run it a few times since though to identify parts of our binaries that could be easily optimized. I suppose this would fit Hollywood's concept of “Based on a True Story”. ↩︎

  3. More info about RSLs can be found at Adobe's Introduction to RSLs. ↩︎