Diagnosing the Internet

A major role of a compiler is the code analysis it performs before turning a programming language into machine code. During this process, a variety of diagnostics are performed, which provides valuable insights into the code that’s being compiled. Let’s take a look at what we’ve found when compiling the PHP world.

PeachPie is about code analysis, compilation, and running regular PHP code on .NET, in a way C# developers are used to. Those two worlds are very different but they also share many language features and libraries. We’ve been able to run WordPress for a while now and we’ve filed a few bug reports to the core development team as we were getting it to run. But what makes the WordPress ecosystem so powerful is its vast array of plugins and themes. Does anyone even check those for bugs? And what about the thousands of PHP packages all over the Internet that constitute somewhere around 80% of websites?

A good portion of the PHP world’s code is available on GitHub, so we tried to run our compiler on it. Listing all the libraries, frameworks, plugins, and themes, and all of their dependencies generated quite an extensive catalog.

It’s important to note that we are not evaluating any non-strict behavior of PHP, but we are rather pointing out what kind of issues applications are allowed to have and which ones we have to deal with or treat in order for us to provide a corresponding .NET solution.

Issue statistics

The charts below depict the proportions of various issues found in the code of different PHP applications we scanned. Here’s how the issues are categorized:

  • Fatal Error shows occurrences of code that will certainly cause the program to crash, e.g. syntax errors, calling nonexistent methods, implementing a type that is not an interface, using $this out of class scope, instantiating abstract classes, using parent:: when there is no parent class, etc.
  • Warning lists issues that may be a result of bad refactoring or unintended code, such as overflowing an integer number to a float, missing arguments, providing too many arguments to a function or invalid regular expressions.
  • Deprecations symbolize the usage of functions that are annotated with the @deprecated tag or that have been deprecated by the latest version of PHP.
  • Notice contains code that does not make much sense, e.g. as duplicate “switch” cases, duplicate array keys, defining a class that is already defined, expressions that do nothing (like assigning a variable to itself or expressions without being read), or foreach over a type that does not support an iteration (like string).
  • Informational denotes occurrences of code that might be nice to clean up, such as unnecessary casting.

WordPress

First, listing all the freely available plugins and themes for WordPress gave us thousands of repositories. We took the latest versions of roughly the first 3,000, considering only those updated within the last year, and we compiled them.

WordPress plugins:

The chart above shows the diagnostics of WordPress plugins (about 1000 plugins). Here’s what we notice:

  • There are lots of deprecations; since PeachPie compiler tracks deprecations in the code (functions annotated with the @deprecated PHPDoc keyword) and reports any usage of such a function, we have a neat list of plugins and themes using deprecated (and thus not maintained) code. On average, each plugin calls several deprecated WordPress Core functions!
  • Dependencies on other plugins; WordPress plugins and themes do not have dependencies stated declaratively in its manifest file. As a result, users always have to check its Readme for additional notes. If they don’t, they might install a plugin that requires another plugin. It’s the responsibility of the plugin itself to check and report a missing dependency. Those checks are inconsistent and disallow further code analysis.
  • Use of eval(); plugin authors love to use eval() even when it may not be necessary. This opens the doors to a plethora of security and performance issues.
  • Other bugs in the code;
    • reassigning $this; an error that’s often silently ignored, but will cause a plugin to crash in case the code gets executed at this point.
    • calling member functions on scalar types (error in runtime)
    • errors in PCRE (error in runtime usually silently ignored)
    • tons of warnings; the code, in general, doesn’t seem too clean. Often, arguments are marked as optional (they have a default value) even though they cannot be used as optional arguments. Array initializers have duplicity keys, “switch” has duplicate cases, all of which causes unnecessary overhead in run time.

WordPress themes:

The chart above shows our diagnostics of WordPress themes (about 1000 themes).

Dependencies to Laravel, Twig, etc.

Next, we checked the Laravel framework, plus its dependencies recursively, including the suggested dependencies. We ended up with thousands of packages (this number includes all major and minor versions of those packages, considering only stable releases). The chart below shows the diagnostics of about 1300 packages.

Larevel packages:

In general, the code quality here is much better compared to WordPress. The most notable problem of the Laravel codebase is the dynamic behavior itself of course. Below are some notes:

  • Missing dependencies; often times, package authors make use of classes in case they are there by coincidence. If they are not, they simply don’t use them. This is a huge deal for the compiler, but nothing that can’t be fixed.
  • Circular dependencies; in PHP, dependency management works differently – it allows for circular dependencies, because they get flattened. For .NET, this actually matters, so the compiler has to work around them.
  • Syntax and other fatal errors; some packages that were released as stable and are actually used in production contain regular syntax errors. Also, there is a number of packages containing continue; outside a loop construct – this code is a fatal error and thus could never run.
  • Deprecated and non-standard SPDX license expressions; every package should have a license specified according to the license identifiers named at https://spdx.org/licenses/. But very often, the developer uses naive license names, such as “PHP” or “GPL” or “GPL 2”. Since there is no toolchain that would report an error before publishing a package with such license name, all those made-up names go to production. Then it’s the responsibility of the tool that consumes such a package to understand non-standard names and to deal with it. We have tried to make this step already in build time, and we report eventual issues.

Note, when packing a C# project (during the phase when the NuGet task prepares the NuSpec file for you), the license is checked, and even the use of a deprecated license identifier causes the build process to fail. This behavior is very strict.

Conclusion

By checking the Internet, our compiler learned a lot of new things. It is now able to compile a wide range of code, providing single PHP libraries as .NET packages. The great part of this is that we only provide safely checked packages and we filter out those that have fatal issues. It is interesting to observe what kinds of issues the static analysis of our compiler recognized in the ecosystem of PHP libraries that powers the majority of the Internet.

Posted on June 13, 2020, in category Information, Security, tags: , , , , ,