Linux Graphics Feed Aggregator
I’ve had this post sitting in my drafts for the last 7 months. The code is stale, but the concepts are correct. I had intended to add some pictures before posting, but it’s clear that won’t happen now. Words are better than nothing, I suppose…
Recently I pushed an intel-gpu-tool tool for modifying GPU frequencies. It’s meant to simplify using the sysfs interfaces, similar to many other such helper tools in Linux (cpupower has become one of my recent favorites). I wrote this fancy getopt example^w^w^w program to address a specific need I had, but saw its uses for the community as well. Some time after upstreaming the tool, I accidentally put the name of this tool into my favorite search engine (yes, after all these years X clipboard still confuses me). Surprisingly I was greeted by a discussion about the tool. None of it was terribly misinformed, but I figured I might set the record straight anyway.Introduction
Dynamically changing frequencies is a difficult thing to accomplish. Typically there are multiple clock domains in a chip and coordinating all the necessary ones in order to modify one probably requires a bunch of state which I’d not understand much less be able to explain. To facilitate this (on Gen6+) there is firmware which does whatever it is that firmwares do to change frequencies. When we talk about changing frequencies from a Linux kernel driver perspective, it means we’re asking the firmware for a frequency. It can, and does overrule, balk and ignore our frequency requestsA brief explanation on frequencies in i915
The term that is used within the kernel driver and docs is, “RPS” which is [I believe] short for Render P-States. They are analogous to CPU P-states in that lower numbers are faster, higher number are slower. Conveniently, we only have two numbers, 0, and 1 on GEN. Past that, I don’t know how CPU P-States work, so I’m not sure how much else is similar.
There are roughly 4 generations of RPS:
- IPS (not to be confused with Intermediate Pixel Storage). The implementation of this predates my knowledge of this part of the hardware, so I can’t speak much about it. It stopped being a thing after Ironlake (Gen5). I don’t care to look, but you can if you want: drivers/platform/x86/intel_ips.c
- RPS (Sandybridge, Ivybridge)
There are 4 numbers of interest: RP0, RP1, RPn, “hw max”. The first 3 are directly read from a register rp_state_cap = I915_READ(GEN6_RP_STATE_CAP); dev_priv->rps.rp0_freq = (rp_state_cap >> 0) & 0xff; dev_priv->rps.rp1_freq = (rp_state_cap >> 8) & 0xff; dev_priv->rps.min_freq = (rp_state_cap >> 16) & 0xff;
RP0 is the maximum value the driver can request from the firmware. It’s the highest non-overclocked frequency supported.
RPn is the minimum value the driver can request from the firmware. It’s the lowest frequency supported.
RP1 is the most efficient frequency.
hw_max is RP0 if the system does not supports overclocking.
Otherwise, it is read through a special set of commands where we’re told by the firmware the real max. The overclocking max typically cannot be sustained for long durations due to thermal considerations, but that is transparent to software.
- RPS (HSW+)
Similar to the previous RPS, except there is an RPe (for efficient). I don’t know how it differs from RP1. I just learned about this after posting the tool – but just be aware it’s different. Baytrail and Cherryview also have a concept of RPe.
- The Atom based stuff (Baytrail, Cherryview)
I don’t pretend to know how this works [or doesn’t]. They use similar terms.
They have the same set of names as the HSW+ RPS.
The driver can make requests to the firmware for a desired frequency:if (IS_HASWELL(dev) || IS_BROADWELL(dev)) I915_WRITE(GEN6_RPNSWREQ, HSW_FREQUENCY(val)); else I915_WRITE(GEN6_RPNSWREQ, GEN6_FREQUENCY(val) | GEN6_OFFSET(0) | GEN6_AGGRESSIVE_TURBO);
This register interface doesn’t provide a way to determine the request was granted other than reading back the current frequency, which is error prone, as explained below.
By default, the driver will request frequencies between the efficient frequency (RP1, or RPe), and the max frequency (RP0 or hwmax) based on the system’s busyness. The busyness can be calculated either by software, or by the hardware. For the former, the driver can periodically read a register to get busyness and make decisions based on that:render_count = I915_READ(VLV_RENDER_C0_COUNT_REG);
In the latter case, the firmware will itself measure busyness and give the driver an interrupt when it determines that the GPU is sufficiently overworked or under worked. At each interrupt the driver would raise or lower by the smallest step size (typically 50MHz), and continue on its way. The most complex thing we did (which we still managed to screw up) was disable interrupts telling us to go up when we were already at max, and the equivalent for down.
It seems obvious that there are usual trends, if you’re incrementing the frequency you’re more likely to increment again in the near future and such. Since leaving for sabbatical and returning to work on mesa, there has been a lot of complexity added here by Chris Wilson, things which mimic concepts like the ondemand CPU frequency governor. I never looked much into those, so I can’t talk knowledgeably about it – just realize it’s not as naive as it once was, and should do a better job as a result.
The flow seems a bit ping-pongish:
The benefit is though that the firmware can do all the dirty work, and the CPU can sleep. Particularly when there’s nothing going on in the system, that should provide significant power savings.Why you shouldn’t set –max or –min
First, what does it do? –max and –min lock the GPU frequency to min and max respectively. What this actually means is that in the driver, even if we get interrupts to throttle up, or throttle down, we ignore them (hopefully the driver will disable such useless interrupts, but I am too scared to check). I also didn’t check what this does when we’re using software to determine busyness, but it should be very similar
I should mention now that this is what I’ve inferred through careful observation^w^w random guessing. In the world of thermals and overheating, things can go from good, to bad, to broke faster than the CPU can take an interrupt and adjust things. As a result, the firmware can and will lower frequencies even if it’s previously acknowledges it can give a specific frequency.
As an example if you do:intel_gpu_frequency --set X assert (intel_gpu_frequency --get == X)
There is a period in the middle, and after the assert where the firmware may have done who knows what. Furthermore, if we try to lock to –max, the GPU is more likely to hit these overheating conditions and throttle you down. So –max doesn’t even necessarily get you max. It’s sort of an interesting implication there since one would think these same conditions (probably the CPU is heavily used as well) would end up clocking up all the way anyway, and we’d get into the same spot, but I’m not really sure. Perhaps the firmware won’t tell us to throttle up so aggressively when it’s near its thermal limits. Using –max can actually result in non-optimal performance, and I have some good theories why, but I’ll keep them to myself since I really have no proof.
–min on the other hand is equally stupid for a different and more obvious reason. As I said above, it’s guaranteed to provide the worst possible performance and not guaranteed to provide the most optimal power usage.What you should do with it
The usages are primarily benchmarking and performance debug. Assuming you can sustain decent cooling, locking the GPU frequency to anything will give you the most consistent results from run to run. Presumably max, or near max will be the most optimal.
min is useful to get a measure of what the worst possible performance is to see how it might impact various optimizations. It can help you change the ratio of GPU to CPU frequency drastically (and easily). I don’t actually expect anyone to use this very much.How not to get shot in the foot
If you are a developer trying to get performance optimizations or measurements which you think can be negatively impacted by the GPU throttling you can set this value. Again because of the thermal considerations, as you go from run torun making tweaks, I’d recommend setting something maybe 75% or so of max – that is a total ballpark figure. When you’re ready to get some absolute numbers, you can try setting –max, and the few frequencies near max to see if you have any unexpected results. Take highest value when done.
The sysfs entries require root for a reason. Any timeFootnotes
On the surface it would seem that the minimum frequency should always use the least amount of power. At idle, I’d assert that is always true. The corollary to that is that at minimum frequency they also finish the workload the slower. Intel GPUs don’t idle for long, they go into RC6. The efficient frequency is a blend of maintaining a low frequency and winning the race to idle. AFAIK, it’s a number selected after a whole lot of tests run – we could ignore it.
Any here's the slides for my gfx kernel upstreaming requirements training.
First looking at the modeset side of the driver the big overall thing is all the work to convert i915 to atomic. In this release there's code from Ander Conselvan de Oliveira to have a struct drm_atomic_state allocated for all the legacy modeset code paths in the driver. With that we can switch the internals to start using atomic state objects and gradually convert over everything to atomic on the modeset side. Matt Roper on the other hand was busy to prepare the plane code to land the atomic watermark update code. Damien has reworked the initial plane configuration code used for fastboot, which also needs to be adapted to the atomic world.
For more specific feature work there's the DRRS (dynamic refresh rate switching) from Sonika, Vandana and more people, which is now enabled where supported. The idea is to reduce the refresh rate of the panel to save power when nothing changes on the screen. And Paulo Zanoni has provided patches to improve the FBC code, hopefully we can enable that by default soon too. Under the hood Ville has refactored the DP link rate computation and the sprite color key handling, both to prepare for future work and platform enabling. Intermediate link rate support for eDP 1.4 from Sonika built on top of this. Imre Deak has also reworked the Baytrail/Braswell DPLL code to prepare for Broxton.
Speaking of platforms, Skyleigh has gained runtime PM support from Damien, and RPS (render turbo and sleep states) from Akash. Another SKL exclusive is support for scanout of Y-tiled buffers and scanning out buffers rotated by 90°/270° (instead of just normal and rotated by 180°) from Tvrtko and Damien. Well the rotation support didn't quite land yet, but Tvrtko's support for the special pagetable binding needed for that feature in the form of rotated GGTT views. Finally Nick Hoath and Damien also submitted a lot of workaround patches for SKL.
Moving on to Braswell/Cherryview there have been tons of fixes to the DPLL and watermark code from Vijay and Ville, and BSW left the preliminary hw support. And also for the SoC platforms Chris Wilson has supplied a pile of patches to tune the rps code and bring it more in-line with the big core platforms.
On the GT side the big ongoing work is dyanmic pagetable allocations Michel Thierry based upon patches from Ben Widawsky. With per-process address spaces and even more so with the big address spaces gen8+ supports it would be wasteful if not impossible to allocate pagetables for the entire address space upfront. But changing the code to handle another possible memory allocation failure point needed a lot of work. Most of that has landed now, but the benefits of enabling bigger address spaces haven't made it into 4.1.
Another big work is XenGT client-side support fromYu Zhang and team. This is paravirtualization to allow virtual machines to tap into the render engines without requiring exclusive access, but also with a lot less overhead than full virtual hardware like vmware or virgil would provide. The host-side code is also submitted already, but needs a bit more work still to integrate cleanly into the driver.
And of course there's been lots of other smaller work all over, as usual. Internal documentation for the shrinker, more dead UMS code removed, the vblank interrupt code cleaned up and more.
Code of conducts seem to be in the news a bit recently, and I realized that I've never really documented how we run things. It's different from the kernel's overall CodeOfConflict and also differs from the official Intel/OTC one in small details about handling issues. And for completeness there's also the Xorg Foundation event policy. Anyway, I think this is worth clarifying and here it goes.
It's simple: Be respectful, open and excellent to each another.
Which doesn't mean we want to sacrifice quality to be nice. Striving for technical excellence very much doesn't exclude being excellent to someone else, and in our experience it tends to go hand in hand.
Unfortunately things go south occasionally. So if you feel threatened, personally abused or otherwise uncomfortable, even and especially when you didn't participate in a discussion yourself, then please raise this in private with the drm/i915 maintainers (currently Daniel Vetter and Jani Nikula, see MAINTAINERS for contact information). And the "in private" part is important: Humans screw up, disciplining minor fumbles by tarnishing someones google-able track record forever is out of proportion.
Still there are some teeth to this code of conduct:
1. First time around minor issues will be raised in private.
2. On repeat cases a public reply in the discussion will enforce that respectful behavior is expected.
3. We'll ban people who don't get it.
And severe cases will be escalated much quicker.
This applies to all community communication channels (irc, mailing list and bugzilla). And as mentioned this really just is a public clarification of the rules already in place - you can't see that though since we never had to go further than step 1.
Let's keep it at that.
And in case you have a problem with an individual drm/i915 maintainer and don't want to raise it with the other one there's the Xorg BoD, linux foundation TAB and the drm upstream maintainer Dave Airlie.
The 2015Q1 brings new features and many fixes for all platforms. As usual this Release Notes highlight most important features and bug fixes and also list all known issues.
This release introduces the “Upcoming Platforms” section for announcements related to platforms that are under development and not available for end users yet.Linux Kernel – 3.19.2 Mesa – 10.5.1 xf86-video-intel – 2.99.917 Libdrm – 2.4.59 Libva – 1.5.1 vaapi intel-driver – 1.5.1 Cairo – 1.14.2 Xorg Xserver – 1.17.1 Intel-gpu-tools – 1.10 Linux Kernel for Upcoming Platforms (BSW and SKL) – drm-intel-testing For more details read the full release notes: https://01.org/linuxgraphics/downloads/2015/2015q1-intel-graphics-stack-release
Let's first start with all the driver internal rework to support atomic. The big thing with atomic is that it requires a clean split between code that checks display updates and the code that commits a new display state to the hardware. The corallary from that is that any derived state that's computed in the validation code and needed int the commit code must be stored somewhere in the state object. Gustavo Padovan and Matt Roper have done all that work to support atomic plane updates. This is the code that's now in 3.20 as a tech preview. The big things missing for proper atomic plane updates is async commit support (which has already landed for 3.21) and support to adjust watermark settings on the fly. Patches for from Ville have been around for a long time, but need to be rebased, reviewed and extended for recently added platforms.
On the modeset side Ander Conselvan de Oliveira has done a lot of the necessary work already. Unfortunately converting the modeset code is much more involved for mainly two reaons: First there's a lot more derived state that needs to be handled, and the existing code already has structures and code for this. Conceptually the code has been prepared for an atomic world since the big display rewrite and the introduction of CRTC configuration structures. But converting the i915 modeset code to the new DRM atomic structures and interface is still a lot of work. Most of these refactorings have landed in 3.20. The other hold-up is shared resources and the software state to handle that. This is mostly for handling display PLLs, but also other shared resources like the display FIFO buffers. Patches to handle this are still in-flight.
Continuing with modeset work Jani Nikula has reworked the DSI support to use the shared DSI helpers from the DRM core. Jani also reworked the DSI to in preparation for dual-link DSI support, which Gaurav Singh implemented. Rodrigo Vivi and others provided a lot of patches to improve PSR support and enable it for Baytrail/Braswell. Unfortunately there's still issues with the automated testcase and so PSR unfortunately stays disabled by default for now. Rodrigo also wrote some nice DocBook documentation for fbc, another step towards fully documenting i915 driver internals.
Moving on to platform enabling there has been a lot of work from Ville on Cherryview: RPS/gpu turbo and pipe CRC support (used for automated display testing) are both improved. On Skylake almost all the basic enabling is merged now: PM patches, enabling mmio pageflips and fastboot support from Damien have landed. Tvrtko Ursulin also create the infrastructure for global GTT views. This will be used for some upcoming features on Skylake. And to simplify enabling future platforms Chris Wilson and Mika Kuoppala have completely rewritten the forcewake domains handling code.
Also really important for Skylake is that the execlist support for gen8+ command submission is finally in a good enough to be used by default - on Skylake that's the only support path, legacy ring submission has been deprecated. Around that feature and a few other closely related ones a lot of code was refactoring: John Harrison implemented the conversion from raw sequence numbers to request objects for gpu progress tracking. This as is also prep work for landing a gpu scheduler. Nick Hoath removed the special execlist request tracking structures, simplifying the code. The engine initialization code was also refactored for a cleaner split between software and hardware initialization, leading to robuster code for driver load and system resume. Dave Gordon has also reworked the code tracking and computing the ring free space. On top of that we've also enabled full ppgtt again, but for now only where execlists are available since there are still issues with the legacy ring-based pagetable loading.
For generic GEM work there's the really nice support for write-combine cpu memory mappings from Akash Goel and Chris Wilson. On Atom SoC platforms lacking the giant LLC bigger chips have this gives a really fast way to upload textures. And even with LLC it's useful for uploading to scanout targets since those are always uncached. But like the special-purpose uploads paths for LLC platforms the cpu mmap views do not detile textures, hence special-purpose fastpaths need to be written in mesa and other userspace to fully exploit this. In other GEM features the shadow batch copying code for the command parser has now also landed.
Finally there's the redesign from Imre Deak to use the component framework for the snd-hda/i915 interactions. Modern platforms need a lot of coordination between the graphics and sound driver side for audio over hdmi, and thus far this was done with some ad-hoc dynamic symbol lookups. Which results in a lot of headaches to get the ordering correctly for driver load or system suspend and resume. With the component framework this depency is now explicit, which means we will be able to get rid of a few hacks. It's also much easier to extend for the future - new platforms tend to integrate different components even more.