Linux Graphics Feed Aggregator

Aggregator items

Rodrigo Vivi - 02 Jul, 2015 - 01:19pm

This release brings new features and many fixes for all platforms. As usual this Release Notes highlight most important features and bug fixes and also list all known issues. The 2015Q2 highlights are:

  • Execlists and full PPGTT enabled by default on all platforms that supports it: Broadwell, Skylake and Cherryview/Braswell.
  • Cherryview/Braswell is out of preliminary_hw_support protection what means it achieved good stability and confidence levels.

Another important thing to highlight this quarter is that we published Hardware specifications for Broadwell and Cherryview/Braswell:

Linux Kernel – 4.0.4 Mesa – 10.6.0 xf86-video-intel – 2.99.917 Libdrm – 2.4.60 Libva – 1.6.0 vaapi intel-driver – 1.6.0 Cairo – 1.14.2 Xorg Xserver – 1.17.1 Intel-gpu-tools – 1.11 Linux Kernel for Upcoming Platforms (BSW and SKL) – drm-intel-testing For more details read the full release notes: https://01.org/linuxgraphics/downloads/2015q2-intel-graphics-stack-release

 


Daniel Vetter - 01 Jun, 2015 - 02:14am
The 4.1 kernel release is still a few weeks off and hence a bit early to talk about 4.2. But the drm subsystem feature cut-off already passed and I'm going on vacation for 2 weeks, so here we go.

First things first: No, i915 does not yet support atomic modesets. But a lot of progress has been made again towards enabling it. As I explained last time around the trouble is that the intel driver has grown its own almost-atomic modeset infrastructure over the past few years. And now we need to convert that to the slightly different proper atomic support infrastructure merged into the drm core, which means lots and lots of small changes all over the driver. A big part merged in this release is the removal of the ->new_config pointer by Ander, Matt & Maarten. This was the old i915-specific pointer to the staged new configuration. Removing it required switching all the CRTC code over to handling the staged configuration stored in the struct drm_atomic_state to be compatible with the atomic core. Unfortunately we still need to do the same for encoder/connector states and for plane states, so there's still lots of shuffling pending for 4.2.

There has also been other feature work going on on the modeset side: Ville cleaned&fixed up the CDCLK support in anticipation of implementing dynamic display clock frequency scaling. Unfortunately that part of his patches hasn't landed yet. Ville has also merged patches to fix up some details in the CPT modeset sequence, maybe this will finally fix the remaining "DP port stuck" issues we still seem to have.

Looking at newer platforms the interesting bit is rotation support for SKL from Sonika and Tvrtki. Compared to older platforms skl now also supports 90° and 270° rotation in the scanout engines, but only when the framebuffer uses a special tiling layout (which have been enabled in 4.0). A related feature is support for plane/CRTC scalers on SKL, provided by Chandra. Skyleigh has also gained support for the new low-power display states DC5/6. For Broxton basic enabling has landed, but there's nothing too interesting yet besides piles of small adjustments all over. This is because Broxton and Skyleigh have a common display block (similar to how the render block for atom chips was already shared since Baytrail) and hence share a lot of the infrastructure code. Unfortunately neither of these platforms has yet left the preliminary hardware support label for the i915 driver.

There's also a few minor features in the display code worth mentioning: DP compliance testing infrastructure from Todd Previte - DP compliance test devices have a special DP AUX sidechannel protocol for requesting certain test procedures and hence need a bit of driver support. Most of this will be in userspace though, with the kernel just forward requests and handing back results. Mika Kahola has optimized the DP link training, the kernel will now first try to use the current values (either from a previous modeset or set up by the firmware). PSR has also seen some more work, unfortunately it's still not yet enabled by default. And finally there's been lots of cleanups and improvements under the hood all over, as usual.

A big feature is the dynamic pagetable allocation for gen8+ from Michel Thierry and Ben Widawsky. This will greatly reduce the overhead of PPGTT and is a requirement for 48bit address space support - with that big a VM preallocating all the pagetables is just not possible any more. The gen7 cmd parser is now finally fixed up and enabled by default (thanks to Rebecca Palmer for one crucial fix), which means finally some newer GL extensions can be used without adding kernel hacks. And Chris Wilson has fine-tuned the cmd parser with a big pile of patches to reduce the overhead. And Chris has tuned the RPS boost code more, it should now no longer erratically boost the GPU's clock when it's inappropriate. He has also written a lot of patches to reduce the overhead of execlist command submission, and some of those patches have been merged into this release.

Finally two pieces of prep work: A few patches from John Harrison to prepare for removing the outstanding lazy request. We've added this years ago as a cheap way out of a memory and ringbuffer space preallocation issue and ever since then paid the price for this with added complexity leaking all over the GEM code. Unfortunately the actual removal is still pending. And then Joonas Lahtinen has implemented partial GTT mmap support. This is needed for virtual enviroments like XenGT where the GTT is cut up between the different guests and hence badly fragmented. The merged code only supports linear views and still needs support for fenced buffer objects to be actually useful.


Paulo Zanoni - 29 May, 2015 - 02:54pm

This is the third part of the documentation for the Runtime Power Management (RPM) subsystem of the Intel Linux Kernel Graphics Driver, also known as i915.ko. In this part I will explain the power management interactions between the graphics device and the rest of the system. Part 1 gave an overview of the feature and part 2 explained the implementation details.

I should also point that although I am a Graphics developer and I believe to have a good knowledge on the Graphics part of the problem, I am not an expert on general power management, so I may say something wrong. If you notice that, please correct me!

Disclaimer

I work for Intel and I am a member of its Linux Kernel Graphics team. On this text I will talk about things that are related to my work at Intel, but this text reflects only my own opinion, not necessarily Intel’s opinion. Everything I talk about here is already public information.

Basic concepts

The very first thing you need to do is to learn the basic concepts related to power management: Core C states – or just C States – and Package C states – or PC states. Since descriptions of these concepts were already written by people who know more about them than I do, I will just provide you some pointers:

The small detail that makes a lot of power management comparisons wrong

Now that you know what the PC states are, you understood that the current PC state directly affects the system’s total power consumption. You also understood that if just one little piece of the whole system is not allowing a certain PC state, then the whole system will not enter that PC state. And this is the big problem.

Let’s say you have a system with a disk that is only allowing up to PC3 state, and your graphics device is allowing up to the PC6 state. That means that you are stuck on PC3, you will not reach PC6. Then, you enable a certain graphics power management feature – let’s say, for example, Framebuffer Compression (FBC) -, and the graphics device starts allowing PC7 instead of PC6. Since the disk is still there, preventing anything deeper than PC3, your system will stay at PC3 or worse. Then you try to measure power consumption both before and after enabling FBC to see how much it changes. You will conclude that the difference is zero! Why? Because the disk is keeping your machine at PC3. Does that mean the FBC feature does not save any power? No. What this means is that you have to fix your disk device power management configuration first.

Then let’s say you fix the disk power management policies so it starts allowing PC7. This way, when you disable FBC your machine will be able to reach only up to PC6 state, but when you enable FBC you will reach up to PC7. And now, if you measure the power consumption, you will notice a difference. But even this case can allow you to reach wrong conclusions: if you’re trying to check how much you can save by properly configuring your disk power management policies, you will reach different conclusions depending on whether FBC is enabled or not! Power management comparisons are really not easy, and you usually cannot look at just one feature: you have to consider the whole platform, its states and your residencies. You always have to know what exactly is preventing you from getting into the deeper states before concluding anything.

Another common misconception

Another common misconception that I have to mention here is the relationship between power management and performance. A lot of people assume that enabling power management features sacrifices performance. While this may be true for some cases, it is also false for other cases! Power management features help your system not only draw less power when idle, but also stay at cooler temperatures. This means that when you really need performance, you will be less limited by the thermal management system, since it will take you more time to reach the high temperatures, so you will be able to run at higher clocks for more time.

I am not an expert in this area so I don’t really want to write too much because I don’t want to risk saying something wrong. But I suggest you to read the Thermal Management chapter of the datasheet I referenced earlier and pay a lot of attention while reading the Thermal Considerations and Turbo Boost sections!

The tools

After reading the last sections you may be wondering what you need to do to know which device is preventing the system from getting into deeper package C states. Unfortunately, as far as I know, there is really no way to discover that today. But we do have a tool that allows you to see how much time your system is spending on each state and that also gives you some hints on changes you could try to make to reach deeper PC states: PowerTOP.

Since you already read PowerTOP’s user manual – referenced earlier in this text – you already know how to use it. I suggest you to play with the tunables tab and try to see how each parameter affects the system. Do you see a change on the PC state residencies after toggling a certain parameter? Is the difference for one parameter only noticeable after you toggle another parameter? Do you have a certain group of parameters that need to be toggled together to allow deeper PC states? Also, since you’re basically changing the default parameters of the devices, you may see undesired side effects, such as a mouse that refuses to go back to life after you move it. Unfortunately, some power saving features are disabled exactly because they can cause bad side effects: that’s why you have to change the defaults. On the good side, it’s not everybody that is affected by these side effects, so your system may be just fine.

Another nice tool is Turbostat, which is located inside the Kernel source tree, under tools/power/x86/turbostat. When I am trying to discover if a certain change makes a difference in the PC state residencies, I find it much easier to look at Turbostat’s output rather than at PowerTOP’s output, simply because Turbostat just prints all the values on new lines, allowing you to quickly scroll up to see the past. I usually have one terminal on PowerTOP’s tunables tab, and another terminal with Turbostat running, so I just keep toggling the tunables and see if the residencies change.

Back to the graphics device

Now that you understand PC states, I can tell you that there are many graphics features that affect the deepest possible PC state. I won’t really list every possible feature here, but the main rule is: the more pixels you process per second, the more power you will draw. On the datasheet provided at the beginning of this text, please read Section 4.2.6: Package C-States and Display Resolutions.

Did you read it? Ok, we can proceed. I know you just read that, but I need to emphasize: the screen resolution is not the only thing that can limit the deepest possible PC state. So, for example, if you have a lot of rendering, or if some feature such as FBC is disabled, your deepest possible PC state may be limited.

Screen on vs screen off

If you really read that datasheet, you may have noticed that, on those specific processors, the deepest PC state you can reach with the screen on is PC7, while the deepest PC state you can reach with the screen off is PC10. When there are no pixels to process, everything gets easier.

From the graphics driver point of view, it is much easier to make sure we allow the deepest possible PC state while the screen is off, and you can usually expect the state of i915.ko to be really good on this aspect. You can use this information as a way to find out if the graphics device is the limiting factor on your PC states. First, you close all the applications that could be doing off-screen rendering. Then you launch Turbostat and check up to which PC state you can reach. Then you disable the screen – see Part 1 for possible commands -, wait about 30 seconds for things to settle down and for Turbostat to print the output, and then you check if you were able to reach deeper PC states. If the answer is yes, then the graphics device is the limiting factor on your system. If not, then there’s probably something else limiting how deep you can go. Of course, i915.ko runtime PM needs to be properly configured and enabled while you’re doing this.

If you really want to guarantee that no bad applications are interfering with your test, I suggest you to close your display manager, all applications, and run tests/pm_rpm --stay from intel-gpu-tools. Then you can run the experiment with Turbostat.

Common bottlenecks I observed

If you go back to Part 1 you will notice I already mentioned the power management bottlenecks that I usually observe on my own systems: the disk and audio power management policies. I usually find that if I only change these policies I can already get to the deepest possible PC states on my development machines. I won’t get great residences without tuning everything else, but I will at least be able to reach those deepest PC states.

But remember: every system is different, and I have no idea what is attached to your machine, so I can’t really guarantee that the recipe that worked for me will also work for you. I don’t know which wireless card you have, I don’t know which devices are plugged to your system, I don’t know how fast your memory is, so I can’t really provide you an universal tool for power management. The most I can do is suggest you to run PowerTOP’s auto tune feature.

The good news is that there are people trying to solve both the disk and audio problems. Want to help on the disk side? See this blog post from Matthew Garrett.

Back to graphics Runtime PM

So what does the graphics runtime PM implementation have to do with all these things I just explained? It’s very simple: when we runtime suspend i915.ko, we completely disable the graphics device. So, in theory, it should start allowing the deepest possible PC states. But, as I explained, even though this allows power savings, it does not really guarantee that you will save power, because the other devices might be preventing you from reaching the deepest PC states. But at least now we – the graphics team – probably won’t be the reason why you’re not saving more power, so you will have to blame other people.

Conclusion

My main goal with these posts was to teach you the little things I know about power management. I really hope that you use this acquired knowledge to make our world a greener place. I also have to ask you: please help us improve the state of power management on the Linux ecosystem! Please help testing all these features. Please find the main bottlenecks of your system. Please engage with the upstream developers of the many subsystems and help them change the default parameters and policies of the devices so they can save more power. Please initiate power management discussions with the distributions, so they can review their tools and default values too.

And before I forget, here’s a link to a comic that is going to give you even more motivation to help you make the world consume less power. You can be like Superman! http://www.smbc-comics.com/?id=2305


Paulo Zanoni - 22 May, 2015 - 07:53am

Welcome to the second part of the Runtime PM (RPM) documentation. On the first part I gave an overview of the Intel Linux Kernel Graphics Runtime PM infrastructure. On this part I will explain a little bit about the feature design, debugging and our test suite. If you’re someone who has to give maintenance of i915.ko to your users, you should definitely read this.

Disclaimer

I work for Intel and I am a member of its Linux Kernel Graphics team. On this text I will talk about things that are related to my work at Intel, but this text reflects only my own opinion, not necessarily Intel’s opinion. Everything I talk about here is already public information.

Feature design

The design of the i915 RPM is based on reference counting: whenever some code needs to use the graphics device, it needs to call intel_runtime_pm_get(), and whenever the code is done using the hardware, it can call intel_runtime_pm_put(). Whenever the refcount is zero we can runtime suspend the device. This is the most basic concept, but we have some things on top.

An interesting detail of some of our hardware generations is that we have the concept of power wells, which are specific pieces of the hardware that can be powered off independently of the others, saving some power. For example, on Haswell, if you’re just using the eDP panel on pipe A, you can turn off the power well responsible for pipes B and C, saving some power on a usage case that is very common for laptop owners. Check the diagram on page 139 of the public Haswell documentation to see which pieces of the hardware are affected by the power well.

As you can see on drivers/gpu/drm/i915/intel_runtime_pm.c, different platforms have different sets of power wells, so we created the power domain concept in order to map our code abstractions to the different power wells on different platforms. For example, the display code has to grab POWER_DOMAIN_PIPE_B when it uses the pipe B, but the real power wells that will actually be grabbed when this power domain is grabbed depend on the platform. Just like the runtime PM subsystem, the power domains framework is based on reference counting. It is also worth mentioning that whenever any power domain is grabbed, we also call intel_runtime_pm_get().

Besides the power domains, there are other pieces of our code that grab RPM references, such as forcewake code – which prevents RC6 from being enabled – and a few others. To discover them, just grep our code for intel_runtime_pm_get() calls.

The difficulties Developers: a new paradigm

Before all this got implemented, the state of power management on our driver was as simple as it could be: the hardware was always there for the driver, so any time the driver needed to interact with it – such as when reading or writing registers -, it would succeed. After RPM and the power domains were implemented, the situation changed: the hardware might decide to go away for a nap, so if the driver does not explicitly prevent it from sleeping or explicitly wake it up, all the communication with the hardware will be ignored.

It is also important to remember that when the hardware gets runtime suspended – or when a specific power well is powered down – it may lose some of its state. So if you write a register, runtime suspend, resume, and then read the register, the value you read may not be the value you wrote. So now we need to make sure all the relevant hardware state is reprogrammed whenever we resume – or that we just don’t runtime suspend whenever there is state we need to keep.

And since the developers were all used with the previous model where they never needed to think about grabbing reference counts before doing things, we had a big period of time where regressions after regressions were added. This was a big pain: the developers and reviewers always forgot to grab the refcounts and forgot that the hardware might just go away and lose some of its state. Today the situation is much better since the power management concepts are now usually remembered during code writing or reviewing, but we’re still not regression-free, and we’ll probably never be – unless the developers get replaced by robots or something else.

Driver entry points

Another major problem contributing to the difficulty of RPM is that the graphics driver has way too many entry points. We have a massive interaction with drm.ko, so it calls many of our functions and we call many of its functions. We have a huge number of IOCTLs, some defined by our own driver and some inherited by drm.ko. We also have files on different places, such as sysfs and debugfs. We have the command submission infrastructure, which has IOCTLs as its entry points, but requires the hardware to be awake even after the IOCTL is over. We allow the user space to map graphics memory. We have many workqueues and delayed work functions. All this and more. Most of these interfaces require the hardware to be awake at some point – or to stay awake even after the interface is used – so it is really hard to guarantee that all the possible holes are covered by our reference count get() and put() calls.

Debugging

Based on all the difficulties listed above, it is easy to see that we can’t really be sure that we covered all the holes and that they will stay covered forever, leading to high anxiety levels for the developers and a lot of work for the bug triagers. So in order to reduce our anxiety levels we decided to add code to help us catch these possible problems.

While the upper layer and the driver entry points are many and difficult to check, we have a few specific functions at the end of our call stack that are responsible for actually touching the hardware. So on these functions we added some assertions. One of these assertions is a function called assert_device_not_suspended(), which is called whenever we do a register read or write, and also whenever we release forcewake. Another assertion we have is the unclaimed register checking code: for certain pieces of our hardware, if we try to read from or write to a register that does not exist, a specific bit in a specific register will change, so we can know that we did an operation that was ignored by the hardware. We also have a few other assertions that I can’t remember right now, but the ones explained are the most important. We are also probably missing assertions like these at other points of our code, and patches are welcome. Bug reports too.

As part of the Kernel, we also have some debugfs files that print information about the current state of some features, such as the reference counts for the power wells, among other things.

In addition to the Kernel code, we also use intel-gpu-tools (IGT) to try to catch RPM bugs. First of all, all the existing IGT tests can potentially trigger the assertions above, so all IGT tests are somehow helping test RPM. Second, we also added RPM subtests to some of the tests that already existed before RPM was implemented. Usually these subtests have the word suspend as part of their names. Third, we added a test called pm_rpm that has the goal to explicitly test the areas not covered by the other tests.

Most of the subtests of pm_rpm follow the same script: (i) put the device on a state where it can runtime suspend; (ii) assert it is runtime suspended; (iii) use one of the driver interfaces; (iv) make sure everything is fine, and the operation just done had the desired effect; and finally (v) make sure the device runtime suspends after everything is finished. We do have variations and other special things, but the main idea is this. If you’re interesting in finding out more about the many interfaces between the Kernel and i915.ko, this test is a very good place to look at.

So if you’re some distribution maintainer or someone else interested in making use of i915.ko runtime suspend, please run intel-gpu-tools in order to check the driver sanity. If you can’t run it all, please make sure you at least pass all tests of pm_rpm, and also make sure that none of these tests produce Kernel WARNs, BUGs or any other messages at the error level (hint: dmesg | egrep "(WARN|ERR|BUG)").

I found a problem, your driver suck!

Well, congratulations! You’re one step away from becoming a Linux Kernel Contributor! That is going to look good on your resumé, isn’t it? You’re welcome.

If you read everything up to this point you can already imagine how complex everything is, so I hope you’re not mad at us. The very first thing which you need to do is to report the problem. While we do look at bugs reported through email, the best way to guarantee that your problem won’t be forgotten is to report a bug on freedesktop.org.

Now if you’re feeling adventurous, you can try to discover how to reproduce the bug. Once you do that, you can try to write a test case for intel-gpu-tools – possibly a subtest for pm_rpm.c. And if you really care, you can try to implement new assertions on our Kernel driver to prevent similar bugs. And now your resumé is going to look really good!


Paulo Zanoni - 12 May, 2015 - 01:23pm

Since more than a year ago the Intel Linux Graphics team has been working on the Runtime Power Management (RPM) implementation for our Kernel driver, i915.ko. Since a feature is useless if nobody knows about it, I figured it’s time to document things a little bit in order to spread the knowledge.

Disclaimer

I work for Intel and I am a member of its Linux Kernel Graphics team. On this text I will talk about things that are related to my work at Intel, but this text reflects only my own opinion, not necessarily Intel’s opinion. Everything I talk about here is already publicly documented somewhere else.

What is it?

The basic goal of the i915.ko RPM implementation is to try to reduce the power consumption when all screens are disabled and the graphics device is idle. We basically shut down the whole graphics device, and put it in the PCI D3 state. This implementation uses the Kernel standard runtime PM framework, so the way you control i915.ko RPM should be the same way you control RPM for the other devices.

There are quite a few documents and presentations explaining Kernel RPM, you can learn a lot from them. Use your favorite search engine!

Which platforms support it?

For now, the platforms that support runtime suspend are just SNB and the newer ones, except for IVB. The power saving features make more sense on the HSW/VLV and newer platforms, but SNB support was added as a proof of concept that the RPM code could support multiple  platforms. Also notice that we don’t yet have support for IVB (which is between SNB and HSW), but adding this support would be very simple and probably share most (all?) of the SNB code. In order to be 100% sure of what is supported, grab the Kernel source code, then look at the definition of the HAS_RUNTIME_PM macro inside drivers/gpu/drm/i915/i915_drv.h.

When do we runtime suspend?

First of all, you need a very recent Kernel, because the older ones don’t support RPM for our graphics device. Then you have to enable RPM for the graphics device. Then, all screens have to be disabled and no applications can be using the graphics device.  After all these conditions are met, we start counting time. If no screen is re-enabled and no user-space application uses the GPU after some time, we runtime suspend. If something uses the GPU before the timeout, the timeout is reset after the application finishes using the GPU. In other words: we use the autosuspend delay from the Kernel RPM infrastructure. The current default autosuspend delay is 10 seconds, but it can be changed at runtime.

When I say “all screens have to be disabled”, I mean that, on Kernels older than 3.16, all CRTCs have to be completely unset, without attached framebuffers. But on Kernel 3.16 and newer, RPM can also happen if the screens are in DPMS state, which is a much more common case – just leave your machine idle and it will probably set the DPMS state at some point.

Why ten seconds?

Currently, the default autosuspend delay is set to 10 seconds, which means we will only runtime suspend if nothing uses the GPU for 10 seconds. This is a very conservative value: you won’t get to the runtime suspended state unless your applications are really quiet. Some applications keep waking up the GPU consistently every second, so on these environments you won’t really see RPM.

Even though I arbitrarily decided 10s to be the default value, I admit this value is too conservative, but there are a few reasons why I chose it.

  • With a big timeout, RPM is less likely to start happening on everybody’s machines at the same time, which minimizes the damages caused by possible bugs. When I proposed the 10s timeout, RPM was a new feature, and new features usually bring new bugs.
  • If the timeout value is too low, there’s a possibility that we may do way too many runtime suspend + resume cycles per second, and although I never measured this, it is pretty safe to assume that doing all these suspend + resume cycles will waste more power than just not doing them at all. If you’ve been idle for a whole ten seconds, it’s likely that you’ll probably remain idle for at least a few more seconds.
  • I was expecting someone – maybe even me – would do some very thorough experiments and conclude some ideal value: one that makes sure we don’t suspend + resume so frequently that we would waste power doing it, but that also makes sure we take all the opportunities we can to save power.
  • A conservative value would encourage people to analyze their user-space apps and tune them so they would wake-up the GPU – and maybe even the CPU – only when absolutely necessary, which would benefit not only the cases where we have RPM enabled, but also the cases where we are just using the applications.
  • The value can be changed at runtime! Do you want 1ms? You can have it!

Of course, due to all the complexities involved, I imagine that the ideal runtime suspend delay depends on the set of user-space applications that is running, so in theory the value should be different for every single use case.

Update: after I wrote this text, but before I published it publicly, Daniel Vetter sent a patch proposing a new timeout of 100ms.

How do I play with it?

To be able to change most of these parameters you will need to be root. Since our graphics device is always PCI device 00:02.0, the files will always have the same locations:

  • /sys/devices/pci0000:00/0000:00:02.0/power/control: this is the file that you use to enable and disable graphics RPM. The values are non-trivial: “auto” means “RPM enabled”, while “on” means “RPM disabled”. Write to this file using “echo” to change the parameters, read it to check the current state.
  • /sys/devices/pci0000:00/0000:00:02.0/power/autosuspend_delay_ms: this is the file that controls how often we will runtime suspend. Remember that the number is in milliseconds, so “10” means 10ms, not 10s. Write to this file to change the parameters, read to check the current state.
  • /sys/devices/pci0000:00/0000:00:02.0/power/runtime_status: this is the file you use to check the RPM status. It can print “suspended”, “suspending”, “resuming” and “active”.

Due to some funny interactions between the graphics and audio devices – needed for things like HDMI audio -, you will also need to enable audio runtime PM. I won’t go into details of how audio RPM works because I really don’t know much about it, but I know that to enable it:

echo 1 > /sys/module/snd_hda_intel/parameters/power_save
echo auto > /sys/bus/pci/devices/0000:00:03.0/power/control

Or you can just blacklist the snd_hda_intel module, but that means you will lose its features.

As an alternate to all the instructions above, you can also try to just run:

sudo powertop --auto-tune

This command will enable not only graphics and audio RPM, but try to tune all your system to use less power. While this is good, it can also have some bad effects, such as disabling your USB mouse. If something annoys you, you can then run sudo powertop, go to the tunables tab and undo what you want.

After all this, to force runtime PM you can do this:

  • If your Kernel supports runtime PM for DPMS, you can just run:
    xset dpms force off
    In this case, if you want to enable the screens again, you just need to move the mouse.
  • If your Kernel is older, you can run:
    xrandr
    Then, for each connected output, you run:
    xrandr --output $output --off
    In this case, if you want to enable the screens again, you will have to run, for each output:
    xrandr --output $output --auto

If you boot with drm.debug=0xe, you can look at the output of dmesg and look for “Device suspended” and “Device resumed” messages. This can be done with:

dmesg | grep intel_runtime_

What is the difference between runtime suspend and the usual suspend?

The “suspend” that everybody is used to is something that happens for the whole platform. All devices and applications stop, the machine enters the S3 state, and you just can’t use it while it remains on that state. The runtime suspend feature allows only the devices that are currently unused on the system to be suspended, so all the other devices still work, and user space also keeps working. Then, when something needs access to the device – such as user space -, it gets resumed.

What’s next?

On the next part of our documentation I will explain the development aspect of the RPM feature, including the feature design, its problems and how to debug them.

Credits

I have to mention that even though I did a lot of work on the RPM subsystem of our driver, a big number of other developers also did huge amounts of work here too, so they have to take the credit! Since pretty much everybody on our team helped in one way or another – writing features, enabling specific platforms, giving design feedback, reviewing patches, reporting bugs, adding regressions, etc. – I am not going to say specific names: everybody deserves the credit. Also, a lot of people that are not on the Intel Linux Kernel Graphics team contributed to this work too.


Ben Widawsky - 08 May, 2015 - 12:33pm

I’ve had this post sitting in my drafts for the last 7 months. The code is stale, but the concepts are correct. I had intended to add some pictures before posting, but it’s clear that won’t happen now. Words are better than nothing, I suppose…

Recently I pushed an intel-gpu-tool tool for modifying GPU frequencies. It’s meant to simplify using the sysfs interfaces, similar to many other such helper tools in Linux (cpupower has become one of my recent favorites). I wrote this fancy getopt example^w^w^w program to address a specific need I had, but saw its uses for the community as well. Some time after upstreaming the tool, I accidentally put the name of this tool into my favorite search engine (yes, after all these years X clipboard still confuses me). Surprisingly I was greeted by a discussion about the tool. None of it was terribly misinformed, but I figured I might set the record straight anyway.

Introduction

Dynamically changing frequencies is a difficult thing to accomplish. Typically there are multiple clock domains in a chip and coordinating all the necessary ones in order to modify one probably requires a bunch of state which I’d not understand much less be able to explain. To facilitate this (on Gen6+) there is firmware which does whatever it is that firmwares do to change frequencies. When we talk about changing frequencies from a Linux kernel driver perspective, it means we’re asking the firmware for a frequency. It can, and does overrule, balk and ignore our frequency requests

A brief explanation on frequencies in i915

The term that is used within the kernel driver and docs is, “RPS” which is [I believe] short for Render P-States. They are analogous to CPU P-states in that lower numbers are faster, higher number are slower. Conveniently, we only have two numbers, 0, and 1 on GEN. Past that, I don’t know how CPU P-States work, so I’m not sure how much else is similar.

There are roughly 4 generations of RPS:

  1. IPS (not to be confused with Intermediate Pixel Storage). The implementation of this predates my knowledge of this part of the hardware, so I can’t speak much about it. It stopped being a thing after Ironlake (Gen5). I don’t care to look, but you can if you want: drivers/platform/x86/intel_ips.c
  2. RPS (Sandybridge, Ivybridge)
    There are 4 numbers of interest: RP0, RP1, RPn, “hw max”. The first 3 are directly read from a register

    rp_state_cap = I915_READ(GEN6_RP_STATE_CAP); dev_priv->rps.rp0_freq = (rp_state_cap >> 0) & 0xff; dev_priv->rps.rp1_freq = (rp_state_cap >> 8) & 0xff; dev_priv->rps.min_freq = (rp_state_cap >> 16) & 0xff;

    RP0 is the maximum value the driver can request from the firmware. It’s the highest non-overclocked frequency supported.
    RPn is the minimum value the driver can request from the firmware. It’s the lowest frequency supported.
    RP1 is the most efficient frequency.
    hw_max is RP0 if the system does not supports overclocking.
    Otherwise, it is read through a special set of commands where we’re told by the firmware the real max. The overclocking max typically cannot be sustained for long durations due to thermal considerations, but that is transparent to software.

  3. RPS (HSW+)
    Similar to the previous RPS, except there is an RPe (for efficient). I don’t know how it differs from RP1. I just learned about this after posting the tool – but just be aware it’s different. Baytrail and Cherryview also have a concept of RPe.
  4. The Atom based stuff (Baytrail, Cherryview)
    I don’t pretend to know how this works [or doesn’t]. They use similar terms.
    They have the same set of names as the HSW+ RPS.
How the driver uses it

The driver can make requests to the firmware for a desired frequency:

if (IS_HASWELL(dev) || IS_BROADWELL(dev)) I915_WRITE(GEN6_RPNSWREQ, HSW_FREQUENCY(val)); else I915_WRITE(GEN6_RPNSWREQ, GEN6_FREQUENCY(val) | GEN6_OFFSET(0) | GEN6_AGGRESSIVE_TURBO);

This register interface doesn’t provide a way to determine the request was granted other than reading back the current frequency, which is error prone, as explained below.

By default, the driver will request frequencies between the efficient frequency (RP1, or RPe), and the max frequency (RP0 or hwmax) based on the system’s busyness. The busyness can be calculated either by software, or by the hardware. For the former, the driver can periodically read a register to get busyness and make decisions based on that:

render_count = I915_READ(VLV_RENDER_C0_COUNT_REG);

In the latter case, the firmware will itself measure busyness and give the driver an interrupt when it determines that the GPU is sufficiently overworked or under worked. At each interrupt the driver would raise or lower by the smallest step size (typically 50MHz), and continue on its way. The most complex thing we did (which we still managed to screw up) was disable interrupts telling us to go up when we were already at max, and the equivalent for down.

It seems obvious that there are usual trends, if you’re incrementing the frequency you’re more likely to increment again in the near future and such. Since leaving for sabbatical and returning to work on mesa, there has been a lot of complexity added here by Chris Wilson, things which mimic concepts like the ondemand CPU frequency governor. I never looked much into those, so I can’t talk knowledgeably about it – just realize it’s not as naive as it once was, and should do a better job as a result.

The flow seems a bit ping-pongish:

The benefit is though that the firmware can do all the dirty work, and the CPU can sleep. Particularly when there’s nothing going on in the system, that should provide significant power savings.

Why you shouldn’t set –max or –min

First, what does it do? –max and –min lock the GPU frequency to min and max respectively. What this actually means is that in the driver, even if we get interrupts to throttle up, or throttle down, we ignore them (hopefully the driver will disable such useless interrupts, but I am too scared to check). I also didn’t check what this does when we’re using software to determine busyness, but it should be very similar

I should mention now that this is what I’ve inferred through careful observation^w^w random guessing. In the world of thermals and overheating, things can go from good, to bad, to broke faster than the CPU can take an interrupt and adjust things. As a result, the firmware can and will lower frequencies even if it’s previously acknowledges it can give a specific frequency.

As an example if you do:

intel_gpu_frequency --set X assert (intel_gpu_frequency --get == X)

There is a period in the middle, and after the assert where the firmware may have done who knows what. Furthermore, if we try to lock to –max, the GPU is more likely to hit these overheating conditions and throttle you down. So –max doesn’t even necessarily get you max. It’s sort of an interesting implication there since one would think these same conditions (probably the CPU is heavily used as well) would end up clocking up all the way anyway, and we’d get into the same spot, but I’m not really sure. Perhaps the firmware won’t tell us to throttle up so aggressively when it’s near its thermal limits. Using –max can actually result in non-optimal performance, and I have some good theories why, but I’ll keep them to myself since I really have no proof.

–min on the other hand is equally stupid for a different and more obvious reason. As I said above, it’s guaranteed to provide the worst possible performance and not guaranteed to provide the most optimal power usage.

What you should do with it

The usages are primarily benchmarking and performance debug. Assuming you can sustain decent cooling, locking the GPU frequency to anything will give you the most consistent results from run to run. Presumably max, or near max will be the most optimal.

min is useful to get a measure of what the worst possible performance is to see how it might impact various optimizations. It can help you change the ratio of GPU to CPU frequency drastically (and easily). I don’t actually expect anyone to use this very much.

How not to get shot in the foot

If you are a developer trying to get performance optimizations or measurements which you think can be negatively impacted by the GPU throttling you can set this value. Again because of the thermal considerations, as you go from run torun making tweaks, I’d recommend setting something maybe 75% or so of max – that is a total ballpark figure. When you’re ready to get some absolute numbers, you can try setting –max, and the few frequencies near max to see if you have any unexpected results. Take highest value when done.

The sysfs entries require root for a reason. Any time

Footnotes

On the surface it would seem that the minimum frequency should always use the least amount of power. At idle, I’d assert that is always true. The corollary to that is that at minimum frequency they also finish the workload the slower. Intel GPUs don’t idle for long, they go into RC6. The efficient frequency is a blend of maintaining a low frequency and winning the race to idle. AFAIK, it’s a number selected after a whole lot of tests run – we could ignore it.


Daniel Vetter - 08 May, 2015 - 06:21am
Upstreaming requirements for the DRM subsystem are a bit special since Dave Airlie requires a full-blown open-source implementation as a demonstration vehicle for any new interfaces added. I've figured it's better to clear this up once instead of dealing with the fallout from surprises and made a few slides for a training session. Dave reviewed and acked them, hence this should be the up-to-date rules - the old mails and blogs back from when some ARM SoC vendors tried to push drm drivers for blob userspace to upstream are a bit outdated.

Any here's the slides for my gfx kernel upstreaming requirements training.

Daniel Vetter - 19 Apr, 2015 - 08:57pm
With Linux kernel v3.20^W v4.0 already out there door my overview of what's in 4.1 for drm/i915 is way overdue.

First looking at the modeset side of the driver the big overall thing is all the work to convert i915 to atomic. In this release there's code from Ander Conselvan de Oliveira to have a struct drm_atomic_state allocated for all the legacy modeset code paths in the driver. With that we can switch the internals to start using atomic state objects and gradually convert over everything to atomic on the modeset side. Matt Roper on the other hand was busy to prepare the plane code to land the atomic watermark update code. Damien has reworked the initial plane configuration code used for fastboot, which also needs to be adapted to the atomic world.


For more specific feature work there's the DRRS (dynamic refresh rate switching) from Sonika, Vandana and more people, which is now enabled where supported. The idea is to reduce the refresh rate of the panel to save power when nothing changes on the screen. And Paulo Zanoni has provided patches to improve the FBC code, hopefully we can enable that by default soon too. Under the hood Ville has refactored the DP link rate computation and the sprite color key handling, both to prepare for future work and platform enabling. Intermediate link rate support for eDP 1.4 from Sonika built on top of this. Imre Deak has also reworked the Baytrail/Braswell DPLL code to prepare for Broxton.

Speaking of platforms, Skyleigh has gained runtime PM support from Damien, and RPS (render turbo and sleep states) from Akash. Another SKL exclusive is support for scanout of Y-tiled buffers and scanning out buffers rotated by 90°/270° (instead of just normal and rotated by 180°) from Tvrtko and Damien. Well the rotation support didn't quite land yet, but Tvrtko's support for the special pagetable binding needed for that feature in the form of rotated GGTT views. Finally Nick Hoath and Damien also submitted a lot of workaround patches for SKL.

Moving on to Braswell/Cherryview there have been tons of fixes to the DPLL and watermark code from Vijay and Ville, and BSW left the preliminary hw support. And also for the SoC platforms Chris Wilson has supplied a pile of patches to tune the rps code and bring it more in-line with the big core platforms.

On the GT side the big ongoing work is dyanmic pagetable allocations Michel Thierry based upon patches from Ben Widawsky. With per-process address spaces and even more so with the big address spaces gen8+ supports it would be wasteful if not impossible to allocate pagetables for the entire address space upfront. But changing the code to handle another possible memory allocation failure point needed a lot of work. Most of that has landed now, but the benefits of enabling bigger address spaces haven't made it into 4.1.

Another big work is XenGT client-side support fromYu Zhang and team. This is paravirtualization to allow virtual machines to tap into the render engines without requiring exclusive access, but also with a lot less overhead than full virtual hardware like vmware or virgil would provide. The host-side code is also submitted already, but needs a bit more work still to integrate cleanly into the driver.

And of course there's been lots of other smaller work all over, as usual. Internal documentation for the shrinker, more dead UMS code removed, the vblank interrupt code cleaned up and more.