linux

Commit Graph

Author	SHA1	Message	Date
Stratos Karafotis	00bfe05889	cpufreq: conservative: Decrease frequency faster for deferred updates Conservative governor changes the CPU frequency in steps. That means that if a CPU runs at max frequency, it will need several sampling periods to return to min frequency when the workload is finished. If the update function that calculates the load and target frequency is deferred, the governor might need even more time to decrease the frequency. This may have impact to power consumption and after all conservative should decrease the frequency if there is no workload at every sampling rate. To resolve the above issue calculate the number of sampling periods that the update is deferred. Considering that for each sampling period conservative should drop the frequency by a freq_step because the CPU was idle apply the proper subtraction to requested frequency. Below, the kernel trace with and without this patch. First an intensive workload is applied on a specific CPU. Then the workload is removed and the CPU goes to idle. WITHOUT <idle>-0 [007] dN.. 620.329153: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 620.350857: cpu_frequency: state=1700000 cpu_id=7 kworker/7:2-556 [007] .... 620.370856: cpu_frequency: state=1900000 cpu_id=7 kworker/7:2-556 [007] .... 620.390854: cpu_frequency: state=2100000 cpu_id=7 kworker/7:2-556 [007] .... 620.411853: cpu_frequency: state=2200000 cpu_id=7 kworker/7:2-556 [007] .... 620.432854: cpu_frequency: state=2400000 cpu_id=7 kworker/7:2-556 [007] .... 620.453854: cpu_frequency: state=2600000 cpu_id=7 kworker/7:2-556 [007] .... 620.494856: cpu_frequency: state=2900000 cpu_id=7 kworker/7:2-556 [007] .... 620.515856: cpu_frequency: state=3100000 cpu_id=7 kworker/7:2-556 [007] .... 620.536858: cpu_frequency: state=3300000 cpu_id=7 kworker/7:2-556 [007] .... 620.557857: cpu_frequency: state=3401000 cpu_id=7 <idle>-0 [007] d... 669.591363: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 669.591939: cpu_idle: state=4294967295 cpu_id=7 <idle>-0 [007] d... 669.591980: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] dN.. 669.591989: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 670.201224: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 670.221975: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 670.222016: cpu_frequency: state=3300000 cpu_id=7 <idle>-0 [007] d... 670.222026: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 670.234964: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 670.801251: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 671.236046: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 671.236073: cpu_frequency: state=3100000 cpu_id=7 <idle>-0 [007] d... 671.236112: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 671.393437: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 671.401277: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 671.404083: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 671.404111: cpu_frequency: state=2900000 cpu_id=7 <idle>-0 [007] d... 671.404125: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 671.404974: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 671.501180: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 671.995414: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 671.995459: cpu_frequency: state=2800000 cpu_id=7 <idle>-0 [007] d... 671.995469: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 671.996287: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 672.001305: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 672.078374: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 672.078410: cpu_frequency: state=2600000 cpu_id=7 <idle>-0 [007] d... 672.078419: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 672.158020: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 672.158040: cpu_frequency: state=2400000 cpu_id=7 <idle>-0 [007] d... 672.158044: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 672.160038: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 672.234557: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 672.237121: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 672.237174: cpu_frequency: state=2100000 cpu_id=7 <idle>-0 [007] d... 672.237186: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 672.237778: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 672.267902: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 672.269860: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 672.269906: cpu_frequency: state=1900000 cpu_id=7 <idle>-0 [007] d... 672.269914: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 672.271902: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 672.751342: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 672.823056: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-556 [007] .... 672.823095: cpu_frequency: state=1600000 cpu_id=7 WITH <idle>-0 [007] dN.. 4380.928009: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-399 [007] .... 4380.949767: cpu_frequency: state=2000000 cpu_id=7 kworker/7:2-399 [007] .... 4380.969765: cpu_frequency: state=2200000 cpu_id=7 kworker/7:2-399 [007] .... 4381.009766: cpu_frequency: state=2500000 cpu_id=7 kworker/7:2-399 [007] .... 4381.029767: cpu_frequency: state=2600000 cpu_id=7 kworker/7:2-399 [007] .... 4381.049769: cpu_frequency: state=2800000 cpu_id=7 kworker/7:2-399 [007] .... 4381.069769: cpu_frequency: state=3000000 cpu_id=7 kworker/7:2-399 [007] .... 4381.089771: cpu_frequency: state=3100000 cpu_id=7 kworker/7:2-399 [007] .... 4381.109772: cpu_frequency: state=3400000 cpu_id=7 kworker/7:2-399 [007] .... 4381.129773: cpu_frequency: state=3401000 cpu_id=7 <idle>-0 [007] d... 4428.226159: cpu_idle: state=1 cpu_id=7 <idle>-0 [007] d... 4428.226176: cpu_idle: state=4294967295 cpu_id=7 <idle>-0 [007] d... 4428.226181: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 4428.227177: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 4428.551640: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 4428.649239: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-399 [007] .... 4428.649268: cpu_frequency: state=2800000 cpu_id=7 <idle>-0 [007] d... 4428.649278: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 4428.689856: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 4428.799542: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 4428.801683: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-399 [007] .... 4428.801748: cpu_frequency: state=1700000 cpu_id=7 <idle>-0 [007] d... 4428.801761: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 4428.806545: cpu_idle: state=4294967295 cpu_id=7 ... <idle>-0 [007] d... 4429.051880: cpu_idle: state=4 cpu_id=7 <idle>-0 [007] d... 4429.086240: cpu_idle: state=4294967295 cpu_id=7 kworker/7:2-399 [007] .... 4429.086293: cpu_frequency: state=1600000 cpu_id=7 Without the patch the CPU dropped to min frequency after 3.2s With the patch applied the CPU dropped to min frequency after 0.86s Signed-off-by: Stratos Karafotis <stratosk@semaphore.gr> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-16 23:15:56 +01:00
Viresh Kumar	26f0dbc9ab	cpufreq: governor: Don't use 'timer' keyword The earlier implementation of governors used background timers and so functions, mutex, etc had 'timer' keyword in their names. But that's not true anymore. Replace 'timer' with 'update', as those functions, variables are based around updates to frequency. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-11-11 01:48:33 +01:00
Rafael J. Wysocki	58919e83c8	cpufreq / sched: Pass flags to cpufreq_update_util() It is useful to know the reason why cpufreq_update_util() has just been called and that can be passed as flags to cpufreq_update_util() and to the ->func() callback in struct update_util_data. However, doing that in addition to passing the util and max arguments they already take would be clumsy, so avoid it. Instead, use the observation that the schedutil governor is part of the scheduler proper, so it can access scheduler data directly. This allows the util and max arguments of cpufreq_update_util() and the ->func() callback in struct update_util_data to be replaced with a flags one, but schedutil has to be modified to follow. Thus make the schedutil governor obtain the CFS utilization information from the scheduler and use the "RT" and "DL" flags instead of the special utilization value of ULONG_MAX to track updates from the RT and DL sched classes. Make it non-modular too to avoid having to export scheduler variables to modules at large. Next, update all of the other users of cpufreq_update_util() and the ->func() callback in struct update_util_data accordingly. Suggested-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-08-16 22:14:55 +02:00
Rafael J. Wysocki	f6709b8aa7	cpufreq: governor: Drop gov_cancel_work() There's no reason for gov_cancel_work() to exist at all, as it only has one caller and the only thing done by that caller is to invoke gov_cancel_work(). Accordingly, drop gov_cancel_work() and move its contents to the caller. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-06-09 23:39:29 +02:00
Rafael J. Wysocki	9a15fb2c79	cpufreq: Drop the 'initialized' field from struct cpufreq_governor The 'initialized' field in struct cpufreq_governor is only used by the conservative governor (as a usage counter) and the way that happens is far from straightforward and arguably incorrect. Namely, the value of 'initialized' is checked by cpufreq_dbs_governor_init() and cpufreq_dbs_governor_exit() and the results of those checks are passed (as the second argument) to the ->init() and ->exit() callbacks in struct dbs_governor. Those callbacks are only implemented by the ondemand and conservative governors and ondemand doesn't use their second argument at all. In turn, the conservative governor uses it to decide whether or not to either register or unregister a transition notifier. That whole mechanism is not only unnecessarily convoluted, but also racy, because the 'initialized' field of struct cpufreq_governor is updated in cpufreq_init_governor() and cpufreq_exit_governor() under policy->rwsem which doesn't help if one of these functions is run twice in parallel for different policies (which isn't impossible in principle), for example. Instead of it, add a proper usage counter to the conservative governor and update it from cs_init() and cs_exit() which is guaranteed to be non-racy, as those functions are only called under gov_dbs_data_mutex which is global. With that in place, drop the 'initialized' field from struct cpufreq_governor as it is not used any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-06-02 23:24:39 +02:00
Viresh Kumar	bf2be2de84	cpufreq: governor: Create cpufreq_policy_apply_limits() Create a new helper to avoid code duplication across governors. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-02 23:24:39 +02:00
Viresh Kumar	666f4ccc5d	cpufreq: governor: Remove unnecessary bits from print message pr_*() helpers already prefix the print messages with "cpufreq_governor:" and similar details aren't required in the actual message. For example, the print message getting fixed looks like this before this patch: cpufreq_governor: cpufreq: Governor initialization failed (dbs_data kobject init error 0) Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-06-02 23:24:38 +02:00
Rafael J. Wysocki	e788892ba3	cpufreq: governor: Get rid of governor events The design of the cpufreq governor API is not very straightforward, as struct cpufreq_governor provides only one callback to be invoked from different code paths for different purposes. The purpose it is invoked for is determined by its second "event" argument, causing it to act as a "callback multiplexer" of sorts. Unfortunately, that leads to extra complexity in governors, some of which implement the ->governor() callback as a switch statement that simply checks the event argument and invokes a separate function to handle that specific event. That extra complexity can be eliminated by replacing the all-purpose ->governor() callback with a family of callbacks to carry out specific governor operations: initialization and exit, start and stop and policy limits updates. That also turns out to reduce the code size too, so do it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-06-02 23:24:15 +02:00
Rafael J. Wysocki	9485e4ca0b	cpufreq: governor: Fix handling of special cases in dbs_update() As reported in KBZ 69821: "With CONFIG_HZ_PERIODIC=y cpu stays at the lowest frequcency 800MHz even if usage goes to 100%, frequency does not scale up, the governor in use is ondemand. Neither works conservative. Performance and userspace governors work as expected. With CONFIG_NO_HZ_IDLE or CONFIG_NO_HZ_FULL cpu scales up with ondemand as expected." Analysis carried out by Chen Yu leads to the conclusion that the observed issue is due to idle_time in dbs_update() representing a negative number in which case the function will return 0 as the load (unless load is greater than 0 for another CPU sharing the policy), although that need not be the right choice. Indeed, idle_time representing a negative number means that during the last sampling interval the CPU was almost 100% busy on the rough average, so 100 should be returned as the load in that case. Modify the code accordingly and rearrange it to clarify the handling of all of the special cases in it. While at it, also avoid returning zero as the load if time_elapsed is 0 (it doesn't really make sense to return 0 then). Link: https://bugzilla.kernel.org/show_bug.cgi?id=69821 Tested-by: Chen Yu <yu.c.chen@intel.com> Tested-by: Timo Valtoaho <timo.valtoaho@gmail.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-05-06 14:24:23 +02:00
Rafael J. Wysocki	b4f4b4b371	cpufreq: governor: Change confusing struct field and variable names The name of the prev_cpu_wall field in struct cpu_dbs_info is confusing, because it doesn't represent wall time, but the previous update time as returned by get_cpu_idle_time() (that may be the current value of jiffies_64 in some cases, for example). Moreover, the names of some related variables in dbs_update() take that confusion further. Rename all of those things to make their names reflect the purpose more accurately. While at it, drop unnecessary parens from one of the updated expressions. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Chen Yu <yu.c.chen@intel.com>	2016-04-28 15:10:08 +02:00
Rafael J. Wysocki	ba1ca654f3	cpufreq: governor: Fix prev_load initialization in cpufreq_governor_start() The way cpufreq_governor_start() initializes j_cdbs->prev_load is questionable. First off, j_cdbs->prev_cpu_wall used as a denominator in the computation may be zero. The case this happens is when get_cpu_idle_time_us() returns -1 and get_cpu_idle_time_jiffy() used to return that number is called exactly at the jiffies_64 wrap time. It is rather hard to trigger that error, but it is not impossible and it will just crash the kernel then. Second, j_cdbs->prev_load is computed as the average load during the entire time since the system started and it may not reflect the load in the previous sampling period (as it is expected to). That doesn't play well with the way dbs_update() uses that value. Namely, if the update time delta (wall_time) happens do be greater than twice the sampling rate on the first invocation of it, the initial value of j_cdbs->prev_load (which may be completely off) will be returned to the caller as the current load (unless it is equal to zero and unless another CPU sharing the same policy object has a greater load value). For this reason, notice that the prev_load field of struct cpu_dbs_info is only used by dbs_update() and only in that one place, so if cpufreq_governor_start() is modified to always initialize it to 0, it will make dbs_update() always compute the actual load first time it checks the update time delta against the doubled sampling rate (after initialization) and there won't be any side effects of it. Consequently, modify cpufreq_governor_start() as described. Fixes: `18b46abd00` (cpufreq: governor: Be friendly towards latency-sensitive bursty workloads) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-04-25 16:21:34 +02:00
Rafael J. Wysocki	1cbc99dfe5	Merge back cpufreq changes for v4.7.	2016-04-25 15:44:01 +02:00
Rafael J. Wysocki	94862a62df	Revert "cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC" Revert commit `0df35026c6` (cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC) that introduced a regression by causing the ondemand cpufreq governor to misbehave for CONFIG_TICK_CPU_ACCOUNTING unset (the frequency goes up to the max at one point and stays there indefinitely). The revert takes subsequent modifications of the code in question into account. Fixes: `0df35026c6` (cpufreq: governor: Fix negative idle_time when configured with CONFIG_HZ_PERIODIC) Link: https://bugzilla.kernel.org/show_bug.cgi?id=115261 Reported-and-tested-by: Timo Valtoaho <timo.valtoaho@gmail.com> Cc: 4.5+ <stable@vger.kernel.org> # 4.5+ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-04-25 02:39:20 +02:00
Rafael J. Wysocki	2d0c58ad60	cpufreq: governor: Move abstract gov_attr_set code to seperate file Move abstract code related to struct gov_attr_set to a separate (new) file so it can be shared with (future) goverernors that won't share more code with "ondemand" and "conservative". No intentional functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-04-02 01:09:01 +02:00
Rafael J. Wysocki	0dd3c1d678	cpufreq: governor: New data type for management part of dbs_data In addition to fields representing governor tunables, struct dbs_data contains some fields needed for the management of objects of that type. As it turns out, that part of struct dbs_data may be shared with (future) governors that won't use the common code used by "ondemand" and "conservative", so move it to a separate struct type and modify the code using struct dbs_data to follow. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-04-02 01:09:00 +02:00
Rafael J. Wysocki	0bed612be6	cpufreq: sched: Helpers to add and remove update_util hooks Replace the single helper for adding and removing cpufreq utilization update hooks, cpufreq_set_update_util_data(), with a pair of helpers, cpufreq_add_update_util_hook() and cpufreq_remove_update_util_hook(), and modify the users of cpufreq_set_update_util_data() accordingly. With the new helpers, the code using them doesn't need to worry about the internals of struct update_util_data and in particular it doesn't need to worry about populating the func field in it properly upfront. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2016-04-02 01:08:43 +02:00
Rafael J. Wysocki	539a4c4247	cpufreq: governor: Always schedule work on the CPU running update Modify dbs_irq_work() to always schedule the process-context work on the current CPU which also ran the dbs_update_util_handler() that the irq_work being handled came from. This causes the entire frequency update handling (involving the "ondemand" or "conservative" governors) to be carried out by the CPU whose frequency is to be updated and reduces the overall amount of inter-CPU noise related to cpufreq. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-22 23:13:36 +01:00
Rafael J. Wysocki	adaf9fcd13	cpufreq: Move scheduler-related code to the sched directory Create cpufreq.c under kernel/sched/ and move the cpufreq code related to the scheduler to that file and to sched.h. Redefine cpufreq_update_util() as a static inline function to avoid function calls at its call sites in the scheduler code (as suggested by Peter Zijlstra). Also move the definition of struct update_util_data and declaration of cpufreq_set_update_util_data() from include/linux/cpufreq.h to include/linux/sched.h. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2016-03-10 20:44:47 +01:00
Rafael J. Wysocki	08f511fd41	cpufreq: Reduce cpufreq_update_util() overhead a bit Use the observation that cpufreq_update_util() is only called by the scheduler with rq->lock held, so the callers of cpufreq_set_update_util_data() can use synchronize_sched() instead of synchronize_rcu() to wait for cpufreq_update_util() to complete. Moreover, if they are updated to do that, rcu_read_(un)lock() calls in cpufreq_update_util() might be replaced with rcu_read_(un)lock_sched(), respectively, but those aren't really necessary, because the scheduler calls that function from RCU-sched read-side critical sections already. In addition to that, if cpufreq_set_update_util_data() checks the func field in the struct update_util_data before setting the per-CPU pointer to it, the data->func check may be dropped from cpufreq_update_util() as well. Make the above changes to reduce the overhead from cpufreq_update_util() in the scheduler paths invoking it and to make the cleanup after removing its callbacks less heavy-weight somewhat. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>	2016-03-09 15:07:58 +01:00
Viresh Kumar	f737236b12	cpufreq: governor: Drop unnecessary checks from show() and store() The show() and store() routines in the cpufreq-governor core don't need to check if the struct governor_attr they want to use really provides the callbacks they need as expected (if that's not the case, it means a bug in the code anyway), so change them to avoid doing that. Also change the error value to -EBUSY, if the governor is getting removed and we aren't allowed to store any more changes. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-09 14:41:11 +01:00
Rafael J. Wysocki	27de348239	cpufreq: governor: Fix race in dbs_update_util_handler() There is a scenario that may lead to undesired results in dbs_update_util_handler(). Namely, if two CPUs sharing a policy enter the funtion at the same time, pass the sample delay check and then one of them is stalled until dbs_work_handler() (queued up by the other CPU) clears the work counter, it may update the work counter and queue up another work item prematurely. To prevent that from happening, use the observation that the CPU queuing up a work item in dbs_update_util_handler() updates the last sample time. This means that if another CPU was stalling after passing the sample delay check and now successfully updated the work counter as a result of the race described above, it will see the new value of the last sample time which is different from what it used in the sample delay check before. If that happens, the sample delay check passed previously is not valid any more, so the CPU should not continue. Fixes: f17cbb53783c (cpufreq: governor: Avoid atomic operations in hot paths) Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:10 +01:00
Rafael J. Wysocki	94ab5e030f	cpufreq: governor: Make gov_set_update_util() static The gov_set_update_util() routine is only used internally by the common governor code and it doesn't need to be exported, so make it static. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:10 +01:00
Rafael J. Wysocki	1112e9d83e	cpufreq: governor: Narrow down the dbs_data_mutex coverage Since cpufreq_governor_dbs() is now always called with policy->rwsem held, it cannot be executed twice in parallel for the same policy. Thus it is not necessary to hold dbs_data_mutex around the invocations of cpufreq_governor_start/stop/limits() from it as those functions never modify any data that can be shared between different policies. However, cpufreq_governor_dbs() may be executed twice in parallal for different policies using the same gov->gdbs_data object and dbs_data_mutex is still necessary to protect that object against concurrent updates. For this reason, narrow down the dbs_data_mutex locking to cpufreq_governor_init/exit() where it is needed and rename the mutex to gov_dbs_data_mutex to reflect its purpose. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:10 +01:00
Rafael J. Wysocki	e3f5ed9393	cpufreq: governor: Make dbs_data_mutex static That mutex is only used by cpufreq_governor_dbs() and it doesn't need to be exported to modules, so make it static and drop the export incantation. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:09 +01:00
Rafael J. Wysocki	8c8f77fd07	cpufreq: governor: Move per-CPU data to the common code After previous changes there is only one piece of code in the ondemand governor making references to per-CPU data structures, but it can be easily modified to avoid doing that, so modify it accordingly and move the definition of per-CPU data used by the ondemand and conservative governors to the common code. Next, change that code to access the per-CPU data structures directly rather than via a governor callback. This causes the ->get_cpu_cdbs governor callback to become unnecessary, so drop it along with the macro and function definitions related to it. Finally, drop the definitions of struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s that aren't necessary any more. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:09 +01:00
Rafael J. Wysocki	7d5a9956af	cpufreq: governor: Make governor private data per-policy Some fields in struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s are only used for a limited set of CPUs. Namely, if a policy is shared between multiple CPUs, those fields will only be used for one of them (policy->cpu). This means that they really are per-policy rather than per-CPU and holding room for them in per-CPU data structures is generally wasteful. Also moving those fields into per-policy data structures will allow some significant simplifications to be made going forward. For this reason, introduce struct cs_policy_dbs_info and struct od_policy_dbs_info to hold those fields. Define each of the new structures as an extension of struct policy_dbs_info (such that struct policy_dbs_info is embedded in each of them) and introduce new ->alloc and ->free governor callbacks to allocate and free those structures, respectively, such that ->alloc() will return a pointer to the struct policy_dbs_info embedded in the allocated data structure and ->free() will take that pointer as its argument. With that, modify the code accessing the data fields in question in per-CPU data objects to look for them in the new structures via the struct policy_dbs_info pointer available to it and drop them from struct od_cpu_dbs_info_s and struct cs_cpu_dbs_info_s. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:08 +01:00
Rafael J. Wysocki	a33cce1c6c	cpufreq: governor: Fix CPU load information updates via ->store The ->store() callbacks of some tunable sysfs attributes of the ondemand and conservative governors trigger immediate updates of the CPU load information for all CPUs "governed" by the given dbs_data by walking the cpu_dbs_info structures for all online CPUs in the system and updating them. This is questionable for two reasons. First, it may lead to a lot of extra overhead on a system with many CPUs if the given dbs_data is only associated with a few of them. Second, if governor tunables are per-policy, the CPUs associated with the other sets of governor tunables should not be updated. To address this issue, use the observation that in all of the places in question the update operation may be carried out in the same way (because all of the tunables involved are now located in struct dbs_data and readily available to the common code) and make the code in those places invoke the same (new) helper function that will carry out the update correctly. That new function always checks the ignore_nice_load tunable value and updates the CPUs' prev_cpu_nice data fields if that's set, which wasn't done by the original code in store_io_is_busy(), but it should have been done in there too. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:08 +01:00
Rafael J. Wysocki	702c9e542a	cpufreq: governor: Add a ->start callback for governors To avoid having to check the governor type explicitly in the common code in order to initialize data structures specific to the governor type properly, add a ->start callback to struct dbs_governor and use it to initialize those data structures for the ondemand and conservative governors. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:07 +01:00
Rafael J. Wysocki	8847e038c1	cpufreq: governor: Move io_is_busy to struct dbs_data The io_is_busy governor tunable is only used by the ondemand governor and is located in the ondemand-specific data structure, but it is looked at by the common governor code that has to do ugly things to get to that value, so move it to struct dbs_data and modify ondemand accordingly. Since the conservative governor never touches that field, it will be always 0 for that governor and it won't have any effect on the results of computations in that case. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:06 +01:00
Rafael J. Wysocki	574ef14d5d	cpufreq: governor: Close dbs_data update race condition It is possible for a dbs_data object to be updated after its usage counter has become 0. That may happen if governor_store() runs (via a govenor tunable sysfs attribute write) in parallel with cpufreq_governor_exit() called for the last cpufreq policy associated with the dbs_data in question. In that case, if governor_store() acquires dbs_data->mutex right after cpufreq_governor_exit() has released it, the ->store() callback invoked by it may operate on dbs_data with no users. Although sysfs will cause the kobject_put() in cpufreq_governor_exit() to block until governor_store() has returned, that situation may lead to some unexpected results, depending on the implementation of the ->store callback, and therefore it should be avoided. To that end, modify governor_store() to check the dbs_data's usage count before invoking the ->store() callback and return an error if it is 0 at that point. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:06 +01:00
Rafael J. Wysocki	07aa4402a0	cpufreq: governor: Use microseconds in sample delay computations Do not convert microseconds to jiffies and the other way around in governor computations related to the sampling rate and sample delay and drop delay_for_sampling_rate() which isn't of any use then. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:05 +01:00
Rafael J. Wysocki	57dc3bcd45	cpufreq: governor: Move rate_mult to struct policy_dbs The rate_mult field in struct od_cpu_dbs_info_s is used by the code shared with the conservative governor and to access it that code has to do an ugly governor type check. However, first of all it is ever only used for policy->cpu, so it is per-policy rather than per-CPU and second, it is initialized to 1 by cpufreq_governor_start(), so if the conservative governor never modifies it, it will have no effect on the results of any computations. For these reasons, move rate_mult to struct policy_dbs_info (as a common field). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:04 +01:00
Rafael J. Wysocki	78347cdb89	cpufreq: governor: Reset sample delay in store_sampling_rate() If store_sampling_rate() updates the sample delay when the ondemand governor is in the middle of its high/low dance (OD_SUB_SAMPLE sample type is set), the governor will still do the bottom half of the previous sample which may take too much time. To prevent that from happening, change store_sampling_rate() to always reset the sample delay to 0 which also is consistent with the new behavior of cpufreq_governor_limits(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:04 +01:00
Rafael J. Wysocki	4cccf75557	cpufreq: governor: Get rid of the ->gov_check_cpu callback The way the ->gov_check_cpu governor callback is used by the ondemand and conservative governors is not really straightforward. Namely, the governor calls dbs_check_cpu() that updates the load information for the policy and the invokes ->gov_check_cpu() for the governor. To get rid of that entanglement, notice that cpufreq_governor_limits() doesn't need to call dbs_check_cpu() directly. Instead, it can simply reset the sample delay to 0 which will cause a sample to be taken immediately. The result of that is practically equivalent to calling dbs_check_cpu() except that it will trigger a full update of governor internal state and not just the ->gov_check_cpu() part. Following that observation, make cpufreq_governor_limits() reset the sample delay and turn dbs_check_cpu() into a function that will simply evaluate the load and return the result called dbs_update(). That function can now be called by governors from the routines that previously were pointed to by ->gov_check_cpu and those routines can be called directly by each governor instead of dbs_check_cpu(). This way ->gov_check_cpu becomes unnecessary, so drop it. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:04 +01:00
Rafael J. Wysocki	57eb832f90	cpufreq: governor: Clean up load-related computations Clean up some load-related computations in dbs_check_cpu() and cpufreq_governor_start() to get rid of unnecessary operations and type casts and make the code easier to read. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:03 +01:00
Rafael J. Wysocki	679b8fe43a	cpufreq: governor: Fix nice contribution computation in dbs_check_cpu() The contribution of the CPU nice time to the idle time in dbs_check_cpu() is computed in a bogus way, as the code may subtract current and previous nice values for different CPUs. That doesn't matter for cases when cpufreq policies are not shared, but may lead to problems otherwise. Fix the computation and simplify it to avoid taking unnecessary steps. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:03 +01:00
Rafael J. Wysocki	e4db2813d2	cpufreq: governor: Avoid atomic operations in hot paths Rework the handling of work items by dbs_update_util_handler() and dbs_work_handler() so the former (which is executed in scheduler paths) only uses atomic operations when absolutely necessary. That is, when the policy is shared and dbs_update_util_handler() has already decided that this is the time to queue up a work item. In particular, this avoids the atomic ops entirely on platforms where policy objects are never shared. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:03 +01:00
Rafael J. Wysocki	f62b93740c	cpufreq: governor: Simplify gov_cancel_work() slightly The atomic work counter incrementation in gov_cancel_work() is not necessary any more, because work items won't be queued up after gov_clear_update_util() anyway, so drop it along with the comment about how it may be missed by the gov_clear_update_util(). Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:41:02 +01:00
Rafael J. Wysocki	b9db42730a	cpufreq: governor: Avoid irq_work_queue_on() crash on non-SMP ARM As it turns out, irq_work_queue_on() will crash if invoked on non-SMP ARM platforms, but in fact it is not necessary to use that function in the cpufreq governor code (as it doesn't matter to that code which CPU will handle the irq_work), so change it to always use irq_work_queue(). Fixes: 8fb47ff100af (cpufreq: governor: Replace timers with utilization update callbacks) Reported-and-tested-by: Guenter Roeck <linux@roeck-us.net> Reported-and-tested-by: Tony Lindgren <tony@atomide.com> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-09 14:41:02 +01:00
Viresh Kumar	aded387b94	cpufreq: conservative: Update sample_delay_ns immediately The ondemand governor already updates sample_delay_ns immediately on updates to the sampling rate, but conservative doesn't do that. It was left out earlier as the code was really too complex to get that done easily. Things are sorted out very well now, however, and the conservative governor can be modified to follow ondemand in that respect. Moreover, since the code needed to implement that in the conservative governor would be identical to the corresponding ondemand governor's code, make that code common and change both governors to use it. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Juri Lelli <juri.lelli@arm.com> Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-09 14:41:01 +01:00
Viresh Kumar	581c214b21	cpufreq: governor: No need to manage state machine now The cpufreq core now guarantees that policy->rwsem won't be dropped while running the ->governor callback for the CPUFREQ_GOV_POLICY_EXIT event and will be held acquired until the complete sequence of governor state changes has finished. This allows governor state machine checks to be dropped from multiple functions in cpufreq_governor.c. This also means that policy_dbs->policy can be initialized upfront, so the entire initialization of struct policy_dbs can be carried out in one place. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Juri Lelli <juri.lelli@arm.com> Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-09 14:41:01 +01:00
Viresh Kumar	c54df07184	cpufreq: governor: Create and traverse list of policy_dbs to avoid deadlock The dbs_data_mutex lock is currently used in two places. First, cpufreq_governor_dbs() uses it to guarantee mutual exclusion between invocations of governor operations from the core. Second, it is used by ondemand governor's update_sampling_rate() to ensure the stability of data structures walked by it. The second usage is quite problematic, because update_sampling_rate() is called from a governor sysfs attribute's ->store callback and that leads to a deadlock scenario involving cpufreq_governor_exit() which runs under dbs_data_mutex. Thus it is better to rework the code so update_sampling_rate() doesn't need to acquire dbs_data_mutex. To that end, rework update_sampling_rate() to walk a list of policy_dbs objects supported by the dbs_data one it has been called for (instead of walking cpu_dbs_info object for all CPUs). The list manipulation is protected with dbs_data->mutex which also is held around the execution of update_sampling_rate(), it is not necessary to hold dbs_data_mutex in that function any more. Reported-by: Juri Lelli <juri.lelli@arm.com> Reported-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> [ rjw: Subject & changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-09 14:40:59 +01:00
Viresh Kumar	c443563036	cpufreq: governor: New sysfs show/store callbacks for governor tunables The ondemand and conservative governors use the global-attr or freq-attr structures to represent sysfs attributes corresponding to their tunables (which of them is actually used depends on whether or not different policy objects can use the same governor with different tunables at the same time and, consequently, on where those attributes are located in sysfs). Unfortunately, in the freq-attr case, the standard cpufreq show/store sysfs attribute callbacks are applied to the governor tunable attributes and they always acquire the policy->rwsem lock before carrying out the operation. That may lead to an ABBA deadlock if governor tunable attributes are removed under policy->rwsem while one of them is being accessed concurrently (if sysfs attributes removal wins the race, it will wait for the access to complete with policy->rwsem held while the attribute callback will block on policy->rwsem indefinitely). We attempted to address this issue by dropping policy->rwsem around governor tunable attributes removal (that is, around invocations of the ->governor callback with the event arg equal to CPUFREQ_GOV_POLICY_EXIT) in cpufreq_set_policy(), but that opened up race conditions that had not been possible with policy->rwsem held all the time. Therefore policy->rwsem cannot be dropped in cpufreq_set_policy() at any point, but the deadlock situation described above must be avoided too. To that end, use the observation that in principle governor tunables may be represented by the same data type regardless of whether the governor is system-wide or per-policy and introduce a new structure, struct governor_attr, for representing them and new corresponding macros for creating show/store sysfs callbacks for them. Also make their parent kobject use a new kobject type whose default show/store callbacks are not related to the standard core cpufreq ones in any way (and they don't acquire policy->rwsem in particular). Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Juri Lelli <juri.lelli@arm.com> Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> [ rjw: Subject & changelog + rebase ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-09 14:40:58 +01:00
Viresh Kumar	ff4b17895e	cpufreq: governor: Move common tunables to 'struct dbs_data' There are a few common tunables shared between the ondemand and conservative governors. Move them to struct dbs_data to simplify code. Signed-off-by: Viresh Kumar <viresh.kumar@linaro.org> Tested-by: Juri Lelli <juri.lelli@arm.com> Tested-by: Shilpasri G Bhat <shilpa.bhat@linux.vnet.ibm.com> [ rjw: Changelog ] Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	2016-03-09 14:40:58 +01:00
Rafael J. Wysocki	fafd5e8ab2	cpufreq: governor: Drop pointless goto from cpufreq_governor_init() It is silly to jump around "return 0", so don't do that. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:40:57 +01:00
Rafael J. Wysocki	686cc637c9	cpufreq: governor: Rename skip_work to work_count The skip_work field in struct policy_dbs_info technically is a counter, so give it a new name to reflect that. No functional changes. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:40:57 +01:00
Rafael J. Wysocki	cea6a9e772	cpufreq: governor: Symmetrize cpu_dbs_info initialization and cleanup Make the initialization of struct cpu_dbs_info objects in alloc_policy_dbs_info() and the code that cleans them up in free_policy_dbs_info() more symmetrical. In particular, set/clear the update_util.func field in those functions along with the policy_dbs field. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:40:56 +01:00
Rafael J. Wysocki	bc505475b8	cpufreq: governor: Rearrange governor data structures The struct policy_dbs_info objects representing per-policy governor data are not accessible directly from the corresponding policy objects. To access them, one has to get a pointer to the struct cpu_dbs_info of policy->cpu and use the policy_dbs field of that which isn't really straightforward. To address that rearrange the governor data structures so the governor_data pointer in struct cpufreq_policy will point to struct policy_dbs_info (instead of struct dbs_data) and that will contain a pointer to struct dbs_data. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:40:56 +01:00
Rafael J. Wysocki	e975189400	cpufreq: governor: Simplify cpufreq_governor_limits() Use the observation that cpufreq_governor_limits() doesn't have to get to the policy object it wants to manipulate by walking the reference chain cdbs->policy_dbs->policy, as the final pointer is actually equal to its argument, and make it access the policy object directy via its argument. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:40:56 +01:00
Rafael J. Wysocki	d10b5eb5fc	cpufreq: governor: Drop cpu argument from dbs_check_cpu() Since policy->cpu is always passed as the second argument to dbs_check_cpu(), it is not really necessary to pass it, because the function can obtain that value via its first argument just fine. Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Viresh Kumar <viresh.kumar@linaro.org>	2016-03-09 14:40:55 +01:00

1 2 3

116 Commits