Commit Graph

968318 Commits

Author SHA1 Message Date
Ofir Bitton bd2f477f20 habanalabs: add support for cs with timestamp
add support for user to request a timestamp upon
cs completion.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:37 +02:00
Ofir Bitton 9d127ad571 habanalabs: indicate to user that a cs is gone
We want to indicate to the user that a certain command submission
is finished long time ago and it is no longer in database.
This means no further information regarding this cs can be obtained.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
Oded Gabbay 64a9d5ab2c habanalabs/gaudi: print ECC type field
We have the ECC type field from the firmware but the driver didn't
print it, so we need to add that field to the ECC print message.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
Oded Gabbay 051504d9f6 habanalabs: update firmware files
Update various firmware header files with new defines.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
kernel test robot 293744d92c habanalabs: gaudi_ctx_fini() can be static
Make a function in gaudi.c to be static

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
kernel test robot 2a570736ef habanalabs: goya_reset_sob_group() can be static
Make some functions static

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: kernel test robot <lkp@intel.com>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
Alon Mizrahi 4147864e8d habanalabs: fetch pll frequency from firmware
Once firmware security is enabled, driver must fetch pll frequencies
through the firmware message interface instead of reading the registers
directly.

Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
Ofir Bitton 5c05487f15 habanalabs: mmu map wrapper for sizes larger than a page
We introduce a new wrapper which allows us to mmu map any size
to any host va_range available. In addition we remove duplicated
code from various places in driver and using this new wrapper
instead.
This wrapper supports mapping only contiguous physical
memory blocks and will be used for mappings that are done to the
driver ASID.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
Oded Gabbay 5e5867e51d habanalabs: print CS type when it is stuck
We have several types of command submissions and the user wants to know
which type of command submission has not finished in time when that
event occurs. This is very helpful for debug.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:36 +02:00
Ofir Bitton b90c894434 habanalabs/gaudi: align to new FW reset scheme
As part of the security effort in which FW will be handling
sensitive HW registers, hard reset flow will be done by FW
and will be triggered by driver.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:35 +02:00
Alon Mizrahi 439bc47b8e habanalabs: firmware returns 64bit argument
F/W message returns 64bit value but up until now we casted it to
a 32bit variable, instead of receiving 64bit in the first place.

Signed-off-by: Alon Mizrahi <amizrahi@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:35 +02:00
Moti Haimovski 00e1b59c8b habanalabs: fix MMU debugfs operations
After the MMU-code refactoring, the existing MMU debugfs operations
are no longer working so we need to fix them.
In addition, remove the duplicate code that was in the debugfs code
and use the already existing MMU-code.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:35 +02:00
Moti Haimovski fe2bc2d249 habanalabs: share a single ctx-mutex between all MMUs
Multiple locks are usually a source of problems, which in the MMU
case can be avoided since it is relatively rare that both MMU
tables are updated at the same time.

Therefore, use a single shared lock instead of two separate ones.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:35 +02:00
Ofir Bitton 412c41fcd5 habanalabs: support reserving aligned va block
Add support for reserving va block with alignment different than
page size. This is a pre-requisite for allocations needed in future
ASICs

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:35 +02:00
Guy Nisan b2d09622be habanalabs: add boot errors prints
Add log prints for security and eFuse boot error bits

Signed-off-by: Guy Nisan <gnisan@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:35 +02:00
Oded Gabbay 92ede12a07 habanalabs: print message with correct device
During hard-reset, the driver rejects further IOCTL calls and prints
an error message. That error message should be printed with the correct
device instead of using only the control device.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:35 +02:00
Ofir Bitton 5a2998f46c habanalabs/gaudi: fetch HBM ecc info from FW
Once FW security is enabled there is no access to HBM ecc registers,
need to read values from FW using a dedicated interface.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Ofir Bitton d611b9f0b1 habanalabs: fetch hard reset capability from FW
Driver must fetch FW hard reset capability during boot time,
in order to skip the hard reset flow if necessary.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Oded Gabbay 7f070c913c habanalabs: move asic property to correct structure
Whether an ASIC has MMU towards its DRAM is an ASIC property, so
move it to the asic fixed properties structure.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Ofir Bitton be91b91fa4 habanalabs: use host va range for internal pools
Instead of using a dedicated va range for each internal pool,
we introduce a new way for reserving a va block from an existing
va range. This is a more generic way of reserving va blocks for
future use.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Ofir Bitton adb51298fd habanalabs: improve hard reset procedure
We want to handle the scenario in which the driver was not able
to kill all user processes due to many memory mappings.
We need to retry again after some period while releasing the cores.
The devices will be unusable and "in-reset" status during that time.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Tomer Tayar 804a72276c habanalabs: Rename hw_queues_mirror to cs_mirror
Future command submission types might be submitted to HW not via the
QMAN queues path. However, it would be still required to have the TDR
mechanism for these CS, and thus the patch renames the TDR fields and
replaces the hw_queues_ prefix with cs_.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Ofir Bitton 784b916dad habanalabs: refactor mmu va_range db structure
Use an array of va_ranges instead of keeping each va_range separately,
we do this for better readability and in order to support access to
a specific range in a much elegant manner.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:34 +02:00
Ofir Bitton d1ddd90551 habanalabs: move HW dirty check to a proper location
Driver must verify if HW is dirty before trying to fetch preboot
information. Hence, we move this validation to a prior stage of
the boot sequence.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Oded Gabbay 28e052c952 habanalabs: restore vm_pgoff after mmap
Due to using dma_mmap_coherent() to perform mmap of dma memory, we
had to clear the vm_pgoff field before calling that function.

However, that broke the userspace (profiler tool) as they relied
on searching the /proc/self/maps for these values to correctly
"disassemble" the topology recipe.

To re-enable that functionality, the driver can simply restore the
value of vm_pgoff before returning to userspace but after calling
dma_mmap_coherent().

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Ofir Bitton 66a76401c5 habanalabs: add 'needs reset' state in driver
The new state indicates that device should be reset in order
to re-gain funcionality.
This unique state can occur if reset_on_lockup is disabled
and an actual lockup has occurred.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Omer Shpigelman f2d032ee13 habanalabs: fix hard reset print and comment
One of the first steps of a hard reset flow is to close all open user
contexts. This user process teradown might take some time due to long
cleanup in our driver or some other reason even before our cleanup flow.
Hence fix the relevant print and comment to be more accurate.

Signed-off-by: Omer Shpigelman <oshpigelman@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Igor Grinberg b726a2f7c0 habanalabs/gaudi: remove pcie_en strap toggle
Since the very large grace period is over and this functionality
prevents us to implement the new reset sequence and apply security
settings, we need to remove the code toggling the PCIE_EN bit in the
straps register.
Remove it for good.

Signed-off-by: Igor Grinberg <igrinberg@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Oded Gabbay 66bfcccdb8 habanalabs: remove duplicate print
We print twice the firmware status regarding security, once in
common code and once in asic code. Remove the print in asic code
and leave the common code print.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Tomer Tayar 649c459212 habanalabs: Separate CS job completion from its deallocation
Current CS jobs are no longer needed after their completion.
However, jobs of future workload might be in use even after they are
completed. To allow that, the patch adds a refcount to the job object,
and decouples its completion handling from its deallocation.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:33 +02:00
Oded Gabbay 0da5698bf4 habanalabs/gaudi: increase MAX CS to 16K
We need to have the MAX CS be much larger than the size of the
different queues. In GAUDI we have around 8 groups of queues, and each
group has 1K queue size. To prevent head-of-the-line blocking, we need
to make sure there is sufficient number of available CS allocations
even if one or more of those queues are full.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:32 +02:00
farah kassabri eb10b897e4 habanalabs: reset device upon fw read failure
failure in reading pre-boot verion is not handled correctly,
upon failure we need to reset the device in order to be able
to reinstall the driver.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:32 +02:00
Tomer Tayar ba7e389c30 habanalabs: Move repeatedly included headers to habanalabs.h
Several header files are repeatedly included in many files.
Move these files to habanalabs.h which is included by all.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:32 +02:00
Ofir Bitton c1d505a922 habanalabs: release signal if collective wait was dropped
As in standard wait cs, we must release a signal fence once
a collective wait cs was dropped and not submitted.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:32 +02:00
Tomer Tayar 4ba1b227b6 habanalabs: Skip updating CI of internal queues if not in use
There are no internal queues if H/W queues are being used.
In this case we can skip the redundant traversal over the queues array,
looking for internal queues.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:32 +02:00
Tomer Tayar ea6ee260cb habanalabs: Small refactoring of cs_do_release()
Slightly refactor the cs_do_release() function, to reduce nesting level
and to ease the handling of future CS types.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:32 +02:00
Tomer Tayar 6de3d769fd habanalabs: Small refactoring of CS IOCTL handling
Refactor the CS IOCTL handling by gathering common code into
sub-functions, in order to ease future additions of new CS types.

Signed-off-by: Tomer Tayar <ttayar@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:32 +02:00
Ofir Bitton 1cbca899fa habanalabs/gaudi: fetch PLL info from FW
Once FW security is enabled there is no access to PLL registers,
need to read values from FW using a dedicated interface.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
Moti Haimovski ccf979ee33 habanalabs: refactor MMU to support dual residency MMU
This commit refactors the MMU code to support PCI MMU page tables
residing on host and DCORE MMU residing on the device DRAM at the
same time.

This is needed for future devices as on GAUDI and GOYA we have
a single MMU where its page tables always reside on DRAM.

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
Moti Haimovski a6722d6a97 habanalabs: fix MMU print message
This commit fixes an incorrect error message

Signed-off-by: Moti Haimovski <mhaimovski@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
farah kassabri 03df136bc5 habanalabs/gaudi: scrub all memory upon closing FD
In cases of multi-tenants, administrators may want to prevent data
leakage between users running on the same device one after another.

To do that the driver can scrub the internal memory (both SRAM and
DRAM) after a user finish to use the memory.

Because in GAUDI the driver allows only one application to use the
device at a time, it can scrub the memory when user app close FD.

In future devices where we have MMU on the DRAM, we can scrub the DRAM
memory with a finer granularity (page granularity) when the user
allocates the memory.

This feature is not supported in Goya.

To allow users that want to debug their applications, we add a kernel
module parameter to load the driver with this feature disabled.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
Ofir Bitton c692dec703 habanalabs/gaudi: add support for FW security
Skip relevant HW configurations once FW security is enabled
because these configurations are being performed by FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
Ofir Bitton 323b726706 habanalabs: fetch security indication from FW
Add support for fetching security indication from FW.
This indication is needed in order to skip unnecessary
initializations done by FW.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
farah kassabri e753643d51 habanalabs: fix cs counters structure
Fix cs counters structure in uapi to be one flat structure instead
of two instances of the same other structure.
use atomic read/increment for context counters so we could use
one structure for both aggregated and context counters.

Signed-off-by: farah kassabri <fkassabri@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
Ofir Bitton 9bb86b63d8 habanalabs: advanced FW loading
Today driver is able to load a whole FW binary into a specific
location on ASIC. We add support for loading sections from the
same FW binary into different loactions.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:31 +02:00
Oded Gabbay 977d53a614 habanalabs: initialize variable before use
GCC 7.3.1 20180303 (Red Hat 7.3.1-5) complains that collective_engine_id
might be used uninitialized.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:30 +02:00
Ofir Bitton 71a984f9ae habanalabs/gaudi: remove unreachable code
Remove unreachable code in gaudi collective flow.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:30 +02:00
Oded Gabbay e716ad3c76 habanalabs: make sure cs type is valid in cs_ioctl_signal_wait
Although we get a valid cs type from the callee, in case new values
will be added in the future, it is best to check the expected values
in that function.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:30 +02:00
Oded Gabbay 3e62299657 habanalabs/gaudi: monitor device memory usage
In GAUDI we don't have an MMU towards the HBM device memory. Therefore,
the user access that memory directly through physical address (via the
different engines) without the need to go through the driver to
allocate/free memory on the HBM.

For system monitoring purposes, the driver will keep track of the HBM
usage. This can be done as long as the user accurately reports the
allocations and releases of HBM memory, through the existing MEMORY
IOCTL uapi.

Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:30 +02:00
Ofir Bitton 5de406c0b5 habanalabs: sync stream collective support
Implement sync stream collective for GAUDI. Need to allocate additional
resources for that and add ctx_fini() to clean up those resources.

Signed-off-by: Ofir Bitton <obitton@habana.ai>
Reviewed-by: Oded Gabbay <ogabbay@kernel.org>
Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
2020-11-30 10:47:30 +02:00