Overhead of Iterating T[] cast to IList<T> - c#

I've noticed a performance hit of iterating over a primitive collection (T[]) that has been cast to a generic interface collection (IList or IEnumberable).
For example:
private static int Sum(int[] array)
{
int sum = 0;
foreach (int i in array)
sum += i;
return sum;
}
The above code executes significantly faster than the code below, where the parameter is changed to type IList (or IEnumerable):
private static int Sum(IList<int> array)
{
int sum = 0;
foreach (int i in array)
sum += i;
return sum;
}
The performance hit still occurs if the object passed is a primitive array, and if I try changing the loop to a for loop instead of a foreach loop.
I can get around the performance hit by coding it like such:
private static int Sum(IList<int> array)
{
int sum = 0;
if( array is int[] )
foreach (int i in (int[])array)
sum += i;
else
foreach (int i in array)
sum += i;
return sum;
}
Is there a more elegant way of solving this issue? Thank you for your time.
Edit: My benchmark code:
static void Main(string[] args)
{
int[] values = Enumerable.Range(0, 10000000).ToArray<int>();
Stopwatch sw = new Stopwatch();
sw.Start();
Sum(values);
//Sum((IList<int>)values);
sw.Stop();
Console.WriteLine("Elasped: {0} ms", sw.ElapsedMilliseconds);
Console.Read();
}

Your best bet is to create overload for Sum with int[] as argument if this method is performance-critical. CLR's JIT can detect foreach-style iteration over array and can skip range checking and address each element directly. Each iteration of loop takes 3-5 instructions on x86, with only one memory lookup.
When using IList, JIT does not have knowledge about underlying collection's structure and ends up using IEnumerator<int>. Each iteration of loop uses two interface invocation - one for Current, one for MoveNext (2-3 memory lookups and a call for each of those). This most likely ends up with ~20 quite expensive instructions and there is not much you can do about it.
Edit: If you are curious about actual machine code emitted by JIT from Release build without debugger attached, here it is.
Array version
int s = 0;
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 xor esi,esi
foreach (int i in arg)
00000007 xor edx,edx
00000009 mov edi,dword ptr [ecx+4]
0000000c test edi,edi
0000000e jle 0000001B
00000010 mov eax,dword ptr [ecx+edx*4+8]
s += i;
00000014 add esi,eax
00000016 inc edx
foreach (int i in arg)
00000017 cmp edi,edx
00000019 jg 00000010
IEnumerable version
int s = 0;
00000000 push ebp
00000001 mov ebp,esp
00000003 push edi
00000004 push esi
00000005 push ebx
00000006 sub esp,1Ch
00000009 mov esi,ecx
0000000b lea edi,[ebp-28h]
0000000e mov ecx,6
00000013 xor eax,eax
00000015 rep stos dword ptr es:[edi]
00000017 mov ecx,esi
00000019 xor eax,eax
0000001b mov dword ptr [ebp-18h],eax
0000001e xor edx,edx
00000020 mov dword ptr [ebp-24h],edx
foreach (int i in arg)
00000023 call dword ptr ds:[009E0010h]
00000029 mov dword ptr [ebp-28h],eax
0000002c mov ecx,dword ptr [ebp-28h]
0000002f call dword ptr ds:[009E0090h]
00000035 test eax,eax
00000037 je 00000052
00000039 mov ecx,dword ptr [ebp-28h]
0000003c call dword ptr ds:[009E0110h]
s += i;
00000042 add dword ptr [ebp-24h],eax
foreach (int i in arg)
00000045 mov ecx,dword ptr [ebp-28h]
00000048 call dword ptr ds:[009E0090h]
0000004e test eax,eax
00000050 jne 00000039
00000052 mov dword ptr [ebp-1Ch],0
00000059 mov dword ptr [ebp-18h],0FCh
00000060 push 0F403BCh
00000065 jmp 00000067
00000067 cmp dword ptr [ebp-28h],0
0000006b je 00000076
0000006d mov ecx,dword ptr [ebp-28h]
00000070 call dword ptr ds:[009E0190h]

Welcome to optimization. Things aren't always obvious here!
Basically, as you've found, when the compiler detects that certain types of safety constraints are proven to hold, it can issue enormously more efficient code when doing full optimization. Here (as MagnatLU shows) we see that knowing you've got an array allows all sorts of assumptions to be made about the size being fixed, and it allows memory to be accessed directly (which is also maximally efficient in how it integrates with the CPU's memory prefetch code, for bonus speed). When the compiler doesn't have the proof that it can generate super-fast code, it plays it safe. (This is the right thing to do.)
As a general comment, your workaround code is pretty simple when it comes to coding for optimization (when making the code super-readable and maintainable isn't always the first consideration). I don't really see how you could better it without making your class's API more complex (not a win!). Moreover, just adding a comment inside the body to say why you've done this would solve the maintenance issue; this is in fact one of the best uses for (non-doc) comments in the code in the first place. Given that the use case is large arrays (i.e., that it's reasonable to optimize at all in the first place) I'd say you have a great solution right there.

As an alternative to #MagnatLU's answer above, you can use for instead of foreach and cache the list's Count. There is still overhead when compared to int[] but not quite as much: the IList<int> overload duration decreased by ~50% using your test code on my machine.
private static int Sum(IList<int> array)
{
int sum = 0;
int count = array.Count;
for (int i = 0; i < count; i++)
sum += array[i];
return sum;
}

Related

Why is the enumeration value from a multi dimensional array not equal to itself?

Consider:
using System;
public class Test
{
enum State : sbyte { OK = 0, BUG = -1 }
static void Main(string[] args)
{
var s = new State[1, 1];
s[0, 0] = State.BUG;
State a = s[0, 0];
Console.WriteLine(a == s[0, 0]); // False
}
}
How can this be explained? It occurs in debug builds in Visual Studio 2015 when running in the x86 JIT. A release build or running in the x64 JIT prints True as expected.
To reproduce from the command line:
csc Test.cs /platform:x86 /debug
(/debug:pdbonly, /debug:portable and /debug:full also reproduce.)
You found a code generation bug in the .NET 4 x86 jitter. It is a very unusual one, it only fails when the code is not optimized. The machine code looks like this:
State a = s[0, 0];
013F04A9 push 0 ; index 2 = 0
013F04AB mov ecx,dword ptr [ebp-40h] ; s[] reference
013F04AE xor edx,edx ; index 1 = 0
013F04B0 call 013F0058 ; eax = s[0, 0]
013F04B5 mov dword ptr [ebp-4Ch],eax ; $temp1 = eax
013F04B8 movsx eax,byte ptr [ebp-4Ch] ; convert sbyte to int
013F04BC mov dword ptr [ebp-44h],eax ; a = s[0, 0]
Console.WriteLine(a == s[0, 0]); // False
013F04BF mov eax,dword ptr [ebp-44h] ; a
013F04C2 mov dword ptr [ebp-50h],eax ; $temp2 = a
013F04C5 push 0 ; index 2 = 0
013F04C7 mov ecx,dword ptr [ebp-40h] ; s[] reference
013F04CA xor edx,edx ; index 1 = 0
013F04CC call 013F0058 ; eax = s[0, 0]
013F04D1 mov dword ptr [ebp-54h],eax ; $temp3 = eax
; <=== Bug here!
013F04D4 mov eax,dword ptr [ebp-50h] ; a == s[0, 0]
013F04D7 cmp eax,dword ptr [ebp-54h]
013F04DA sete cl
013F04DD movzx ecx,cl
013F04E0 call 731C28F4
A plodding affair with lots of temporaries and code duplication, that's normal for unoptimized code. The instruction at 013F04B8 is notable, that is where the necessary conversion from sbyte to a 32-bit integer occurs. The array getter helper function returned 0x0000000FF, equal to State.BUG, and that needs to be converted to -1 (0xFFFFFFFF) before the value can be compared. The MOVSX instruction is a Sign eXtension instruction.
Same thing happens again at 013F04CC, but this time there is no MOVSX instruction to make the same conversion. That's where the chips fall down, the CMP instruction compares 0xFFFFFFFF with 0x000000FF and that is false. So this is an error of omission, the code generator failed to emit MOVSX again to perform the same sbyte to int conversion.
What is particularly unusual about this bug is that this works correctly when you enable the optimizer, it now knows to use MOVSX in both cases.
The probable reason that this bug went undetected for so long is the usage of sbyte as the base type of the enum. Quite rare to do. Using a multi-dimensional array is instrumental as well, the combination is fatal.
Otherwise a pretty critical bug I'd say. How widespread it might be is hard to guess, I only have the 4.6.1 x86 jitter to test. The x64 and the 3.5 x86 jitter generate very different code and avoid this bug. The temporary workaround to keep going is to remove sbyte as the enum base type and let it be the default, int, so no sign extension is necessary.
You can file the bug at connect.microsoft.com, linking to this Q+A should be enough to tell them everything they need to know. Let me know if you don't want to take the time and I'll take care of it.
Let's consider OP's declaration:
enum State : sbyte { OK = 0, BUG = -1 }
Since the bug only occurs when BUG is negative (from -128 to -1) and State is an enum of signed byte I started to suppose that there were a cast issue somewhere.
If you run this:
Console.WriteLine((sbyte)s[0, 0]);
Console.WriteLine((sbyte)State.BUG);
Console.WriteLine(s[0, 0]);
unchecked
{
Console.WriteLine((byte) State.BUG);
}
it will output :
255
-1
BUG
255
For a reason that I ignore(as of now) s[0, 0] is cast to a byte before evaluation and that's why it claims that a == s[0,0] is false.

What are these extra disassembly instructions when using SIMD intrinsics?

I'm testing what sort of speedup I can get from using SIMD instructions with RyuJIT and I'm seeing some disassembly instructions that I don't expect. I'm basing the code on this blog post from the RyuJIT team's Kevin Frei, and a related post here. Here's the function:
static void AddPointwiseSimd(float[] a, float[] b) {
int simdLength = Vector<float>.Count;
int i = 0;
for (i = 0; i < a.Length - simdLength; i += simdLength) {
Vector<float> va = new Vector<float>(a, i);
Vector<float> vb = new Vector<float>(b, i);
va += vb;
va.CopyTo(a, i);
}
}
The section of disassembly I'm querying copies the array values into the Vector<float>. Most of the disassembly is similar to that in Kevin and Sasha's posts, but I've highlighted some extra instructions (along with my confused annotations) that don't appear in their disassemblies:
;// Vector<float> va = new Vector<float>(a, i);
cmp eax,r8d ; <-- Unexpected - Compare a.Length to i?
jae 00007FFB17DB6D5F ; <-- Unexpected - Jump to range check failure
lea r10d,[rax+3]
cmp r10d,r8d
jae 00007FFB17DB6D5F
mov r11,rcx ; <-- Unexpected - Extra register copy?
movups xmm0,xmmword ptr [r11+rax*4+10h ]
;// Vector<float> vb = new Vector<float>(b, i);
cmp eax,r9d ; <-- Unexpected - Compare b.Length to i?
jae 00007FFB17DB6D5F ; <-- Unexpected - Jump to range check failure
cmp r10d,r9d
jae 00007FFB17DB6D5F
movups xmm1,xmmword ptr [rdx+rax*4+10h]
Note the loop range check is as expected:
;// for (i = 0; i < a.Length - simdLength; i += simdLength) {
add eax,4
cmp r9d,eax
jg loop
so I don't know why there are extra comparisons to eax. Can anyone explain why I'm seeing these extra instructions and if it's possible to get rid of them.
In case it's related to the project settings I've got a very similar project that shows the same issue here on github (see FloatSimdProcessor.HwAcceleratedSumInPlace() or UShortSimdProcessor.HwAcceleratedSumInPlaceUnchecked()).
I'll annotate the code generation that I see, for a processor that supports AVX2 like Haswell, it can move 8 floats at a time:
00007FFA1ECD4E20 push rsi
00007FFA1ECD4E21 sub rsp,20h
00007FFA1ECD4E25 xor eax,eax ; i = 0
00007FFA1ECD4E27 mov r8d,dword ptr [rcx+8] ; a.Length
00007FFA1ECD4E2B lea r9d,[r8-8] ; a.Length - simdLength
00007FFA1ECD4E2F test r9d,r9d ; if (i >= a.Length - simdLength)
00007FFA1ECD4E32 jle 00007FFA1ECD4E75 ; then skip loop
00007FFA1ECD4E34 mov r10d,dword ptr [rdx+8] ; b.Length
00007FFA1ECD4E38 cmp eax,r8d ; if (i >= a.Length)
00007FFA1ECD4E3B jae 00007FFA1ECD4E7B ; then OutOfRangeException
00007FFA1ECD4E3D lea r11d,[rax+7] ; i+7
00007FFA1ECD4E41 cmp r11d,r8d ; if (i+7 >= a.Length)
00007FFA1ECD4E44 jae 00007FFA1ECD4E7B ; then OutOfRangeException
00007FFA1ECD4E46 mov rsi,rcx ; move a[i..i+7]
00007FFA1ECD4E49 vmovupd ymm0,ymmword ptr [rsi+rax*4+10h]
00007FFA1ECD4E50 cmp eax,r10d ; same as above
00007FFA1ECD4E53 jae 00007FFA1ECD4E7B ; but for b
00007FFA1ECD4E55 cmp r11d,r10d
00007FFA1ECD4E58 jae 00007FFA1ECD4E7B
00007FFA1ECD4E5A vmovupd ymm1,ymmword ptr [rdx+rax*4+10h]
00007FFA1ECD4E61 vaddps ymm0,ymm0,ymm1 ; a[i..] + b[i...]
00007FFA1ECD4E66 vmovupd ymmword ptr [rsi+rax*4+10h],ymm0
00007FFA1ECD4E6D add eax,8 ; i += 8
00007FFA1ECD4E70 cmp r9d,eax ; if (i < a.Length)
00007FFA1ECD4E73 jg 00007FFA1ECD4E38 ; then loop
00007FFA1ECD4E75 add rsp,20h
00007FFA1ECD4E79 pop rsi
00007FFA1ECD4E7A ret
So the eax compares are those "pesky bound checks" that the blog post talks about. The blog post gives an optimized version that is not actually implemented (yet), real code right now checks both the first and the last index of the 8 floats that are moved at the same time. The blog post's comment "Hopefully, we'll get our bounds-check elimination work strengthened enough" is an uncompleted task :)
The mov rsi,rcx instruction is present in the blog post as well and appears to be a limitation in the register allocator. Probably influenced by RCX being an important register, it normally stores this. Not important enough to do the work to get this optimized away I'd assume, register-to-register moves take 0 cycles since they only affect register renaming.
Note how the difference between SSE2 and AVX2 is ugly, while the code moves and adds 8 floats at a time, it only actually uses 4 of them. Vector<float>.Count is 4 regardless of the processor flavor, leaving 2x perf on the table. Hard to hide the implementation detail I guess.

Assigning a constant cast to a var in C#

How smart is the C# compiler, given the following:
float a = 1; //A
var b = 1f; //B
var c = (float)1; //C - Is this line equivalent to A and B?
var d = Convert.ToSingle(1); //D - Is this line equivalent to A and B?
As far as I know, A and B are equivalent after the compilation. What about the other lines?
Are C and D optimized in compile-time to be equivalent to A and B or are they going to be assigned only in run-time, causing more processing to perform the assignment?
I suppose casting (C) must be optimized and the function (D) must not.
In any case, how could I investigate and compare the generated assembly code using VS2012?
The three first lines are equivalent; in fact they compile down to the same IL (at least with the .NET 4 compiler that I used).
The fourth is a runtime conversion performed by calling a method, which is a completely different beast.
Regarding the inspection of generated IL, take a look at A tool for easy IL code inspection.
how could I investigate and compare the generated assembly code using
VS2012?
"Go To Assembly" ( or press CTRL+ALT+D )
Answers are below ....
float x = 1; //A
00000061 fld1
00000063 fstp dword ptr [ebp-40h]
var x1 = 1f; //B
00000066 fld1
00000068 fstp dword ptr [ebp-44h]
var x2 = (float)1; //C - Is this line equivalent to A and B?
0000006b fld1
0000006d fstp dword ptr [ebp-48h]
var x3= Convert.ToSingle(1); //D - Is this line equivalent to A and B?
00000070 mov ecx,1
00000075 call 5FB7A2DC
0000007a fstp dword ptr [ebp-50h]
0000007d fld dword ptr [ebp-50h]
00000080 fstp dword ptr [ebp-4Ch]

Why does this addition of byte* and uint fail to carry into the higher dword?

Now filed on Microsoft Connect; please upvote if you feel it needs fixing. I've also simplified the test case a lot:
byte* data = (byte*) 0x76543210;
uint offset = 0x80000000;
byte* wrong = data + offset;
byte* correct = data + (uint) 0x80000000;
// "wrong" is now 0xFFFFFFFFF6543210 (!)
// "correct" is 0xF6543210
Looking at the IL, as far as I can tell, the C# compiler did everything right, and the bug lies in the JITter.
Original question: What is going on here?
byte* data = (byte*)Marshal.AllocHGlobal(0x100);
uint uioffset = 0xFFFF0000;
byte* uiptr1 = data + uioffset;
byte* uiptr2 = data + (uint)0xFFFF0000;
ulong uloffset = 0xFFFF0000;
byte* ulptr1 = data + uloffset;
byte* ulptr2 = data + (ulong)0xFFFF0000;
Action<string, ulong> dumpValue =
(name, value) => Console.WriteLine("{0,8}: {1:x16}", name, value);
dumpValue("data", (ulong)data);
dumpValue("uiptr1", (ulong)uiptr1);
dumpValue("uiptr2", (ulong)uiptr2);
dumpValue("ulptr1", (ulong)ulptr1);
dumpValue("ulptr2", (ulong)ulptr2);
This test requires a 64-bit OS targeting the x64 platform.
Output:
data: 000000001c00a720 (original pointer)
uiptr1: 000000001bffa720 (pointer with a failed carry into the higher dword)
uiptr2: 000000011bffa720 (pointer with a correct carry into the higher dword)
ulptr1: 000000011bffa720 (pointer with a correct carry into the higher dword)
ulptr2: 000000011bffa720 (pointer with a correct carry into the higher dword)
^
look here
So is this a bug or did I mess something up?
I think you are encountering this C# compiler bug: https://connect.microsoft.com/VisualStudio/feedback/details/675205/c-compiler-performs-sign-extension-during-unsigned-pointer-arithmetic
Which was filed as a result of this question: 64-bit pointer arithmetic in C#, Check for arithmetic overflow changes behavior
(Answer under construction)
I checked the emitted x64 asm and these are my observations:
Base pointer:
data:
00000000024539E0
Pointer with correct carry:
data + (uint)0xFFFF0000:
00000001024439E0
Disassembly of the instructions:
byte* ptr2 = data + ((uint)0xFFFF0000); // redundant cast to be extra sure
00000084 mov ecx,0FFFF0000h
00000089 mov rax,qword ptr [rsp+20h]
0000008e add rax,rcx
00000091 mov qword ptr [rsp+38h],rax
Pointer with incorrect carry:
data + offset:
00000000024439E0
Disassembly of the instructions:
uint offset = 0xFFFF0000;
0000006a mov dword ptr [rsp+28h],0FFFF0000h
byte* ptr1 = data + offset;
00000072 movsxd rcx,dword ptr [rsp+28h] ; (1)
00000077 mov rax,qword ptr [rsp+20h]
0000007c add rax,rcx
0000007f mov qword ptr [rsp+30h],rax
The instruction (1) converts an unsigned int32 into a signed long with sign extension (bug or feature?). Therefore rcx contains 0xFFFFFFFFFFFF0000, while it should contain 0x00000000FFFF0000 for the addition to work properly.
And according to 64 bit arithmetic:
0xFFFFFFFFFFFF0000 +
0x00000000024539E0 =
0x00000000024439E0
The add overflows indeed.
I don't know if this is a bug or intended behavior, I'm going to check SSCLI before trying to give any conclusion. EDIT: See Ben Voigt's answer.

Why the performance difference between C# (quite a bit slower) and Win32/C?

We are looking to migrate a performance critical application to .Net and find that the c# version is 30% to 100% slower than the Win32/C depending on the processor (difference more marked on mobile T7200 processor). I have a very simple sample of code that demonstrates this. For brevity I shall just show the C version - the c# is a direct translation:
#include "stdafx.h"
#include "Windows.h"
int array1[100000];
int array2[100000];
int Test();
int main(int argc, char* argv[])
{
int res = Test();
return 0;
}
int Test()
{
int calc,i,k;
calc = 0;
for (i = 0; i < 50000; i++) array1[i] = i + 2;
for (i = 0; i < 50000; i++) array2[i] = 2 * i - 2;
for (i = 0; i < 50000; i++)
{
for (k = 0; k < 50000; k++)
{
if (array1[i] == array2[k]) calc = calc - array2[i] + array1[k];
else calc = calc + array1[i] - array2[k];
}
}
return calc;
}
If we look at the disassembly in Win32 for the 'else' we have:
35: else calc = calc + array1[i] - array2[k];
004011A0 jmp Test+0FCh (004011bc)
004011A2 mov eax,dword ptr [ebp-8]
004011A5 mov ecx,dword ptr [ebp-4]
004011A8 add ecx,dword ptr [eax*4+48DA70h]
004011AF mov edx,dword ptr [ebp-0Ch]
004011B2 sub ecx,dword ptr [edx*4+42BFF0h]
004011B9 mov dword ptr [ebp-4],ecx
(this is in debug but bear with me)
The disassembly for the optimised c# version using the CLR debugger on the optimised exe:
else calc = calc + pev_tmp[i] - gat_tmp[k];
000000a7 mov eax,dword ptr [ebp-4]
000000aa mov edx,dword ptr [ebp-8]
000000ad mov ecx,dword ptr [ebp-10h]
000000b0 mov ecx,dword ptr [ecx]
000000b2 cmp edx,dword ptr [ecx+4]
000000b5 jb 000000BC
000000b7 call 792BC16C
000000bc add eax,dword ptr [ecx+edx*4+8]
000000c0 mov edx,dword ptr [ebp-0Ch]
000000c3 mov ecx,dword ptr [ebp-14h]
000000c6 mov ecx,dword ptr [ecx]
000000c8 cmp edx,dword ptr [ecx+4]
000000cb jb 000000D2
000000cd call 792BC16C
000000d2 sub eax,dword ptr [ecx+edx*4+8]
000000d6 mov dword ptr [ebp-4],eax
Many more instructions, presumably the cause of the performance difference.
So 3 questions really:
Am I looking at the correct disassembly for the 2 programs or are the tools misleading me?
If the difference in the number of generated instructions is not the cause of the difference what is?
What can we possibly do about it other than keep all our performance critical code in a native DLL.
Thanks in advance
Steve
PS I did receive an invite recently to a joint MS/Intel seminar entitled something like 'Building performance critical native applications' Hmm...
I believe your main issue in this code is going to be bounds checking on your arrays.
If you switch to using unsafe code in C#, and use pointer math, you should be able to achieve the same (or potentially faster) code.
This same issue was previously discussed in detail in this question.
I believe you are seeing the results of bounds checks on the arrays. You can avoid the bounds checks by using unsafe code.
I believe the JITer can recognize patterns like for loops that go up to array.Length and avoid the bounds check, but it doesn't look like your code can utilizate that.
As others have said, one of the aspects is bounds checking. There's also some redundancy in your code in terms of array access. I've managed to improve the performance somewhat by changing the inner block to:
int tmp1 = array1[i];
int tmp2 = array2[k];
if (tmp1 == tmp2)
{
calc = calc - array2[i] + array1[k];
}
else
{
calc = calc + tmp1 - tmp2;
}
That change knocked the total time down from ~8.8s to ~5s.
Just for fun, I tried building this in C# in Visual Studio 2010, and took a look at the JITed disassembly:
else
calc = calc + array1[i] - array2[k];
000000cf mov eax,dword ptr [ebp-10h]
000000d2 add eax,dword ptr [ebp-14h]
000000d5 sub eax,edx
000000d7 mov dword ptr [ebp-10h],eax
They made a number of improvements to the jitter in 4.0 of the CLR.
C# is doing bounds checking
when running the calculation part in C# unsafe code does it perform as well as the native implementation?
If your application's performance critical path consists entirely of unchecked array processing, I'd advise you not to rewrite it in C#.
But then, if your application already works fine in language X, I'd advise you not to rewrite it in language Y.
What do you want to achieve from the rewrite? At the very least, give serious consideration to a mixed language solution, using your already-debugged C code for the high performance sections and using C# to get a nice user interface or convenient integration with the latest rich .NET libraries.
A longer answer on a possibly related theme.
I am sure the optimization for C is different than C#. Also you have to expect at least a little bit of performance slow down. .NET adds another layer to the application with the framework.
The trade off is more rapid development, huge libraries and functions, for (what should be) a small amount of speed.

Categories